[Contents] [Index] [Top] [Bottom] [Prev] [Next]

1. Concepts

LSF is a suite of workload management products that schedule, monitor and analyze the workload of a network of computers. LSF JobScheduler allows you to schedule your mission-critical jobs across the whole network as if you were using a single main-frame computer.

LSF JobScheduler consists of a set of daemons that provide workload management services across the whole cluster, an API that allows access to such services at the procedure level, and a suite of tools or utilities that end users can use to access such services at the command or GUI level.

This chapter introduces important concepts related to the administration and operation of LSF JobScheduler. You should also read the LSF JobScheduler User's Guide to understand the concepts involved in using LSF JobScheduler.

Definitions

This section contains definitions of terms used in this guide.

Clusters

A cluster is a group of hosts that provides shared computing resources. Hosts can be grouped into clusters in a number of ways. A cluster can contain:

all the hosts in a single administrative group
all the hosts on one file server or sub-network
hosts which perform similar functions

If you have hosts of more than one type, it is often convenient to group them together in the same cluster. LSF JobScheduler allows you to use these hosts transparently, so applications that run on only one host type are available to the entire cluster.

Submission, Master, and Execution Hosts

When LSF JobScheduler runs a job, three hosts are involved. The host from which the job is submitted is the submission host. The job information is sent to the master host, which is the host where the master LIM and mbatchd are running. The job is run on the execution host. It is possible for more than one of these to be the same host.

The master host is displayed by the lsid command:

% lsid
LSF 3.1, Dec 11, 1997
Copyright 1992-1997 Platform Computing Corporation

My cluster name is test_cluster
My master name is hostA

The following example shows the submission and execution hosts for a batch job:

hostD% bsub sleep 60
Job <1502> is submitted to default queue <normal>.


hostD% bjobs 1502
JOBID USER  STAT QUEUE  FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
1502  user2 RUN  normal hostD     hostB     sleep 60 Nov 22 14:03

The master host is hostA, as shown by the lsid command. The submission host is hostD, and the execution host is hostB.

Fault Tolerance

LSF JobScheduler has a number of features to support fault tolerance. LSF JobScheduler can tolerate the failure of any host or group of hosts in the cluster.

The LSF master host is chosen dynamically. If the current master host becomes unavailable, another host takes over automatically. The master host selection is based on the order in which hosts are listed in the lsf.cluster.cluster file. If the first host in the file is available, that host acts as the master. If the first host is unavailable, the second host takes over, and so on. LSF may be unavailable for a few minutes while hosts wait to be contacted by the new master. If the cluster is partitioned by a network failure, there will be a master on each side of the partition.

Fault tolerance in LSF JobScheduler depends on the event log file, lsb.events. Every event in the system is logged in this file, including all job submissions and job and host status changes. If the master host becomes unavailable, a new master is chosen by the LIMs. The slave daemon sbatchd on the new master starts a new master batch daemon mbatchd. The new mbatchd reads the lsb.events file to recover the state of the system.

If the network is partitioned, only one of the partitions can access the lsb.events log, so services are only available on one side of the partition. A lock file is used to guarantee that only one mbatchd is running in the cluster.

Running jobs are managed by the sbatchd on each server host. When the new mbatchd starts up it polls the sbatchd daemons on each host and finds the current status of its jobs. If sbatchd fails but the host is still running, jobs running on the host are not lost. When sbatchd is restarted it regains control of all jobs running on the host.

If an LSF server host fails, jobs running on that host are lost. No other jobs are affected.

If all of the hosts in a cluster go down, all running jobs are lost. When a host comes back up and takes over as master, it reads the lsb.events file to get the state of all jobs. Jobs that were running when the systems went down are assumed to have exited, and email is sent to the submitting user. Pending jobs remain in their queues, and are scheduled as hosts become available.

Shared Directories and Files

LSF is designed for networks where all hosts have shared file systems, and files have the same names on all hosts. LSF supports the Network File System (NFS), the Andrew File System (AFS), and DCE's Distributed File System (DFS). NFS file systems can be mounted permanently or on demand using automount.

LSF includes support for copying user data to the execution host before running a job, and for copying results back after the job executes. In networks where the file systems are not shared, this can be used to give remote jobs access to local data.

For more information about running LSF on networks where no shared file space is available, see `Using LSF JobScheduler without Shared File Systems' on page 5.

Shared User Directories

To provide transparent remote execution, LSF commands determine the user's current working directory and use that directory on the remote host. For example, if the command cc file.c is executed remotely, cc only finds the correct file.c if the remote command runs in the same directory.

LSF JobScheduler automatically creates a .lsbatch subdirectory in the user's home directory on the execution host. This directory is used to store temporary input and output files for jobs.

Executables and the PATH Environment Variable

Search paths for executables (the PATH environment variable) are passed to the remote execution host unchanged. In mixed clusters, LSF works best when the user binary directories (/usr/bin, /usr/local/bin, etc.) have the same path names on different host types. This makes the PATH variable valid on all hosts.

If your user binaries are NFS-mounted, place all binaries in a shared file system under /usr/local/lsf/mnt (or some similar name), and then make a symbolic link from /usr/local/bin to /usr/local/lsf/mnt/type/bin for the correct host type on each machine. This is what LSF's Default installation procedure does.

LSF configuration files are normally in a shared directory. This makes administration easier. There is little performance penalty for this, because the configuration files are not read often.

For more information on LSF installation directories, see the LSF Installation Guide.

Using LSF JobScheduler without Shared File Systems

Some networks do not share files between hosts. LSF JobScheduler can still be used on these networks, with reduced fault tolerance.

You must choose one host to act as the LSF JobScheduler master host. The LSF JobScheduler configuration files and working directories must be installed on this host, and the master host must be listed first in the lsf.cluster.cluster file.

If the master host is unavailable, users cannot submit batch jobs or check job status. Running jobs continue to run, but no new jobs are started. When the master host becomes available again, LSF JobScheduler service is resumed.

Some fault tolerance can be introduced by choosing more than one host as possible master hosts, and using NFS to mount the LSF JobScheduler working directory on only these hosts. All the possible master hosts must be listed first in the lsf.cluster.cluster file. As long as one of these hosts is available, LSF JobScheduler continues to operate.

Resources and Resource Requirements

LSF provides a powerful means for you to describe your heterogeneous cluster in terms of resources. One of the most important decisions LSF makes when scheduling a job is to map a job's resource requirements onto resources available on individual hosts. There are several types of resource. Load indices measure dynamic resource availability such as a host's CPU load or available swap space. Static resources represent unchanging information such as the number of CPUs a host has, the host type, and the maximum available swap space.

Resources may also be described in terms of where they are located. A shared resource is a resource which is associated with the entire cluster or a subset of hosts within the cluster. In contrast to host-based resources such as memory or swap space, using a shared resource from one machine affects the availability of that resource as seen by other machines. Common examples of shared resources include floating licenses for software packages, shared file systems, and network bandwidth. LSF provides a mechanism to configure which machines share a particular resource and to monitor the availability of those resources.

Resource names may be any string of characters, excluding the characters reserved as operators. The lsinfo command lists the resources available in your cluster.

For a complete description of resources and how they are used, see Chapter 4, `Resources', on page 45 of the LSF JobScheduler User's Guide.

To best place a job for optimized performance, resource requirements can be specified for each application. A resource requirement is an expression that contains resource names and operators. Resource requirements can be configured for individual applications, or specified for each job. The detailed format for resource requirements can be found in `Resource Requirement Strings' on page 51of the LSF JobScheduler User's Guide.

Remote Execution Control

There are two aspects to controlling access to remote execution. The first requirement is to authenticate the user. When a user executes a remote command, the command must be run with that user's permission. The LSF daemons need to know which user is requesting the remote execution. The second requirement is to check access controls on the remote host. The user must be authorized to execute commands remotely on the host.

User Authentication Methods

LSF supports user authentication using privileged ports, authentication using the RFC 931 or RFC 1413 identification protocols, and site-specific external authentication, such as Kerberos and DCE.

The default method is to use privileged ports. To use privileged ports, some of the LSF utilities must be installed with root as the owner of the file and with the setuid bit set.

Authentication Using Privileged Ports

If a load-sharing program is owned by root and has the setuid bit set, the LSF API functions use a privileged port to communicate with LSF servers, and the servers accept the user ID supplied by the caller. This is the same user authentication mechanism as used by rlogin and rsh.

When a setuid application calls the LSLIB initialization routine, a number of privileged ports are allocated for remote connections to LSF servers. The effective user ID then reverts to the real user ID. Therefore, the number of remote connections is limited. Note that an LSF utility reuses the connection to the RES for all remote task executions on that host, so the number of privileged ports is only a limitation on the number of remote hosts that can be used by a single application, not on the number of remote tasks. Programs using LSLIB can specify the number of privileged ports to be created at initialization time.

Authentication Using Identification Daemons

The RFC 1413 and RFC 931 protocols use an identification daemon running on each client host. Using an identification daemon incurs more overhead, but removes the need for LSF applications to allocate privileged ports. All LSF commands except lsadmin can be run without setuid permission if an identification daemon is used.

You should use identification daemons if your site cannot install programs owned by root with the setuid bit set, or if you have software developers creating new load-sharing applications in C using LSLIB.

An implementation of RFC 931 or RFC 1413, such as pidentd or authd, may be obtained from the public domain (if you have access to Internet FTP, a good source for identification daemons is host ftp.lysator.liu.se, directory pub/ident/servers). RFC 1413 is a more recent standard than RFC 931. LSF is compatible with either.

External Authentication

You can configure your own user authentication scheme using the eauth mechanism of LSF. If external authentication is used, an executable called eauth must be written and installed in LSF_SERVERDIR.

When an LSF client program is invoked (e.g., lsrun), the client program automatically executes eauth -c hostname to get the external authentication data. hostname is the name of the host running the LSF daemon (e.g., RES). The external user authentication data can be passed to LSF via eauth's standard output.

When the LSF daemon receives the request, it executes eauth -s under the primary LSF administrator user ID. The parameter, LSF_EAUTH_USER, must be configured in the /etc/lsf.sudoers file if your site needs to run authentication under another user ID (see `The lsf.sudoers File' on page 89 for details). eauth -s is executed to process the user authentication data. The data is passed to eauth -s via its standard input. The standard input stream has the following format:

uid gid username client_addr client_port user_auth_data_len 
user_auth_data

The variables are listed below:

uid is the user ID in ASCII of the client user
gid is the group ID in ASCII of the client user
username is the user name of the client user
client_addr is the host address of the client host in ASCII dot notation
client_port is the port number from where the client request is made
user_auth_data_len is the length of the external authentication data in ASCII passed from the client
user_auth_data is the external user authentication data passed from the client

The LSF daemon expects eauth -s to write 1 to its standard output if authentication succeeds, or 0 if authentication fails.

The same eauth -s process can service multiple authentication requests; if the process terminates, the LSF daemon will re-invoke eauth -s on the next authentication request.

Example uses of external authentication include support for Kerberos 4 and DCE client authentication using the GSSAPI. These examples can be found in the examples/krb and examples/dce directories in the standard LSF distribution. Installation instructions are found in the README file in these directories.

Security of LSF Authentication

All authentication methods supported by LSF depend on the security of the root account on all hosts in the cluster. If a user can get access to the root account, they can subvert any of the authentication methods. There are no known security holes that allow a non-root user to execute programs with another user's permission.

Some people have particular concerns about security schemes involving RFC 1413 identification daemons. When a request is coming from an unknown host, there is no way to know whether the identification daemon on that host is correctly identifying the originating user.

LSF only accepts job execution requests that originate from hosts within the LSF cluster, so the identification daemon can be trusted. The identification protocol uses a port in the UNIX privileged port range, so it is not possible for an ordinary user to start a hacked identification daemon on an LSF host.

LSF Security

The default authentication of method of LSF is to use privileged ports. On UNIX, this requires binaries which need to be authenticated (for example bsub) to be made setuid root. NT does not have the concept of setuid binaries and does not restrict which binaries can use privileged ports. A security risk exists that a user can discover the format of LSF protocol messages and write a program which tries to communicate with an LSF server. It is recommended that external authentication (via eauth) be used where this security risk is a concern.

The system environment variable LSF_ENVDIR is used by LSF to obtain the location of lsf.conf which points to important configuration files. Any user who can modify system environment variables can modify LSF_ENVDIR to point to their own configuration and start up programs under the lsfadmin account.

Once the LSF service is started, it will only accept requests from the lsfadmin account. To allow other users to interact with the LSF service, you must set up the lsf.sudoers file under the directory specified by the SYSTEMROOT environment variable. See `The lsf.sudoers File' on page 89 for the format of the lsf.sudoers file.

Only the LSF_STARTUP_USERS and LSF_STARTUP_PATH are used on NT. You should ensure that only authorized users modify the files under the SYSTEMROOT directory.

All external binaries invoked by the LSF daemons (such as esub, eexec, elim, eauth, and queue level pre- and post-execution commands) are run under the lsfadmin account.

How LSF Chooses Authentication Methods

LSF uses the LSF_AUTH parameter in the lsf.conf file to determine the type of authentication to use.

If an LSF application is not setuid to root, library functions use a non-privileged port. If the LSF_AUTH flag is not set in the lsf.conf file, the connection is rejected. If LSF_AUTH is defined to be ident, the RES on the remote host, or mbatchd in the case of a bsub command, contacts the identification daemon on the local host to verify the user ID. The identification daemon looks directly into the kernel to make sure the network port number being used is attached to a program being run by the specified user.

LSF allows both the setuid and authentication daemon methods to be in effect simultaneously. If the effective user ID of a load-sharing application is root, then a privileged port number is used in contacting the RES. RES always accepts requests from a privileged port on a known host even if LSF_AUTH is defined to be ident. If the effective user ID of the application is not root, and the LSF_AUTH parameter is defined to be ident, then a normal port number is used and RES tries to contact the identification daemon to verify the user's identity.

External user authentication is used if LSF_AUTH is defined to be eauth. In this case, LSF will run the external executable eauth in the LSF_SERVERDIR directory to do the authentication.

The error message "User permission denied" is displayed by lsrun, bsub, and other LSF commands if LSF cannot verify the user's identity. This may be because the LSF applications are not installed setuid, the NFS directory is mounted with the nosuid option, the identification daemon is not available on the local or submitting host, or the external authentication failed.

If you change the authentication type while the LSF daemons are running, you will need to run the command lsfdaemons start on each of the LSF server hosts so that the daemons will use the new authentication method.

Host Authentication Methods

When a batch job or a remote execution request is received, LSF first determines the user's identity. Once the user's identity is known, LSF decides whether it can trust the host from which the request comes from.

Trust LSF Host

LSF normally allows remote execution by all users except root, from all hosts in the LSF cluster, i.e. LSF trusts all hosts that are configured into your cluster. The reason for this is that by configuring an LSF cluster, you are turning a network of machines into a single computer. Users must have valid accounts on all hosts. This allows any user to run a job with their own permission on any host in the cluster. Remote execution requests and batch job submissions are rejected if they come from a host not in the LSF cluster.

A site can configure an external executable to perform additional user or host authorization. By defining LSF_AUTH to be eauth, the LSF daemon will invoke eauth -s when it receives a request that needs authentication and authorization. As an example, this eauth can check if the client user is on a list of authorized users.

Using /etc/hosts.equiv

If the LSF_USE_HOSTEQUIV parameter is set in the lsf.conf file, LSF uses the same remote execution access control mechanism as the rsh command. When a job is run on a remote host, the user name and originating host are checked using the ruserok(3) function on the remote host.

This function checks in the /etc/hosts.equiv file and the user's $HOME/.rhosts file to decide if the user has permission to execute jobs.

The name of the local host should be included in this list. RES calls ruserok() for connections from the local host. mbatchd calls ruserok() on the master host, so every LSF JobScheduler user must have a valid account and remote execution permission on the master host.

The disadvantage of using the /etc/hosts.equiv and $HOME/.rhosts files is that these files also grant permission to use the rlogin and rsh commands without giving a password. Such access is restricted by security policies at some sites.

See the hosts.equiv(5) and ruserok(3) manual pages for details on the format of the files and the access checks performed.

The error message "User permission denied" is displayed by lsrun, bsub, and other LSF commands if you configure LSF to use ruserok() and the client host is not found in either the /etc/hosts.equiv or the $HOME/.rhosts file on the master or remote host.

User Account Mapping

By default, LSF assumes uniform user accounts throughout the cluster. This means that job will be executed on any host with exactly the same user ID and user login name.

LSF JobScheduler has a mechanism to allow user account mapping across dissimilar name spaces. Account mapping can be done at the individual user level. Individual users of the LSF cluster can set up their own account mapping by setting up a .lsfhosts file in their home directories.

The LSF administrator can disable user account mapping.

How LSF JobScheduler Schedules Jobs

LSF JobScheduler provides the functions of a traditional mainframe job scheduler with transparent operation across a network. Jobs can be submitted and monitored from anywhere in the LSF cluster.

LSF's master scheduler, mbatchd, exists in the network, accepting client requests such as job submissions, job modifications, and controls. It also dispatches jobs to run on all LSF server hosts when the jobs are ready to run. There is one slave scheduler, sbatchd, that lives on every host configured as an LSF JobScheduler server. Each sbatchd accepts jobs dispatched from the mbatchd, runs them on the local host, and monitors the jobs that are running.

There is a Load Information Manager, LIM, running on every host, that collects and propagates resource and load information. This information is then provided to mbatchd to help choose the most appropriate host for running a job.

When a job is submitted to LSF JobScheduler, it enters a queue. The job remains pending in the queue until all its required conditions are met. Many factors control when and where the job should run:

resource requirements of the job
resource availability in the whole LSF cluster
time conditions associated with the job (such as a calendar)
inter-job dependencies
file status events
external events
load conditions on all hosts

mbatchd periodically scans through the jobs that are ready to run. In doing so, it first obtains load and resource information from the LIM. When appropriate resources become available for the job, LSF JobScheduler compares the load conditions of all qualified hosts and runs the job on a host that does the job best. The job is sent to the sbatchd on the most appropriate host via a TCP/IP connection.

When sbatchd receives a job from mbatchd, it initializes the job's execution environment first. Such initialization can be customized to fit the user's preference. By default, all user environment variables from the submission host are automatically copied to the execution host and re-established. The user can also choose to reinitialize the environment on the execution host at job submission time.

A job starter can be configured by the cluster administrator to start the job in a certain environment, such as a particular shell.

When the job is started, sbatchd keeps track of all processes of the job through the Process Information Manager, PIM. The resource consumption information of the whole job is periodically sampled and passed to mbatchd. When the job finishes, sbatchd reports the status back to mbatchd.

Job States

An LSF JobScheduler job goes through a series of state transitions until it eventually completes its task, fails or is terminated. The possible states of a job during its life cycle are shown in Figure 1.

Figure 1. Job States

Many jobs enter only three states:

PEND - waiting in the queue

RUN - dispatched to a host and running

DONE - terminated normally

Pending

A job remains in the PEND state until all conditions for its execution are met. Some of the conditions are:

dependency on a calendar
dependency on another job
dependency on external events such as file status
availability of the specified resources

The bjobs -lp command displays the reason why a job is currently in the PEND state.

Terminated

A job may terminate abnormally for various reasons. Job termination may happen from any state. An abnormally terminated job goes into EXIT state. The situations where a job terminates abnormally include:

the job is cancelled by the user while pending, or after being started
the job fails to start successfully, e.g. the wrong executable is specified by the user when the job is submitted
the job exits with a non-zero exit status

A job that has terminated may return to the PEND state once again if it is a repetitive job, and if it is not waiting for the next time schedule.

Suspended

Jobs may also be suspended at any time. A job can be suspended by its owner, by the LSF administrator, by the root user (superuser), or by LSF JobScheduler. There are three different states for suspended jobs:

PSUSPThe job was suspended by its owner or the LSF administrator while in PEND state (a job can also be in the PSUSP state if it was submitted with the hold option).

USUSPThe job was suspended by its owner or the LSF administrator after being dispatched.

SSUSPThe job was suspended by LSF JobScheduler after being dispatched.

The bjobs -s command displays the reason why a job was suspended.

User Suspended

A job may be suspended by its owner or the LSF administrator with the bstop command. These jobs are considered user-suspended (displayed by bjobs as USUSP).

When the user restarts the job with the bresume command, the job is not resumed immediately to prevent overloading. Instead, the job is changed from USUSP to SSUSP (suspended by the system). The SSUSP job is resumed when the host load levels are within the scheduling thresholds for that job, exactly as for jobs suspended because of high load.

Pre- and Post-Execution Commands

Each job can be associated with optional pre- and post-execution commands.

If a pre-execution command is specified, the job is held in the queue until the specified pre-execution command returns a successful exit status (zero). While the job is pending, other jobs may go ahead of the waiting job.

If a post-execution command is specified, then the command is run after the job is finished.

Pre- and post-execution commands are arbitrary command lines.

Pre-execution commands can be used to support job starting decisions which cannot be configured directly in LSF JobScheduler.

Post-execution commands are typically used to clean up some state left by the pre-execution and the job execution.

LSF JobScheduler supports both job level and queue level pre-execution. Post-execution is only supported at the queue level.

See `Queue-Level Pre-/Post-Execution Commands' on page 99 for more information about queue-level pre/post-execution commands, and `Pre-execution Commands' on page 72 of the LSF JobScheduler User's Guide for more information about the job-level pre-execution commands.

LSF Daemons

LSF consists of:

a number of daemons running on each of the server machines providing workload management services across the cluster
an API that allows clients to access the services
a set of user interfaces that allow end users to submit, monitor and control the workload to the cluster

In order to effectively manage LSF JobScheduler, it is important to understand the operation of these daemons.

Load Information Manager (LIM)

LIM is a key server that forms the basis for the concept of an LSF cluster. LIM belongs to LSF Base and provides a single system image called a cluster.

LIM runs on every server host and provides cluster configuration information, load information, and host selection services. The LIMs on all hosts coordinate in collecting and transmitting load and resource information. Load information is transmitted between LIMs in the form of vectors of load indices.

All LIMs in a cluster must share the same cluster configuration in order to work correctly. To run LIM on a host, the host must be configured as a server host in the lsf.cluster.cluster file. See `The lsf.cluster.cluster File' on page 86 for more information.

A master LIM is elected for each LSF cluster. The master LIM receives load information from all slave LIMs and provides services to all hosts. The slave LIMs periodically check their own load conditions and send a load vector to the master if significant changes in load condition are observed. The minimum load information exchange interval is 15 seconds. Slave LIMs also monitor the status of the master LIM and elect a new master if the original one becomes unavailable. This provides high availability because as long as one host is up, the master will be up for services.

The load indices monitored at a site can be extended by directing the LIM to invoke and communicate with an External Load Information Manager (ELIM). The ELIM is responsible for collecting load indices not managed by the LIM. These indices are passed on to the LIM by ELIM through a well-defined protocol.

Remote Execution Server (RES)

RES runs on every machine that runs jobs through LSF. RES provides remote execution and remote file operation services. LSF JobScheduler uses RES to do file transfers across machines. RES is also used to run interactive jobs. Together with LIM, RES belongs to LSF Base.

Master Batch Daemon (mbatchd)

mbatchd runs on the host where master LIM is running. There can be only one mbatchd per LSF cluster. mbatchd is the job scheduler daemon that schedules jobs according to user defined schedules as well as system configured policies. mbatchd gets resource and load information from the master LIM and chooses the most appropriate host to run a job that is ready. Powerful mechanisms are built into mbatchd to provide intelligent and configurable job scheduling functions and reliable operations in cases of failures.

mbatchd works closely with the slave batch daemon (sbatchd) in coordinating job executions.

mbatchd is always automatically started by sbatchd on the master host.

Slave Batch Daemon (sbatchd)

sbatchd is the execution daemon for all batch jobs. Jobs that are ready to run are dispatched to sbatchds from the mbatchd. Together with each job are job specifications which sbatchd uses to run the job and control its execution. sbatchd initiates the job according to its specifications and monitors the job throughout the job's life time. When the job finishes, sbatchd reports the job status back to mbatchd. The sbatchd on the master host is responsible for the starting of mbatchd on that host.

Alarm Daemon (alarmd)

alarmd is the alarm daemon for LSF JobScheduler. alarmd is a daemon started by mbatchd and is used to perform periodic operations on the alarm log file, lsb.alarms.log, including sending renotifications and moving resolved or expired alarm incidents into the alarm history file, lsb.alarms.hist. The alarm log and history files are stored in LSB_SHAREDIR/logdir.

Alarm incidents are appended to the alarm log through the raisealarm command which is invoked by mbatchd when an alert condition happens. alarmd reads the alarm definition in the lsb.alarms file to determine the method for sending renotifcations. See `The lsb.alarms File' on page 101 for details of the alarm configuration file. The alarmd is started whenever mbatchd is started and exits when it detects that mbatchd, which started it, is not running. It runs under the user account of the primary LSF administrator.

External Event Daemon (eeventd)

LSF has an open system architecture to allow each site to customize the behaviour of the system. External events are site specific conditions that can be used to trigger job scheduling actions. Examples of external events are data arrival, tape silo status, and exceptional conditions. External events are collected by the External Event Daemon (eeventd). The eeventd runs on the same host as the mbatchd and collects site specific events that LSF JobScheduler will use to trigger the scheduling of jobs. LSF JobScheduler comes with a default eeventd that monitors file events. A user site can easily add more event functions to it to monitor more events.

For more details see `External Event Management' on page 68.

Remote File Access

When LSF JobScheduler runs a job, it attempts to run the job in the directory where the bsub command was invoked. If the execution directory is under the user's home directory, sbatchd looks for the path relative to the user's home directory.

If the directory is not available on the execution host, the job is run in /tmp. Any files created by the job, including the standard output and error files, are left on the execution host.

LSF provides support for moving user data from the submission host to the execution host before executing a job, and from the execution host back to the submitting host after the job completes. The file operations can be specified when submitting a job.

The LSF JobScheduler remote file access mechanism uses lsrcp(1) to process the file transfer. lsrcp first tries to connect to the RES daemon on the submission host to handle the file transfer. If lsrcp cannot contact the RES on the submission host, it attempts to use rcp to copy the file. You must set up the /etc/hosts.equiv or HOME/.rhosts file in order to use rcp. See the rcp(1) and rsh(1) manual pages for more information on using rcp.

A site may replace lsrcp with its own file transfer mechanism as long as it supports the same syntax as lsrcp(1). This may be done to take advantage of a faster interconnection network or to overcome limitations with the existing lsrcp. sbatchd looks for the lsrcp executable in the LSF_BINDIR directory as specified in the lsf.conf file.

For a complete description of the LSF remote file access facilities, see the bsub(1) manual page and the LSF JobScheduler User's Guide.

[Contents] [Index] [Top] [Bottom] [Prev] [Next]

doc@platform.com