[Contents] [Index] [Top] [Bottom] [Prev] [Next]

3. Managing LSF JobScheduler

This chapter describes the operating concepts and maintenance tasks of LSF JobScheduler. This chapter requires concepts from `Managing LSF Base' on page 23. The topics covered in this chapter are:

managing LSF JobScheduler logs
controlling LSF JobScheduler servers
controlling LSF JobScheduler queues
managing LSF JobScheduler configuration
controlling LSF JobScheduler jobs
controlling job execution environment
system calendars
example LSF JobScheduler configuration files

Managing LSF JobScheduler Logs

Managing error log files for LSF JobScheduler daemons was described in `Managing Error Logs' on page 23. This section discusses the other important log files LSF JobScheduler daemons produce. The LSF JobScheduler log files are found in the directory LSB_SHAREDIR/cluster/logdir.

LSF JobScheduler Accounting Log

Each time a job completes or exits, an entry is appended to the lsb.acct file. This file can be used to create accounting summaries of LSF JobScheduler system use. The bacct(1) command produces one form of summary. The lsb.acct file is a text file suitable for processing with awk, perl, or similar tools. See the lsb.acct(5) online help page for details of the contents of this file. Additionally, the LSF API supports calls to process the lsb.acct records. See the LSF Programmer's Guide for details of LSF API.

If the lsb.acct file grows too large, you can move the lsb.acct file to a backup location. Removing lsb.acct will not cause any operational problems for LSF JobScheduler. The daemon automatically creates a new lsb.acct file to replace the moved file.

LSF JobScheduler Event Log

The LSF JobScheduler daemons keep an event log in the lsb.events file. The mbatchd daemon uses this information to recover from server failures, host reboots, and LSF JobScheduler reconfiguration. The lsb.events file is also used by the bhist command to display detailed information about the execution history of jobs, and by the badmin command to display the operational history of hosts, queues and LSF JobScheduler daemons.

For performance reasons, the mbatchd automatically backs up and rewrites the lsb.events file after every 1000 job completions (this is the default; the value is controlled by the MAX_JOB_NUM parameter in the lsb.param file). The old lsb.events file is moved to lsb.events.1, and each old lsb.events.n file is moved to lsb.events.n+1. The mbatchd never deletes these files. If disk storage is a concern, the LSF administrator should arrange to archive or remove old lsb.events.n files occasionally.

CAUTION!

Do not remove or modify the lsb.events file. Removing or modifying the lsb.events file could cause jobs to be lost.

Duplicate Event Logging

By default, LSF JobScheduler stores all state information needed to recover from server failures, host reboots, or reconfiguration in a file in the LSB_SHAREDIR directory. Typically, the LSB_SHAREDIR directory resides on a reliable file server which also contains other critical applications necessary for running users' jobs.This is done because, if the central file server is unavailable, users' applications cannot run, and the failure of LSF JobScheduler to continue processing users' jobs is a secondary issue.

For sites not wishing to rely solely on a central file server for recovery information, LSF can be configured to maintain a replica of the recovery file. The replica is stored on the file server, and used if the primary copy is unavailable--referred to as duplicate event logging. When LSF is configured this way, the primary even log is stored on the first master host, and re-synchronized with the replicated copy when the host recovers.

Configuring Duplicate Event Logging

To enable the replication feature, define LSB_LOCALDIR in the lsf.conf file. LSB_LOCALDIR should be a local directory and it should exist ONLY on the first master host (i.e. the first host configured in the lsf.cluster.cluster file).

LSB_LOCALDIR is used to store the primary copy of the batch state information. The contents of LSB_LOCALDIR are copied to a replica in LSB_SHAREDIR which resides on a central file server. As before, LSB_SHAREDIR is assumed to be accessible from all hosts which can potentially become the master.

How Duplicate Event Logging Works

With the replication feature enabled the following scenarios can occur.

Failure of File Server

If the file server containing LSB_SHAREDIR goes down, LSF will continue to process jobs. Client commands such as bhist(1) and bacct(1) which directly read LSB_SHAREDIR will not work. When the file server recovers, the replica in LSB_SHAREDIR will be updated.

Failure of First Master Host

If the first master host fails, then the primary copy of the recovery file in the LSB_LOCALDIR directory becomes unavailable. A new master host will be selected which will use the recovery file replica in LSB_SHAREDIR to restore its state and to log future events. There is no replication by the second master.

Recovery of First Master Host

When the first master host becomes available again, it will update the primary copy in LSB_LOCALDIR from the replica in LSB_SHAREDIR and continue operations as before.

The replication feature improves the reliability of LSF JobScheduler operations provided that the following assumptions hold:

Failure of the LSF master host only occurs from the first master to the second master. The replication feature is not active if the second master also fails and a third master takes over.
The master host containing LSB_LOCALDIR and the file server containing LSB_SHAREDIR do not fail simultaneously. In this situation, LSF JobScheduler will be unavailable.
Network partitioning causing a cluster to split into two independent clusters each simultaneously running an mbatchd does not occur. This may happen given certain network topologies and failure modes. For example, connectivity is lost between the first master, M1, and both the file server and the secondary master, M2. Both M1 and M2 will run the mbatchd service with M1 logging events to LSB_LOCALDIR and M2 logging to LSB_SHAREDIR. When connectivity is restored, the changes made by M2 to LSB_SHAREDIR will be lost when M1 updates LSB_SHAREDIR from its copy in LSB_LOCALDIR.

Controlling LSF JobScheduler Servers

The lsadmin command is used to control LSF Base daemons, LIM and RES. LSF JobScheduler has the badmin command to perform similar operations on LSF JobScheduler daemons.

LSF JobScheduler System Status

To check the status of LSF JobScheduler server hosts and queues, use the bhosts and bqueues commands:

% bhosts
HOST_NAME   STATUS   MAX  NJOBS  RUN  SSUSP USUSP  RSV
hostA       ok       1    0      0    0     0      0
hostB       closed   2    2      2    0     0      0
hostD       ok       8    1      1    0     0      0


% bqueues
QUEUE_NAME     PRIO   STATUS          NJOBS  PEND  RUN  SUSP
night          30     Open:Inactive   4      4     0    0
short          10     Open:Active     1      0     1    0
simulation     10     Open:Active     0      0     0    0
default        1      Open:Active     6      4     2    0

If the status of a server is `closed', then it will not accept more jobs. A server host can become closed if one of the following conditions is true:

The host reaches its job slot limit
The LSF administrator has explicitly closed the server host using the badmin command

An inactive queue will accept new job submissions, but will not dispatch any new jobs. A queue can become inactive if the LSF cluster administrator explicitly inactivates it via badmin command.

mbatchd automatically logs the history of the LSF JobScheduler daemons in the LSF JobScheduler event log. You can display the administrative history of the batch system using the badmin command.

The badmin hhist command displays the times when LSF JobScheduler server hosts are opened and closed by the LSF administrator.

The badmin qhist command displays the times when queues are opened, closed, activated, and inactivated.

The badmin mbdhist command displays the history of the mbatchd daemon, including the times when the master starts, exits, reconfigures, or changes to a different host.

The badmin hist command displays all LSF JobScheduler history information, including all the events listed above.

Remote Start-up of sbatchd

You can use badmin hstartup command to start sbatchd on some or all remote hosts from one host:

% badmin hstartup all
Start up slave batch daemon on <hostA> ......done
Start up slave batch daemon on <hostB> ......done
Start up slave batch daemon on <hostD> ......done

For remote startup to work, file /etc/lsf.sudoers has to be set up properly and you have to be able to run rsh across all LSF hosts without having to enter a password. See `The lsf.sudoers File' on page 89 for configuration details of lsf.sudoers.

Restarting sbatchd

mbatchd is restarted by the badmin reconfig command. sbatchd can be restarted using the badmin hrestart command:

% badmin hrestart hostD

Restart slave batch daemon on <hostD> ...... done

You can specify more than one host name to restart sbatchd on multiple hosts, or use `all' to refer to all LSF JobScheduler server hosts. Restarting sbatchd on a host does not affect jobs that are running on that host.

Shutting Down LSF JobScheduler Daemons

The badmin hshutdown command shuts down the sbatchd.

% badmin hshutdown hostD

Shut down slave batch daemon on <hostD> .... done

If sbatchd is shut down, that particular host will not be available for running new jobs. Existing jobs running on that host will continue to completion, but the results will not be sent to the user until sbatchd is later restarted.

To shut down mbatchd, you must first use the badmin hshutdown command to shut down the sbatchd on the master host, and then run the badmin reconfig command. The mbatchd is normally restarted by sbatchd; if there is no sbatchd running on the master host, badmin reconfig causes mbatchd to exit.

If mbatchd is shut down, all LSF JobScheduler service will be temporarily unavailable. However all existing jobs will not be affected. When mbatchd is later restarted, previous status will be restored from the event log file and job scheduling will continue.

Opening and Closing of Server Hosts

Occasionally you may want to drain a server host for purposes of rebooting, maintenance, or host removal. This can be achieved by running the badmin hclose command:

% badmin hclose hostB
Close <hostB> ...... done

When a host is open, LSF JobScheduler can dispatch jobs to it. When a host is closed no new batch jobs are dispatched, but jobs already dispatched to the host continue to execute. To reopen a server host, run badmin hopen command:

% badmin hopen hostB
Open <hostB> ...... done

To view the history of a JobScheduler server host, run badmin hhist command:

% badmin hhist hostB
Wed Nov 20 14:41:58: Host <hostB> closed by administrator <lsf>.
Wed Nov 20 15:23:39: Host <hostB> opened by administrator <lsf>.

Controlling LSF JobScheduler Queues

Each JobScheduler queue can be open or closed, active or inactive. Users can submit jobs to open queues, but not to closed queues. Active queues start jobs on available server hosts, and inactive queues hold all jobs. The LSF administrator can change the state of any queue.

bqueues -- Queue Status

The current status of a particular queue or all queues is displayed by the bqueues(1) command. The bqueues -l option also gives current statistics about the jobs in a particular queue such as the total number of jobs in this queue, the number of jobs running, suspended, etc.

% bqueues normal
QUEUE_NAME     PRIO      STATUS       NJOBS  PEND  RUN  SUSP
normal         30        Open:Active  6      4     2    0

Opening and Closing Queues

When a queue is open, users can submit jobs to the queue. When a queue is closed, users cannot submit jobs to the queue. If a user tries to submit a job to a closed queue, an error message is printed and the job is rejected. If a queue is closed but still active, previously submitted jobs continue to be processed. This allows the LSF administrator to drain a queue.

% badmin qclose normal
Queue <normal> is closed

% bqueues normal
QUEUE_NAME  PRIO      STATUS         NJOBS  PEND  RUN  SUSP
normal      30        Closed:Active  6      4     2    0


% bsub -q normal hostname
normal: Queue has been closed
% badmin qopen normal
Queue <normal> is opened

Activating and Inactivating Queues

When a queue is active, jobs in the queue are started if appropriate hosts are available. When a queue is inactive, jobs in the queue are not started. Queues can be activated and inactivated by the LSF administrator using badmin qact and badmin qinact.

If a queue is open and inactive, users can submit jobs to this queue but no new jobs are dispatched to hosts. Currently running jobs continue to execute. This allows the LSF administrator to let running jobs complete before removing queues or making other major changes.

% badmin qinact normal

Queue <normal> is inactivated

% bqueues normal
QUEUE_NAME     PRIO      STATUS         NJOBS  PEND  RUN  SUSP
normal         30        Open:Inactive  0      0     0    0

% badmin qact normal
Queue <normal> is activated

Managing LSF JobScheduler Configuration

The LSF JobScheduler cluster is a subset of the LSF Base cluster. All servers used by LSF JobScheduler must belong to the base cluster. However not all servers in the base cluster must provide LSF JobScheduler services.

LSF JobScheduler configuration consists of five files: lsb.params, lsb.hosts, lsb.queues, lsb.alarms, and lsb.calendars. These files are stored in LSB_CONFDIR/cluster/configdir, where cluster is the name of your cluster.

All these files are optional. If any of these files does not exist, LSF JobScheduler will assume a default configuration.

The lsb.params file defines general parameters about LSF JobScheduler system operation such as the name of the default queue when the user does not specify one, scheduling intervals for mbatchd and sbatchd, etc. Detailed parameters are described in `The lsb.params File' on page 93.

The lsb.hosts file defines LSF JobScheduler server hosts together with their attributes. Not all LSF hosts defined by LIM configuration have to be configured to run jobs. Server host attributes include scheduling load thresholds, job slot limits, etc. This file is also used to define host groups. See `Host Section' on page 95 for details of this file.

The lsb.queues file defines job queues. See `The lsb.queues File' on page 97 for more details.

The lsb.alarms file contains definitions of alarms used by LSF JobScheduler. See `The lsb.alarms File' on page 101 for details. The lsb.calendars file defines system calendars for usage by all user jobs. See `The lsb.calendars File' on page 103 for details.

When you first install LSF on your cluster, some example queues are already configured for you. You should customize these queues or define new queues to meet your site need.

Note

After changing any of the LSF JobScheduler configuration files, you need to run badmin reconfig to tell mbatchd to pick up the new configuration. You also must run this every time you change LIM configuration.

Adding a JobScheduler Server Host

To add a server host to an LSF JobScheduler configuration, use the following procedure.

If you are adding a host that has not been added to the LSF Base cluster yet, do the steps described in `Adding a Host to a Cluster' on page 32.
Modify the LSB_CONFDIR/cluster/configdir/lsb.hosts file to add the new host together with its attributes. If you want to limit the added host for use only by some queues, you should also update the lsb.queues file. Since host types and host models, as well as the virtual name `default', can be used to refer to all hosts of that type, model, or every other LSF host not covered by the definitions, you may not need to change any of the files, if the host is already covered.
Run badmin reconfig to tell mbatchd to pick up the new configuration.
Start sbatchd on the added host by running badmin hstartup or simply start it by hand.

Removing a JobScheduler Server Host

To make a host no longer a server host, use the following procedure.

If you need to permanently remove a host from your cluster, you should use badmin hclose to prevent new batch jobs from starting on the host, and wait for any running jobs on that host to finish. If you wish to shut the host down before all jobs complete, use bkill to kill the running jobs.
Modify lsb.hosts and lsb.queues in LSB_CONFDIR/cluster/configdir directory and remove the host from any of the sections.
Run badmin hshutdown to shutdown sbatchd on that host.

CAUTION!

You should never remove the master host from LSF JobScheduler. Change the LIM configuration to assign a different default master host if you want to remove your current default master from the LSF JobScheduler server pool.

Adding a Queue

To add a queue to a cluster, use the following procedure.

Log in as the LSF administrator on any host in the cluster.
Edit the LSB_CONFDIR/cluster/configdir/lsb.queues file to add the new queue definition. You can copy another queue definition from this file as a starting point; remember to change the QUEUE_NAME of the copied queue. Save the changes to lsb.queues. See `The lsb.queues File' on page 97 for a complete description of LSF JobScheduler queue configuration.
Run the command badmin ckconfig to check the new queue definition. If any errors are reported, fix the problem and check the configuration again. See `Controlling LSF JobScheduler Servers' on page 55 for an example of normal output from badmin ckconfig.
When the configuration files are ready, run badmin reconfig. The master batch daemon (mbatchd) is unavailable for approximately one minute while it reconfigures. Pending and running jobs are not affected.

Adding a queue does not affect pending or running LSF JobScheduler jobs.

Removing a Queue

Before removing a queue, you should make sure there are no jobs in that queue. If you remove a queue that has jobs in it, the jobs are temporarily moved to a lost and found queue. Jobs in the lost and found queue remain pending until the user or the LSF administrator uses the bswitch command to switch the jobs into regular queues. Jobs in other queues are not affected.

In this example, move all pending and running jobs in the night queue to the idle queue, and then delete the night queue.

Log in as the LSF administrator on any host in the cluster.
Close the queue to prevent any new jobs from being submitted:
```
% badmin qclose night
Queue <night> is closed
```

Move all pending and running jobs into another queue. The

bswitch 
    -q night

argument chooses jobs from the night queue, and the job ID number 0 specifies that all jobs should be switched.

% bjobs -u all -q night
JOBID USER   STAT QUEUE  FROM_HOST EXEC_HOST JOB_NAME   SUBMIT_TIME
5308  user5  RUN  night  hostA     hostD     sleep 500  Nov 21 18:16
5310  user5  PEND night  hostA               sleep 500  Nov 21 18:17

% bswitch -q night idle 0
Job <5308> is switched to queue <idle>
Job <5310> is switched to queue <idle>

Edit the LSB_CONFDIR/cluster/configdir/lsb.queues file. Remove (or comment out) the definition for the queue being removed. Save the changes.
Run the command badmin reconfig. If any problems are reported, fix them and run badmin reconfig again. The JobScheduler system is unavailable for about one minute while the system rereads the configuration.

Controlling LSF JobScheduler Jobs

The LSF administrator can control jobs belonging to any user. Other users may control only their own jobs. Jobs can be suspended, resumed and killed.

The bstop, bresume, bkill, and bdel commands send signals to batch jobs. See the kill(1) man/online help page for a discussion of these signals.

bstop suspends a job, bringing the suspended job to USUSP status.

bresume causes a suspended job to resume execution.

bkill terminates a job.

bdel deletes a job.

See the LSF JobScheduler User's Guide and the online help pages for more information about these commands.

This example shows the use of the bstop and bkill commands:

% bstop 5310
Job <5310> is being stopped

% bjobs 5310
JOBID USER  STAT  QUEUE  FROM_HOST  EXEC_HOST  JOB_NAME  SUBMIT_TIME
5310  user5 PSUSP night  hostA                 analysis  Nov 21 18:17

% bkill 5310
Job <5310> is being terminated

% bjobs 5310
JOBID USER  STAT  QUEUE  FROM_HOST  EXEC_HOST  JOB_NAME  SUBMIT_TIME
5310  user5 EXIT  night  hostA                 analysis  Nov 21 18:17

Controlling Job Execution Environment

When LSF JobScheduler runs your jobs, it tries to make it as transparent to the user as possible. By default, the execution environment is maintained to be as close to the current environment as possible. LSF JobScheduler will copy the environment from the submission host to the execution host. It also sets the umask and the current working directory.

Since a network can be heterogeneous, it is often impossible or undesirable to reproduce the submission host's execution environment on the execution host. For example, if a home directory is not shared between submission and execution host, LSF JobScheduler runs the job in /tmp on the execution host.

Users can change the default behaviour by using a job starter. See `Using a Job Starter' on page 66 for details of a job starter.

In addition to environment variables inherited from the user, LSF JobScheduler also sets a few more environment variables for jobs. These are:

LSB_JOBID: Job ID assigned by LSF JobScheduler.
LSB_HOSTS: The list of hosts that are used to run the job. For sequential jobs, this is only one host name. For parallel jobs, this includes multiple host names.
LSB_QUEUE: The name of the queue the job belongs to.
LSB_JOBNAME: Name of the job.
LSB_EXIT_PRE_ABORT: Set to an integer value representing an exit status. A pre-execution command should exit with this value if it wants the job to be aborted instead of requeued or executed.
LSB_JOB_STARTER: Set to the value of the job starter if a job starter is defined for the queue.
LSB_EVENT_ATTRIB: Set to the attributes of external events that were specified in the job's dependency condition. The variable is of the format `event_name1 attribute1 event_name2 ...'.
LS_JOBPID: Set to the process ID of the job.
LS_SUBCWD: This is the directory on the submission when the job was submitted. This is different from PWD only if the directory is not shared across machines or when the execution account is different from the submission account as a result of account mapping.

Note

These variables are set for the convenience of the job. The job does not have to use these variables.

Using a Job Starter

Some jobs have to be started under particular shells or require certain setup steps to be performed before the actual job is executed. This is often handled by writing wrapper scripts around the job. The LSF job starter feature allows you to specify an executable which will perform the actual execution of the job, doing any necessary setup before hand.The job starter can be specified at the queue level using the JOB_STARTER parameter in the lsb.queues file. This allows the LSF JobScheduler queue to control the job startup. For example, the following might be defined in a queue:

Begin Queue
.
JOB_STARTER = xterm -e 
.
End Queue

This way all jobs submitted into this queue will be run under an xterm.

The following are other possible uses of a job starter:

Set job starter to `/bin/csh -c' allows C-shell syntax to be used.
Set job starter to `$USER_STARTER' enables users to define his/her own job starter by defining the environment variable USER_STARTER.

A job starter is configured at the queue level. See `Job Starter' on page 101 for details.

System Calendars

Calendars are normally created by users using the bcadd command or the xbcal GUI interface. Calendars that are commonly used may be defined as system calendars, which can be referenced by all users. System calendars are defined in the lsb.calendars configuration file in LSB_CONFDIR/cluster/configdir directory.

The lsb.calendars file consists of multiple Calendar section where each section corresponds to one calendar. Each calendar section requires the NAME and TIME_EVENTS parameter and can optionally contain a DESCRIPTION parameter. The Calendar section is of the form:

Begin Calendar
NAME=<name>
CAL_EXPR=<calendar expression>
DESCRIPTION=<description>
End Calendar

The syntax of the CAL_EXPR parameter is described in the LSF JobScheduler User's Guide and the man page bcaladd(1). The following is a sample lsb.calendars file:

Begin Calendar
NAME=Holidays
CAL_EXPR=((*:Dec:25)||(*:Jan:1)||*:Jul:4))
DESCRIPTION=U.S. Holidays
End Calendar

Begin Calendar
NAME=weekends
CAL_EXPR=sat,sun
DESCRIPTION=weekend days
End Calendar

System calendars are owned by the virtual user sys and can be viewed by everybody. The xbcal GUI and the bcal command display the system calendars:

% bcal
CALENDAR_NAME   OWNER   STATUS     LAST_CAL_DAY       NEXT_CAL_DAY
holiday         sys     inactive   Fri Jul 4 1997     Thu Dec 25 1997
weekend         sys     inactive   Sun Dec 21 1997    Sat Dec 27 1997
workday         sys     active     Tue Dec 23 1997    Wed Dec 24 1997

System calendars cannot be created with the bcadd command and they cannot be deleted with the bcdel command. When a system calendar is defined, its name becomes a reserved calendar name in the cluster. Consequently, users cannot create a calendar with the same name as a system calendar.

External Event Management

LSF JobScheduler supports the scheduling of jobs based on external site-specific events. A typical use of this feature in data processing environment is to trigger jobs based on the arrival of data or the availability of tapes. Sites that use storage management systems, for example, can coordinate the dispatch of jobs with the staging of data from hierarchical storage onto disk.

The scheduling daemon (mbatchd) can startup and communicate with an external event daemon (eeventd) to detect the occurrence of events. The eeventd is implemented as an executable called eeventd which resides in LSF_SERVERDIR. Users can submit jobs specifying dependencies on any logical combination of external events using the -w option of the bsub command. External event dependencies can be combined with job, file, and calendar events.

A protocol is defined which allows mbatchd to indicate to the eeventd that a job is waiting on a particular event. The eeventd will monitor the event and possibly take actions to trigger it. When the event occurs, the eeventd informs mbatchd, which will then consider the job as eligible for dispatch provided appropriate hosts are available.

LSF JobScheduler comes with an eeventd for file event detection. If you want to monitor additional site events, you can simply add event detection functions into the existing eeventd. The source code of the default eeventd is also included in the release.

The eeventd Protocol

The protocol between the external event daemon, eeventd, and mbatchd consists of a sequence of ASCII messages that are exchanged over a socket pair.

The startup sequence and message format for the protocol is described in the man page eeventd(8).

Each event is identified by an event name. The event name is an arbitrary string, which is site-specific. A user specifies job dependencies on an external event by using the -w option of the bsub command using the event keyword. For example:

% bsub -w `event(tapeXYZ)' myjob

LSF JobScheduler considers the job to be waiting on an event with the name tapeXYZ. There is no checking of the syntax of the event name by LSF JobScheduler. The eeventd can reject an event if the syntax is incorrect preventing the job from being dispatched until the user either modifies the event or removes the job. Alternatively, a site may write a wrapper submission script which checks the syntax of the event before it is submitted to LSF JobScheduler.

The following messages are sent from mbatchd to the eeventd:

SUB event_nameSubscribe to the event given by event_name. Whenever a job is submitted with a new event name that mbatchd has not seen before, a subscribe request is sent to the eeventd. The eeventd is expected to monitor the event, and if necessary, to take any actions required for the event to occur.

UNSUB event_nameUnsubscribe to a given event when there are no jobs dependent on this event. This should cause the eeventd to stop monitoring the event.

The following messages are sent from the eeventd to mbatchd:

START event_name event_type [event_attrib]Tells mbatchd to make the event active. The event_type field should be one of latched, pulse, pulseAll, or exclusive. The different event types control when mbatchd will inactivate an event as follows:

latched Not automatically inactivated until an explicit END message is received.

pulse Automatically inactivated when one job is dispatched. Subsequent START messages on the same event can cause one job to be dispatched, each time the event is pulsed.

pulseAll Automatically inactivated after it is received. For pulseAll events, each job will maintain its own copy of the event state. When a pulseAll event is triggered, all jobs currently waiting on the event will have their copy of the event state marked as active and will be eligible for dispatch. Subsequently submitted jobs will view the event as inactive.

exclusive Automatically inactivated when one job is dispatched and kept inactive until the job completes. Subsequent attempts by the eeventd to activate the event are ignored until job completion.

The event_attrib is an optional attribute string that can be associated with the event. The event attribute is not interpreted by the system and is passed to a job when it starts via the LSB_EVENT_ATTRIB environment variable. It can be used to communicate information between the event daemon and the job. It is also displayed by the bevents command.

END event_nameCauses the event to be put in the inactive state. If the event is already inactive, this has no effect.

REJECT event_name [event_attribute]Causes the event to be put in the reject state. This can be used to indicate a syntax error in the event name. Rejected events are considered to be inactive so that jobs waiting on them are not dispatched. The optional event_attrib can be used to give more information about why the job is rejected. This information will be displayed by the bevents command.

The sequence of interactions between mbatchd and the eeventd are shown in
Figure 2.

Figure 2. mbatchd and eeventd Interactions

User submits a job specifying dependency on the external event eventX
mbatchd scans event table to see if eventX already exists. If not, it creates the event and sends a subscribe message to the eeventd. The eeventd recognizes eventX and initiates monitoring it. If the eeventd could not recognize eventX, it returns a REJECT message to mbatchd.
The eeventd detects an occurrence of eventX and sends a START message telling mbatchd to try to schedule any jobs waiting on the event. Since eventX is latched, mbatchd will consider the event as active indefinitely. If a job also has a dependency on a calendar, it can be run multiple times while eventX is still active.
The eeventd detects that eventX is no longer occurring and sends an END message. mbatchd considers eventX as inactive and stops scheduling jobs waiting on the event.
Steps (3) and (4) can be repeated multiple times.
The user deletes the job waiting on eventX.
If there are no jobs in the system waiting on eventX, an UNSUB message is sent to the eeventd. This should cause the eeventd to stop monitoring eventX.

The external event daemon given in examples/eevent/eevent.c provides an example of a simple event daemon. It receives requests from mbatchd to subscribe and unsubscribe to events. Periodically, it scans the list of subscribed events and toggles the state of the event between active and inactive. The type of the event is chosen based on the event name, that is event names beginning with the string `exclusive' or `pulse' are treated as exclusive events or pulse events respectively. Otherwise the event is treated as a latched event.

File Event Handling

The handling of file events is implemented using the default external event daemon. The installation scripts automatically install the event daemon that handles file events.

Since only one external event daemon can run on a system, sites requiring file event handling in addition to site-specific events must modify the existing file event daemon. The source is provided in examples/eevent/fevent.c in the distribution directory.

You can monitor all external events using bevents command.

Alarm Configuration

The LSF JobScheduler system can raise an alarm when exceptions are encountered in processing critical jobs. See the LSF JobScheduler User's Guide to see how users can associate an alarm with a failure in a job. Alarms must be configured by the LSF JobScheduler administrator before they can be associated with jobs by users. The alarm configuration defines how a notification is to be delivered.

The administrator can configure an alarm to send notification via email or invoke a command which sends a page or a message to a system console. Each time an alarm is triggered a record is created for the incident in a log. The administrator or other users with write access to the log can modify the state of the incident using the xbalarms GUI or the balarms command.

The alarm system supplied with LSF JobScheduler consists of the following components:

a configuration file, lsb.alarms, in LSB_CONFDIR/cluster/configdir which defines the alarms
a program, raisealarm, in LSF_SERVERDIR, which is invoked by the mbatchd in order to trigger an alarm
a log file, lsb.alarmlog, in LSB_SHAREDIR/cluster/logdir which contains a record of each alarm incident (a history log file, lsb.alarmlog.hist, keeps old alarm incident records)
a daemon, alarmd, started by mbatchd performs periodic operations on the alarm
logging, such as sending renotifications or moving resolved or expired records to the history log
GUI and command line tools for viewing, acknowledging and resolving alarms

Alarm definitions must be created by the LSF Administrator in the lsb.alarms file located in LSB_CONFDIR/cluster/configdir directory. The lsb.alarms file consists of multiple "Alarm" sections where each section corresponds to one alarm. Each section has a number of keywords which are used to define the alarm parameters. Each alarm section requires the NAME and NOTIFICATION parameters. See `The lsb.alarms File' on page 101 for detailed information on each parameter.

[Contents] [Index] [Top] [Bottom] [Prev] [Next]

doc@platform.com