This chapter describes the operating concepts and maintenance tasks of LSF JobScheduler. This chapter requires concepts from `Managing LSF Base' on page 23. The topics covered in this chapter are:
Managing error log files for LSF JobScheduler daemons was described in `Managing Error Logs' on page 23. This section discusses the other important log files LSF JobScheduler daemons produce. The LSF JobScheduler log files are found in the directory LSB_SHAREDIR/
cluster/logdir
.
Each time a job completes or exits, an entry is appended to the lsb.acct
file. This file can be used to create accounting summaries of LSF JobScheduler system use. The bacct
(1
) command produces one form of summary. The lsb.acct
file is a text file suitable for processing with awk
, perl
, or similar tools. See the lsb.acct
(5
) online help page for details of the contents of this file. Additionally, the LSF API supports calls to process the lsb.acct
records. See the LSF Programmer's Guide for details of LSF API.
If the lsb.acct
file grows too large, you can move the lsb.acct
file to a backup location. Removing lsb.acct
will not cause any operational problems for LSF JobScheduler. The daemon automatically creates a new lsb.acct
file to replace the moved file.
The LSF JobScheduler daemons keep an event log in the lsb.events
file. The mbatchd
daemon uses this information to recover from server failures, host reboots, and LSF JobScheduler reconfiguration. The lsb.events
file is also used by the bhist
command to display detailed information about the execution history of jobs, and by the badmin
command to display the operational history of hosts, queues and LSF JobScheduler daemons.
For performance reasons, the mbatchd
automatically backs up and rewrites the lsb.events
file after every 1000 job completions (this is the default; the value is controlled by the MAX_JOB_NUM
parameter in the lsb.param
file). The old lsb.events
file is moved to lsb.events.1
, and each old lsb.events.
n file is moved to lsb.events.
n+1. The mbatchd
never deletes these files. If disk storage is a concern, the LSF administrator should arrange to archive or remove old lsb.events.
n files occasionally.
Do not remove or modify the lsb.events
file. Removing or modifying the lsb.events
file could cause jobs to be lost.
By default, LSF JobScheduler stores all state information needed to recover from server failures, host reboots, or reconfiguration in a file in the LSB_SHAREDIR directory. Typically, the LSB_SHAREDIR directory resides on a reliable file server which also contains other critical applications necessary for running users' jobs.This is done because, if the central file server is unavailable, users' applications cannot run, and the failure of LSF JobScheduler to continue processing users' jobs is a secondary issue.
For sites not wishing to rely solely on a central file server for recovery information, LSF can be configured to maintain a replica of the recovery file. The replica is stored on the file server, and used if the primary copy is unavailable--referred to as duplicate event logging. When LSF is configured this way, the primary even log is stored on the first master host, and re-synchronized with the replicated copy when the host recovers.
To enable the replication feature, define LSB_LOCALDIR in the lsf.conf
file. LSB_LOCALDIR should be a local directory and it should exist ONLY on the first master host (i.e. the first host configured in the lsf.cluster.
cluster
file).
LSB_LOCALDIR is used to store the primary copy of the batch state information. The contents of LSB_LOCALDIR are copied to a replica in LSB_SHAREDIR which resides on a central file server. As before, LSB_SHAREDIR is assumed to be accessible from all hosts which can potentially become the master.
With the replication feature enabled the following scenarios can occur.
If the file server containing LSB_SHAREDIR goes down, LSF will continue to process jobs. Client commands such as bhist(1)
and bacct(1)
which directly read LSB_SHAREDIR will not work. When the file server recovers, the replica in LSB_SHAREDIR will be updated.
If the first master host fails, then the primary copy of the recovery file in the LSB_LOCALDIR directory becomes unavailable. A new master host will be selected which will use the recovery file replica in LSB_SHAREDIR to restore its state and to log future events. There is no replication by the second master.
When the first master host becomes available again, it will update the primary copy in LSB_LOCALDIR from the replica in LSB_SHAREDIR and continue operations as before.
The replication feature improves the reliability of LSF JobScheduler operations provided that the following assumptions hold:
mbatchd
does not occur. This may happen given certain network topologies and failure modes. For example, connectivity is lost between the first master, M1, and both the file server and the secondary master, M2. Both M1 and M2 will run the mbatchd
service with M1 logging events to LSB_LOCALDIR and M2 logging to LSB_SHAREDIR. When connectivity is restored, the changes made by M2 to LSB_SHAREDIR will be lost when M1 updates LSB_SHAREDIR from its copy in LSB_LOCALDIR.
The lsadmin
command is used to control LSF Base daemons, LIM and RES. LSF JobScheduler has the badmin
command to perform similar operations on LSF JobScheduler daemons.
To check the status of LSF JobScheduler server hosts and queues, use the bhosts
and bqueues
commands:
% bhosts
HOST_NAME STATUS MAX NJOBS RUN SSUSP USUSP RSV
hostA ok 1 0 0 0 0 0
hostB closed 2 2 2 0 0 0
hostD ok 8 1 1 0 0 0
% bqueues
QUEUE_NAME PRIO STATUS NJOBS PEND RUN SUSP
night 30 Open:Inactive 4 4 0 0
short 10 Open:Active 1 0 1 0
simulation 10 Open:Active 0 0 0 0
default 1 Open:Active 6 4 2 0
If the status of a server is `closed', then it will not accept more jobs. A server host can become closed if one of the following conditions is true:
badmin
command
An inactive queue will accept new job submissions, but will not dispatch any new jobs. A queue can become inactive if the LSF cluster administrator explicitly inactivates it via badmin
command.
mbatchd
automatically logs the history of the LSF JobScheduler daemons in the LSF JobScheduler event log. You can display the administrative history of the batch system using the badmin
command.
The badmin hhist
command displays the times when LSF JobScheduler server hosts are opened and closed by the LSF administrator.
The badmin qhist
command displays the times when queues are opened, closed, activated, and inactivated.
The badmin mbdhist
command displays the history of the mbatchd
daemon, including the times when the master starts, exits, reconfigures, or changes to a different host.
The badmin hist
command displays all LSF JobScheduler history information, including all the events listed above.
You can use badmin hstartup
command to start sbatchd
on some or all remote hosts from one host:
% badmin hstartup all
Start up slave batch daemon on <hostA> ......done
Start up slave batch daemon on <hostB> ......done
Start up slave batch daemon on <hostD> ......done
For remote startup to work, file
/etc/lsf.sudoers
has to be set up properly and you have to be able to runrsh
across all LSF hosts without having to enter a password. See `The lsf.sudoers File' on page 89 for configuration details oflsf.sudoers
.
mbatchd
is restarted by the badmin
reconfig
command. sbatchd
can be restarted using the badmin
hrestart
command:
% badmin hrestart hostD
Restart slave batch daemon on <hostD> ...... done
You can specify more than one host name to restart sbatchd
on multiple hosts, or use `all
' to refer to all LSF JobScheduler server hosts. Restarting sbatchd
on a host does not affect jobs that are running on that host.
The badmin hshutdown
command shuts down the sbatchd
.
% badmin hshutdown hostD
Shut down slave batch daemon on <hostD> .... done
If sbatchd
is shut down, that particular host will not be available for running new jobs. Existing jobs running on that host will continue to completion, but the results will not be sent to the user until sbatchd
is later restarted.
To shut down mbatchd
, you must first use the badmin hshutdown
command to shut down the sbatchd
on the master host, and then run the badmin reconfig
command. The mbatchd
is normally restarted by sbatchd
; if there is no sbatchd
running on the master host, badmin reconfig
causes mbatchd
to exit.
If mbatchd
is shut down, all LSF JobScheduler service will be temporarily unavailable. However all existing jobs will not be affected. When mbatchd
is later restarted, previous status will be restored from the event log file and job scheduling will continue.
Occasionally you may want to drain a server host for purposes of rebooting, maintenance, or host removal. This can be achieved by running the badmin hclose
command:
% badmin hclose hostB
Close <hostB> ...... done
When a host is open, LSF JobScheduler can dispatch jobs to it. When a host is closed no new batch jobs are dispatched, but jobs already dispatched to the host continue to execute. To reopen a server host, run badmin hopen
command:
% badmin hopen hostB
Open <hostB> ...... done
To view the history of a JobScheduler server host, run badmin hhist
command:
% badmin hhist hostB
Wed Nov 20 14:41:58: Host <hostB> closed by administrator <lsf>.
Wed Nov 20 15:23:39: Host <hostB> opened by administrator <lsf>.
Each JobScheduler queue can be open or closed, active or inactive. Users can submit jobs to open queues, but not to closed queues. Active queues start jobs on available server hosts, and inactive queues hold all jobs. The LSF administrator can change the state of any queue.
The current status of a particular queue or all queues is displayed by the bqueues(1)
command. The bqueues
-l
option also gives current statistics about the jobs in a particular queue such as the total number of jobs in this queue, the number of jobs running, suspended, etc.
% bqueues normal
QUEUE_NAME PRIO STATUS NJOBS PEND RUN SUSP
normal 30 Open:Active 6 4 2 0
When a queue is open, users can submit jobs to the queue. When a queue is closed, users cannot submit jobs to the queue. If a user tries to submit a job to a closed queue, an error message is printed and the job is rejected. If a queue is closed but still active, previously submitted jobs continue to be processed. This allows the LSF administrator to drain a queue.
% badmin qclose normal
Queue <normal> is closed
% bqueues normal
QUEUE_NAME PRIO STATUS NJOBS PEND RUN SUSP
normal 30 Closed:Active 6 4 2 0
% bsub -q normal hostname
normal: Queue has been closed % badmin qopen normal
Queue <normal> is opened
When a queue is active, jobs in the queue are started if appropriate hosts are available. When a queue is inactive, jobs in the queue are not started. Queues can be activated and inactivated by the LSF administrator using badmin qact
and badmin
qinact
.
If a queue is open and inactive, users can submit jobs to this queue but no new jobs are dispatched to hosts. Currently running jobs continue to execute. This allows the LSF administrator to let running jobs complete before removing queues or making other major changes.
% badmin qinact normal
Queue <normal> is inactivated % bqueues normal
QUEUE_NAME PRIO STATUS NJOBS PEND RUN SUSP
normal 30 Open:Inactive 0 0 0 0
% badmin qact normal
Queue <normal> is activated
The LSF JobScheduler cluster is a subset of the LSF Base cluster. All servers used by LSF JobScheduler must belong to the base cluster. However not all servers in the base cluster must provide LSF JobScheduler services.
LSF JobScheduler configuration consists of five files: lsb.params
, lsb.hosts
, lsb.queues
, lsb.alarms
,
and lsb.calendars
. These files are stored in LSB_CONFDIR/
cluster/configdir
, where cluster is the name of your cluster.
All these files are optional. If any of these files does not exist, LSF JobScheduler will assume a default configuration.
The lsb.params
file defines general parameters about LSF JobScheduler system operation such as the name of the default queue when the user does not specify one, scheduling intervals for mbatchd
and sbatchd
, etc. Detailed parameters are described in `The lsb.params File' on page 93.
The lsb.hosts
file defines LSF JobScheduler server hosts together with their attributes. Not all LSF hosts defined by LIM configuration have to be configured to run jobs. Server host attributes include scheduling load thresholds, job slot limits, etc. This file is also used to define host groups. See `Host Section' on page 95 for details of this file.
The lsb.queues
file defines job queues. See `The lsb.queues File' on page 97 for more details.
The lsb.alarms
file contains definitions of alarms used by LSF JobScheduler. See `The lsb.alarms File' on page 101 for details. The lsb.calendars
file defines system calendars for usage by all user jobs. See `The lsb.calendars File' on page 103 for details.
When you first install LSF on your cluster, some example queues are already configured for you. You should customize these queues or define new queues to meet your site need.
After changing any of the LSF JobScheduler configuration files, you need to run badmin reconfig
to tell mbatchd
to pick up the new configuration. You also must run this every time you change LIM configuration.
To add a server host to an LSF JobScheduler configuration, use the following procedure.
LSB_CONFDIR/
cluster/configdir/lsb.hosts
file to add the new host together with its attributes. If you want to limit the added host for use only by some queues, you should also update the lsb.queues
file. Since host types and host models, as well as the virtual name `default
', can be used to refer to all hosts of that type, model, or every other LSF host not covered by the definitions, you may not need to change any of the files, if the host is already covered.
badmin reconfig
to tell mbatchd
to pick up the new configuration.
sbatchd
on the added host by running badmin hstartup
or simply start it by hand.
To make a host no longer a server host, use the following procedure.
badmin hclose
to prevent new batch jobs from starting on the host, and wait for any running jobs on that host to finish. If you wish to shut the host down before all jobs complete, use bkill
to kill the running jobs.
lsb.hosts
and lsb.queues
in LSB_CONFDIR
/cluster/configdir
directory and remove the host from any of the sections.
badmin hshutdown
to shutdown sbatchd
on that host.
You should never remove the master host from LSF JobScheduler. Change the LIM configuration to assign a different default master host if you want to remove your current default master from the LSF JobScheduler server pool.
To add a queue to a cluster, use the following procedure.
LSB_CONFDIR/cluster/configdir/lsb.queues
file to add the new queue definition. You can copy another queue definition from this file as a starting point; remember to change the QUEUE_NAME
of the copied queue. Save the changes to lsb.queues
. See `The lsb.queues File' on page 97 for a complete description of LSF JobScheduler queue configuration.
badmin ckconfig
to check the new queue definition. If any errors are reported, fix the problem and check the configuration again. See `Controlling LSF JobScheduler Servers' on page 55 for an example of normal output from badmin ckconfig
.
badmin reconfig
. The master batch daemon (mbatchd
) is unavailable for approximately one minute while it reconfigures. Pending and running jobs are not affected.
Adding a queue does not affect pending or running LSF JobScheduler jobs.
Before removing a queue, you should make sure there are no jobs in that queue. If you remove a queue that has jobs in it, the jobs are temporarily moved to a lost and found queue. Jobs in the lost and found queue remain pending until the user or the LSF administrator uses the bswitch
command to switch the jobs into regular queues. Jobs in other queues are not affected.
In this example, move all pending and running jobs in the night queue to the idle queue, and then delete the night queue.
% badmin qclose night
Queue <night> is closed
bswitch
-q night
argument chooses jobs from the night queue, and the
job ID number 0
specifies that all jobs should be switched.
% bjobs -u all -q night
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
5308 user5 RUN night hostA hostD sleep 500 Nov 21 18:16
5310 user5 PEND night hostA sleep 500 Nov 21 18:17
% bswitch -q night idle 0
Job <5308> is switched to queue <idle>
Job <5310> is switched to queue <idle>
LSB_CONFDIR
/cluster/configdir/lsb.queues
file. Remove (or comment out) the definition for the queue being removed.
Save the changes.
badmin reconfig
. If any problems are reported,
fix them and run badmin reconfig
again. The JobScheduler
system is unavailable for about one minute while the system rereads the configuration.
The LSF administrator can control jobs belonging to any user. Other users may control only their own jobs. Jobs can be suspended, resumed and killed.
The bstop
, bresume,
bkill
, and bdel
commands send signals to batch jobs. See the kill(1)
man/online help page for a discussion of these signals.
bstop suspends a job, bringing the suspended job to USUSP status.
bresume causes a suspended job to resume execution.
See the LSF JobScheduler User's Guide and the online help pages for more information about these commands.
This example shows the use of the bstop
and bkill
commands:
% bstop 5310
Job <5310> is being stopped % bjobs 5310
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
5310 user5 PSUSP night hostA analysis Nov 21 18:17 % bkill 5310
Job <5310> is being terminated % bjobs 5310
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
5310 user5 EXIT night hostA analysis Nov 21 18:17
When LSF JobScheduler runs your jobs, it tries to make it as transparent to the user as possible. By default, the execution environment is maintained to be as close to the current environment as possible. LSF JobScheduler will copy the environment from the submission host to the execution host. It also sets the umask
and the current working directory.
Since a network can be heterogeneous, it is often impossible or undesirable to reproduce the submission host's execution environment on the execution host. For example, if a home directory is not shared between submission and execution host, LSF JobScheduler runs the job in /tmp
on the execution host.
Users can change the default behaviour by using a job starter. See `Using a Job Starter' on page 66 for details of a job starter.
In addition to environment variables inherited from the user, LSF JobScheduler also sets a few more environment variables for jobs. These are:
LSB_JOBID
: Job ID assigned by LSF JobScheduler.
LSB_HOSTS
: The list of hosts that are used to run the job. For sequential jobs, this is only one host name. For parallel jobs, this includes multiple host names.
LSB_QUEUE
: The name of the queue the job belongs to.
LSB_JOBNAME
: Name of the job.
LSB_EXIT_PRE_ABORT
: Set to an integer value representing an exit status. A pre-execution command should exit with this value if it wants the job to be aborted instead of requeued or executed.
LSB_JOB_STARTER
: Set to the value of the job starter if a job starter is defined for the queue.
LSB_EVENT_ATTRIB
: Set to the attributes of external events that were specified in the job's dependency condition. The variable is of the format `event_name1 attribute1 event_name2 ...
'.
LS_JOBPID
: Set to the process ID of the job.
LS_SUBCWD
: This is the directory on the submission when the job was submitted. This is different from PWD
only if the directory is not shared across machines or when the execution account is different from the submission account as a result of account mapping.
These variables are set for the convenience of the job. The job does not have to use these variables.
Some jobs have to be started under particular shells or require certain setup steps to be performed before the actual job is executed. This is often handled by writing wrapper scripts around the job. The LSF job starter feature allows you to specify an executable which will perform the actual execution of the job, doing any necessary setup before hand.The job starter can be specified at the queue level using the JOB_STARTER
parameter in the lsb.queues
file. This allows the LSF JobScheduler queue to control the job startup. For example, the following might be defined in a queue:
Begin Queue
.
JOB_STARTER = xterm -e
.
End Queue
This way all jobs submitted into this queue will be run under an xterm.
The following are other possible uses of a job starter:
/bin/csh -c' a
llows C-shell syntax to be used.
$USER_STARTER
' enables users to define his/her own job starter by defining the environment variable USER_STARTER
.
A job starter is configured at the queue level. See `Job Starter' on page 101 for details.
Calendars are normally created by users using the bcadd
command or the xbcal
GUI interface. Calendars that are commonly used may be defined as system calendars, which can be referenced by all users. System calendars are defined in the lsb.calendars
configuration file in LSB_CONFDIR/
cluster/configdir
directory.
The lsb.calendars
file consists of multiple Calendar
section where each section corresponds to one calendar. Each calendar section requires the NAME
and TIME_EVENTS
parameter and can optionally contain a DESCRIPTION
parameter. The Calendar
section is of the form:
Begin Calendar
NAME=<name>
CAL_EXPR=<calendar expression>
DESCRIPTION=<description>
End Calendar
The syntax of the CAL_EXPR
parameter is described in the LSF JobScheduler User's Guide and the man page bcaladd
(1
). The following is a sample lsb.calendars
file:
Begin Calendar
NAME=Holidays
CAL_EXPR=((*:Dec:25)||(*:Jan:1)||*:Jul:4))
DESCRIPTION=U.S. Holidays
End Calendar Begin Calendar
NAME=weekends
CAL_EXPR=sat,sun
DESCRIPTION=weekend days
End Calendar
System calendars are owned by the virtual user sys
and can be viewed by everybody. The xbcal
GUI and the bcal
command display the system calendars:
% bcal
CALENDAR_NAME OWNER STATUS LAST_CAL_DAY NEXT_CAL_DAY
holiday sys inactive Fri Jul 4 1997 Thu Dec 25 1997
weekend sys inactive Sun Dec 21 1997 Sat Dec 27 1997
workday sys active Tue Dec 23 1997 Wed Dec 24 1997
System calendars cannot be created with the bcadd
command and they cannot be deleted with the bcdel
command. When a system calendar is defined, its name becomes a reserved calendar name in the cluster. Consequently, users cannot create a calendar with the same name as a system calendar.
LSF JobScheduler supports the scheduling of jobs based on external site-specific events. A typical use of this feature in data processing environment is to trigger jobs based on the arrival of data or the availability of tapes. Sites that use storage management systems, for example, can coordinate the dispatch of jobs with the staging of data from hierarchical storage onto disk.
The scheduling daemon (mbatchd
) can startup and communicate with an external event daemon (eeventd
) to detect the occurrence of events. The eeventd
is implemented as an executable called eeventd
which resides in LSF_SERVERDIR
. Users can submit jobs specifying dependencies on any logical combination of external events using the -w
option of the bsub
command. External event dependencies can be combined with job, file, and calendar events.
A protocol is defined which allows mbatchd
to indicate to the eeventd
that a job is waiting on a particular event. The eeventd
will monitor the event and possibly take actions to trigger it. When the event occurs, the eeventd
informs mbatchd
, which will then consider the job as eligible for dispatch provided appropriate hosts are available.
LSF JobScheduler comes with an eeventd
for file event detection. If you want to monitor additional site events, you can simply add event detection functions into the existing eeventd
. The source code of the default eeventd
is also included in the release.
The protocol between the external event daemon, eeventd
, and mbatchd
consists of a sequence of ASCII messages that are exchanged over a socket pair.
The startup sequence and message format for the protocol is described in the man page
eeventd
(8
).
Each event is identified by an event name. The event name is an arbitrary string, which is site-specific. A user specifies job dependencies on an external event by using the -w
option of the bsub
command using the event
keyword. For example:
% bsub -w `event(tapeXYZ)' myjob
LSF JobScheduler considers the job to be waiting on an event with the name tapeXYZ
. There is no checking of the syntax of the event name by LSF JobScheduler. The eeventd
can reject an event if the syntax is incorrect preventing the job from being dispatched until the user either modifies the event or removes the job. Alternatively, a site may write a wrapper submission script which checks the syntax of the event before it is submitted to LSF JobScheduler.
The following messages are sent from mbatchd
to the eeventd
:
SUB event_name
Subscribe to the event given by event_name. Whenever a job is submitted with a new event name that
mbatchd
has not seen before, a subscribe request is sent to theeeventd
. Theeeventd
is expected to monitor the event, and if necessary, to take any actions required for the event to occur.
UNSUB event_name
Unsubscribe to a given event when there are no jobs dependent on this event. This should cause theeeventd
to stop monitoring the event.
The following messages are sent from the eeventd
to mbatchd
:
START event_name event_type [event_attrib]
Tells
mbatchd
to make the event active. The event_type field should be one oflatched
,pulse
,pulseAll
, orexclusive
. The different event types control whenmbatchd
will inactivate an event as follows:
latched
Not automatically inactivated until an explicitEND
message is received.
pulse
Automatically inactivated when one job is dispatched. SubsequentSTART
messages on the same event can cause one job to be dispatched, each time the event is pulsed.
pulseAll
Automatically inactivated after it is received. ForpulseAll
events, each job will maintain its own copy of the event state. When apulseAll
event is triggered, all jobs currently waiting on the event will have their copy of the event state marked as active and will be eligible for dispatch. Subsequently submitted jobs will view the event as inactive.
exclusive
Automatically inactivated when one job is dispatched and kept inactive until the job completes. Subsequent attempts by theeeventd
to activate the event are ignored until job completion.
The event_attrib is an optional attribute string that can be associated with the event. The event attribute is not interpreted by the system and is passed to a job when it starts via theLSB_EVENT_ATTRIB
environment variable. It can be used to communicate information between the event daemon and the job. It is also displayed by thebevents
command.
END event_name
Causes the event to be put in the inactive state. If the event is already
inactive, this has no effect.
REJECT event_name [event_attribute]
Causes the event to be put in the reject state. This can be used to indicate a syntax error in the event name. Rejected events are considered to be inactive so that jobs waiting on them are not dispatched. The optional event_attrib can be used to give more information about why the job is rejected. This information will be displayed by thebevents
command.
The sequence of interactions between mbatchd
and the eeventd
are shown in
Figure 2.
Figure 2. mbatchd
and eeventd
Interactions
mbatchd
scans event table to see if eventX already exists. If not, it creates the event and sends a subscribe message to the eeventd
. The eeventd
recognizes eventX and initiates monitoring it. If the eeventd
could not recognize eventX, it returns a REJECT
message to mbatchd
.
eeventd
detects an occurrence of eventX and sends a START
message telling mbatchd
to try to schedule any jobs waiting on the event. Since eventX is latched
, mbatchd
will consider the event as active indefinitely. If a job also has a dependency on a calendar, it can be run multiple times while eventX is still active.
eeventd
detects that eventX is no longer occurring and sends an END
message. mbatchd
considers eventX as inactive and stops scheduling jobs waiting on the event.
Steps (3) and (4) can be repeated multiple times.
UNSUB
message is sent to the eeventd
. This should cause the eeventd
to stop monitoring eventX.
The external event daemon given in examples/eevent/eevent.c
provides an example of a simple event daemon. It receives requests from mbatchd
to subscribe and unsubscribe to events. Periodically, it scans the list of subscribed events and toggles the state of the event between active and inactive. The type of the event is chosen based on the event name, that is event names beginning with the string `exclusive' or `pulse' are treated as exclusive
events or pulse
events respectively. Otherwise the event is treated as a latched
event.
The handling of file events is implemented using the default external event daemon. The installation scripts automatically install the event daemon that handles file events.
Since only one external event daemon can run on a system, sites requiring file event handling in addition to site-specific events must modify the existing file event daemon. The source is provided in examples/eevent/fevent.c
in the distribution directory.
You can monitor all external events using bevents
command.
The LSF JobScheduler system can raise an alarm when exceptions are encountered in processing critical jobs. See the LSF JobScheduler User's Guide to see how users can associate an alarm with a failure in a job. Alarms must be configured by the LSF JobScheduler administrator before they can be associated with jobs by users. The alarm configuration defines how a notification is to be delivered.
The administrator can configure an alarm to send notification via email or invoke a command which sends a page or a message to a system console. Each time an alarm is triggered a record is created for the incident in a log. The administrator or other users with write access to the log can modify the state of the incident using the xbalarms
GUI or the balarms
command.
The alarm system supplied with LSF JobScheduler consists of the following components:
lsb.alarms
, in LSB_CONFDIR/cluster/configdir
which defines the alarms
raisealarm
, in LSF_SERVERDIR, which is invoked by the mbatchd
in order to trigger an alarm
lsb.alarmlog
, in LSB_SHAREDIR/cluster/logdir
which contains a record of each alarm incident (a history log file, lsb.alarmlog.hist
, keeps old alarm incident records)
alarmd
, started by mbatchd
performs periodic operations on the alarm
Alarm definitions must be created by the LSF Administrator in the lsb.alarms
file located in LSB_CONFDIR/cluster/configdir
directory. The lsb.alarms
file consists of multiple "Alarm" sections where each section corresponds to one alarm. Each section has a number of keywords which are used to define the alarm parameters. Each alarm section requires the NAME and NOTIFICATION parameters. See `The lsb.alarms File' on page 101 for detailed information on each parameter.