This chapter describes the LSF JobScheduler configuration
files lsb.params
, lsb.hosts
, lsb.queues
,
lsb.alarms
, and lsb.calendars
. These files use the
same horizontal and vertical section structure as the LIM configuration files
(see `Configuration File Formats' on page 29).
All LSF JobScheduler configuration files are found in the LSB_CONFDIR/
cluster/configdir
directory.
The lsb.params
file defines general parameters used by the LSF JobScheduler cluster. This file contains only one section.
Most of the parameters that can be defined in the lsb.params
file control timing within the LSF JobScheduler system. The default settings provide good throughput for long-running batch jobs while adding a minimum of processing overhead to the daemons.
This section and all the keywords in this section are optional. If keywords are not present, LSF JobScheduler assumes default values for the corresponding keywords.
The valid keywords for this section are:
DEFAULT_QUEUE
lists a name for the LSF JobScheduler queue defined in the lsb.queues
file. When a user submits a job to the LSF JobScheduler system without explicitly specifying a queue and the user's environment variable LSB_DEFAULTQUEUE
is not set, LSF JobScheduler queues the job in the default queue.
If this keyword is not present or no valid value is given, then LSF JobScheduler automatically creates a default queue named default
with all the default parameters (see `The lsb.queues File' on page 97).
The LSF JobScheduler job dispatching interval. It determines how often the LSF JobScheduler system tries to dispatch pending jobs.
The LSF JobScheduler job checking interval. It determines how often the LSF JobScheduler system checks the load conditions of each host to decide whether jobs on the host must be suspended or resumed.
The number of MBD_SLEEP_TIME
periods to wait after dispatching a job to a host, before dispatching a second job to the same host. If JOB_ACCEPT_INTERVAL
is zero, a host may accept more than one job in each job dispatching interval (MBD_SLEEP_TIME
).
The maximum number of retries for reaching a non-responding slave batch daemon, sbatchd
. The interval between retries is defined by MBD_SLEEP_TIME
. If the master batch daemon fails to reach a host, and has retried MAX_SBD_FAIL
times, the host is considered unavailable.
The amount of time that job records for jobs that have finished or have been killed remain in `DONE' or `EXIT' status before they are either cleaned out of mbatchd
's memory or moved into `PEND' status for next schedule. Users can still see all finished jobs after they have finished using the bjobs
command. For jobs that finished more than CLEAN_PERIOD
seconds ago (and as a result are cleaned out of memory), use the bhist
command.
The maximum number of finished jobs whose events are to be stored in an event log file (see the lsb.events
(5
) manual page). Once the limit is reached, the mbatchd
switches the event log file. See `LSF JobScheduler Event Log' on page 52.
The lsb.hosts
file contains host related configuration information for the batch server hosts in the cluster. This file is optional.
The optional Host
section contains per-host configuration information. Each host, host model or host type can be configured to run a maximum number of jobs. Hosts, host models or host types can also be configured to run jobs only under specific load conditions.
If no hosts, host models or host types are named in this section, LSF JobScheduler uses all hosts in the LSF cluster as server hosts. Otherwise, only the named hosts, host models and host types are used by LSF JobScheduler. If a line in the Host
section lists the reserved host name default
, LSF JobScheduler uses all hosts in the cluster and the settings on that line apply to every host not referenced in the section, either explicitly or by listing its model or type.
The first line of this section gives the keywords that apply to the rest of the lines. The keyword HOST_NAME
must appear. Other supported keywords are optional.
The name of a host defined in the lsf.cluster.
cluster
file, a host model or host type defined in the lsf.shared
file,
or the reserved word default
.
The maximum number of job slots for the host. On multiprocessor hosts MXJ
should be set to at least the number of processors to fully use the CPU resource.
r15s
, r1m
, r15m
, ut
, pg
, io
, ls
, it
, tmp
, swp
, mem
, name
Scheduling and suspending thresholds for the dynamic load indices supported by LIM. Each load index column must contain either the default entry or two numbers separated by a slash `/', with no white space. The first number is the scheduling threshold for the load index; the second number is the suspending threshold. See Section 4, `Resources', beginning on page 45 of the LSF JobScheduler User's Guide for complete descriptions of the load indices.
The HostGroup
section is optional. This section defines names for sets of hosts. The host group name can then be used in other host group, and queue definitions, as well as on an LSF JobScheduler command line. When a host group name is used, it has exactly the same effect as listing all of the host names in the group.
The host group section must begin with a line containing the mandatory keywords GROUP_NAME
and GROUP_MEMBER
. Each other line in this section must contain an alphanumeric string for the group name, and a list of host names or previously defined group names enclosed in parentheses and separated by white space.
Host names and host group names can appear in more than one host group. The reserved name all
specifies all hosts in the cluster.
Begin HostGroup
GROUP_NAME GROUP_MEMBER
licence1 (hostA hostD)
sys_hosts (hostF license1 hostK)
End HostGroup
This example section defines two host groups. The group license1 contains the hosts hostA and hostD; the group sys_hosts contains hostF and hostK, along with all hosts in the group license1. Group names must not conflict with host names.
The lsb.queues
file contains definitions of the queues in an LSF cluster. This file is optional. If no queues are configured, LSF JobScheduler creates a queue named default, with all parameters set to default values (see the description of DEFAULT_QUEUE
in `The lsb.params File' on page 93).
Queue definitions are horizontal sections that begin with the line Begin Queue
and end with the line End Queue
. You can define at most 40 queues in an LSF cluster. Each queue definition contains the following parameters:
The name of the queue. This parameter must be defined, and has no default. The queue name can be any string of non-blank characters up to 40 characters long. It is best to use 6 to 8 character long names made up of letters, digits, and possibly underscores `_
' or dashes `-
'.
This parameter indicates the priority of the queue relative to other LSF Batch queues. Note that this is an LSF JobScheduler dispatching priority, completely independent of the operating system's priority system for time-sharing processes.
LSF JobScheduler tries to schedule jobs from queues with larger PRIORITY
values first. This does not mean that jobs in lower priority queues are not scheduled unless higher priority queues are empty. Higher priority queues are checked first, but not all jobs in them are necessarily scheduled. For example, a job might be held because no machine with the right resources is available. Lower priority queues are then checked and, if possible, their jobs are scheduled.
If more than one queue is configured with the same PRIORITY
, LSF JobScheduler schedules jobs from all these queues in first-come, first-served order.
The list of hosts on which jobs from this queue can be run. Each name in the list must be a valid host name, host group name or host partition name as configured in the lsb.hosts
file. The name can be optionally followed by +
pref_level to indicate the preference for dispatching a job to that host, host group, or host partition. pref_level is a positive number specifying the preference level of that host. If a host preference is not given, it is assumed to be 0.
Hosts at the same level of preference are ordered by load. For example:
HOSTS = hostA+1 hostB hostC+1 servers+3
where servers is a host group name referring to all computer servers. This defines three levels of preferences: run jobs on servers as much as possible, or else on hostA and hostC. Jobs should not run on hostB unless all other hosts are too busy to accept more jobs.
If you use the reserved word 'others', it means jobs should run on all hosts not explicitly listed. You do not need to define this parameter if you want to use all JobScheduler server hosts and you do not need host preferences.
Pre- and post-execution commands can be configured on a per-queue basis. These commands are run on the execution host before and after a job from this queue is run, respectively. By configuring appropriate pre- and/or post-execution commands various situations can be handled such as:
Note that the job-level pre-exec specified with the -E
option of bsub
is also supported. In some situations (for example, license checking) it is possible to specify a queue-level pre-execution command instead of requiring every job be submitted with the -E
option.
The execution commands are specified using the PRE_EXEC
and POST_EXEC
keywords; for example:
Begin Queue
QUEUE_NAME = priority
PRIORITY = 43
PRE_EXEC = /usr/people/lsf/pri_prexec
POST_EXEC = /usr/people/lsf/pri_postexec
End Queue
The following points should be considered when setting up the pre- and post-execution commands for queues:
/bin/sh -c
, so shell features can
be used in the command. For example, the following is valid:
PRE_EXEC = /usr/local/lsf/misc/testq_pre >> /tmp/pre.out
POST_EXEC = /usr/local/lsf/misc/testq_post | grep -v "Hey!"
LSB_PRE_POST_EXEC_USER
in the lsf.sudoers
file. See `The
lsf.sudoers File' on page 89 for details.
/tmp
.
/dev/null
.
The output from the pre- and post-execution commands can be explicitly redirected
to a file for debugging purposes.
PATH
environment variable is set to `/bin /usr/bin /sbin /usr/sbin
'.
LSB_JOBEXIT_STAT
,
is set to the exit status of the job. Refer to the manual page for wait(2)
for the format of this exit status.
LSB_JOBPEND
, is set if the job is requeued. If the job's execution
environment could not be set up, LSB_JOBEXIT_STAT
is set to 0
.
Default: no pre- and post-execution commands
A Job starter can be defined for each queue to bring the actual job into the desired environment before execution. The configuration syntax for job starter is:
JOB_STARTER = starter
Here, the starter string is any executable that can be used to start the job command line.
When LSF JobScheduler runs the job, it executes
/bin/sh -c "JOB_STARTER job_command_line"
. Thus a job starter can be anything that can be run together with the job command line.
This file defines all alarms in LSF JobScheduler and how alarms should be handled. An alarm provides a mechanism for sending a notification of an alert condition. An alarm is associated with a job when the job is defined. Users can specify when an alarm should be triggered for the job. The alarm definition in this file is read by the raisealarm
command that is invoked by the LSF JobScheduler when certain job exceptions are detected. Users can view the alarm definition through the xbalarms
GUI or balarms
command. For alarms which require periodic notifications, alarmd
will read this file to invoke the appropriate notification method.
The lsb.alarms
file consists of multiple "Alarm" sections where each section corresponds to one alarm. Each section has a number of keywords which are used to define the alarm parameters. Each alarm parameter can be specified in the "Alarm" section:
The name of the alarm. Use any ASCII string with up to 32 characters. The word `default' is reserved and cannot be used as an alarm name.
The notification method to be used to inform users or administrators that the alarm has been triggered. The notification method can be either email or by invoking a site-defined executable. To send the notifcation through email, the value should be:
EMAIL[userlist]
Here, userlist
is a list of one or more login names separated by spaces. To invoke a site-defined executable, the value should be of the form:
CMD[cmdname arguments]
In this case, cmdname
is the name of the command to be invoked and arguments
are any arguments supplied to the command (optionally).
This parameter controls how often the notification is sent if an alarm incident is not acknowledged. A limit can be set on the maximum number of notifications. The format of this parameter is:
retry_interval
or retry_interval/max_retry
Here, retry_interval
is the number of minutes between renotifications and max_retry
is the maximum number of renotifications. If max_retry
is not specified then renotifications will be sent until the alarm is acknowledged or until the alarm expires.
By default, if NOTIFICATION_RETRY is not specified, the notification method will only be invoked once.
This parameter specifies the expiration time for an alarm incident in minutes. If an incident is still in the open state after the specified time, its state will automatically be changed to expired and be moved to the alarm history file. This is intended to prevent low-priority alarms from filling up the alarm log indefinitely.
By default, the expiration time is infinite.
A brief description of the alarm.
The following are examples of an alarm definition:
Begin Alarm
NAME=DBError
DESCRIPTION=Send Administrator a Page on errors in DBMS
NOTIFICATION=CMD[/usr/local/bin/sendpage dbadmin]
NOTIFICATION_RETRY=30/5
End AlarmBegin Alarm
NAME=DiskFull
DESCRIPTION=Send LSF administrator a mail when full disk is detected
NOTIFICATION=EMAIL[lsfadmin]
EXPIRATION=40
End Alarm
This file contains the definitions of system calendars. System calendars are read-only calendars which can be referenced by all users. The use of system calendars is discussed in `System Calendars' on page 66.
The lsb.calendars
file consists of multiple "Calendar" sections, each corresponding to a single calendar. Each calendar section requires the NAME and CAL_EXPR parameter and can optionally contain a DESCRIPTION parameter.
A "Calendar" section in the lsb.calendars
file is structured as follows:
Begin Calendar
NAME=name
CAL_EXPR=calendar expression
DESCRIPTION=description
End Calendar
The NAME parameter is a character string that names the system calendar.
The syntax of the CAL_EXPR parameter is described in the LSF JobScheduler User's Guide and the bcadd(1)
man page.