This chapter describes the operation, maintenance, and tuning of LSF Base cluster. Since LSF Base is essential to all LSF components, the correct operation of LSF Base is essential to other LSF products.
Error logs contain important information about daemon operations. When you see any abnormal behavior related to any of the LSF daemons, you should check the relevant error logs to find out the cause of the problem.
LSF log files grow over time. These files should occasionally be cleared, either by hand or using automatic scripts.
All LSF log files are reopened each time a message is logged, so if you rename or remove a log file of an LSF daemon, the daemons will automatically create a new log file.
The LSF daemons log messages when they detect problems or unusual situations.
The daemons can be configured to put these messages into files.
On UNIX, the message can be sent to the system error logs using the
syslog
facility.
If LSF_LOGDIR
is defined in the lsf.conf
file, LSF daemons try to store their messages in files in that directory. Note
that LSF_LOGDIR
must be writable by root. The error log
file names for the LSF Base system daemons, LIM and RES, are lim.log.
hostname,
res.log.
hostname.
The error log file names for LSF Batch daemons are sbatchd.log.
hostname,
mbatchd.log.
hostname,
and pim.log.
hostname.
If
LSF_LOGDIR
is defined, but the daemons cannot write to files there, the error log files are created in/tmp
.
On Unix, ifLSF_LOGDIR
is not defined, then errors are logged tosyslog
using theLOG_DAEMON
facility.syslog
messages are highly configurable, and the default configuration varies widely from system to system. Start by looking for the file/etc/syslog.conf
, and read the manual pages forsyslog
and/orsyslogd
.
If
LSF_LOGDIR
is defined, but the daemons cannot write to files there, the error log files are created inC:\temp
.
LSF daemons log error messages in different levels so that you can choose to log all messages, or only log messages that are deemed critical. Message logging is controlled by the parameter LSF_LOG_MASK
in the lsf.conf
. file. Possible values for this parameter can be any log priority symbol that is defined in <syslog.h>
. The default value for LSF_LOG_MASK
is LOG_WARNING
.
If the error log is managed by syslog, it is probably already being automatically cleared.
If LSF daemons cannot find the lsf.conf
file when they start, they will not find the definition of LSF_LOGDIR
. In this case, error messages go to syslog
. If you cannot find any error messages in the log files, they are likely in the syslog
.
See `Troubleshooting and Error Messages' on page 239 for a discussion of common problems and error log messages.
The FLEXlm license server daemons log messages about the state of the license servers, and when licenses are checked in or out. This log helps to resolve problems with the license servers and to track license use.
The FLEXlm log is configured by the lsflicsetup
command as described in `Installing
a New Permanent License' in the LSF
Installation Guide. This log file grows over time. You can remove
or rename the existing FLEXlm log file at any time. The script lsf_license
used to run the FLEXlm daemons creates a new log file when necessary.
If you already have FLEXlm server running for other products and LSF licenses are added to the existing license file, then the log messages for FLEXlm should go to the same place as you previously set up for other products.
The LSF cluster administrator can monitor the status of the hosts in a cluster, start and stop the LSF daemons, and reconfigure the cluster. Many operations are performed using the lsadmin
command, which performs administrative operations on LSF Base daemons, LIM, and RES.
The lshosts
and lsload
commands report the current status and load levels of hosts in an LSF cluster. The lsmon
and xlsmon
commands provide a running display of the same information. The LSF administrator can find unavailable or overloaded hosts with these tools.
% lsload
HOST_NAME status r15s r1m r15m ut pg ls it tmp swp mem hostD ok 1.3 1.2 0.9 92% 0.0 2 20 5M 148M 88M hostB -ok 0.1 0.3 0.7 0% 0.0 1 67 45M 25M 34M hostA busy 8.0 *7.0 4.9 84% 4.6 6 17 1M 81M 27M
When the status of a host is proceeded by a `-', it means RES is not running on that host. In the above example, RES on hostB is down.
LIM and RES can be restarted to upgrade software or clear persistent errors. Jobs running on the host are not affected by restarting the daemons. The LIM and RES daemons are restarted using the lsadmin
command:
% lsadmin lsadmin>limrestart hostD
Checking configuration files ...
No errors found.
Restart
LIM on <hostD> ...... done
lsadmin>resrestart hostD
Restart RES on <hostD> ...... done
lsadmin>quit
You must login as LSF cluster administrator to run lsadmin command.
The lsadmin
command can be applied to all available hosts by using the host name all
; for example, lsadmin limrestart all
. If a daemon is not responding to network connections lsadmin
displays an error message with the host name. In this case you must kill and restart the daemon manually.
LSF administrators can start up any, or all, LSF daemons, on any, or all, LSF hosts, from any host in the LSF cluster. For this to work, file lsf.sudoers
has to be set up properly to allow you to start up daemons as root. You should be able to run rsh
across LSF hosts without having to enter a password. See `The lsf.sudoers File' on page 189 for configuration details of lsf.sudoers
.
The limstartup
and resstartup
options in lsadmin
allow for the startup of the LIM and RES daemons respectively. Specifying a host name allows for starting up a daemon on a particular host. For example:
% lsadmin limstartup hostA
Starting up LIM on <hostA> ...... done % lsadmin resstartup hostA
Starting up RES on <hostA> ...... done
The lsadmin
command can be used to start up all available hosts by using the host name all
; for example, lsadmin limstartup all
. All LSF daemons, including LIM, RES, and sbatchd
, can be started on all LSF hosts using the command lsfstartup
.
All LSF daemons can be shut down at any time. If the LIM daemon on the current master host is shut down, another host automatically takes over as master. If the RES daemon is shut down while remote interactive tasks are running on the host, the running tasks continue but no new tasks are accepted. To shutdown LIM and RES, use lsadmin
command:
% lsadmin
lsadmin>resshutdown hostD
Shut down RES on <hostD> ...... done
lsadmin>limshutdown hostD
Shut down LIM on <hostD> ...... done
lsadmin>quit
You can run lsadmin reconfig
while the LSF system is in use; users might be unable to submit new jobs for a short time, but all current remote executions are unaffected.
A LIM can be locked to temporarily prevent any further jobs from being sent to the host. The lock can be set to last either for a specified period of time, or until the host is explicitly unlocked. Only the local host can be locked and unlocked.
% lsadmin limlock
Host is locked
% lsload
HOST_NAME status r15s r1m r15m ut pg ls it tmp swp mem
hostD ok 1.3 1.2 0.9 92% 0.0 2 20 5M 148M 28M
hostA busy 8.0 *7.0 4.9 84% 0.6 0 17 *1M 31M 7M
hostC lockU 0.8 1.0 1.1 73% 1.2 3 0 4M 44M 12M
% lsadmin limunlock
Host is unlocked
Only root and the LSF administrator can lock and unlock hosts.
LSF configuration consists of several levels:
lsf.conf
--The primary LSF environment configuration file
lsf.shared
and lsf.cluster.
cluster--Configuration
files for the Load Information Manager
lsf.task
and lsf.task.
cluster--The
files containing task to default resource requirement string mappings
LSB_CONFDIR/cluster
--The directory containing configuration
files for LSF Batch
This is the generic LSF environment configuration file. This file defines general installation parameters so that all LSF executables can find the necessary information. This file is typically installed in the LSF_CONFDIR
directory (the same directory as the LIM configuration files), and a symbolic link is made from a convenient directory as defined by the environment variable LSF_ENVDIR
, or the default directory /etc
. This file is created by the lsfsetup
during installation. Note that many of the parameters in this file are machine specific. The contents of this file are described in detail in `The lsf.conf File' on page 161.
LIM is the kernel of your cluster that provides the single system image to all applications. LIM reads the LIM configuration files and determines your cluster and the cluster master host.
LIM files include lsf.shared
and lsf.cluster.
cluster,
where cluster is the name of your LSF cluster. These files define the host members,
general host attributes, and resource definitions for your cluster. The individual
functions of each of the files are described below.
lsf.shared
defines the available resource names, host types, host models, cluster names, and external load indices that can be used by all clusters. This file is shared by all clusters.
lsf.cluster.
cluster
file is a per cluster configuration file. It contains two types of configuration
information: cluster definition information and LIM policy information. Cluster
definition information impacts all LSF applications, while LIM policy information
impacts applications that rely on LIM's policy for job placement.
The cluster definition information defines cluster administrators, all the hosts that make up the cluster, attributes of each individual host such as host type or host model, and resources using the names defined in lsf.shared
.
LIM policy information defines the load sharing and job placement policies provided by LIM. More details about LIM policies are described in `Tuning LIM Load Thresholds' on page 69.
LIM configuration files are stored in directory LSF_CONFDIR
as defined in lsf.conf
file. Details of LIM configuration files are described in `The lsf.shared File' on page 173.
lsf.task
is a system-wide task to `default resource requirement string' mapping file. This file defines mappings between task names and their default resource requirements. LSF maintains a task list for each user in the system. The lsf.task
file is useful for the cluster administrator to set task-to-resource requirement mapping at the system level. Individual users can customize their own list by using the lsrtasks
command (See lsrtasks(1)
man page for details on this command).
When you run a job with an LSF command such as bsub
or lsrun
, the command consults your task list to find out the default resource requirement string of the job if they are not already specified explicitly. If a match is not found in your task list, the system will assume a default, which typically means run the job on a host that has the same host type as the local host.
There is also a per cluster file lsf.task
.cluster that applies to the cluster only and overrides the system-wide definition. Individual users can have their own files to override the system-wide and cluster-wide files by using the lsrtasks
command.
lsf.task
and lsf.task.
cluster files are installed in directory LSF_CONFDIR
as defined in lsf.conf
file.
These files define LSF Batch specific configuration such as queues, batch server hosts, and batch user controls. These files are only read by mbatchd
. The LSF Batch configuration relies on LIM configuration. LSF Batch daemons get the cluster configuration information from the LIM via the LSF API.
LSF Batch configuration files are stored in directory LSB_CONFDIR/
cluster, where LSB_CONFDIR
is defined in lsf.conf
, and cluster is the name of your cluster. Details of LSF Batch configuration files are described in `Managing LSF Batch' on page 79.
All configuration files except lsf.conf
use a section-based format. Each file contains a number of sections. Each section starts with a line beginning with the reserved word Begin
followed by a section name, and ends with a line beginning with the reserved word End
followed by the same section name. Begin
, End
, section names, and keywords are all case insensitive.
Sections can either be vertical or horizontal. A horizontal section contains a number of lines, each having the format: keyword = value
, where value is one or more strings. For example:
Begin exampleSection
key1 = string1
key2 = string2 string3
key3 = string4
End exampleSection Begin exampleSection
key1 = STRING1
key2 = STRING2 STRING3
End exampleSection
In many cases you can define more than one object of the same type by giving more than one horizontal section with the same section name.
A vertical section has a line of keywords as the first line. The lines following the first line are values assigned to the corresponding keywords. Values that contain more than one string must be bracketed with `(' and `)'. The above examples can also be expressed in one vertical section:
Begin exampleSection
key1 key2 key3
string1 (string2 string3) string4
STRING1 (STRING2 STRING3) -
End exampleSection
Each line in a vertical section is equivalent to a horizontal section with the same section name.
Some keys in certain sections are optional. For a horizontal section, an optional key does not appear in the section if its value is not defined. For a vertical section, an optional keyword must appear in the keyword line if any line in the section defines a value for that keyword. To specify the default value use `-' or `()' in the corresponding column, as shown for key3
in the example above.
Each line can have multiple columns, separated by either spaces or TAB characters. Lines can be extended by a `\
' (back slash) at the end of a line. A `#
' (pound sign) indicates the beginning of a comment; characters up to the end of the line are not interpreted. Blank lines are ignored.
Below are some examples of LIM configuration files. The detailed explanations of the variables are described in `LSF Base Configuration Reference' on page 161.
Begin Cluster
ClusterName # This line is keyword(s)
test_cluster
End Cluster Begin HostType
TYPENAME # This line is keyword(s)
hppa
SUNSOL
rs6000
alpha
NTX86
End HostType Begin HostModel
MODELNAME CPUFACTOR # This line is keyword(s)
HP735 4.0
DEC3000 5.0
ORIGIN2K 8.0
PENTI120 3.0
End HostModelBegin Resource
RESOURCENAMETYPE
INTERVAL
INCREASING
DESCRIPTION
#This line is keyword(s)
hpux Boolean()
()
(HP-UX
operating system)
decunixBoolean
()
()
(Digital Unix
)solaris
Boolean
()
()
(Sun Solaris operating system)
NTBoolean
()
()
(Windows NT operating system)
fserverBoolean
()
()
(File Server)
cserver Boolean()
()
(Compute Server)
scratchNumeric
30
N
(Shared scratch space on server)
verilogNumeric
30
N
(Floating licenses for Verilog)
console String 30 N (User Logged in on console)
End Resource
Example lsf.cluster.test_cluster
file:
Begin ClusterManager
Manager = lsf user7
End ClusterManager Begin Host
HostNAme Model Type server swp Resources
hostA HP735 hppa 1 2 (fserver hpux)
hostD ORIGIN2K sgi 1 2 (cserver)
hostB PENT200 NTX86 1 2 (NT)
End Host
In the above file, section ClusterManager
takes horizontal format, while Host
section takes vertical format.
Other LSF Batch configuration files are described in `Example LSF Batch Configuration Files' on page 136.
This section provides procedures for some common changes to the LIM configuration. There are three different ways for you to change LIM configuration:
lsfsetup
program as described in various sections of the LSF Installation Guide
xlsadmin
tool (a graphical application).
The following discussions focus on changing configuration files using a text editor so that you can understand the concepts behind the configuration changes. See `Managing an LSF Cluster Using xlsadmin' on page 99 for the use of xlsadmin
in changing configuration files.
If you run LSF Batch, you must restart mbatchd using the badmin reconfig command each time you change the LIM configuration, even if the LSF Batch configuration files do not change. This is necessary because the LSF Batch configuration depends on the LIM configuration.
HostType
section of the lsf.shared
file to add the new host type. A host type can be any alphanumeric string up to 29 characters long.
HostModel
section of your lsf.shared
file to add in the new model together with its CPU speed factor relative to other models.
Host
section of the lsf.cluster.
cluster
file with host name, host type, and all other attributes defined, as shown
in `Example Configuration Files' on page 54.
The master LIM andmbatchd
daemons run on the first available host in theHost
section of yourlsf.cluster.
cluster file, so you should list reliable batch server hosts first. For more information see `Fault Tolerance' on page 5.
If you are adding a client host, set theSERVER
field for the host to0
(zero).
LSF_SERVERDIR/lsf_daemons start
and use ps
to make sure that res
, lim
and sbatchd
have started.
The lsf daemons start
command must be run as root. If you are creating a private cluster, do not attempt to use lsf_daemons
to start your daemons, as this command will kill all running daemons on the system before starting new ones. Start them manually.
lsf.cluster.
cluster
file and remove the unwanted hosts from the Host
section.
lsadmin resshutdown host1 host2 ...
where host1, host2, ... are hosts you want to remove from your cluster.
lstcsh
as their login shell, change their login shell to tcsh
or csh
. Remove lstcsh
from the /etc/shells
file.
Your cluster is most likely heterogeneous. Even if your computers are all the same, it might still be heterogeneous. For example, some machines are configured as file servers, while others are compute servers; some have more memory, others have less; some have four CPUs, others have only one; some have host-locked software licenses installed, others do not.
LSF provides powerful resource selection mechanisms so that correct hosts with required resources are chosen to run your jobs. For maximum flexibility, you should characterize your resources clearly enough so that users have satisfactory choices. For example, if some of your machines are connected to both Ethernet and FDDI, while others are only connected to Ethernet, then you probably want to define a resource called fddi
and associate the fddi
resource to machines connected to FDDI. This way, users can specify resource fddi
if they want their jobs to run on machines connected to FDDI.
To customize host resources for your cluster, perform the following procedure:
lsf.shared
file. Add a brief description to each of the added resource names. Resource descriptions will be displayed to a user by lsinfo
command.
lsf.task
file properly to reflect the resource in the resource requirements of the application. Alternatively, you can leave this to individual users who can use lsrtasks
command to customize their own files.
lsf.cluster.
cluster
file to modify the RESOURCES
column of the "Host" section so
that all hosts that have the added resources will now have the added resource
names in that column.
Resources are defined in the Resource
section of the lsf.shared
file. The definition of a resource involves specifying a name and description, as well as, optionally, the type of its value, its update interval, and whether a higher or lower value indicates greater availability.
The mandatory resource information fields are:
RESOURCENAME
indicating the name of the resource
DESCRIPTION
that should indicate what the resource represents.
The optional resource information fields are:
TYPE
indicating its value (boolean, numeric, or string)
INTERVAL
indicating how often the value is updated (for resources whose value changes dynamically)
INCREASING
flag indicating whether a higher value represents a greater availability of the resource (for numeric resources which can be used for scheduling jobs).
When the optional attributes are not specified, the resource is treated as static and boolean-valued.
The following is a sample of a Resource
section from an lsf.shared
file:
Begin Resource
RESOURCENAME TYPE INTERVAL INCREASING DESCRIPTION
mips Boolean () () (MIPS architecture)
dec Boolean () () (DECStation system)
sparc Boolean () () (SUN SPARC)
hppa Boolean () () (HPPA architecture)
bsd Boolean () () (BSD unix)
sysv Boolean () () (System V UNIX)
hpux Boolean () () (HP-UX UNIX)
aix Boolean () () (AIX UNIX)
nt Boolean () () (Windows NT)
scratch Numeric 30 N (Shared scratch space on server)
synopsys Numeric 30 N (Floating licenses for Synopsys)
verilog Numeric 30 N (Floating licenses for Verilog)
console String 30 N (User Logged in on console)
End Resource
There is no distinction between shared and non-shared resources in the resource definition in the lsf.shared
file.
The NewIndex section in the lsf.shared file is obsolete. To achieve the same effect, the Resource section of the lsf.shared file can be used to define a dynamic numeric resource, and the default keyword can be used in the LOCATION field of the ResourceMap section of the lsf.cluster.cluster file.
Resources are associated with the host(s) on which they
are available in the ResourceMap
section of the lsf.cluster.
cluster
file (where cluster is the name of the cluster). The following fields must be
completed for each resource:
RESOURCENAME
indicating the name of the resource, as defined in the lsf.shared
file
LOCATION
indicating whether the resource is shared or non-shared, across which hosts, and with which initial value(s).
The following is an example of a ResourceMap
section from an lsf.cluster.
cluster
file:
Begin ResourceMap
RESOURCENAME LOCATION
verilog 5@[all]
synopsys (2@[apple] 2@[others])
console (1@[apple] 1@[orange])
End ResourceMap
The possible states of a resource that may be specified in the LOCATION
column are:
For static resources, the LOCATION
column should contain the value of the resource.
The syntax of the information in the LOCATION
field takes one of two forms. For static resources, where the value must be specified, use:
(
value1@[host1 host2 ...] value2@[host3 host4] ...)
For dynamic resources, where the value is updated by an ELIM, use:
Each set of hosts listed within the square brackets specifies an instance of the resource. All hosts within the instance share the resource whose quantity is indicated by its value. In the above example, host1
, host2
,... form one instance of the resource, host3
, host4
,... form another instance, and so on.
The same host cannot be in more than one instance of a resource.
Three predefined words have special meaning in this specification:
all
refers to all the server hosts in the cluster; for example, value@[all]
means the resource is shared by all server hosts in the cluster made up of host1 host2 ... hostn
others
refers to the rest of the server hosts listed in the cluster; for example, (2@[apple] 2[others])
means there are 2 units of "syno
" on apple, and 2 shared by all other hosts
default
refers to each host; for example, value@[default]
is equivalent to (value@[host1] value@[host2] ... value@[hostn])
where host1
, ... hostn
are all server hosts in the cluster.
These syntax examples assume that static resources (requiring values) are being specified. For dynamic resources, use the same syntax but omit the value
.
The following items should be taken into consideration when configuring resources under LSF Base.
In the lsf.cluster.
cluster
file, the Host
section must precede the ResourceMap
section since the ResourceMap
section uses the host names defined
in the Host section.
RESOURCES
column in the Host
section of the
lsf.cluster.
cluster
file should be used to associate static boolean resources with particular
hosts. Using the ResourceMap
section for static boolean resources
section will result in an empty RESOURCES
column in the lshosts(1)
display.
ResourceMap
section are treated as shared resources, which are displayed using the lsload -s
or lshosts -s
commands. The exception is for dynamic numeric resources specified using the default
predefined word. These will be treated together with load indices such as mem
and swap
and are viewed using the lsload -l
command.
If the ResourceMap
section is not defined, then any dynamic resources specified in lsf.shared
are considered to be host-based (the resource is available on each host in the cluster).
After changing LIM configuration files, you must tell LIM to read the new configuration. Use the lsadmin
command to tell LIM to pick up the new configuration.
Operations can be specified on the command line or entered at a prompt. Run the lsadmin
command with no arguments, and enter help
to see the available operations.
The lsadmin reconfig
command checks the LIM configuration files for errors. If no errors are found, the command confirms that you want to restart the LIMs on all hosts, and reconfigures all the LIM daemons:
% lsadmin reconfig
Checking configuration files ...
No errors found. Do you really want to restart LIMs on all hosts? [y/n] y
Restart LIM on <hostD> ...... done
Restart LIM on <hostA> ...... done
Restart LIM on <hostC> ...... done
In the above example, no errors are found. If any non-fatal errors are found, the command asks you to confirm the reconfiguration. If fatal errors are found, the reconfiguration is aborted.
If you want to see details on any errors, run the command lsadmin ckconfig -v
. This reports all errors to your terminal.
If you change the configuration file of LIM, you should also reconfigure LSF Batch by running badmin reconfig
because LSF Batch depends on LIM configuration. If you change the configuration of LSF Batch, then you only need to run badmin reconfig
.
The values of static external resources are specified
through the lsf.cluster.
cluster
configuration file. All dynamic resources, regardless of whether they
are shared or host-based, are collected through an ELIM. An ELIM is started
in the following situations:
LOCATION
field in the ResourceMap
section
of lsf.cluster.
cluster
is ([default])
,
then every host will start an ELIM.
LOCATION
field in the ResourceMap
section of lsf.cluster.
cluster
is ([all])
, then an ELIM is started on the master host.
LOCATION
field in the ResourceMap
section of lsf.cluster.
cluster
is ([hostA hostB hostC] [hostD hostE hostF])
, then an ELIM will
be stared on hostA
and hostD
to report the value
of that resource for that set of hosts.
If the host reporting the value for an instance goes down, then an ELIM is started on the next available host in the instance. In above example, if hostA
became unavailable, an ELIM is started on hostB
. If the hostA
becomes available again then the ELIM on hostB
is shut down and the one on hostA
is started.
There is only one ELIM on each host, regardless of the number of resources on which it reports. If only cluster-wide resources are to be collected, then an ELIM will only be started on the master host. When LIM starts, the following environment variables are set for ELIM:
LSF_MASTER
: This variable is defined if the ELIM is being invoked on the master host. It is undefined otherwise. This can be used to test whether the ELIM should report on cluster-wide resources that only need to be collected on the master host.
LSF_RESOURCES
: This variable contains a list of resource names (separated by spaces) on which the ELIM is expected to report. A resource name is only put in the list if the host on which the ELIM is running shares an instance of that resource.
The following restrictions apply to the use of shared resources in LSF products.
Hosts
section of the lsf.cluster.
cluster
file.
loadSched/loadStop
thresholds, or in the STOP_COND
or RESUME_COND
parameters in the queue definition in the lsb.queues
file.
The ELIM can be any executable program, either an interpreted script or compiled code. Example code for an ELIM is included in the examples
directory in the LSF distribution. The elim.c
file is an ELIM written in C. You can customize this example to collect the load indices you want.
The ELIM communicates with the LIM by periodically writing a load update string to its standard output. The load update string contains the number of indices followed by a list of name-value pairs in the following format:
N name1 value1 name2 value2 ... nameN valueN
3 tmp2 47.5 nio 344.0 licenses 5
This string reports three indices: tmp2
, nio
, and licenses
, with values 47.5, 344.0, and 5 respectively. Index values must be numbers between -INFINIT_LOAD
and INFINIT_LOAD
as defined in the lsf.h
header file.
If the ELIM is implemented as a C program, as part of initialization it should use setbuf(3)
to establish unbuffered output to stdout
.
The ELIM should ensure that the entire load update string is written successfully to stdout
. This can be done by checking the return value of printf(3s)
if the ELIM is implemented as a C program or as the return code of /bin/echo(1)
from a shell script. The ELIM should exit if it fails to write the load information.
Each LIM sends updated load information to the master every 15 seconds. Depending on how quickly your external load indices change, the ELIM should write the load update string at most once every 15 seconds. If the external load indices rarely change, the ELIM can write the new values only when a change is detected. The LIM continues to use the old values until new values are received.
The executable for the ELIM must be in LSF_SERVERDIR
and must have the name elim
. If LIM expects some resources to be collected by an ELIM according to configuration, it invokes the ELIM automatically on startup. The ELIM runs with the same user id and file access permission as the LIM.
The LIM restarts the ELIM if it exits; to prevent problems in case of a fatal error in the ELIM, it is restarted at most once every 90 seconds. When the LIM terminates, it sends a SIGTERM
signal to the ELIM. The ELIM must exit upon receiving this signal.
The ELIM can also return values for the built-in load indices. In this case the value produced by the ELIM overrides the value produced by the LIM. The ELIM must ensure that the semantics of any index it supplies are the same as that of the corresponding index returned by the lsinfo(1)
command.
For example, some sites prefer to use /usr/tmp
for temporary files. To override the tmp
load index, write a program that periodically measures the space in the /usr/tmp
file system and writes the value to standard output. Name this program elim
and store it in the LSF_SERVERDIR
directory.
The name of an external load index must not be one of the resource name aliases cpu, idle, logins, or swap. To override one of these indices, use its formal name: r1m, it, ls, or swp.
You must configure the external load index even if you are overriding a built-in load index.
LIM provides very critical services to the all LSF components. In addition to the timely collection of resource information, LIM also provides host selection and job placement policies. If you are using the LSF MultiCluster product, LIM policies also determine how different clusters should exchange load and resource information.
LIM policies are advisory information for applications. Applications can either use the placement decision from the LIM, or make further decisions based on information from the LIM.
Most of the LSF interactive tools, such as lsrun
and lstcsh
, use LIM policies to place jobs on the network. LSF Batch uses load and resource information from LIM and makes its own placement decisions based on other factors in addition to load information.
As was described in `Overview of LSF Configuration Files' on page 50, LIM configuration file defines load-sharing policies. The LIM configuration parameters that affect LIM policies include:
lsf.cluster.
cluster.
Dispatch windows in lsf.cluster.
cluster
cause hosts to become locked outside the time windows so that LIM will not
advise jobs to go to those hosts. Details of this parameter are described
in `Hosts' on page 182.
LIM thresholds and run windows affect the job placement advice of the LIM. Job placement advice is not enforced by LIM. LSF Batch, for example, does not follow the policies of the LIM.
lsf.cluster.
cluster
file. These parameters apply to LSF MultiCluster product only. The parameters
define the relationship between the local cluster and remote clusters and
the direction of job placement flows across clusters. See `Managing
LSF MultiCluster' on page 143 for details.
There are two main goals in adjusting the LIM configuration parameters: improving response time, and reducing interference with interactive use. To improve response time, LSF should be tuned to correctly select the best available host for each job. To reduce interference, LSF should be tuned to avoid overloading any host.
CPU factors are used to differentiate the relative speed of different machines. LSF runs jobs on the best possible machines so that the response time is minimized. To achieve this, it is important that you define correct CPU factors for each machine model in your cluster by changing the HostModel
section of your lsf.shared
file.
CPU factors should be set based on a benchmark that reflects your work load. (If there is no such benchmark, CPU factors can be set based on raw CPU power.) The CPU factor of the slowest hosts should be set to one, and faster hosts should be proportional to the slowest. For example, consider a cluster with two hosts, hostA and hostB, where hostA takes 30 seconds to run your favourite benchmark and hostB takes 15 seconds to run the same test. hostA should have a CPU factor of 1, and hostB (since it is twice as fast) should have a CPU factor of 2.
LSF uses a normalized CPU performance rating to decide which host has the most available CPU power. The normalized ratings can be seen by running the lsload -N
command. The hosts in your cluster are displayed in order from best to worst. Normalized CPU run queue length values are based on an estimate of the time it would take each host to run one additional unit of work, given that an unloaded host with CPU factor 1 runs one unit of work in one unit of time.
Incorrect CPU factors can reduce performance in two ways. If the CPU factor for a host is too low, that host may not be selected for job placement when a slower host is available. This means that jobs would not always run on the fastest available host. If the CPU factor is too high, jobs are run on the fast host even when they would finish sooner on a slower but lightly loaded host. This causes the faster host to be overused while the slower hosts are underused.
Both of these conditions are somewhat self-correcting. If the CPU factor for a host is too high, jobs are sent to that host until the CPU load threshold is reached. The LIM then marks that host as busy, and no further jobs will be sent there. If the CPU factor is too low, jobs may be sent to slower hosts. This increases the load on the slower hosts, making LSF more likely to schedule future jobs on the faster host.
The Host
section of the lsf.cluster.
cluster
file can contain busy thresholds for load indices. You do not need to specify
a threshold for every index; indices that are not listed do not affect the scheduling
decision. These thresholds are a major factor in influencing LSF performance.
This section does not describe all LSF load indices; see `Resource
Requirements' on page 24 and `Threshold
Fields' on page 184 for more complete discussions.
The parameters that most often affect performance are:
r15s
15-second average
r1m
1-minute average
r15m
15-minute average
pg
paging rate in pages per second
swp
Available swap space
For tuning these parameters, you should compare the output of lsload
to the thresholds reported by lshosts -l
.
The lsload
and lsmon
commands display an asterisk `*' next to each load index that exceeds its threshold. For example, consider the following output from lshosts -l
and lsload
:
%
lshosts -l
HOST_NAME: hostD
...
LOAD_THRESHOLDS:
r15s r1m r15m ut pg io ls it tmp swp mem
- 3.5 - - 15 - - - - 2M 1M HOST_NAME: hostA
...
LOAD_THRESHOLDS:
r15s r1m r15m ut pg io ls it tmp swp mem
- 3.5 - - 15 - - - - 2M 1M % lsload
HOST_NAME status r15s r1m r15m ut pg ls it tmp swp mem
hostD ok 0.0 0.0 0.0 0% 0.0 6 0 30M 32M 10M
hostA busy 1.9 2.1 1.9 47% *69.6 21 0 38M 96M 60M
In this example, hostD is ok
. However, hostA is busy
; the pg
(paging rate) index is 69.6, above the threshold of 15.
Other monitoring tools such as xlsmon
also help to show the effects of changes.
If the LIM often reports a host to be busy
when the CPU run queue length is low, the most likely cause is the paging rate threshold. Different operating systems assign subtly different meanings to the paging rate statistic, so the threshold needs to be set at different levels for different host types. In particular, HP-UX systems need to be configured with significantly higher pg
values; try starting at a value of 50 rather than the default of 15.
If the LIM often shows systems busy
when the CPU utilization and run queue lengths are relatively low and the system is responding quickly, try raising the pg
threshold. There is a point of diminishing returns; as the paging rate rises, eventually the system spends too much time waiting for pages and the CPU utilization decreases. Paging rate is the factor that most directly affects perceived interactive response. If a system is paging heavily, it feels very slow.
The CPU run queue threshold can be reduced if you find that interactive jobs slow down your response too much while the LIM still reports your host as ok
. Likewise, it can be increased if hosts become busy at too low a load.
On multi-processor systems, the CPU run queue threshold is compared to the effective run queue length as displayed by the lsload -E
command. The run queue threshold should be configured as the load limit for a single processor. Sites with a variety of uniprocessor and multi-processor machines can use a standard value for r15s
, r1m
and r15m
in the configuration files, and the multi-processor machines will automatically run more jobs. Note that the normalized run queue length printed by lsload -N
is scaled by the number of processors. See Section 4, `Resources', beginning on page 35 of the LSF Batch User's Guide and lsfintro(1)
for the concept of effective and normalized run queue lengths.
Because LSF takes a wide variety of measurements on the hosts in your network, it can be a powerful tool for monitoring and capacity planning. The lsmon
command gives updated information that can quickly identify problems such as inaccessible hosts or unusual load levels. The lsmon -L
option logs the load information to a file for later processing. See the lsmon(1)
and lim.acct(5)
manual pages for more information.
For example, if the paging rate (pg
) on a host is always high, adding memory to the system will give a significant increase in both interactive performance and total throughput. If the pg
index is low but the CPU utilization (ut
) is usually more than 90 percent, the CPU is the limiting resource. Getting a faster host, or adding another host to the network, would provide the best performance improvement. The external load indices can be used to track other limited resources such as user disk space, network traffic, or software licenses.
The xlsmon
program is a Motif graphic interface to the LSF load information. The xlsmon
display uses colour to highlight busy and unavailable hosts, and can show both the current levels and scrolling histories of selected load indices.
See Section 3, `Cluster Information', beginning on page 25 of the LSF Batch User's Guide for more information about xlsmon
.
LSF software is licensed using the FLEXlm license manager from Globetrotter Software, Inc. The LSF license key controls the hosts allowed to run LSF. The procedures for obtaining, installing, and upgrading license keys are described in `Getting License Key Information' and `Setting Up the License Key' in the LSF Installation Guide. This section provides background information on FLEXlm.
FLEXlm controls the total number of hosts configured in all your LSF clusters. You can organize your hosts into clusters however you choose. Each server host requires at least one license; multi-processor hosts require more than one, as a function of the number of processors. Each client host requires 1/5 of a license.
LSF uses two kinds of FLEXlm license: time-limited DEMO licenses and permanent licenses.
The DEMO license allows you to try LSF out on an unlimited number of hosts on any supported host type. The trial period has a fixed expiry date, and the LSF software will not function after that date. DEMO licenses do not require any additional daemons.
Permanent licenses are the most common. A permanent license limits only the total number of hosts that can run the LSF software, and normally has no time limit. You can choose which hosts in your network will run LSF, and how they are arranged into clusters. Permanent licenses are counted by a license daemon running on one host on your network.
For permanent licenses, you need to choose a license server host and send hardware host identification numbers for the license server host to your software vendor. The vendor uses this information to create a permanent license that is keyed to the license server host. Some host types have a built-in hardware host ID; on others, the hardware address of the primary LAN interface is used.
FLEXlm is used by many software packages because it provides a simple and flexible method for controlling access to licensed software. A single FLEXlm license server can handle licenses for many software packages, even if those packages come from different vendors. This reduces the systems administration load, since you do not need to install a new license manager every time you get a new package.
FLEXlm uses a daemon called lmgrd
to manage permanent licenses. This daemon runs on one host on your network, and handles license requests from all applications. Each license key is associated with a particular software vendor. lmgrd
automatically starts a vendor daemon; the LSF version is called lsf_ld
and is provided by Platform Computing Corporation. The vendor daemon keeps track of all licenses supported by that vendor. DEMO licenses do not require you to run license daemons.
The license server daemons should be run on a reliable host, since licensed software will not run if it cannot contact the server. The FLEXlm daemons create very little load, so they are usually run on the file server. If you are concerned about availability, you can run lmgrd
on a set of three or five hosts. As long as a majority of the license server hosts are available, applications can obtain licenses.
Software licenses are stored in a text file. The default location for this file is
/usr/local/flexlm/licenses/license.dat
, but this can be overridden. For example, when LSF is installed following the default installation procedure, the license file is installed in the same directory where all LSF configuration files are installed; for example, /usr/local/lst/mnt/conf
. The license file must be readable on every host that runs licensed software. It is most convenient to place the license file in a shared NFS directory.
The license.dat
file normally contains:
SERVER
line for each FLEXlm server host. The SERVER
line contains the host name, hardware host ID, and network port number for the server.
DAEMON
line for each software vendor, which gives the file path name of the vendor daemon.
FEATURE
line for each software license. This line contains the number of copies that can be run, along with other necessary information.
The FEATURE
line contains an encrypted code to prevent tampering. For permanent licenses, the licenses granted by the FEATURE
line can be accessed only through license servers listed on the SERVER
lines.
For DEMO licenses, no FLEXlm daemons are needed, so the license file contains only the FEATURE
line.
Here is an example of a DEMO license file.
FEATURE lsf_base lsf_ld 3.100 20-Dec-1997 0 5CE371439854221102F7 "Platform" DEMO
FEATURE lsf_batch lsf_ld 3.100 20-Dec-1997 0 3CC371C33076712F433B "Platform" DEMO
FEATURE lsf_multicluster lsf_ld 3.100 20-Dec-1997 0 5C63119330771250944C "Platform" DEMO
This license file allows a site to run LSF Base, Batch, and MultiCluster until December 20, 1997. Note that a DEMO license does not have a SERVER
line and a DAEMON
line because no license server is needed for DEMO licenses.
The following is an example of a permanent license:
SERVER hostD 690a377d 1700
DAEMON lsf_ld /usr/local/lsf/etc/lsf_ld
FEATURE lsf_base lsf_ld 3.100 1-jan-0000 1000 5C239486C4D72739BAF8 "Platform"
FEATURE lsf_batch lsf_ld 3.100 1-jan-0000 1000 6CB344F6E2A5B7A31526 "Platform"
FEATURE lsf_multicluster lsf_ld 3.100 1-jan-0000 1000 5C535446DAE5DEE6B736 "Platform"
LSF uses the notion of license units in calculating the amount of licenses required for a product on a host. The number of license units required to run LSF depends on the number of CPUs the host has as well as the type of the machine. For example, a single CPU HP-UX machine would require ten license units, whereas a client-only machine would need two license units.
The above license is configured to run on hostD, using TCP port 1700. This license allows 1000 license units for version 3.1 of LSF Base, LSF Batch, and LSF MultiCluster.
FLEXlm provides several utility programs for managing software licenses. These utilities and their manual pages are included in the LSF software distribution.
Because these utilities can be used to shut down the FLEXlm license server, and thus prevent licensed software from running, they are installed in the LSF_SERVERDIR
directory. The file permissions are set so that only root and members of group 0 can use them.
lmcksum
Calculate check sums of the license key information
lmdown
Shut down the FLEXlm server
lmhostid
Display the hardware host ID
lmremove
Remove a feature from the list of checked out features
lmreread
Tell the license daemons to re-read the license file
lmstat
Display the status of the license servers and checked out licenses
lmver
Display the FLEXlm version information for a program or library
For complete details on these commands, see the on-line manual pages.
FLEXlm only accepts one license key for each feature listed in a license key file. If there is more than one FEATURE
line for the same feature, only the first FEATURE
line is used. To add hosts to your LSF cluster, you must replace the old FEATURE
line with a new one listing the new total number of licenses.
The procedure for updating a license key file to include new license keys is described in `Adding a Permanent License' in the LSF Installation Guide.
The fourth field on the SERVER
line specifies the TCP port number that the FLEXlm server uses. Choose an unused port number. LSF usually uses port numbers in the range 3879 to 3882, so the numbers from 3883 forward are good choices. If the lmgrd
daemon complains that the license server port is in use, you can choose another port number and restart lmgrd
.
For example, if your license file contains the line:
SERVER hostname host-id 1700
and you want your FLEXlm server to use TCP port 3883, change the SERVER line to:
SERVER hostname host-id 3883
LSF Suite 3.1 includes the following products: LSF Base, LSF Batch, LSF JobScheduler, LSF MultiCluster, and LSF Analyzer.
The configuration changes to enable a particular product
in a cluster are handled during installation by lsfsetup
. If at
some later time you want to modify the products in your cluster, edit the PRODUCTS
line in the Parameters
section of the lsf.cluster.
cluster
file. You can specify one or more of the strings
LSF_Base
, LSF_Batch
,
LSF_JobScheduler
, LSF Analyzer,
and LSF_MultiCluster
to enable the operation of LSF Base, LSF Batch, LSF JobScheduler, LSF Analyzer,
and LSF MultiCluster, respectively. If any of LSF_Batch
, LSF_JobScheduler
,
or LSF_MultiCluster
are specified, then LSF_Base
is
automatically enabled as well.
If the lsf.cluster.
cluster
file is shared, adding a product name to the
PRODUCTS
line enables
that product for all hosts in the cluster. For example, to enable the operation
of LSF Base, LSF Batch, and LSF MultiCluster:
Begin Parameters
PRODUCTS=LSF_Base
LSF_Batch
LSF_MultiCluster
End Parameters
To enable the operation of LSF Base only:
Begin Parameters
PRODUCTS=LSF_Base
End Parameters
To enable the operation of LSF JobScheduler:
Begin Parameters
PRODUCTS=LSF_JobScheduler LSF_Base
End Parameters
It is possible to indicate that only certain hosts run
LSF Batch or LSF JobScheduler within a cluster. This is done by specifying LSF_Batch
or LSF_JobScheduler
in the RESOURCES
field on the
HOSTS
section of the lsf.cluster.
cluster
file. For example, the following enables hosts hostA, hostB, and hostC to run
LSF JobScheduler and hosts hostD, hostE, and hostF to run LSF Batch.
Begin Parameters
PRODUCTS=LSF_Batch LSF_Base
End Parameters Begin Host
HOSTNAME model type server RESOURCES
hostA SUN41 SPARCSLC 1 (sparc bsd LSF_JobScheduler)
hostB HPPA9 HP735 1 (linux LSF_JobScheduler)
hostC SGI SGIINDIG 1 (irix cs LSF_JobScheduler)
hostD SUNSOL SunSparc 1 (solaris)
hostE HP_UX A900 1 (hpux cs bigmem)
hostF ALPHA DEC5000 1 (alpha)
End Hosts
The license file used to serve the cluster must have the corresponding features. A host will show as unlicensed if the license for the component it was configured to run is unavailable. For example, if a cluster is configured to run LSF_Batch on all hosts, and the license file does not contain the LSF_JobScheduler feature, then the hosts will be unlicensed, even if there are licenses for LSF Base.