[Contents] [Index] [Top] [Bottom] [Prev] [Next]

2. Managing LSF Base

This chapter describes the operation, maintenance, and tuning of LSF Base cluster. Since LSF Base is essential to all LSF components, the correct operation of LSF Base is essential to other LSF products.

Managing Error Logs

Error logs contain important information about daemon operations. When you see any abnormal behavior related to any of the LSF daemons, you should check the relevant error logs to find out the cause of the problem.

LSF log files grow over time. These files should occasionally be cleared, either by hand or using automatic scripts.

LSF Daemon Error Log

All LSF log files are reopened each time a message is logged, so if you rename or remove a log file of an LSF daemon, the daemons will automatically create a new log file.

The LSF daemons log messages when they detect problems or unusual situations.

The daemons can be configured to put these messages into files.

On UNIX, the message can be sent to the system error logs using the syslog facility.

If LSF_LOGDIR is defined in the lsf.conf file, LSF daemons try to store their messages in files in that directory. Note that LSF_LOGDIR must be writable by root. The error log file names for the LSF Base system daemons, LIM and RES, are lim.log.hostname, res.log.hostname.

The error log file names for LSF Batch daemons are sbatchd.log.hostname, mbatchd.log.hostname, and pim.log.hostname.

If LSF_LOGDIR is defined, but the daemons cannot write to files there, the error log files are created in /tmp.

On Unix, if LSF_LOGDIR is not defined, then errors are logged to syslog using the LOG_DAEMON facility. syslog messages are highly configurable, and the default configuration varies widely from system to system. Start by looking for the file /etc/syslog.conf, and read the manual pages for syslog and/or syslogd.

If LSF_LOGDIR is defined, but the daemons cannot write to files there, the error log files are created in C:\temp.

LSF daemons log error messages in different levels so that you can choose to log all messages, or only log messages that are deemed critical. Message logging is controlled by the parameter LSF_LOG_MASK in the lsf.conf. file. Possible values for this parameter can be any log priority symbol that is defined in <syslog.h>. The default value for LSF_LOG_MASK is LOG_WARNING.

If the error log is managed by syslog, it is probably already being automatically cleared.

If LSF daemons cannot find the lsf.conf file when they start, they will not find the definition of LSF_LOGDIR. In this case, error messages go to syslog. If you cannot find any error messages in the log files, they are likely in the syslog.

See `Troubleshooting and Error Messages' on page 239 for a discussion of common problems and error log messages.

FLEXlm Log

The FLEXlm license server daemons log messages about the state of the license servers, and when licenses are checked in or out. This log helps to resolve problems with the license servers and to track license use.

The FLEXlm log is configured by the lsflicsetup command as described in `Installing a New Permanent License' in the LSF Installation Guide. This log file grows over time. You can remove or rename the existing FLEXlm log file at any time. The script lsf_license used to run the FLEXlm daemons creates a new log file when necessary.

Note

If you already have FLEXlm server running for other products and LSF licenses are added to the existing license file, then the log messages for FLEXlm should go to the same place as you previously set up for other products.

Controlling LIM and RES Daemons

The LSF cluster administrator can monitor the status of the hosts in a cluster, start and stop the LSF daemons, and reconfigure the cluster. Many operations are performed using the lsadmin command, which performs administrative operations on LSF Base daemons, LIM, and RES.

Checking Host Status

The lshosts and lsload commands report the current status and load levels of hosts in an LSF cluster. The lsmon and xlsmon commands provide a running display of the same information. The LSF administrator can find unavailable or overloaded hosts with these tools.

% lsload

HOST_NAME status r15s r1m  r15m ut  pg  ls it tmp swp  mem
hostD     ok     1.3  1.2  0.9  92% 0.0 2  20 5M  148M 88M
hostB     -ok    0.1  0.3  0.7  0%  0.0 1  67 45M 25M  34M
hostA     busy   8.0  *7.0 4.9  84% 4.6 6  17 1M  81M  27M

When the status of a host is proceeded by a `-', it means RES is not running on that host. In the above example, RES on hostB is down.

Restarting LIM and RES

LIM and RES can be restarted to upgrade software or clear persistent errors. Jobs running on the host are not affected by restarting the daemons. The LIM and RES daemons are restarted using the lsadmin command:

% lsadmin
lsadmin>limrestart hostD
Checking configuration files ...
No errors found.

Restart LIM on <hostD> ...... done
lsadmin>resrestart hostD
Restart RES on <hostD> ...... done
lsadmin>quit

Note

You must login as LSF cluster administrator to run lsadmin command.

The lsadmin command can be applied to all available hosts by using the host name all; for example, lsadmin limrestart all. If a daemon is not responding to network connections lsadmin displays an error message with the host name. In this case you must kill and restart the daemon manually.

Remote Startup of LIM and RES

LSF administrators can start up any, or all, LSF daemons, on any, or all, LSF hosts, from any host in the LSF cluster. For this to work, file lsf.sudoers has to be set up properly to allow you to start up daemons as root. You should be able to run rsh across LSF hosts without having to enter a password. See `The lsf.sudoers File' on page 189 for configuration details of lsf.sudoers.

The limstartup and resstartup options in lsadmin allow for the startup of the LIM and RES daemons respectively. Specifying a host name allows for starting up a daemon on a particular host. For example:

% lsadmin limstartup hostA
Starting up LIM on <hostA> ...... done

% lsadmin resstartup hostA
Starting up RES on <hostA> ...... done

The lsadmin command can be used to start up all available hosts by using the host name all; for example, lsadmin limstartup all. All LSF daemons, including LIM, RES, and sbatchd, can be started on all LSF hosts using the command lsfstartup.

Shutting down LIM and RES

All LSF daemons can be shut down at any time. If the LIM daemon on the current master host is shut down, another host automatically takes over as master. If the RES daemon is shut down while remote interactive tasks are running on the host, the running tasks continue but no new tasks are accepted. To shutdown LIM and RES, use lsadmin command:

% lsadmin
lsadmin>resshutdown hostD
Shut down RES on <hostD> ...... done
lsadmin>limshutdown hostD
Shut down LIM on <hostD> ...... done
lsadmin>quit

You can run lsadmin reconfig while the LSF system is in use; users might be unable to submit new jobs for a short time, but all current remote executions are unaffected.

Locking and Unlocking Hosts

A LIM can be locked to temporarily prevent any further jobs from being sent to the host. The lock can be set to last either for a specified period of time, or until the host is explicitly unlocked. Only the local host can be locked and unlocked.

% lsadmin limlock
Host is locked

% lsload
HOST_NAME  status r15s  r1m  r15m  ut  pg  ls  it  tmp  swp  mem
hostD      ok     1.3   1.2  0.9   92% 0.0 2   20  5M   148M 28M
hostA      busy   8.0   *7.0 4.9   84% 0.6 0   17  *1M  31M  7M
hostC      lockU  0.8   1.0  1.1   73% 1.2 3   0   4M   44M  12M

% lsadmin limunlock
Host is unlocked

Only root and the LSF administrator can lock and unlock hosts.

Managing LSF Configuration

Overview of LSF Configuration Files

LSF configuration consists of several levels:

lsf.conf--The primary LSF environment configuration file
lsf.shared and lsf.cluster.cluster--Configuration files for the Load Information Manager
lsf.task and lsf.task.cluster--The files containing task to default resource requirement string mappings
LSB_CONFDIR/cluster--The directory containing configuration files for LSF Batch

The lsf.conf File

This is the generic LSF environment configuration file. This file defines general installation parameters so that all LSF executables can find the necessary information. This file is typically installed in the LSF_CONFDIR directory (the same directory as the LIM configuration files), and a symbolic link is made from a convenient directory as defined by the environment variable LSF_ENVDIR, or the default directory /etc. This file is created by the lsfsetup during installation. Note that many of the parameters in this file are machine specific. The contents of this file are described in detail in `The lsf.conf File' on page 161.

LIM Configuration Files

LIM is the kernel of your cluster that provides the single system image to all applications. LIM reads the LIM configuration files and determines your cluster and the cluster master host.

LIM files include lsf.shared and lsf.cluster.cluster, where cluster is the name of your LSF cluster. These files define the host members, general host attributes, and resource definitions for your cluster. The individual functions of each of the files are described below.

lsf.shared defines the available resource names, host types, host models, cluster names, and external load indices that can be used by all clusters. This file is shared by all clusters.

lsf.cluster.cluster file is a per cluster configuration file. It contains two types of configuration information: cluster definition information and LIM policy information. Cluster definition information impacts all LSF applications, while LIM policy information impacts applications that rely on LIM's policy for job placement.

The cluster definition information defines cluster administrators, all the hosts that make up the cluster, attributes of each individual host such as host type or host model, and resources using the names defined in lsf.shared.

LIM policy information defines the load sharing and job placement policies provided by LIM. More details about LIM policies are described in `Tuning LIM Load Thresholds' on page 69.

LIM configuration files are stored in directory LSF_CONFDIR as defined in lsf.conf file. Details of LIM configuration files are described in `The lsf.shared File' on page 173.

The lsf.task File

lsf.task is a system-wide task to `default resource requirement string' mapping file. This file defines mappings between task names and their default resource requirements. LSF maintains a task list for each user in the system. The lsf.task file is useful for the cluster administrator to set task-to-resource requirement mapping at the system level. Individual users can customize their own list by using the lsrtasks command (See lsrtasks(1) man page for details on this command).

When you run a job with an LSF command such as bsub or lsrun, the command consults your task list to find out the default resource requirement string of the job if they are not already specified explicitly. If a match is not found in your task list, the system will assume a default, which typically means run the job on a host that has the same host type as the local host.

There is also a per cluster file lsf.task.cluster that applies to the cluster only and overrides the system-wide definition. Individual users can have their own files to override the system-wide and cluster-wide files by using the lsrtasks command.

lsf.task and lsf.task.cluster files are installed in directory LSF_CONFDIR as defined in lsf.conf file.

LSF Batch Configuration Files

These files define LSF Batch specific configuration such as queues, batch server hosts, and batch user controls. These files are only read by mbatchd. The LSF Batch configuration relies on LIM configuration. LSF Batch daemons get the cluster configuration information from the LIM via the LSF API.

LSF Batch configuration files are stored in directory LSB_CONFDIR/cluster, where LSB_CONFDIR is defined in lsf.conf, and cluster is the name of your cluster. Details of LSF Batch configuration files are described in `Managing LSF Batch' on page 79.

Configuration File Formats

All configuration files except lsf.conf use a section-based format. Each file contains a number of sections. Each section starts with a line beginning with the reserved word Begin followed by a section name, and ends with a line beginning with the reserved word End followed by the same section name. Begin, End, section names, and keywords are all case insensitive.

Sections can either be vertical or horizontal. A horizontal section contains a number of lines, each having the format: keyword = value, where value is one or more strings. For example:

Begin exampleSection
key1 = string1
key2 = string2 string3
key3 = string4
End exampleSection



Begin exampleSection
key1 = STRING1
key2 = STRING2 STRING3
End exampleSection

In many cases you can define more than one object of the same type by giving more than one horizontal section with the same section name.

A vertical section has a line of keywords as the first line. The lines following the first line are values assigned to the corresponding keywords. Values that contain more than one string must be bracketed with `(' and `)'. The above examples can also be expressed in one vertical section:

Begin exampleSection
key1     key2               key3
string1  (string2 string3)  string4
STRING1  (STRING2 STRING3)  -
End exampleSection

Each line in a vertical section is equivalent to a horizontal section with the same section name.

Some keys in certain sections are optional. For a horizontal section, an optional key does not appear in the section if its value is not defined. For a vertical section, an optional keyword must appear in the keyword line if any line in the section defines a value for that keyword. To specify the default value use `-' or `()' in the corresponding column, as shown for key3 in the example above.

Each line can have multiple columns, separated by either spaces or TAB characters. Lines can be extended by a `\' (back slash) at the end of a line. A `#' (pound sign) indicates the beginning of a comment; characters up to the end of the line are not interpreted. Blank lines are ignored.

Example Configuration Files

Below are some examples of LIM configuration files. The detailed explanations of the variables are described in `LSF Base Configuration Reference' on page 161.

Example lsf.shared file

Begin Cluster
ClusterName                                # This line is keyword(s)
test_cluster
End Cluster

Begin HostType
TYPENAME                                   # This line is keyword(s) 
hppa
SUNSOL
rs6000
alpha
NTX86
End HostType

Begin HostModel
MODELNAME               CPUFACTOR          # This line is keyword(s)
HP735                   4.0
DEC3000                 5.0
ORIGIN2K                8.0
PENTI120                3.0
End HostModel

Begin Resource
RESOURCENAME TYPE    INTERVAL INCREASING DESCRIPTION #This line is keyword(s)
hpux         Boolean ()       ()         (HP-UX operating system)
decunix      Boolean ()       ()         (Digital Unix) 
solaris      Boolean ()       ()         (Sun Solaris operating system)
NT           Boolean ()       ()         (Windows NT operating system)
fserver      Boolean ()       ()         (File Server)
cserver      Boolean ()       ()         (Compute Server)  
scratch      Numeric 30       N          (Shared scratch space on server)
verilog      Numeric 30       N          (Floating licenses for Verilog)
console      String  30       N          (User Logged in on console)
End Resource

Example lsf.cluster.test_cluster file:

Begin ClusterManager
Manager = lsf user7
End ClusterManager



Begin Host
HostNAme     Model      Type       server     swp   Resources 
hostA        HP735      hppa          1       2     (fserver hpux)
hostD        ORIGIN2K   sgi           1       2     (cserver)
hostB        PENT200    NTX86         1       2     (NT)
End Host

In the above file, section ClusterManager takes horizontal format, while Host section takes vertical format.

Other LSF Batch configuration files are described in `Example LSF Batch Configuration Files' on page 136.

Changing LIM Configuration

This section provides procedures for some common changes to the LIM configuration. There are three different ways for you to change LIM configuration:

Use the lsfsetup program as described in various sections of the LSF Installation Guide
Edit individual files using a text editor
Use the xlsadmin tool (a graphical application).

The following discussions focus on changing configuration files using a text editor so that you can understand the concepts behind the configuration changes. See `Managing an LSF Cluster Using xlsadmin' on page 99 for the use of xlsadmin in changing configuration files.

Note

If you run LSF Batch, you must restart mbatchd using the badmin reconfig command each time you change the LIM configuration, even if the LSF Batch configuration files do not change. This is necessary because the LSF Batch configuration depends on the LIM configuration.

Adding a Host to a Cluster

If you are adding a host of a new host type, make sure you perform the steps described in `Installing an Additional Host Type' in the LSF Installation Guide first.
If you are adding a host of a type for which you have already installed LSF binaries, make sure that the LSF binaries, configuration files, and working directories are NFS-mounted on the new host. For each new host you add, follow the host setup procedure as described in `Adding an Additional Host to an Existing Cluster' in the LSF Installation Guide.
If you are adding a new host type to the cluster, modify the HostType section of the lsf.shared file to add the new host type. A host type can be any alphanumeric string up to 29 characters long.
If you are adding a new host model, modify the HostModel section of your lsf.shared file to add in the new model together with its CPU speed factor relative to other models.
For each host you add into the cluster, you should add a line to the Host section of the lsf.cluster.cluster file with host name, host type, and all other attributes defined, as shown in `Example Configuration Files' on page 54.
The master LIM and mbatchd daemons run on the first available host in the Host section of your lsf.cluster.cluster file, so you should list reliable batch server hosts first. For more information see `Fault Tolerance' on page 5.

If you are adding a client host, set the SERVER field for the host to 0 (zero).
Reconfigure your LSF cluster so that LIM knows that you have added a new host to the cluster. Follow instructions in `Reconfiguring an LSF Cluster' on page 62. If you are adding more than one host, perform this step after you have performed steps 1 to 6 for all added hosts.
If you are adding hosts as LSF Batch server hosts, add these hosts to the LSF Batch configuration by following steps described in `Restarting sbatchd' on page 85.
Start the LSF daemons on the newly added host(s) by running LSF_SERVERDIR/lsf_daemons start and use ps to make sure that res, lim and sbatchd have started.

CAUTION!

The lsf daemons start command must be run as root. If you are creating a private cluster, do not attempt to use lsf_daemons to start your daemons, as this command will kill all running daemons on the system before starting new ones. Start them manually.

Removing Hosts From a Cluster

If you are running LSF Batch, make sure you remove unwanted hosts from the LSF Batch first following steps described in `Restarting sbatchd' on page 85.
Edit your lsf.cluster.cluster file and remove the unwanted hosts from the Host section.
Log in to any host in the cluster as the LSF administrator. Run:
lsadmin resshutdown host1 host2 ...where host1, host2, ... are hosts you want to remove from your cluster.
Follow instructions in `Reconfiguring an LSF Cluster' on page 62 to reconfigure your LSF cluster. The LIMs on the removed hosts will quit upon reconfiguration.

Removing Hosts From a Cluster

Remove the LSF section from the host's system startup files. This undoes what you have done previously to start LSF daemons at boot time. See `Starting LSF Servers at Boot Time' in the LSF Installation Guide for details.
If any users use lstcsh as their login shell, change their login shell to tcsh or csh. Remove lstcsh from the /etc/shells file.

Customizing Host Resources

Your cluster is most likely heterogeneous. Even if your computers are all the same, it might still be heterogeneous. For example, some machines are configured as file servers, while others are compute servers; some have more memory, others have less; some have four CPUs, others have only one; some have host-locked software licenses installed, others do not.

LSF provides powerful resource selection mechanisms so that correct hosts with required resources are chosen to run your jobs. For maximum flexibility, you should characterize your resources clearly enough so that users have satisfactory choices. For example, if some of your machines are connected to both Ethernet and FDDI, while others are only connected to Ethernet, then you probably want to define a resource called fddi and associate the fddi resource to machines connected to FDDI. This way, users can specify resource fddi if they want their jobs to run on machines connected to FDDI.

To customize host resources for your cluster, perform the following procedure:

Log in to any host in the cluster as the LSF administrator.
Define new resource names by modifying the "Resource" section of the lsf.shared file. Add a brief description to each of the added resource names. Resource descriptions will be displayed to a user by lsinfo command.
If you want to associate added resource names to an application, edit lsf.task file properly to reflect the resource in the resource requirements of the application. Alternatively, you can leave this to individual users who can use lsrtasks command to customize their own files.
Edit the lsf.cluster.cluster file to modify the RESOURCES column of the "Host" section so that all hosts that have the added resources will now have the added resource names in that column.
Follow instructions in `Reconfiguring an LSF Cluster' on page 62 to reconfigure your LSF cluster.

Configuring Resources in LSF Base

Resources are defined in the Resource section of the lsf.shared file. The definition of a resource involves specifying a name and description, as well as, optionally, the type of its value, its update interval, and whether a higher or lower value indicates greater availability.

The mandatory resource information fields are:

A RESOURCENAME indicating the name of the resource
A DESCRIPTION that should indicate what the resource represents.

The optional resource information fields are:

A TYPE indicating its value (boolean, numeric, or string)
An INTERVAL indicating how often the value is updated (for resources whose value changes dynamically)
An INCREASING flag indicating whether a higher value represents a greater availability of the resource (for numeric resources which can be used for scheduling jobs).

When the optional attributes are not specified, the resource is treated as static and boolean-valued.

The following is a sample of a Resource section from an lsf.shared file:

Begin Resource
RESOURCENAME TYPE  INTERVAL INCREASING DESCRIPTION
mips         Boolean  ()    ()        (MIPS architecture)
dec          Boolean  ()    ()        (DECStation system)
sparc        Boolean  ()    ()        (SUN SPARC)
hppa         Boolean  ()    ()        (HPPA architecture)
bsd          Boolean  ()    ()        (BSD unix)
sysv         Boolean  ()    ()        (System V UNIX)
hpux         Boolean  ()    ()        (HP-UX UNIX)
aix          Boolean  ()    ()        (AIX UNIX)
nt           Boolean  ()    ()        (Windows NT)
scratch      Numeric  30    N         (Shared scratch space on server)
synopsys     Numeric  30    N         (Floating licenses for Synopsys)
verilog      Numeric  30    N         (Floating licenses for Verilog)
console      String   30    N         (User Logged in on console)
End Resource

There is no distinction between shared and non-shared resources in the resource definition in the lsf.shared file.

Note

The NewIndex section in the lsf.shared file is obsolete. To achieve the same effect, the Resource section of the lsf.shared file can be used to define a dynamic numeric resource, and the default keyword can be used in the LOCATION field of the ResourceMap section of the lsf.cluster.cluster file.

Associating Resources with Hosts

Resources are associated with the host(s) on which they are available in the ResourceMap section of the lsf.cluster.cluster file (where cluster is the name of the cluster). The following fields must be completed for each resource:

A RESOURCENAME indicating the name of the resource, as defined in the lsf.shared file
A LOCATION indicating whether the resource is shared or non-shared, across which hosts, and with which initial value(s).

The following is an example of a ResourceMap section from an lsf.cluster.cluster file:

Begin ResourceMap
RESOURCENAME   LOCATION
verilog        5@[all]
synopsys       (2@[apple] 2@[others])
console        (1@[apple] 1@[orange])
End ResourceMap

The possible states of a resource that may be specified in the LOCATION column are:

Each host in the cluster has the resource
The resource is shared by all hosts in the cluster
There are multiple instances of a resource within the cluster, and each instance is shared by a unique subset of hosts.

For static resources, the LOCATION column should contain the value of the resource.

The syntax of the information in the LOCATION field takes one of two forms. For static resources, where the value must be specified, use:

(value1@[host1 host2 ...] value2@[host3 host4] ...)

For dynamic resources, where the value is updated by an ELIM, use:

([host1 host2 ...] [host3 host4 ...] ...)

Each set of hosts listed within the square brackets specifies an instance of the resource. All hosts within the instance share the resource whose quantity is indicated by its value. In the above example, host1, host2,... form one instance of the resource, host3, host4,... form another instance, and so on.

Note

The same host cannot be in more than one instance of a resource.

Three predefined words have special meaning in this specification:

all refers to all the server hosts in the cluster; for example, value@[all] means the resource is shared by all server hosts in the cluster made up of host1 host2 ... hostn
others refers to the rest of the server hosts listed in the cluster; for example, (2@[apple] 2[others]) means there are 2 units of "syno" on apple, and 2 shared by all other hosts
default refers to each host; for example, value@[default] is equivalent to (value@[host1] value@[host2] ... value@[hostn]) where host1, ... hostn are all server hosts in the cluster.

These syntax examples assume that static resources (requiring values) are being specified. For dynamic resources, use the same syntax but omit the value.

The following items should be taken into consideration when configuring resources under LSF Base.

In the lsf.cluster.cluster file, the Host section must precede the ResourceMap section since the ResourceMap section uses the host names defined in the Host section.

The RESOURCES column in the Host section of the lsf.cluster.cluster file should be used to associate static boolean resources with particular hosts. Using the ResourceMap section for static boolean resources section will result in an empty RESOURCES column in the lshosts(1) display.
All resources specified in the ResourceMap section are treated as shared resources, which are displayed using the lsload -s or lshosts -s commands. The exception is for dynamic numeric resources specified using the default predefined word. These will be treated together with load indices such as mem and swap and are viewed using the lsload -l command.

If the ResourceMap section is not defined, then any dynamic resources specified in lsf.shared are considered to be host-based (the resource is available on each host in the cluster).

Reconfiguring an LSF Cluster

After changing LIM configuration files, you must tell LIM to read the new configuration. Use the lsadmin command to tell LIM to pick up the new configuration.

Operations can be specified on the command line or entered at a prompt. Run the lsadmin command with no arguments, and enter help to see the available operations.

The lsadmin reconfig command checks the LIM configuration files for errors. If no errors are found, the command confirms that you want to restart the LIMs on all hosts, and reconfigures all the LIM daemons:

% lsadmin reconfig
Checking configuration files ...
No errors found.
Do you really want to restart LIMs on all hosts? [y/n] y
Restart LIM on <hostD> ...... done
Restart LIM on <hostA> ...... done
Restart LIM on <hostC> ...... done

In the above example, no errors are found. If any non-fatal errors are found, the command asks you to confirm the reconfiguration. If fatal errors are found, the reconfiguration is aborted.

If you want to see details on any errors, run the command lsadmin ckconfig -v. This reports all errors to your terminal.

If you change the configuration file of LIM, you should also reconfigure LSF Batch by running badmin reconfig because LSF Batch depends on LIM configuration. If you change the configuration of LSF Batch, then you only need to run badmin reconfig.

External Resource Collection

The values of static external resources are specified through the lsf.cluster.clusterconfiguration file. All dynamic resources, regardless of whether they are shared or host-based, are collected through an ELIM. An ELIM is started in the following situations:

On every host if any dynamic resource is configured as host-based. For example, if the LOCATION field in the ResourceMap section of lsf.cluster.cluster is ([default]), then every host will start an ELIM.
On the master host for any cluster-wide resources. For example, if the LOCATION field in the ResourceMap section of lsf.cluster.cluster is ([all]), then an ELIM is started on the master host.
On the first host specified for each instance, if multiple instances of the resource exist within the cluster. For example, if the LOCATION field in the ResourceMap section of lsf.cluster.cluster is ([hostA hostB hostC] [hostD hostE hostF]), then an ELIM will be stared on hostA and hostD to report the value of that resource for that set of hosts.

If the host reporting the value for an instance goes down, then an ELIM is started on the next available host in the instance. In above example, if hostA became unavailable, an ELIM is started on hostB. If the hostA becomes available again then the ELIM on hostB is shut down and the one on hostA is started.

Note

There is only one ELIM on each host, regardless of the number of resources on which it reports. If only cluster-wide resources are to be collected, then an ELIM will only be started on the master host. When LIM starts, the following environment variables are set for ELIM:

LSF_MASTER: This variable is defined if the ELIM is being invoked on the master host. It is undefined otherwise. This can be used to test whether the ELIM should report on cluster-wide resources that only need to be collected on the master host.
LSF_RESOURCES: This variable contains a list of resource names (separated by spaces) on which the ELIM is expected to report. A resource name is only put in the list if the host on which the ELIM is running shares an instance of that resource.

Restrictions

The following restrictions apply to the use of shared resources in LSF products.

A shared resource cannot be used as a load threshold in the Hosts section of the lsf.cluster.cluster file.
A shared resource cannot be used in the loadSched/loadStop thresholds, or in the STOP_COND or RESUME_COND parameters in the queue definition in the lsb.queues file.

Writing an External LIM

The ELIM can be any executable program, either an interpreted script or compiled code. Example code for an ELIM is included in the examples directory in the LSF distribution. The elim.c file is an ELIM written in C. You can customize this example to collect the load indices you want.

The ELIM communicates with the LIM by periodically writing a load update string to its standard output. The load update string contains the number of indices followed by a list of name-value pairs in the following format:

N name1 value1 name2 value2 ... nameN valueN

For example,

3 tmp2 47.5 nio 344.0 licenses 5

This string reports three indices: tmp2, nio, and licenses, with values 47.5, 344.0, and 5 respectively. Index values must be numbers between -INFINIT_LOAD and INFINIT_LOAD as defined in the lsf.h header file.

If the ELIM is implemented as a C program, as part of initialization it should use setbuf(3) to establish unbuffered output to stdout.

The ELIM should ensure that the entire load update string is written successfully to stdout. This can be done by checking the return value of printf(3s) if the ELIM is implemented as a C program or as the return code of /bin/echo(1) from a shell script. The ELIM should exit if it fails to write the load information.

Each LIM sends updated load information to the master every 15 seconds. Depending on how quickly your external load indices change, the ELIM should write the load update string at most once every 15 seconds. If the external load indices rarely change, the ELIM can write the new values only when a change is detected. The LIM continues to use the old values until new values are received.

The executable for the ELIM must be in LSF_SERVERDIR and must have the name elim. If LIM expects some resources to be collected by an ELIM according to configuration, it invokes the ELIM automatically on startup. The ELIM runs with the same user id and file access permission as the LIM.

The LIM restarts the ELIM if it exits; to prevent problems in case of a fatal error in the ELIM, it is restarted at most once every 90 seconds. When the LIM terminates, it sends a SIGTERM signal to the ELIM. The ELIM must exit upon receiving this signal.

Overriding Built-In Load Indices

The ELIM can also return values for the built-in load indices. In this case the value produced by the ELIM overrides the value produced by the LIM. The ELIM must ensure that the semantics of any index it supplies are the same as that of the corresponding index returned by the lsinfo(1) command.

For example, some sites prefer to use /usr/tmp for temporary files. To override the tmp load index, write a program that periodically measures the space in the /usr/tmp file system and writes the value to standard output. Name this program elim and store it in the LSF_SERVERDIR directory.

Note

The name of an external load index must not be one of the resource name aliases cpu, idle, logins, or swap. To override one of these indices, use its formal name: r1m, it, ls, or swp.

You must configure the external load index even if you are overriding a built-in load index.

LIM Policies

LIM provides very critical services to the all LSF components. In addition to the timely collection of resource information, LIM also provides host selection and job placement policies. If you are using the LSF MultiCluster product, LIM policies also determine how different clusters should exchange load and resource information.

LIM policies are advisory information for applications. Applications can either use the placement decision from the LIM, or make further decisions based on information from the LIM.

Most of the LSF interactive tools, such as lsrun and lstcsh, use LIM policies to place jobs on the network. LSF Batch uses load and resource information from LIM and makes its own placement decisions based on other factors in addition to load information.

As was described in `Overview of LSF Configuration Files' on page 50, LIM configuration file defines load-sharing policies. The LIM configuration parameters that affect LIM policies include:

Load threshold parameters. These define the conditions beyond which a host is considered busy by LIM. No jobs will be dispatched to a busy host by LIM's policy. Each of these parameters is a load index value, so that if the host load goes beyond that value, the host becomes busy.

If a particular load index is not specified, then LIM assumes that there is no threshold for that load index. Define looser values for load thresholds if you want to aggressively run jobs on a host. See `Threshold Fields' on page 184 for details about load thresholds.
Dispatch window parameter. This defines one or more time windows during which a host is considered available for sharing a load from other hosts. If the current time is outside all the defined time windows, the host is considered locked and LIM will not advise any applications to run jobs on the host.

If you do not want LIM to place jobs to some hosts during certain hours, you can define run windows for these hosts in the lsf.cluster.cluster. Dispatch windows in lsf.cluster.cluster cause hosts to become locked outside the time windows so that LIM will not advise jobs to go to those hosts. Details of this parameter are described in `Hosts' on page 182.

Note

LIM thresholds and run windows affect the job placement advice of the LIM. Job placement advice is not enforced by LIM. LSF Batch, for example, does not follow the policies of the LIM.

Intercluster policies. These are parameters specified in the RemoteClusters section of the lsf.cluster.cluster file. These parameters apply to LSF MultiCluster product only. The parameters define the relationship between the local cluster and remote clusters and the direction of job placement flows across clusters. See `Managing LSF MultiCluster' on page 143 for details.

There are two main goals in adjusting the LIM configuration parameters: improving response time, and reducing interference with interactive use. To improve response time, LSF should be tuned to correctly select the best available host for each job. To reduce interference, LSF should be tuned to avoid overloading any host.

Tuning CPU Factors

CPU factors are used to differentiate the relative speed of different machines. LSF runs jobs on the best possible machines so that the response time is minimized. To achieve this, it is important that you define correct CPU factors for each machine model in your cluster by changing the HostModel section of your lsf.shared file.

CPU factors should be set based on a benchmark that reflects your work load. (If there is no such benchmark, CPU factors can be set based on raw CPU power.) The CPU factor of the slowest hosts should be set to one, and faster hosts should be proportional to the slowest. For example, consider a cluster with two hosts, hostA and hostB, where hostA takes 30 seconds to run your favourite benchmark and hostB takes 15 seconds to run the same test. hostA should have a CPU factor of 1, and hostB (since it is twice as fast) should have a CPU factor of 2.

LSF uses a normalized CPU performance rating to decide which host has the most available CPU power. The normalized ratings can be seen by running the lsload -N command. The hosts in your cluster are displayed in order from best to worst. Normalized CPU run queue length values are based on an estimate of the time it would take each host to run one additional unit of work, given that an unloaded host with CPU factor 1 runs one unit of work in one unit of time.

Incorrect CPU factors can reduce performance in two ways. If the CPU factor for a host is too low, that host may not be selected for job placement when a slower host is available. This means that jobs would not always run on the fastest available host. If the CPU factor is too high, jobs are run on the fast host even when they would finish sooner on a slower but lightly loaded host. This causes the faster host to be overused while the slower hosts are underused.

Both of these conditions are somewhat self-correcting. If the CPU factor for a host is too high, jobs are sent to that host until the CPU load threshold is reached. The LIM then marks that host as busy, and no further jobs will be sent there. If the CPU factor is too low, jobs may be sent to slower hosts. This increases the load on the slower hosts, making LSF more likely to schedule future jobs on the faster host.

Tuning LIM Load Thresholds

The Host section of the lsf.cluster.cluster file can contain busy thresholds for load indices. You do not need to specify a threshold for every index; indices that are not listed do not affect the scheduling decision. These thresholds are a major factor in influencing LSF performance. This section does not describe all LSF load indices; see `Resource Requirements' on page 24 and `Threshold Fields' on page 184 for more complete discussions.

The parameters that most often affect performance are:

r15s
15-second average

r1m
1-minute average

r15m
15-minute average

pg
paging rate in pages per second

swp
Available swap space

For tuning these parameters, you should compare the output of lsload to the thresholds reported by lshosts -l.

The lsload and lsmon commands display an asterisk `*' next to each load index that exceeds its threshold. For example, consider the following output from lshosts -l and lsload:

% lshosts -l
HOST_NAME:  hostD
...
LOAD_THRESHOLDS:
  r15s   r1m  r15m   ut    pg    io   ls   it   tmp   swp   mem
     -   3.5     -    -    15     -    -    -     -    2M    1M

HOST_NAME:  hostA
...
LOAD_THRESHOLDS:
  r15s   r1m  r15m   ut    pg    io   ls   it   tmp   swp   mem
     -   3.5     -    -    15     -    -    -     -    2M    1M

% lsload
HOST_NAME status r15s  r1m  r15m   ut    pg  ls  it  tmp  swp  mem
hostD         ok  0.0  0.0   0.0   0%   0.0   6   0  30M  32M  10M
hostA       busy  1.9  2.1   1.9  47% *69.6  21   0  38M  96M  60M

In this example, hostD is ok. However, hostA is busy; the pg (paging rate) index is 69.6, above the threshold of 15.

Other monitoring tools such as xlsmon also help to show the effects of changes.

If the LIM often reports a host to be busy when the CPU run queue length is low, the most likely cause is the paging rate threshold. Different operating systems assign subtly different meanings to the paging rate statistic, so the threshold needs to be set at different levels for different host types. In particular, HP-UX systems need to be configured with significantly higher pg values; try starting at a value of 50 rather than the default of 15.

If the LIM often shows systems busy when the CPU utilization and run queue lengths are relatively low and the system is responding quickly, try raising the pg threshold. There is a point of diminishing returns; as the paging rate rises, eventually the system spends too much time waiting for pages and the CPU utilization decreases. Paging rate is the factor that most directly affects perceived interactive response. If a system is paging heavily, it feels very slow.

The CPU run queue threshold can be reduced if you find that interactive jobs slow down your response too much while the LIM still reports your host as ok. Likewise, it can be increased if hosts become busy at too low a load.

On multi-processor systems, the CPU run queue threshold is compared to the effective run queue length as displayed by the lsload -E command. The run queue threshold should be configured as the load limit for a single processor. Sites with a variety of uniprocessor and multi-processor machines can use a standard value for r15s, r1m and r15m in the configuration files, and the multi-processor machines will automatically run more jobs. Note that the normalized run queue length printed by lsload -N is scaled by the number of processors. See Section 4, `Resources', beginning on page 35 of the LSF Batch User's Guide and lsfintro(1) for the concept of effective and normalized run queue lengths.

Cluster Monitoring with LSF

Because LSF takes a wide variety of measurements on the hosts in your network, it can be a powerful tool for monitoring and capacity planning. The lsmon command gives updated information that can quickly identify problems such as inaccessible hosts or unusual load levels. The lsmon -L option logs the load information to a file for later processing. See the lsmon(1) and lim.acct(5) manual pages for more information.

For example, if the paging rate (pg) on a host is always high, adding memory to the system will give a significant increase in both interactive performance and total throughput. If the pg index is low but the CPU utilization (ut) is usually more than 90 percent, the CPU is the limiting resource. Getting a faster host, or adding another host to the network, would provide the best performance improvement. The external load indices can be used to track other limited resources such as user disk space, network traffic, or software licenses.

The xlsmon program is a Motif graphic interface to the LSF load information. The xlsmon display uses colour to highlight busy and unavailable hosts, and can show both the current levels and scrolling histories of selected load indices.

See Section 3, `Cluster Information', beginning on page 25 of the LSF Batch User's Guide for more information about xlsmon.

LSF License Management

LSF software is licensed using the FLEXlm license manager from Globetrotter Software, Inc. The LSF license key controls the hosts allowed to run LSF. The procedures for obtaining, installing, and upgrading license keys are described in `Getting License Key Information' and `Setting Up the License Key' in the LSF Installation Guide. This section provides background information on FLEXlm.

FLEXlm controls the total number of hosts configured in all your LSF clusters. You can organize your hosts into clusters however you choose. Each server host requires at least one license; multi-processor hosts require more than one, as a function of the number of processors. Each client host requires 1/5 of a license.

LSF uses two kinds of FLEXlm license: time-limited DEMO licenses and permanent licenses.

The DEMO license allows you to try LSF out on an unlimited number of hosts on any supported host type. The trial period has a fixed expiry date, and the LSF software will not function after that date. DEMO licenses do not require any additional daemons.

Permanent licenses are the most common. A permanent license limits only the total number of hosts that can run the LSF software, and normally has no time limit. You can choose which hosts in your network will run LSF, and how they are arranged into clusters. Permanent licenses are counted by a license daemon running on one host on your network.

For permanent licenses, you need to choose a license server host and send hardware host identification numbers for the license server host to your software vendor. The vendor uses this information to create a permanent license that is keyed to the license server host. Some host types have a built-in hardware host ID; on others, the hardware address of the primary LAN interface is used.

How FLEXlm Works

FLEXlm is used by many software packages because it provides a simple and flexible method for controlling access to licensed software. A single FLEXlm license server can handle licenses for many software packages, even if those packages come from different vendors. This reduces the systems administration load, since you do not need to install a new license manager every time you get a new package.

The License Server Daemon

FLEXlm uses a daemon called lmgrd to manage permanent licenses. This daemon runs on one host on your network, and handles license requests from all applications. Each license key is associated with a particular software vendor. lmgrd automatically starts a vendor daemon; the LSF version is called lsf_ld and is provided by Platform Computing Corporation. The vendor daemon keeps track of all licenses supported by that vendor. DEMO licenses do not require you to run license daemons.

The license server daemons should be run on a reliable host, since licensed software will not run if it cannot contact the server. The FLEXlm daemons create very little load, so they are usually run on the file server. If you are concerned about availability, you can run lmgrd on a set of three or five hosts. As long as a majority of the license server hosts are available, applications can obtain licenses.

The License File

Software licenses are stored in a text file. The default location for this file is

/usr/local/flexlm/licenses/license.dat, but this can be overridden. For example, when LSF is installed following the default installation procedure, the license file is installed in the same directory where all LSF configuration files are installed; for example, /usr/local/lst/mnt/conf. The license file must be readable on every host that runs licensed software. It is most convenient to place the license file in a shared NFS directory.

The license.dat file normally contains:

A SERVER line for each FLEXlm server host. The SERVER line contains the host name, hardware host ID, and network port number for the server.
A DAEMON line for each software vendor, which gives the file path name of the vendor daemon.
A FEATURE line for each software license. This line contains the number of copies that can be run, along with other necessary information.

The FEATURE line contains an encrypted code to prevent tampering. For permanent licenses, the licenses granted by the FEATURE line can be accessed only through license servers listed on the SERVER lines.

For DEMO licenses, no FLEXlm daemons are needed, so the license file contains only the FEATURE line.

Here is an example of a DEMO license file.

FEATURE lsf_base lsf_ld 3.100 20-Dec-1997 0 5CE371439854221102F7 "Platform" DEMO
FEATURE lsf_batch lsf_ld 3.100 20-Dec-1997 0 3CC371C33076712F433B "Platform" DEMO
FEATURE lsf_multicluster lsf_ld 3.100 20-Dec-1997 0 5C63119330771250944C "Platform" DEMO

This license file allows a site to run LSF Base, Batch, and MultiCluster until December 20, 1997. Note that a DEMO license does not have a SERVER line and a DAEMON line because no license server is needed for DEMO licenses.

The following is an example of a permanent license:

SERVER hostD 690a377d 1700
DAEMON lsf_ld /usr/local/lsf/etc/lsf_ld
FEATURE lsf_base lsf_ld 3.100 1-jan-0000 1000 5C239486C4D72739BAF8 "Platform"
FEATURE lsf_batch lsf_ld 3.100 1-jan-0000 1000 6CB344F6E2A5B7A31526 "Platform"
FEATURE lsf_multicluster lsf_ld 3.100 1-jan-0000 1000 5C535446DAE5DEE6B736 "Platform"

LSF uses the notion of license units in calculating the amount of licenses required for a product on a host. The number of license units required to run LSF depends on the number of CPUs the host has as well as the type of the machine. For example, a single CPU HP-UX machine would require ten license units, whereas a client-only machine would need two license units.

The above license is configured to run on hostD, using TCP port 1700. This license allows 1000 license units for version 3.1 of LSF Base, LSF Batch, and LSF MultiCluster.

License Management Utilities

FLEXlm provides several utility programs for managing software licenses. These utilities and their manual pages are included in the LSF software distribution.

Because these utilities can be used to shut down the FLEXlm license server, and thus prevent licensed software from running, they are installed in the LSF_SERVERDIR directory. The file permissions are set so that only root and members of group 0 can use them.

The utilities included are:

lmcksum
Calculate check sums of the license key information

lmdown
Shut down the FLEXlm server

lmhostid
Display the hardware host ID

lmremove
Remove a feature from the list of checked out features

lmreread
Tell the license daemons to re-read the license file

lmstat
Display the status of the license servers and checked out licenses

lmver
Display the FLEXlm version information for a program or library

For complete details on these commands, see the on-line manual pages.

Updating an LSF License

FLEXlm only accepts one license key for each feature listed in a license key file. If there is more than one FEATURE line for the same feature, only the first FEATURE line is used. To add hosts to your LSF cluster, you must replace the old FEATURE line with a new one listing the new total number of licenses.

The procedure for updating a license key file to include new license keys is described in `Adding a Permanent License' in the LSF Installation Guide.

Changing the FLEXlm Server TCP Port

The fourth field on the SERVER line specifies the TCP port number that the FLEXlm server uses. Choose an unused port number. LSF usually uses port numbers in the range 3879 to 3882, so the numbers from 3883 forward are good choices. If the lmgrd daemon complains that the license server port is in use, you can choose another port number and restart lmgrd.

For example, if your license file contains the line:

SERVER hostname host-id 1700

and you want your FLEXlm server to use TCP port 3883, change the SERVER line to:

SERVER hostname host-id 3883

Modifying LSF Products and Licensing

LSF Suite 3.1 includes the following products: LSF Base, LSF Batch, LSF JobScheduler, LSF MultiCluster, and LSF Analyzer.

The configuration changes to enable a particular product in a cluster are handled during installation by lsfsetup. If at some later time you want to modify the products in your cluster, edit the PRODUCTS line in the Parameters section of the lsf.cluster.cluster file. You can specify one or more of the strings LSF_Base, LSF_Batch, LSF_JobScheduler, LSF Analyzer, and LSF_MultiCluster to enable the operation of LSF Base, LSF Batch, LSF JobScheduler, LSF Analyzer, and LSF MultiCluster, respectively. If any of LSF_Batch, LSF_JobScheduler, or LSF_MultiCluster are specified, then LSF_Base is automatically enabled as well.

If the lsf.cluster.cluster file is shared, adding a product name to the PRODUCTS line enables that product for all hosts in the cluster. For example, to enable the operation of LSF Base, LSF Batch, and LSF MultiCluster:

Begin Parameters
PRODUCTS=LSF_Base LSF_Batch LSF_MultiCluster
End Parameters

To enable the operation of LSF Base only:

Begin Parameters
PRODUCTS=LSF_Base
End Parameters

To enable the operation of LSF JobScheduler:

Begin Parameters
PRODUCTS=LSF_JobScheduler LSF_Base
End Parameters

Selected Hosts

It is possible to indicate that only certain hosts run LSF Batch or LSF JobScheduler within a cluster. This is done by specifying LSF_Batch or LSF_JobScheduler in the RESOURCES field on the HOSTS section of the lsf.cluster.cluster file. For example, the following enables hosts hostA, hostB, and hostC to run LSF JobScheduler and hosts hostD, hostE, and hostF to run LSF Batch.

Begin Parameters
PRODUCTS=LSF_Batch LSF_Base
End Parameters


Begin   Host
HOSTNAME    model     type  server RESOURCES
hostA       SUN41 SPARCSLC    1    (sparc bsd LSF_JobScheduler)
hostB       HPPA9    HP735    1    (linux LSF_JobScheduler)
hostC         SGI SGIINDIG    1    (irix cs LSF_JobScheduler)
hostD      SUNSOL SunSparc    1    (solaris)
hostE       HP_UX     A900    1    (hpux cs bigmem)
hostF       ALPHA  DEC5000    1    (alpha)
End Hosts

The license file used to serve the cluster must have the corresponding features. A host will show as unlicensed if the license for the component it was configured to run is unavailable. For example, if a cluster is configured to run LSF_Batch on all hosts, and the license file does not contain the LSF_JobScheduler feature, then the hosts will be unlicensed, even if there are licenses for LSF Base.

[Contents] [Index] [Top] [Bottom] [Prev] [Next]

doc@platform.com