This section describes the procedures that must be used to start up the LSF daemons, test the LSF cluster configuration, and provide LSF to users at your site. These procedures cannot be performed until after the LSF software has been installed and the hosts have been configured individually (see `Default Installation' on page 13 or `Custom Installation' on page 23).
Before you can start any LSF daemons, you should make sure that your cluster configuration is correct. The lsfsetup
program includes an option to check the LSF configuration. The default LSF configuration should work as it is installed following the steps described in `Default Installation Procedures' on page 14.
lsf.cluster.
cluster,
(
cluster
is the name of the cluster) as the LSF administrator.
lsadmin ckconfig -v
The
lsadmin
program is located in theLSF_TOP/bin
directory.
The output should look something like the following:
Checking configuration files ...
LSF v3.1, Sept 10, 1997
Copyright 1992-1997 Platform Computing Corporation
Reading configuration from /etc/lsf.conf
Dec 21 21:15:51 13412 /usr/local/lsf/etc/lim -C
Dec 21 21:15:52 13412 initLicense: Trying to get license for LIM from source
</usr/local/lsf/conf/license.dat>
Dec 21 21:15:52 13412 main: Got 1 licenses
Dec 21 21:15:52 13412 main: Configuration checked. No fatal errors found.---------------------------------------------------------
No errors found.
The messages shown above are the normal output from lsadmin ckconfig -v
.
Other messages may indicate problems with the LSF configuration.
Both LSF Batch and LSF JobScheduler require this check to be made.
To check the LSF Batch configuration files, LIM must be running on the master host.
LSF_SERVERDIR/lim
.
lsid
program to make sure LIM
is available.
The lsid
program is located in the LSF_TOP
/bin
directory.
badmin ckconfig -v
The output should look something like the following:
Checking configuration files ...
Dec 21 21:22:14 13545 mbatchd: LSF_ENVDIR not defined; assuming /etc
Dec 21 21:22:15 13545 minit: Trying to call LIM to get cluster name ...
Dec 21 21:22:17 13545 readHostFile: 3 hosts have been specified in file
</usr/local/lsf/conf/lsbatch/test_cluster/configdir/lsb.hosts>; only these
hosts will be used by lsbatch
Dec 21 21:22:17 13545 Checking Done
---------------------------------------------------------
No fatal errors found.
The above messages are normal; other messages may indicate problems with the LSF configuration.
The LSF daemons can be started using the lsf_daemons
program. This program must be run from the root
account, so if you are starting daemons for a private cluster, do not use lsf_daemons
: start the daemons manually instead.
lsf_daemons start
res
, lim
and sbatchd
processes
have started using the ps
command.
If you choose, you can start LSF daemons for all machines using the lsadmin
and badmin
commands. Do this by executing the following commands
in order, instead of using the lsf_daemons
command.
lsadmin limstartup lsadmin resstartup badmin hstartup
lsfsetup
creates a default LSF Batch configuration (including
a set of batch queues) which is used by both LSF Batch and LSF JobScheduler.
You do not need to change any LSF Batch files to use the default configuration.
After you have started the LSF daemons in your cluster, you should run some simple tests. Wait a minute or two for all the LIMs to get in touch with each other, to elect a master, and to exchange some setup information.
The testing should be performed as a non-root user. This user's PATH
must include the LSF user binaries (LSF_BINDIR
as defined in LSF_ENVDIR/lsf.conf
).
Testing consists of running a number of LSF commands and making sure that correct results are reported for all hosts in the cluster. This section shows suggested tests and examples of correct output. The output you see on your system will reflect your local configuration.
The following steps may be performed from any host in the cluster.
% lsid LSF 3.1, Dec 10, 1997 Copyright 1992-1997 Platform Computing Corporation
My cluster name is test_cluster My master name is hostA
The master name may vary but is usually the first host configured in the
Hosts
section of the lsf.cluster.
cluster
file.
If the LIM is not available on the local host,
lsid
displays the following message:lsid: ls_getmastername failed: LIM is down; try laterIf the LIM is not running, try running
lsid
a few more times.
The error message
lsid: ls_getmastername failed: Cannot locate master LIM now, try latermeans that local LIM is running, but the master LIM has not contacted the local LIM yet. Check the LIM on the first host listed in
lsf.cluster.
cluster. If it is running, wait for 30 seconds and try
lsid
again. Otherwise, another LIM will take over after one or two minutes.
lsinfo
command displays cluster-wide configuration
information.
% lsinfo
RESOURCE_NAME TYPE ORDER DESCRIPTION
r15s Numeric Inc 15-second CPU run queue length
r1m Numeric Inc 1-minute CPU run queue length (alias: cpu)
r15m Numeric Inc 15-minute CPU run queue length
ut Numeric Inc 1-minute CPU utilization (0.0 to 1.0)
pg Numeric Inc Paging rate (pages/second)
ls Numeric Inc Number of login sessions (alias: login)
it Numeric Dec Idle time (minutes) (alias: idle)
tmp Numeric Dec Disk space in /tmp (Mbytes)
mem Numeric Dec Available memory (Mbytes)
ncpus Numeric Dec Number of CPUs
maxmem Numeric Dec Maximum memory (Mbytes)
maxtmp Numeric Dec Maximum /tmp space (Mbytes)
cpuf Numeric Dec CPU factor
type String N/A Host type
model String N/A Host model
status String N/A Host status
server Boolean N/A LSF server host
cserver Boolean N/A Compute Server
solaris Boolean N/A Sun Solaris operating system
fserver Boolean N/A File Server
NT Boolean N/A Windows NT operating system
TYPE_NAME
hppa
SUNSOL
alpha
sgi
NTX86
rs6000 MODEL_NAME CPU_FACTOR
HP735 4.0
ORIGIN2K 8.0
DEC3000 5.0
PENT200 3.0
The resource names, host types, and host models should be those configured
in LSF_CONFDIR/lsf.shared
.
lshosts
command displays configuration information about
your hosts:
% lshosts
HOST_NAME type model cpuf ncpus maxmem maxswp server RESOURCES
hostA hppa HP735 4.00 1 128M 256M Yes (fserver hpux)
hostD sgi ORIGIN2K 8.00 32 512M 1024M Yes (cserver)
hostB NTX86 PENT200 3.00 1 96M 180M Yes (NT)
The output should contain one line for each host
configured in the cluster, and the type
, model
,
and RESOURCES
should be those configured for that host in lsf.cluster.
cluster.
cpuf
should match the CPU factor given for the host model in
lsf.shared
.
% lsload
HOST_NAME status r15s r1m r15m ut pg ls it tmp swp mem
hostA ok 0.3 0.1 0.0 3% 1.0 1 12 122M 116M 56M
hostD ok 0.6 1.2 2.0 23% 3.0 14 0 63M 698M 344M
hostB ok 0.6 0.3 0.0 5% 0.3 1 0 55M 41M 37M
The output contains one line for each host in the cluster.
If any host has unavail
in the status
column,
the master LIM is unable to contact the LIM on that host. This can occur
if the LIM was started recently and has not yet contacted the master LIM,
or if no LIM was started on that host, or if that host was not configured
correctly.
If the entry in the status
column begins with -
(for example, -ok
), the RES is not available on that host.
RES status is checked every 90 seconds, so allow enough time for STATUS
to reflect this.
If all these tests succeed, the LIMs on all hosts are running correctly.
lsgrun
command runs a UNIX command on a group of hosts:
% lsgrun -v -m "hostA hostD hostB" hostname
<<Executing hostname on hostA>>
hostA
<<Executing hostname on hostD>>
hostD
<<Executing hostname on hostB>>
hostB
If remote execution fails on any host, check the RES error log on that host.
Testing consists of running a number of LSF commands and making sure that correct results are reported for all hosts in the cluster.
bhosts
command lists the batch server hosts in the cluster:
% bhosts
HOST_NAME STATUS JL/U MAX NJOBS RUN SSUSP USUSP RSV
hostD ok - 10 1 1 0 0 0
hostA ok - 10 4 2 2 0 0
hostC unavail - 3 1 1 0 0 0
The STATUS
column shows the status of sbatchd
on that host. If the STATUS
column contains unavail
,
that host is not available. Either the sbatchd
on that host
has not started or it has started but has not yet contacted the mbatchd
.
If hosts are still listed as unavailable after roughly three minutes, check
the error logs on those hosts.
See the
bhosts(1)
manual page for explanations of the other columns.
% bsub sleep 60
Job <1> is submitted to default queue <normal>
If the job you submitted was the first ever, it should have job ID 1. Otherwise, the number varies.
% bqueues
QUEUE_NAME PRIO STATUS MAX JL/U JL/P JL/H NJOBS PEND RUN SUSP
interactive 400 Open:Active - - - - 1 1 0 0
fairshare 300 Open:Active - - - - 2 0 2 0
owners 43 Open:Active - - - - 0 0 0 0
priority 43 Open:Active - - - - 29 29 0 0
night 40 Open:Inactive - - - - 1 1 0 0
short 35 Open:Active - - - - 0 0 0 0
normal 30 Open:Active - - - - 0 0 0 0
idle 20 Open:Active - - - - 0 0 0 0
See the bqueues(1)
manual page for an explanation of the output.
% bjobs JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
1 fred RUN normal hostA hostD sleep 60 Dec 10 22:44
Note that if all hosts are busy, the job is not started immediately so
the STAT
column says PEND
. This job should take
one minute to run. When the job completes, you should receive mail reporting
the job completion.
You do not need to read this section if you are not using the LSF MultiCluster product.
LSF MultiCluster unites multiple LSF clusters so that they can share resources transparently, while at the same time, still maintain resource ownership and autonomy of individual clusters.
LSF MultiCluster extends the functionality of a single cluster. Configuration involves a few more steps. First you set up a single cluster as described above, then you need to do some additional steps specific to LSF MultiCluster.
You do not need to read this section if you are not using the LSF JobScheduler product.
LSF JobScheduler provides reliable production job scheduling according to user specified calendars and events. It runs user-defined jobs automatically at the right time, under the right conditions, and on the right machines.
The configuration of LSF JobScheduler is almost the same as that of the LSF Batch cluster, except that you may have to define system-level calendars for your cluster and you might need to add additional events to monitor your site.
When you have finished installing and testing LSF cluster, you can let users try it out. LSF users must add LSF_BINDIR to their PATH environment variables to run the LSF utilities.
Users also need access to the on-line manual pages, which were installed in LSF_MANDIR (as defined in lsf.conf
) by the lsfsetup
installation procedure. For most versions of UNIX, users should add the directory LSF_MANDIR to their MANPATH environment variable. If your system has a man
command that does not understand MANPATH
, you should either install the manual pages in the /usr/man
directory or get one of the freely available man
programs.
The /etc/lsf.conf
file (or LSF_CONFDIR/lsf.conf
if you used the Default installation procedure) must be available.
You can use the xlsadmin
graphical tool to do most of the cluster configuration and management work that has been described in this chapter.