[Contents] [Index] [Top] [Bottom] [Prev] [Next]


6. After Installation

This section describes the procedures that must be used to start up the LSF daemons, test the LSF cluster configuration, and provide LSF to users at your site. These procedures cannot be performed until after the LSF software has been installed and the hosts have been configured individually (see `Default Installation' on page 13 or `Custom Installation' on page 23).

Checking Cluster Configuration

Before you can start any LSF daemons, you should make sure that your cluster configuration is correct. The lsfsetup program includes an option to check the LSF configuration. The default LSF configuration should work as it is installed following the steps described in `Default Installation Procedures' on page 14.

  1. Log into the first host listed in lsf.cluster.cluster, (cluster is the name of the cluster) as the LSF administrator.
  2. Check the LIM configuration by entering the following command:
    lsadmin ckconfig -v 
    

The lsadmin program is located in the LSF_TOP/bin directory.

  1. Check the output of the command to make sure there are no errors.

    The output should look something like the following:

    Checking configuration files ...
    LSF v3.1, Sept 10, 1997
    Copyright 1992-1997 Platform Computing Corporation
    Reading configuration from /etc/lsf.conf
    Dec 21 21:15:51 13412 /usr/local/lsf/etc/lim -C
    Dec 21 21:15:52 13412 initLicense: Trying to get license for LIM from source 
    </usr/local/lsf/conf/license.dat>
    Dec 21 21:15:52 13412 main: Got 1 licenses
    Dec 21 21:15:52 13412 main: Configuration checked. No fatal errors found.
    ---------------------------------------------------------
    No errors found.

    The messages shown above are the normal output from lsadmin ckconfig -v. Other messages may indicate problems with the LSF configuration.

Checking Batch Daemon Configuration

Both LSF Batch and LSF JobScheduler require this check to be made.

To check the LSF Batch configuration files, LIM must be running on the master host.

  1. If the LIM is not running, log in as root and start LSF_SERVERDIR/lim.
  2. Wait a minute, and then run the lsid program to make sure LIM is available.

    The lsid program is located in the LSF_TOP/bin directory.

  3. Check the batch configuration by entering the following command:
    badmin ckconfig -v
    The output should look something like the following:
    Checking configuration files ...
    Dec 21 21:22:14 13545 mbatchd: LSF_ENVDIR not defined; assuming /etc
    Dec 21 21:22:15 13545 minit: Trying to call LIM to get cluster name ...
    Dec 21 21:22:17 13545 readHostFile: 3 hosts have been specified in file 
    </usr/local/lsf/conf/lsbatch/test_cluster/configdir/lsb.hosts>; only these 
    hosts will be used by lsbatch
    Dec 21 21:22:17 13545 Checking Done
    ---------------------------------------------------------
    No fatal errors found.
  4. Check the output of the command to make sure there are no errors.

    The above messages are normal; other messages may indicate problems with the LSF configuration.

Starting the LSF Daemons

The LSF daemons can be started using the lsf_daemons program. This program must be run from the root account, so if you are starting daemons for a private cluster, do not use lsf_daemons: start the daemons manually instead.

  1. Start the LSF daemons by running the following command:
    lsf_daemons start
    
  2. Check that res, lim and sbatchd processes have started using the ps command.

    If you choose, you can start LSF daemons for all machines using the lsadmin and badmin commands. Do this by executing the following commands in order, instead of using the lsf_daemons command.

    lsadmin limstartup
    lsadmin resstartup
    badmin hstartup 
    

    lsfsetup creates a default LSF Batch configuration (including a set of batch queues) which is used by both LSF Batch and LSF JobScheduler. You do not need to change any LSF Batch files to use the default configuration.

Testing the LSF Cluster

After you have started the LSF daemons in your cluster, you should run some simple tests. Wait a minute or two for all the LIMs to get in touch with each other, to elect a master, and to exchange some setup information.

The testing should be performed as a non-root user. This user's PATH must include the LSF user binaries (LSF_BINDIR as defined in LSF_ENVDIR/lsf.conf).

Testing consists of running a number of LSF commands and making sure that correct results are reported for all hosts in the cluster. This section shows suggested tests and examples of correct output. The output you see on your system will reflect your local configuration.

The following steps may be performed from any host in the cluster.

Testing LIM

  1. Check cluster name and master host name:
    % lsid
    LSF 3.1, Dec 10, 1997
    Copyright 1992-1997 Platform Computing Corporation
    My cluster name is test_cluster My master name is hostA

    The master name may vary but is usually the first host configured in the Hosts section of the lsf.cluster.cluster file.

If the LIM is not available on the local host, lsid displays the following message:

lsid: ls_getmastername failed: LIM is down; try later 

If the LIM is not running, try running lsid a few more times.

The error message

lsid: ls_getmastername failed: Cannot locate master LIM now, try later 

means that local LIM is running, but the master LIM has not contacted the local LIM yet. Check the LIM on the first host listed in lsf.cluster.cluster. If it is running, wait for 30 seconds and try lsid again. Otherwise, another LIM will take over after one or two minutes.

  1. The lsinfo command displays cluster-wide configuration information.
    % lsinfo
    RESOURCE_NAME   TYPE      ORDER  DESCRIPTION
    r15s            Numeric   Inc    15-second CPU run queue length
    r1m             Numeric   Inc    1-minute CPU run queue length (alias: cpu)
    r15m            Numeric   Inc    15-minute CPU run queue length
    ut              Numeric   Inc    1-minute CPU utilization (0.0 to 1.0)
    pg              Numeric   Inc    Paging rate (pages/second)
    ls              Numeric   Inc    Number of login sessions (alias: login)
    it              Numeric   Dec    Idle time (minutes) (alias: idle)
    tmp             Numeric   Dec    Disk space in /tmp (Mbytes)
    mem             Numeric   Dec    Available memory (Mbytes)
    ncpus           Numeric   Dec    Number of CPUs
    maxmem          Numeric   Dec    Maximum memory (Mbytes)
    maxtmp          Numeric   Dec    Maximum /tmp space (Mbytes)
    cpuf            Numeric   Dec    CPU factor
    type            String    N/A    Host type
    model           String    N/A    Host model
    status          String    N/A    Host status
    server          Boolean   N/A    LSF server host
    cserver         Boolean   N/A    Compute Server
    solaris         Boolean   N/A    Sun Solaris operating system
    fserver         Boolean   N/A    File Server
    NT              Boolean   N/A    Windows NT operating system        
    TYPE_NAME
    hppa
    SUNSOL
    alpha
    sgi
    NTX86
    rs6000 MODEL_NAME            CPU_FACTOR
    HP735                 4.0
    ORIGIN2K              8.0
    DEC3000               5.0
    PENT200               3.0  

    The resource names, host types, and host models should be those configured in LSF_CONFDIR/lsf.shared.

  2. The lshosts command displays configuration information about your hosts:
    lshosts
    HOST_NAME type   model     cpuf  ncpus  maxmem  maxswp  server  RESOURCES
    hostA     hppa   HP735     4.00  1      128M    256M    Yes     (fserver hpux)
    hostD     sgi    ORIGIN2K  8.00  32     512M    1024M   Yes     (cserver)
    hostB     NTX86  PENT200   3.00  1      96M     180M    Yes     (NT)

    The output should contain one line for each host configured in the cluster, and the type, model, and RESOURCES should be those configured for that host in lsf.cluster.cluster. cpuf should match the CPU factor given for the host model in lsf.shared.

  3. Check the current load levels:
    lsload
    HOST_NAME  status  r15s   r1m  r15m    ut    pg   ls  it   tmp    swp    mem
    hostA      ok      0.3    0.1  0.0     3%    1.0  1   12   122M   116M   56M
    hostD      ok      0.6    1.2  2.0     23%   3.0  14  0    63M    698M   344M
    hostB      ok      0.6    0.3  0.0     5%    0.3  1   0    55M    41M    37M

    The output contains one line for each host in the cluster.

    If any host has unavail in the status column, the master LIM is unable to contact the LIM on that host. This can occur if the LIM was started recently and has not yet contacted the master LIM, or if no LIM was started on that host, or if that host was not configured correctly.

    If the entry in the status column begins with - (for example, -ok), the RES is not available on that host. RES status is checked every 90 seconds, so allow enough time for STATUS to reflect this.

    If all these tests succeed, the LIMs on all hosts are running correctly.

Testing RES

  1. The lsgrun command runs a UNIX command on a group of hosts:
    % lsgrun -v -m "hostA hostD hostB" hostname
    <<Executing hostname on hostA>>
    hostA
    <<Executing hostname on hostD>>
    hostD
    <<Executing hostname on hostB>>
    hostB

    If remote execution fails on any host, check the RES error log on that host.

Testing LSF Batch

Testing consists of running a number of LSF commands and making sure that correct results are reported for all hosts in the cluster.

  1. The bhosts command lists the batch server hosts in the cluster:
    bhosts
    HOST_NAME   STATUS    JL/U  MAX  NJOBS  RUN  SSUSP USUSP  RSV
    hostD       ok        -     10   1      1    0     0      0
    hostA       ok        -     10   4      2    2     0      0
    hostC       unavail   -     3    1      1    0     0      0

    The STATUS column shows the status of sbatchd on that host. If the STATUS column contains unavail, that host is not available. Either the sbatchd on that host has not started or it has started but has not yet contacted the mbatchd. If hosts are still listed as unavailable after roughly three minutes, check the error logs on those hosts.

See the bhosts(1) manual page for explanations of the other columns.

  1. Submit a job to the default queue:
    bsub sleep 60
    Job <1> is submitted to default queue <normal>

    If the job you submitted was the first ever, it should have job ID 1. Otherwise, the number varies.

  2. Check available queues and their configuration parameters:
    bqueues
    QUEUE_NAME    PRIO   STATUS          MAX  JL/U JL/P JL/H NJOBS  PEND  RUN  SUSP
    interactive   400    Open:Active     -    -    -    -    1      1     0    0
    fairshare     300    Open:Active     -    -    -    -    2      0     2    0
    owners        43     Open:Active     -    -    -    -    0      0     0    0
    priority      43     Open:Active     -    -    -    -    29     29    0    0
    night         40     Open:Inactive   -    -    -    -    1      1     0    0
    short         35     Open:Active     -    -    -    -    0      0     0    0
    normal        30     Open:Active     -    -    -    -    0      0     0    0
    idle          20     Open:Active     -    -    -    -    0      0     0    0

    See the bqueues(1) manual page for an explanation of the output.

  3. Check job status.
    bjobs
    JOBID USER  STAT QUEUE  FROM_HOST EXEC_HOST  JOB_NAME SUBMIT_TIME
    1     fred  RUN  normal hostA     hostD      sleep 60 Dec 10 22:44

    Note that if all hosts are busy, the job is not started immediately so the STAT column says PEND. This job should take one minute to run. When the job completes, you should receive mail reporting the job completion.

Configuring LSF MultiCluster

You do not need to read this section if you are not using the LSF MultiCluster product.

LSF MultiCluster unites multiple LSF clusters so that they can share resources transparently, while at the same time, still maintain resource ownership and autonomy of individual clusters.

LSF MultiCluster extends the functionality of a single cluster. Configuration involves a few more steps. First you set up a single cluster as described above, then you need to do some additional steps specific to LSF MultiCluster.

Configuring LSF JobScheduler

You do not need to read this section if you are not using the LSF JobScheduler product.

LSF JobScheduler provides reliable production job scheduling according to user specified calendars and events. It runs user-defined jobs automatically at the right time, under the right conditions, and on the right machines.

The configuration of LSF JobScheduler is almost the same as that of the LSF Batch cluster, except that you may have to define system-level calendars for your cluster and you might need to add additional events to monitor your site.

Providing LSF to Users

When you have finished installing and testing LSF cluster, you can let users try it out. LSF users must add LSF_BINDIR to their PATH environment variables to run the LSF utilities.

Users also need access to the on-line manual pages, which were installed in LSF_MANDIR (as defined in lsf.conf) by the lsfsetup installation procedure. For most versions of UNIX, users should add the directory LSF_MANDIR to their MANPATH environment variable. If your system has a man command that does not understand MANPATH, you should either install the manual pages in the /usr/man directory or get one of the freely available man programs.

Note

The /etc/lsf.conf file (or LSF_CONFDIR/lsf.conf if you used the Default installation procedure) must be available.

Using xlsadmin

You can use the xlsadmin graphical tool to do most of the cluster configuration and management work that has been described in this chapter.



[Contents] [Index] [Top] [Bottom] [Prev] [Next]


doc@platform.com

Copyright © 1994-1998 Platform Computing Corporation.
All rights reserved.