[Contents] [Index] [Top] [Bottom] [Prev] [Next]


3 Programming with LSBLIB

This chapter shows how to use LSBLIB to access the services provided by LSF Batch and LSF JobScheduler. Since LSF Batch and LSF JobScheduler are built on top of LSF Base, LSBLIB relies on services provided by LSLIB. Thus if you use LSBLIB functions, you must link your program with both LSLIB and LSBLIB.

LSF Batch and LSF JobScheduler services are mostly provided by mbatchd, except services for processing event and job log files which do not involve any daemons. LSBLIB is shared by both LSF Batch and LSF JobScheduler. The functions described for LSF Batch in this chapter also apply to LSF JobScheduler, unless explicitly indicated otherwise.

Initializing LSF Batch Applications

Before accessing any of the services provided by the LSF Batch and LSF JobScheduler, an application must initialize LSBLIB. It does this by calling the following function:

int lsb_init(appname);

On success, it returns 0; otherwise, it returns -1 and sets lsberrno to indicate the error.

The parameter appname is used only if you want to log detailed messages about the transactions inside LSLIB for debugging purpose. The messages will be logged only if LSB_CMD_LOG_MASK is defined as LOG_DEBUG1.

The messages will be logged in file LSF_LOGDIR/appname. If appname is NULL, the log file is LSF_LOGDIR/bcmd.

Note

This function must be called before any other function in LSBLIB can be called.

Getting Information about LSF Batch Queues

LSF Batch queues hold the jobs in the LSF Batch and set scheduling policies and limits on resource usage.

LSBLIB provides a function to get information about the queues in the LSF Batch. This includes queue name, parameters, statistics, status, resource limits, scheduling policies and parameters, and users and hosts associated with the queue.

The example program in this section uses the following LSBLIB function to get the queue information:

struct queueInfoEnt *lsb_queueinfo(queues,numQueues,hostname,username,options)

On success, this function returns an array containing a queueInfoEnt structure (see below) for each queue of interest and sets *numQueues to the size of the array. On failure, it returns NULL and sets lsberrno to indicate the error. It has the following parameters:

char  **queues;           An array containing names of queues of interest
int   *numQueues;         The number of names in queues
char  *hostname;          Only queues using hostname are of interest
char  *username;          Only queues enabled for user are of interest
int   options;            Reserved for future use; supply 0

To get information on all queues, set *numQueues to 0; *numQueues will be updated to the actual number of queues returned on a successful return.

If *numQueues is 1 and queue is NULL, information on the system default queue is returned.

If hostname is not NULL, then all queues using host hostname as a batch server host will be returned. If username is not NULL, then all queues allowing user username to submit jobs to will be returned.

The queueInfoEnt structure is defined in lsbatch.h as

struct queueInfoEnt {
    char  *queue;          Name of the queue
    char  *description;    Description of the queue
    int   priority;        Priority of the queue
    short nice;            Nice value at which jobs in the queue will be run
    char  *userList;       Users allowed to submit jobs to the queue
    char  *hostList;       Hosts to which jobs in the queue may be dispatched
    int   nIdx;            Size of the loadSched and loadStop arrays
    float *loadSched;      Load thresholds that control scheduling of jobs from the queue
    float *loadStop;       Load thresholds that control suspension of jobs from the queue
    int   userJobLimit;    Number of unfinished jobs a user can dispatch from the queue
    int   procJobLimit;    Number of unfinished jobs the queue can dispatch to a processor
    char  *windows;        Queue run window
    int   rLimits[LSF_RLIM_NLIMITS];  The per-process resource limits for jobs
    char  *hostSpec;       Obsolete. Use defaultHostSpec instead
    int   qAttrib;         Attributes of the queue
    int   qStatus;         Status of the queue
    int   maxJobs;         Job slot limit of the queue.
    int   numJobs;         Total number of job slots required by all jobs 
    int   numPEND;         Number of  job slots needed by pending jobs 
    int   numRUN;          Number of jobs slots used by  running jobs  
    int   numSSUSP;        Number of  job slots used by system suspended jobs
    int   numUSUSP;        Number of  jobs slots used by user suspended jobs 
    int   mig;             Queue migration threshold in minutes
    int   schedDelay;      Schedule delay for new jobs
    int   acceptIntvl;     Minimum interval between two jobs dispatched to the same host
    char  *windowsD;       Queue dispatch window
    char  *nqsQueues;      A blank-separated list of NQS queue specifiers
    char  *userShares;     A blank-separated list of user shares
    char  *defaultHostSpec; Value of DEFAULT_HOST_SPEC for the queue in lsb.queues
    int   procLimit;       Maximum number of job slots a job can take
    char  *admins;         Queue level administrators
    char  *preCmd;         Queue level pre-exec command 
    char  *postCmd;        Queue's post-exec command 
    char  *requeueEValues; Queue's requeue exit status 
    int   hostJobLimit;    Per host job slot limit 
    char  *resReq;         Queue level resource requirement 
    int   numRESERVE;      Reserved job slots for pending jobs 
    int   slotHoldTime;    Time period for reserving job slots
    char  *sndJobsTo;      Remote queues to forward jobs to 
    char  *rcvJobsFrom;    Remote queues which can forward to me 
    char  *resumeCond;     Conditions to resume jobs 
    char  *stopCond;       Conditions to suspend jobs 
    char  *jobStarter;     Queue level job starter 
    char  *suspendActCmd;  Action commands for SUSPEND
    char  *resumeActCmd;   Action commands for RESUME 
    char  *terminateActCmd; Action commands for TERMINATE 
    int   sigMap[LSB_SIG_NUM]; Configurable signal mapping 
    char  *preemption;     Preemption policy
    int    maxRschedTime;  Time period for remote cluster to schedule job
};

The variable nIdx is the number of load threshold values for job scheduling. This is in fact the total number of load indices as returned by LIM. The parameters sndJobsTo, rcvJobsFrom, and maxRschedTime are only used with LSF MultiCluster.

For a complete description of the fields in the queueInfoEnt structure, see the lsb_queueinfo(3) man page.

The program below takes a queue name as the first argument and displays information about the named queue.

#include <stdio.h>
#include <lsf/lsbatch.h>

int 
main (argc, argv)
    int  argc;
    char *argv[];
{
    struct queueInfoEnt *qInfo;
    int  numQueues = 1;
    char *queue=argv[1];
    int  i;

    if (argc != 2) {
        printf("Usage: %s queue_name\n", argv[0]);
        exit(-1);
    }

    if (lsb_init(argv[0]) < 0) {
        lsb_perror("lsb_init()");
        exit(-1);
    }

    qInfo = lsb_queueinfo(&queue, &numQueues, NULL, NULL, 0); 
    if (qInfo == NULL) { 
        lsb_perror("lsb_queueinfo()");
        exit(-1);
    }

    printf("Information about %s queue:\n", queue);
    printf("Description: %s\n", qInfo[0].description);
    printf("Priority: %d                     Nice:     %d     \n",
            qInfo[0].priority, qInfo[0].nice);
    printf("Maximum number of job slots:");
    if (qip->maxJobs < INFINIT_INT)
        printf("%5d\n", qInfo[0].maxJobs);
    else
        printf("%5s\n", "unlimited");

    printf("Job slot statistics: PEND(%d) RUN(%d) SUSP(%d) TOTAL(%d).\n",
         qInfo[0].numPEND, qInfo[0].numRUN,
         qInfo[0].numSSUSP + qInfo[0].numUSUSP, qInfo[0].numJobs);

    exit(0);
}

The header file lsbatch.h must be included with every application that uses LSBLIB functions. Note that lsf.h does not have to be explicitly included in your program because lsbatch.h already has lsf.h included. The function lsb_perror() is used in much the same way ls_perror() is used to print error messages regarding function call failure. You could check lsberrno if you want to take different actions for different errors.

In the above program, INFINIT_INT is defined in lsf.h and is used to indicate that there is no limit set for maxJobs. This applies to all LSF API function calls. LSF will supply INFINIT_INT automatically whenever the value for the variable is either invalid (not available) or infinity. This value should be checked for all variables that are optional. For example, if you were to display the loadSched/loadStop values, an INFINIT_INT indicates that the threshold is not configured and is ignored.

Note

Like the returned data structures by LSLIB functions, the returned data structures from an LSBLIB function is dynamically allocated inside LSBLIB and is automatically freed next time the same function is called. You should not attempt to free the space allocated by LSBLIB. If you need to keep this information across calls, make your own copy of the data structure.

The above program will produce output similar to the following:

Information about normal queue:
Description: For normal low priority jobs
Priority: 25            Nice: 20
Maximum number of job slots : 40
Job slot statistics: PEND( 5) RUN(12) SUSP(1) TOTAL(18)

Getting Information about LSF Batch Hosts

LSF Batch server hosts execute the jobs in the LSF Batch system.

LSBLIB provides a function to get information about the server hosts in the LSF Batch system. This includes both configured static information as well as dynamic information. Examples of host information include host name, status, job limits and statistics, dispatch windows, and scheduling parameters.

The example program in this section uses the following LSBLIB function:

struct hostInfoEnt *lsb_hostinfo(hostsnumHosts)

This function gets information about LSF Batch server hosts. On success, it returns an array of hostInfoEnt structures which hold the host information and sets *numHosts to the size of the array. On failure, it returns NULL and sets lsberrno to indicate the error. It has the following parameters:

char  **hosts;             An array of names of hosts of interest
int   *numHosts;           The number of names in hosts

To get information on all hosts, set *numHosts to 0; *numHosts will be set to the actual number of hostInfoEnt structures when this call returns successfully.

If *numHosts is 1 and hosts is NULL, information on the local host is returned.

The hostInfoEnt structure is defined in lsbatch.h as

struct hostInfoEnt {
    char  *host;             Name of the host
    int   hStatus;       Status of host. (see below)
    int   busySched;     Reason host will not schedule jobs
    int   busyStop;      Reason host has suspended jobs
    float cpuFactor;     Host CPU factor, as returned by LIM
    int   nIdx;          Size of the loadSched and loadStop arrays, as returned from LIM
    float *load;         Load LSF Batch used for scheduling batch jobs
    float *loadSched;    Load thresholds that control scheduling of jobs on host
    float *loadStop;     Load thresholds that control suspension of jobs on host
    char  *windows;      Host dispatch window
    int   userJobLimit;  Maximum number of jobs a user can run on host
    int   maxJobs;       Maximum number of jobs that host can process concurrently
    int   numJobs;       Number of jobs running or suspended on host
    int   numRUN;        Number of jobs running on host
    int   numSSUSP;      Number of jobs suspended by sbatchd on host
    int   numUSUSP;      Number of jobs suspended by a user on host
    int   mig;           Migration threshold for jobs on host
    int   attr;          Host attributes
#define H_ATTR_CHKPNTABLE  0x1
#define H_ATTR_CHKPNT_COPY 0x2
    float *realLoad;     The load mbatchd obtained from LIM
    int   numRESERVE;    Num of slots reserved for pending jobs
    int    chkSig;       This variable is obsolete
};

Note the differences between host information returned by LSLIB function ls_gethostinfo() and host information returned by the LSBLIB function lsb_hostinfo(). The former returns general information about the hosts whereas the latter returns LSF Batch specific information about hosts.

For a complete description of the fields in the hostInfoEnt structure, see the lsb_hostinfo(3) man page.

The example program below takes a host name as an argument and displays various information about the named host. It is a simplified version of the LSF Batch bhosts command.

#include <stdio.h>
#include <lsf/lsbatch.h>

main (argc, argv)
    int  argc;
    char *argv[];
{
    struct hostInfoEnt *hInfo;
    int  numHosts = 1;
    char *hostname = argv[1];
    int  i;

    if (argc != 2) { 
        printf("Usage: %s hostname\n", argv[0]);
        exit(-1);
    }
    if (lsb_init(argv[0]) < 0) {
        lsb_perror("lsb_init");
        exit(-1);
    }

    hInfo = lsb_hostinfo(&hostname, &numHosts);

    if (hInfo == NULL) {
        lsb_perror("lsb_hostinfo");
        exit (-1);
    }

    printf("HOST_NAME    STATUS    JL/U  NJOBS  RUN  SSUSP USUSP\n");

    printf ("%-18.18s", hInfo->host);

    if (hInfo->hStatus & HOST_STAT_UNLICENSED) {
        printf(" %-9s\n", "unlicensed");    
        continue;                 /* don't print other info */
    } else if (hInfo->hStatus & HOST_STAT_UNAVAIL)
        printf(" %-9s",  "unavail");
    else if (hInfo->hStatus & HOST_STAT_UNREACH)
        printf(" %-9s", "unreach");
    else if (hInfo->hStatus & ( HOST_STAT_BUSY | HOST_STAT_WIND
            | HOST_STAT_DISABLED | HOST_STAT_LOCKED
            | HOST_STAT_FULL | HOST_STAT_NO_LIM))
        printf(" %-9s", "closed");
    else
        printf(" %-9s", "ok");

    if (hInfo->userJobLimit < INFINIT_INT)
        printf("%4d", hInfo->userJobLimit);
    else
        printf("%4s", "-");

    printf("%7d  %4d  %4d  %4d\n",
        hInfo->numJobs, hInfo->numRUN, hInfo->numSSUSP, hInfo->numUSUSP);

    exit(0);

hStatus is the status of the host. It is the bitwise inclusive OR of some of the following constants defined in lsbatch.h:

HOST_STAT_BUSY
The host load is greater than a scheduling threshold. In this status, no new batch job will be scheduled to run on this host.
HOST_STAT_WIND
The host dispatch window is closed. In this status, no new batch job will be accepted.
HOST_STAT_DISABLED
The host has been disabled by the LSF administrator and will not accept jobs. In this status, no new batch job will be scheduled to run on this host.
HOST_STAT_LOCKED
The host is locked by an exclusive job. In this status, no new batch job will be scheduled to run on this host.
HOST_STAT_FULL
The host has reached its job limit. In this status, no new batch job will be scheduled to run on this host.
HOST_STAT_UNREACH
The sbatchd on this host is unreachable.
HOST_STAT_UNAVAIL
The LIM and sbatchd on this host are unreachable.
HOST_STAT_UNLICENSED
The host does not have an LSF license.
HOST_STAT_NO_LIM
The host is running an sbatchd but not a LIM.

If none of the above holds, hStatus is set to HOST_STAT_OK to indicate that the host is ready to accept and run jobs.

The constant INFINIT_INT defined in lsf.h is used to indicate that there is no limit set for userJobLimit.

The example output from the above program follows:

a.out hostB
HOST_NAME    STATUS    JL/U  NJOBS  RUN  SSUSP USUSP
hostB         ok        -     2      1    1     0

Job Submission and Modification

Job submission and modification are most common operations in the LSF Batch system. A user can submit jobs to the system and then modify them if the job has not been started.

LSBLIB provides one function for job submission and one function for job modification.

int lsb_submit(jobSubReqjobSubReply)
int lsb_modify(jobSubReqjobSubReplyjobId)

On success, these calls return the job ID, otherwise -1 is returned with lsberrno set to indicate the error. These two functions are similar except that lsb_modify() modifies the parameters of an already submitted job.

Both of these functions use the same data structure:

struct submit      *jobSubReq;      Job specifications
struct submitReply *jobSubReply;    Results of job submission
int   jobId;                        Id of the job to modify (lsb_modify() only)

The submit structure is defined in lsbatch.h as

struct submit {
    int    options;       Indicates which optional fields are present
    int    options2;      Indicates which additional fields are present
    char   *jobName;      Job name (optional)
    char   *queue;        Submit the job to this queue (optional)
    int    numAskedHosts; Size of askedHosts (optional)
    char   **askedHosts;  An array of names of candidate hosts (optional)
    char   *resReq;       Resource requirements of the job (optional)
    int    rlimits[LSF_RLIM_NLIMITS];
                          Limits on system resource use by all of the job's processes
    char   *hostSpec;     Host model used for scaling rlimits (optional)
    int    numProcessors; Initial number of processors needed by the job
    char   *dependCond;   Job dependency condition (optional)
    time_t beginTime;     Dispatch the job on or after beginTime
    time_t termTime;      Job termination deadline
    int    sigValue;      This variable is obsolete)
    char   *inFile;       Path name of the job's standard input file (optional)
    char   *outFile;      Path name of the job's standard output file (optional)
    char   *errFile;      Path name of the job's standard error output file (optional)
    char   *command;      Command line of the job
    time_t chkpntPeriod;  Job is checkpointable with this period (optional)
    char   *chkpntDir;    Directory for this job's chk directory (optional)
    int    nxf;           Sze of xf (optional)
    struct xFile *xf;     An array of file transfer specifications (optional)
    char   *preExecCmd;   Job's pre-execution command (optional)
    char   *mailUser;     User E-mail address to which the job's output are mailed (optional)
    int    delOptions;    Bits to be removed from options (lsb_modify() only)
    char   *projectName;  Name of the job's project (optional)
    int    maxNumProcessors;  Requested maximum num of job slots for the job
    char   *loginShell;   Login shell to be used to re-initialize environment
    char   *exceptList;   Lists the exception handlers
};

For a complete description of the fields in the submit structure, see the lsb_submit(3) man page.

The submitReply structure is defined in lsbatch.h as

struct submitReply {
    char   *queue;            The queue name the job was submitted to
    int    badJobId;          dependCond contains badJobId but there is no such job
    char   *badJobName;       dependCond contains badJobName but there is no such job
    int    badReqIndx;        Index of a host or resource limit that caused an error
};

The last three variables in the structure submitReply are only used when the lsb_submit() or lsb_modify() function calls fail.

For a complete description of the fields in the submitReply structure, see the lsb_submit(3) man page.

To submit a new job, all you have to do is to fill out this data structure and then call lsb_submit(). The delOptions variable is ignored by LSF Batch system for lsb_submit() function call.

The example job submission program below takes the job command line as an argument and submits the job to the LSF Batch system. For simplicity, it is assumed that the job command does not have arguments.

#include <stdio.h>
#include <lsf/lsbatch.h>

main(argc, argv)
    int  argc;
    char **argv;
{
    struct submit  req;
    struct submitReply  reply;
    int  jobId;
    int  i;

    if (argc != 2) {
        fprintf(stderr, "Usage: %s command\n", argv[0]);
        exit(-1);
    }

    if (lsb_init(argv[0]) < 0) {
        lsb_perror("lsb_init");
        exit(-1);
    }

    req.options = 0;
    req.maxNumProcessors = 1;
    req.options2 = 0;
    req.resReq = NULL;

    for (i = 0; i < LSF_RLIM_NLIMITS; i++)
        req.rLimits[i] = DEFAULT_RLIMIT;

    req.hostSpec = NULL;
    req.numProcessors = 1;
    req.maxNumProcessors = 1;
    req.beginTime = 0;
    req.termTime  = 0;
    req.command = argv[1];
    req.nxf = 0;
    req.delOptions = 0;

    jobId = lsb_submit(&req, &reply);

    if (jobId < 0) {
        switch (lsberrno) {
        case LSBE_QUEUE_USE:
        case LSBE_QUEUE_CLOSED:
            lsb_perror(reply.queue);
            exit(-1);
        default:
            lsb_perror(NULL);
            exit(-1);
        }
    }
    exit(0);
}

The options field of the submit structure is the bitwise inclusive OR of some of the SUB_* flags defined in lsbatch.h. These flags serve two purposes. Some flags indicate which of the optional fields of the submit structure are present. Those that are not present have default values. Other flags indicate submission options. For a description of these flags, see lsb_submit(3).

Since options indicate which of the optional fields are meaningful, the programmer does not need to initialize the fields that are not chosen by options. All parameters that are not optional must be initialized properly.

If the resReq field of the submit structure is NULL, LSBLIB will try to obtain resource requirements for command from the remote task list (see `Getting Task Resource Requirements' on page 38). If the task does not appear in the remote task list, then NULL is passed to the LSF Batch system. mbatchd will then use the default resource requirements with option DFT_FROMTYPE bit set when making a LSLIB call for host selection from LIM. See `Handling Default Resource Requirements' on page 26 for more information about default resource requirements.

The constant DEFAULT_RLIMIT defined in lsf.h indicates that there is no limit on a resource.

The constants used to index the rlimits array of the submit structure is defined in lsf.h, and the resource limits currently supported by LSF Batch are listed below.

Table 3. Resource Limits Supported by LSF Batch

Resource Limit

Index in rlimits Array

CPU time limit

LSF_RLIMIT_CPU

File size limit

LSF_RLIMIT_FSIZE

Data size limit

LSF_RLIMIT_DATA

Stack size limit

LSF_RLIMIT_STACK

Core file size limit

LSF_RLIMIT_CORE

Resident memory size limit

LSF_RLIMIT_RSS

Number of open files limit

LSF_RLIMIT_OPEN_MAX

Virtual memory limit

LSF_RLIMIT_SWAP

Wall-clock time run limit

LSF_RLIMIT_RUN

Maximum num of processes a job can fork

LSF_RLIMIT_PROCESS

The hostSpec field of the submit structure specifies the host model to use for scaling rlimits[LSF_RLIMIT_CPU] and rlimits[LSF_RLIMIT_RUN] (See lsb_queueinfo(3)). If hostSpec is NULL, the local host's model is assumed.

If the beginTime field of the submit structure is 0, start the job as soon as possible.

If the termTime field of the submit structure is 0, allow the job to run until it reaches a resource limit.

The above example checks the value of lsberrno when lsb_submit() fails. Different actions can be taken depending on the type of the error. All possible error numbers are defined in lsbatch.h. For example, error number LSBE_QUEUE_USE indicates that the user is not authorized to use the queue. The error number LSBE_QUEUE_CLOSED indicates that the queue is closed.

Since a queue name was not specified for the job, the job will be submitted to the default queue. The queue field of the submitReply structure contains the name of the queue to which the job was submitted.

The above program will produce output similar to the following:

Job <5602> is submitted to default queue <default>.

The output from the job will be mailed to the user because it did not specify a file name for the outFile parameter in the submit structure.

If you are familiar with the bsub command, it may help to know how the fields in the submit structure realte to the bsub command options. This is provided in the following table.

Table 4. submit fields and bsub options

bsub Option

submit Field

options

-J job_name_spec

jobName

SUB_JOB_NAME

-q queue_name

queue

SUB_QUEUE

-m host_name[+[pref_level]]

askedHosts

SUB_HOST

-n min_proc[,max_proc]

numProcessors,
maxNumProcessors

-R res_req

resReq

SUB_RES_REQ

-c cpu_limit[/host_spec]

rlimits[LSF_RLIMIT_CPU] / hostSpec **

SUB_HOST_SPEC (if host_spec is specified)

-W run_limit[/host_spec]

rlimits[LSF_RLIMIT_RUN] / hostSpec**

SUB_HOST_SPEC (if host_spec is specified)

-F file_limit

rlimits[LSF_RLIMIT_FSIZE]**

-M mem_limit

rlimits[LSF_RLIMIT_RSS]**

-D data_limit

rlimits[LSF_RLIMIT_DATA]**

-S stack_limit

rlimits[LSF_RLIMIT_STACK**

-C core_limit

rlimits[LSF_RLIMIT_CORE]**

-k "chkpnt_dir [chkpnt_period ]"

chkpntDir, chkpntPeriod

SUB_CHKPNT_DIR, SUB_CHKPNT_DIR (if chkpntPeriod is specified)

-w depend_cond

dependCond

SUB_DEPEND_COND

-b begin_time

beginTime

-t term_time

TermTime

-i in_file

inFile

SUB_IN_FILE

-o out_file

outFile

SUB_OUT_FILE

-e err_file

errFile

SUB_ERR_FILE

-u mail_user

mailUser

SUB_MAIL_USER

-f "lfile op [ rfile ]"

xf

-E "pre_exec_command [argument ...]"

preExecCmd

SUB_PRE_EXEC

-L login_shell

loginShell

SUB_LOGIN_SHELL

-P project_name

projectName

SUB_PROJECT_NAME

-G user_group

userGroup

SUB_USER_GROUP

-H

SUB2_HOLD*

-x

SUB_EXCLUSIVE

-r

SUB_RERUNNABLE

-N

SUB_NOTIFY_END

-B

SUB_NOTIFY_BEGIN

-I

SUB_INTERACTIVE

-Ip

SUB_PTY

-Is

SUB_PTY_SHELL

-K

SUB2_BSUB_BLOCK*

- X "exception_cond([params])::action"

exceptList

SUB_EXCEPT

-T time_event

timeEvent

SUB_TIME_EVENT

* indicates a bitwise OR mask for options2.

** indicates -1 means undefined


Even if not all options are used, all optional string fields must be initialized to the empty string. For a complete description of the fields in the submit structure, see the lsb_submit(3) manual page.

To modify an already submitted job, you can fill out a new submit structure to override existing parameters, and use delOptions to remove option bits that were previously specified for the job. Essentially, modifying a submitted job is like re-submitting the job. So the same program as above can be used to modify an existing job with minor changes. One additional parameter that must be specified for job modification is the job Id. The parameter delOptions can also be set if you want to clear some option bits that were set previously.

Note

All applications that call lsb_submit() and lsb_modify() are subject to authentication constraints described in `Authentication' on page 17.

Getting Information about Batch Jobs

LSBLIB provides functions to get status information about batch jobs. Since the number of jobs in the LSF Batch system could be on the order of many thousands, getting all this information in one message could potentially use a lot of memory space. LSBLIB allows the application to open a stream connection and then read the job records one by one. This way the memory space needed is always the size of one job record.

LSF Batch Job ID

An LSF Batch job ID stored in a 32-bit integer and it consists of two parts: base ID and array index. The base ID is stored in the lower 20 bits whereas the array index in the top 12 bits which are only used when the underlying job is an array job.

LSBLIB provides the following C macros (defined in lsbatch.h) for munipulating job IDs:

LSB_JOBID(base_id, array_index)	 Yield a 32-bit LSF Batch job ID
LSB_ARRAY_IDX(job_id) Yield array index part of the job ID
LSB_ARRAY_JOBID(job_id) Yield the base ID part of the job ID

The function calls used to get job information are:

int lsb_openjobinfo(jobIdjobNameuserqueuehostoptions);
struct jobInfoEnt *lsb_readjobinfo(more);
void lsb_closejobinfo(void);

These functions are used to open a job information connection with mbatchd, read job records, and then close the job information connection.

lsb_openjobinfo() function takes the following arguments:

int   jobId;                Select job with the given job Id
char  *jobName;             Select job(s) with the given job name
char  *user;                Select job(s) submitted by the named user or user group
char  *queue;               Select job(s) submitted to the named queue
char  *host;                Select job(s) that are dispatched to the named host
int   options;              Selection flags constructed from the bits defined in lsbatch.h

The options parameter contains additional job selection flags defined in lsbatch.h. These are:

ALL_JOB
Select jobs matching any status, including unfinished jobs and recently finished jobs. LSF Batch remembers finished jobs within the CLEAN_PERIOD, as defined in the lsb.params file.
CUR_JOB
Return jobs that have not finished yet.
DONE_JOB
Return jobs that have finished recently.
PEND_JOB
Return jobs that are in the pending status.
SUSP_JOB
Return jobs that are in the suspended status.
LAST_JOB
Return jobs that are submitted most recently.
JGRP_ARRAY_INFO
Return job array information.

If options is 0, then the default is CUR_JOB.

lsb_openjobinfo() returns the total number of matching job records in the connection. It returns -1 on failure and sets lsberrno to indicate the error.

lsb_readjobinfo() takes one argument:

int   *more;                    If not NULL, contains the remaining number of jobs unread

Either this parameter or the return value from the lsb_openjobinfo() can be used to keep track of the number of job records that can be returned from the connection. This parameter is updated each time lsb_readjobinfo() is called.

The jobInfoEnt structure returned by lsb_readjobinfo() is defined in lsbatch.h as:

struct jobInfoEnt {
    int jobId; job ID
    char *user; submission user
    /* possible values for the status field */
#define JOB_STAT_PEND 0x01 job is pending
#define JOB_STAT_PSUSP 0x02 job is held
#define JOB_STAT_RUN 0x04 job is running
#define JOB_STAT_SSUSP 0x08 job is suspended by LSF Batch system
#define JOB_STAT_USUSP 0x10 job is suspended by user
#define JOB_STAT_EXIT 0x20 job exited
#define JOB_STAT_DONE 0x40 job is completed successfully
    int status;
    int *reasonTb; pending or suspending reasons
    int numReasons; length of reasonTb vector
    int reasons; reserved for future use
    int subreasons; reserved for future use
    int jobPid; process Id of the job
    time_t submitTime; time when the job is submitted
    time_t reserveTime; time when job slots are reserved
    time_t startTime; time when job is actually started
    time_t predictedStartTime; job's predicted start time
    time_t endTime; time when the job finishes
    time_t lastEvent; last time event
    time_t nextEvent; next time event
    int duration; duration time (minutes)
    float cpuTime; CPU time consumed by the job
    int umask; file mode creation mask for the job
    char *cwd;                  current working directory where job is submitted
    char *subHomeDir; submitting user's home directory
    char *fromHost; host from which the job is submitted
    char **exHosts; host(s) on which the job executes
    int numExHosts; number of execution hosts
    float cpuFactor; CPU factor of the first execution host
    int nIdx;                  number of load indices in the loadSched and
                                   loadStop vector
    float *loadSched;            stop scheduling new jobs if this threshold
                                   is exceeded
    float *loadStop; stop jobs if this threshold is exceeded
    struct submit submit; job submission parameters
    int exitStatus; exit status
    int execUid; user ID under which the job is running
    char *execHome;             home directory of the user denoted by execUid
    char *execCwd;              current working directory where job is running
    char *execUsername; user name corresponds to execUid
    time_t jRusageUpdateTime; last time job's resource usage is updated
    struct jRusage runRusage; last updated job's resource usage

    /* Possible values for the jType field */
#define JGRP_NODE_JOB 1 this structure stores a normal batch job
#define JGRP_NODE_GROUP 2 this structure stores a job group
#define JGRP_NODE_ARRAY 3 this structure stores a job array
    int jType;
    char *parentGroup; for job group use
    char *jName; job group name: if jType is JGRP_NODE_GROUP
                                   job's name: otherwise
    /* index into the counter array, only used for job array */
#define JGRP_COUNT_NJOBS 0 total jobs in the array
#define JGRP_COUNT_PEND 1 number of pending jobs in the array
#define JGRP_COUNT_NPSUSP 2 number of held jobs in the array
#define JGRP_COUNT_NRUN 3 number of running jobs in the array
#define JGRP_COUNT_NSSUSP 4 number of jobs suspended by the system in the array
#define JGRP_COUNT_NUSUSP 5   number of jobs suspended by the user in the array
#define JGRP_COUNT_NEXIT 6 number of exited jobs in the array
#define JGRP_COUNT_NDONE 7 number of successfully completed jobs
    int counter[NUM_JGRP_COUNTERS];
};

Under LSF Batch, the jobInfoEnt can store a job array as well as a non-array batch job, depending on the value of jType field, which can be either JGRP_NODE_JOB or JGRP_NODE_ARRAY.

lsb_closejobinfo() should be called after receiving all job records in the connection.

Below is an example of a simplified bjobs command. This program displays all pending jobs belonging to all users.

#include <stdio.h>
#include <lsf/lsbatch.h>

main()
{
    int  options = PEND_JOB;
    char *user = "all";             /* match jobs for all users */
    struct jobInfoEnt *job;
    int more;

    if (lsb_init(argv[0]) < 0) {
        lsb_perror("lsb_init");
        exit(-1);
    }

    if (lsb_openjobinfo(0, NULL, user, NULL, NULL, options) < 0) {
        lsb_perror("lsb_openjobinfo");
        exit(-1);
    }

    printf("All pending jobs submitted by all users:\n");
    for (;;) {
        job = lsb_readjobinfo(&more);
        if (job == NULL) {
            lsb_perror("lsb_readjobinfo");
            exit(-1);
        }
        /* display the job */
        printf("%s:\nJob <%d> of user <%s>, submitted from host <%s>\n",
                ctime(&job->submitTime), job->jobId, job->user, job->fromHost);

        if (! more) 
            break;
    }

    lsb_closejobinfo();
    exit(0);
}

If you want to print out the reasons why the job is still pending, you can use the function lsb_pendreason(). See lsb_pendreason(3) for details.

The above program will produce output similar to the following:

All pending jobs submitted by all users:
Mon Mar 1 10:34:04 EST 1996:
Job <123> of user <john>, submitted from host <orange>
Mon Mar 1 11:12:11 EST 1996:
Job <126> of user <john>, submitted from host <orange>
Mon Mar 1 14:11:34 EST 1996:
Job <163> of user <ken>, submitted from host <apple>
Mon Mar 1 15:00:56 EST 1996:
Job <199> of user <tim>, submitted from host <pear>

The following program displays the job arrays of all users in the LSF Batch system and displays the breakdown of jobs as far as job status is concerned. The program demonstrates the use of LSBLIB API calls for collecting summary information of a job array.

#include <stdio.h>
#include <lsf/lsbatch.h>

int
main(int argc, char **argv)
{
    struct jobInfoEnt *job;
    int numJobs;
    int more;

    if (lsb_init(argv[0]) < 0) {
        lsb_perror("lsb_init");
        exit(-1);
    }numJobs = lsb_openjobinfo(0, NULL, "all", NULL, NULL, ALL_JOB|JGRP_ARRAY_INFO);
if (numJobs < 0) {
    lsb_perror("lsb_openjobinfo");
    exit(-1);
}printf("JOBID  ARRAY_NAME  OWNER  NJOBS PEND DONE RUN EXIT SSUSP USUSP PSUSP\n");
more = 1;
for (;;) {
    if (!more)
        break;    job = lsb_readjobinfo(&more);

    printf("%-5d   %-8.8s ", LSB_ARRAY_JOBID(job->jobId), job->submit.jobName);
    printf("%8.8s ", job->user);

    printf(" %5d %4d %4d %4d %4d %5d %5d %5d\n",
            job->counter[JGRP_COUNT_NJOBS],
            job->counter[JGRP_COUNT_PEND],
            job->counter[JGRP_COUNT_NDONE],
            job->counter[JGRP_COUNT_NRUN],
            job->counter[JGRP_COUNT_NEXIT],
            job->counter[JGRP_COUNT_NSSUSP],
            job->counter[JGRP_COUNT_NUSUSP],
            job->counter[JGRP_COUNT_NPSUSP]);
    }
    lsb_closejobinfo();

    exit(0);
}

The above program produces output similar to the following:

JOBID    ARRAY_NAME   OWNER     NJOBS PEND DONE  RUN EXIT SSUSP USUSP PSUSP
4205     ja1[1-8]     userA         8    0    0    0    0     0     0     8
4207     ja2[1-2]     userB         2    0    0    0    0     0     0     2
5074     ja3[1-4]     userA         4    0    3    1    0     0     0     0
5075     ja4[1-10]    userC        17    0   13    0    0     4     0     0
5076     ja5[1-4]     userD         4    0    1    0    3     0     0     0

Job Manipulation

After a job has been submitted, it can be manipulated by users in different ways. It can be suspended, resumed, killed, or sent an arbitrary signal.

Note

All applications that manipulate jobs are subject to authentication provisions described in `Authentication' on page 17.

Sending a Signal To a Job

Users can send signals to submitted jobs. If the job has not been started, you can send KILL, TERM, INT, and STOP signals. These will cause the job to be cancelled (KILL, TERM, INT) or suspended (STOP). If the job is already started, then any signals can be sent to the job.

The LSBLIB call to send a signal to a job is:

int lsb_signaljob(jobIdsigValue);

The jobId and sigValue parameters are self-explanatory.

The following example takes a job ID as the argument and send a SIGSTOP signal to the job.

#include <stdio.h>
#include <lsf/lsbatch.h>

main(argc, argv)
    int  argc;
    char *argv[];
{
    if (argc != 2) {
        printf("Usage: %s jobId\n", argv[0]);
        exit(-1);
    }

    if (lsb_init(argv[0]) < 0) {
        lsb_perror("lsb_init");
        exit(-1);
    }

    if (lsb_signaljob(atoi(argv[1]), SIGSTOP) <0) {
        lsb_perror("lsb_signaljob");
        exit(-1);
    }

    printf("Job %d is signaled\n", argv[1]);
    exit(0);
}

Switching a Job To a Different Queue

A job can be switched to a different queue after submission. This can be done even after the job has already started.

The LSBLIB function to switch a job from one queue to another is:

int lsb_switchjob(jobIdqueue);

Below is an example program that switches a specified job to a new queue.

#include <stdio.h>
#include <lsf/lsbatch.h>

main(argc, argv)
    int argc;
    char *argv[];
{
    if (argc != 3) {
        printf("Usage: %s jobId new_queue\n", argv[0]);
        exit(-1);
    }

    if (lsb_init(argv[0]) <0) {
        lsb_perror("lsb_init");
        exit(-1);
    }

    if (lsb_switchjob(argv[1], argv[2]) < 0) {
        lsb_perror("lsb_switchjob");
        exit(-1);
    }

    printf("Job %d is switched to new queue <%s>\n", argv[1], argv[2]);
    
    exit(0);
}

Forcing a Job to Run

After a job is submitted to the LSF Batch system, it remains pending until LSF Batch determines that it is ready to run (for details on the factors that govern when and where a job starts to run, see "How LSF Batch Schedules Jobs" in the LSF Batch Administrator's Guide). However, a job can be forced to run on a specified list of hosts immediately using the following LSBLIB function:

int lsb_runjob(runJobReq)

This function takes the runJobReq structure which is defined in lsbatch.h:

struct runJobReq {
    int jobId;          Job ID of the job to start
    int numHosts;       Number of hosts to run the job on
    char **hostname;    Host names where jobs run
    int options;        RUNJOB_REQ_NORMAL or RUNJOB_REQ_NOSTOP
}

A job can be started and run subject to no scheduling constraints, such as job slot limits. If the job is started with the options field being 0 or RUNJOB_REQ_NORMAL, then the job will still be subject to the underlying queue's run windows and to the threshold of the queue and of the job's execution hosts.

To override this, use RUNJOB_REQ_NOSTOP and the job will not be stopped due to the above mentioned load conditions. However, all LSBLIB's job munipulation APIs can still be applied to the job.

The following is an example program that runs a specified job on a host that has no batch job running.

#include <stdio.h>
#include <lsf/lsbatch.h>int
main(int argc, char **argv)
{
    struct hostInfoEnt *hInfo;
    int numHosts;

    if (argc != 2) {
        printf("Usage: %s jobId\n", argv[0]);
        exit(-1);
    }

    if (lsb_init(argv[0]) < 0) {
        lsb_perror("lsb_init");
        exit(-1);
    }

    hInfo = lsb_hostinfo(NULL, &numHosts);
    if (hInfo == NULL) {
        lsb_perror("lsb_hostinfo");
        exit(-1);
    }

    for (i = 0; i < numHosts; i++) {
        if (hInfo[i].hStatus & (HOST_STAT_BUSY | HOST_STAT_WIND
                                   | HOST_STAT_DISABLED | HOST_STAT_LOCKED
                                   | HOST_STAT_FULL | HOST_STAT_NOLIM
                                   | HOST_STAT_UNLICENSED | HOST_STAT_UNAVAIL
                                   | HOST_STAT_UNREACH))
            continue;

        /* found a vacant host */
        if (hInfo[i].numJobs == 0)
            break;
    }

    if (i == numHosts) {
        fprintf(stderr, "Cannot find vacate host to run job < %d >\n",
                jobId);
        exit(-1);
    }

    /* The job can be stopped due to load conditions */
    runJobReq.options = 0;
    runJobReq.numHosts = 1;
    runJobReq.hosts = &hInfo[i].host

    if (lsb_runjob(&runJobReq) < 0) {
        lsb_perror("lsb_runjob");

        exit(-1);
    }

    exit (0);
}

Processing LSF Batch Log Files

LSF Batch saves a lot of valuable information about the system and jobs. Such information is logged by mbatchd in files lsb.events and lsb.acct under the directory $LSB_SHAREDIR/your_cluster/logdir, where LSB_SHAREDIR is defined in the lsf.conf file and your_cluster is the name of your LSF cluster.

mbatchd logs such information for several purposes. Firstly, some of the events serve as the backup of mbatchd's memory so that in case mbatchd crashes, all the critical information can be picked up by the newly started mbatchd from the event file to restore the current state of LSF Batch. Secondly, the events can be used to produce historical information about the LSF Batch system and user jobs. Lastly, such information can be used to produce accounting or statistic reports.

CAUTION!

The lsb.events file contains critical user job information. It should never be modified by your program. Writing into this file may cause the loss of user jobs.

LSBLIB provides a function to read information from these files into a well-defined data structure:

struct eventRec *lsb_geteventrec(log_fplineNum)

The parameters are:

FILE  *log_fp;                     File handle for either an event log file or job log file
nt    *lineNum;                    Line number of the next event record

The parameter log_fp is as returned by a successful fopen() call. The content in lineNum is modified to indicate the line number of the next event record in the log file on a successful return. This value can then be used to report the line number when an error occurs while reading the log file. This value should be initiated to 0 before lsb_geteventrec() is called for the first time.

This call returns the following data structure:

struct eventRec {
    char  version[MAX_VERSION_LEN];   Version number of the mbatchd
    int   type;                       Type of the event
    int   eventTime;                  Event time stamp
    union eventLog eventLog;          Event data
};

The event type is used to determine the structure of the data in eventLog. LSBLIB remembers the storage allocated for the previously returned data structure and automatically frees it before returning the next event record.

lsb_geteventrec() returns NULL and sets lsberrno to LSBE_EOF when there are no more records in the event file.

Events are logged by mbatchd for many different purposes. There are job-related events and system-related events. Applications can choose to process certain events and ignore other events. For example, the bhist command processes job-related events only. The currently available event types are listed below.

Table 5. Event Types

Event Type

Description

EVENT_JOB_NEW

New job event

EVENT_JOB_START

mbatchd is trying to start a job

EVENT_JOB_STATUS

Job status change event

EVENT_JOB_SWITCH

Job switched to a new queue

EVENT_JOB_MOVE

Job moved within a queue

EVENT_QUEUE_CTRL

Queue status changed by LSF admin

EVENT_HOST_CTRL

Host status changed by LSF admin

EVENT_MBD_START

New mbatchd start event

EVENT_MBD_DIE

mbatchd resign event

EVENT_MBD_UNFULFILL

mbatchd has an action to be fulfilled

EVENT_JOB_FINISH

Job has finished (logged in lsb.acct only)

EVENT_LOAD_INDEX

Complete list of load index names

EVENT_MIG

Job has migrated

EVENT_PRE_EXEC_START

The pre-execution command started

EVENT_JOB_ROUTE

The job has been routed to NQS

EVENT_JOB_MODIFY

The job has been modified

EVENT_JOB_SIGNAL

Job signal to be delivered

EVENT_CAL_NEW

New calendar event 1

EVENT_CAL_MODIFY

Calendar modified 1

EVENT_CAL_DELETE

Calendar deleted 1

EVENT_JOB_FORCE

Forcing a job to start on specified hosts

EVENT_JOB_FORWARD

Job forwarded to another cluster

EVENT_JOB_ACCEPT

Job from a remote cluster dispatched

EVENT_STATUS_ACK

Job status successfully sent to submission cluster

EVENT_JOB_EXECUTE

Job started successfully

EVENT_JOB_REQUEUE

Job is requeued

EVENT_JOB_SIGACT

An signal action on a job has been initiated or finished

EVENT_JOB_START_ACCEPT

Job accepted by sbatchd

1 Available only if the LSF JobScheduler component is enabled.

Note that the event type EVENT_JOB_FINISH is used by the lsb.acct file only and all other event types are used by the lsb.events file only. For detailed formats of these log files, see lsb.events(5) and lsb.acct(5).

Each event type corresponds to a different data structure in the union:

union  eventLog { 
    struct jobNewLog     jobNewLog;                EVENT_JOB_NEW
    struct jobStartLog   jobStartLog;              EVENT_JOB_START
    struct jobStatusLog  jobStatusLog;             EVENT_JOB_STATUS
    struct jobSwitchLog  jobSwitchLog;             EVENT_JOB_SWITCH
    struct jobMoveLog    jobMoveLog;               EVENT_JOB_MOVE
    struct queueCtrlLog  queueCtrlLog;             EVENT_QUEUE_CTRL
    struct hostCtrlLog   hostCtrlLog;              EVENT_HOST_CTRL
    struct mbdStartLog   mbdStartLog;              EVENT_MBD_START
    struct mbdDieLog     mbdDieLog;                EVENT_MBD_DIE
    struct unfulfillLog  unfulfillLog;             EVENT_MBD_UNFULFILL
    struct jobFinishLog  jobFinishLog;             EVENT_JOB_FINISH
    struct loadIndexLog  loadIndexLog;             EVENT_LOAD_INDEX
    struct migLog        migLog;                   EVENT_MIG
    struct calendarLog   calendarLog;              Shared by all calendar events
    struct jobForce      jobForceRequestLog        EVENT_JOB_FORCE
    struct jobForwardLog jobForwardLog;            EVENT_JOB_FORWARD
    struct jobAcceptLog  jobAcceptLog;             EVENT_JOB_ACCEPT
    struct statusAckLog  statusAckLog;             EVENT_STATUS_ACK
    struct signalLog     signalLog;                EVENT_JOB_SIGNAL
    struct jobExecuteLog jobExecuteLog;            EVENT_JOB_EXECUTE
    struct jobRequeueLog jobRequeueLog;            EVENT_JOB_REQUEUE
    struct sigactLog sigactLog;                    EVENT_JOB_SIGACT
    struct jobStartAcceptLog jobStartAcceptLog     EVENT_JOB_START_ACCEPT 
};

The detailed data structures in the above union are defined in lsbatch.h and described in lsb_geteventrec(3).

Below is an example program that takes an argument as job name and displays a chronological history about all jobs matching the job name. This program assumes that the lsb.events file is in /local/lsf/work/cluster1/logdir.

#include <stdio.h>
#include <string.h>
#include <time.h>
#include <lsf/lsbatch.h>

main(argc, argv)
    int  argc;
    char *argv[];
{
    char *eventFile = "/local/lsf/work/cluster1/logdir/lsb.events";
    FILE *fp;
    struct eventRec *recrod;
    int  lineNum = 0;
    char *jobName = argv[1];
    int  i;

    if (argc != 2) {
        printf("Usage: %s jobname\n", argv[0]);
        exit(-1);
    }

    if (lsb_init(argv[0]) < 0) {
        lsb_perror("lsb_init");
        exit(-1);
    }

    fp = fopen(eventFile, "r");
    if (fp == NULL) {
        perror(eventFile);
        exit(-1);
    }

    for (;;) {

        record = lsb_geteventrec(fp, &lineNum);
        if (record == NULL) {
            if (lsberrno == LSBE_EOF)
                exit(0);
            lsb_perror("lsb_geteventrec");
            exit(-1);
        }

        if (strcmp(record->eventLog.jobNewLog.jobName, jobName) != 0)
            continue;

        switch (record->type) {
            struct jobNewLog *newJob;
            struct jobStartLog *startJob;
            struct jobStatusLog *statusLog;

        case EVENT_JOB_NEW:
              newJob = &(record->eventLog.jobNewLog);
            printf("%s: job <%d> submitted by <%s> from <%s> to <%s> queue\n",
                ctime(&record->eventTime), newJob->jobId, newJob->userName, 
                newJob->fromHost, newJob->queue);
            continue;
        case EVENT_JOB_START:
            startJob = &(record->eventLog.jobStartLog);
            printf("%s: job <%d> started on ",
                        ctime(&record->eventTime), newJob->jobId);
            for (i=0; i<startJob->numExHosts; i++) 
                printf("<%s> ", startJob->execHosts[i]);
            printf("\n");
            continue;
        case EVENT_JOB_STATUS:
            statusJob = &(record->eventLog.jobStatusLog);
            printf("%s: Job <%d> status changed to: ", 
                        ctime(&record->eventTime), statusJob->jobId);
            switch(statusJob->jStatus) {
            case JOB_STAT_PEND:
                printf("pending\n");
                continue;
            case JOB_STAT_RUN:
                printf("running\n");
                continue;
            case    JOB_STAT_SSUSP:
            case JOB_STAT_USUSP:
            case JOB_STAT_PSUSP:
                printf("suspended\n");
                continue;
            case JOB_STAT_UNKWN:
                printf("unknown (sbatchd unreachable)\n");
                continue;
            case JOB_STAT_EXIT:
                printf("exited\n");
                continue;
            case JOB_STAT_DONE:
                printf("done\n");
                continue;

            default:
                printf("\nError: unknown job status %d\n", statusJob->jStatus);
                continue;
            }
        default:            /* Only display a few selected event types*/
            continue;
        }
    }

    exit(0);
}

Note that in the above program, events that are of no interest are skipped. The job status codes are defined in lsbatch.h. The lsb.acct file stores job accounting information and can be processed similarly. Since currently there is only one event type (EVENT_JOB_FINISH) in lsb.acct file, the processing is simpler than the above example.



[Contents] [Index] [Top] [Bottom] [Prev] [Next]


doc@platform.com

Copyright © 1994-1998 Platform Computing Corporation.
All rights reserved.