This chapter shows how to use LSBLIB to access the services provided by LSF Batch and LSF JobScheduler. Since LSF Batch and LSF JobScheduler are built on top of LSF Base, LSBLIB relies on services provided by LSLIB. Thus if you use LSBLIB functions, you must link your program with both LSLIB and LSBLIB.
LSF Batch and LSF JobScheduler services are mostly provided by mbatchd
, except services for processing event and job log files which do not involve any daemons. LSBLIB is shared by both LSF Batch and LSF JobScheduler. The functions described for LSF Batch in this chapter also apply to LSF JobScheduler, unless explicitly indicated otherwise.
Before accessing any of the services provided by the LSF Batch and LSF JobScheduler, an application must initialize LSBLIB. It does this by calling the following function:
int lsb_init(appname);
On success, it returns 0; otherwise, it returns -1 and sets lsberrno
to indicate the error.
The parameter appname
is used only if you want to log detailed messages about the transactions inside LSLIB for debugging purpose. The messages will be logged only if LSB_CMD_LOG_MASK
is defined as LOG_DEBUG1
.
The messages will be logged in file LSF_LOGDIR/appname
. If appname is NULL
, the log file is LSF_LOGDIR/bcmd
.
This function must be called before any other function in LSBLIB can be called.
LSF Batch queues hold the jobs in the LSF Batch and set scheduling policies and limits on resource usage.
LSBLIB provides a function to get information about the queues in the LSF Batch. This includes queue name, parameters, statistics, status, resource limits, scheduling policies and parameters, and users and hosts associated with the queue.
The example program in this section uses the following LSBLIB function to get the queue information:
struct queueInfoEnt *lsb_queueinfo(queues,numQueues,hostname,username,options)
On success, this function returns an array containing a queueInfoEnt
structure (see below) for each queue of interest and sets *numQueues
to the size of the array. On failure, it returns NULL
and sets lsberrno
to indicate the error. It has the following parameters:
char **queues; An array containing names of queues of interest
int *numQueues; The number of names in queues
char *hostname; Only queues using hostname are of interest
char *username; Only queues enabled for user are of interest
int options; Reserved for future use; supply 0
To get information on all queues, set *numQueues
to 0
; *numQueues
will be updated to the actual number of queues returned on a successful return.
If *numQueues
is 1
and queue
is NULL
, information on the system default queue is returned.
If
hostname
is not NULL
, then all queues using host hostname
as a batch server host will be returned. If username
is not NULL
, then all queues allowing user username
to submit jobs to will be returned.
The queueInfoEnt
structure is defined in lsbatch.h
as
struct queueInfoEnt {
char *queue; Name of the queue
char *description; Description of the queue
int priority; Priority of the queue
short nice; Nice value at which jobs in the queue will be run
char *userList; Users allowed to submit jobs to the queue
char *hostList; Hosts to which jobs in the queue may be dispatched
int nIdx; Size of the loadSched and loadStop arrays
float *loadSched; Load thresholds that control scheduling of jobs from the queue
float *loadStop; Load thresholds that control suspension of jobs from the queue
int userJobLimit; Number of unfinished jobs a user can dispatch from the queue
int procJobLimit; Number of unfinished jobs the queue can dispatch to a processor
char *windows; Queue run window
int rLimits[LSF_RLIM_NLIMITS]; The per-process resource limits for jobs
char *hostSpec; Obsolete. Use defaultHostSpec instead
int qAttrib; Attributes of the queue
int qStatus; Status of the queue
int maxJobs; Job slot limit of the queue.
int numJobs; Total number of job slots required by all jobs
int numPEND; Number of job slots needed by pending jobs
int numRUN; Number of jobs slots used by running jobs
int numSSUSP; Number of job slots used by system suspended jobs
int numUSUSP; Number of jobs slots used by user suspended jobs
int mig; Queue migration threshold in minutesint schedDelay;
Schedule delay for new jobsint acceptIntvl;
Minimum interval between two jobs dispatched to the same host
char *windowsD; Queue dispatch window
char *nqsQueues; A blank-separated list of NQS queue specifiers
char *userShares; A blank-separated list of user shares
char *defaultHostSpec; Value of DEFAULT_HOST_SPEC for the queue in lsb.queues
int procLimit; Maximum number of job slots a job can takechar *admins;
Queue level administratorschar *preCmd;
Queue level pre-exec command
char *postCmd; Queue's post-exec command
char *requeueEValues; Queue's requeue exit status
int hostJobLimit; Per host job slot limit
char *resReq; Queue level resource requirement
int numRESERVE; Reserved job slots for pending jobs
int slotHoldTime; Time period for reserving job slots
char *sndJobsTo; Remote queues to forward jobs to
char *rcvJobsFrom; Remote queues which can forward to me
char *resumeCond; Conditions to resume jobs
char *stopCond; Conditions to suspend jobs
char *jobStarter; Queue level job starter
char *suspendActCmd; Action commands for SUSPEND
char *resumeActCmd; Action commands for RESUME
char *terminateActCmd; Action commands for TERMINATE
int sigMap[LSB_SIG_NUM]; Configurable signal mapping
char *preemption; Preemption policy
int maxRschedTime; Time period for remote cluster to schedule job
};
The variable nIdx
is the number of load threshold values for job scheduling. This is in fact the total number of load indices as returned by LIM. The parameters sndJobsTo
, rcvJobsFrom
, and maxRschedTime
are only used with LSF MultiCluster.
For a complete description of the fields in the queueInfoEnt
structure, see the lsb_queueinfo(3)
man page.
The program below takes a queue name as the first argument and displays information about the named queue.
#include <stdio.h>
#include <lsf/lsbatch.h>
int
main (argc, argv)
int argc;
char *argv[];
{
struct queueInfoEnt *qInfo;
int numQueues = 1;
char *queue=argv[1];
int i;
if (argc != 2) {
printf("Usage: %s queue_name\n", argv[0]);
exit(-1);
}
if (lsb_init(argv[0]) < 0) {
lsb_perror("lsb_init()");
exit(-1);
}
qInfo = lsb_queueinfo(&queue, &numQueues, NULL, NULL, 0);
if (qInfo == NULL) {
lsb_perror("lsb_queueinfo()");
exit(-1);
}
printf("Information about %s queue:\n", queue);
printf("Description: %s\n", qInfo[0].description);
printf("Priority: %d Nice: %d \n",
qInfo[0].priority, qInfo[0].nice);
printf("Maximum number of job slots:");
if (qip->maxJobs < INFINIT_INT)
printf("%5d\n", qInfo[0].maxJobs);
else
printf("%5s\n", "unlimited");
printf("Job slot statistics: PEND(%d) RUN(%d) SUSP(%d) TOTAL(%d).\n",
qInfo[0].numPEND, qInfo[0].numRUN,
qInfo[0].numSSUSP + qInfo[0].numUSUSP, qInfo[0].numJobs);
exit(0);
}
The header file lsbatch.h
must be included with every application that uses LSBLIB functions. Note that lsf.h
does not have to be explicitly included in your program because lsbatch.h
already has lsf.h
included. The function lsb_perror()
is used in much the same way ls_perror()
is used to print error messages regarding function call failure. You could check lsberrno
if you want to take different actions for different errors.
In the above program, INFINIT_INT
is defined in lsf.h
and is used to indicate that there is no limit set for maxJobs
. This applies to all LSF API function calls. LSF will supply INFINIT_INT
automatically whenever the value for the variable is either invalid (not available) or infinity. This value should be checked for all variables that are optional. For example, if you were to display the loadSched
/loadStop
values, an INFINIT_INT
indicates that the threshold is not configured and is ignored.
Like the returned data structures by LSLIB functions, the returned data structures from an LSBLIB function is dynamically allocated inside LSBLIB and is automatically freed next time the same function is called. You should not attempt to free the space allocated by LSBLIB. If you need to keep this information across calls, make your own copy of the data structure.
The above program will produce output similar to the following:
Information about normal queue:
Description: For normal low priority jobs
Priority: 25 Nice: 20
Maximum number of job slots : 40
Job slot statistics: PEND( 5) RUN(12) SUSP(1) TOTAL(18)
LSF Batch server hosts execute the jobs in the LSF Batch system.
LSBLIB provides a function to get information about the server hosts in the LSF Batch system. This includes both configured static information as well as dynamic information. Examples of host information include host name, status, job limits and statistics, dispatch windows, and scheduling parameters.
The example program in this section uses the following LSBLIB function:
struct hostInfoEnt *lsb_hostinfo(hosts, numHosts)
This function gets information about LSF Batch server hosts. On success, it returns an array of hostInfoEnt
structures which hold the host information and sets *numHosts
to the size of the array. On failure, it returns NULL
and sets lsberrno
to indicate the error. It has the following parameters:
char **hosts; An array of names of hosts of interest
int *numHosts; The number of names in hosts
To get information on all hosts, set *numHosts
to 0
; *numHosts
will be set to the actual number of hostInfoEnt
structures when this call returns successfully.
If *numHosts
is 1
and hosts
is NULL
, information on the local host is returned.
The hostInfoEnt
structure is defined in lsbatch.h
as
struct hostInfoEnt {
char *host; Name of the host
int hStatus; Status of host. (see below)
int busySched; Reason host will not schedule jobs
int busyStop; Reason host has suspended jobs
float cpuFactor; Host CPU factor, as returned by LIM
int nIdx; Size of the loadSched and loadStop arrays, as returned from LIMfloat *load;
Load LSF Batch used for scheduling batch jobs
float *loadSched; Load thresholds that control scheduling of jobs on host
float *loadStop; Load thresholds that control suspension of jobs on host
char *windows; Host dispatch window
int userJobLimit; Maximum number of jobs a user can run on host
int maxJobs; Maximum number of jobs that host can process concurrently
int numJobs; Number of jobs running or suspended on host
int numRUN; Number of jobs running on host
int numSSUSP; Number of jobs suspended by sbatchd on host
int numUSUSP; Number of jobs suspended by a user on host
int mig; Migration threshold for jobs on host
int attr; Host attributes
#define H_ATTR_CHKPNTABLE 0x1
#define H_ATTR_CHKPNT_COPY 0x2
float *realLoad; The load mbatchd obtained from LIM
int numRESERVE; Num of slots reserved for pending jobs
int chkSig; This variable is obsolete
};
Note the differences between host information returned by LSLIB function ls_gethostinfo()
and host information returned by the LSBLIB function lsb_hostinfo()
. The former returns general information about the hosts whereas the latter returns LSF Batch specific information about hosts.
For a complete description of the fields in the hostInfoEnt
structure, see the lsb_hostinfo(3)
man page.
The example program below takes a host name as an argument and displays various information about the named host. It is a simplified version of the LSF Batch bhosts
command.
#include <stdio.h>
#include <lsf/lsbatch.h>
main (argc, argv)
int argc;
char *argv[];
{
struct hostInfoEnt *hInfo;
int numHosts = 1;
char *hostname = argv[1];
int i;
if (argc != 2) {
printf("Usage: %s hostname\n", argv[0]);
exit(-1);
}
if (lsb_init(argv[0]) < 0) {
lsb_perror("lsb_init");
exit(-1);
}
hInfo = lsb_hostinfo(&hostname, &numHosts);
if (hInfo == NULL) {
lsb_perror("lsb_hostinfo");
exit (-1);
}
printf("HOST_NAME STATUS JL/U NJOBS RUN SSUSP USUSP\n");
printf ("%-18.18s", hInfo->host);
if (hInfo->hStatus & HOST_STAT_UNLICENSED) {
printf(" %-9s\n", "unlicensed");
continue; /* don't print other info */
} else if (hInfo->hStatus & HOST_STAT_UNAVAIL)
printf(" %-9s", "unavail");
else if (hInfo->hStatus & HOST_STAT_UNREACH)
printf(" %-9s", "unreach");
else if (hInfo->hStatus & ( HOST_STAT_BUSY | HOST_STAT_WIND
| HOST_STAT_DISABLED | HOST_STAT_LOCKED
| HOST_STAT_FULL | HOST_STAT_NO_LIM))
printf(" %-9s", "closed");
else
printf(" %-9s", "ok");
if (hInfo->userJobLimit < INFINIT_INT)
printf("%4d", hInfo->userJobLimit);
else
printf("%4s", "-");
printf("%7d %4d %4d %4d\n",
hInfo->numJobs, hInfo->numRUN, hInfo->numSSUSP, hInfo->numUSUSP);
exit(0);
}
hStatus
is the status of the host. It is the bitwise inclusive OR
of some of the following constants defined in lsbatch.h
:
HOST_STAT_BUSY
The host load is greater than a scheduling threshold. In this status, no new batch job will be scheduled to run on this host.
HOST_STAT_WIND
The host dispatch window is closed. In this status, no new batch job will be accepted.
HOST_STAT_DISABLED
The host has been disabled by the LSF administrator and will not accept jobs. In this status, no new batch job will be scheduled to run on this host.
HOST_STAT_LOCKED
The host is locked by an exclusive job. In this status, no new batch job will be scheduled to run on this host.
HOST_STAT_FULL
The host has reached its job limit. In this status, no new batch job will be scheduled to run on this host.
HOST_STAT_UNREACH
Thesbatchd
on this host is unreachable.
HOST_STAT_UNAVAIL
The LIM andsbatchd
on this host are unreachable.
HOST_STAT_UNLICENSED
The host does not have an LSF license.
HOST_STAT_NO_LIM
The host is running ansbatchd
but not a LIM.
If none of the above holds, hStatus
is set to HOST_STAT_OK
to indicate that the host is ready to accept and run jobs.
The constant INFINIT_INT
defined in lsf.h
is used to indicate that there is no limit set for userJobLimit
.
The example output from the above program follows:
% a.out hostB
HOST_NAME STATUS JL/U NJOBS RUN SSUSP USUSP
hostB ok - 2 1 1 0
Job submission and modification are most common operations in the LSF Batch system. A user can submit jobs to the system and then modify them if the job has not been started.
LSBLIB provides one function for job submission and one function for job modification.
int lsb_submit(jobSubReq, jobSubReply)
int lsb_modify(jobSubReq, jobSubReply, jobId)
On success, these calls return the job ID, otherwise -1 is returned with lsberrno
set to indicate the error. These two functions are similar except that lsb_modify()
modifies the parameters of an already submitted job.
Both of these functions use the same data structure:
struct submit *jobSubReq; Job specifications
struct submitReply *jobSubReply; Results of job submission
int jobId; Id of the job to modify (lsb_modify() only)
The submit
structure is defined in lsbatch.h
as
struct submit {
int options; Indicates which optional fields are present
int options2; Indicates which additional fields are present
char *jobName; Job name (optional)
char *queue; Submit the job to this queue (optional)
int numAskedHosts; Size of askedHosts (optional)
char **askedHosts; An array of names of candidate hosts (optional)
char *resReq; Resource requirements of the job (optional)
int rlimits[LSF_RLIM_NLIMITS];
Limits on system resource use by all of the job's processes
char *hostSpec; Host model used for scaling rlimits (optional)
int numProcessors; Initial number of processors needed by the job
char *dependCond; Job dependency condition (optional)
time_t beginTime; Dispatch the job on or after beginTime
time_t termTime; Job termination deadline
int sigValue; This variable is obsolete)
char *inFile; Path name of the job's standard input file (optional)
char *outFile; Path name of the job's standard output file (optional)
char *errFile; Path name of the job's standard error output file (optional)
char *command; Command line of the job
time_t chkpntPeriod; Job is checkpointable with this period (optional)
char *chkpntDir; Directory for this job's chk directory (optional)
int nxf; Sze of xf (optional)
struct xFile *xf; An array of file transfer specifications (optional)
char *preExecCmd; Job's pre-execution command (optional)
char *mailUser; User E-mail address to which the job's output are mailed (optional)
int delOptions; Bits to be removed from options (lsb_modify() only)
char *projectName; Name of the job's project (optional)int maxNumProcessors;
Requested maximum num of job slots for the jobchar *loginShell;
Login shell to be used to re-initialize environment
char *exceptList; Lists the exception handlers
};
For a complete description of the fields in the submit
structure, see the lsb_submit(3)
man page.
The submitReply
structure is defined in lsbatch.h
as
struct submitReply {
char *queue; The queue name the job was submitted to
int badJobId; dependCond contains badJobId but there is no such job
char *badJobName; dependCond contains badJobName but there is no such job
int badReqIndx; Index of a host or resource limit that caused an error
};
The last three variables in the structure submitReply
are only used when the lsb_submit()
or lsb_modify()
function calls fail.
For a complete description of the fields in the submitReply
structure, see the lsb_submit(3)
man page.
To submit a new job, all you have to do is to fill out this data structure and then call lsb_submit()
. The delOptions
variable is ignored by LSF Batch system for lsb_submit()
function call.
The example job submission program below takes the job command line as an argument and submits the job to the LSF Batch system. For simplicity, it is assumed that the job command does not have arguments.
#include <stdio.h>
#include <lsf/lsbatch.h>
main(argc, argv)
int argc;
char **argv;
{
struct submit req;
struct submitReply reply;
int jobId;
int i;
if (argc != 2) {
fprintf(stderr, "Usage: %s command\n", argv[0]);
exit(-1);
}
if (lsb_init(argv[0]) < 0) {
lsb_perror("lsb_init");
exit(-1);
}
req.options = 0;
req.maxNumProcessors = 1;
req.options2 = 0;
req.resReq = NULL;
for (i = 0; i < LSF_RLIM_NLIMITS; i++)
req.rLimits[i] = DEFAULT_RLIMIT;
req.hostSpec = NULL;
req.numProcessors = 1;
req.maxNumProcessors = 1;
req.beginTime = 0;
req.termTime = 0;
req.command = argv[1];
req.nxf = 0;
req.delOptions = 0;
jobId = lsb_submit(&req, &reply);
if (jobId < 0) {
switch (lsberrno) {
case LSBE_QUEUE_USE:
case LSBE_QUEUE_CLOSED:
lsb_perror(reply.queue);
exit(-1);
default:
lsb_perror(NULL);
exit(-1);
}
}
exit(0);
}
The options
field of the submit
structure is the bitwise inclusive OR of some of the SUB_*
flags defined in lsbatch.h
. These flags serve two purposes. Some flags indicate which of the optional fields of the submit
structure are present. Those that are not present have default values. Other flags indicate submission options. For a description of these flags, see lsb_submit(3)
.
Since options
indicate which of the optional fields are meaningful, the programmer does not need to initialize the fields that are not chosen by options. All parameters that are not optional must be initialized properly.
If the resReq
field of the submit
structure is NULL
, LSBLIB will try to obtain resource requirements for command
from the remote task list (see `Getting Task Resource Requirements' on page 38). If the task does not appear in the remote task list, then NULL
is passed to the LSF Batch system. mbatchd
will then use the default resource requirements with option DFT_FROMTYPE
bit set when making a LSLIB call for host selection from LIM. See `Handling Default Resource Requirements' on page 26 for more information about default resource requirements.
The constant DEFAULT_RLIMIT defined in lsf.h indicates that there is no limit on a resource.
The constants used to index the rlimits
array of the submit
structure is defined in lsf.h
, and the resource limits currently supported by LSF Batch are listed below.
The hostSpec
field of the submit
structure specifies the host model to use for scaling rlimits[LSF_RLIMIT_CPU]
and rlimits[LSF_RLIMIT_RUN]
(See lsb_queueinfo(3)
). If hostSpec
is NULL
, the local host's model is assumed.
If the beginTime
field of the submit
structure is 0
, start the job as soon as possible.
If the termTime
field of the submit
structure is 0
, allow the job to run until it reaches a resource limit.
The above example checks the value of lsberrno
when lsb_submit()
fails. Different actions can be taken depending on the type of the error. All possible error numbers are defined in lsbatch.h
. For example, error number LSBE_QUEUE_USE
indicates that the user is not authorized to use the queue. The error number LSBE_QUEUE_CLOSED
indicates that the queue is closed.
Since a queue name was not specified for the job, the job will be submitted to the default queue. The queue
field of the submitReply
structure contains the name of the queue to which the job was submitted.
The above program will produce output similar to the following:
Job <5602> is submitted to default queue <default>.
The output from the job will be mailed to the user because it did not specify a file name for the outFile
parameter in the submit
structure.
If you are familiar with the bsub
command, it may help to know how the fields in the submit
structure realte to the bsub
command options. This is provided in the following table.
| ||
* indicates a bitwise OR mask for options2.
** indicates -1 means undefined
Even if not all options are used, all optional string fields must be initialized to the empty string. For a complete description of the fields in the submit
structure, see the lsb_submit(3)
manual page.
To modify an already submitted job, you can fill out a new submit structure to override existing parameters, and use delOptions
to remove option bits that were previously specified for the job. Essentially, modifying a submitted job is like re-submitting the job. So the same program as above can be used to modify an existing job with minor changes. One additional parameter that must be specified for job modification is the job Id. The parameter delOptions
can also be set if you want to clear some option bits that were set previously.
All applications that call lsb_submit()
and lsb_modify()
are subject to authentication constraints described in `Authentication' on page 17.
LSBLIB provides functions to get status information about batch jobs. Since the number of jobs in the LSF Batch system could be on the order of many thousands, getting all this information in one message could potentially use a lot of memory space. LSBLIB allows the application to open a stream connection and then read the job records one by one. This way the memory space needed is always the size of one job record.
An LSF Batch job ID stored in a 32-bit integer and it consists of two parts: base ID and array index. The base ID is stored in the lower 20 bits whereas the array index in the top 12 bits which are only used when the underlying job is an array job.
LSBLIB provides the following C macros (defined in lsbatch.h
) for munipulating job IDs:
LSB_JOBID(base_id, array_index) Yield a 32-bit LSF Batch job ID
LSB_ARRAY_IDX(job_id) Yield array index part of the job ID
LSB_ARRAY_JOBID(job_id) Yield the base ID part of the job ID
The function calls used to get job information are:
int lsb_openjobinfo(jobId, jobName, user, queue, host, options);
struct jobInfoEnt *lsb_readjobinfo(more);
void lsb_closejobinfo(void);
These functions are used to open a job information connection with mbatchd
, read job records, and then close the job information connection.
lsb_openjobinfo()
function takes the following arguments:
int jobId; Select job with the given job Id
char *jobName; Select job(s) with the given job name
char *user; Select job(s) submitted by the named user or user group
char *queue; Select job(s) submitted to the named queue
char *host; Select job(s) that are dispatched to the named host
int options; Selection flags constructed from the bits defined in lsbatch.h
The options
parameter contains additional job selection flags defined in lsbatch.h
. These are:
ALL_JOB
Select jobs matching any status, including unfinished jobs and recently finished jobs. LSF Batch remembers finished jobs within theCLEAN_PERIOD
, as defined in thelsb.params
file.
CUR_JOB
Return jobs that have not finished yet.
DONE_JOB
Return jobs that have finished recently.
PEND_JOB
Return jobs that are in the pending status.
SUSP_JOB
Return jobs that are in the suspended status.
LAST_JOB
Return jobs that are submitted most recently.
JGRP_ARRAY_INFO
Return job array information.
If options
is 0, then the default is CUR_JOB
.
lsb_openjobinfo()
returns the total number of matching job records in the connection. It returns -1 on failure and sets lsberrno
to indicate the error.
lsb_readjobinfo()
takes one argument:
int *more; If not NULL, contains the remaining number of jobs unread
Either this parameter or the return value from the lsb_openjobinfo()
can be used to keep track of the number of job records that can be returned from the connection. This parameter is updated each time lsb_readjobinfo()
is called.
The jobInfoEnt
structure returned by lsb_readjobinfo()
is defined in lsbatch.h
as:
struct jobInfoEnt {
int jobId; job ID
char *user; submission user
/* possible values for the status field */
#define JOB_STAT_PEND 0x01 job is pending
#define JOB_STAT_PSUSP 0x02 job is held
#define JOB_STAT_RUN 0x04 job is running
#define JOB_STAT_SSUSP 0x08 job is suspended by LSF Batch system
#define JOB_STAT_USUSP 0x10 job is suspended by user
#define JOB_STAT_EXIT 0x20 job exited
#define JOB_STAT_DONE 0x40 job is completed successfully
int status;
int *reasonTb; pending or suspending reasons
int numReasons; length of reasonTb vector
int reasons; reserved for future use
int subreasons; reserved for future use
int jobPid; process Id of the job
time_t submitTime; time when the job is submitted
time_t reserveTime; time when job slots are reserved
time_t startTime; time when job is actually started
time_t predictedStartTime; job's predicted start time
time_t endTime; time when the job finishes
time_t lastEvent; last time event
time_t nextEvent; next time event
int duration; duration time (minutes)
float cpuTime; CPU time consumed by the job
int umask; file mode creation mask for the job
char *cwd; current working directory where job is submitted
char *subHomeDir; submitting user's home directory
char *fromHost; host from which the job is submitted
char **exHosts; host(s) on which the job executes
int numExHosts; number of execution hosts
float cpuFactor; CPU factor of the first execution host
int nIdx; number of load indices in the loadSched and
loadStop vector
float *loadSched; stop scheduling new jobs if this threshold
is exceeded
float *loadStop; stop jobs if this threshold is exceeded
struct submit submit; job submission parameters
int exitStatus; exit status
int execUid; user ID under which the job is running
char *execHome; home directory of the user denoted by execUid
char *execCwd; current working directory where job is running
char *execUsername; user name corresponds to execUid
time_t jRusageUpdateTime; last time job's resource usage is updated
struct jRusage runRusage; last updated job's resource usage
/* Possible values for the jType field */
#define JGRP_NODE_JOB 1 this structure stores a normal batch job
#define JGRP_NODE_GROUP 2 this structure stores a job group
#define JGRP_NODE_ARRAY 3 this structure stores a job array
int jType;
char *parentGroup; for job group use
char *jName; job group name: if jType is JGRP_NODE_GROUP
job's name: otherwise
/* index into the counter array, only used for job array */
#define JGRP_COUNT_NJOBS 0 total jobs in the array
#define JGRP_COUNT_PEND 1 number of pending jobs in the array
#define JGRP_COUNT_NPSUSP 2 number of held jobs in the array
#define JGRP_COUNT_NRUN 3 number of running jobs in the array
#define JGRP_COUNT_NSSUSP 4 number of jobs suspended by the system in the array
#define JGRP_COUNT_NUSUSP 5 number of jobs suspended by the user in the array
#define JGRP_COUNT_NEXIT 6 number of exited jobs in the array
#define JGRP_COUNT_NDONE 7 number of successfully completed jobs
int counter[NUM_JGRP_COUNTERS];
};
Under LSF Batch, the jobInfoEnt
can store a job array as well as a non-array batch job, depending on the value of jType
field, which can be either JGRP_NODE_JOB
or JGRP_NODE_ARRAY
.
lsb_closejobinfo()
should be called after receiving all job records in the connection.
Below is an example of a simplified bjobs
command. This program displays all pending jobs belonging to all users.
#include <stdio.h>
#include <lsf/lsbatch.h>
main()
{
int options = PEND_JOB;
char *user = "all"; /* match jobs for all users */
struct jobInfoEnt *job;
int more;
if (lsb_init(argv[0]) < 0) {
lsb_perror("lsb_init");
exit(-1);
}
if (lsb_openjobinfo(0, NULL, user, NULL, NULL, options) < 0) {
lsb_perror("lsb_openjobinfo");
exit(-1);
}
printf("All pending jobs submitted by all users:\n");
for (;;) {
job = lsb_readjobinfo(&more);
if (job == NULL) {
lsb_perror("lsb_readjobinfo");
exit(-1);
}
/* display the job */
printf("%s:\nJob <%d> of user <%s>, submitted from host <%s>\n",
ctime(&job->submitTime), job->jobId, job->user, job->fromHost);
if (! more)
break;
}
lsb_closejobinfo();
exit(0);
}
If you want to print out the reasons why the job is still pending, you can use the function lsb_pendreason()
. See lsb_pendreason(3)
for details.
The above program will produce output similar to the following:
All pending jobs submitted by all users:
Mon Mar 1 10:34:04 EST 1996:
Job <123> of user <john>, submitted from host <orange>
Mon Mar 1 11:12:11 EST 1996:
Job <126> of user <john>, submitted from host <orange>
Mon Mar 1 14:11:34 EST 1996:
Job <163> of user <ken>, submitted from host <apple>
Mon Mar 1 15:00:56 EST 1996:
Job <199> of user <tim>, submitted from host <pear>
The following program displays the job arrays of all users in the LSF Batch system and displays the breakdown of jobs as far as job status is concerned. The program demonstrates the use of LSBLIB API calls for collecting summary information of a job array.
#include <stdio.h>
#include <lsf/lsbatch.h>
int
main(int argc, char **argv)
{
struct jobInfoEnt *job;
int numJobs;
int more;
if (lsb_init(argv[0]) < 0) {
lsb_perror("lsb_init");
exit(-1);
}numJobs = lsb_openjobinfo(0, NULL, "all", NULL, NULL, ALL_JOB|JGRP_ARRAY_INFO);
if (numJobs < 0) {
lsb_perror("lsb_openjobinfo");
exit(-1);
}printf("JOBID ARRAY_NAME OWNER NJOBS PEND DONE RUN EXIT SSUSP USUSP PSUSP\n");
more = 1;
for (;;) {
if (!more)
break; job = lsb_readjobinfo(&more);
printf("%-5d %-8.8s ", LSB_ARRAY_JOBID(job->jobId), job->submit.jobName);
printf("%8.8s ", job->user);
printf(" %5d %4d %4d %4d %4d %5d %5d %5d\n",
job->counter[JGRP_COUNT_NJOBS],
job->counter[JGRP_COUNT_PEND],
job->counter[JGRP_COUNT_NDONE],
job->counter[JGRP_COUNT_NRUN],
job->counter[JGRP_COUNT_NEXIT],
job->counter[JGRP_COUNT_NSSUSP],
job->counter[JGRP_COUNT_NUSUSP],
job->counter[JGRP_COUNT_NPSUSP]);
}
lsb_closejobinfo();
exit(0);
}
The above program produces output similar to the following:
JOBID ARRAY_NAME OWNER NJOBS PEND DONE RUN EXIT SSUSP USUSP PSUSP
4205 ja1[1-8] userA 8 0 0 0 0 0 0 8
4207 ja2[1-2] userB 2 0 0 0 0 0 0 2
5074 ja3[1-4] userA 4 0 3 1 0 0 0 0
5075 ja4[1-10] userC 17 0 13 0 0 4 0 0
5076 ja5[1-4] userD 4 0 1 0 3 0 0 0
After a job has been submitted, it can be manipulated by users in different ways. It can be suspended, resumed, killed, or sent an arbitrary signal.
All applications that manipulate jobs are subject to authentication provisions described in `Authentication' on page 17.
Users can send signals to submitted jobs. If the job has not been started, you can send KILL
, TERM
, INT
, and STOP
signals. These will cause the job to be cancelled (KILL
, TERM
, INT
) or suspended (STOP
). If the job is already started, then any signals can be sent to the job.
The LSBLIB call to send a signal to a job is:
int lsb_signaljob(jobId, sigValue);
The jobId
and sigValue
parameters are self-explanatory.
The following example takes a job ID as the argument and send a SIGSTOP
signal to the job.
#include <stdio.h>
#include <lsf/lsbatch.h>
main(argc, argv)
int argc;
char *argv[];
{
if (argc != 2) {
printf("Usage: %s jobId\n", argv[0]);
exit(-1);
}
if (lsb_init(argv[0]) < 0) {
lsb_perror("lsb_init");
exit(-1);
}
if (lsb_signaljob(atoi(argv[1]), SIGSTOP) <0) {
lsb_perror("lsb_signaljob");
exit(-1);
}
printf("Job %d is signaled\n", argv[1]);
exit(0);
}
A job can be switched to a different queue after submission. This can be done even after the job has already started.
The LSBLIB function to switch a job from one queue to another is:
int lsb_switchjob(jobId, queue);
Below is an example program that switches a specified job to a new queue.
#include <stdio.h>
#include <lsf/lsbatch.h>
main(argc, argv)
int argc;
char *argv[];
{
if (argc != 3) {
printf("Usage: %s jobId new_queue\n", argv[0]);
exit(-1);
}
if (lsb_init(argv[0]) <0) {
lsb_perror("lsb_init");
exit(-1);
}
if (lsb_switchjob(argv[1], argv[2]) < 0) {
lsb_perror("lsb_switchjob");
exit(-1);
}
printf("Job %d is switched to new queue <%s>\n", argv[1], argv[2]);
exit(0);
}
After a job is submitted to the LSF Batch system, it remains pending until LSF Batch determines that it is ready to run (for details on the factors that govern when and where a job starts to run, see "How LSF Batch Schedules Jobs" in the LSF Batch Administrator's Guide). However, a job can be forced to run on a specified list of hosts immediately using the following LSBLIB function:
int lsb_runjob(runJobReq)
This function takes the runJobReq
structure which is defined in lsbatch.h:
struct runJobReq {
int jobId; Job ID of the job to start
int numHosts; Number of hosts to run the job on
char **hostname; Host names where jobs run
int options; RUNJOB_REQ_NORMAL or RUNJOB_REQ_NOSTOP
}
A job can be started and run subject to no scheduling constraints, such as job slot limits. If the job is started with the options field being 0 or RUNJOB_REQ_NORMAL, then the job will still be subject to the underlying queue's run windows and to the threshold of the queue and of the job's execution hosts.
To override this, use RUNJOB_REQ_NOSTOP and the job will not be stopped due to the above mentioned load conditions. However, all LSBLIB's job munipulation APIs can still be applied to the job.
The following is an example program that runs a specified job on a host that has no batch job running.
#include <stdio.h>
#include <lsf/lsbatch.h>int
main(int argc, char **argv)
{
struct hostInfoEnt *hInfo;
int numHosts;
if (argc != 2) {
printf("Usage: %s jobId\n", argv[0]);
exit(-1);
}
if (lsb_init(argv[0]) < 0) {
lsb_perror("lsb_init");
exit(-1);
}
hInfo = lsb_hostinfo(NULL, &numHosts);
if (hInfo == NULL) {
lsb_perror("lsb_hostinfo");
exit(-1);
}
for (i = 0; i < numHosts; i++) {
if (hInfo[i].hStatus & (HOST_STAT_BUSY | HOST_STAT_WIND
| HOST_STAT_DISABLED | HOST_STAT_LOCKED
| HOST_STAT_FULL | HOST_STAT_NOLIM
| HOST_STAT_UNLICENSED | HOST_STAT_UNAVAIL
| HOST_STAT_UNREACH))
continue;
/* found a vacant host */
if (hInfo[i].numJobs == 0)
break;
}
if (i == numHosts) {
fprintf(stderr, "Cannot find vacate host to run job < %d >\n",
jobId);
exit(-1);
}
/* The job can be stopped due to load conditions */
runJobReq.options = 0;
runJobReq.numHosts = 1;
runJobReq.hosts = &hInfo[i].host
if (lsb_runjob(&runJobReq) < 0) {
lsb_perror("lsb_runjob");
exit(-1);
}
exit (0);
}
LSF Batch saves a lot of valuable information about the system and jobs. Such information is logged by mbatchd
in files lsb.events
and lsb.acct
under the directory $LSB_SHAREDIR/
your_cluster/logdir
, where LSB_SHAREDIR
is defined in the lsf.conf
file and your_cluster is the name of your LSF cluster.
mbatchd
logs such information for several purposes. Firstly, some of the events serve as the backup of mbatchd
's memory so that in case mbatchd
crashes, all the critical information can be picked up by the newly started mbatchd
from the event file to restore the current state of LSF Batch. Secondly, the events can be used to produce historical information about the LSF Batch system and user jobs. Lastly, such information can be used to produce accounting or statistic reports.
The lsb.events
file contains critical user job information. It should never be modified by your program. Writing into this file may cause the loss of user jobs.
LSBLIB provides a function to read information from these files into a well-defined data structure:
struct eventRec *lsb_geteventrec(log_fp, lineNum)
FILE *log_fp; File handle for either an event log file or job log file
nt *lineNum; Line number of the next event record
The parameter log_fp
is as returned by a successful fopen()
call. The content in lineNum
is modified to indicate the line number of the next event record in the log file on a successful return. This value can then be used to report the line number when an error occurs while reading the log file. This value should be initiated to 0 before lsb_geteventrec()
is called for the first time.
This call returns the following data structure:
struct eventRec {
char version[MAX_VERSION_LEN]; Version number of the mbatchd
int type; Type of the event
int eventTime; Event time stamp
union eventLog eventLog; Event data
};
The event type is used to determine the structure of the data in eventLog
. LSBLIB remembers the storage allocated for the previously returned data structure and automatically frees it before returning the next event record.
lsb_geteventrec()
returns NULL
and sets lsberrno
to LSBE_EOF
when there are no more records in the event file.
Events are logged by mbatchd
for many different purposes. There are job-related events and system-related events. Applications can choose to process certain events and ignore other events. For example, the bhist
command processes job-related events only. The currently available event types are listed below.
New calendar event 1 | |
Calendar modified 1 | |
Calendar deleted 1 | |
1 Available only if the LSF JobScheduler component is enabled. |
Note that the event type EVENT_JOB_FINISH
is used by the lsb.acct
file only and all other event types are used by the lsb.events
file only. For detailed formats of these log files, see lsb.events(5)
and lsb.acct(5).
Each event type corresponds to a different data structure in the union:
union eventLog {
struct jobNewLog jobNewLog; EVENT_JOB_NEW
struct jobStartLog jobStartLog; EVENT_JOB_START
struct jobStatusLog jobStatusLog; EVENT_JOB_STATUS
struct jobSwitchLog jobSwitchLog; EVENT_JOB_SWITCH
struct jobMoveLog jobMoveLog; EVENT_JOB_MOVE
struct queueCtrlLog queueCtrlLog; EVENT_QUEUE_CTRL
struct hostCtrlLog hostCtrlLog; EVENT_HOST_CTRL
struct mbdStartLog mbdStartLog; EVENT_MBD_START
struct mbdDieLog mbdDieLog; EVENT_MBD_DIE
struct unfulfillLog unfulfillLog; EVENT_MBD_UNFULFILL
struct jobFinishLog jobFinishLog; EVENT_JOB_FINISH
struct loadIndexLog loadIndexLog; EVENT_LOAD_INDEX
struct migLog migLog; EVENT_MIG
struct calendarLog calendarLog; Shared by all calendar events
struct jobForce jobForceRequestLog EVENT_JOB_FORCE
struct jobForwardLog jobForwardLog; EVENT_JOB_FORWARD
struct jobAcceptLog jobAcceptLog; EVENT_JOB_ACCEPT
struct statusAckLog statusAckLog; EVENT_STATUS_ACK
struct signalLog signalLog; EVENT_JOB_SIGNAL
struct jobExecuteLog jobExecuteLog; EVENT_JOB_EXECUTE
struct jobRequeueLog jobRequeueLog; EVENT_JOB_REQUEUEstruct sigactLog sigactLog;
EVENT_JOB_SIGACTstruct jobStartAcceptLog jobStartAcceptLog
EVENT_JOB_START_ACCEPT};
The detailed data structures in the above union are defined in lsbatch.h
and described in lsb_geteventrec(3)
.
Below is an example program that takes an argument as job name and displays a chronological history about all jobs matching the job name. This program assumes that the lsb.events
file is in /local/lsf/work/cluster1/logdir
.
#include <stdio.h>
#include <string.h>
#include <time.h>
#include <lsf/lsbatch.h>
main(argc, argv)
int argc;
char *argv[];
{
char *eventFile = "/local/lsf/work/cluster1/logdir/lsb.events";
FILE *fp;
struct eventRec *recrod;
int lineNum = 0;
char *jobName = argv[1];
int i;
if (argc != 2) {
printf("Usage: %s jobname\n", argv[0]);
exit(-1);
}
if (lsb_init(argv[0]) < 0) {
lsb_perror("lsb_init");
exit(-1);
}
fp = fopen(eventFile, "r");
if (fp == NULL) {
perror(eventFile);
exit(-1);
}
for (;;) {
record = lsb_geteventrec(fp, &lineNum);
if (record == NULL) {
if (lsberrno == LSBE_EOF)
exit(0);
lsb_perror("lsb_geteventrec");
exit(-1);
}
if (strcmp(record->eventLog.jobNewLog.jobName, jobName) != 0)
continue;
switch (record->type) {
struct jobNewLog *newJob;
struct jobStartLog *startJob;
struct jobStatusLog *statusLog;
case EVENT_JOB_NEW:
newJob = &(record->eventLog.jobNewLog);
printf("%s: job <%d> submitted by <%s> from <%s> to <%s> queue\n",
ctime(&record->eventTime), newJob->jobId, newJob->userName,
newJob->fromHost, newJob->queue);
continue;
case EVENT_JOB_START:
startJob = &(record->eventLog.jobStartLog);
printf("%s: job <%d> started on ",
ctime(&record->eventTime), newJob->jobId);
for (i=0; i<startJob->numExHosts; i++)
printf("<%s> ", startJob->execHosts[i]);
printf("\n");
continue;
case EVENT_JOB_STATUS:
statusJob = &(record->eventLog.jobStatusLog);
printf("%s: Job <%d> status changed to: ",
ctime(&record->eventTime), statusJob->jobId);
switch(statusJob->jStatus) {
case JOB_STAT_PEND:
printf("pending\n");
continue;
case JOB_STAT_RUN:
printf("running\n");
continue;
case JOB_STAT_SSUSP:
case JOB_STAT_USUSP:
case JOB_STAT_PSUSP:
printf("suspended\n");
continue;
case JOB_STAT_UNKWN:
printf("unknown (sbatchd unreachable)\n");
continue;
case JOB_STAT_EXIT:
printf("exited\n");
continue;
case JOB_STAT_DONE:
printf("done\n");
continue;
default:
printf("\nError: unknown job status %d\n", statusJob->jStatus);
continue;
}
default: /* Only display a few selected event types*/
continue;
}
}
exit(0);
}
Note that in the above program, events that are of no interest are skipped. The job status codes are defined in lsbatch.h
. The lsb.acct
file stores job accounting information and can be processed similarly. Since currently there is only one event type (EVENT_JOB_FINISH
) in lsb.acct
file, the processing is simpler than the above example.