This chapter describes the commands that report and change the status of your jobs:
The bjobs
command reports the status of LSF Batch jobs.
% bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
3926 user1 RUN priority hostf hostc verilog Oct 22 13:51
605 user1 SSUSP idle hostq hostc Test4 Oct 17 18:07
1480 user1 PEND priority hostd generator Oct 19 18:13
7678 user1 PEND priority hostd verilog Oct 28 13:08
7679 user1 PEND priority hosta coreHunter Oct 28 13:12
7680 user1 PEND priority hostb myjob Oct 28 13:17
The -a
option displays jobs that completed or exited in the recent past, along with pending and running jobs.
The -r
option displays only running jobs.
The -u
username option displays jobs submitted by other users. The special user name all
displays jobs submitted by all users.
For example, to find out who is running jobs on which hosts enter:
% bjobs -u all
You can also find jobs on specific queues or hosts, find jobs submitted by specific projects, and check the status of specific jobs using their job IDs or names. See the bjobs(1)
manual page for more information.
When you submit a job to LSF Batch, it may be held in the queue before it starts running and it may be suspended while running. The -p
option to the bjobs
command displays the reasons a job is pending. Because there can be more than one reason the job is pending or suspended, all reasons that contributed to the pending or suspension are reported. For example:
% bjobs -p7678 user1 PEND priority hostD verilog Oct 28 13:08
Queue's resource requirements not satisfied:3 hosts;
Unable to reach slave lsbatch server: 1 host;
Not enough job slots: 1 host;
The pending reasons will also mention the number of hosts for each condition.
To get the specific host names, along with pending reasons, use the -p
and -l
options to the bjobs
command. For example:
% bjobs -lp
Job Id <7678>, User <user1>, Project <default>, Status <PEND>, Qu eue <priority>, Command <verilog>
Mon Oct 28 13:08:11: Submitted from host <hostD>,CWD <$HOME>, Re
quested Resources <type==any && swp>35>;
PENDING REASONS:
Queue's resource requirements not satisfied: hostb, hostk, hostv;
Unable to reach slave lsbatch server: hostH;
Not enough job slots: hostF;
SCHEDULING PARAMETERS:
r15s r1m r15m ut pg io ls it tmp swp mem
loadSched - 0.7 1.0 - 4.0 - - - - - -
loadStop - 1.5 2.5 - 8.0 - - - - - -
In a cluster with many hosts (100-200 hosts), it may be too verbose or considered unnecessary to always show the host names with the pending reasons. Therefore, use the bjobs
command with the -p
option only.
The -l
option to the bjobs
command displays detailed information about job status and parameters, such as the job's current working directory, parameters specified when the job was submitted, and the time when the job started running.
% bjobs -l 7678
Job Id <7678>, User <user1>, Project <default>, Status <PEND>, Qu eue <priority>, Command <verilog>
Mon Oct 28 13:08:11: Submitted from host <hostD>,CWD <$HOME>, Re
quested Resources <type==any && swp>35>;
PENDING REASONS:
Queue's resource requirements not satisfied:3 hosts;
Unable to reach slave lsbatch server: 1 host;
Not enough job slots: 1 host;
SCHEDULING PARAMETERS:
r15s r1m r15m ut pg io ls it tmp swp mem
loadSched - 0.7 1.0 - 4.0 - - - - - -
loadStop - 1.5 2.5 - 8.0 - - - - - -
The loadSched
and loadStop
thresholds displayed are those that apply to this job. If the job is pending, the thresholds are taken from the queue. If the job has been dispatched, each threshold is the more restrictive of the queue and execution host thresholds for that load index.
Scheduling is also affected by other queue constraints such as RES_REQ
, STOP_COND
, RESUME_COND
, fairshare policy, and others.
The -s
option displays the reasons a batch job was suspended. Because the load conditions are constantly changing, the reasons for suspension may be out of date. Once the job is suspended it does not resume execution until its scheduling conditions are met.
% bjobs -s
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
605 user1 SSUSP idle hosta hostc Test4 Oct 17 18:07
The host load exceeded the following threshold(s):
Paging rate: pg;
Idle time: it;
In the example above, the job was suspended because the paging rate and interactive idle time on the execution host went above the suspending threshold. Even though the paging rate may have dropped back below the scheduling threshold, the job may remain suspended because of another threshold. The job does not resume until all load indices are within their scheduling thresholds.
Jobs submitted through the LSF Batch system have the resources they consume monitored while they are running. The -l
option of the bjobs
command displays the current resource usage of the job. This job-level information includes:
The job-level resource usage information is updated at a maximum frequency of every SBD_SLEEP_TIME
seconds (see `The lsb.params File' on page 193 of the LSF Batch Administrator's Guide for the value of SBD_SLEEP_TIME
). The update is done only if the value for the CPU time, resident memory usage, or virtual memory usage has changed by more than 10 percent from the previous update or if a new process or process group has been created.
% bjobs -l 1531
Job Id <1531>, User <user1>, Project <default>, Status <RUN>, Queue
<priority> Command <example 200>
Fri Dec 27 13:04:14 Submitted from host <hostA>, CWD <$HOME>,
SpecifiedHosts <hostD>;
Fri Dec 27 13:04:19: Started on <hostD>, Execution Home </home/user1
>, Execution CWD </home/user1>;
Fri Dec 27 13:05:00: Resource usage collected.
The CPU time used is 2 seconds.
MEM: 147 Kbytes; SWAP: 201 Kbytes PGID: 8920; PIDs: 8920 8921 8922
SCHEDULING PARAMETERS:
r15s r1m r15m ut pg io ls it tmp swp mem
loadSched - - - - - - - - - - -
loadStop - - - - - - - - - - -
Sometimes you want to know what has happened to your job since it was submitted. The bhist
command displays a summary of the pending, suspended and running time of batch jobs. The -l
option of the bhist
command prints the time information and a complete history of the scheduling events for each job.
%bhist -l 1531
JobId <1531>, User <user1>, Project <default>, Command< example200>
Fri Dec 27 13:04:14: Submitted from host <hostA> to Queue <priority
>, CWD <$HOME>, Specified Hosts <hostD>;
Fri Dec 27 13:04:19: Dispatched to <hostD>;
Fri Dec 27 13:04:19: Starting (Pid 8920);
Fri Dec 27 13:04:20: Running with execution home </home/user1>, Exe
cution CWD </home/user1>, Execution Pid <8920>
;
Fri Dec 27 13:05:49: Suspended by the user or administrator;
Fri Dec 27 13:05:56: Suspended: Waiting for re-scheduling after bei
ng resumed byuser;
Fri Dec 27 13:05:57: Running;
Fri Dec 27 13:07:52: Done successfully. The CPU time used is 28.3 s
conds.
Summary of time in seconds spent in various states by Sat Dec 27 13:07:52 1997
PEND PSUSP RUN USUSP SSUSP UNKWN TOTAL
5 0 205 7 1 0 218
The -J
job_name option of the bhist
command displays the history of all LSF Batch jobs with the specified job name. Job names are assigned with the -J
job_name option of the bsub
command.
LSF keeps job history information after the job exits, so you can look at the history of jobs that completed in the past. The length of the history depends on how often the LSF administrator cleans up the log files.
By default, bhist
only displays job history from the current event log file. The -n
option to the bhist
command allows users to display the history of jobs that completed a long time ago, and are no longer listed in the active event log.
The LSF Batch system periodically backs up and prunes the job history log. The -n
num_logfiles option tells the bhist
command to search through the specified number of log files instead of only searching the current log file. Log files are searched in reverse time order; for example, the command bhist -n 3
searches the current event log file and then the two most recent backup files.
By default the bhist
command displays information from the job event history file, lsb.events
, on a per job basis. The -t
option to bhist(1)
can be used to display the events chronologically, instead of grouping all events for each job. The -T
option allows for selecting only those events within a given time range.
For example, the following displays all events which occurred between 14:00 and 15:00 hours on a given day:
% bhist -t -T 14:00,14:30Wed Oct 22 14:01:25: Job <1574> done successfully;
Wed Oct 22 14:03:09: Job <1575> submitted from host to Queue ,
CWD , User , Project , Command , Requested
Resources ;
Wed Oct 22 14:03:18: Job <1575> dispatched to ;
Wed Oct 22 14:03:18: Job <1575> starting (Pid 210);
Wed Oct 22 14:03:18: Job <1575> running with execution home , E
xecution CWD , Execution Pid <210>;
Wed Oct 22 14:05:06: Job <1577> submitted from host to Queue,
CWD , User , Project , Command , Requested
Resources ;
Wed Oct 22 14:05:11: Job <1577> dispatched to ;
Wed Oct 22 14:05:11: Job <1577> starting (Pid 429);
Wed Oct 22 14:05:12: Job <1577> running with execution home, Ex
ecution CWD , Execution Pid <429>;
Wed Oct 22 14:08:26: Job <1578> submitted from host to Queue, C
WD , User , Project , Command;
Wed Oct 22 14:10:55: Job <1577> done successfully;
Wed Oct 22 14:16:55: Job <1578> exited;
Wed Oct 22 14:17:04: Job <1575> done successfully;
The output from an LSF Batch job is normally not available until the job is finished. However, LSF Batch provides the bpeek
command for you to look at the output the job has produced so far. By default, bpeek
shows the output from the most recently submitted job; you can also select the job by queue or execution host, or specify the job ID or job name on the command line.
% bpeek 1234
<< output from stdout >>
Starting phase 1
Phase 1 done
Calculating new parameters
...
Only the job owner can use bpeek
to see job output. The bpeek
command will not work on a job running under a different user account.
You can use this command to check if your job is behaving as you expected and kill the job if it is running away or producing unusable results. This could save you time.
The status individual elements of a job array can be viewed using the bjobs
command or the xlsbatch
GUI. The JOBID field will be the same for all elements of the array and the JOBNAME field will have the index of the element appended to it i.e jobName[index]
. The following output shows the result of submitting and viewing the job array through bjobs
.
% bsub -J "myArray[1-5]" sleep 10
Job <212> is submitted to default queue. % bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
212 user1 RUN default hostA hostB myArray[1] Jul 25 12:45
212 user1 PEND default hostA myArray[2] Jul 25 12:45
212 user1 PEND default hostA myArray[3] Jul 25 12:45
212 user1 PEND default hostA myArray[4] Jul 25 12:45
212 user1 PEND default hostA myArray[5] Jul 25 12:45
To display summary information about the number of jobs in the different states in the array, use the -A
option of bjobs
as follows.
% bjobs -A JOBID NAME OWNER NJOBS PEND RUN DONE EXIT SSUSP USUSP PSUSP 215 testArray[1-100:2] user1 50 40 5 1 0 0 0 4
The history of jobs in the array can be viewed using the bhist
command. When the jobId of an array is specified, the history of each element is displayed.
% bhist 212
The history of a specific element(s) can be displayed by appending an index specification after the job id. For example:
% bhist "212[5]"
The bqueues
and bhosts
commands display the number of jobs in a queue or dispatched to a host. For more information on these commands see `Batch Queues' on page 67 and `Batch Hosts' on page 79.
After a job is submitted, you can control it by killing it, suspending it, or resuming it.
The bkill
command cancels pending batch jobs and sends signals to running jobs. By default, on UNIX, bkill
sends the SIGKILL
signal to running jobs. For example, to kill job 3421 enter:
% bkill 3421
Job <3421> is being terminated
Before SIGKILL
is sent, SIGINT
and SIGTERM
are sent to give the job a chance to catch the signals and clean up. The signals are forwarded from the mbatchd
to the sbatchd
. The sbatchd
waits for the job to exit before reporting the status. Because of these delays, for a short period of time after the bkill
command has been sent, bjobs
may still report that the job is running.
On Windows NT, job control messages replace the SIGINT
and SIGTERM
signals, and termination is implemented by the TerminateProcess( )
system call.
The bstop
and bresume
commands allow you to suspend or resume a job.
% bstop 3421
Job <3421> is being stopped
![]()
bstop
sends theSIGSTOP
signal to sequential jobs andSIGTSTP
to parallel jobs. SIGTSTP is sent to a parallel job so the master process can trap the signal and pass it to all the slave processes running on other hosts.
![]()
bstop
causes the job to be suspended.
To resume the same job, enter:
% bresume 3421
Job <3421> is being resumed
Suspending a job causes your job to go into USUSP
state if the job is already started, or to go into PSUSP
state if your job is pending. Resuming a user suspended job does not put your job into RUN
state immediately. If your job was running before the suspension, bresume
first puts your job into SSUSP
state and then waits for sbatchd
to schedule it according to the load conditions.
Each element of a job array is run independently of the others. You can kill, suspend, or resume all elements of the array, or only selected ones.
You can send an arbitrary signal to all elements of the array, or only selected ones.
Using the job id of the array operates on the all elements of the array. Selecting particular elements to control requires appending the index specification after the job id. For example:
% bsub -J "myArray[1-50]" sleep 10
Job <212> is submitted to default queue. % bstop 212
Job <212>: Operation is in progress % bresume 212
Job <212>: Operation is in progress % bstop "212[5]"
Job <212[5]> is being stopped % bstop "212[40-50]"
Job <212[40]> is being stopped
...
Job <212[50]> is being stopped
When sending a command which operates on several elements of the array, the change in the status of the elements may not show up immediately. The system ensures the operation takes place in the background while other requests are being serviced.
The job name can also be used in selecting elements of the array (e.g bstop -J "myArray[40-50]"
). Since multiple job arrays may have the same job name the command will affect all arrays with the name "myArray".
Changing the queueing position of a job through the bbot
and btop
commands can be done on individual elements of an array, but cannot operate on the entire array. For example, btop "212[5]"
to move the element with index 5 in the job array with ID 212 to the first queuing position.
To send an arbitrary signal to your job, use the -s
option of the bkill
command. You can specify either the signal name or the signal number. On most versions of UNIX, signal names and numbers are listed in the kill(1)
or signal(2)
manual page. On Windows NT, only customized applications will be able to process job control messages specified with the -s
option.
% bkill -s TSTP 3421
Job <3421> is being signaled
This example sends the TSTP signal to job 3421.
Signal numbers are translated across different platforms because different operating systems may have different signal numbering. The real meaning of a specific signal is interpreted by the machine from which the bkill
command is issued. For example, if you send signal 18 from an SunOS 4.x host, it means SIGTSTP
. If the job is running on an HP-UX, SIGTSTP
is defined as signal number 25, so signal 25 is sent to the job.
Only the owner of a batch job or an LSF administrator can send signals to a job.
You cannot send arbitrary signals to a pending job; most signals are only valid for running jobs. However, LSF Batch does allow you to kill, suspend and resume pending jobs.
The btop
and bbot
commands move pending jobs within a queue. btop
moves jobs toward the top of the queue, so that they are dispatched before other pending jobs. bbot
moves jobs toward the bottom of the queue so that they are dispatched later. The default behaviour is to move the job as close to the top or bottom of the queue as possible. By specifying a position on the command line, you can move a job to an arbitrary position relative to the top or bottom of the queue.
The btop
and bbot
commands do not allow users to move their own jobs ahead of those submitted by other users; only the dispatch order of the user's own jobs is changed. Only an LSF administrator can move one user's job ahead of another.
The btop
and bbot
commands have no effect on the job dispatch order when fairshare policies are used.
% bjobs -u all
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
5308 user2 RUN normal hostA hostD sleep 500 Oct 23 10:16
5309 user2 PEND night hostA sleep 200 Oct 23 11:04
5310 user1 PEND night hostB myjob Oct 23 13:45
5311 user2 PEND night hostA sleep 700 Oct 23 18:17
% btop 5311
Job <5311> has been moved to position 1 from top. % bjobs -u all
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
5308 user2 RUN normal hostA hostD sleep 500 Oct 23 10:16
5311 user2 PEND night hostA sleep 700 Oct 23 18:17
5310 user1 PEND night hostB myjob Oct 23 13:45
5309 user2 PEND night hostA sleep 200 Oct 23 11:04
Note that user1's job is still in the same position on the queue. User2 cannot use btop
to get extra jobs at the top of the queue; when one of his jobs moves up on the queue, the rest of his jobs move down.
The bswitch
command switches pending and running jobs from queue to queue. This is useful if you submit a job to the wrong queue, or if the job is suspended because of the queue thresholds or run windows and you would like to resume the job.
% bswitch priority 5309
Job <5309> is switched to queue <priority> % bjobs -u all
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
5308 user2 RUN normal hostA hostD sleep 500 Oct 23 10:16
5309 user2 RUN priority hostA hostB sleep 200 Oct 23 11:04
5311 user2 PEND night hostA sleep 700 Oct 23 18:17
5310 user1 PEND night hostB myjob Oct 23 13:45
There is no "pre-submission" modification. Using the bmod
command, jobs are modified after they have been submitted. This section discusses the following topics:
For submitted jobs in PEND state, the bmod
command is used by the job owner and LSF administrator to modify command line parameters (see `Submitting Batch Jobs' on page 89).
To replace the job command line the -Z "newCommand"
command option is used. The following example replaces the command line option for job 101 with "myjob file":
To change a specific job parameter, use bmod
with the bsub
option used to specify the parameter. The specified options replace the submitted options. The following example changes the start time of job 101 to 2:00 a.m.:
To reset an option to its default submitted value (undo a bmod
), append the n
character to the option name, and do not include an option value. The following example resets the start time for job 101 back to its original value:
Resource reservation can be modified after a job has been started to ensure proper reservation and optimal resource utilization.
For dispatched (started) jobs, the bmod
command is used by the job owner and LSF administrator to modify resource reservations (see `Resource Reservation' on page 91). A job is usually submitted with a resource reservation for the maximum amount required. This command is used to decrease the reservation, allowing other jobs access to the resource. The following example sets the resource reservation for job 101 to 25MB of memory and 50MB of swap space:
% bmod -R "rusage[mem=25:swp=50]" 101
Individual elements of a job array can be modified after the array is submitted to LSF Batch. For example, this enables individual elements to have different resource requirements or different dependency conditions to control scheduling behavior.
When a job array is submitted, all the elements (jobs) within the array share the same job ID and submission parameters (i.e., resource requirements and submission queue). The bmod
command is used by the job owner and LSF administrator to change the resource requirements for individual jobs or the entire array. Use the bswitch
command to change the submission queue for individual jobs or the entire array. Both commands use the jobId indexList
extension to support job array modifications.
Job array modifications affect only those jobs that have not been dispatched. To make sure the modifications are applied to all specified jobs:
1. Submit the job array on hold using the -H
option, bsub -H
...
2. Make the job modifications, bmod
, bswitch
, ...
3. Release the job, bresume jobId
...
% bmod modification "jobId[indexList]"
% bswitch fromQ toQ "jobId[indexList]"
modification
specifies the resource modification using correctbsub
syntax.
fromQ, toQ
specifies the queue to which the job was originally submitted, and the queue to which the job is to be switched.
jobId
specifies the job ID of the job array. The double quotes are not required ifindexList
does not follow the job ID.
indexList
specifies the elements (jobs) to be modified. Elements do not need to occupy continuous indices.
% bmod -R "mem >= 200" 101
changes the memory requirements for all jobs in the job array to 200MB
% bmod -R "mem >= 500" "101[3, 7]"
changes the memory requirements of jobs 3 and 7 to 500MB
% bmod -w "101[1]" "101[10]"
makes sure job 10 runs after job 1
% bswitch defaultQ priorityQ "101[5]"
changes the submission queue for job 5 from defaultQ to priorityQ
To replace the entire job command line after submission, use bmod
with the -Z
option, which takes the form bmod -Z "new_command" jobId
. Consider the following example:
% bmod -Z "myjob file1" 12223
This command modifies the command for job 12223, changing it to "myjob file1".
To change specific job parameters after submission, use bmodify
with the option(s) you want to change. The bmodify
command takes the same options as the bsub
command together with a job ID (see `Submitting Batch Jobs' on page 89). The given options replace the existing options of the specified job. For example, the following command changes the start time of job 123 to 2:00 a.m.
% bmod -b 2:00 123
To reset an option to its default value, append the n
character to the option name, and do not include an option value. For example:
% bmod -bn 123
Job 123 will be dispatched as soon as possible, ignoring any previously specified start time.
Job arrays can be modified in the same way. Since all jobs share the same set of parameters, modifying the array will affect all jobs in the array.
Most of the operations discussed in this chapter can also be performed using the GUI. The main window of xlsbatch
is shown in Figure 4 on page 24.
You can view job details by first select a job and then click on the `Detail' button. The resulting popup window is shown in Figure 12. This gives you the same information as you can get by running the bjobs -l
command.
Figure 12. Detailed Job Information Popup Window
The `History
' button gives you a popup window for job history as you can otherwise get through the bhist
command.
To perform control actions on jobs, such as killing a job or suspending/resuming a job, simply select the job and then click on an action button.
You can also invoke the xbsub
window from inside xlsbatch to submit new jobs. If you want to modify a job parameter, simply select on the job and click on `Modify
' button to get the job modification popup window. Note that this window can also be invoked by running xbmod
from the command line. Figure 13 shows the xbmod
window. This window is the similar to the xbsub
window.
Figure 13. Job Modification
Window