[Contents] [Index] [Top] [Bottom] [Prev] [Next]


7. Tracking Batch Jobs

This chapter describes the commands that report and change the status of your jobs:

Displaying Job Status

The bjobs command reports the status of LSF Batch jobs.

% bjobs 
JOBID USER   STAT   QUEUE     FROM_HOST EXEC_HOST JOB_NAME    SUBMIT_TIME
3926  user1  RUN    priority  hostf     hostc     verilog     Oct 22 13:51
605   user1  SSUSP  idle      hostq     hostc     Test4       Oct 17 18:07
1480  user1  PEND   priority  hostd               generator   Oct 19 18:13
7678  user1  PEND   priority  hostd               verilog     Oct 28 13:08
7679  user1  PEND   priority  hosta               coreHunter  Oct 28 13:12
7680  user1  PEND   priority  hostb               myjob       Oct 28 13:17

The -a option displays jobs that completed or exited in the recent past, along with pending and running jobs.

The -r option displays only running jobs.

The -u username option displays jobs submitted by other users. The special user name all displays jobs submitted by all users.

For example, to find out who is running jobs on which hosts enter:

% bjobs -u all

You can also find jobs on specific queues or hosts, find jobs submitted by specific projects, and check the status of specific jobs using their job IDs or names. See the bjobs(1) manual page for more information.

Finding Pending or Suspension Reasons

When you submit a job to LSF Batch, it may be held in the queue before it starts running and it may be suspended while running. The -p option to the bjobs command displays the reasons a job is pending. Because there can be more than one reason the job is pending or suspended, all reasons that contributed to the pending or suspension are reported. For example:

% bjobs -p
7678   user1   PEND   priority   hostD   verilog   Oct 28 13:08
Queue's resource requirements not satisfied:3 hosts;
Unable to reach slave lsbatch server: 1 host;
Not enough job slots: 1 host;


The pending reasons will also mention the number of hosts for each condition. To get the specific host names, along with pending reasons, use the -p and -l options to the bjobs command. For example:

bjobs -lp
Job Id <7678>, User <user1>, Project <default>, Status <PEND>, Qu eue <priority>, Command <verilog>
Mon Oct 28 13:08:11: Submitted from host <hostD>,CWD <$HOME>, Re
quested Resources <type==any && swp>35>;
PENDING REASONS:
Queue's resource requirements not satisfied: hostb, hostk, hostv;
Unable to reach slave lsbatch server: hostH;
Not enough job slots: hostF;
SCHEDULING PARAMETERS:
            r15s   r1m  r15m   ut  pg    io   ls    it    tmp    swp    mem
loadSched   -      0.7  1.0    -   4.0   -    -     -     -      -      -
loadStop    -      1.5  2.5    -   8.0   -    -     -     -      -      -

Note

In a cluster with many hosts (100-200 hosts), it may be too verbose or considered unnecessary to always show the host names with the pending reasons. Therefore, use the bjobs command with the -p option only.

The -l option to the bjobs command displays detailed information about job status and parameters, such as the job's current working directory, parameters specified when the job was submitted, and the time when the job started running.

% bjobs -l 7678
Job Id <7678>, User <user1>, Project <default>, Status <PEND>, Qu eue <priority>, Command <verilog>
Mon Oct 28 13:08:11: Submitted from host <hostD>,CWD <$HOME>, Re
quested Resources <type==any && swp>35>;
PENDING REASONS:
Queue's resource requirements not satisfied:3 hosts;
Unable to reach slave lsbatch server: 1 host;
Not enough job slots: 1 host;
SCHEDULING PARAMETERS:
           r15s   r1m  r15m   ut   pg    io   ls  it  tmp  swp  mem
loadSched  -      0.7  1.0    -    4.0   -    -   -   -    -    -
loadStop   -      1.5  2.5    -    8.0   -    -   -   -    -    -

The loadSched and loadStop thresholds displayed are those that apply to this job. If the job is pending, the thresholds are taken from the queue. If the job has been dispatched, each threshold is the more restrictive of the queue and execution host thresholds for that load index.

Scheduling is also affected by other queue constraints such as RES_REQ, STOP_COND, RESUME_COND, fairshare policy, and others.

The -s option displays the reasons a batch job was suspended. Because the load conditions are constantly changing, the reasons for suspension may be out of date. Once the job is suspended it does not resume execution until its scheduling conditions are met.

bjobs -s
JOBID USER  STAT  QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
605   user1 SSUSP idle  hosta     hostc     Test4    Oct 17 18:07
The host load exceeded the following threshold(s):
Paging rate: pg;
Idle time: it;

In the example above, the job was suspended because the paging rate and interactive idle time on the execution host went above the suspending threshold. Even though the paging rate may have dropped back below the scheduling threshold, the job may remain suspended because of another threshold. The job does not resume until all load indices are within their scheduling thresholds.

Monitoring Resource Consumption of Jobs

Jobs submitted through the LSF Batch system have the resources they consume monitored while they are running. The -l option of the bjobs command displays the current resource usage of the job. This job-level information includes:

The job-level resource usage information is updated at a maximum frequency of every SBD_SLEEP_TIME seconds (see `The lsb.params File' on page 193 of the LSF Batch Administrator's Guide for the value of SBD_SLEEP_TIME). The update is done only if the value for the CPU time, resident memory usage, or virtual memory usage has changed by more than 10 percent from the previous update or if a new process or process group has been created.

bjobs -l 1531
Job Id <1531>, User <user1>, Project <default>, Status <RUN>, Queue
<priority> Command <example 200>
Fri Dec 27 13:04:14 Submitted from host <hostA>, CWD <$HOME>,
SpecifiedHosts <hostD>;
Fri Dec 27 13:04:19: Started on <hostD>, Execution Home </home/user1
>, Execution CWD </home/user1>;
Fri Dec 27 13:05:00: Resource usage collected.
The CPU time used is 2 seconds.
MEM: 147 Kbytes; SWAP: 201 Kbytes PGID: 8920;  PIDs: 8920 8921 8922 
SCHEDULING PARAMETERS:
          r15s   r1m   r15m   ut    pg    io    ls    it    tmp   swp   mem
loadSched -      -     -      -     -     -     -     -     -     -     -
loadStop  -      -     -      -     -     -     -     -     -     -     -

Displaying Job History

Sometimes you want to know what has happened to your job since it was submitted. The bhist command displays a summary of the pending, suspended and running time of batch jobs. The -l option of the bhist command prints the time information and a complete history of the scheduling events for each job.

%bhist -l 1531
JobId <1531>, User <user1>, Project <default>, Command< example200>
Fri Dec 27 13:04:14: Submitted from host <hostA> to Queue <priority
>, CWD <$HOME>, Specified Hosts <hostD>;
Fri Dec 27 13:04:19: Dispatched to <hostD>;
Fri Dec 27 13:04:19: Starting (Pid 8920);
Fri Dec 27 13:04:20: Running with execution home </home/user1>, Exe
cution CWD </home/user1>, Execution Pid <8920>
;
Fri Dec 27 13:05:49: Suspended by the user or administrator;
Fri Dec 27 13:05:56: Suspended: Waiting for re-scheduling after bei
ng resumed byuser;
Fri Dec 27 13:05:57: Running;
Fri Dec 27 13:07:52: Done successfully. The CPU time used is 28.3 s
conds.
Summary of time in seconds spent in various states by Sat Dec 27 13:07:52 1997
PEND  PSUSP  RUN  USUSP  SSUSP  UNKWN  TOTAL
5     0      205  7      1      0      218

The -J job_name option of the bhist command displays the history of all LSF Batch jobs with the specified job name. Job names are assigned with the -J job_name option of the bsub command.

LSF keeps job history information after the job exits, so you can look at the history of jobs that completed in the past. The length of the history depends on how often the LSF administrator cleans up the log files.

By default, bhist only displays job history from the current event log file. The -n option to the bhist command allows users to display the history of jobs that completed a long time ago, and are no longer listed in the active event log.

The LSF Batch system periodically backs up and prunes the job history log. The -n num_logfiles option tells the bhist command to search through the specified number of log files instead of only searching the current log file. Log files are searched in reverse time order; for example, the command bhist -n 3 searches the current event log file and then the two most recent backup files.

Viewing Chronological History

By default the bhist command displays information from the job event history file, lsb.events, on a per job basis. The -t option to bhist(1) can be used to display the events chronologically, instead of grouping all events for each job. The -T option allows for selecting only those events within a given time range.

For example, the following displays all events which occurred between 14:00 and 15:00 hours on a given day:

% bhist -t -T 14:00,14:30
Wed Oct 22 14:01:25: Job <1574> done successfully;
Wed Oct 22 14:03:09: Job <1575> submitted from host to Queue ,
CWD , User , Project , Command , Requested
Resources ;
Wed Oct 22 14:03:18: Job <1575> dispatched to ;
Wed Oct 22 14:03:18: Job <1575> starting (Pid 210);
Wed Oct 22 14:03:18: Job <1575> running with execution home , E
xecution CWD , Execution Pid <210>;
Wed Oct 22 14:05:06: Job <1577> submitted from host to Queue,
CWD , User , Project , Command , Requested
Resources ;
Wed Oct 22 14:05:11: Job <1577> dispatched to ;
Wed Oct 22 14:05:11: Job <1577> starting (Pid 429);
Wed Oct 22 14:05:12: Job <1577> running with execution home, Ex
ecution CWD , Execution Pid <429>;
Wed Oct 22 14:08:26: Job <1578> submitted from host to Queue, C
WD , User , Project , Command;
Wed Oct 22 14:10:55: Job <1577> done successfully;
Wed Oct 22 14:16:55: Job <1578> exited;
Wed Oct 22 14:17:04: Job <1575> done successfully;

Checking Partial Job Output

The output from an LSF Batch job is normally not available until the job is finished. However, LSF Batch provides the bpeek command for you to look at the output the job has produced so far. By default, bpeek shows the output from the most recently submitted job; you can also select the job by queue or execution host, or specify the job ID or job name on the command line.

% bpeek 1234
<< output from stdout >>
Starting phase 1
Phase 1 done
Calculating new parameters
...

Only the job owner can use bpeek to see job output. The bpeek command will not work on a job running under a different user account.

You can use this command to check if your job is behaving as you expected and kill the job if it is running away or producing unusable results. This could save you time.

Tracking Job Arrays

The status individual elements of a job array can be viewed using the bjobs command or the xlsbatch GUI. The JOBID field will be the same for all elements of the array and the JOBNAME field will have the index of the element appended to it i.e jobName[index]. The following output shows the result of submitting and viewing the job array through bjobs.

% bsub -J "myArray[1-5]" sleep 10
Job <212> is submitted to default queue. % bjobs
JOBID  USER   STAT   QUEUE     FROM_HOST  EXEC_HOST   JOB_NAME    SUBMIT_TIME
212    user1  RUN    default   hostA      hostB       myArray[1]  Jul 25 12:45
212    user1  PEND   default   hostA                  myArray[2]  Jul 25 12:45
212    user1  PEND   default   hostA                  myArray[3]  Jul 25 12:45
212    user1  PEND   default   hostA                  myArray[4]  Jul 25 12:45
212    user1  PEND   default   hostA                  myArray[5]  Jul 25 12:45

To display summary information about the number of jobs in the different states in the array, use the -A option of bjobs as follows.

% bjobs -A
JOBID  NAME               OWNER NJOBS PEND RUN DONE EXIT SSUSP USUSP PSUSP
215    testArray[1-100:2] user1 50    40   5   1    0    0     0     4

The history of jobs in the array can be viewed using the bhist command. When the jobId of an array is specified, the history of each element is displayed.

% bhist 212

The history of a specific element(s) can be displayed by appending an index specification after the job id. For example:

% bhist "212[5]"

Displaying Queue and Host Status

The bqueues and bhosts commands display the number of jobs in a queue or dispatched to a host. For more information on these commands see `Batch Queues' on page 67 and `Batch Hosts' on page 79.

Job Controls

After a job is submitted, you can control it by killing it, suspending it, or resuming it.

Killing Jobs

The bkill command cancels pending batch jobs and sends signals to running jobs. By default, on UNIX, bkill sends the SIGKILL signal to running jobs. For example, to kill job 3421 enter:

% bkill 3421
Job <3421> is being terminated

Before SIGKILL is sent, SIGINT and SIGTERM are sent to give the job a chance to catch the signals and clean up. The signals are forwarded from the mbatchd to the sbatchd. The sbatchd waits for the job to exit before reporting the status. Because of these delays, for a short period of time after the bkill command has been sent, bjobs may still report that the job is running.

On Windows NT, job control messages replace the SIGINT and SIGTERM signals, and termination is implemented by the TerminateProcess( ) system call.

Suspending and Resuming Jobs

The bstop and bresume commands allow you to suspend or resume a job.

To suspend job 3421, enter:

% bstop 3421
Job <3421> is being stopped

bstop sends the SIGSTOP signal to sequential jobs and SIGTSTP to parallel jobs. SIGTSTP is sent to a parallel job so the master process can trap the signal and pass it to all the slave processes running on other hosts.

bstop causes the job to be suspended.

To resume the same job, enter:

% bresume 3421
Job <3421> is being resumed

Suspending a job causes your job to go into USUSP state if the job is already started, or to go into PSUSP state if your job is pending. Resuming a user suspended job does not put your job into RUN state immediately. If your job was running before the suspension, bresume first puts your job into SSUSP state and then waits for sbatchd to schedule it according to the load conditions.

Controlling Job Arrays

Each element of a job array is run independently of the others. You can kill, suspend, or resume all elements of the array, or only selected ones.

You can send an arbitrary signal to all elements of the array, or only selected ones.

Using the job id of the array operates on the all elements of the array. Selecting particular elements to control requires appending the index specification after the job id. For example:

% bsub -J "myArray[1-50]" sleep 10
Job <212> is submitted to default queue. % bstop 212
Job <212>: Operation is in progress % bresume 212
Job <212>: Operation is in progress % bstop "212[5]"
Job <212[5]> is being stopped % bstop "212[40-50]"
Job <212[40]> is being stopped
...
Job <212[50]> is being stopped

Note

When sending a command which operates on several elements of the array, the change in the status of the elements may not show up immediately. The system ensures the operation takes place in the background while other requests are being serviced.

The job name can also be used in selecting elements of the array (e.g bstop -J "myArray[40-50]"). Since multiple job arrays may have the same job name the command will affect all arrays with the name "myArray".

Changing the queueing position of a job through the bbot and btop commands can be done on individual elements of an array, but cannot operate on the entire array. For example, btop "212[5]" to move the element with index 5 in the job array with ID 212 to the first queuing position.

Sending an Arbitrary Signal to a Job

To send an arbitrary signal to your job, use the -s option of the bkill command. You can specify either the signal name or the signal number. On most versions of UNIX, signal names and numbers are listed in the kill(1) or signal(2) manual page. On Windows NT, only customized applications will be able to process job control messages specified with the -s option.

bkill -s TSTP 3421
Job <3421> is being signaled

This example sends the TSTP signal to job 3421.

Note

Signal numbers are translated across different platforms because different operating systems may have different signal numbering. The real meaning of a specific signal is interpreted by the machine from which the bkill command is issued. For example, if you send signal 18 from an SunOS 4.x host, it means SIGTSTP. If the job is running on an HP-UX, SIGTSTP is defined as signal number 25, so signal 25 is sent to the job.

Only the owner of a batch job or an LSF administrator can send signals to a job.

You cannot send arbitrary signals to a pending job; most signals are only valid for running jobs. However, LSF Batch does allow you to kill, suspend and resume pending jobs.

Moving Jobs Within and Between Queues

The btop and bbot commands move pending jobs within a queue. btop moves jobs toward the top of the queue, so that they are dispatched before other pending jobs. bbot moves jobs toward the bottom of the queue so that they are dispatched later. The default behaviour is to move the job as close to the top or bottom of the queue as possible. By specifying a position on the command line, you can move a job to an arbitrary position relative to the top or bottom of the queue.

The btop and bbot commands do not allow users to move their own jobs ahead of those submitted by other users; only the dispatch order of the user's own jobs is changed. Only an LSF administrator can move one user's job ahead of another.

Note

The btop and bbot commands have no effect on the job dispatch order when fairshare policies are used.

% bjobs -u all
JOBID USER  STAT  QUEUE    FROM_HOST  EXEC_HOST  JOB_NAME   SUBMIT_TIME
5308  user2 RUN   normal   hostA      hostD      sleep 500  Oct 23 10:16
5309  user2 PEND  night    hostA                 sleep 200  Oct 23 11:04
5310  user1 PEND  night    hostB                 myjob      Oct 23 13:45
5311  user2 PEND  night    hostA                 sleep 700  Oct 23 18:17
% btop 5311
Job <5311> has been moved to position 1 from top. % bjobs -u all
JOBID USER  STAT  QUEUE    FROM_HOST  EXEC_HOST  JOB_NAME   SUBMIT_TIME
5308  user2 RUN   normal   hostA      hostD      sleep 500  Oct 23 10:16
5311  user2 PEND  night    hostA                 sleep 700  Oct 23 18:17
5310  user1 PEND  night    hostB                 myjob      Oct 23 13:45
5309  user2 PEND  night    hostA                 sleep 200  Oct 23 11:04

Note that user1's job is still in the same position on the queue. User2 cannot use btop to get extra jobs at the top of the queue; when one of his jobs moves up on the queue, the rest of his jobs move down.

The bswitch command switches pending and running jobs from queue to queue. This is useful if you submit a job to the wrong queue, or if the job is suspended because of the queue thresholds or run windows and you would like to resume the job.

% bswitch priority 5309
Job <5309> is switched to queue <priority> % bjobs -u all
JOBID USER  STAT  QUEUE    FROM_HOST  EXEC_HOST  JOB_NAME   SUBMIT_TIME
5308  user2 RUN   normal   hostA      hostD      sleep 500  Oct 23 10:16
5309  user2 RUN   priority hostA      hostB      sleep 200  Oct 23 11:04
5311  user2 PEND  night    hostA                 sleep 700  Oct 23 18:17
5310  user1 PEND  night    hostB                 myjob      Oct 23 13:45

Job Modification

There is no "pre-submission" modification. Using the bmod command, jobs are modified after they have been submitted. This section discusses the following topics:

· Submitted Job Modification

· Dispatched Job Modification

· Job Array Modification

Submitted Job Modification

For submitted jobs in PEND state, the bmod command is used by the job owner and LSF administrator to modify command line parameters (see `Submitting Batch Jobs' on page 89).

To replace the job command line the -Z "newCommand" command option is used. The following example replaces the command line option for job 101 with "myjob file":

% bmod -Z "myjob file" 101

To change a specific job parameter, use bmod with the bsub option used to specify the parameter. The specified options replace the submitted options. The following example changes the start time of job 101 to 2:00 a.m.:

% bmod -b 2:00 101

To reset an option to its default submitted value (undo a bmod), append the n character to the option name, and do not include an option value. The following example resets the start time for job 101 back to its original value:

% bmod -bn 101

Resource reservation can be modified after a job has been started to ensure proper reservation and optimal resource utilization.

Dispatched Job Modification

For dispatched (started) jobs, the bmod command is used by the job owner and LSF administrator to modify resource reservations (see `Resource Reservation' on page 91). A job is usually submitted with a resource reservation for the maximum amount required. This command is used to decrease the reservation, allowing other jobs access to the resource. The following example sets the resource reservation for job 101 to 25MB of memory and 50MB of swap space:

% bmod -R "rusage[mem=25:swp=50]" 101

Individual elements of a job array can be modified after the array is submitted to LSF Batch. For example, this enables individual elements to have different resource requirements or different dependency conditions to control scheduling behavior.

Job Array Modification

When a job array is submitted, all the elements (jobs) within the array share the same job ID and submission parameters (i.e., resource requirements and submission queue). The bmod command is used by the job owner and LSF administrator to change the resource requirements for individual jobs or the entire array. Use the bswitch command to change the submission queue for individual jobs or the entire array. Both commands use the jobId indexList extension to support job array modifications.

Note

Job array modifications affect only those jobs that have not been dispatched. To make sure the modifications are applied to all specified jobs:

1. Submit the job array on hold using the -H option, bsub -H ...

2. Make the job modifications, bmod, bswitch, ...

3. Release the job, bresume jobId ...

Syntax

% bmod modification "jobId[indexList]"

% bswitch fromQ toQ "jobId[indexList]"

modification
specifies the resource modification using correct bsub syntax.
fromQ, toQ
specifies the queue to which the job was originally submitted, and the queue to which the job is to be switched.
jobId
specifies the job ID of the job array. The double quotes are not required if indexList does not follow the job ID.
indexList
specifies the elements (jobs) to be modified. Elements do not need to occupy continuous indices.

Examples

% bmod -R "mem >= 200" 101
changes the memory requirements for all jobs in the job array to 200MB
% bmod -R "mem >= 500" "101[3, 7]"
changes the memory requirements of jobs 3 and 7 to 500MB
% bmod -w "101[1]" "101[10]"
makes sure job 10 runs after job 1
% bswitch defaultQ priorityQ "101[5]"
changes the submission queue for job 5 from defaultQ to priorityQ

To replace the entire job command line after submission, use bmod with the -Z option, which takes the form bmod -Z "new_command" jobId. Consider the following example:

% bmod -Z "myjob file1" 12223

This command modifies the command for job 12223, changing it to "myjob file1".

To change specific job parameters after submission, use bmodify with the option(s) you want to change. The bmodify command takes the same options as the bsub command together with a job ID (see `Submitting Batch Jobs' on page 89). The given options replace the existing options of the specified job. For example, the following command changes the start time of job 123 to 2:00 a.m.

% bmod -b 2:00 123

To reset an option to its default value, append the n character to the option name, and do not include an option value. For example:

% bmod -bn 123

Job 123 will be dispatched as soon as possible, ignoring any previously specified start time.

Job arrays can be modified in the same way. Since all jobs share the same set of parameters, modifying the array will affect all jobs in the array.

Job Tracking and Manipulation Using the GUI

Most of the operations discussed in this chapter can also be performed using the GUI. The main window of xlsbatch is shown in Figure 4 on page 24.

You can view job details by first select a job and then click on the `Detail' button. The resulting popup window is shown in Figure 12. This gives you the same information as you can get by running the bjobs -l command.

Figure 12. Detailed Job Information Popup Window

The `History' button gives you a popup window for job history as you can otherwise get through the bhist command.

To perform control actions on jobs, such as killing a job or suspending/resuming a job, simply select the job and then click on an action button.

You can also invoke the xbsub window from inside xlsbatch to submit new jobs. If you want to modify a job parameter, simply select on the job and click on `Modify' button to get the job modification popup window. Note that this window can also be invoked by running xbmod from the command line. Figure 13 shows the xbmod window. This window is the similar to the xbsub window.

Figure 13. Job Modification Window


[Contents] [Index] [Top] [Bottom] [Prev] [Next]


doc@platform.com

Copyright © 1994-1998 Platform Computing Corporation.
All rights reserved.