[Contents] [Index] [Top] [Bottom] [Prev] [Next]

6. Submitting Batch Jobs

This chapter describes how to use the bsub command. Command options are divided into groups with related functions. Topics covered in this chapter are:

input to and output from batch jobs
specifying resource requirements
restricting the hosts eligible to run a job
controlling resource usage
using pre-execution commands to determine when a job can start
specifying dependencies between batch jobs
moving files to and from the execution host
specifying `start after' and `finish before' times
submitting parallel jobs
other LSF Batch options
job scripts

The options to the bsub command related to job checkpointing and migration are described in `Checkpointing and Migration' on page 165.

Input and Output

When a batch job completes or exits, LSF Batch by default sends you a job report by electronic mail. The report includes the standard output (stdout) and error output (stderr) of the job. The output from stdout and stderr are merged together in the order printed, as if the job was run interactively. The default standard input (stdin) file is the null device.

The null device is /dev/null.

If you want mail sent to another user, use the -u username option to the bsub command. Mail associated with the job will be sent to the named user instead of to you.

If you do not want output to be sent by mail, you can specify stdout and stderr files. You can also specify the standard input file if the job needs to read input from stdin. For example:

% bsub -q night -i job_in -o job_out -e job_err myjob

submits myjob to the night queue. The job reads its input from file job_in. Standard output is stored in file job_out, and standard error is stored in file job_err. If you specify a -o outfile argument and do not specify a -e errfile argument, the standard output and error are merged and stored in outfile.

The output file created by the -o option to the bsub command normally contains job report information as well as the job output. This information includes the submitting user and host, the execution host, the CPU time (user plus system time) used by the job, and the exit status. If you want to separate the job report information from the job output, use the -N option to specify that the job report information should be sent by email.

The output files specified by the -o and -e options are created on the execution host. See `Remote File Access' on page 101 for an example of copying the output file back to the submission host if the job executes on a file system that is not shared between the submission and execution hosts.

Resource Requirements

If you need to explicitly specify resource requirements for your job, use the -R option to the bsub command. For example:

% bsub -R "swp > 15 && hpux order[cpu]" myjob

runs myjob on an HP-UX host that is lightly loaded (CPU utilization) and has at least 15 megabytes of swap memory available. See `Resource Requirement Strings' on page 46 for a complete discussion of resource requirements.

You do not have to specify resource requirements every time you submit a job. The LSF administrator may have already configured the resource requirements for your jobs, or you can put your executable name together with its resource requirements into your personal remote task list. The bsub command automatically uses the resource requirements of the job from the remote task lists. See `Managing Your Task List' on page 53 for more information about displaying task lists and putting tasks into your remote task list.

Resource Reservation

When a job is dispatched, the system assumes that the resources that the job consumes will be reflected in the load information. However, many jobs often do not consume the resources they require when they first start. Instead, they will typically use the resources over a period of time. For example, a job requiring 100MB of swap space is dispatched to a host having 150MB of available swap space. The job starts off initially allocating 5MB, gradually increasing the amount consumed to 100MB over a 30-minute period. During this period, another job requiring more than 50MB of swap space should not be started on the same host to avoid overcommitting the resource.

When submitting a job, you can specify the amount of resources to be reserved through the resource usage section of resource requirement string argument to the bsub command. The syntax of the resource reservation in the rusage section of resource requirement string is:

res=value[:res=value]...[:res=value][:duration=value][:decay=valu
e]

The res parameter can be any load index. The value parameter is the initial reserved amount. If res or value is not given, the default is to not reserve that resource.

The duration parameter is the time period within which the specified resources should be reserved. It is specified in minutes by default. If the value is followed by the letter 'h', it is specified in hours. For example, 'duration=30' and 'duration=2h' specify a duration of 30 minutes and two hours respectively. If duration is not specified, the default is to reserve the total amount for the lifetime of the job.

The decay parameter indicates how the reserved amount should decrease over the duration. A value of 1, 'decay=1', indicates that system should linearly decrease the amount reserved over the duration. The default decay value is 0, which causes the total amount to be reserved for the entire duration. Values other than 0 or 1 are unsupported. If duration is not specified decay is ignored.

When deciding whether to schedule a job on a host, the LSF Batch system considers the reserved resources of jobs that have previously started on that host. For each load index, the amount reserved by all jobs on that host is summed up and subtracted (or added if the index is increasing) from the current value of the resources as reported by the LIM to get amount available for scheduling new jobs:

available amount = current value - reserved amount for all jobs

For example:

% bsub -R "rusage[tmp=30:duration=30:decay=1]" myjob

will reserve 30MB of /tmp space for the job. As the job runs, the amount reserved will decrease at approximately 1 megabyte/minute such that the reserved amount is 0 after 30 minutes.

The queue level resource requirement parameter RES_REQ may also specify the resource reservation. If a queue reserves certain amount of a resource, you cannot use the -R option of the bsub command to reserve a greater amount of that resource. For example, if the output of bqueues -l command contains:

RES_REQ: rusage[mem=40:swp=80:tmp=100]

the following submission will be rejected since the requested amount of certain resource(s) exceeds queue's specification:

% bsub -R "rusage[mem=50:swp=100]" myjob

The amount of resources reserved on each host can be viewed through the -l option of the bhosts command.

Host Selection

If you want to restrict the set of candidate hosts for running your batch job, use the -m option to bsub.

% bsub -q idle -m "hostA hostD hostB" myjob

This command submits myjob to the idle queue and tells LSF Batch to choose one host from hostA, hostD and hostB to run the job. All other LSF Batch scheduling conditions still apply, so the selected host must be eligible to run the job.

If you have applications that need specific resources, it is more flexible to create a new Boolean resource and configure that resource for the appropriate hosts in the LSF cluster. This must be done by the LSF administrator. If you specify a host list using the -m option to bsub, you must change the host list every time you add a new host that supports the desired resources. By using a Boolean resource, the LSF administrator can add, move or remove resources without forcing users to learn about changes to resource configuration.

Host Preference

When several hosts can satisfy the resource requirements of a job, the hosts are ordered by load. However, in certain situations it may be desirable to override this behaviour to give preference to specific hosts, even if they are more heavily loaded.

For example, you may have licensed software which runs on different groups of hosts, but prefer to run on a particular host group because the jobs will finish faster, thereby freeing the software license to be used by other jobs.

Another situation arises in clusters consisting of dedicated batch servers and desktop machines which can also run jobs when no user is logged in. You may prefer to run on the batch servers and only use the desktop machines if no server is available.

The -m option of the bsub command allows you to specify preference by using '+' after the hostname. The special hostname, others, can be used to refer to all the hosts that are not explicitly listed. For example:

% bsub -R "solaris && mem> 10" -m "hostD+ others" myjob

will select all solaris hosts having more than 10 megabytes of memory available. If host 'hostD' satisfies this criteria, it will be picked over any other host which otherwise meets the same criteria. If hostD does not satisfy the criteria, the least loaded host among the others will be selected. All the other hosts are considered as a group and are ordered by load.

You can specify different levels of preference by specifying a number after the '+'. The larger the number, the higher the preference for that host or host group. For example:

% bsub -m "groupA+2 groupB+1 groupC" myjob

gives first preference to hosts in groupA, second preference to hosts in groupB and last preference to those in groupC. The ordering within a group is still determined by the load. You can use the bmgroup command to display the host groups configured in the system.

Note

A queue may also define the host preference for jobs via HOSTS parameter. The queue specification is ignored if a job specifies its own preference.

You can also exclude a host by specifying a resource requirement using hname resource:

% bsub -R "hname!=hostb && type==sgi6" myjob

Resource Limits

Resource limits are constraints you or your LSF administrator can specify to limit the use of resources. Jobs that consume more than the specified amount of a resource are signalled or have their priority lowered.

Resource limits can be specified either at the queue level by your LSF administrator or at the job level when you submit a job. Resource limits specified at the queue level are hard limits while those specified with job submission are soft limits. See setrlimit(2) man page for concepts of hard and soft limits.

The following resource limits can be specified to the bsub command:

-c cpu_limit[/host_spec]
Set the soft CPU time limit to cpu_limit for this batch job. The default is no limit. This option is useful for preventing erroneous jobs from running away, or to avoid using up too many resources. A SIGXCPU signal is sent to all processes belonging to the job when it has accumulated the specified amount of CPU time. If the job has no signal handler for SIGXCPU, this causes it to be killed. LSF Batch keeps track of the CPU time used by all processes of the job.

cpu_limit is in the form [hour:]minute, where minute can be greater than 59. So, 3.5 hours can either be specified as 3:30 or 210. The CPU limit is scaled by the host CPU factors of the submitting and execution hosts. This is done so that the job does approximately the same amount of processing for a given CPU limit, even if it is sent to a host with a faster or slower CPU. For example, if a job is submitted from a host with a CPU factor of 2 and executed on a host with a CPU factor of 3, the CPU time limit is multiplied by 2/3 because the execution host can do the same amount of work as the submission host in 2/3 of the time.

The optional host_spec specifies a host name or a CPU model name defined by LSF. The lsinfo command displays CPU model information. If host_spec is not given, the CPU limit is scaled based on the DEFAULT_HOST_SPEC shown by the bparams -l command. (If DEFAULT_HOST_SPEC is not defined, the fastest batch host in the cluster is used as the default.) If host_spec is given, the appropriate CPU scaling factor for the specified host or CPU model is used to adjust the actual CPU time limit at the execution host. The following example specifies that myjob can run for 10 minutes on a DEC3000 host, or the corresponding time on any other host:

% bsub -c 10/DEC3000 myjob

-W run_limit[/host_spec]
Set the wall-clock run time limit of this batch job. The default is no limit. If the accumulated time the job has spent in the RUN state exceeds this limit, the job is sent a USR2 signal. If the job does not terminate within 10 minutes after being sent this signal, it is killed. run_limit and host_spec have the same format as the argument to the bsub -c option.
 
-F file_limit
Set a per-process (soft) file size limit for each process that belongs to this batch job. If a process of this job attempts to write to a file such that the file size would increase beyond file_limit kilobytes, the kernel sends that process a SIGXFSZ signal. This condition normally terminates the process, but may be caught. The default is no soft limit.
 
-D data_limit
Set a per-process (soft) data segment size limit for each process that belongs to this batch job. An sbrk() or malloc() call to extend the data segment beyond data_limit kilobytes returns an error. The default is no soft limit.
 
-S stack_limit
Set a per-process (soft) stack segment size limit for each process that belongs to this batch job. An sbrk() call to extend the stack segment beyond stack_limit kilobytes causes the process to be terminated. The default is no soft limit.
 
-C core_limit
Set a per-process (soft) core file size limit for each process that belongs to this batch job. On some systems, no core file is produced if the image for the process is larger than core_limit kilobytes. On other systems only the first core_limit kilobytes of the image are dumped. The default is no soft limit.
 
-M mem_limit
Set the per-process (soft) process resident set size limit to mem_limit kilobytes for all processes that belong to this batch job. Exceeding this limit when free physical memory is in short supply results in a low scheduling priority being assigned to the process. That is, the process is reniced. The default is no soft limit. On HP-UX and Sun Solaris 2.x, a resident set size limit cannot be set, so this option has no effect.

Pre-Execution Commands

Some batch jobs require resources that LSF does not directly support. For example, a batch job may need to reserve a tape drive or check for the availability of a software license.

The -E pre_exec_command option to the bsub command specifies an arbitrary command to run before starting the batch job. When LSF Batch finds a suitable host on which to run a job, the pre-execution command is executed on that host. If the pre-execution command runs successfully, the batch job is started.

An alternative to using the -E pre_exec_command option is for the LSF administrator to set up a queue level pre-execution command. See `Queue-Level Pre-/Post-Execution Commands' on page 224 of the LSF Batch Administrator's Guide for more information.

By default, the pre-execution command is run under the same user ID, environment, and home and working directories as the batch job. For queue-level pre-execution commands, you can specify a different user ID by defining the LSB_PRE_POST_EXEC_USER variable. If the pre-execution command is not in your normal execution path, the full path name of the command must be specified.

For parallel batch jobs, the pre-execution command is run on the first selected host.

The pre-execution command returns information to LSF Batch using the exit status. If the pre-execution command exits with non-zero status, the batch job is not dispatched. The job goes back to the PEND state, and LSF Batch tries to dispatch another job to that host. The next time LSF Batch tries to dispatch jobs this process is repeated.

LSF Batch assumes that the pre-execution command runs without side effects. For example, if the pre-execution command reserves a software license or other resource, you must take care not to reserve the same resource more than once for the same batch job.

The following example shows a batch job that requires a tape drive. The tapeCheck program is a site specific program that exits with status zero if the specified tape drive is ready, and one otherwise:

% bsub -E "/usr/local/bin/tapeCheck /dev/rmt0l" myjob

Job Dependencies

Some batch jobs depend on the results of other jobs. For example, a series of jobs could process input data, run a simulation, generate images based on the simulation output, and finally, record the images on a high-resolution film output device. Each step can only be performed when the previous step completes and all subsequent steps must be aborted if any step fails.

The -w depend_cond option to the bsub command specifies a dependency condition, which is a logical expression based on the execution states of preceding batch jobs. When the depend_cond expression evaluates to TRUE, the batch job can be started. Complex conditions can be written using the logical operators `&&' (AND), `||' (OR), `!' (NOT) and parentheses `()'.

If any one of the depended batch jobs is not found, bsub fails and the job is not submitted.

Inter-job dependency scheduling can be based on specific job exit status, so that a suitable recovery job can be initiated in case of specific types of job failures. The exit condition in the dependency string (specified in the -w option of bsub) can be triggered on particular exit code(s) of the dependant job. Relational operators can be used when a job needs to be triggered on a range of exit codes.

If there is a space character, a logic operator or parentheses in the expression string, the string must be enclosed in single or double quotes (' or ") to prevent the shell from interpreting the special characters.

Batch jobs are identified by job ID number or job name. The job ID number is displayed by the bsub command when the job is submitted. The job name is a string specified by the -J job_name option.

In job dependency expressions, numeric job names must be enclosed in quotes.

Note that a numeric job name should be doubly quoted, e.g. -w "'210'", since the UNIX shell treats -w "210" the same as -w 210.

Job names refer to jobs submitted by the same user. If more than one of your jobs has the same name, the condition is tested on the last job submitted with that name.

A wildcard character `*' can be specified at the end of a job name to indicate all jobs matching the name. For example, jobA* will match jobA, jobA1, jobA_test, jobA.log etc. There must be at least one match.

The conditions that can be tested are:

started({jobID | jobName})
If the specified batch job has started running or has run to completion, the condition is TRUE; that is, the job is not in the PEND or PSUSP state, and also is not currently running the pre-execution command if the bsub -E option was specified.

done({jobID | jobName})
If the specified batch job has completed successfully and is in the DONE state, the condition is TRUE. Otherwise, it is FALSE.

exit({jobID | jobName})
If the specified batch job has terminated abnormally and is in the EXIT state, the condition is TRUE. Otherwise, it is FALSE.

exit({jobID | jobName}, [op] code)
If the specified job has terminated with the exit code specified by code, or with an exit code satisfying the relationship expressed by op code, the condition is TRUE. Otherwise, it is FALSE. When a batch job is killed while pending, it is assigned a special exit code of 512.

The op variable may be any of the relational operators `>', `>=', `<`, `<=', `==', `!='. The code variable is numeric, representing a job exit code.

ended({jobID | jobName})
If the specified batch job has finished (either in the EXIT or DONE state), the condition is TRUE. Otherwise, it is FALSE.

{jobID | jobName}
Specifying only jobID or jobName is equivalent to done({jobID | jobName}). If the specified batch job has completed successfully and is in the DONE state, the condition is TRUE. Otherwise, it is FALSE.

Job Dependency Examples

done(312) && (started(Job2)||exit(Job3))

The submitted job will not start until job 312 has completed successfully, and either the job named Job2 has started or the job named Job3 has terminated abnormally.

1532 || jobName2 || ended(jobName3*)

The submitted job will not start until either job 1532 has completed, the job named jobName2 has completed, or all jobs with names beginning with jobName3 have finished.

exit (34334, 12)

The submitted job will not start until job 34334 finishes with an exit code of 12.

exit (myjob, < 30)

The submitted job will not start until myjob finishes with an exit code lower than 30.

Note

If you require more extensive dependencies, for example, calendar or event dependencies, you may want to examine the LSF JobScheduler product of LSF Suite.

Remote File Access

LSF is usually used in networks with shared file space. When shared file space is not available, LSF can copy needed files to the execution host before running the job, and copy result files back to the submission host after the job completes.

The -f "[lfile op [rfile]]" option to the bsub command copies a file between the submission host and the execution host. lfile is the file name on the submission host, and rfile is the name on the execution host. op is the operation to perform on the file. lfile and rfile can be absolute or relative file path names. If one of the files is not specified, it defaults to the other, which must be given.

The -f option may be repeated to specify multiple files.

op must be surrounded by white space. The possible values for op are:

>
lfile on the submission host is copied to rfile on the execution host before job execution. rfile is overwritten if it exists

<
rfile on the execution host is copied to lfile on the submission host after the job completes. lfile is overwritten if it exists

<<
rfile is appended to lfile after the job completes. lfile is created if it does not exist

><, <>
equivalent to performing the > and then the < operation. lfile is copied to rfile before the job executes, and rfile is copied back (replacing the previous lfile) after the job completes. `<>' is the same as `><`

You must include lfile with op, otherwise it will result in a syntax error. When rfile is not given, it is assumed to be the same as lfile.

If the input file specified with the -i argument to bsub is not found on the execution host, the file is copied from the submission host using LSF's remote file access facility and is removed from the execution host after the job finishes.

The output files specified with the -o and -e arguments to bsub are created on the execution host, and are not copied back to the submission host by default. You can use the remote file access facility to copy these files back to the submission host if they are not on a shared file system. For example, the following command stores the job output in the job_out file and copies the file back to the submission host:

% bsub -o job_out -f "job_out <" myjob

If the submission and execution hosts have different directory structures, you must ensure that the directory where rfile and lfile will be placed exists. LSF tries to change the directory to the same path name as the directory where the bsub command was run. If this directory does not exist, the job is run in your home directory on the execution host.

You should specify rfile as a file name with no path when running in non-shared file systems; this places the file in the job's current working directory on the execution host. This way the job will work correctly even if the directory where the bsub command is run does not exist on the execution host. Be careful not to overwrite an existing file in your home directory.

For example, to submit myjob to LSF Batch, with input taken from the file /data/data3 and the output copied back to /data/out3, run the command:

% bsub -f "/data/data3 > data3" -f "/data/out3 < out3" myjob data3 out3

To run the job batch_update, which updates the batch_data file in place, you need to copy the file to the execution host before the job runs and copy it back after the job completes:

% bsub -f "batch_data <>" batch_update batch_data

LSF Batch uses the lsrcp(1) command to transfer files. lsrcp contacts the RES on the remote host to perform the file transfer. If the RES is not available, rcp(1) is used. Because LSF client hosts do not run the RES daemon, jobs that are submitted from client hosts should only specify the -f option to bsub if rcp is allowed. You must set up the permissions for rcp if account mapping is used.

Start and Termination Time

If you do not want LSF Batch to start your job immediately, use the bsub -b option to specify the time after which the job should be dispatched.

% bsub -b 5:00 myjob

The submitted job remains pending until after the local time on the LSF master host reaches 5 A.M. You can also specify a time after which the job should be terminated with the -t option to bsub. The command

% bsub -b 11:12:5:40 -t 11:12:20:30 myjob

submits myjob to the default queue to start after November 12 at 05:40 A.M. If the job is still running on Nov 12 at 8:30 P.M., it is killed.

Parallel Jobs

LSF Batch can allocate more than one host or processor to run a job and automatically keeps track of the job status, while a parallel job is running. To submit a parallel job, use the -n option of bsub:

% bsub -n 10 myjob

This command submits myjob as a parallel job. The job is started when 10 job slots are available.

For parallel jobs, LSF Batch only starts one controlling process for the batch job. This process is started on the first host in the list of selected hosts. The controlling process is responsible for starting the actual parallel components on all the hosts selected by LSF Batch.

LSF Batch sets a number of environment variables for each batch job. The variable LSB_JOBID is set to the LSF Batch job ID number as printed by bsub. The LSB_HOSTS variable is set to the names of the hosts running the batch job. For a sequential job, LSB_HOSTS is set to a single host name. For a parallel batch job, LSB_HOSTS contains the complete list of hosts that LSF Batch has allocated to that job. Parallel batch jobs must get the list of hosts from the LSB_HOSTS variable and start up all of the job components on the allocated hosts.

In the myjob example above, LSF Batch starts myjob on the first host. myjob reads the LSB_HOSTS environment variable to get the list of hosts and uses the RES to execute subtasks on those hosts.

LSF includes scripts for running PVM, P4, and MPI parallel programs as batch jobs. See `Parallel Jobs' on page 181 and the pvmjob(1), p4job(1), and mpijob(1) manual pages for more information.

The following features support parallel jobs running through the LSF Batch system.

Minimum and Maximum Number of Processors

When submitting a parallel job that requires multiple processors, you can specify the minimum number and maximum number of the processors using -n option to the bsub command. The syntax of the -n option is:

bsub -n min_proc[,max_proc] <other bsub options>

If max_proc is not specified then it is assumed to be equal to min_proc. For example:

% bsub -n 4,16 myjob

At most, 16 processors can be allocated to this job. If there are less than 16 processors eligible to run the job, this job can still be started as long as the number of eligible processors is greater than 4. Once the job gets started, no more processors will be allocated to it even though more may be available later on.

If the specified maximum number is greater than the value of PROCLIMIT defined for the queue to which the job is submitted, the job will be rejected.

Specifying Locality

Sometimes you need to control how the selected processors for a parallel job are distributed across the hosts in the cluster. You are able to specify "select all the processors for this parallel batch job on the same host", or "do not chose more than n processor on one host" by using the 'span' section in the -R option string. For example:

% bsub -n 4 -R "span[hosts=1]" my_job

This job should be dispatched to a multiprocessor that has at least 4 processors currently eligible to run the 4 components of this job.

% bsub -n 4 -R "span[ptile=1]" myjob

This job should be dispatched to 4 hosts even though some of the 4 hosts may have more than one processor currently available.

Note

The queue may also define the locality for parallel jobs using RES_REQ parameter. The queue specification is ignored if your job specifies its own locality.

A parallel job may span multiple hosts, with a specifiable number of processes allocated to each host. Thus, a job may be scheduled onto a single multiprocessor host to take advantage of its efficient shared memory, or spread out onto multiple hosts to take advantage of their aggregate memory and swap space. Flexible spanning may also be used to achieve parallel I/O.

The span section of the resource requirement string can specify a processor tiling factor, ptile:

span[ptile=value]

The value is a number greater than 0 indicating that up to <value> processor(s) on each host should be allocated to the job regardless of how many processors the host possesses. For example:

% bsub -n 4 -R "span[ptile=2]" myjob

This job should be dispatched to 2 hosts with 2 processors on each host allocated for the job. Each host may have more than 2 processors available.

% bsub -n 4 -R "span[ptile=3]" myjob

In this case, the job must be dispatched to 2 hosts. It takes 3 processors on the first host and 1 processor on the second host.

Job Arrays

The LSF Batch system provides a structure called a job array, which allows multiple jobs to be created with a single job submission (bsub). A job array is a series of independent batch jobs, all of which share the same job ID and submission parameters (resource requirements). The job array elements are referenced using an array index. The dimension and structure of the job array are defined as part of the job name when the job is submitted.

Job array elements (jobs) are scheduled to run independently of each other, using the various policies that govern a user's jobs within the LSF system. After a job array is submitted, the resource requirements for individual jobs and for the entire array are modified using the bmod command (see `Job Array Modification' on page 133). Individual jobs and the entire array are controlled using the bstop, bresume, and bkill commands (see `Controlling Job Arrays' on page 129). The status and history of a job array and its jobs are viewed using the bjobs and bhist commands (see `Tracking Job Arrays' on page 126).

The default maximum size of a job array is 1000 jobs, but this can be increased to 2046 jobs. The MAX_JOB_ARRAY_SIZE parameter specified in the lsb.params file sets the maximum size of a job array.

This section discusses the following topics:

Creating a Job Array
Array Job Dependencies
Handling Input/Output/Error Files for Job Arrays

Additional topics about job arrays:

Creating a Job Array

A job array is created at the time of submission. The job name field of the bsub command is extended to specify the elements of the array. Each element of the array corresponds to a single job and is identified by an index which must be a positive integer.

The index values in an array do not have to be consecutive, and a combination of individual index values and index ranges are used to define a job array. The following command creates a job array with the name myJobArray consisting of 100 elements with indices 1 through 100.

% bsub -J "myJobArray[1, 2, 3, 4-50, 51, 52, 53-100]"

The array elements (jobs) are named myJobArray[1], myJobArray[2], ..., myJobArray[n], ..., myJobArray[100].

Syntax

% bsub -J "jobArrayName[indexList, . . .]" command

Note

One blank space must be entered between the -J switch and the first quote in the job array specification

The job array specification must be enclosed in double quotes

The square brackets, [ ], around the indexList must be entered exactly as shown.

Note: The job array syntax breaks the convention of using square brackets to indicate optional items.

jobArrayNameSpecifies a user defined string used to name the job array. Any combination of the following characters make up a valid jobArrayName:

a-z | A-Z | 0-9 | . | - | _

indexList
Specifies the dimension, structure, and indices of the job array in the following format:

indexList = start [- end [: step]]

startA unique positive integer specifying the start of a range of job array indices. If a start value is specified without an end value, start specifies an individual index in the job array.

endA unique positive integer specifying the end of a range of job array indices.

stepA positive integer specifying the value to increment the index values for the preceding range. If omitted, the default value is 1.

Examples of indexList specifications

[1] specifies 1 job with the index of 1. Since no end value is specified, the start value is the index.
[1, 2, 3, 4, 5] specifies 5 jobs with indices 1 through 5.
[1-10] specifies 10 jobs with indices 1 through 10. Since no step value is specified, the default step is 1. The index values are determined by starting at 1 and adding 1, the default step value, to the current index. The index values are not incremented past the end value.
[10-20:2] specifies 6 jobs with indices 10, 12, 14, 16, 18, and 20. The step value is 2. Index values are determined by starting at 10 and adding 2 to the current index. The index values are not incremented past the end value.
[1, 2, 3, 4, 5, 6-10, 27, 100-200:50, 201, 202] specifies 16 jobs with indices of 1-10, 27, 100, 150, 200, 201, and 202.

LSB_JOBINDEX Environment Variable

The environment variable LSB_JOBINDEX is set when each job array element is dispatched. Its value corresponds to the job array index. Typically this variable is used within a script to select the job command to be performed based on the job array index. For example:

if [$LSB_JOBINDEX -eq 1]; then
	 cmd1
fi
if [$LSB_JOBINDEX -eq 2]; then
	 cmd2
fi

Array Job Dependencies

Since each job array has the same set of submission parameters, it is not possible to set up job dependencies between elements of the same array. Similar behaviour can be achieved by creating two job arrays and using array job dependencies. For example, suppose you want to have an array with 100 elements, where the first 50 elements must be run before next 50. This can be achieved with the following submissions:

% bsub -J "myJob[1-50]" cmd
Job <101> submitted to default queue <normal>.


% bsub -w "done(101)" -J "myJob[51-100]" cmd
Job <102> submitted to queue <normal>.

The second job array, 102, will wait for the successful completion of all jobs in the array due the done(101) dependency. Note that two job arrays can have the same name, but have a different number of elements in each. Each job array is handled independently of the other.

A job or job array can also depend on the partial completion of another array. One of the dependency condition functions listed in the Table can be used to evaluate the number of jobs of a job array in a given job state. The "op" in Table 6 is one of the strings "==", ">", "<", ">=", or "<=". "num" is a non-negative integer. A special string "*" can be used in place of "num" to mean "all"..

Table 6. Dependency Condition Functions


Function	Description
numrun(array_jobId, op num)	TRUE if RUN counter satisfies test
numpend(array_jobId, op num)	TRUE if PEND counter satisfies test
numdone(array_jobId, op num)	TRUE if DONE counter satisfies test
numexit(array_jobId, op num)	TRUE if EXIT counter satisfies test
numended(array_jobId, op num)	TRUE if DONE+EXIT counter satisfies test
numhold(array_jobId, op num)	TRUE if PSUSP counter satisfies test
numstart(array_jobId, op num)	TRUE if RUN+SSUSP+USUSP counters satisfies test

In the following example, the elements in job array 202 will be scheduled when 10 or more elements in job array 201 have completed successfully.

% bsub -J "myJob[1-50]" cmd
Job <201> submitted to default queue <normal>.

% bsub -w "numdone(201,>=10)" -J "myJob[51-100]" cmd
Job <202> submitted to default queue <normal>.

Handling Input/Output/Error Files for Job Arrays

If input, output, or error files are specified for the array, then all elements will share these files. In order to separate the I/O of each element, special strings can be inserted in the I/O file specification to indicate the job ID or the array index of the element. The string "%J" and "%I" are expanded at job execution time into the job ID and array index (respectively), when found in the input, output or error specification. Both "%I" and "%J" may be specified simultaneously. For example:

% bsub -J "render[1-5]" -i "frame.%I"  renderit
Job <200> submitted to default queue <normal>.

would result in an array with 5 elements: render[1],..render[5], whose input files are "frame.1", "frame.2", ...,"frame.5", respectively.

Specifying a Share Account

If the cluster uses fairshare to determine the rate of resource allocation, then the order in which jobs are dispatched will be different from the default first-come-first-served (FCFS) policy. A user cannot control the priority assigned by the system when fairshare is enabled. However, if a user belongs to more than one group in the share tree and hence has multiple share accounts, then the job can be associated with a particular share account. This can be done by using the -G option of bsub(1). For example, if a user is member of both the 'test' and 'development' groups, a job can be submitted which uses the users share account in the test group by:

% bsub -G test myJob

Note that the user must have an account under the group specified, otherwise the job is rejected. Use bugroup -l to find out if you belong to a group.

Re-initializing Job Environment on the Execution Host

By default LSF Batch copies the environment of the job from the submission host when the job is submitted. The environment is recreated on the execution host when the job is started. This is convenient, in many cases, because the job runs as if it were run interactively on the submission host.

There are cases where you want to use a platform specific or host specific environment to run the job, rather than using the same environment as on the submission host. For example, you may want to set up different search paths on the execution host.

The -L shell option to the bsub command causes LSF Batch to emulate a login on the execution host before starting your job. This makes sure that the login start up files (.profile for /bin/sh, or .cshrc and .login for /bin/csh) are sourced before the job is started. The shell argument specifies the login shell to use.
% bsub -L /a/b/shell myjob
Job <1234> is submitted to default queue <normal>.
 
This tells LSF Batch to use /a/b/shell as the login shell to reinitialize the environment.

This does not affect the shell under which the job is run. When a login shell is specified with the -L shell option to the bsub command, that shell is only used as a login shell to set the environment. The job is run using /bin/sh, unless you specify otherwise as described in `Running a Job Under a Particular Shell' on page 116. For example, if your job script is written in /bin/sh and your regular login shell is /bin/csh, you can run your job under /bin/sh but use /bin/csh to reinitialize the job environment by sourcing your .cshrc and .login files.

Other bsub Options

This section lists some other bsub options. For details on these options see the bsub(1) manual page.

-x
The job must run exclusively on a host. The job is started on a host that has no other LSF Batch jobs running on it. The host is locked (status lockU) while this job is running so that no other LSF jobs are sent to the host.

-r
Specify that the job is rerunnable. See `Automatically Rerunning and Restarting Jobs' on page 174.

-B
Send email to the job submitter when the job begins executing.

-H
The job is submitted so that it is not scheduled until it is explicitly released by the user or administrator. The job immediately goes into the PSUSP state instead of the PEND state. A bresume(1) command would cause the job to go into the PEND state, where it could be scheduled.

This feature is useful when a job must wait on a condition which cannot be detected through LSF. The user or administrator can manually resume the job when the condition is satisfied, allowing it to be scheduled.

-I
An interactive batch job is submitted to the LSF Batch system. See `Interactive Batch Job Support' on page 145 for more details.

-k "checkdir[ interval ]"
Specify the checkpoint directory and interval. See `Submitting Checkpointable Jobs' on page 169.

-P project
Associate a project name with a job. Project names are logged in the lsb.acct file. You can use the bacct command to gather accounting information on a per-project basis.

On systems running IRIX 6, before the submitted job begins execution, a new array session is created and the project Id corresponding to the project name is assigned to the session.

-K
Force the synchronous execution of a job: the bsub command will not return until the specified job finishes running.

This is useful in cases where the completion of the job is required in order to proceed, such as a job script. If the job needs to be rerun due to transient failures, the command will return after the job finishes successfully.

For example:
% ./bsub -K myJob
Job <205> is submitted to default queue < normal>.
<< Waiting for dispatch ...>>
 
This will cause the bsub command to wait until the job is completed before returning. bsub will exit with the same exit code as the application, so that job submission scripts can take appropriate actions based on any failure conditions.

Job Scripts

You can build a job file one line at a time, or create it from another file, by running bsub without a command to submit. When you do this, you start an interactive session where bsub reads command lines from the standard input and submits them as a single batch job. You are prompted with bsub> for each line.

Examples



% bsub -q simulation
bsub> cd /work/data/myhomedir
bsub> myjob arg1 arg2 ......
bsub> rm myjob.log
bsub> ^D
Job <1234> submitted to queue <simulation>.

In this case, the three command lines are submitted to LSF Batch and run as a Bourne shell (/bin/sh) script. Note that only valid Bourne shell command lines are acceptable in this case. Here is another example:

% bsub -q simulation < command_file
Job <1234> submitted to queue <simulation>.

command_file must contain Bourne shell command lines.



C:\> bsub -q simulation
bsub> cd \\server\data\myhomedir
bsub> myjob arg1 arg2 ......
bsub> del myjob.log
bsub> ^Z
Job <1234> submitted to queue <simulation>.

In this case, the three command lines are submitted to LSF Batch and run as a batch file (.BAT). Note that only valid Windows batch file command lines are acceptable in this case. Here is another example:

% bsub -q simulation < command_file
Job <1234> submitted to queue <simulation>.

command_file must contain Windows batch file command lines.

Embedded Submission Options

You can specify job submission options in the script read from the standard input by the bsub command using lines starting with `#BSUB':

% bsub -q simulation
bsub> #BSUB -q test
bsub> #BSUB -o outfile -R "mem>10"
bsub> myjob arg1 arg2
bsub> #BSUB -J simjob
bsub> ^D
Job <1234> submitted to queue <simulation>.

There are a few things to note:

Command line options override embedded options, therefore the job is submitted to the simulation queue rather than the test queue
Submission options can be specified anywhere in the standard input. In the above example, the -J option to bsub is specified after the command to be run
More than one option can be specified on one line, as shown in the above example

As a second example, you can redirect a script to the standard input of the bsub command:

% bsub < myscript
Job <1234> submitted to queue <test>.

The myscript file contains job submission options as well as command lines to execute. When the bsub command reads a script from its standard input, the script file is actually spooled by the LSF Batch system; therefore, the script can be modified right after bsub returns for the next job submission.

When the script is specified on the bsub command line, the script is not spooled:

% bsub myscript
Job <1234> submitted to default queue <normal>.

In this case the command line myscript is spooled by LSF Batch, instead of the contents of the myscript file. Later modifications to the myscript file can affect the job's behaviour.

Running a Job Under a Particular Shell

By default, LSF runs batch jobs using the Bourne (/bin/sh) shell. You can specify the shell under which the job is run. This is done by specifying an interpreter in the first line of the script.

% bsub
bsub> #!/bin/csh -f
bsub> set coredump=`ls |grep core`
bsub> if ( "$coredump" != "") then
bsub> mv core core.`date | cut -d" " -f1`
bsub> endif
bsub> myjob
bsub> ^D
Job <1234> is submitted to default queue <normal>.

The bsub command must read the job script from the standard input to set the execution shell.

If you do not specify a shell in the script, the script is run using /bin/sh. If the first line of the script starts with a `#' not immediately followed by a `!', then /bin/csh is used to run the job. For example:

% bsub
bsub> # This is a comment line. This tells the system to 
use /bin/csh to
bsub> # interpret the script.
bsub>
bsub> setenv DAY `date | cut -d" " -f1`
bsub> myjob
bsub> ^D
Job <1234> is submitted to default queue <normal>.

If running jobs under a particular shell is frequently required, you can specify an alternate shell using a command-level job starter and run your jobs interactively. See `Command-Level Job Starters' on page 144 for detailed information.

Submitting Jobs Using the Job Submission GUI

LSF Batch provides a GUI for submitting jobs. The main window of xbsub was shown in the figure `xbsub Job Submission Window' on page 23. All the job submission options can be selected using the GUI.

Detailed parameters can be set by clicking the `Advanced' button. The resulting window is shown in Figure 11.

Figure 11. Advanced Parameters of the Job Submission Window

[Contents] [Index] [Top] [Bottom] [Prev] [Next]

doc@platform.com