This chapter describes how to submit and interact with parallel applications in the LSF Batch system.
An extensive and flexible set of tools is provided that allows parallel applications to be submitted through the LSF Batch system. Parallel applications can also be executed interactively under control of the Parallel Application Manger (PAM). These tools allow the specification of how, when, and where a parallel application is to be run.
This chapter discusses the following topics:
The LSF Parallel system supports batch submission of parallel applications (batch jobs) using the facilities of the LSF Batch System. Interactive execution of parallel applications is also supported under control of the Parallel Application Manager (PAM).
When submitting a parallel batch job, the LSF Parallel system uses the advanced features of the LSF Batch system to select, submit, and interact with the individual tasks of the parallel batch job. The batch job is submitted to a queue using the bsub command and the LSF Batch system attends to the details.
A parallel batch job is submitted to a queue, where it waits until it reaches the front of the queue and the appropriate resources become available. Then the batch job will be dispatched to the most suitable hosts for execution. This sophisticated queuing system allows batch jobs to run as soon as the suitable host resources becomes available.
The batch job may not be run immediately, it may queued until the appropriate resources become available.
To use the bsub command to submit a parallel batch job to the LSF Batch system, see Submitting Batch Jobs on page 26.
When interactively executing a parallel batch job, the pam
command is used to invoke PAM. When submitting batch jobs using the pam command, the LSF Batch system is bypassed; the jobs are not queued. Batch jobs are run immediately upon entering the command if the specified resource requirements are met. If the resources are not available the job is not run.
Since the jobs do not wait, interactive job execution is beneficial for debugging parallel applications. Direct interaction is supported. All the input and output is handled transparently between the local and execution hosts.
To use the pam command to execute a parallel batch job interactively, see Interactive Execution on page 34.
The LSF Batch and LSF Parallel products provide commands and man pages for these commands.
If these commands cannot be executed or the man pages cannot be viewed, the appropriate directories may need to be added to the systems path; check with the system administrator.
The LSF Parallel system uses the features of the LSF Batch system to select the most suitable hosts, submit, and interact with parallel batch jobs. The batch job is submitted to a queue using the bsub command, as described in Submitting Batch Jobs on page 26, and the LSF Batch and LSF Parallel systems attend to the rest.
Like serial batch jobs, parallel batch jobs pass through many states. See Batch Job Status on page 23.
This part of the chapter discusses the following topics:
Each batch job submitted to the LSF Batch system passes through a series of states until the job completes normally (success) or abnormally (failure). The bjobs command allows the status of the batch jobs to be monitored; see Monitoring Job Status on page 30. The ability to monitor batch job status extends to the individual processes (tasks) of the parallel application.
Figure 3 Batch Job State Diagram
Figure 3 shows the possible states a batch job can pass through when submitted to the LSF Batch system. The diagram also shows the activities and commands that cause the state transitions. The batch job states are described in Table .
The bsub command is used to submit parallel batch jobs to the LSF Batch system. The syntax for using bsub when submitting parallel applications is the same as the LSF Batch system with the addition of the pam option:
% bsub [options] pam [options] job
The pam options used with the bsub command are a subset of the pam command options, see The pam Command on page 35. Since the LSF Batch system does all of the resource allocation and scheduling, the pam options -m, -f, and -n are not necessary and are ignored by the bsub command. The syntax for bsub pam is:
pam [-h][-V][-t][-v]
Suppress the printing of the process status summary on job completion. | |
Specifies the job is to be run in verbose mode. The names of the selected hosts are displayed. |
Example: The following command submits a parallel batch job named myjob to the LSF Batch system and requests four processors of any type to run the job:
% bsub -n 4 pam myjob
When the parallel batch job named myjob is submitted to the LSF Batch system and dispatched to host1, host2, host3 and host4, the bjobs command will display:
% bjobs JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 713 user1 RUN batch host99 host1 myjob Sep 12 16:30 host2 host3 host4 |
The bstop command is used to suspend parallel batch jobs running in the LSF Batch system. The syntax for using the bstop command in the LSF Parallel system is:
% bstop jobId
Example: The following command suspends the parallel batch job named myjob running in the LSF Batch system with job id of 713:
% bstop 713
When the parallel batch job named myjob is suspended the bjobs command will display the batch job state of USUSP:
% bjobs JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 713 user1 USUSP batch host99 host1 myjob Sep 12 16:32 host2 host3 host4 |
The bresume command is used to resume suspended parallel batch jobs running in the LSF Batch system. The syntax for using the bresume command in the LSF Parallel system is:
% bresume jobID
Example: The following command resumes the suspended parallel batch job named myjob running in the LSF Batch system with job ID of 713:
% bresume 713
When the parallel batch job named myjob is resumed the bjobs command will display the batch job state of RUN or PEND:
% bjobs JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 713 user1 RUN batch host99 host1 myjob Sep 12 16:34 host2 host3 host4 |
The bjobs command is used to view the running status and resource usage of parallel batch jobs running in the LSF Batch system. The syntax for using the bjobs command in the LSF Parallel system is:
% bjobs [options]
Example: The following command displays the running status and resource usage of the jobs running in the LSF Batch system:
% bjobs JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 713 user1 RUN batch host99 host1 myjob Sep 12 16:34 host2 host3 host4 |
Example: The following command uses the -l option to display run-time resource usage (CPU, memory, and swap) as well as the running status of the jobs running in the LSF Batch system:
The bkill command is used to terminate parallel batch jobs running in the LSF Batch system. The syntax for using the bkill command in the LSF Parallel system is:
% bkill jobID [options]
Example: The following command terminates the parallel batch job named myjob running in the LSF Batch system with a job ID of 713:
% bkill 713
When the parallel batch job named myjob is terminated the bjobs command will display the batch job state of EXIT:
% bkill 713 JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 713 user1 EXIT batch host99 host1 myjob Sep 12 16:30 host2 host3 host4 |
The time taken to terminate a parallel batch job varies and depends on the number of parallel processes.
The LSF Parallel system provides an LSF host type substitution facility to allow a heterogeneous multiple-architecture distributed application to be submitted to the LSF Batch system.
1. The binary will run on each specified platform, or a binary exists for each platform.
2. The binaries for the Parallel application are specified using the %a
notation format, see Building a Heterogeneous Parallel Application on page 17.
Example: Using the LSF host type extension format to specify the batch job named myjob to run on any two available processors having either Sun Solaris (SUNSOL) or RS6000 (RS6K) architectures, the following command can be used:
% bsub -n 2 pam myjob.%aTo specify SUNSOL and RS6K in an environment with other architectures, the following command is specified with the -R (resource) option:
% bsub -n 2 -R "type==SUNSOL || type==RS6K" pam myjob.%aFor both these examples, the Parallel Application Manager (PAM) substitutes the %a notation with the correct LSF host type extension. The binaries used are named:
Example: Using the LSF host type path name format to specify the batch job named myjob to run on any two processors having either SUNSOL or RS6K architectures, the following command can be used:
% bsub -n 2 pam /user/batch/%a/myjob
To specify SUNSOL and RS6K in an environment with other architectures, the following command is specified with the -R (resource) option:
% bsub -n 2 -R "type==SUNSOL || type==RS6K" pam /user/batch/%a/myjob
For both these examples, the Parallel Application Manager (PAM) substitutes the %a notation with the correct LSF host type path name. The paths used to select the binaries are:
The LSF Parallel system uses the Parallel Application Manager (PAM) to control the execution of parallel batch jobs interactively. Batch jobs are executed interactively using the pam command, see The pam Command on page 35. When submitting batch jobs using the pam command, the LSF Batch system is bypassed, the jobs are not queued. Batch jobs are run immediately upon entering the command if the resource requirements specified are met. If the resources are not available the job is not run. Since the jobs do not wait, interactive job execution is beneficial for debugging parallel applications.
To successfully execute an interactive parallel batch job, the pam command must be reissued at a time when the resources are available. If specific resources are not requested the LSF Parallel system will run the batch job on the least loaded hosts that meet the batch jobs criteria.
Direct interaction is supported. All the input and output is handled transparently between local and execution hosts. All job control signals (e.g., ctrl+x, ctrl+z, and ctrl+l) are propagated to the execution hosts; this allows interaction with the job as if it were a being executed locally.
This part of the chapter discusses the following topics:
The pam command is used to interactively execute parallel batch jobs in the LSF Parallel system. A subset of the pam command is used as a command option for the bsub command (see The bsub Command on page 26). The syntax for using the pam command is:
pam [-h][-V][-i][-t][-v]
[-server_addr location ]
[| -server_jobid location ]
[| -server_jobname location ]
{-m "host ..." }
{| [-R req] -n num }
job [arg ...]
:
Example: The following command executes the parallel batch job named myjob on the LSF Parallel system requesting four processors of any type:
Example: The following command uses the -m option to execute the parallel batch job named myjob on host1, host2, and host3:
After a parallel batch job terminates in a successful (Done) or failed (EXIT) state the LSF Parallel system displays the status of all the processes. For example:
Use the -t option to suppress the process Status report. For example:
pam -t -n 4 myjob
The lshosts command is used to display information about LSF host configurations including name, type, model, CPU normalization factor, number of CPUs, total memory, and available resources.