[Contents] [Index] [Top] [Bottom] [Prev] [Next]


4. Submitting Parallel Applications

This chapter describes how to submit and interact with parallel applications in the LSF Batch system.

An extensive and flexible set of tools is provided that allows parallel applications to be submitted through the LSF Batch system. Parallel applications can also be executed interactively under control of the Parallel Application Manger (PAM). These tools allow the specification of how, when, and where a parallel application is to be run.

This chapter discusses the following topics:

Job Submission Methods
Batch Execution
Batch Job Status
Submitting Batch Jobs
Suspending Jobs
Resuming Jobs
Monitoring Job Status
Terminating Jobs
Running Heterogeneous Parallel Applications
Interactive Execution
The pam Command
Process Status Report
Getting Host Information

Job Submission Methods

The LSF Parallel system supports batch submission of parallel applications (batch jobs) using the facilities of the LSF Batch System. Interactive execution of parallel applications is also supported under control of the Parallel Application Manager (PAM).

Batch Execution

When submitting a parallel batch job, the LSF Parallel system uses the advanced features of the LSF Batch system to select, submit, and interact with the individual tasks of the parallel batch job. The batch job is submitted to a queue using the bsub command and the LSF Batch system attends to the details.

A parallel batch job is submitted to a queue, where it waits until it reaches the front of the queue and the appropriate resources become available. Then the batch job will be dispatched to the most suitable hosts for execution. This sophisticated queuing system allows batch jobs to run as soon as the suitable host resources becomes available.

Note

The batch job may not be run immediately, it may queued until the appropriate resources become available.

To use the bsub command to submit a parallel batch job to the LSF Batch system, see Submitting Batch Jobs on page 26.

Interactive Execution

When interactively executing a parallel batch job, the pam command is used to invoke PAM. When submitting batch jobs using the pam command, the LSF Batch system is bypassed; the jobs are not queued. Batch jobs are run immediately upon entering the command if the specified resource requirements are met. If the resources are not available the job is not run.

Since the jobs do not wait, interactive job execution is beneficial for debugging parallel applications. Direct interaction is supported. All the input and output is handled transparently between the local and execution hosts.

To use the pam command to execute a parallel batch job interactively, see Interactive Execution on page 34.

LSF Parallel and LSF Batch Commands

The LSF Batch and LSF Parallel products provide commands and man pages for these commands.

If these commands cannot be executed or the man pages cannot be viewed, the appropriate directories may need to be added to the systems path; check with the system administrator.

Batch Execution

The LSF Parallel system uses the features of the LSF Batch system to select the most suitable hosts, submit, and interact with parallel batch jobs. The batch job is submitted to a queue using the bsub command, as described in Submitting Batch Jobs on page 26, and the LSF Batch and LSF Parallel systems attend to the rest.

Like serial batch jobs, parallel batch jobs pass through many states. See Batch Job Status on page 23.

This part of the chapter discusses the following topics:

Batch Job Status
Submitting Batch Jobs
Suspending Jobs
Resuming Jobs
Monitoring Job Status
Terminating Jobs
Running Heterogeneous Parallel Applications

Batch Job Status

Each batch job submitted to the LSF Batch system passes through a series of states until the job completes normally (success) or abnormally (failure). The bjobs command allows the status of the batch jobs to be monitored; see Monitoring Job Status on page 30. The ability to monitor batch job status extends to the individual processes (tasks) of the parallel application.

Figure 3 Batch Job State Diagram

Figure 3 shows the possible states a batch job can pass through when submitted to the LSF Batch system. The diagram also shows the activities and commands that cause the state transitions. The batch job states are described in Table .

Table 3 Batch Job State Descriptions

State

Description

PEND

A batch job is pending when it is submitted (using the bsub command) and waiting in a queue. It remains pending until it moves to the head of the queue and all conditions for its execution are met. The conditions may include:

  • Start time specified by the user when the job is submitted
  • Load conditions on qualified hosts
  • Time windows during which:
    • The queue can dispatch jobs
    • Qualified hosts can accept jobs
  • Relative priority to other users and jobs
  • Availability of the specified resources.

RUN

A batch job is running when it has been dispatched to a host.

DONE

A batch job is done when it has normally completed its execution.

PSUSP

The job owner or the LSF Administrator can suspend (using the bstop command) a batch job while it is pending.

Also, the job owner or the LSF Administrator can resume (using the bresume command) a batch that is in the PSUSP state, then the batch job state transitions to PEND.

USUSP

The job owner or the LSF Administrator can suspend (using the bstop command) a batch job after it has been dispatched.

Also, the owner or the LSF Administrator can resume (using the bresume command) a batch that is in the USUSP state, then the batch job state transitions to SSUSP.

SSUSP

A batch job can be suspended by the LSF Batch system after it has been dispatched. This is done if the load on the execution host or hosts becomes too high in order to maximize host performance or to guarantee interactive response time.

The LSF Batch system suspends batch jobs according to their priority unless the scheduling policy associated with the job dictates otherwise. A batch job may also be suspended if the job queue has a time window and the current time exceeds the window.

The LSF Batch system can later resume a system suspended (SSUSP) job if the load condition on the execution host decreases or the time window of the queue opens.

EXIT

A batch job can terminate abnormally (fail) from any state for many reasons. Abnormal job termination can occur when:

  • Cancelled (using the bkill command) by owner or LSF administrator while in PEND, RUN, or USUSP state.
  • Aborted by LSF because job cannot be dispatched before a termination deadline
  • Fails to start successfully (e.g., the wrong executable was specified at time of job submission)
  • Crashes during execution.

Parallel Batch Job Behavior

Submitting Batch Jobs

The bsub Command

The bsub command is used to submit parallel batch jobs to the LSF Batch system. The syntax for using bsub when submitting parallel applications is the same as the LSF Batch system with the addition of the pam option:

   % bsub [options] pam [options] job 

The pam Option

The pam options used with the bsub command are a subset of the pam command options, see The pam Command on page 35. Since the LSF Batch system does all of the resource allocation and scheduling, the pam options -m, -f, and -n are not necessary and are ignored by the bsub command. The syntax for bsub pam is:

   pam [-h][-V][-t][-v] 

The bsub pam options are:

Option

Description

-h

Print command usage to standard error and exit.

-V

Print LSF version to standard error and exit.

-t

Suppress the printing of the process status summary on job completion.

-v

Specifies the job is to be run in verbose mode. The names of the selected hosts are displayed.

Example: The following command submits a parallel batch job named myjob to the LSF Batch system and requests four processors of any type to run the job:

   % bsub -n 4 pam myjob 

When the parallel batch job named myjob is submitted to the LSF Batch system and dispatched to host1, host2, host3 and host4, the bjobs command will display:

% bjobs
JOBID USER     STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
713   user1    RUN   batch      host99      host1       myjob      Sep 12 16:30
                                            host2
                                            host3
                                            host4

Suspending Jobs

The bstop Command

The bstop command is used to suspend parallel batch jobs running in the LSF Batch system. The syntax for using the bstop command in the LSF Parallel system is:

   % bstop jobId  

Example: The following command suspends the parallel batch job named myjob running in the LSF Batch system with job id of 713:

   % bstop 713 

When the parallel batch job named myjob is suspended the bjobs command will display the batch job state of USUSP:

% bjobs
JOBID USER     STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
713   user1    USUSP batch      host99      host1       myjob      Sep 12 16:32
                                            host2
                                            host3
                                            host4

Resuming Jobs

The bresume Command

The bresume command is used to resume suspended parallel batch jobs running in the LSF Batch system. The syntax for using the bresume command in the LSF Parallel system is:

   % bresume jobID 

Example: The following command resumes the suspended parallel batch job named myjob running in the LSF Batch system with job ID of 713:

   % bresume 713 

When the parallel batch job named myjob is resumed the bjobs command will display the batch job state of RUN or PEND:

% bjobs
JOBID USER     STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
713   user1    RUN   batch      host99      host1       myjob      Sep 12 16:34
                                            host2
                                            host3
                                            host4

Monitoring Job Status

The bjobs Command

The bjobs command is used to view the running status and resource usage of parallel batch jobs running in the LSF Batch system. The syntax for using the bjobs command in the LSF Parallel system is:

   % bjobs [options] 

Example: The following command displays the running status and resource usage of the jobs running in the LSF Batch system:

% bjobs
JOBID USER     STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
713   user1    RUN   batch      host99      host1       myjob      Sep 12 16:34
                                            host2
                                            host3
                                            host4

Example: The following command uses the -l option to display run-time resource usage (CPU, memory, and swap) as well as the running status of the jobs running in the LSF Batch system:

% bjobs -l 
Job Id <713>, User, Project, Status, Queue, Interactive pseudo-terminal mode, 
Command <myjob>
                     
Thu Sep 12 16:39:17: Submitted from host <host99>, CWD <$HOME/Work/utopia/pass/
                     pam>, 2-4 Processors Requested;
Thu Sep 12 16:39:18: Started on 4 Hosts/Processors host1 host2 host3 host4,
                     Execution Home < /pcc/s/user1, Execution CWD < /pcc/s/user1/W
                     ork/utopia/pass/pam;
Thu Sep 12 16:40:41: Resource usage collected.
                     The CPU time used is 2 seconds.
                     MEM: 281 Kbytes;  SWAP: 367 Kbytes
                     PGIDs: 4 PIDs: 4 
 SCHEDULING PARAMETERS:
           r15s   r1m  r15m   ut      pg    io   ls    it    tmp    swp    mem
 loadSched   -     -     -     -       -     -    -     -     -      -      -  
 loadStop    -     -     -     -       -     -    -     -     -      -      -  
            nresj
 loadSched     - 
 loadStop      - 

Terminating Jobs

The bkill Command

The bkill command is used to terminate parallel batch jobs running in the LSF Batch system. The syntax for using the bkill command in the LSF Parallel system is:

   % bkill jobID [options]  

Example: The following command terminates the parallel batch job named myjob running in the LSF Batch system with a job ID of 713:

   % bkill 713 

When the parallel batch job named myjob is terminated the bjobs command will display the batch job state of EXIT:

% bkill 713
JOBID USER     STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
713   user1    EXIT  batch      host99      host1       myjob      Sep 12 16:30
                                            host2
                                            host3
                                            host4

Note

The time taken to terminate a parallel batch job varies and depends on the number of parallel processes.

Running Heterogeneous Parallel Applications

The LSF Parallel system provides an LSF host type substitution facility to allow a heterogeneous multiple-architecture distributed application to be submitted to the LSF Batch system.

Assumptions:

1. The binary will run on each specified platform, or a binary exists for each platform.

2. The binaries for the Parallel application are specified using the %a notation format, see Building a Heterogeneous Parallel Application on page 17.

Example: Using the LSF host type extension format to specify the batch job named myjob to run on any two available processors having either Sun Solaris (SUNSOL) or RS6000 (RS6K) architectures, the following command can be used:

  % bsub -n 2 pam myjob.%a 

To specify SUNSOL and RS6K in an environment with other architectures, the following command is specified with the -R (resource) option:

  % bsub -n 2 -R "type==SUNSOL || type==RS6K" pam myjob.%a    

For both these examples, the Parallel Application Manager (PAM) substitutes the %a notation with the correct LSF host type extension. The binaries used are named:

Interactive Execution

The LSF Parallel system uses the Parallel Application Manager (PAM) to control the execution of parallel batch jobs interactively. Batch jobs are executed interactively using the pam command, see The pam Command on page 35. When submitting batch jobs using the pam command, the LSF Batch system is bypassed, the jobs are not queued. Batch jobs are run immediately upon entering the command if the resource requirements specified are met. If the resources are not available the job is not run. Since the jobs do not wait, interactive job execution is beneficial for debugging parallel applications.

To successfully execute an interactive parallel batch job, the pam command must be reissued at a time when the resources are available. If specific resources are not requested the LSF Parallel system will run the batch job on the least loaded hosts that meet the batch jobs criteria.

Direct interaction is supported. All the input and output is handled transparently between local and execution hosts. All job control signals (e.g., ctrl+x, ctrl+z, and ctrl+l) are propagated to the execution hosts; this allows interaction with the job as if it were a being executed locally.

This part of the chapter discusses the following topics:

The pam Command 35

Process Status Report 38

Getting Host Information 39

The pam Command

The pam command is used to interactively execute parallel batch jobs in the LSF Parallel system. A subset of the pam command is used as a command option for the bsub command (see The bsub Command on page 26). The syntax for using the pam command is:

   pam  [-h][-V][-i][-t][-v] 
[-server_addr location ]
[| -server_jobid location ]
[| -server_jobname location ]
{-m "host ..." }
{| [-R req] -n num }
job [arg ...]
:

Option

Description

-h

Print command usage to standard error and exit.

-V

Print LSF version to standard error and exit.

-i

Specifies interactive operation mode, the user will be asked if application is to be executed on all hosts.

If yes (y) the task is started on all hosts specified in the list.

If no (n) the user must interactively specify the hosts.

-t

Suppress the printing of the job task summary report to the standard output at job completion.

-v

Specifies the job is to be run in verbose mode. The names of the selected hosts are displayed.

-server_addr location

Specifies the location of the PAM server. The location is specified in the hostname:port_no format.

-server_jobid location

Specifies the location of the PAM server. The location is specified using the jobid for the server PAM job.

-server_jobname location

Specifies the location of the PAM server. The location is specified using the jobname for the server PAM job.

-m "host ..."

Specifies the list of hosts on which to run the parallel batch job tasks. The number of host names specified indicates the number of processors requested.

Note

This option cannot be used with options -R or -n, and is ignored when pam is used as a bsub option.

[-R req] -n num

Specifies the number of processors required to run the parallel job.

Note

This option cannot be used with option -m, and is ignored when pam is used as a bsub option.

-R req

Specify the resource requirements for host selection.Execute the parallel job on the hosts that meet these requirements.

Default: r15s:pg

Note

This option is ignored when pam is used as a bsub option.

job [arg ...]

The name of the parallel job to be run.

Note

This must be the last argument on the pam command line.

Example: The following command executes the parallel batch job named myjob on the LSF Parallel system requesting four processors of any type:

% pam -n 4 myjob
TID  HOST_NAME    COMMAND_LINE             STATUS            TERMINATION_TIME
==== ========== ================  ========================  ==================
1    host1      myjob             Completed                 03/31/98 10:31:58
2    host2      myjob             Completed                 03/31/98 10:31:59
3    host3      myjob             Completed                 03/31/98 10:31:59
4    host4      myjob             Completed                 03/31/98 10:31:58

Example: The following command uses the -m option to execute the parallel batch job named myjob on host1, host2, and host3:

% pam -m "host1 host2 host3" myjob
TID  HOST_NAME    COMMAND_LINE             STATUS            TERMINATION_TIME
==== ========== ================  ========================  ==================
1    host1      myjob             Completed                 03/31/98 10:31:58
2    host2      myjob             Completed                 03/31/98 10:31:59
3    host3      myjob             Completed                 03/31/98 10:31:59

Process Status Report

After a parallel batch job terminates in a successful (Done) or failed (EXIT) state the LSF Parallel system displays the status of all the processes. For example:

% pam -n 4 myjob
TID  HOST_NAME    COMMAND_LINE             STATUS            TERMINATION_TIME
==== ========== ================  ========================  ==================
1    host1      myjob             Done                      03/31/98 10:31:58
2    host2      myjob             Done                      03/31/98 10:31:59
3    host3      myjob             Done                      03/31/98 10:31:59
3    host4      myjob             Done                      03/31/98 10:31:59

Table 4 pam -n Job Status

Status

Description

Done

Process successfully completed with exit code of 0

Exit (code)

Process unsuccessfully completed with an exit code of code

Signaled (signal)

Process was terminated by signal

Exit (status unknown)

Connection broken; exit status unknown

Killed by PAM (signal)

PAM shutdown process using signal

Unreachable

PAM is unable to reach host after broken connection. No way to determine the state of the process

Runaway

Process is still running; cannot be killed by PAM

Suspend

Process suspended

Undefined

???

Run

Process running

Local RES died

???

Note

Use the -t option to suppress the process Status report. For example:

   pam -t -n 4 myjob 

Getting Host Information

The lshosts command is used to display information about LSF host configurations including name, type, model, CPU normalization factor, number of CPUs, total memory, and available resources.

Example:

% lshosts
HOST_NAME      type    model  cpuf ncpus maxmem maxswp server RESOURCES
host1        SGI64  SGI4D35   2.0     1    96M   153M    Yes (lsf_js irix gla)
host99      SUNSOL SunSparc  12.0     4  1024M  1930M    Yes (solaris cs bigmem)
host2        LINUX  I486_33  14.0     1    30M    64M    Yes (linux)
host7        SUN41 SPARCSLC   3.0     1    15M    29M    Yes (sparc bsd sun41)
host3      ALPHA~1  DEC5000   5.0     1    88M   384M    Yes (cs bigmem alpha gla)
host6      ALPHA~1  DEC5000   5.0     1    84M   350M    Yes (gla)
host4       SUNSOL SunSparc  12.0     2   256M   733M    Yes (solaris cs bigmem)
host5          SGI SGIINDIG  15.0     1    96M   300M    Yes (irix)
host8       SUNSOL SunSparc  12.0     1    56M    90M    Yes (solaris cs bigmem)

 



[Contents] [Index] [Top] [Bottom] [Prev] [Next]


doc@platform.com

Copyright © 1994-1998 Platform Computing Corporation.
All rights reserved.