[Contents] [Index] [Top] [Bottom] [Prev] [Next]


1. Introduction

This chapter describes the LSF Parallel system and its architecture.

This chapter discusses the following topics:

What Is the LSF Parallel System?
How Does LSF Parallel Fit Into the LSF Batch System?
LSF Parallel Architecture

What Is the LSF Parallel System?

The LSF Parallel system is a fully supported commercial software system that supports the programming, testing, and execution of parallel applications in production environments.

The LSF Parallel system is fully integrated with the LSF Batch system, the de-facto industry standard resource management software product, to provide load sharing in a distributed system and batch scheduling for compute intensive jobs. The LSF Parallel system provides support for:

How Does LSF Parallel Fit Into the LSF Batch System?

The LSF Parallel system adopts a layered approach, shown in Figure 1, that is fully integrated with the LSF Batch system. In addition to the LSF Batch system resources, the following components make up the LSF Parallel system:

Figure 1 Major Components of the LSF Parallel system

MPI Library

The Message Passing Interface (MPI) library is a message-passing library that must be linked to the parallel applications that are to be run in the LSF Batch system. The MPI library translates MPI message calls to messages for the machine-dependent layer and it interfaces the user application to PAM.

See Vendor MPI Implementations on page 41 for a description of vendor-specific MPI implementations.

PAM

The Parallel Application Manager (PAM) is the point of control for the LSF Parallel system. PAM is fully integrated with the LSF Batch system. PAM interfaces the user application with the LSF Batch system. For all parallel application processes (tasks), PAM:

LSF Batch System

The LSF Batch system is a sophisticated resource-based batch job scheduling system. It accepts user jobs and holds them in queues until suitable hosts are available and resource requirements are satisfied. Host selection is based on up-to-the-minute load information provided by the master Load Information Manager (LIM).

LSF Batch runs user jobs on batch server hosts. It has sophisticated controls for sharing hosts with interactive users; there is no need to set aside dedicated hosts for processing batch jobs.

See the LSF Batch User's Guide and the LSF Batch Administrator's Guide for a detailed description of the LSF Batch system.

LSF Parallel Architecture

The LSF Parallel system fully utilizes the resources of the LSF Batch System for resource selection and process invocation and control. The process of parallel batch job invocation and control is shown in Figure 2 and described in Table 1 on page 6. The LSF components shown in Figure 2 are described in Table 2 on page 6.

Figure 2 LSF Parallel Architecture

The LSF Parallel system also supports interactive parallel job submission. The process is similar to that shown in Figure 2 except the user request is submitted directly to PAM which makes a simple placement query to LIM. Job queuing and resource reservations are not supported in interactive mode.

Table 1 LSF Parallel Batch Job Invocation

Stage

Description

1

User submits a parallel batch job to the LSF Batch system

2

MBD retrieves a list of suitable execution hosts from the master LIM

3

MBD allocates (schedules, reserves) the execution hosts for the parallel batch job

3

MBD dispatches the parallel batch job to the SBD on the first execution host that was allocated to the batch job

4

SBD starts PAM on the same execution host

5

PAM starts RES on each execution host allocated to the batch job

6

RES starts the tasks on each execution host

Table 2 Description of LSF Parallel Components

Part

Function

User Request

Batch job submission to the LSF Batch system using the bsub command.

MBD

The Master Batch Daemon is the policy center for the LSF Batch system. It maintains information about batch jobs, hosts, users, and queues. All of this information is used in scheduling batch jobs to hosts.

LIM

The Load Information Manager is a daemon process running on each execution host. LIM monitors the load on its host and exchanges this information with the Master LIM.

The Master LIM resides on one execution host and collects information from the LIMs on all other hosts in the LSF cluster. If the master LIM becomes unavailable, another host will automatically take over.

For batch submission the master LIM provides this information is provided to the MBD.

For interactive execution the Master LIM provides simple placement advice.

SBD

The Slave Batch Daemons are batch job execution agents residing on the execution hosts. SBD receives jobs from the MBD in the form of a job specification and starts RES to run the job according the specification. SBD reports the batch job status to the MBD whenever job state changes.

PAM

The Parallel Application Manager is the point of control for the LSF Parallel system. PAM is fully integrated with the LSF System. PAM interfaces the user application with the LSF system.

If PAM or its host crashes, each RES will terminate all tasks under its management. This avoids the problem of orphaned processes.

RES

The Remote Execution Servers reside on each execution host. RES manages all remote tasks and forwards signals, standard I/O, resources consumption data, and parallel job information between PAM and the tasks.

application task

The individual process of a parallel application

execution hosts

The most suitable hosts to execute the batch job as determined by the LSF Batch system.

first execution host

The host name at the top of the execution host list as determined by the LSF Batch system


[Contents] [Index] [Top] [Bottom] [Prev] [Next]


doc@platform.com

Copyright © 1994-1998 Platform Computing Corporation.
All rights reserved.