This section of the LSF Batch User's Guide gives a quick introduction to LSF Base, LSF Batch, LSF Make and LSF MultiCluster products. After reading the conceptual material, you should be able to begin using LSF. The rest of the guide contains more detailed information on LSF features and commands.
LSF is a suite of workload management products from Platform Computing Corporation. The LSF Suite includes LSF Batch, LSF JobScheduler, LSF MultiCluster, LSF Make, and LSF Analyzer all running on top of the LSF Base system.
LSF manages, monitors, and analyzes the workload for a heterogeneous network of computers and it unites a group of UNIX and NT computers into a single system to make better use of the resources on a network. Hosts from various vendors can be integrated into a seamless system. You can submit your job and leave the system to find the best host to run your programs.
LSF supports sequential and parallel applications running as interactive and batch jobs. LSF also allows new distributed applications to be developed through LSF Application Programming Interface (API), C programming libraries and a tool kit of programs for writing shell scripts.
With LSF you can use a network of heterogeneous computers as a single system. You are no longer limited to the resources on your own workstation. You do not need to rewrite or change your programs to take advantage of LSF. You only need to learn a few simple commands and the resources of your entire network will be within reach.
LSF can automatically select hosts in a heterogeneous environment based on the current load conditions and the resource requirements of the applications.
With LSF, remotely run jobs behave just like jobs run on the local host. Even jobs with complicated terminal controls behave transparently to the user as if they were run locally.
LSF can run batch jobs automatically when required resources become available, or when systems are lightly loaded. LSF maintains full control over the jobs, including the ability to suspend and resume the jobs based on load conditions.
LSF can run both sequential and parallel jobs. Some jobs speed up substantially when run on a group of idle or lightly loaded hosts. For example, the LSF Make program allows you to do your software builds or automated tests many times faster than with traditional makes.
With LSF, you can transparently run software that is not available on your local host. For example, you could run a CAD tool that is only available on an HP host from your Sun workstation. The job would run on the HP and be displayed transparently on your Sun system.
With LSF, the system administrators can easily control access to resources such as:
LSF provides mechanisms for resource and job accounting. This information can help the administrator to find out which applications or users are consuming resources, at what times of the day (or week) the system is busy or idle, and whether certain resources are overloaded and need to be expanded or upgraded.
LSF allows you to write your own load sharing applications, both as shell scripts using the lstools
programs and as compiled programs using the LSF application programming libraries.
LSF provides comprehensive resource and load information about all hosts in the network.
Resource information includes the number of processors on each host, total physical memory available to user jobs, the type, model, and relative speed of each host, the special resources available on each host, and the time windows when a host is available for load sharing.
Dynamic load information includes:
/tmp
directory
LSF Batch lets you submit batch jobs to a queue, which can have access to many hosts on your network, and can automatically run jobs as soon as a suitable host is available. Resource intensive jobs are processed more efficiently because they are scheduled automatically. You do not have to spend time hunting around on the network to find an idle host with the resources that your job needs.
The system administrator can create multiple queues, and can specify policies for each queue that will help to prioritize and schedule the work.
LSF lets you run interactive jobs on any computer on the network, using your own terminal or workstation. Interactive jobs run immediately and normally require some input through a text-based or graphical user interface. All the input and output is transparently sent between the local host and the job execution host.
You can submit interactive jobs using LSF Batch to take advantage of queues and queuing policies. However, an interactive batch job is subject to the scheduling policies of the submission queue, so it may not be dispatched immediately.
Load sharing in LSF is based on clusters. A cluster is simply a group of hosts. Each cluster has one or more LSF administrators. An administrator is a user account that has permission to change the LSF configuration and perform other maintenance functions. An LSF administrator decides how the hosts are grouped together.
A cluster can contain a mixture of host types. By putting all hosts types into a single cluster, you can have easy access to the resources available on all host types.
Clusters are normally set up based on administrative boundaries. LSF clusters work best when each user has an account on all hosts in the cluster, and user files are shared among the hosts so that they can be accessed from any host. This way LSF can send a job to any host. You need not worry about whether the job will be able to access the correct files.
LSF can also run batch jobs when files are not shared among the hosts. LSF includes facilities to copy files to and from the host where the batch job is run, so your data will always be in the right place.
LSF can also run batch jobs when user accounts are not shared by all hosts in a cluster. Accounts can be mapped across machines.
LSF MultiCluster supports interoperation across clusters. Your jobs can be forwarded transparently to be run on another cluster within your organization.
LSF is designed to continue operating even if some of the hosts in the cluster are unavailable. One host in the cluster acts as the master, but if the master host becomes unavailable another host takes over. LSF services are available as long as there is one available host in the cluster.
When a host crashes, all jobs running on that host are lost. No other pending or running jobs are affected. Important jobs can be submitted to LSF Batch with an option to automatically restart if the job is lost because of a host failure.
Figure 1 shows the structure of LSF Base and how it fits into your system. The software modules that make up LSF Base are shaded.
Figure 1. Structure of LSF Base
A server is a host that runs load shared jobs. The Load Information Manager (LIM) and Remote Execution Server (RES) run on every server. The LIM and RES are implemented as daemons that interface directly with the underlying operating systems and provide users with a uniform, host independent environment. The Load Sharing LIBrary (LSLIB) is the basic interface.
The LIM, RES and LSLIB form the LSF Base system.
The LIM on each server monitors its host's load and exchanges load information with other LIMs. On one host in the cluster, the LIM acts as the master. The master LIM collects information for all hosts and provides that information to the applications. The master LIM is chosen among all the LIMs running in the cluster. If the master LIM becomes unavailable, a LIM on another host will automatically take over the role of master.
The LIM provides simple placement advice for interactive tasks. This information is used by some of the lstools
(1
) applications (for example, lsrun
) to determine which host to run on.
The RES on each server accepts remote execution requests and provides fast, transparent and secure remote execution of interactive jobs.
LSLIB is the Application Programming Interface (API) for the LSF Base system, providing easy access to the services of LIM and RES.
The LSF utilities are a suite of products built on top of LSF Base. The utilities include the following products:
The LSF Batch queuing system uses dynamic load information from the LIM to schedule batch jobs in an LSF cluster. LSF Batch is described further in the section `Structure of LSF Batch' on page 10.
The LSF JobScheduler is a separately licensed product of LSF that manages data processing workload in a distributed environment.
A very large organization may divide its computing resources into a number of autonomous clusters, reflecting the structure of the company. The separately licensed LSF MultiCluster product enables load sharing across clusters resulting in more efficient use of the resources of the entire organization. LSF MultiCluster is described in more details in `Using LSF MultiCluster' on page 187.
LSF Analyzer is a graphical tool for comprehensive workload data analysis. It processes cluster-wide job logs from LSF Batch and LSF JobScheduler to produce statistical reports on the usage of system resources by users on different hosts through various queues.
lstcsh
is a load-sharing version of tcsh
, a popular UNIX command interpreter (shell). In addition to all the features of tcsh
, lstcsh
allows arbitrary UNIX commands and user programs to be executed remotely. Remote execution with lstcsh
is completely transparent. For example, you can run vi
remotely, suspend it, and resume it. For more information, see `Using lstcsh' on page 149.
LSF Make is a load sharing version of GNU make. It uses the same makefiles as GNU make and behaves similarly, except that you specify the number of hosts to use to run the make tasks in parallel. Tasks are started on multiple hosts simultaneously to reduce the execution time.
lsmake
, the LSF Make executable, is covered by the Free Software Foundation General Public License. tcsh
is covered by copyrights held by the University of California. Read the file LSF_MISC/lsmake/COPYING
in the LSF software distribution for details.
The lstools
are a set of utilities for getting information from LSF and running programs on remote hosts. For example, you can write a script that uses the lstools
to find the best hosts satisfying given resource requirements, then run jobs on one or more of the selected hosts.
The parallel tools are a set of utilities for users to run parallel applications using message passing packages. PVM and MPI jobs can be submitted to the LSF Batch system through pvmjob
and mpijob
, shell scripts for running PVM and MPI jobs under LSF Batch. See `Parallel Jobs' on page 181 for more information.
LSF has a comprehensive set of Graphical User Interface tools that give users complete access to the power and flexibility of the LSF Suite with the convenience of point and click.
Most applications can access LSF through LSF utility programs. Most applications do not need to communicate directly with LSF and do not need to be modified to work with LSF. Nearly all UNIX or Windows NT commands and third-party applications can be run using LSF utilities.
LSF Batch is a distributed batch system built on top of the LSF Base system to provide powerful batch job scheduling services to users. Figure 2 shows the components of LSF Batch and the interactions among them.
Figure 2. Structure of LSF Batch
LSF Batch accepts user jobs and holds them in queues until suitable hosts are available. Host selection is based on up-to-date load information from the master LIM, so LSF Batch can take full advantage of all your hosts without overloading any.
LSF Batch runs user jobs on suitable server hosts that the LSF administrator has chosen. LSF Batch has sophisticated controls for sharing hosts with interactive users, so you do not need dedicated hosts to process batch jobs.
There is one master batch daemon (mbatchd
) running in each LSF cluster, and one slave batch daemon (sbatchd
) running on each batch server host. User jobs are held in batch queues by mbatchd
, which checks the load information on all candidate hosts periodically. When a host with the necessary resources becomes available, mbatchd
sends the job to the sbatchd
on that host for execution. When more than one host is available, the best host is chosen. The sbatchd
controls the execution of the jobs and reports job status to the mbatchd
.
The LSF Batch Library (LSBLIB) is the Application Programming Interface (API) for LSF Batch, providing easy access to the services of mbatchd
and sbatchd
. LSBLIB provides a powerful interface for advanced users to develop new batch processing applications in C.
NQS interoperation allows LSF users to submit jobs to remote NQS servers using the LSF user interface. The LSF administrator can configure LSF Batch queues to forward jobs to NQS queues. Users may then use any supported interface, including LSF Batch commands, lsNQS
commands, and xlsbatch
, to submit, monitor, signal and delete batch jobs in NQS queues. This feature provides users with a consistent user interface for jobs running under LSF Batch and NQS.