[Contents] [Index] [Top] [Bottom] [Prev] [Next]


F. LSF and NQS

This chapter contains information on registering LSF with the Network Queuing System.

Configuring NQS Interoperation

NQS (Network Queuing System) is a UNIX batch queuing facility that allows users to queue batch jobs to individual UNIX hosts from remote systems. This chapter describes how to configure and use LSF to submit and control batch jobs in NQS queues.

If you are not going to configure LSF to interoperate with NQS, you do not need to read this chapter.

While it is desirable to run LSF on all hosts for transparent resource sharing, this is not always possible. Some of the computing resources may be under separate administrative control, or LSF may not currently be available for some of the hosts.

An example of this is sites that use Cray supercomputers. The supercomputer is often not under the control of the workstation system administrators. Users on the workstation cluster still want to run jobs on the Cray supercomputer. LSF allows users to submit and control jobs on the Cray system using the same interface as they use for jobs on the local cluster.

LSF queues can be configured to forward jobs to remote NQS queues. Users can submit jobs, send signals to jobs, check the status of jobs, and delete jobs that are forwarded to the remote NQS. Although running on an NQS server outside the LSF cluster, jobs are still managed by LSF Batch in almost the same way as jobs running inside the LSF cluster.

Registering LSF with NQS

This section describes how to configure LSF and NQS so that jobs submitted to LSF can be run on NQS servers. To do this, you should already be familiar with the administration of the NQS system.

Hosts

NQS uses a machine identification number (MID) to identify each NQS host in the network. The MID must be unique and must be the same in the NQS database of each host in the network. LSF uses the NQS protocol to talk with NQS daemons for routing, monitoring, signalling and deleting LSF Batch jobs that run on NQS hosts. Therefore, you must assign a MID to each of the LSF hosts that might become the master host.

To do this, perform the following steps:

  1. Login to the NQS host as the NQS System Administrator or System Operator.
  2. Run the nmapmgr command to create MIDs for each LSF host that can possibly become the master host. List all MIDs available. See the NQS nmapmgr(1) manual page for a description of this command.

Users

NQS uses a mechanism similar to ruserok(3) to determine whether access is permitted. When a remote request from LSF is received, NQS looks in the /etc/hosts.equiv file. If the submitting host is found, requests are allowed as long as the user name is the same on both hosts. If the submitting host is not listed in the /etc/hosts.equiv file, NQS looks for a .rhosts file in the destination user's home directory. This file must contain the names of both the submitting host and the submitting user. Finally, if access still is not granted, NQS checks for a file called /etc/hosts.nqs. This file is similar to the .rhosts file, but it can provide mapping of remote usernames to local usernames. Cray NQS also looks for a .nqshosts file in the destination user's home directory. The .nqshosts file has the same format as the .rhosts file.

NQS treats the LSF cluster just as if it were a remote NQS server, except that jobs never flow to the LSF cluster from NQS hosts.

For LSF users to get permission to run jobs on NQS servers, you must make sure the above setup is done properly. Refer to your local NQS documentation for details on setting up the NQS side.

lsb.nqsmaps

The lsb.nqsmaps file in the LSB_CONFDIR/cluster/configdir directory is for configuring inter-operation between LSF and NQS.

Hosts

LSF must use the MIDs of NQS hosts when talking with NQS servers. The Hosts section of the LSB_CONFDIR/cluster/configdir/lsb.nqsmaps file contains the MIDs and operating system types of your NQS hosts.

Begin Hosts
HOST_NAME        MID    OS_TYPE
cray001          1      UNICOS      #NQS host, must specify OS_TYPE
sun0101          2      SOLARIS     #NQS host
sgi006           3      IRIX        #NQS host
hostA            4      -           #LSF host; OS_TYPE is ignored
hostD            5      -           #LSF host
hostB            6      -           #LSF host
End Hosts

Note that the OS_TYPE column is required for NQS hosts only. For hosts in the LSF cluster, OS_TYPE is ignored; the type is specified by the TYPE field in the lsf.cluster.cluster file. The `-' entry is a placeholder.

User Name Mapping

LSF assumes that users have the same account names and user IDs on all LSF hosts. If the user accounts on the NQS hosts are not the same as on the LSF hosts, the LSF administrator must specify the NQS usernames that correspond to LSF users.

The Users section of the lsb.nqsmaps file contains entries for LSF users and the corresponding account names on NQS hosts. The following example shows two users who have different accounts on the NQS server hosts.

Begin Users
FROM_NAME      TO_NAME
user7          (user7l@cray001 luser7@sgi006)
user4          (suser4@cray001)
End Users

FROM_NAME is the user's login name in the LSF cluster, and TO_NAME is a list of the user's login names on the remote NQS hosts. If a user is not specified in the lsb.nqsmaps file, jobs are sent to the NQS hosts with the same user name.

Configuring Queues for NQS Jobs

You must configure one or more LSF Batch queues to forward jobs to remote NQS hosts. A forward queue is an LSF Batch queue with the parameter NQS_QUEUES defined. The following queue forwards jobs to the NQS queue named pipe on host cray001:

Begin Queue
QUEUE_NAME  = nqsUse
PRIORITY    = 30
NICE        = 15
QJOB_LIMIT  = 5
UJOB_LIMIT  = ()
CPULIMIT    = 15
NQS_QUEUES  = pipe@cray001
DESCRIPTION = Jobs submitted to this queue are forwarded to NQS_QUEUES
USERS       = all
End Queue

You can specify more than one NQS queue for the NQS_QUEUES parameter. LSF Batch tries to send the job to each queue in the order they are listed, until one of the queues accepts the job.

Since many features of LSF are not supported by NQS, the following queue configuration parameters are ignored for NQS forward queues: PJOB_LIMIT, POLICIES, RUN_WINDOW, DISPATCH_WINDOW, RUNLIMIT, HOSTS, MIG. In addition, scheduling load threshold parameters are ignored because NQS does not provide load information about hosts.

Handling Cray NQS Incompatibilities

Cray NQS is incompatible with some of the public domain versions of NQS. Different versions of NQS on Cray may be incompatible with each other. If your NQS server host is a Cray, some additional steps may be needed in order for LSF to understand the NQS protocol correctly.

If the NQS version on a Cray is NQS 80.42 or NQS 71.3, then no extra setup is needed. For other versions of NQS on a Cray, you need to define NQS_REQUESTS_FLAGS and NQS_QUEUES_FLAGS in the lsb.params file.

NQS_REQUESTS_FLAGS = integer

If the version is NQS 1.1 on a Cray, the value of this flag is 251918848.

For other versions of NQS on a Cray, do the following to get the value for this flag. Run the NQS command:

% qstat -h CrayHost -a

on a workstation, where CrayHost is the host name of the Cray machine. Watch the messages logged by Cray NQS (you need access to the NQS log file on the Cray host):

03/02 12:31:59 I pre_server(): Packet type=<NPK_QSTAT(203)>.
03/02 12:31:59 I pre_server(): Packet contents are as follows:
03/02 12:31:59 I pre_server(): Npk_str[1] = <>.
03/02 12:31:59 I pre_server(): Npk_str[2] = <platform>.
03/02 12:31:59 I pre_server(): Npk_int[1] = <1392767360>.
03/02 12:31:59 I pre_server(): Npk_int[2] = <2147483647>.
03/02 12:31:59 I show_qstat_flags(): Flags=SHO_R_ALLUID SHO_R_SHORT
SHO_RS_RUN SHO_RS_STAGE SHO_RS_QUEUED SHO_RS_WAIT SHO_RS_HOLD \
    SHO_RS_ARRIVE SHO_Q_BATCH SHO_Q_PIPE SHO_R_FULL SHO_R_HDR

The value of Npk_int[1] in the above output is the value you need for the parameter NQS_REQUESTS_FLAGS.

NQS_QUEUES_FLAGS = integer

To get the value for this flag, run the NQS command:

% qstat -h CrayHost -p -b -l

on a workstation, where CrayHost is the host name of the Cray machine. Watch the messages logged by Cray NQS (you need to have access to the Cray NQS log file):

03/02 12:32:57 I pre_server(): Packet type=<NPK_QSTAT(203)>.
03/02 12:32:57 I pre_server(): Packet contents are as follows:
03/02 12:32:57 I pre_server(): Npk_str[1] = <>.
03/02 12:32:57 I pre_server(): Npk_str[2] = <platform>.
03/02 12:32:57 I pre_server(): Npk_int[1] = <593494199>.
03/02 12:32:57 I pre_server(): Npk_int[2] = <2147483647>.
03/02 12:32:57 I show_qstat_flags(): Flags=SHO_H_ACCESS SHO_H_DEST \
    SHO_H_LIM SHO H_RUNL SHO_H_SERV SHO_R_ALLUID SHO_Q_HDR \
    SHO_Q_LIMITS SHO_Q_BATCH SHO_Q_PIPE SHO_Q_FULL

The value of Npk_int[1] in the above output is the value you need for the parameter NQS_QUEUES_FLAGS.

If you are unable to get the required information after running the above NQS commands, make sure that your Cray NQS is configured properly to log these parameters. To do this, run:

% qmgr

and enter show all to get all information. The parameters related to the logging of the information you need are:

Debug level = 3
MESSAGE_Header = Short
MESSAGE_Types:
    Accounting         OFF    CHeckpoint         OFF   COMmand_flow OFF
    CONfig             OFF    DB_Misc            OFF   DB_Reads OFF
    DB_Writes          OFF    Flow               OFF   NETWORK_Misc ON
    NETWORK_Reads      ON     NETWORK_Writes     ON    OPer OFF
    OUtput             OFF    PACKET_Contents    ON    PACKET_Flow ON
    PROTOCOL_Contents  ON     PROTOCOL_Flow      ON    RECovery OFF
    REQuest            OFF    ROuting            OFF   Scheduling OFF
    USER1              OFF    USER2              OFF   USER3 OFF
    USER4              OFF    USER5              OFF

NQS Forward Queues

To interoperate with NQS, you must configure one or more LSF Batch queues to forward jobs to remote NQS hosts. An NQS forward queue is an LSF Batch queue with the parameter NQS_QUEUES defined.

NQS_QUEUES = queue_name@host_name ...

host_name is an NQS host name which can be the official host name or an alias name known to the LSF master host through gethostbyname(3). queue_name is the name of an NQS queue on this host. NQS destination queues are considered for job routing in the order in which they are listed here. If a queue accepts the job, then it is routed to that queue. If no queue accepts the job, it remains pending in the NQS forward queue.

The lsb.nqsmaps file (see `The lsb.nqsmaps File' on page 128) must be present in order for LSF Batch to route jobs in this queue to NQS systems.

Since many features of LSF are not supported by NQS, the following queue configuration parameters are ignored for NQS forward queues: PJOB_LIMIT, POLICIES, RUN_WINDOW, DISPATCH_WINDOW, RUNLIMIT, HOSTS, MIG. In addition, scheduling load threshold parameters are ignored because NQS does not provide load information about hosts.

Default: undefined.

DESCRIPTION = text

A brief description of the job queue. This information is displayed by the bqueues -l command. The description can include any characters, including white space. The description can be extended to multiple lines by ending the preceding line with a back slash `\'. The maximum length for the description is 512 characters.

This description should clearly describe the service features of this queue to help users select the proper queue for each job.

The lsb.nqsmaps File

The lsb.nqsmaps file contains information on configuring LSF for interoperation with NQS. This file is optional.

Hosts

NQS uses a machine identification number (MID) to identify each host in the network that communicates using the NQS protocol. This MID must be unique and must be the same in the NQS database of each host in the network. The MID is assigned and put into the NQS data base using the NQS program nmapmgr(1m) or Cray NQS command qmgr(8). mbatchd uses the NQS protocol to talk with NQS daemons for routing, monitoring, signalling, and deleting LSF Batch jobs that run on NQS hosts. Therefore, the MIDs of the LSF master host and any LSF host that might become the master host when the current master host is down must be assigned and put into the NQS database of each host which may possibly process LSF Batch jobs.

In the mandatory Hosts section, list the MIDs of the LSF master host (and potential master hosts) and the NQS hosts that are specified in the lsb.queues file. If an NQS destination queue specified in the lsb.queues file is a pipe queue, the MIDs of all the destination hosts of this pipe queue must be listed here. If a destination queue of this pipe queue is itself a pipe queue, the MIDs of the destination hosts of this queue must also be listed, and so forth.

There are three mandatory keywords in this section:

HOST_NAME

The name of an LSF or NQS host. It can be the official host name or an alias host name known to the master batch daemon (mbatchd) through gethostbyname(3).

MID

The machine identification number of an LSF or NQS host. It is assigned by the NQS administrator to each host communicating using the NQS protocol.

OS_TYPE

The operating system (OS) type of the NQS host. At present, its value can be one of ULTRIX, HPUX, AIX, SOLARIS, SUNOS, IRIX, OSF1, CONVEX or UNICOS. It is used by mbatchd to deliver the correct signals to the LSF Batch jobs running on this NQS host. An incorrect OS type would cause unpredictable results. If the host is an LSF host, the type is specified by the type field of the Host section in the lsf.cluster.cluster file. OS_TYPE is ignored; `-' must be used as a placeholder.

Begin Hosts
HOST_NAME        MID    OS_TYPE
cray001          1      UNICOS      #NQS host, must specify OS_TYPE
sun0101          2      SOLARIS     #NQS host
sgi006           3      IRIX        #NQS host
hostA            4      -           #LSF host; OS_TYPE is ignored
hostD            5      -           #LSF host
hostC            6      -           #LSF host
End Hosts

Users

LSF assumes shared and uniform user accounts on all of the LSF hosts. However, if the user accounts on NQS hosts are not the same as on LSF hosts, account mapping is needed so that the network server on the remote NQS host can take on the proper identity attributes. The mapping is performed for all NQS network conversations. In addition, the user name and the remote host name may need to match an entry either in the .rhosts file in the user's home directory, or in the /etc/hosts.equiv file, or in the /etc/hosts.nqs file on the server host. For Cray NQS, the entry may be either in the .rhosts file or in the .nqshosts file in the user's home directory.

This optional section defines the user name mapping from the LSF master host to each of the NQS hosts listed in the Host section above (i.e., the hosts on which the jobs routed by LSF Batch may run). There are two mandatory keywords:

FROM_NAME

The name of an LSF Batch user. It is a valid login name on the LSF master host.

TO_NAME

A list of user names on NQS hosts to which the corresponding FROM_NAME is mapped. Each of the user names is specified in the form username@hostname. The hostname is the official name or an alias name of an NQS host, while the username is a valid login name on this NQS host. The TO_NAME of a user on a specific NQS host should always be the same when the user's name is mapped from different hosts. If no TO_NAME is specified for an NQS host, LSF Batch assumes that the user has the same user name on this NQS host as on an LSF host.

Begin Users
FROM_NAME       TO_NAME
user3          (user3l@cray001 luser3@sgi006)
user1          (suser1@cray001) # assumed to be user1@sgi006
End Users

If a user is not specified in the lsb.nqsmaps file, jobs are sent to NQS hosts with the same name the user has in LSF.



[Contents] [Index] [Top] [Bottom] [Prev] [Next]


doc@platform.com

Copyright © 1994-1998 Platform Computing Corporation.
All rights reserved.