This chapter describes some common problems with LSF and LSF Batch operations, answers some frequently asked questions, and provides some instructions for solving problems.
When something goes wrong, the daemons almost always log an error message. The first step is to find the appropriate log and see whether there are any messages.
Specific error log messages are listed in `Error Messages' on page 245.
Error messages of LSF servers are logged in either the
syslog(3)
or specified files. This is determined by the LSF_LOGDIR
definition in the lsf.conf
file. For complete instructions on finding
the LSF server logs, see `Managing Error
Logs' on page 45.
If you configure LSF to log daemon messages using syslog, the destination file is determined by the syslog configuration. On most systems, you can find out which file the LSF messages are logged in with the command:
grep daemon /etc/syslog.conf
Once you have found the syslog file, you can select the LSF error messages with the command:
egrep 'lim|res|batchd' syslog_file
Look at the
/etc/syslog.conf
file and the manual page forsyslog
orsyslogd
for help in finding the system logs.
When searching for log messages from LSF servers, you are more likely to find them on the remote machine where LSF put the task than on your local machine where the command was given.
LIM problems are usually logged on the master host. Run
lsid
to find the master host, and check syslog
or
the lim.log.
hostname
file on the master host. The res.log.
hostname
file contains messages about RES problems, execution problems and setup problems
for LSF. Most problems with interactive applications are logged in the remote
machine's log files.
Errors from LSF Batch are logged either in the mbatchd.log.
hostname
file on the master host, or the sbatchd.log.
hostname
file on the execution host. The bjobs
or bhist
command
tells you the execution host for a specific job.
Most LSF log messages include the name of an internal LSF function to help the developers locate problems. Many error messages can be generated in more than one place, so it is important to report the entire error message when you ask for technical support.
A frequent problem with LSF is non-accessible files due to a non-uniform file space. If a task is run on a remote host where a file it requires cannot be accessed using the same name, an error results. Almost all interactive LSF commands fail if the user's current working directory cannot be found on the remote host.
If you are running NFS, rearranging the NFS mount table may solve
the problem. If your system is running the automount
server, LSF
tries to map the filenames, and in most cases it succeeds. If shared mounts
are used, the mapping may break for those files. In such cases, specific measures
need to be taken to get around it.
The automount maps must be managed through NIS. When LSF tries to
map filenames, it assumes that automounted file systems are mounted under the
/tmp_mnt
directory.
To share files among NT machines, set up a share on the server and access it from the client. You can access files on the share either by specifying a UNC path (
\\server\share\path
) or connecting the share to a local drive name and using adrive:\path
syntax. Using UNC is recommended because drive mappings may be different across machines, while UNC allows you to unambiguously refer to a file on the network.
For file sharing across UNIX and NT, you require a third party NFS product on NT to export directories from NT to UNIX.
This section lists some other common problems with the LIM, the RES and interactive applications.
Run the following command to check for errors in the LIM configuration files.
% lsadmin ckconfig -v
This displays most configuration errors. If this does not report any errors, check in the LIM error log.
Sometimes the LIM is up, but executing the lsload
command prints the following error message:
Communication time out.
If the LIM has just been started, this is normal, because the LIM needs time to get initialized by reading configuration files and contacting other LIMs.
If the LIM does not become available within one or two minutes, check the LIM error log on the local host.
When the local LIM is running but there is no master LIM in the cluster, LSF applications display the following message:
Cannot locate master LIM now, try later".
Check the LIM error logs on the first few hosts listed
in the "Host" section of the lsf.cluster.
cluster
file.
If the RES is unable to read the
lsf.conf
file and does not know where to write error messages, it logs errors intosyslog(3)
.
If the RES is unable to read the
lsf.conf
file and does not know where to write error messages, it logs errors intoC:\temp
.
If remote execution fails with the following error message, the remote host could not securely determine the user ID of the user requesting remote execution.
User permission denied.
Check the RES error log on the remote host; this usually contains a more detailed error message.
If you are not using an identification daemon (LSF_AUTH
is not defined in the lsf.conf
file), then all applications that
do remote executions must be owned by root
with the setuid
bit set. This can be done as follows.
% chmod 4755 filename
If the binaries are on an NFS-mounted file system, make
sure that the file system is not mounted with the nosuid
flag.
If you are using an identification daemon (defined in
the lsf.conf
file by LSF_AUTH
), inetd
must be configured to run the daemon. The identification daemon must not be
run directly.
If LSF_USE_HOSTEQUIV
is defined in the lsf.conf
file, check if /etc/hosts.equiv
or HOME/.rhosts
on
the destination host has the client host name in it. Inconsistent host names
in a name server with /etc/hosts
and /etc/hosts.equiv
can also cause this problem.
On SGI hosts running a name server, you can try the following
command to tell the host name lookup code to search the /etc/hosts
file before calling the name server.
% setenv HOSTRESORDER "local,nis,bind"
A command may fail with the following error message due to a non-uniform file name space.
chdir(...) failed: no such file or directory
You are trying to execute a command remotely, where either your current working directory does not exist on the remote host, or your current working directory is mapped to a different name on the remote host.
If your current working directory does not exist on a remote host, you should not execute commands remotely on that host.
If the directory exists, but is mapped to a different name on the remote host, you have to create symbolic links to make them consistent.
LSF can resolve most, but not all, problems using automount
.
The automount maps must be managed through NIS. Follow the instructions in your
Release Notes for obtaining technical support if you are running automount and
LSF is not able to locate directories on remote hosts.
This section lists some common problems with LSF Batch.
Most problems are due to incorrect installation or configuration. Check the
mbatchd
and sbatchd
error log files; often the log
messages points directly to the problem.
First, check the sbatchd
and mbatchd
error logs. Try running the following command to check the configuration.
% badmin ckconfig
This reports most errors. You should also check if there
is any email from LSF Batch in the LSF administrator's mailbox. If the mbatchd
is running but the sbatchd
dies on some hosts, it may be because
mbatchd
has not been configured to use those hosts. See `Host
Not Used By LSF Batch' on page 244.
Check whether LIM is running. You can test this by running
the lsid
command. If LIM is not running properly, follow the suggestions
in this chapter to fix the LIM first. You should make sure that LSF and LSF
Batch are using the same lsf.conf
file. Note that it is possible
that mbatchd
is temporarily unavailable because the master LIM
is temporarily unknown, causing the following error message.
sbatchd: unknown service
Check whether services are registered properly. See `Registering LSF Service Ports' on page 84 of the LSF Installation Guide.
If you configure a list of server hosts in the Host
section of the lsb.hosts
file, mbatchd
allows sbatchd
to run only on the hosts listed. If you try to configure an unknown host in
the HostGroup
or HostPartition
sections of the lsb.hosts
file, or as a HOSTS
definition for a queue in the lsb.queues
file, mbatchd
logs the following message.
mbatchd on host: LSB_CONFDIR/cluster/configdir/file(line #): Host hostname is not used by lsbatch;
ignored
If you start sbatchd
on a host that is not
known by mbatchd
, mbatchd
rejects the sbatchd
.
The sbatchd
logs the following message and exits.
This host is not used by lsbatch system.
Both of these errors are most often caused by not running the following commands, in order, after adding a host to the configuration.
lsadmin reconfig
badmin reconfig
You must run both of these before starting the daemons on the new host.
The following error messages are logged by the LSF daemons, or displayed by the following commands.
lsadmin ckconfig
badmin ckconfig
LSF daemon message logs are described in `Managing Error Logs' on page 45.
The messages listed in this section may be generated by any LSF daemon.
can't open file: error
The daemon could not open the named file for the reason given byerror
. This error is usually caused by incorrect file permissions or missing files. All directories in the path to the configuration files must have `x' permission for the LSF administrator, and the actual files must have `r' permission. Missing files could be caused by incorrect path names in thelsf.conf
file, running LSF daemons on a host where the configuration files have not been installed, or having a symbolic link pointing to a nonexistent file or directory.
file(line): malloc failed
Memory allocation failed. Either the host does not have enough available memory or swap space, or there is an internal error in the daemon. Check the program load and available swap space on the host; if the swap space is full, you must add more swap space or run fewer (or smaller) programs on that host.
auth_user: getservbyname(ident/tcp) failed: error;
ident must be registered in services
LSF_AUTH=ident
is defined in thelsf.conf
file, but theident/tcp
service is not defined in the services database. Addident/tcp
to the services database, or removeLSF_AUTH
from thelsf.conf
file andsetuid root
those LSF binaries that require authentication.
auth_user: operation(<host>/<port>) failed: error
LSF_AUTH=ident
is defined in thelsf.conf
file, but the LSF daemon failed to contact the ident daemon on host. Check thatident
is defined in host'sinetd.conf
and the ident daemon is running on host.auth_user: Authentication data format error (rbuf=<data>) from <host>/<port>
auth_user: Authentication port mismatch (...) from <host>/<port
>
is defined in the
LSF_AUTH=identlsf.conf
file, but there is a protocol error between LSF and the ident daemon onhost
. Make sure the ident daemon onhost
is configured correctly.
userok: Request from bad port (<portno>), denied
LSF_AUTH
is not defined, and the LSF daemon received a request that originates from a non-privileged port. The request is not serviced.Set the LSF binaries (for example,
lsrun
) to be owned byroot
with thesetuid
bit set, or defineLSF_AUTH=ident
and set up an ident server on all hosts in the cluster. If the binaries are on an NFS-mounted file system, make sure that the file system is not mounted with thenosuid
flag.userok: Forged username suspected from <host>/<port>: <claimed user>/<actual user>
The service request claimed to come from userclaimed user
but ident authentication returned that the user was actuallyactual user
. The request was not serviced.
userok: ruserok(<host>,<uid>) failed
LSF_USE_HOSTEQUIV
is defined in thelsf.conf
file, buthost
has not been set up as an equivalent host (see/etc/host.equiv
), and useruid
has not set up a.rhosts
file.
init_AcceptSock: RES service(res) not registered, exiting
init_AcceptSock: res/tcp: unknown service, exiting
initSock: LIM service not registered. See LSF Guide for help
initSock: Service lim/udp is unknown. Read LSF Guide for help
get_ports: <serv> service not registered
The LSF services are not registered. See `Registering LSF Service Ports' on page 84 of the LSF Installation Guide.
init_AcceptSock: Can't bind daemon socket to port <port>: error, exiting
init_ServSock: Could not bind socket to port <port>: error
These error messages can occur if you try to start a second LSF daemon (for example, RES is already running, and you execute RES again). If this is the case, and you want to start the new daemon, kill the running daemon or use the
lsadmin
orbadmin
commands to shut down or restart the daemon.
The messages listed in this section are caused by problems in the LSF configuration files. General errors are listed first, and then errors from specific files.
file(line): Section name expected after Begin; ignoring section
file(line): Invalid section name name; ignoring
section
The keyword begin
at the specified line is not followed by a section
name, or is followed by an unrecognized section name.
file(line): section section: Premature EOF
The end of file was reached before reading the
end section
line for the named section.
file(line): keyword line format error for section section; Ignore this
section
The first line of the section should contain a list of keywords. This
error is printed when the keyword line is incorrect or contains an unrecognized
keyword.
file(line): values do not match keys for section section; Ignoring
line
The number of fields on a line in a configuration section does not
match the number of keywords. This may be caused by not putting ()
in a column to represent the default value.
file: HostModel section missing or invalid file: Resource section missing or invalidfile: HostType section missing or invalid
The
HostModel
,Resource
, orHostType
section in thelsf.shared
file is either missing or contains an unrecoverable error.
file(line): Name name reserved or previously defined. Ignoring index
The name assigned to an external load index must not be the same as any built-in or previously defined resource or load index.
file(line): Duplicate clustername name in section cluster. Ignoring current line
A cluster name is defined twice in the same
lsf.shared
file. The second definition is ignored.
file(line): Bad cpuFactor for host model model. Ignoring line
The CPU factor declared for the named host model in the
lsf.shared
file is not a valid number.
file(line): Too many host models, ignoring model name
You can declare a maximum of 25 host models in the
lsf.shared
file.file(line): Resource name name too long in section resource. Should be less than 40 characters. Ignoring lineThe maximum length of a resource name is 39 characters. Choose a shorter name for the resource.
file(line): Resource name name reserved or previously defined. Ignoring line.
You have attempted to define a resource name that is reserved by LSF or already defined in the
lsf.shared
file. Choose another name for the resource.file(line): illegal character in resource name: name, section resource. Line ignored.Resource names must begin with a letter in the set [a-zA-Z], followed by letters, digits or underscores [a-zA-Z0-9_].
The following messages are logged by the LIM:
main: LIM cannot run without licenses, exiting
The LSF software license key is not found or has expired. Check that
FLEXlm is set up correctly, or contact your LSF technical support.
main: Received request from unlicensed host <host>/<port>
LIM refuses to service requests from hosts that do not have licenses.
Either your LSF license has expired, or you have configured LSF on more hosts
than your license key allows.
initLicense: Trying to get license for LIM from source <LSF_CONFDIR/license.dat> getLicense: Can't get software license for LIM from license file <LSF_CONFDIR/license.dat>: feature not yet available.
Your LSF license is not yet valid. Check whether the system clock is correct.
findHostbyAddr/<proc>: Host <host>/<port> is unknown by <myhostname> function: Gethostbyaddr_(<host>/<port>) failed: error main: Request from unknown host <host>/<port>: error function: Received request from non-LSF host <host>/<port>
The daemon does not recognize host as an LSF host. The request is not serviced. These messages can occur if
host
was added to the configuration files, but not all the daemons have been reconfigured to read the new information. If the problem still occurs after reconfiguring all the daemons, check whether host is a multi-addressed host. See "Host Naming" in the LSF Installation Guide.
rcvLoadVector: Sender (<host>/<port>) may have different config?
MasterRegister: Sender (host) may have different config?
LIM detected inconsistent configuration information with the sending LIM. Run the following command so that all the LIMs have the same configuration information.
% lsadmin reconfig
rcvLoadVector: Got load from client-only host <host>/<port>. Kill LIM on <host>/<port>
A LIM is running on an LSF client host. Run the following command, or go to the client host and kill the LIM daemon.
% lsadmin limshutdown host
saveIndx: Unknown index name <name> from ELIM
LIM received an external load index name that is not defined in the
lsf.shared
file. If name is defined inlsf.shared
, reconfigure the LIM. Otherwise, add name to thelsf.shared
file and reconfigure all the LIMs.
saveIndx: ELIM over-riding value of index <name>
This is a warning message. The ELIM sent a value for one of the built-in index names. LIM uses the value from ELIM in place of the value obtained from the kernel.
getusr: Protocol error numIndx not read (cc=num): error getusr: Protocol error on index number (cc=num): errorProtocol error between ELIM and LIM. See `Changing LIM Configuration' on page 55 for a description of the protocol.
These messages are logged by the RES.
doacceptconn: getpwnam(<username>@<host>/<port>) failed: error doacceptconn: User <username> has uid <uid1> on client host <host>/<port>, uid <uid2> on RES host; assume bad user authRequest: username/uid <userName>/<uid>@<host>/<port> does not exist authRequest: Submitter's name <clname>@<clhost> is different from name <lname> on this host
RES assumes that a user has the same userID and username on all the LSF hosts. These messages occur if this assumption is violated. If the user is allowed to use LSF for interactive remote execution, make sure the user's account has the same userID and username on all LSF hosts.
doacceptconn: root remote execution permission denied authRequest: root job submission rejected
Root tried to execute or submit a job but LSF_ROOT_REX is not defined in the
lsf.conf
file.
resControl: operation permission denied, uid = <uid>
The user with user ID uid is not allowed to make RES control
requests. Only the LSF manager, or root if
LSF_ROOT_REX
is defined
in lsf.conf
, can make RES control requests.
resControl: access(respath, X_OK): error
The RES received a reboot request, but failed to find the file
respath
to re-execute itself. Make sure respath
contains the RES binary,
and it has execution permission.
The following messages are logged by the mbatchd
and sbatchd
daemons:
renewJob: Job <jobId>: rename(<from>,<to>) failed: error
mbatchd
failed in trying to re-submit a rerunnable job. Check that the filefrom
exists and that the LSF administrator can rename the file. Iffrom
is in an AFS directory, check that the LSF administrator's token processing is properly setup (see `Installation on AFS' on page 97 of the LSF Installation Guide).
logJobInfo_: fopen(<logdir/info/jobfile>) failed: error logJobInfo_: write <logdir/info/jobfile> <data> failed: error logJobInfo_: seek <logdir/info/jobfile> failed: error logJobInfo_: write <logdir/info/jobfile> xdrpos <pos> failed: error logJobInfo_: write <logdir/info/jobfile> xdr buf len <len> failed: error logJobInfo_: close(<logdir/info/jobfile>) failed: error rmLogJobInfo: Job <jobId>: can't unlink(<logdir/info/jobfile>): error rmLogJobInfo_: Job <jobId>: can't stat(<logdir/info/jobfile>): error readLogJobInfo: Job <jobId> can't open(<logdir/info/jobfile>): error start_job: Job <jobId>: readLogJobInfo failed: error readLogJobInfo: Job <jobId>: can't read(<logdir/info/jobfile>) size size: error initLog: mkdir(<logdir/info>) failed: error <fname>: fopen(<logdir/file> failed: error getElogLock: Can't open existing lock file <logdir/file>: error getElogLock: Error in opening lock file <logdir/file>: error releaseElogLock: unlink(<logdir/lockfile>) failed: error touchElogLock: Failed to open lock file <logdir/file>: errortouchElogLock: close <logdir/file> failed: error
mbatchd
failed to create, remove, read, or write the log directory
or a file in the log directory, for the reason given in error
.
Check that LSF managerid
has read, write, and execute permissions
on the logdir
directory.
If logdir
is on AFS, check that the instructions in `Installation
on AFS' on page 97 of the LSF
Installation Guide have been followed. Do fs la
to verify
that the LSF administrator owns logdir
and that the directory has
the correct acl.
replay_newjob: File <logfile> at line <line>: Queue <queue> not found, saving to queue <lost_and_found> replay_switchjob: File <logfile> at line <line>: Destination queue <queue> not found, switching to queue <lost_and_found>
When mbatchd
was reconfigured, jobs were found in queue
but that queue is no longer in the configuration.
replay_startjob: JobId <jobId>: exec host <host> not found, saving to host <lost_and_found>
When mbatchd was reconfigured, the event log contained jobs dispatched to host, but that host is no longer configured to be used by LSF Batch.
do_restartReq: Failed to get hData of host <hostname>/<hostaddr>
mbatchd
received a request from sbatchd
on
host hostname
, but that host is not known to mbatchd
.
Either the configuration file has been changed but mbatchd
has
not been reconfigured to pick up the new configuration, or hostname
is a client host but the sbatchd
daemon is running on that host.
Run the following command to reconfigure the mbatchd
or kill the
sbatchd
daemon on hostname
.
% badmin reconfig