LSF is integrated with products from Fluent Inc., allowing FLUENT jobs to take advantage of the checkpoint and migration features provided by LSF Batch. This increases the efficiency of the software and means the data is processed faster.
In this appendix, we assume that you are already familiar with using FLUENT software and checkpointing jobs in LSF.
For checkpointing jobs, LSF uses two executable files called echkpnt
and erestart
. LSF provides special versions of echkpnt
and erestart
that will allow checkpointing with FLUENT software.
When you submit a checkpointing job, you have to specify a checkpoint directory. Before the job starts running, LSF sets the environment variable LSB_CHKPNT_DIR
. The value of LSB_CHKPNT_DIR
is a subdirectory of the checkpoint directory specified in the command line. This subdirectory is identified by the job ID and will only contain files relating to the submitted job.
When you checkpoint the FLUENT job, LSF creates a checkpoint trigger file (.check
) in the job subdirectory, which will cause the FLUENT software to checkpoint and continue running. A special option is used to create a different trigger file (.exit
) which will cause the FLUENT software to checkpoint and exit the job.
The FLUENT software uses the LSB_CHKPNT_DIR
environment variable to determine the location of checkpoint trigger files. It checks the job subdirectory periodically while running the job. The FLUENT software does not do any checkpointing unless it finds the LSF trigger file in the job subdirectory. The FLUENT software removes the trigger file after checkpointing the job.
If a job is restarted, LSF will attempt to restart the job with "-r" option appended to the original FLUENT command. FLUENT software will use the checkpointed data and case files to restart the process from that checkpoint point, rather than repeating the entire process.
Each time a job is restarted, it is assigned a new job ID, and a new job subdirectory is created in the checkpoint directory. Files in the checkpoint directory are never deleted by LSF, but you may choose to remove old files once the FLUENT job is finished and the job history is no longer required.
The files that you can use with FLUENT software are available from Platform. Installation instructions are included.
LSF provides special versions of echkpnt
and erestart
that will allow checkpointing with FLUENT software. You must make sure LSF uses these files instead of the standard versions. There are two ways to do this:
LSF_ECHKPNTDIR
environment variable to point to the FLUENT versions.
The LSF_ECHKPNTDIR
environment variable, defined in the lsf.conf
file, specifies the location of the echkpnt
and erestart
files that LSF will use. If this variable is not defined, LSF uses the files in the default location, identified by the environment variable LSF_SERVERDIR
.
Submit the job as usual, but include the parameters required for checkpointing. The syntax for the bsub
command is:
bsub [-k
chkpntDir
]
[any other regular options to the bsub command]
FLUENT command
[any other regular options to the FLUENT command]
- lsf
The checkpointing feature for FLUENT jobs requires all the following parameters:
-k
chkpntDir
Regular option tobsub
command, specifies the name of the checkpoint directory.
FLUENT
command
The regular command used with FLUENT software.
- lsf
Special option to the FLUENT command. Specifies that the FLUENT software is running under LSF, and causes the FLUENT software to check for trigger files in the checkpoint directory if the environment variableLSB_CHKPNT_DIR
is set.
This option to the FLUENT command should be documented with the FLUENT software. At the time of printing, the option was -lsf
, but this may change.
Checkpoint the FLUENT job manually. The syntax for the bchkpnt
command is:
bchkpnt [regular options to bchkpnt] [-k] [jobId]
The following parameters are used with FLUENT:
-k
Regular option tobchkpnt
command, specifies checkpoint and exit. The job will be killed immediately after being checkpointed. When the job is restarted, it doesn't have to repeat any operations.
jobId
Job ID of the FLUENT job, should be used to specify which job to checkpoint.
Restart the FLUENT job as usual. The syntax for the brestart
command is:
brestart [regular options to brestart]
chkpntDir [
jobId]
The following parameters are used with FLUENT:
chkpntDir
Specifies the checkpoint directory, where the job subdirectory is located.
jobId
Job ID of the FLUENT job, specifies which job to restart. At this point, the restarted job is assigned a new job ID, and the new job ID starts being used for checkpointing. The job ID changes each time the job is restarted.