[Contents] [Index] [Top] [Bottom] [Prev] [Next]


3. Events and Calendars

LSF JobScheduler manages network-wide events, and uses events to drive the scheduling of jobs. LSF JobScheduler responds to the following types of events:

File events and user events are handled by an External Event Server (eeventd). LSF JobScheduler comes with an eeventd daemon that handles file events. A site can modify eeventd to monitor any site-specific events and use them to drive jobs.

Since file events and user events are events external to the LSF JobScheduler, they are also referred to as external events, whereas time events, job events, job group events, and exception events are referred to as built-in events.

Time events are defined by calendars and time expressions, specified when a job is submitted. A calendar is a set of days during which time events occur. A time expression specifies time(s) of the day at which time events occur.

How Are Events Created?

LSF JobScheduler events are used to trigger jobs. As such, events are defined when jobs that are to be triggered by the events are created. There is a difference between a condition and an event. An event is a condition that has been associated with a job as one of its scheduling conditions. A condition that is not associated with a job is not monitored by LSF JobScheduler, and is not considered an event.

When you create a new job, you can specify one or more conditions that will trigger the job's execution. As a user, you only need to worry about the conditions of your jobs, and LSF JobScheduler will automatically register events that monitor the conditions you specify.

Event Status

The status of an event has three possible values:

At any given instant, the event is either active, inactive, or invalid.

Built-in Events

Built-in events are events inside LSF JobScheduler. These events are monitored by LSF JobScheduler rather than by an eeventd.

Time Events

A calendar together with a time expression defines a sequence of time events. Time events are a useful means of triggering repetitive jobs.

A time event has two attributes: a start time and a duration.

See `Calendars and Time Events' on page 31 for detailed information on time events.

Job Events

Frequently, the status changes of some jobs can trigger the scheduling of other jobs. The change in status of a job is a job event. LSF JobScheduler allows you to submit a job so that the scheduling of this job is dependent on the status of some prior job(s).

The following job event functions are provided to specify an inter-job dependency when you submit a job. For all functions, the parameter job is either a jobID or a jobname.

started(job)
The condition is TRUE if the specified job has started running or has already finished.
done(job)
The condition is TRUE if the specified job has finished successfully in the DONE state. A job is considered to have finished successfully if it terminates with exit code 0 (zero).
exit(job)
The condition is TRUE if the specified job has terminated abnormally in the EXIT state. A job is considered to have terminated abnormally if it terminates with a non-zero exit code.
exit(job,exit_code)
The condition is TRUE if the specified job has terminated with the specified exit_code value.
exit(job,op exit_code)
The condition is TRUE if the specified job has terminated with an exit code within the specified range of exit_code, where op is one of >, >=, <, <=, ==, or !=.
ended(job)
The condition is TRUE if the specified job has finished.

The jobname can be preceded by a job group specification to indicate a dependency on a job belonging to a particular group. See `Inter-job Dependencies' on page 83 for examples of using the above functions when submitting a job.

Job Group Events

A job or a job group can depend on the status of another job group. Using the group dependencies you can set up a sequence of job groups to execute in a particular order. A group itself is not actually executed, but rather the individual jobs under the group. Therefore the successful completion or failure of a group is determined by the state of the jobs in the group.

The following functions are provided to specify job group dependencies:

active(group_spec)
TRUE if the group is in the ACTIVE state. A group is active if it is ready to schedule jobs.
inactive(group_spec)
TRUE if the group is in the INACTIVE state, i.e. no job in the group may scheduled to run.
hold(group_spec)
TRUE if the group is in the HOLD state.
numrun(group_spec, op num)
TRUE if the number of jobs in RUN state satisfy the test.
numpend(group_spec, op num)
TRUE if the number of jobs in PEND state satisfy the test.
numdone(group_spec, op num)
TRUE if the number of jobs in DONE state satisfy the test.
numexit(group_spec, op num)
TRUE if the number of jobs in EXIT state satisfy the test.
numended(group_spec, op num)
TRUE if the total number of jobs in the DONE or EXIT state satisfy the test.
numstarted(group_spec, op num)
TRUE if the total number of jobs in the RUN, USUSP or SSUSP state satisfies the test.

Job Exception Events

Exceptions are conditions that arise during the life of a job that may require notification or corrective action. An exception can trigger either a corrective job (by means of an exception event), or an exception handler (such as an alarm).

Since every job can potentially display unique behaviour, the definition of an exception is entirely up to the user who creates the job, i.e. a condition that is considered to be an error for one job may not be considered an error for another job. LSF JobScheduler allows each user to have his/her own definition of exceptions for each of his/her jobs.

When you create a job, you can specify what you would consider to be an exception for that job. You can then specify exception handlers to take care of the exception conditions. When an exception occurs, the associated exception handler will be automatically invoked to take action. Exception handling is discussed in more detail in Section 7, `Exception Handling and Alarms', beginning on page 123.

This section focuses on exception events that can be created as a result of exception handling. LSF JobScheduler provides many exception handlers to recover a job when errors occur. These handlers allow you to specify how you want to handle the job when certain exceptions happen. For example, you can re-run the job, terminate the job, trigger an alarm, or trigger an external exception handler job. Exception events are used to trigger external exception handlers.

As with all other events, an exception event is created when a job that responds to the event is created. An exception event has a name defined by the user who creates the job. The following is an example of how an exception event can be created:

% bsub -w "exception(too_long)" re-init

This command submits a job re-init that runs when exception event too_long occurs. LSF JobScheduler then registers an exception event named too_long. Note that the full name of the event is too_long@user where user is the owner of the job. This allows different users to use the same exception name without causing conflicts.

For the registered exception event to become active, a source for the exception must also be defined. This is done through the special exception event handler function setexcept():

% bsub -X "overrun(60)::setexcept(too_long)" simulation

Here, overrun() is an exception function and setexcept() is an exception handler that sets the exception event too_long to active when the exception condition overrun(60) becomes TRUE. The exception condition overrun(60) becomes TRUE when the job runs for more than 60 minutes.

In the above example, the re-init job creates the event and serves as the external event handler, and the simulation job sets the event to active when the exception happens. It is possible to have more than one job handling the same exception event. In this case, the first job creates the event and other jobs would refer to the event. The event is removed from the system when all exception handling jobs for the event are removed from LSF JobScheduler.

For a list of all valid exception conditions and exception handlers in LSF JobScheduler see Section 7, `Exception Handling and Alarms', beginning on page 123.

External Events

External events allow your jobs to be triggered by conditions external to LSF JobScheduler. This provides a lot of flexibility for you to integrate your site-specific conditions with the scheduling of your production jobs.

Typical examples of external events include the arrival of a file, the mount of a device, and the detection of an exception situation in the system.

External events are detected by the External Event Daemon (eeventd), which resides on the same server host as the master scheduler daemon (mbatchd). The eeventd communicates with mbatchd using a well-defined protocol. LSF JobScheduler comes with an eeventd that detects file events. A site can easily modify the eeventd to integrate other events by adding event processing functions.

File Events

A file event condition is specified as:

file(file_condition_expression)

where the keyword file() tells the master scheduler (mbatchd) that this is a file condition so that the parameter file_condition_expression should be passed to the eeventd for processing. The file_condition_expression is a logical expression in terms of the following four file status functions:

arrival(file_loc)
This function evaluates to TRUE when the file specified by file_loc arrives. The arrival of a file refers to the transition from non-existence to existence of the file.
exist(file_loc)
This function evaluates to TRUE if the file specified by file_loc exists. Note that this function is different from arrival() in that a transition from a non-existence to existence is not needed. As long as the file exists, the function always evaluates to TRUE.
size(file_loc)
This function evaluates to the size of the file specified by file_loc in bytes. If the file does not exist, this function evaluates to 0.
age(file_loc)
This function evaluates to the age of the file specified by file_loc since the last modification in minutes. If the file does not exist, this function evaluates to 0.

The file_loc in the above functions takes the following form:

[hostname:]absolute_directory/filename

Here, hostname is the name of the host on which the file can be accessed. Note that this host does not have to be the same host on which the job executes. If hostname is not specified, then the Event Server assumes that the file is accessible from any host.

Note

You must specify the absolute path name of the file being evaluated in the above expressions.

An example of a file event condition using the above file event functions is:

file(exist(/tmp/core) && size(/tmp/core)>10M)

A file event is automatically created when a user submits a job with a file event dependency condition. The event is automatically removed when there are no dependent jobs.

See `File Event Dependency' on page 86 for examples of how to use file event conditions when submitting a job.

User Events

LSF JobScheduler provides an open mechanism for sites to implement site specific events by adding more event processing functions into the External Event Daemon (eeventd). A user event condition is specified as:

event(event_spec)

where the keyword event() tells the master scheduler (mbatchd) that this is a user event so that the parameter event_spec should be passed to eeventd for processing. The site is responsible for writing the function that parses and processes the event_spec. See "External Event Management" in the LSF JobScheduler Administrator's Guide for details of how to modify eeventd.

A user event can be used to detect arbitrary site specific environmental status that can trigger production jobs. For example, a `diskisfull' event could be designed to detect the fact that a critical file system is full, and therefore, an exception handling job should be triggered to correct the situation. In this case, the user event condition might be specified as:

event(diskisfull)

This will create an event `diskisfull' which is monitored by the diskisfull event processing function in the eeventd.

See `User Event Dependency' on page 90 for examples of associating user events with job submissions.

Viewing Events

You can view all events in LSF JobScheduler. To view prior job events and job group events, use the bjobs command. To know the status of a time event, simply view the time event definition of the job by looking at the detailed job definition information, and by looking at the calendar status using a calendar tool, such as bcal or the xbcal GUI.

To view exception events and external events, use the bevents command.

Examples

% bevents
EVENT            OWNER  STATUS    SOURCE  ATTRIBUTE  LAST_UPDATE
size(/tmp/file)  user1  inactive  file    -          Nov 19 18:09:50 1997
too_long@user1   user1  inactive  except  overrun    Nov 19 18:27:01 1997

To view all details of an event, use long format:

% bevents -l
EVENT: size(/tmp/file)>45M
OWNER  STATUS    SOURCE NUM_OF_DEPENDENTS LAST_UPDATE
user1  inactive  file   1                 Nov 19 18:09:50 1997
LAST_DISPATCHED_JOBID   LAST_DISPATCH_TIME
-                       -
ATTRIBUTE: - EVENT: too_long@user1
OWNER  STATUS    SOURCE     NUM_OF_DEPENDENTS  LAST_UPDATE
mike   inactive  exception  1                  Nov 19 18:27:01 1997
LAST_DISPATCHED_JOBID   LAST_DISPATCH_TIME
-                       -
ATTRIBUTE: "overrun Job[105] User[user1] Queue[priority]"

The long format shows you the complete event information. In the above example, the first event is a file event that watches the size of the file, and has one job--which has never run--depending on it. If a job that depends on the event had been triggered by the event, the LAST_DISPATCHED_JOBID and LAST_DISPATCH_TIME would indicate the job ID and the time the job was dispatched.

The second event in the above example is an exception event and has one job--which has never run--depending on it. The ATTRIBUTE parameter is event type-dependent information. This information is passed to jobs that are triggered by the event via the environment variable LSB_EVENT_ATTRIB. For file events, ATTRIBUTE is empty. For exception events, ATTRIBUTE gives the exception function, job ID of the job that caused the exception, login name of the user who owns the event, and name of the queue to which the job belongs. If you want to define an error recovery job in response to an exception event, for example, you can use the ATTRIBUTE to find the context under which the error occurs.

Calendars and Time Events

What is a Calendar?

A calendar is a set of days defined using one or more calendar rules. A calendar, together with a time expression, defines a sequence of time events that can trigger the execution of repetitive jobs. Calendars are defined and manipulated independently of jobs. This allows multiple jobs to share the same calendar. There are three types of calendars you can use to submit your job:

A calendar has a name, an owner (user), and a description. The name of the calendar and its owner are assigned when it is created, and are case-insensitive. The status of a calendar is determined by whether the current day is one of the days specified in the calendar definition. A calendar is active if the current day is in the calendar's list of days, otherwise it is inactive.

Built-in Calendars and Reserved Names for Calendars

Built-in calendars are provided in LSF JobScheduler to reflect the most commonly used set of days. You can use these calendars directly without having to define them first.

The available built-in calendars supported in LSF JobScheduler include:

Sun, Mon, Tue, Wed, Thu, Fri, Sat
These are the days of the week. For example, the calendar "Sun" means every Sunday.
Day
This calendar refers to every day.

Since built-in calendars have obvious meaning in daily life, you cannot view the status of built-in calendars.

LSF JobScheduler reserves the names of all built-in calendars. You cannot create a user calendar that conflicts in name with one of the built-in calendars.

In addition to built-in calendars, LSF JobScheduler also reserves the following names. These names are not calendar names but they are reserved as building blocks of calendar definition. See `Calendar Expressions and the Command Line Interface' on page 37 for details of the use of these keywords in the definition of calendars.

Listed below are reserved keywords that are not built-in calendar names by themselves.

Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec
These are months of the year and are reserved as building blocks of calendar definitions. See `Calendar Expressions and the Command Line Interface' on page 37 for detailed usage information.
Week
This keyword is reserved for use in calendar definitions to specify a period of a week.
Month
This keyword is reserved for use in calendar definition to specify a period of a month.
Quarter
This keyword is reserved for use in calendar definition to specify a period of a quarter.
YY
This keyword is reserved for use in calendar definition to specify a set of years.

System Calendars

System calendars are read-only calendars defined in the LSF JobScheduler configuration by the LSF administrator. System calendars are owned by the virtual user "sys", and can be viewed by everybody. You cannot modify or delete system calendars.

Note

The user account "sys" does not need to exist in the system.

System calendars can be used as normal calendars. When a system calendar is defined, its name becomes a reserved calendar name in the cluster. When the LSF JobScheduler daemons start up, the system calendars are defined in the cluster.

Using the LSF JobScheduler - Calendar GUI

To create a calendar you can use either command line or GUI tools. The GUI tool xbcal lets you create, view, and manipulate calendars.

Creating Calendars

xbcal supports three ways of defining calendars.

Figure 5 shows the xbcal screen that is displayed by choosing File | New Calendar| by Specifying Recurrence Pattern.

Figure 5. Adding the "on_monday" calendar

If you wish to define a set of days with a regular recurrence pattern, you can use the window shown in Figure 5 to create the calendar.

When the set of days you wish to define does not have any regular pattern, use the "by Clicking on Dates" window. A natural calendar is displayed and you can click on particular days that define the calendar. Figure 6 is the GUI interface for defining a calendar by selecting specific days.

Figure 6. Defining a Calendar by Selecting Specific Days

LSF JobScheduler also provides the flexibility to create calendars on top of existing calendars. For example, if you already have a calendar named "businessdays", you can define a new calendar called "lastbuzday" using the existing calendar businessdays. Figure 7 shows the window for creating the "lastbuzday" calendar out of the existing "businessdays" calendar.

Figure 7. Creating the "lastbuzday" Calendar

The windows shown in Figure 7 and Figure 8 allow you to create a new calendar by combining multiple calendars using logical expressions. The "AND" operator selects days that are common to the two calendars, whereas the "OR" operator merges the days of both calendars. The "NOT" operator selects all days that are not part of a calendar. The "View Occurrences" button creates a popup window that displays the actual days of the newly combined calendar.

Figure 8. Combining Existing Calendars

Calendar Expressions and the Command Line Interface

Calendars can also be defined using the command line interface provided by LSF JobScheduler. In order to use commands to manipulate calendars, you first need to understand the concept of calendar expressions. A calendar expression in LSF JobScheduler is a powerful calendar definition language that provides flexible ways to define arbitrary sets of days.

Simple Calendar Expressions

A simple calendar expression takes one of the following formats:

The "YEAR" field defines the set of years during which the set of days will be chosen. Valid values for the "YEAR" field can be any one year, or a list of years separated by commas, such as "1997, 1998". You can also use the keyword "YY" to specify a recurring list of years in the following format:

YY(start_year, end_year, step)

Here, start_year, end_year, and step are integers.

The "MONTH" field specifies the set of months within the years defined by the "YEAR" field. The format of the "MONTH" field can be one or more integers in the range of 1 to 12, separated by commas, such as "1, 3, 5", or one or more of the keywords from "Jan" to "Dec". You can also use the keyword "Month" to specify a recurring list of months in the following format:

Month(start_month, end_month, step)

Here, start_month, end_month, and step are integers.

The "WEEK" field defines the set of weeks within the years defined by the "YEAR" field. The format of the "WEEK" field is:

Week(start_week, end_week, step)

Here, start_week, end_week, and step are integers.

The "DAY" field defines the set of days within the specified months of the year, or weeks of the year. You can specify multiple days separated by commas. Each day can be specified by an integer between 1 and 31 for the days of the month, or between "Sun" and "Sat" for the days of the week. To specify recurring days, you can also use:

Day(start_day, end_day, step)

Here, start_day, end_day, and step are all integers.

A special character "*" can be used in any field above to mean "every year", "every month", or "every day".

For each of the reserved keywords described in `Built-in Calendars and Reserved Names for Calendars' on page 31, you can also use sub-indices to select a day, week, month, quarter or year in no particular order, or to select a particular day, week, month, quarter, or year relative to the start of the range.

For instance, "Mon(-1)" refers to last Monday, "Day(-2)" refers to the second last day, and "Week(3)" refers to the third week.

For examples of simple calendar expressions, see `Command Line Interface for Defining Calendars' on page 39.

Command Line Interface for Defining Calendars

Although it is easier to define calendars using the xbcal GUI interface, LSF JobScheduler also provides a command line interface for calendar manipulations. Calendars can be created using the bcadd command. Below are some examples of creating calendars using bcadd:

% bcadd -d "Back up days on Friday" -t "*:*:Fri" backup_days

This creates a calendar named "backup_days" that includes every Friday. The -d option allows you to give a description of your calendar.

% bcadd -d "bi-weekly pay days on Friday" -t "*:Week(1,*,2):Fri" pay_days

This creates a calendar that is active every two weeks, on Fridays, starting from the beginning of each year.

% bcadd -d "Last Friday of every July" -t "*:Jul:Fri(-1)" report_days

This creates a calendar that is active on the last Friday of July of every year.

% bcadd -d "Quarterly synchronization days" -t "*:quarter:day(1)" quarterly

This creates a calendar that is active on the first day of each quarter.

Complex Calendars

Simple calendar expressions give you a way to define a calendar that is straightforward. In some cases, a calendar can be too sophisticated to be defined in a single calendar. For example, suppose you want to define a calendar that is active on all US holidays and Canadian holidays, but not if it is a Wednesday. It would be difficult to define this using simple calendar expressions.

A combined calendar expression introduces logical operations into calendar definition and provides a structured way to construct complex calendars out of simple calendars.

A combined calendar expression consists of one or more simple calendar expressions and one of more of the logical operators "&&" (AND), "||" (OR), and "!" (NOT). Multiple levels of logical expressions can be constructed by using "(" and ")" to group expressions in desired order.

The "&&" operator selects days that exist in both calendars, while the "||" operator merges all days in both calendars together. The "!" operator specifies days that are not contained in any calendar.

For example, to construct the calendar mentioned above, you would first define a us_holidays calendar and a canadian_holidays calendar using simple calendar expression, then create a complex calendar using:

% bcadd -t "(canadian_holidays || us_holidays) && ! Wed"  na_holidays

Note

Since "Wed" is a built-in calendar, you do not need to define it beforehand.

Manipulating Calendars Using the Command Line Interface

Although you can do all calendar-related operations through the GUI tools, LSF JobScheduler also includes the command line tools necessary for you to manipulate your calendars.

Calendars can be displayed using the bcal command:

% bcal
CALENDAR_NAME  OWNER  STATUS    LAST_CAL_DAY      NEXT_CAL_DAY
businessdays   sys    active    Thu Nov 20 1997   Mon Nov 24 1997
weekdays       sys    active    Thu Nov 20 1997   Mon Nov 24 1997
holidays       sys    inactive  Fri Jul 4 1997   Thu Dec 25 1997
on_monday      user1  inactive  Mon Nov 17 1997   Mon Nov 24 1997

By default, bcal shows all system calendars and the user's own calendars. You can view other users' calendars by using the -u option of the bcal command.

To know more details about each calendar, you can use the -l option of the bcal command:

% bcal -l quarterly
CALENDAR: quarterly
-- First day of each quarter OWNER  STATUS     CREATION_TIME             LAST_MODIFY_TIME
user1  inactive   Fri Nov 14 17:50:01 1997    -
CAL_EXPRESSION: *:quarter:day(1)
LAST_CAL_DAY:     <Wed Oct 1 1997>
NEXT_CAL_DAY:     <Thu Jan 1 1998>

After a calendar is created, you can modify it using the bcmod command:

% bcmod -d "New description: quarterly monday" -t "*:quarter:mon(1)" quarterly

This overwrites the previous definition of calendar "quarterly". You can only modify your own calendars.

To delete a calendar, use the bcdel command:

% bcdel quarterly

Time Events

Time events in LSF JobScheduler are durations of time that are used as jobs' execution conditions. A time event is defined when a job is created with a time event dependency condition. The time event can be specified in the "Date & Time" area of the xbsub GUI, or via the -T option of the bsub command.

A time event has a start date and time at which the event becomes active, and a duration in which the event remains active. The start date can be represented by a calendar, and the start time and duration must be specified for each job.

When a job is submitted with a time event, LSF JobScheduler monitors the time event status. Once current time falls within the start time and duration, the time event becomes active and triggers the job execution.

Figure 9. Defining a Time Event

Figure 9 is an example of how a time event can be defined when submitting a job using the xbsub GUI.

Note that multiple hours and minutes can be put in the time area to create a time event that repeats multiple times per day. In particular, "*" can be put in the "Hours" or "Minutes" areas to refer to "every hour" or " every minute".

Time Expressions and the Command Line Interface

To define a time event for a job from the command-line interface, a time expression can be specified. A time expression has the following format:

[calendar_name[@user_name]:]hours:minutes[%duration]

Here, @user_name and %duration are optional. By default, a user will be using his/her own calendars and system calendars. If you intend to use another user's calendar, you must use "calendar_name@user_name" to explicitly specify the owner of the calendar.

If a duration is not specified, LSF JobScheduler assumes a default of one minute.

If a calendar is not explicitly specified, LSF JobScheduler assumes the built-in calendar, "Day", as the calendar. The built-in calendar "Day" means every day. See `Built-in Calendars and Reserved Names for Calendars' on page 31 for detailed information on built-in calendars.

With time expressions, you can submit jobs associated with time events from the command line. For example:

% bsub -T "on_monday:2:00%60" backup_job

This creates a job "backup_job" that will run every Monday between 2:00AM and 3:00AM. The duration is 60 minutes indicating that the event will be active at 2:00AM and remain active until 3:00AM.

If the job is unable to start by 3:00AM, it is considered to have missed its schedule and will not be scheduled until next time the event becomes active again. You can define exception handlers to handle such situations.

See Section 7, `Exception Handling and Alarms', beginning on page 123 for details on exception handling.



[Contents] [Index] [Top] [Bottom] [Prev] [Next]


doc@platform.com

Copyright © 1994-1998 Platform Computing Corporation.
All rights reserved.