Job Management

DMOG uses the Slurm job scheduler to manage resources and ensure fair access for all users. To submit jobs to the scheduler, users write a submission script specifying the required resources (such as the number of CPUs, amount of memory, etc.) and input/output files. After submitting a job, Slurm will try to allocate the requested resources and start the job. If the requested resources are not immediately available, the job will be queued until enough resources become available.

In addition to batch mode jobs, Slurm also supports interactive jobs which can be useful for testing code or troubleshooting problems. To start an interactive session, users can request an allocation of resources and then start a shell session on a compute node.

Slurm is not only used to submit and monitor jobs, but also to describe the resources required for the jobs. To ensure that your job runs efficiently, it is important to specify the correct resources in your submission script.

Basic Slurm commands

  • sinfo: This command is used to display information about the nodes available on the cluster, such as their state (idle, down, etc.), partition, number of CPUs, and amount of available memory. It can help users choose appropriate resources for their jobs.

  • squeue: This command is used to display the current queue of jobs on the cluster, including their status, user, and estimated start time. It can help users track the progress of their jobs and estimate when they will start running.

  • sbatch: This command is used to submit a batch job to the Slurm scheduler. Users specify the required resources and the commands to run in a script, and then submit the script using the sbatch command. The job is then scheduled by the Slurm scheduler and run on the requested resources.

  • scancel: This command is used to cancel a running or pending job on the cluster. Users can specify the job ID or job name to be cancelled.

  • scontrol: This command is used to view and modify Slurm configuration and status information. It can be used to monitor the status of Slurm components.

  • squeue: use this command to get a high-level overview of all active (running and pending) jobs in the cluster.

Further information about these commands is available in the online manual: man <command>

Job queue states and reasons

The squeue command allows users to view information about the state of a job. The default output format of the command is as follows:

JOBID

PARTITION

NAME

USER

ST

TIME

NODES

NODELIST(REASON)

JOBID

Row 1, column 3

pp

00

jlkj

jhkh

hjhkj

Where

JOBID

Job or step ID. For array jobs, the job ID format will be of the form <job_id>_<index>

PARTITION

Partition of the job/step

NAME

Name of the job/step

USER

Owner of the job/step

ST

State of the job/step. See below for a description of the most common states

TIME

Time used by the job/step. Format is days-hours:minutes:seconds (days,hours only printed as needed)

NODES

Number of nodes allocated to the job or the minimum number of nodes required by a pending job

NODELIST(REASON)

For pending jobs: Reason why pending. For failed jobs: Reason why failed. For all other job states: List of allocated nodes. See below for a list of the most common reason codes.

During its lifetime, a job passes through several states. The most common states are PENDING, RUNNING, SUSPENDED, COMPLETING, and COMPLETED. Some other states are shown in the table below:

PD

Pending. Job is waiting for resource allocation

R

Running. Job has an allocation and is running

S

Suspended. Execution has been suspended and resources have been released for other jobs

CA

Cancelled. Job was explicitly cancelled by the user or the system administrator

CG

Completing. Job is in the process of completing. Some processes on some nodes may still be active

CD

Completed. Job has terminated all processes on all nodes with an exit code of zero

F

Failed. Job has terminated with non-zero exit code or other failure condition

If the job has failed or it is pending, a reason for its current state is given in the last column of the squeue output. Some of the most common reasons are:

(Resources)

The job is waiting for resources to become available so that the jobs resource request can be fulfilled

(Priority)

The job is not allowed to run because at least one higher prioritized job is waiting for resources

(Dependency)

The job is waiting for another job to finish first (–dependency=… option)

(TimeLimit)

The job exhausted its time limit.

(ReqNodeNotAvail)

Some node required by the job is currently not available. The node may currently be in use, reserved for another job, in an advanced reservation, DOWN, DRAINED, or not responding

JobLaunchFailure

The job could not be launched. This may be due to a file system problem, invalid program name, etc.

For a complete list of job states codes and reasons, see Job State Codes and Job Reasons