How to Run and Monitor Jobs for Slurm

This tutorial goes through the steps of editing a simple batch script and running it on the Grid. The job script can be used as a base to create your own batch scripts. There is also an overview of commands for monitoring and controlling jobs.

1. Log on to the Grid.

2. Copy the job script to your home directory. Type: cp /wsu/el7/scripts/tutorial/simple_job.sh .

Image

3. The contents of the script can be viewed by typing: ls

Image

Edit the script to fit your needs. Type: vim simple_job.sh

Image

4. You are now in the vim text editor. Press 'i' to insert and edit. Use the up and down arrows to scroll through the file. Edit the email address to your own.

Image

Press 'Esc' and then type ':wq' and press 'Enter' to save and quit.

Image

5. Now that the script is edited you can submit it to run. Batch scripts are submitted using the following command: sbatch simple_job.sh

Image

The job will be submitted and a job id will be given. In this example, the job id is '243915'. You can check the status of your jobs by entering: qme

Image

The same can be done by using the command: squeue -u <username>

Image

This will output the following information:

Squeue Output

Definition

JOBID

Unique number assigned to each job

PARTITION

Partition the job is scheduled to run or is running on

NAME

Name of the job, typically the job script name

USER

User id of the job

ST

Current state of the job

TIME

Amount of time job has been running

NODES

Number of nodes job is scheduled to run across

NODELIST(REASON)

If running, the list of the nodes the job is running on. If pending, the reason the job is waiting

This is the various job states:

Code

State

Meaning

CA

Canceled

Job was canceled

CD

Completed

Job completed

CF

Configuring

Job resources being configured

CG

Completing

Job is completing

F

Failed

Job terminated with non-zero exit code

NF

Node Fail

Job terminated due to failure of node(s)

PD

Pending

Job is waiting for compute node(s)

R

Running

Job is running on compute node(s)

TO

Timeout

Job terminated upon reaching its time limit

7. A useful command in getting job information is: scontrol show job <jobid>

Image

8. After your job has completed, you can get additional information using the command sacct.

Command

Meaning

sacct -j <jobid>

Get information based on job id

sacct -j <jobid> --format=JobID,Jobname,partition,state,time,MaxRss,MaxVMSize,nodelist

For a more detailed output add the '--format' option. Reference the man page for it here for complete options of the command.

sacct -u <username>

View information for all jobs of a user

Image

Image

Image

9. There are a few useful commands for controlling jobs.

Command

Meaning

scancel <jobid>

Cancel one job

scancel -u <username>

Cancel all jobs for a user

scontrol hold <jobid>

Hold a job from being scheduled

scontrol release <jobid>

Release a job to be scheduled

scontrol requeue <jobid>

Requeue (cancel and rerun) a job

scancel <jobid>_<index>

Cancel an indexed job in a job array