HPC Grid Tutorial: How to Run and Monitor Jobs for SlurmShare & Print
This tutorial goes through the steps of editing a simple batch script and running it on the Grid. The job script can be used as a base to create your own batch scripts. There is also an overview of commands for monitoring and controlling jobs.
1. Log on to the Grid.
2. Copy the job script to your home directory. Type: cp /wsu/el7/scripts/tutorial/simple_job.sh .
3. The contents of the script can be viewed by typing: ls
Edit the script to fit your needs. Type: vim simple_job.sh
4. You are now in the vim text editor. Press 'i' to insert and edit. Use the up and down arrows to scroll through the file. Edit the email address to your own.
Press 'Esc' and then type ':wq' and press 'Enter' to save and quit.
5. Now that the script is edited you can submit it to run. Batch scripts are submitted using the following command: sbatch simple_job.sh
The job will be submitted and a job id will be given. In this example, the job id is '243915'. You can check the status of your jobs by entering: qme
The same can be done by using the command: squeue -u <username>
This will output the following information:
Unique number assigned to each job
Partition the job is scheduled to run or is running on
Name of the job, typically the job script name
User id of the job
Current state of the job
Amount of time job has been running
Number of nodes job is scheduled to run across
If running, the list of the nodes the job is running on. If pending, the reason the job is waiting
This is the various job states:
Job was canceled
Job resources being configured
Job is completing
Job terminated with non-zero exit code
Job terminated due to failure of node(s)
Job is waiting for compute node(s)
Job is running on compute node(s)
Job terminated upon reaching its time limit
7. A useful command in getting job information is: scontrol show job <jobid>
8. After your job has completed, you can get additional information using the command sacct.
sacct -j <jobid>
Get information based on job id
sacct -j <jobid> --format=JobID,Jobname,partition,state,time,MaxRss,MaxVMSize,nodelist
For a more detailed output add the '--format' option. Reference the man page for it here for complete options of the command.
sacct -u <username>
View information for all jobs of a user
9. There are a few useful commands for controlling jobs.
Cancel one job
scancel -u <username>
Cancel all jobs for a user
scontrol hold <jobid>
Hold a job from being scheduled
scontrol release <jobid>
Release a job to be scheduled
scontrol requeue <jobid>
Requeue (cancel and rerun) a job
Cancel an indexed job in a job array