Slurm Quick Start

Running Jobs / Slurm Scheduler

CAC's Slurm page explains what Slurm is and how to use it to run your jobs. Please take the time to read this page, giving special attention to the parts that pertain to the types of jobs you want to run.

  • NOTE: Users should not run codes on the head node. Users who do so will be notified and have privileges revoked.

Here are a few Slurm commands to get familiar with initially.

Show cluster status and configuration
sinfo
scontrol show partition
scontrol show nodes
Submit a job to run on the cluster
sbatch <script_name>                # batch job
srun -p common --pty /bin/bash -l   # interactive job, starts a login shell
Check status of jobs
squeue -u <user_id>
squeue -j <job_id>
scontrol show job <job_id>
scancel <job_id>

Slurm Examples & Tips

NOTE: Examples are bash scripts with special syntax.

  • Lines that begin with #SBATCH are directives to the scheduler. Bash sees these as comments. If you want such a line to be ignored by the scheduler as well as bash, place a double ## at the beginning of the line.
  • #SBATCH directives are equivalent to options to the sbatch command. If the same option appears in the sbatch command, then the command line takes precedence.
Example one-task batch job to run in the partition: common

Example sbatch script to run a job with one task (default) in the 'common' partition (i.e., queue):

#!/bin/bash
#SBATCH -J TestJob           # job name
#SBATCH -p common            # partition (queue) name
#SBATCH -t 00:10:00          # time limit, hh:mm:ss
#SBATCH --ntasks-per-core=1  # disregard hyperthreading
#SBATCH --mem-per-cpu=4GB    # request 4GB/cpu = 8 GB/core
#SBATCH -o testjob-%j.out    # output file (stdout)
#SBATCH -e testjob-%j.err    # error file (stderr)

echo "starting at `date` on `hostname`"

# Print the Slurm job ID
echo "SLURM_JOB_ID=$SLURM_JOB_ID"

echo "hello world `hostname`"

echo "ended at `date` on `hostname`"
exit 0

Notes on one-task jobs:

  • The --ntasks-per-core=1 option assumes 1 task is enough to keep a core busy (default is 2).
  • Due to hyperthreading, 1 core = 2 cpus; therefore, --ntasks-per-core=1 is equivalent to --cpus-per-task=2 or -c 2.
  • Executables that are multithreaded with OpenMP should increase --cpus-per-task to an appropriate value; see the main Slurm page.
  • If the --mem-per-cpu request exceeds the maximum, Slurm reduces it, but increases --cpus-per-task.

Submit/run your job:

sbatch example.sh

View your job:

scontrol show job <job_id>
Example MPI batch job to run in the partition: common

Example sbatch script to run a job with 60 tasks balanced across 3 nodes in the 'common' partition (i.e., queue):

#!/bin/bash
#SBATCH -J TestJob           # job name
#SBATCH -p common            # partition (queue) name
#SBATCH -t 00:10:00          # time limit, hh:mm:ss
#SBATCH -n 60                # number of tasks
#SBATCH -N 3                 # number of nodes
#SBATCH --exclusive          # no other users on nodes
#SBATCH --ntasks-per-core=1  # disregard hyperthreading
#SBATCH -o testjob-%j.out    # output file (stdout)
#SBATCH -e testjob-%j.err    # error file (stderr)

echo "starting at `date` on `hostname`"

# Print Slurm job properties
echo "SLURM_JOB_ID = $SLURM_JOB_ID"
echo "SLURM_NTASKS = $SLURM_NTASKS"
echo "SLURM_JOB_NUM_NODES = $SLURM_JOB_NUM_NODES"
echo "SLURM_JOB_NODELIST = $SLURM_JOB_NODELIST"
echo "SLURM_JOB_CPUS_PER_NODE = $SLURM_JOB_CPUS_PER_NODE"

mpiexec -n $SLURM_NTASKS ./hello_mpi

echo "ended at `date` on `hostname`"
exit 0

Notes on MPI jobs:

  • Slurm uses -n (same as --ntasks) together with --ntasks-per-core or --cpus-per-task to reserve enough slots for the job.
  • The number of nodes -N (same as --nodes) sets a minimum. To set both min and max, specify a range, e.g., -N3-6.
  • Typically an MPI job needs --exclusive access to nodes for proper load balancing.
To include or exclude specific nodes in your batch script

To run on a specific node only, add the following line to your batch script:

#SBATCH --nodelist=c0009

To include one or more nodes that you specifically want, add the following line to your batch script:

#SBATCH --nodelist=<node_names_you_want_to_include>

## e.g., to include c0006:
#SBATCH --nodelist=c0006

## to include c0006 and c0007 (also illustrates shorter syntax):
#SBATCH -w c000[6,7]

To exclude one or more nodes, add the following line to your batch script:

#SBATCH --exclude=<node_names_you_want_to_exclude>

## e.g., to avoid c0006 through c0008, and c0013:
#SBATCH --exclude=c00[06-08,13]

## to exclude c0006 (also illustrates shorter syntax):
#SBATCH -x c0006
Environment variables defined for tasks that are started with srun

If you submit a batch job in which you run the following script with srun -n $SLURM_NTASKS you will see how the various environment variables are defined.

#!/bin/bash
echo "Hello from `hostname`," \
"$SLURM_CPUS_ON_NODE CPUs are allocated here," \
"I am rank $SLURM_PROCID on node $SLURM_NODEID," \
"my task ID on this node is $SLURM_LOCALID"

These variables are not defined in the same useful way in the environments of tasks that are started with mpiexec or mpirun.

Use $HOME within your script rather than the full path to your home directory

In order to access files in your home directory, you should use $HOME rather than the full path. To test, you could add to your batch script:

echo "my home dir is $HOME"

Then view the output file you set in your batch script to get the result.

Copy your data to /tmp to avoid heavy I/O from your NFS mounted \$HOME !!!

We cannot stress enough how important this is to avoid delays on the file systems.

#SBATCH -J TestJob           # job name
#SBATCH -p common            # partition (queue) name
#SBATCH -t 01:30:00          # time limit, hh:mm:ss
#SBATCH -n 1                 # can omit this, default is 1 task
#SBATCH --cpus-per-task=15   # Slurm rounds it to 16 = 8 cores
#SBATCH -o testjob-%j.out    # output file (stdout)
#SBATCH -e testjob-%j.err    # error file (stderr)

echo "starting $SLURM_JOBID at `date` on `hostname`"
echo "my home dir is $HOME"

## copying my data to a local tmp space on the compute node to reduce I/O
MYTMP=/tmp/$USER/$SLURM_JOB_ID
/usr/bin/mkdir -p $MYTMP || exit $?
echo "Copying my data over..."
cp -rp $SLURM_SUBMIT_DIR/mydatadir $MYTMP || exit $?

## run your job executables here...

echo "ended at `date` on `hostname`"
echo "copy your data back to your $HOME" 
/usr/bin/mkdir -p $SLURM_SUBMIT_DIR/newdatadir || exit $?
cp -rp $MYTMP $SLURM_SUBMIT_DIR/newdatadir || exit $?
## remove your data from the compute node /tmp space
rm -rf $MYTMP

exit 0

Explanation: /tmp refers to a local directory that is found on each compute node. It is faster to use /tmp because when you read and write to it, the I/O does not have to go across the network, and it does not have to compete with the other users of a shared network drive (such as the one that holds everyone's $HOME).

To look at files in /tmp while your job is running, you can ssh to the login node, then do a further sshto the compute node that you were assigned. Then you can cd /tmp on that node and inspect the files in there with cat or less.

Note, if your application is producing 1000's of output files that you need to save, then it is far more efficient to put them all into a single tar or zip file before copying them into $HOME as the final step.