Basic jobcontrol

specify ressources

There are a lot of options to specify ressources, we will list just the most important for our clusters here:

gpus

In case you want to make use of gpus, first you would decide whenether you want to use one of the large GPU-systems (4xH100, 4xMI210), then you have to add the “--partition=GPU” option. In any case you have to add a --gres=gpu:<TYPE>:<N>.

Specifying a gpu type, as shown here, currently only works on the festus cluster. On emil please use "–partition=gpu" and "–gres=gpu:<number of gpus>".

So for example if you want to use 2x H100 for your job you have to add.

Example Submit for 2x H100

--partition=GPU --gres=gpu:h100:2

Another example would be that you want to use 1x Nvidia L40s for your job, then you only have to add this to your job:

Example gathering one L40

--gres=gpu:l40:1

number of tasks

Specify how many tasks (processes) should run all over your job. Slurm will reserve these ressources as n cpus for n tasks if “–cpus-per-tasks” is not specified by you other ways.

Example start on 256 tasks (eg 256 MPI-processes)

#SBATCH --ntasks=256

cpus per tasks

With --cpus-per-tasks you would specify how many cpus are reserved for a task.

Example start 8 Task with 16 cores each

#SBATCH --ntasks=8
#SBATCH --cpus-per-task=16

This example will allow you to run a mixed mpi-openmp-programm which start 8 processes where each one could launch 16 openmp threads.

number of nodes

You could specify the number of nodes by using --nodes=N.

check job status

squeue

The easiest way to get an overview of your waiting or running jobs is the squeue command.

squeue

Modify the SQUEUE_FORMAT variable at the end of your .bashrc or .zshrc following this description to customize the output of squeue the way you like it best.

sacct

You can find out more details with the sacct command.

The following will show when a job was submited and when is started:

[bt123456@festus01 hpcadmin]# sacct -X -j 6089 -o jobid,jobname,Submit,Start
JobID           JobName              Submit               Start 
------------ ---------- ------------------- ------------------- 
6089             RUN_GR 2024-12-02T12:37:35 2024-12-02T12:47:07

Tip

To check the submissionscript of a given job used:

sacct -j $SLURM_JOB_ID -B

To check what the submission command was use:

sacct -j $SLURM_JOB_ID -o submitline

check node status

To check the partitions and their node states, eg if the nodes are idle, allocated, drain and so on, on could use:

sinfo

Interactive jobs

To get an interactive job with an interactive shell on a compute node, you can call srun with the “–pty” flag.

Start interactive job with bash session; 128 cores for one task reserved on a single node.

srun --nodes=1 --cpus-per-task=128 --pty bash -i

Modify existing job

Slurm offers the opportunity to change some parameters of an existing job.
For this purpose, scontrol update jobid=<jobid> <OPTIONS> is used.

Adding mail notification to exisiting job

Assuming we have an exisiting job

 [bt123456@festus01 ~]$ squeue 
 JOBID PARTITION NAME     USER      ST  TIME  TIME_LEFT  NODES NODELIST(REASON)
  6282    normal hostname bt123456  PD  0:00  8:00:00      1   (BeginTime)

But we have forgotten to add mail notification to the submissionscript and submission comamnd either. To add these setting we could manipulate the existing job to add mail notification by run this command:

scontrol update jobid=6282 MailUser=bt123456@uni-bayreuth.de MailType=ALL

To get an overview about your options to change for an already submitted job please consult scontrol documentation.

Cancel jobs

Computing jobs can be canceled, if running or not, using scancel.

Cancel by jobid

scancel 1234

cancel by user

scancel -u $USER

This will cancel all your jobs, also your scron-job

There are some more options but these two would be the most important for regular users.