Basic jobcontrol
specify ressources
There are a lot of options to specify ressources, we will list just the most important for our clusters here:
gpus
In case you want to make use of gpus, first you would decide whenether you want to use one of the large GPU-systems (4xH100, 4xMI210), then you have to add the “--partition=GPU
” option. In any case you have to add a --gres=gpu:<TYPE>:<N>
.
So for example if you want to use 2x H100 for your job you have to add.
Example Submit for 2x H100
--partition=GPU --gres=gpu:h100:2
Another example would be that you want to use 1x Nvidia L40s for your job, then you only have to add this to your job:
Example gathering one L40
--gres=gpu:l40:1
number of tasks
Specify how many tasks (processes) should run all over your job. Slurm will reserve these ressources as n cpus for n tasks if “–cpus-per-tasks” is not specified by you other ways.
Example start on 256 tasks (eg 256 MPI-processes)
#SBATCH --ntasks=256
cpus per tasks
With --cpus-per-tasks
you would specify how many cpus are reserved for a task.
Example start 8 Task with 16 cores each
#SBATCH --ntasks=8
#SBATCH --cpus-per-task=16
number of nodes
You could specify the number of nodes by using --nodes=N
.
check job status
squeue
The easiest way to get an overview of your waiting or running jobs is the squeue command.
squeue
Modify the SQUEUE_FORMAT
variable at the end of your .bashrc or .zshrc following this description to customize the output of squeue the way you like it best.
sacct
You can find out more details with the sacct command.
The Following will when a job was submited and when is started:
[bt123456@festus01 hpcadmin]# sacct -X -j 6089 -o jobid,jobname,Submit,Start
JobID JobName Submit Start
------------ ---------- ------------------- -------------------
6089 RUN_GR 2024-12-02T12:37:35 2024-12-02T12:47:07
Tip
To check the submissionscript of a given job used:
sacct -j $SLURM_JOB_ID -B
To check what the submission command was use:
sacct -j $SLURM_JOB_ID -o submitline
check node status
To check the partitions and their node states, eg if the nodes are idle, allocated, drain and so on, on could use:
sinfo
Interactive jobs
To get an interactive job with an interactive shell on a compute node, you can call srun
with the “–pty” flag.
Start interactive job with bash session; 128 cores for one task reserved on a single node.
srun --nodes=1 --cpus-per-task=128 --pty bash -i
Modify existing job
Slurm offers the opportunity to change some parameters of an existing job.
For this purpose, scontrol update jobid=<jobid> <OPTIONS>
is used.
Adding mail notification to exisiting job
Assuming we have an exisiting job
[bt123456@festus01 ~]$ squeue
JOBID PARTITION NAME USER ST TIME TIME_LEFT NODES NODELIST(REASON)
6282 normal hostname bt123456 PD 0:00 8:00:00 1 (BeginTime)
But we have forgotten to add mail notification to the submissionscript and submission comamnd either. To add these setting we could manipulate the existing job to add mail notification by run this command:
scontrol update jobid=6282 MailUser=bt123456@uni-bayreuth.de MailType=ALL
To get an overview about your options to change for an already submitted job please consult scontrol documentation.
Cancel jobs
Computing jobs can be canceled, if running or not, using scancel
.
Cancel by jobid
scancel 1234
cancel by user
scancel -u $USER
This will cancel all your jobs, also your scron-job
There are some more options but these two would be the most important for regular users.