Sending Signals to jobs
job signaling
Assume we have a C programm investigating Goldbach’s conjecture.
golbach.c
These code examples are not meant to be efficient! The intention is easy readable code and a well illusatration of the regarding job control mechanism.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 |
|
This program verifies Goldbach’s conjecture for all numbers within a specified range. It starts from either 4 or the value saved in a checkpoint file, continuing up to MAX_NUMBER. Upon receiving a SIGUSR1 signal, the program creates a checkpoint file before exiting. If restarted, it will resume from the last checkpoint.
To compile this program, on festus for example, we need a mpi-compilerwrapper:
module load gnu/14.1 openmpi/5.0.3
mpicc goldbach.c -o ./goldbach-mpi.x
Now let’s write a simple submission script and save it to goldbach_signal.bash
within the same directory as ./goldbach-mpi.x :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
|
- This is the definition we need to pass the SIGUSR1 to our mainprogram. As slurm doesn’t pass external signals to subprocesses of jobsteps.
- Fetch the PID of our mainprogram.
- Register SIGUSR1-trap to mainprogram’s PID on our passing function sign_h
- Hold on 2 Seconds to avoid mainprogram is killed before writing is done.
Now run this job on dev partition:
sbatch --partition=dev goldbach_signal.bash
Please note the JOBID
When job is started then wait a few seconds to let the computation run. You can use the squeue
-command to list your jobs:
Now lets send a SIGUSR1 to the job by using scancel
:
scancel --signal=USR1 --full $JOBID
Now after a few seconds there sould be a file named goldbach.chkp that is out checkpoint:
job self signaling
So now we know how to send a signal to jobs main process. Wouldn’t it be practical if we could automatically send a “checkpointing”-signal to job for N seconds right before the job ends?
We could do that simply by adding #SBATCH --signal=B:10@30
this will send our job a signal (SIGUSR1==10) at latest 30 seconds right before it should stop. Lets add this to our goldbach example submissionscript and save it as goldbach_selfsignal.bash
.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
|
- –signal=[{R|B}:]
[@sig_time] Where sigtime is in between 0 and 65535 seconds. Please read sbatch docs for further information.
Before submitting lets clean up our directory from last job
Now lets run the job and wait till it ends.
sbatch --partition=dev goldbach_selfsignal.bash
The job is intended to run 10 Minutes and our --signal
option tells slurm to pass SIGUSR1 right 30 Seconds before its ends. So after the job has ended there should be a checkpoint file in our workdir again: