Skip to content

Sending Signals to jobs

job signaling

Assume we have a C programm investigating Goldbach’s conjecture.

golbach.c

These code examples are not meant to be efficient! The intention is easy readable code and a well illusatration of the regarding job control mechanism.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
#include <stdio.h>
#include <stdlib.h>
#include <stdbool.h>
#include <math.h>
#include <mpi.h>
#include <signal.h>
#include <limits.h>

#define MAX_NUMBER ULLONG_MAX  // max iter
#define CHECKPOINT_FILE "goldbach.chkp"

volatile sig_atomic_t stop_calculation = 0;

void handle_signal(int signal) {
    stop_calculation = 1;
}

// This is our checkpointing file
unsigned long long read_checkpoint() {
    FILE *file = fopen(CHECKPOINT_FILE, "r");
    unsigned long long checkpoint_value = 4;  // Start w 4 when there is no checkpoint

    if (file) {
        fscanf(file, "%lu", &checkpoint_value);
        fclose(file);
    }

    return checkpoint_value;
}

// writing checkpoint
void write_checkpoint(unsigned long long checkpoint_value) {
    FILE *file = fopen(CHECKPOINT_FILE, "w");
    if (file) {
        fprintf(file, "%lu\n", checkpoint_value);
        fclose(file);
    }
}

// chech goldbachs conjecture
bool is_prime(unsigned long long n) {
    if (n <= 1) return false;
    if (n <= 3) return true;
    if (n % 2 == 0 || n % 3 == 0) return false;

    for (unsigned long long i = 5; i * i <= n; i += 6) {
        if (n % i == 0 || n % (i + 2) == 0) return false;
    }
    return true;
}

bool check_goldbach(unsigned long long n) {
    for (unsigned long long i = 2; i <= n / 2; i++) {
        if (is_prime(i) && is_prime(n - i)) {
            return true;
        }
    }
    return false;
}

int main(int argc, char *argv[]) {
    int rank, size;
    unsigned long long start, i;
    unsigned long long min_value;
    bool local_min_holder;
    MPI_Status status;

    signal(SIGUSR1, handle_signal);  

    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);

    // read starting point
    start = read_checkpoint();

    // Round-Robin
    for (i = start + 2 * rank; i <= MAX_NUMBER; i += 2 * size) {
        if (stop_calculation) {

            // Find ID with lowest i
            MPI_Allreduce(&i, &min_value, 1, MPI_UNSIGNED_LONG, MPI_MIN, MPI_COMM_WORLD);
            local_min_holder = (i == min_value);

            // Only the process with lowest i will write the checkpoint
            printf("calc stops at %d\n",i);
            if (local_min_holder) {
                write_checkpoint(i);
            }
            break;
        }

        if (!check_goldbach(i)) {
            printf("Rank %d: Goldbach-Vermutung NICHT erfüllt für: %lu\n", rank, i);
        }
    }

    MPI_Finalize();
    return 0;
}

This program verifies Goldbach’s conjecture for all numbers within a specified range. It starts from either 4 or the value saved in a checkpoint file, continuing up to MAX_NUMBER. Upon receiving a SIGUSR1 signal, the program creates a checkpoint file before exiting. If restarted, it will resume from the last checkpoint.

To compile this program, on festus for example, we need a mpi-compilerwrapper:

module load gnu/14.1 openmpi/5.0.3
mpicc goldbach.c -o ./goldbach-mpi.x

Now let’s write a simple submission script and save it to goldbach_signal.bash within the same directory as ./goldbach-mpi.x :

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
#!/bin/bash
#SBATCH -p dev
#SBATCH -t 00:03:00
#SBATCH --ntasks=32
#SBATCH --mail-user=bt123456@uni-bayreuth.de --mail-type=ALL
#SBATCH -N 1

module load gnu/14.1 openmpi/5.0.3 

function sign_h(){  #(1)!
        kill -10 $1
}

mpirun -np 32 ./goldbach-mpi.x &

main_pid=$! #(2)!

trap "sign_h '$main_pid'" SIGUSR1  #(3)!

wait $main_pid # Wait for mainprogram

sleep 2 #(4)! 
  1. This is the definition we need to pass the SIGUSR1 to our mainprogram. As slurm doesn’t pass external signals to subprocesses of jobsteps.
  2. Fetch the PID of our mainprogram.
  3. Register SIGUSR1-trap to mainprogram’s PID on our passing function sign_h
  4. Hold on 2 Seconds to avoid mainprogram is killed before writing is done.

Now run this job on dev partition:

sbatch --partition=dev goldbach_signal.bash

Please note the JOBID

When job is started then wait a few seconds to let the computation run. You can use the squeue-command to list your jobs:

squeue on signal

Now lets send a SIGUSR1 to the job by using scancel:

scancel --signal=USR1 --full $JOBID

Now after a few seconds there sould be a file named goldbach.chkp that is out checkpoint: chkpnt cnt

job self signaling

So now we know how to send a signal to jobs main process. Wouldn’t it be practical if we could automatically send a “checkpointing”-signal to job for N seconds right before the job ends?

We could do that simply by adding #SBATCH --signal=B:10@30 this will send our job a signal (SIGUSR1==10) at latest 30 seconds right before it should stop. Lets add this to our goldbach example submissionscript and save it as goldbach_selfsignal.bash.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
#!/bin/bash
#SBATCH -p dev
#SBATCH -t 00:10:00
#SBATCH --ntasks=32
#SBATCH --mail-user=bt123456@uni-bayreuth.de --mail-type=ALL
#SBATCH -N 1
#SBATCH --signal=B:10@30 #(1)!

module load gnu/14.1 openmpi/5.0.3 

function sign_h(){  
        kill -10 $1
}

mpirun -np 32 ./goldbach-mpi.x &

main_pid=$! 

trap "sign_h '$main_pid'" SIGUSR1 

wait $main_pid # Wait for mainprogram

sleep 2 

  1. –signal=[{R|B}:][@sig_time] Where sigtime is in between 0 and 65535 seconds. Please read sbatch docs for further information.

Before submitting lets clean up our directory from last job

cleanup

Now lets run the job and wait till it ends.

sbatch --partition=dev goldbach_selfsignal.bash

The job is intended to run 10 Minutes and our --signal option tells slurm to pass SIGUSR1 right 30 Seconds before its ends. So after the job has ended there should be a checkpoint file in our workdir again: chkpnt2 cnt