Mixed MPI OpenMP

This example is intended to introduce you to the use of slurm for mixed (distributed-shared) memory programming with OpenMP and MPI.

As always: Our examles are only MWE’s to illustrate most basic principles. Their are not meant to be efficently or well code-styled in any manner.

Prepare

In this example, we want to use intel mpi because thread pinning does not require as much configuration as with openmpi. So let’s load an Intel OneAPI module:

module load OneAPI_2025.0.0

Create a directory for your example and enter it:

mkdir ~/mpiomp.ex
cd ~/mpiomp.ex

Codes

C code

The following code demonstrates how to mix mpi and openmp; more precisely how to nest openmp inside mpi. Please copy and save it to mpi_openmp_example.c inside your project directory.

#define _GNU_SOURCE 1 //(1)!

#include <mpi.h>
#include <omp.h>
#include <stdio.h>
#include <unistd.h>
#include <sched.h>
#include <math.h>
#include <sys/syscall.h>

int main(int argc, char *argv[]) {

int mpi_rank, mpi_size;
int omp_thread_id, omp_num_threads;
pid_t process_id;
char hostname[256];
int cpu_core;

// Initialize MPI
MPI_Init(&argc, &argv); //(2)!
MPI_Comm_rank(MPI_COMM_WORLD, &mpi_rank); 
MPI_Comm_size(MPI_COMM_WORLD, &mpi_size); 

// Get process ID and hostname
process_id = getpid();
gethostname(hostname, sizeof(hostname));

// Print header once
if (mpi_rank == 0) {
    printf("HOSNTAME:CPU-CORE\t<rank>//<thread> (<n threads>) \
    -> <process id>:<system thread-id>\n");
}

#pragma omp parallel private(omp_thread_id, omp_num_threads, cpu_core) //(3)!
{
    omp_thread_id = omp_get_thread_num();    // Get OpenMP thread ID
    omp_num_threads = omp_get_num_threads(); // Get number of threads

    cpu_core = sched_getcpu();                
    pid_t tid = syscall(SYS_gettid);

    #pragma omp critical //(4)!
    {
        printf("%s:%d\t%d//%d (%d)\t->\t%d:%d\n", 
                hostname, cpu_core, mpi_rank, omp_thread_id, 
                omp_num_threads, process_id, tid);
    }
}

// Finalize MPI
MPI_Finalize();
return 0;
}

This is needed to let the sched_getcpu() call work to get the cpu number.
Start distributed memory parallel region (MPI).
Start shared memory parallel region (OpenMP).
Synchronize output.

Now lets compile this with intels mpi-compilerwrapper for intels llvm c compiler (icx):

mpiicx -qopenmp mpi_openmp_example.c -o ./ompmpi.x

submission script

Next step is to write a suitable submissionscript an save it to mpi_openmp_example.submit.

#!/bin/bash 

#SBATCH --job-name=mpi_omp
#SBATCH --output=%x.out
#SBATCH --nodes=2 #!!(1)
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=3
#SBATCH --time=00:03:00

module load OneAPI_2025.0.0

export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK} #(2)!
export OMP_PLACES=cores #(3)!
export OMP_PROC_BIND=close #(4)!

mpirun -n ${SLURM_NTASKS} ./ompmpi.x

It is very important to properly set –nodes=, –ntasks-per-node= and –cpus-per-task=. Please take a look on this cheat sheet and also the FAQ-section.
Set number of OpenMP processes for each MPIprocess.
Let each place for a tasks to run on be a core.
Bind the OpenMP threads as close as possible together (processor level).

submit and run

Now its time to submit our job example:

sbatch mpi_openmp_example.submit

Now, when the job is completed, the output file should look something like this:

HOSNTAME:CPU-CORE       <rank>//<thread> (<n threads>) -> <process id>:<system thread-id>
s72b03.festus:128       2//0 (3)        ->      1838426:1838426
s72b03.festus:129       2//1 (3)        ->      1838426:1838434
s72b03.festus:131       3//0 (3)        ->      1838427:1838427
s72b03.festus:130       2//2 (3)        ->      1838426:1838437
s72b03.festus:132       3//1 (3)        ->      1838427:1838435
s72b03.festus:133       3//2 (3)        ->      1838427:1838436
s72b02.festus:129       0//1 (3)        ->      1745943:1745949
s72b02.festus:128       0//0 (3)        ->      1745943:1745943
s72b02.festus:131       1//0 (3)        ->      1745944:1745944
s72b02.festus:130       0//2 (3)        ->      1745943:1745951
s72b02.festus:132       1//1 (3)        ->      1745944:1745950
s72b02.festus:133       1//2 (3)        ->      1745944:1745952

As you can see, each node has started two of 4 mpi processes (rank), each of which has started its own 3 threads.