Using OpenMP
OpenMP (Open Multi-Processing) is a popular Application Programming Interface (API) for multi-threaded applications. It supports shared memory, multi-processing programming in C, C++, and Fortran on most platforms, instruction set architectures, and operating systems. The API includes compiler directives and constructs, runtime library routines, and environment variables for thread creation and management.
OpenMP is an explicit (i.e., not automatic) programming model offering the programmer full control over parallelization. Parallelization can be as simple as taking a serial program and inserting parallel compiler directives. OpenMP programs accomplish parallelism exclusively through the use of threads—the smallest unit of processing that can be scheduled by an operating system. Typically, the number of threads used matches the number of physical CPU cores, known as one-to-one mapping. However, the optimal number of threads depends on the specific application.
OpenMP uses the fork-join model of parallel execution:
- FORK: The master thread creates a team of parallel threads with access to shared memory. The statements in the program that are enclosed in the parallel region are then executed in parallel among the team of threads.
- JOIN: When the team of threads complete the statements in the parallel region, they synchronize and terminate, leaving only the master thread.
Compiling OpenMP programs
OpenMP programs are compatible with most compilers. The following table lists the compilers available on CARC HPC clusters and their corresponding compilation command and option for OpenMP programs:
Compiler family | Module name | Language | Compilation command |
---|---|---|---|
GCC | gcc | C | gcc -fopenmp [...] |
C++ | g++ -fopenmp [...] | ||
Fortran | gfortran -fopenmp [...] | ||
LLVM | llvm | C | clang -fopenmp [...] |
C++ | clang++ -fopenmp [...] | ||
AOCC | aocc | C | clang -fopenmp [...] |
C++ | clang++ -fopenmp [...] | ||
Intel | intel-oneapi | C | icx -qopenmp [...] |
C++ | icpx -qopenmp [...] | ||
Fortran | ifx -qopenmp [...] | ||
NVHPC | nvhpc | C | nvc -mp [...] |
C++ | nvc++ -mp [...] | ||
Fortran | nvfortran -mp [...] |
For example, to use the gcc compiler to compile a C program using OpenMP, enter in the following:
module purge
module load gcc/11.3.0
gcc -fopenmp omp_program.c -o omp_program
Offloading to GPUs
Parallel regions of programs can also be offloaded to GPUs using OpenMP via the target directive. However, the regions should have substantial parallelism and be structured well with little thread synchronization in order for there to be a noticable increase in the speed of executing your program.
Offloading to GPUs also requires additional compiler flags. For example, using the nvc compiler for a C program using OpenMP:
module purge
module load nvhpc/22.11
nvc -mp=gpu -gpu=cc70 omp_program.c -o omp_program
In this example the target is a V100 GPU (e.g., cc70).
Consult specific compiler documentation for more information on offloading to GPUs.
Running OpenMP programs
Once the program has been compiled, the next step to running OpenMP programs is to set the OMP_NUM_THREADS
environment variable indicating the number of threads to use. Typically, this should be set to match the number of CPU cores that you have requested for your Slurm job.
A Slurm job script is a special type of Bash shell script that the Slurm job scheduler recognizes as a job. For a job launching OpenMP parallel programs, a Slurm job script should look similar to the following:
#!/bin/bash
#SBATCH --account=<project_id>
#SBATCH --partition=main
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --mem=32G
#SBATCH --time=1:00:00
module purge
module load gcc/11.3.0
ulimit -s unlimited
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
./omp_program
Each line is described below:
Command or Slurm argument | Meaning |
---|---|
#!/bin/bash | Use Bash to execute this script |
#SBATCH | Syntax that allows Slurm to read your requests (ignored by Bash) |
--account=<project_id> | Charge compute resources to <project_id>; enter myaccount to view your available project IDs |
--partition=main | Submit job to the main partition |
--nodes=1 | Use 1 compute node |
--ntasks=1 | Run 1 task (e.g., running an OpenMP program) |
--cpus-per-task=16 | Reserve 16 CPUs for your exclusive use |
--mem=32G | Reserve 32 GB of memory for your exclusive use |
--time=1:00:00 | Reserve resources described for 1 hour |
module purge | Clear environment modules |
module load gcc/11.3.0 | Load the gcc compiler environment module |
ulimit -s unlimited | Set the limit of user stack size to unlimited |
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK | Set the number of threads to parallelize over. This uses the Slurm-provided environment variable SLURM_CPUS_PER_TASK for the number of threads. The number of threads should generally be equal to the requested --cpus-per-task option in your job script and not exceed the number of CPU cores on a compute node |
./omp_program | Run your OpenMP program |
Thread affinity
OpenMP includes thread affinity options that allow binding threads to specific places on a compute node. This may improve the performance of your program, though the optimal values to use depend on your specific application.
Use the environment variables OMP_PLACES
and OMP_PROC_BIND
to set thread affinity at runtime. The following values are a good starting point:
export OMP_PLACES=cores
export OMP_PROC_BIND=spread
We recommend experimenting and benchmarking to find the optimal binding strategy for your application. Consult the OpenMP documentation for more information on thread affinity and the available options.
Additional resources
If you have questions about or need help with OpenMP or parallel programming, please submit a help ticket and we will assist you.
For hybrid MPI/OpenMP programs, see our MPI guide.