Using OpenMP

OpenMP (Open Multi-Processing) is a popular Application Programming Interface (API) for multi-threaded applications. It supports shared memory, multi-processing programming in C, C++, and Fortran on most platforms, instruction set architectures, and operating systems. The API includes compiler directives and constructs, runtime library routines, and environment variables for thread creation and management.

OpenMP is an explicit (i.e., not automatic) programming model offering the programmer full control over parallelization. Parallelization can be as simple as taking a serial program and inserting parallel compiler directives. OpenMP programs accomplish parallelism exclusively through the use of threads—the smallest unit of processing that can be scheduled by an operating system. Typically, the number of threads used matches the number of physical CPU cores, known as one-to-one mapping. However, the optimal number of threads depends on the specific application.

OpenMP uses the fork-join model of parallel execution:

  • FORK: The master thread creates a team of parallel threads with access to shared memory. The statements in the program that are enclosed in the parallel region are then executed in parallel among the team of threads.
  • JOIN: When the team of threads complete the statements in the parallel region, they synchronize and terminate, leaving only the master thread.

Compiling OpenMP programs

OpenMP programs are compatible with most compilers. The following table lists the compilers available on CARC HPC clusters and their corresponding compilation command and option for OpenMP programs:

Compiler familyModule nameLanguageCompilation command
GCCgccCgcc -fopenmp [...]
C++g++ -fopenmp [...]
Fortrangfortran -fopenmp [...]
LLVMllvmCclang -fopenmp [...]
C++clang++ -fopenmp [...]
AOCCaoccCclang -fopenmp [...]
C++clang++ -fopenmp [...]
Intelintel-oneapiCicx -qopenmp [...]
C++icpx -qopenmp [...]
Fortranifx -qopenmp [...]
NVHPCnvhpcCnvc -mp [...]
C++nvc++ -mp [...]
Fortrannvfortran -mp [...]

For example, to use the gcc compiler to compile a C program using OpenMP, enter in the following:

module purge
module load gcc/11.3.0
gcc -fopenmp omp_program.c -o omp_program

Offloading to GPUs

Parallel regions of programs can also be offloaded to GPUs using OpenMP via the target directive. However, the regions should have substantial parallelism and be structured well with little thread synchronization in order for there to be a noticable increase in the speed of executing your program.

Offloading to GPUs also requires additional compiler flags. For example, using the nvc compiler for a C program using OpenMP:

module purge
module load nvhpc/22.11
nvc -mp=gpu -gpu=cc70 omp_program.c -o omp_program

In this example the target is a V100 GPU (e.g., cc70).

Consult specific compiler documentation for more information on offloading to GPUs.

Running OpenMP programs

Once the program has been compiled, the next step to running OpenMP programs is to set the OMP_NUM_THREADS environment variable indicating the number of threads to use. Typically, this should be set to match the number of CPU cores that you have requested for your Slurm job.

A Slurm job script is a special type of Bash shell script that the Slurm job scheduler recognizes as a job. For a job launching OpenMP parallel programs, a Slurm job script should look similar to the following:

#!/bin/bash

#SBATCH --account=<project_id>
#SBATCH --partition=main
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --mem=32G
#SBATCH --time=1:00:00

module purge
module load gcc/11.3.0

ulimit -s unlimited

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

./omp_program

Each line is described below:

Command or Slurm argumentMeaning
#!/bin/bashUse Bash to execute this script
#SBATCHSyntax that allows Slurm to read your requests (ignored by Bash)
--account=<project_id>Charge compute resources to <project_id>; enter myaccount to view your available project IDs
--partition=mainSubmit job to the main partition
--nodes=1Use 1 compute node
--ntasks=1Run 1 task (e.g., running an OpenMP program)
--cpus-per-task=16Reserve 16 CPUs for your exclusive use
--mem=32GReserve 32 GB of memory for your exclusive use
--time=1:00:00Reserve resources described for 1 hour
module purgeClear environment modules
module load gcc/11.3.0Load the gcc compiler environment module
ulimit -s unlimitedSet the limit of user stack size to unlimited
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASKSet the number of threads to parallelize over. This uses the Slurm-provided environment variable SLURM_CPUS_PER_TASK for the number of threads. The number of threads should generally be equal to the requested --cpus-per-task option in your job script and not exceed the number of CPU cores on a compute node
./omp_programRun your OpenMP program

Thread affinity

OpenMP includes thread affinity options that allow binding threads to specific places on a compute node. This may improve the performance of your program, though the optimal values to use depend on your specific application.

Use the environment variables OMP_PLACES and OMP_PROC_BIND to set thread affinity at runtime. The following values are a good starting point:

export OMP_PLACES=cores
export OMP_PROC_BIND=spread

We recommend experimenting and benchmarking to find the optimal binding strategy for your application. Consult the OpenMP documentation for more information on thread affinity and the available options.

Additional resources

If you have questions about or need help with OpenMP or parallel programming, please submit a help ticket and we will assist you.

For hybrid MPI/OpenMP programs, see our MPI guide.

OpenMP website
LLNL OpenMP Tutorial
OpenMP code examples

Back to top