Using OpenMP

OpenMP (Open Multi-Processing) is a popular Application Programming Interface (API) for multi-threaded applications. It supports shared memory, multi-processing programming in C, C++, and Fortran on most platforms, instruction set architectures, and operating systems. The API includes compiler directives and constructs, runtime library routines, and environment variables for thread creation and management.

OpenMP is an explicit (i.e., not automatic) programming model offering the programmer full control over parallelization. Parallelization can be as simple as taking a serial program and inserting parallel compiler directives. OpenMP programs accomplish parallelism exclusively through the use of threads—the smallest unit of processing that can be scheduled by an operating system. Typically, the number of threads used matches the number of physical CPU cores, known as one-to-one mapping. However, the optimal number of threads depends on the specific application.

OpenMP uses the fork-join model of parallel execution:

FORK: The master thread creates a team of parallel threads with access to shared memory. The statements in the program that are enclosed in the parallel region are then executed in parallel among the team of threads.
JOIN: When the team of threads complete the statements in the parallel region, they synchronize and terminate, leaving only the master thread.

Compiling OpenMP programs

OpenMP programs are compatible with most compilers. The following table lists the compilers available on CARC HPC clusters and their corresponding compilation command and option for OpenMP programs:

Compiler family	Module name	Language	Compilation command
GCC	gcc	C	gcc -fopenmp [...]
		C++	g++ -fopenmp [...]
		Fortran	gfortran -fopenmp [...]
LLVM	llvm	C	clang -fopenmp [...]
		C++	clang++ -fopenmp [...]
AOCC	aocc	C	clang -fopenmp [...]
		C++	clang++ -fopenmp [...]
Intel	intel-oneapi	C	icx -qopenmp [...]
		C++	icpx -qopenmp [...]
		Fortran	ifx -qopenmp [...]
NVHPC	nvhpc	C	nvc -mp [...]
		C++	nvc++ -mp [...]
		Fortran	nvfortran -mp [...]

For example, to use the gcc compiler to compile a C program using OpenMP, enter in the following:

module purge
module load gcc/11.3.0
gcc -fopenmp omp_program.c -o omp_program

Offloading to GPUs

Parallel regions of programs can also be offloaded to GPUs using OpenMP via the target directive. However, the regions should have substantial parallelism and be structured well with little thread synchronization in order for there to be a noticable increase in the speed of executing your program.

Offloading to GPUs also requires additional compiler flags. For example, using the nvc compiler for a C program using OpenMP:

module purge
module load nvhpc/22.11
nvc -mp=gpu -gpu=cc70 omp_program.c -o omp_program

In this example the target is a V100 GPU (e.g., cc70).

Consult specific compiler documentation for more information on offloading to GPUs.

Running OpenMP programs

Once the program has been compiled, the next step to running OpenMP programs is to set the OMP_NUM_THREADS environment variable indicating the number of threads to use. Typically, this should be set to match the number of CPU cores that you have requested for your Slurm job.

A Slurm job script is a special type of Bash shell script that the Slurm job scheduler recognizes as a job. For a job launching OpenMP parallel programs, a Slurm job script should look similar to the following:

#!/bin/bash

#SBATCH --account=<project_id>
#SBATCH --partition=main
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --mem=32G
#SBATCH --time=1:00:00

module purge
module load gcc/11.3.0

ulimit -s unlimited

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

./omp_program

Each line is described below:

Command or Slurm argument	Meaning
`#!/bin/bash`	Use Bash to execute this script
`#SBATCH`	Syntax that allows Slurm to read your requests (ignored by Bash)
`--account=<project_id>`	Charge compute resources to <project_id>; enter `myaccount` to view your available project IDs
`--partition=main`	Submit job to the main partition
`--nodes=1`	Use 1 compute node
`--ntasks=1`	Run 1 task (e.g., running an OpenMP program)
`--cpus-per-task=16`	Reserve 16 CPUs for your exclusive use
`--mem=32G`	Reserve 32 GB of memory for your exclusive use
`--time=1:00:00`	Reserve resources described for 1 hour
`module purge`	Clear environment modules
`module load gcc/11.3.0`	Load the `gcc` compiler environment module
`ulimit -s unlimited`	Set the limit of user stack size to unlimited
`export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK`	Set the number of threads to parallelize over. This uses the Slurm-provided environment variable `SLURM_CPUS_PER_TASK` for the number of threads. The number of threads should generally be equal to the requested `--cpus-per-task` option in your job script and not exceed the number of CPU cores on a compute node
`./omp_program`	Run your OpenMP program

Thread affinity

OpenMP includes thread affinity options that allow binding threads to specific places on a compute node. This may improve the performance of your program, though the optimal values to use depend on your specific application.

Use the environment variables OMP_PLACES and OMP_PROC_BIND to set thread affinity at runtime. The following values are a good starting point:

export OMP_PLACES=cores
export OMP_PROC_BIND=spread

We recommend experimenting and benchmarking to find the optimal binding strategy for your application. Consult the OpenMP documentation for more information on thread affinity and the available options.

Additional resources

If you have questions about or need help with OpenMP or parallel programming, please submit a help ticket and we will assist you.

For hybrid MPI/OpenMP programs, see our MPI guide.