Supported Compilers

Last updated January 13, 2025

Table of Contents

1 GNU Compiler Collection (GCC)
2 LLVM
3 Intel
4 NVIDIA High Performance Computing Software Development Kit (NVIDIA HPC SDK)
5 Additional resources

CARC provides multiple C/C++/Fortran compilers, each with their own benefits.

The 3 main parts of a compiler tool set include:

Front end
- Language specific
- Checks source code for syntax errors
- Converts source code into format that can be interpreted by next stage
Optimizer
- Normally language agnostic
- Attempts to speed up code
Back end
- Hardware specific
- Creates binary/executable
- Takes advantage of unique hardware features

1 GNU Compiler Collection (GCC)

The GCC is an open source set of tools for compiling source code. A majority of CARC’s software stack is built with GCC because it’s compatible with most packages and hardware.

Load the latest version available:

$ module load gcc

2 LLVM

LLVM is an open source “collection of modular and reusable complier and toolchain technologies”. By being modular, LLVM attempts to make it possible for users to modify their work at various stages. For example, you could create a front end for your own programming language, but still use LLVM’s existing optimizer and backend.

The Intel, AMD, and NVIDIA compiler suites are based on LLVM which provide backend optimizations for the respective hardware architecture they support.

Load the latest version available:

$ module load llvm

3 Intel

Intel provides compiler tools, an MPI library, and performance optimization tools. It can provide enhanced performance on Intel hardware.

Load the latest version available:

$ module load intel-oneapi

4 NVIDIA High Performance Computing Software Development Kit (NVIDIA HPC SDK)

CARC offers GPUs from NVIDIA to facilitate diverse HPC workloads, including the P100, V100, A100, and A40 GPU models. To complement the GPU hardware, we offer NVIDIA programming tools essential to maximizing productivity and optimize GPU acceleration. The latest programming tools are all included in the NVIDIA HPC SDK, available via module load nvhpc. They include both the former NVIDIA CUDA compilers and PGI compilers, as well as state-of-the-art NVIDIA GPU libraries, debugger, and profilers.

Load the latest version available:

$ module load nvhpc

Once you have loaded the nvhpc module, the NVIDIA CUDA Compiler (nvcc) becomes available:

nvcc --version

The NVIDIA HPC SDK provides several applications:

compilers: nvfortran/nvc/nvc++
nvcc
NCCL
NVSHMEM
cuBLAS
cuFFT
cuFFTMp
cuRAND
cuSOLVER
cuSOLVERMp
cuSPARSE
cuTENSOR
Nsight Compute
Nsight Systems
OpenMPI
HPC-X
UCX
OpenBLAS
Scalapack
Thrust
CUB
libcu++

4.1 Compute Capability

Determine the compute capability of the available NVIDIA GPUs to compile and execute CUDA code efficiently.

Compute capability is represented by a version number (sometimes called its “SM version”) and identifies the the GPU hardware’s supported features. It is used by applications at runtime to determine which hardware features (such as tensor cores and L2 cache) and instructions (such as Bfloat16-precision floating-point operations) are available on the GPU device.

In CUDA, GPUs are named sm_xy, where x denotes the GPU generation number and y the version.

The compute capability version of a particular GPU should not be confused with the CUDA version (e.g. CUDA 10.2, CUDA 11.0, CUDA 11.8).

The compute capability of a NVIDIA GPU compute node can be checked with nvidia-smi.

The following commands are an example of an interactive session on a A40 GPU compute node and a query of its compute capability with nvidia-smi:

$ salloc --partition gpu --gres=gpu:a40:1
$ nvidia-smi --query-gpu=compute_cap --format=csv
compute_cap
8.6

The following table lists the compute capability of the four different GPU device types available on CARC HPC systems:

GPU Model	Compute Capability	Architecture
Tesla P100	6.0	Pascal
NVIDIA V100	7.0	Volta
NVIDIA A100	8.0	NVIDIA Ampere GPU
NVIDIA A40	8.6	NVIDIA Ampere GPU

4.2 CUDA Compiler Options

-arch : Specifies the virtual compute architecture that the PTX code should be generated against. The valid format is: -arch=compute_XY
-code: Specifies the actual sm architecture the SASS code should be generated against and included in the binary. The valid format is: -code=sm_XY
-code: Can also specify which PTX code should be included in the binary for forward compatibility. The valid format is: -code=compute_XY
-gencode: combines both -arch and -code. The valid format is: -gencode=arch=compute_XY,code=sm_XY

To compile CUDA code so that it runs on all of the four types of GPUs available on CARC HPC, use the following CUDA-compiler flags: -code, -arch, -gencode.

Compile-time Compatibility:

-arch=compute_Xa is compatible with -code=sm_Xb when a≤b
-arch=compute_X* is incompatible with -code=sm_Y*

Runtime Compatibility:

binaries built with -code=sm_XY will only run on the X.Y architecture
binaries built with -code=compute_Xa will run on Xb architecture with JIT when b≥a
binaries built with -code=compute_ab will run on cd architecture with JIT when c,d≥a,b

Compile CUDA code so that it runs on all of the four types of GPU architecture available on CARC HPC with --generate-code:

nvcc cuda_code.cu \
--generate-code arch=compute_60,code=sm_60 \
--generate-code arch=compute_70,code=sm_70 \
--generate-code arch=compute_80,code=sm_80 \
--generate-code arch=compute_86,code=sm_86

4.3 CUDA example code

The following commands will initiate an interactive session on a P100 GPU compute node, download (wget) and compile (nvcc) the devicequery.cu CUDA code, and run the generated executable.

$ salloc -p debug --gres=gpu:p100:1
$ module purge
$ module load nvhpc
$ wget <https://raw.githubusercontent.com/welcheb/CUDA_examples/master/devicequery.cu>
$ nvcc -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 devicequery.cu -o devicequery.x
$ srun -n 1 ./devicequery.x

The devicequery.cu CUDA code example can help new users get familiar with the GPU resources available on the HPC cluster. The output for a P100 GPU node should look something similar to the following:

CUDA Device Query...
There are 1 CUDA devices.
CUDA Device #0
Major revision number:         6
Minor revision number:         0
Name:                          Tesla P100-PCIE-16GB
Total global memory:           4186898432
Total shared memory per block: 49152
Total registers per block:     65536
Warp size:                     32
Maximum memory pitch:          2147483647
Maximum threads per block:     1024
Maximum dimension 0 of block:  1024
Maximum dimension 1 of block:  1024
Maximum dimension 2 of block:  64
Maximum dimension 0 of grid:   2147483647
Maximum dimension 1 of grid:   65535
Maximum dimension 2 of grid:   65535
Clock rate:                    1328500
Total constant memory:         65536
Texture alignment:             512
Concurrent copy and execution: Yes
Number of multiprocessors:     56
Kernel execution timeout:      No

5 Additional resources

If you have questions about using compilers, please submit a help ticket and we will assist you.