Introduction
OMPi supports offloading to CUDA GPUs through the dedicated cuda module. Initially, the compiler generates CUDA kernel source files for each OpenMP target directive appearing in the user application. The cuda module is responsible for:
- Compiling the CUDA kernels to CUDA executables,
- Initializing/finalizing a CUDA device,
- Transferring data between the CPU and a CUDA device,
- Offloading the CUDA executables to the requested CUDA device.
Requirements
The current requirements of the cuda module are the following:
- CUDA Toolkit version 9.0 or greater, with working CUDA drivers
- CUDA architecture version 3.5 or greater (for each GPU)
Before proceeding with the installation, please make sure that:
<cuda-install-dir>/bin is in your PATH environment variable
<cuda-install-dir>/lib64 is in your LD_LIBRARY_PATH environment variable
where <cuda-install-dir> is the installation directory of the CUDA toolkit.
Installing OMPi
The first step is to deploy OMPi on your system. Initially, you need to set up OMPi as follows:
meson setup build --prefix=<install-dir>
Advanced configuration: Just-in-time kernel compilation
The cuda module operates either in PTX-JIT or in CUBIN mode. These modes affect the output kernel executable format, as well as the actions taking place in the compiler and runtime parts of OMPi.
Both modes have their positive and negative aspects. CUBIN mode removes many runtime overheads but adds a notable delay during kernel compilation. On the other hand, PTX-JIT mode leads to smaller compilation times as PTX file production is much lighter than binary production, however it causes extra overheads during runtime. These overheads are related to the compilation and linking of the PTX file with the appropriate device libraries; they can be, however, minimized by utilizing CUDA JIT cache, a global disk cache for loaded PTX files.
To make sure the cache is enabled, set the corresponding environment variable:
export CUDA_CACHE_DISABLE=0
OMPi by default produces PTX files. To disable the PTX-JIT mode and have OMPi produce executable CUBIN files, set up OMPi with the -Dcuda-jit flag:
meson setup build --prefix=‹install-dir› -Dcuda-jit=false
Then OMPi can be compiled and installed as usual:
cd build/
meson compile
meson install
During the installation process, if your system satisfies the minimum requirements and has at least one CUDA GPUs installed, OMPi will automatically build the cuda module. The correct installation of the module can be verified by running:
ompiconf --devvinfo
which lists all the identified modules/devices, along with their numeric device IDs. If the above command fails to show a CUDA GPU, please revisit the minimum requirements and ensure you have installed OMPi correctly.
Sample output of devvinfo
The ompiconf --devvinfo command should give an output similar to the following (here a Tesla P40 GPU was identified by OMPi):
MODULE [cuda]:
------
OMPi CUDA device module.
Available devices : 1
device id < 0 > {
GPU device name : Tesla P40
Compute capability : 6.1
CUDA toolkit version : 12.4
NVIDIA driver version : 550.90.07
Num of multiprocessors : 30
Cores per multiprocessor : 128
Total num of cores : 3840
Maximum thread block size : 1024
Global memory : 22.4 GBytes
Shared memory per block : 48 KBytes
}
------
Quick start
Compiling your OpenMP application with CUDA GPU offloading requires a simple --devs=cuda OR the generic --devs=all argument:
ompicc --devs=cuda app.c
The compiler will produce the main application executable (a.out) and several CUDA kernel executables, one for each OpenMP target construct in the application.
Sample OpenMP application utilizing the GPU
Here is a sample application that can be used to verify the successful deployment of the cuda module:
#include <stdio.h>
#include <omp.h>
int main(void)
{
/*
* Expected result (assuming that the GPU device ID is 0):
* Running on CUDA device
*/
#pragma omp target device(0)
{
if (omp_is_initial_device())
printf("Running on host\n");
else
printf("Running on CUDA device\n");
}
return 0;
}
Advanced usage: Multiple compilation jobs
When an application contains several target regions, compilation times may become substantial. One can generate multiple compilation jobs that handle the compilation of different kernels in parallel, by using the -j<number> device option. For example to have 4 parallel kernel compilation jobs, use ompicc as follows:
ompicc --devs=cuda --devopt -j4 app.c
Notes
The OMPi cuda module has been tested with the following CUDA GPUs:
- Ampere A2
- Tesla P40
- Tegra X1 (Jetson Nano)
- GTX 1050 Ti
- GTX 970
- GT 1030
- GT 730