Introduction
OMPi supports offloading to CUDA GPUs through the dedicated cuda
module. Initially, the compiler generates CUDA kernel source files for each OpenMP target
directive appearing in the user application. The cuda module is responsible for:
- Compiling the CUDA kernels to CUDA executables,
- Initializing/finalizing a CUDA device,
- Transferring data between the CPU and a CUDA device,
- Offloading the CUDA executables to the requested CUDA device.
Requirements
The current requirements of the cuda
module are the following:
- CUDA Toolkit version 9.0 or greater, with working CUDA drivers
- CUDA architecture version 3.5 or greater (for each GPU)
Before proceeding with the installation, please make sure that:
<cuda-install-dir>/bin
is in your PATH
environment variable
<cuda-install-dir>/lib64
is in your LD_LIBRARY_PATH
environment variable
where <cuda-install-dir>
is the installation directory of the CUDA toolkit.
Installing OMPi
The first step is to deploy OMPi on your system. Initially, you need to configure OMPi as follows:
./configure --prefix=<install-dir> LDFLAGS=-rdynamic
Advanced configuration: Just-in-time kernel compilation
The cuda
module operates either in CUBIN or PTX-JIT mode. These modes affect the output kernel executable format, as well as the actions taking place in the compiler and runtime parts of OMPi.
Both modes have their positive and negative aspects. CUBIN mode removes many runtime overheads but adds a notable delay during kernel compilation. On the other hand, PTX-JIT mode leads to smaller compilation times as PTX file production is much lighter than binary production, however it causes extra overheads during runtime. These overheads are related to the compilation and linking of the PTX file with the appropriate device libraries; they can be, however, minimized by utilizing CUDA JIT cache, a global disk cache for loaded PTX files.
To make sure the cache is enabled, set the corresponding environment variable:
export CUDA_CACHE_DISABLE=0
OMPi by default produces executable CUBIN files. To enable the PTX-JIT mode, configure OMPi with the --enable-cuda-jit
flag:
./configure --prefix=‹install-dir› --enable-cuda-jit LDFLAGS=-rdynamic
Then OMPi can be compiled and installed as usual:
make
make install
During the installation process, if your system satisfies the minimum requirements and has at least one CUDA GPUs installed, OMPi will automatically build the cuda
module. The correct installation of the module can be verified by running:
ompicc --devvinfo
which lists all the identified modules/devices, along with their numeric device IDs. If the above command fails to show a CUDA GPU, please revisit the minimum requirements and ensure you have installed OMPi correctly.
Sample output of devvinfo
The ompicc --devvinfo
command should give an output similar to the following (here a Tesla P40 GPU was identified by OMPi):
MODULE [cuda]:
------
OMPi CUDA device module.
Available devices : 1
device id < 1 > {
name: Tesla P40 (SM v6.1)
30 multiprocessors
128 cores per multiprocessor
3840 cores in total
1024 maximum thread block size
23469184 Kbytes of device global memory
}
------
Quick start
Compiling your OpenMP application with CUDA GPU offloading requires no additional arguments. Simply go:
ompicc app.c
The compiler will produce the main application executable (a.out
) and several CUDA kernel executables, one for each OpenMP target
construct in the application.
Sample OpenMP application utilizing the GPU
Here is a sample application that can be used to verify the successful deployment of the cuda
module:
#include <stdio.h>
#include <omp.h>
int main(void)
{
/*
* Expected result (assuming that the GPU device ID is 1):
* Running on CUDA device
*/
#pragma omp target device(1)
{
if (omp_is_initial_device())
printf("Running on host\n");
else
printf("Running on CUDA device\n");
}
return 0;
}
Advanced usage: Multiple compilation jobs
When an application contains several target
regions, compilation times may become substantial. One can generate multiple compilation jobs that handle the compilation of different kernels in parallel, by using the -j<number>
device option. For example to have 4 parallel kernel compilation jobs, use ompicc
as follows:
ompicc --devopt -j4 app.c
Notes
The OMPi cuda module has been tested with the following CUDA GPUs:
- Tesla P40
- Tegra X1 (Jetson Nano)
- GTX 1050 Ti
- GTX 970
- GT 1030
- GT 730