OMPi: Offloading to CUDA GPUs

Introduction

OMPi supports offloading to CUDA GPUs through the dedicated cuda module. Initially, the compiler generates CUDA kernel source files for each OpenMP target directive appearing in the user application. The cuda module is responsible for:

Compiling the CUDA kernels to CUDA executables,
Initializing/finalizing a CUDA device,
Transferring data between the CPU and a CUDA device,
Offloading the CUDA executables to the requested CUDA device.

Requirements

The current requirements of the cuda module are the following:

CUDA Toolkit version 9.0 or greater, with working CUDA drivers
CUDA architecture version 3.5 or greater (for each GPU)

Before proceeding with the installation, please make sure that:

<cuda-install-dir>/bin is in your PATH environment variable
<cuda-install-dir>/lib64 is in your LD_LIBRARY_PATH environment variable

where <cuda-install-dir> is the installation directory of the CUDA toolkit.

Installing OMPi

The first step is to deploy OMPi on your system. Initially, you need to configure OMPi as follows:

./configure --prefix=<install-dir> LDFLAGS=-rdynamic

Advanced configuration: Just-in-time kernel compilation

The cuda module operates either in CUBIN or PTX-JIT mode. These modes affect the output kernel executable format, as well as the actions taking place in the compiler and runtime parts of OMPi.

Both modes have their positive and negative aspects. CUBIN mode removes many runtime overheads but adds a notable delay during kernel compilation. On the other hand, PTX-JIT mode leads to smaller compilation times as PTX file production is much lighter than binary production, however it causes extra overheads during runtime. These overheads are related to the compilation and linking of the PTX file with the appropriate device libraries; they can be, however, minimized by utilizing CUDA JIT cache, a global disk cache for loaded PTX files.

To make sure the cache is enabled, set the corresponding environment variable:

export CUDA_CACHE_DISABLE=0

OMPi by default produces executable CUBIN files. To enable the PTX-JIT mode, configure OMPi with the --enable-cuda-jit flag:

./configure --prefix=‹install-dir› --enable-cuda-jit LDFLAGS=-rdynamic

Then OMPi can be compiled and installed as usual:

make 
make install

During the installation process, if your system satisfies the minimum requirements and has at least one CUDA GPUs installed, OMPi will automatically build the cuda module. The correct installation of the module can be verified by running:

ompicc --devvinfo

which lists all the identified modules/devices, along with their numeric device IDs. If the above command fails to show a CUDA GPU, please revisit the minimum requirements and ensure you have installed OMPi correctly.

Sample output of devvinfo

The ompicc --devvinfo command should give an output similar to the following (here a Tesla P40 GPU was identified by OMPi):

MODULE [cuda]:
------
OMPi CUDA device module.
Available devices : 1

device id < 1 > { 
 name: Tesla P40 (SM v6.1)
 30 multiprocessors
 128 cores per multiprocessor
 3840 cores in total
 1024 maximum thread block size
 23469184 Kbytes of device global memory
}
------

Quick start

Compiling your OpenMP application with CUDA GPU offloading requires no additional arguments. Simply go:

ompicc app.c

The compiler will produce the main application executable (a.out) and several CUDA kernel executables, one for each OpenMP target construct in the application.

Sample OpenMP application utilizing the GPU

Here is a sample application that can be used to verify the successful deployment of the cuda module:

#include <stdio.h>
#include <omp.h>

int main(void)
{
    /* 
     * Expected result (assuming that the GPU device ID is 1):
     * Running on CUDA device 
     */
    #pragma omp target device(1)
    {
        if (omp_is_initial_device()) 
            printf("Running on host\n");    
        else 
            printf("Running on CUDA device\n"); 
    }
    return 0;
}

Advanced usage: Multiple compilation jobs

When an application contains several target regions, compilation times may become substantial. One can generate multiple compilation jobs that handle the compilation of different kernels in parallel, by using the -j<number> device option. For example to have 4 parallel kernel compilation jobs, use ompicc as follows:

ompicc --devopt -j4 app.c

Notes

The OMPi cuda module has been tested with the following CUDA GPUs:

Tesla P40
Tegra X1 (Jetson Nano)
GTX 1050 Ti
GTX 970
GT 1030
GT 730