OMPi: Offloading to OpenCL GPUs

Introduction

The latest version of OMPi extends GPU support by adding the ability to offload computations to any GPU through OpenCL. OMPi analyzes the OpenMP target directives appearing in a user application and generates kernel source files written in the OpenCL C language; these are then compiled into OpenCL kernels, and get linked with an OpenCL library that offers OpenMP facilities during kernel execution. OpenCL support is provided by the new OMPi opencl module.

Requirements

The requirements for the opencl module to be activated are the following:

The standard OpenCL header and development packages must be installed (usually named opencl-headers, opencl-dev or ocl-icd-devel)
A GPU that supports OpenCL version ≥ 1.2; you can always double-check this by running the clinfo utility, if it is installed on your system.
The corresponding vendor drivers and/or relevant libraries must be installed (we cannot provide help on how to install them as this differs from vendor to vendor; Intel drivers most probably can be found in the intel-neo repository; for AMD you need the amdgpu and/or ROCm drivers; NVIDIA includes the OpenCL drivers/libraries in their CUDA SDK).

Installing OMPi with OpenCL support

You do not need any special preparations; simply deploy OMPi on your system as usual:

meson setup build --prefix=<install-dir>
cd build/
meson compile
meson install

If all the requirements listed above are met, the installation process will automatically detect the GPU(s) and build the opencl module; if something is missing, the module won’t be included. You can verify the correct installation of the opencl module by executing:

ompiconf --devvinfo

which lists all the identified modules/devices, along with their numeric device IDs. If the above command fails to show an OpenCL-capable GPU, double check that all requirements are met and ensure you have installed OMPi correctly.

Sample output of devvinfo

The ompiconf --devvinfo command (or ompiconf --devvinfo=opencl) should give an output similar to the following (here an integrated Intel UHD770 GPU was detected by OMPi):

MODULE [opencl]: 
------ 
OMPi module for OpenCL devices 
Available devices : 1 
 
device id < 0 > {  
  GPU device name        : Intel(R) UHD Graphics 770 
  GPU device vendor      : Intel(R) Corporation 
  Vendor device driver   : 24.22.29735.20 
  OpenCL ver. (platform) : 3.0 
  OpenCL ver. (device)   : 3.0 
  OpenCL ver. (compiler) : 1.0, 1.1, 1.2, 3.0 
  Num of compute units   : 32 
  Max workgroup size     : 512 
  Local memory           : 64 KBytes 
  Global memory          : 13.9 GBytes 
  Coarse grain SVM       : yes 
  Fine grain buffer SVM  : no 
  Fine grain system SVM  : no 
  Unified host memory    : yes 
  Supports "double" type : no 
 
  Intel GPU details: 
    | 1 slice(s) 
    | 2 subslice(s) per slice (max) 
    | 16 EUs per subslice (max) 
    | 7 threads per EU (max) 
} 
------

Quick start

To offload to a GPU through OpenCL, you need to compile your applications with a --devs=opencl argument:

ompicc --devs=opencl app.c

Note

If you use the --devs=opencl compilation option, the numeric device IDs are those printed by ompiconf --devinfo=opencl . For example, in our system we get:

MODULE [opencl] provides device(s) : 0 1

You may instead compile using the --devs=all option to produce kernels that may run on any GPU target, including OpenCL. Be careful that in this case the numeric device IDs may be different depending on what other modules are active. In the case you use --devs=all, the correct device IDs can be found by exectuting ompiconf --devinfo. In our system it reported:

MODULE [cuda] provides device(s) : 0
MODULE [opencl] provides device(s) : 1 2

Thus the numeric IDs of the two OpenCL-capable GPUs are now 1 and 2.

The compiler will produce the main application executable (a.out) and a number of OpenCL kernel files, one for each target construct in the OpenMP application.

Sample OpenMP application utilizing the GPU

Here is a sample application that can be used to verify the successful deployment of the opencl module:

#include <stdio.h>
#include <omp.h>

int main(void)
{
    /* 
     * Expected result (assuming that the GPU device ID is 0):
     * Running on OpenCL device 
     */
    int x;
    #pragma omp target map(tofrom:x) device(0)
    {
        x = omp_is_initial_device();
    }

    if (x) 
        printf("Running on host\n");    
    else 
        printf("Running on an OpenCL device\n");
    
    return 0; 
}

Advanced usage: Speeding up compilation

When an application contains several target regions, compilation times may become substantial. One can generate multiple compilation jobs that handle the compilation of different kernels in parallel, by using the -j<number> device option. For example to have 4 parallel kernel compilation jobs, use ompicc as follows:

ompicc --devs=all --devopt -j4 app.c

Limitations

Currently, the opencl module has a number of limitations:

Only the datatypes supported by OpenCL C can be used; some GPUs may lack double support.
Only static loop scheduling is supported in target teams distribute parallel for directives,
Stand-alone parallel constructs are not handled yet; the parallel construct must be part of a combined target, teams and/or distribute construct.
OpenCL ≥ 1.2 is required; in addition the OpenCL C driver/compiler must allow __global declarations at file scope; while this requirement is not part of the OpenCL 1.2 standard (it is even optional in OpenCL 2.0), all major vendors support it.

Notes

The OMPi opencl module has been tested with the following GPUs:

Integrated Intel GPUs (UHD 770, Iris Xe)
AMD Radeon GPUs (R9 285, RX 550)
NVIDIA GPUs (GeForce GT730, Tesla P40, Ampere A2)