OMPi on the NVIDIA Jetson Nano boards

Installing and using OMPi on an NVIDIA Jetson Nano board is not much different from deploying OMPi on any other system. Below, you can find detailed instructions for a Jetson Nano environment.

Minimum requirements

  • Meson 0.58.0needed for building OMPi
  • Jetson Linux (L4T) — needed for supporting OpenMP on the CPU
  • JetPack SDK 4.2additionally needed for OpenMP offloading to the GPU

Installation

Set up OMPi as follows:

meson setup build --prefix=‹install-dir›
Advanced configuration: Just-in-time kernel compilation

For offloading purposes, OMPi by default enables just-in-time (JIT) kernel compilation. In this mode, the produced kernel files are in the PTX format. This speeds up compilation, but may introduce a minor delay at first kernel launch, as compilation is delegated to the runtime. This delay can be minimized if the CUDA binary cache is exploited; make sure the cache is enabled by setting the corresponding environment variable.

export CUDA_CACHE_DISABLE=0

Alternatively, OMPi gives the user the offline compilation option to produce executable CUBIN files, which eliminates runtime delays at the cost of slower compilation, via the -Dcuda-jit=false flag:

meson setup build --prefix=‹install-dir› -Dcuda-jit=false

Then OMPi can be compiled and installed as usual:

cd build/
meson compile
meson install

Note that the user does not need to provide any additional flags; the OpenMP cuda module is installed by default.

Usage

Compilation

Compiling OpenMP applications with OMPi on a Jetson Nano board is a quite straightforward process. Simply run:

ompicc --devs=cuda app.c # or --devs=all

The compiler will produce the main application executable (a.out) and several kernel executables, specifically one for each OpenMP target construct existing in the application.

Advanced usage: Multiple compilation jobs

When an application contains several target regions, the user can additionally generate N parallel jobs to handle the compilation of the kernels, by using the -jN device option, as follows:

ompicc --devs=cuda --devopt -jN app.c

Testing the CUDA module

The correct installation of OMPi along with its cuda module, can be verified by running:

ompiconf --devvinfo

On a Jetson Nano 2GB, this command must print out the following information, assuming that cuda is the only module installed:

1 configured device module(s): cuda

MODULE [cuda]:
------
OMPi CUDA device module.
Available devices : 1

device id < 1 > {
 GPU device name : NVIDIA Tegra X1
 Compute capability : 5.3
 CUDA toolkit version : 10.2
 Num of multiprocessors : 1
 Cores per multiprocessor : 128
 Total num of cores : 128
 Maximum thread block size : 1024
 Global memory : 1.9 GBytes
 Shared memory per block : 48 KBytes
}
------

The programmer can moreover verify that the module is working properly by compiling the following sample application and running it:

Source code: Sample application utilizing the GPU
#include <stdio.h>
#include <omp.h>

int main(void)
{
    /* 
     * Expected result (assuming that the GPU device ID is 1):
     * Running on CUDA device 
     */
    #pragma omp target device(1)
    {
        if (omp_is_initial_device()) 
            printf("Running on host\n");    
        else 
            printf("Running on CUDA device\n"); 
    }
    return 0;
}

Mini-tutorial: Writing OpenMP applications which exploit the GPU

Targeting the Maxwell GPU of a Jetson Nano board is made possible by using the OpenMP target-related constructs. With these constructs, the programmer indirectly launches a CUDA grid, that consists of multiple CUDA blocks containing multiple CUDA threads. Below you can find information about the syntax and usage of the most popular constructs. For detailed information, please consult the official OpenMP specifications.

Construct: #pragma omp target parallel
#pragma omp target parallel
    <structured-block>

This construct launches a CUDA block consisting of multiple threads. The default number of launched threads is 128, but can be alternatively specified using the num_threads(N) clause.

Construct: #pragma omp target teams
#pragma omp target teams
    <structured-block>

This construct launches multiple CUDA blocks consisting of one thread. The default number of launched blocks is 1, but can be alternatively specified using the num_teams(N) clause.

Construct: #pragma omp target teams distribute parallel for
#pragma omp target teams distribute parallel for
    <for-loops>

This construct is a combination of the previous constructs and launches multiple CUDA blocks consisting of multiple threads and additionally distributes the iterations of a loop across the launched blocks and threads.