OMPi on the NVIDIA Jetson Nano boards

Installing and using OMPi on an NVIDIA Jetson Nano board is not much different from deploying OMPi on any other system. Below, you can find detailed instructions for a Jetson Nano environment.

Minimum requirements

Jetson Linux (L4T) — needed for supporting OpenMP on the CPU
JetPack SDK 4.2 — additionally needed for OpenMP offloading to the GPU

Installation

OMPi must be configured using the --disable-dependency-tracking flag, as follows:

./configure --prefix=‹install-dir› --disable-dependency-tracking

Advanced configuration: Just-in-time kernel compilation

For offloading purposes, OMPi by default produces executable CUBIN files. This eliminates runtime delays, but can slow down the compilation process. Alternatively, OMPi
gives the user the just-in-time kernel compilation option, via the --enable-cuda-jit flag:

./configure --prefix=‹install-dir› --disable-dependency-tracking --enable-cuda-jit

In this mode, the produced kernel files are in the PTX format. While this feature speeds up compilation, it introduces a minor delay, as the kernel compilation is delegated to the runtime. However, this delay can then be minimized if the CUDA binary cache is exploited. When using this feature, make sure the cache is enabled, by setting the corresponding environment variable:

export CUDA_CACHE_DISABLE=0

Then OMPi can be compiled and installed as usual:

make 
make install

Note that the user does not need to provide any additional flags; the OpenMP cuda module is installed by default.

Usage

Compilation

Compiling OpenMP applications with OMPi on a Jetson Nano board is a quite straightforward process. Simply run:

ompicc app.c

The compiler will produce the main application executable (a.out) and several kernel executables, specifically one for each OpenMP target construct existing in the application.

Advanced usage: Multiple compilation jobs

When an application contains several target regions, the user can additionally generate N parallel jobs to handle the compilation of the kernels, by using the -jN device option, as follows:

ompicc --devopt -jN app.c

Testing the CUDA module

The correct installation of OMPi along with its cuda module, can be verified by running:

ompicc --devvinfo

On a Jetson Nano 2GB, this command must print out the following information, assuming that cuda is the only module installed:

1 configured device module(s): cuda

MODULE [cuda]:
------
OMPi CUDA device module.
Available devices : 1

device id < 1 > { 
 name: NVIDIA Tegra X1 (SM v5.3)
 1 multiprocessors
 128 cores per multiprocessor
 128 cores in total
 1024 maximum thread block size
 2027460 Kbytes of device global memory
}
------

The programmer can moreover verify that the module is working properly by compiling the following sample application and running it:

Source code: Sample application utilizing the GPU

#include <stdio.h>
#include <omp.h>

int main(void)
{
    /* 
     * Expected result (assuming that the GPU device ID is 1):
     * Running on CUDA device 
     */
    #pragma omp target device(1)
    {
        if (omp_is_initial_device()) 
            printf("Running on host\n");    
        else 
            printf("Running on CUDA device\n"); 
    }
    return 0;
}

Mini-tutorial: Writing OpenMP applications which exploit the GPU

Targeting the Maxwell GPU of a Jetson Nano board is made possible by using the OpenMP target-related constructs. With these constructs, the programmer indirectly launches a CUDA grid, that consists of multiple CUDA blocks containing multiple CUDA threads. Below you can find information about the syntax and usage of the most popular constructs. For detailed information, please consult the official OpenMP specifications.

Construct: #pragma omp target parallel

#pragma omp target parallel
    <structured-block>

This construct launches a CUDA block consisting of multiple threads. The default number of launched threads is 128, but can be alternatively specified using the num_threads(N) clause.

Construct: #pragma omp target teams

#pragma omp target teams
    <structured-block>

This construct launches multiple CUDA blocks consisting of one thread. The default number of launched blocks is 1, but can be alternatively specified using the num_teams(N) clause.

Construct: #pragma omp target teams distribute parallel for

#pragma omp target teams distribute parallel for
    <for-loops>

This construct is a combination of the previous constructs and launches multiple CUDA blocks consisting of multiple threads and additionally distributes the iterations of a loop across the launched blocks and threads.