Installing and using OMPi on an NVIDIA Jetson Nano board is not much different from deploying OMPi on any other system. Below, you can find detailed instructions for a Jetson Nano environment.
Minimum requirements
- Jetson Linux (L4T) — needed for supporting OpenMP on the CPU
- JetPack SDK 4.2 — additionally needed for OpenMP offloading to the GPU
Installation
OMPi must be configured using the --disable-dependency-tracking
flag, as follows:
./configure --prefix=‹install-dir› --disable-dependency-tracking LDFLAGS=-rdynamic
Advanced configuration: Just-in-time kernel compilation
For offloading purposes, OMPi by default produces executable CUBIN files. This eliminates runtime delays, but can slow down the compilation process. Alternatively, OMPi
gives the user the just-in-time kernel compilation option, via the --enable-cuda-jit
flag:
./configure --prefix=‹install-dir› --disable-dependency-tracking --enable-cuda-jit LDFLAGS=-rdynamic
In this mode, the produced kernel files are in the PTX format. While this feature speeds up compilation, it introduces a minor delay, as the kernel compilation is delegated to the runtime. However, this delay can then be minimized if the CUDA binary cache is exploited. When using this feature, make sure the cache is enabled, by setting the corresponding environment variable:
export CUDA_CACHE_DISABLE=0
Then OMPi can be compiled and installed as usual:
make
make install
Note that the user does not need to provide any additional flags; the OpenMP cuda
module is installed by default.
Usage
Compilation
Compiling OpenMP applications with OMPi on a Jetson Nano board is a quite straightforward process. Simply run:
ompicc app.c
The compiler will produce the main application executable (a.out) and several kernel executables, specifically one for each OpenMP target construct existing in the application.
Advanced usage: Multiple compilation jobs
When an application contains several target regions, the user can additionally generate N parallel jobs to handle the compilation of the kernels, by using the -jN
device option, as follows:
ompicc --devopt -jN app.c
Testing the CUDA module
The correct installation of OMPi along with its cuda
module, can be verified by running:
ompicc --devvinfo
On a Jetson Nano 2GB, this command must print out the following information, assuming that cuda
is the only module installed:
1 configured device module(s): cuda
MODULE [cuda]:
------
OMPi CUDA device module.
Available devices : 1
device id < 1 > {
name: NVIDIA Tegra X1 (SM v5.3)
1 multiprocessors
128 cores per multiprocessor
128 cores in total
1024 maximum thread block size
2027460 Kbytes of device global memory
}
------
The programmer can moreover verify that the module is working properly by compiling the following sample application and running it:
Source code: Sample application utilizing the GPU
#include <stdio.h>
#include <omp.h>
int main(void)
{
/*
* Expected result (assuming that the GPU device ID is 1):
* Running on CUDA device
*/
#pragma omp target device(1)
{
if (omp_is_initial_device())
printf("Running on host\n");
else
printf("Running on CUDA device\n");
}
return 0;
}
Mini-tutorial: Writing OpenMP applications which exploit the GPU
Targeting the Maxwell GPU of a Jetson Nano board is made possible by using the OpenMP target-related constructs. With these constructs, the programmer indirectly launches a CUDA grid, that consists of multiple CUDA blocks containing multiple CUDA threads. Below you can find information about the syntax and usage of the most popular constructs. For detailed information, please consult the official OpenMP specifications.
Construct: #pragma omp target parallel
#pragma omp target parallel
<structured-block>
This construct launches a CUDA block consisting of multiple threads. The default number of launched threads is 128, but can be alternatively specified using the num_threads(N)
clause.
Construct: #pragma omp target teams
#pragma omp target teams
<structured-block>
This construct launches multiple CUDA blocks consisting of one thread. The default number of launched blocks is 1, but can be alternatively specified using the num_teams(N)
clause.
Construct: #pragma omp target teams distribute parallel for
#pragma omp target teams distribute parallel for
<for-loops>
This construct is a combination of the previous constructs and launches multiple CUDA blocks consisting of multiple threads and additionally distributes the iterations of a loop across the launched blocks and threads.