In cluster environments, OMPi implements a novel runtime mechanism which can exploit the computational resources of any cluster node by treating them as OpenMP devices. In particular, both CPUs and accelerators (such as GPUs) present on remote nodes are virtualized and appear as if they were devices local to the node that executes the OpenMP application. As such, the developer can employ the well-known OpenMP device interface (OpenMP versions ≥ 4.0) and offload portions of the application code to them; the actual kernel execution is carried on the corresponding remote devices, transparently (see the related publication).
Requirements
In order to use the remote offloading mechanism of OMPi, an MPI installation is required. While not necessary, the installation of MPI should preferably support multiple threads. The mechanism was initially developed and tested with OpenMPI 4.1.1. It was also successfully tested with MPICH 4.0.2 and Intel MPI 2018.
» MPI must be installed on the node that will act as the “host” (we call this the primary node, and it executes the main part of an application), as well as on every other node that contains devices you want to offload to (these are the remote or worker nodes). Please note that a correct MPI installation is assumed; it is very easy to encounter problems, especially when multiple MPI implementations are installed on the same machine.
» Support for multiple threads can be verified as follows:
- If OpenMPI is installed, execute the following and make sure that the output includes “
MPI_THREAD_MULTIPLE: yes
“:ompi_info | grep -i thread
- In any other case, compile and execute a simple program like the one given below:
Setting up OMPi
Configuration file: specifying remote devices
Before installing OMPi, you need to provide a simple text configuration file in your home directory (~/.ompi_remote_devices
) which contains the names or IP addresses of the participating cluster nodes, along with their devices. The reason for this is that (most probably) OMPi must be installed across all worker nodes, in order to have complete support for all types of available devices.
To make it easier, a script that can generate a sample configuration file is located in the utilities
directory of the OMPi sources. First, you need to execute this script as follows (assuming your current directory is the OMPi sources root):
./utilities/remote_offload_create_config.sh
Then, edit the generated ~/.ompi_remote_devices
and add the names or the IP addresses of the cluster nodes along with the devices you would like to utilize.
Important notes:
» The primary node should not be included in the configuration file.
» Make sure to include all the available devices of the remote nodes, including their CPUs as the very first device of every worker node.
» Regarding CPUs, there is flexibility in the device count one can specify. For example, if a node has N processors, each with C cores, one can treat the whole node as 1 CPU device; alternatively, each processor can be treated as a separate device, so the CPU device count should be N; finally, if each core is treated as a separate device, NxC CPU devices should be declared. In the current version of OMPi, 1 CPU device per worker node provides the best performance.
Building and installing OMPi
To facilitate the process, we provide a dedicated script that automates building and installing OMPi across the nodes listed in the configuration file. The script is located in the root directory of the OMPi sources and can be executed as:
./remote_offload_setup.sh --prefix=<installdir> [options]
Apart from all the available options provided in the configure
script for a typical OMPi installation, you can also specify any of the following options:
--cpus-only
– Use only the worker CPUs as devices--static-procs
– Avoid usingMPI_Comm_spawn()
and create single-threaded MPI processes statically--kernel-bundling=sources|binaries
– Embed all device kernels into the main executable; this assumes NFS is being used. Whensources
is given, OMPi will embed the kernel sources into the application and do a JIT compilation during runtime. With thebinaries
option, OMPi will always pre-compile the kernel sources and embed the binaries.
Usage
After installing OMPi, you can check the correctness of the configuration file by checking that the output of the following command agrees with the contents of the ~/.ompi_remote_devices
configuration file:
ompiconf --devvinfo
» The above command also reveals the numeric device IDs that should be utilized in device()
clauses of target
regions.
» If you modify the configuration file, you have to rebuild and re-install OMPi in order for changes to take effect
Compiling and executing applications
In order to compile an OpenMP program you simply go:
ompicc program.c
The best way to run the executable, assuming MPI_THREAD_MULTIPLE
is supported, is through the mpiexec
command:
mpiexec -n 1 ./a.out
OpenMPI may allow you to execute ./a.out
directly, but the execution through mpiexec
works universally.
In case of static, single-threaded MPI, a helper script is generated for you, named ompi-mpirun.sh
and located in the same directory as a.out
, which can be run as:
./ompi-mpirun.sh ./a.out
Utilizing the user-level remote offloading API
According to OpenMP specifications, a target
code region is offloaded for execution to the default device or to any desired device specified by its numerical ID in a device()
clause. To effectively use the available resources, one would need to remember all the available devices and their IDs. To ease this burden, we provide an API that simplifies managing the available devices. The API currently offers four calls:
int ompx_get_module_num_devices(char *); int ompx_get_module_device(char *, int); char *ompx_get_device_module_name(int); int ompx_get_module_node_info(char *, char *, int *);
The first two functions focus on helping with work-sharing between devices. When the user wants to target devices of a specific module type (e.g. CUDA GPUs), he/she can determine their count with ompx_get_module_num_devices()
. The function ompx_get_module_device()
accepts as a parameter the module name and the device index among all devices of the specified module across all nodes. The return value is the global device ID that can be used in a device()
clause.
The function ompx_get_device_module_name()
is mainly provided for testing and verifying the module type that a device ID belongs to.
Finally, the function ompx_get_module_node_info()
provides information regarding a specific module on a specific node. The user passes as parameters the names of the node and the module, as well as a pointer to an integer that upon return will contain the global device ID of the first usable device of that module in the specified node. The function returns the total number of devices of that module in the given node.
Using Remote Offloading with SLURM
OMPi has also been tested on a system that is managed by the SLURM workload manager. Each SLURM installation might behave differently and modifications might be required in your case, but this section aims to provide some guidance in case you attempt to use our mechanism through SLURM. Note that the system we used did not allow the salloc
command.
» Worker nodes become known only after executing a program. Building and installing the required modules beforehand, through the provided installation script, is not possible. The solution is to install OMPi on a single node, which contains all the required tools and libraries for the required modules. We were able to load the toolkits required on the login node and install OMPi through the script using the option --cpus-only
, which will not attempt remote connecting to any node.
» The CUDA module contains a verification process that tests the execution of a small program. As the login nodes did not contain a GPU, we also had to manually disable this check.
» When compiling a program with OMPi and remote offloading enabled, a snapshot of the configuration file gets included in the resulting binary. However, during compilation it is impossible to know which nodes will be allocated by SLURM when we submit a job for execution. We provide an option to disable inclusion of the snapshot of the configuration file during installation:
./remote_offload_setup.sh --prefix=<installdir> --ignore-snapshot [options]
This way, the configuration file will be parsed at runtime instead of compile time. Whenever we submitted jobs and SLURM allocated some nodes, we used a script to modify the configuration file and use the available nodes and their devices. We can provide that script or other assistance with setting up OMPi correctly upon request.