OMPi: Remote offloading

In cluster environments, OMPi implements a novel runtime mechanism which can exploit the computational resources of any cluster node by treating them as OpenMP devices. In particular, both CPUs and accelerators (such as GPUs) present on remote nodes are virtualized and appear as if they were devices local to the node that executes the OpenMP application. As such, the developer can employ the well-known OpenMP device interface (OpenMP versions ≥ 4.0) and offload portions of the application code to them; the actual kernel execution is carried on the corresponding remote devices, transparently (see the related publication).

Requirements

In order to use the remote offloading mechanism of OMPi, an MPI installation is required. While not necessary, the installation of MPI should preferably support multiple threads. The mechanism was initially developed and tested with OpenMPI 4.1.1. It was also successfully tested with MPICH 4.0.2 and Intel MPI 2018.

» MPI must be installed on the node that will act as the “host” (we call this the primary node, and it executes the main part of an application), as well as on every other node that contains devices you want to offload to (these are the remote or worker nodes). Please note that a correct MPI installation is assumed; it is very easy to encounter problems, especially when multiple MPI implementations are installed on the same machine.

» Support for multiple threads can be verified as follows:

If OpenMPI is installed, execute the following and make sure that the output includes “MPI_THREAD_MULTIPLE: yes“:
```
ompi_info | grep -i thread
```

In any other case, compile and execute a simple program like the one given below:

Simple program to check for multithread support

#include <stdio.h>
#include <mpi.h>

int main(void)
{
  int thr_provided_level;
  MPI_Init_thread(NULL, NULL, MPI_THREAD_MULTIPLE, &thr_provided_level);

  if (thr_provided_level == MPI_THREAD_MULTIPLE) 
    printf("Multithread IS supported\n"); 
  else 
    printf("Multithread is NOT supported\n");

  MPI_Finalize();
  return 0;
}

Setting up OMPi

Configuration file: specifying remote devices

Before installing OMPi, you need to provide a simple text configuration file in your home directory (~/.ompi_remote_devices) which contains the names or IP addresses of the participating cluster nodes, along with their devices. The reason for this is that (most probably) OMPi must be installed across all worker nodes, in order to have complete support for all types of available devices.

To make it easier, a script that can generate a sample configuration file is located in the utilities directory of the OMPi sources. First, you need to execute this script as follows (assuming your current directory is the OMPi sources root):

./utilities/remote_offload_create_config.sh

Then, edit the generated ~/.ompi_remote_devices and add the names or the IP addresses of the cluster nodes along with the devices you would like to utilize.

Sample .ompi_remote_devices

# This is a comment, till the end of the line
# The node syntax is as follows:
# <node ID> { <device type> : <count> [,] <device type> : <count> [,] ... }

# A node with 6 devices
node1.mycluster.org {
    cpu: 1,   # remote CPUs are devices, too, and must be listed first
    cuda: 2   # 2 CUDA GPUs
},

node2.mycluster.org {
    cpu: 1    # commas are not necessary
    cuda: 1
},

# IPs can be used instead of hostnames
195.251.100.230 { 
    cpu: 1
}

Important notes:

» The primary node should not be included in the configuration file.

» Make sure to include all the available devices of the remote nodes, including their CPUs as the very first device of every worker node.

» Regarding CPUs, there is flexibility in the device count one can specify. For example, if a node has N processors, each with C cores, one can treat the whole node as 1 CPU device; alternatively, each processor can be treated as a separate device, so the CPU device count should be N; finally, if each core is treated as a separate device, NxC CPU devices should be declared. In the current version of OMPi, 1 CPU device per worker node provides the best performance.

Building and installing OMPi

To facilitate the process, we provide a dedicated script that automates building and installing OMPi across the nodes listed in the configuration file. The script is located in the root directory of the OMPi sources and can be executed as:

./remote_offload_setup.sh --prefix=<installdir> [options]

Apart from all the available options provided in the configure script for a typical OMPi installation, you can also specify any of the following options:

--cpus-only – Use only the worker CPUs as devices
--static-procs – Avoid using MPI_Comm_spawn() and create single-threaded MPI processes statically
--kernel-bundling=sources|binaries – Embed all device kernels into the main executable; this assumes NFS is being used. When sources is given, OMPi will embed the kernel sources into the application and do a JIT compilation during runtime. With the binaries option, OMPi will always pre-compile the kernel sources and embed the binaries.

Usage

After installing OMPi, you can check the correctness of the configuration file by checking that the output of the following command agrees with the contents of the ~/.ompi_remote_devices configuration file:

ompiconf --devvinfo

» The above command also reveals the numeric device IDs that should be utilized in device() clauses of target regions.

» If you modify the configuration file, you have to rebuild and re-install OMPi in order for changes to take effect

Compiling and executing applications

In order to compile an OpenMP program you simply go:

ompicc program.c

The best way to run the executable, assuming MPI_THREAD_MULTIPLE is supported, is through the mpiexec command:

mpiexec -n 1 ./a.out

OpenMPI may allow you to execute ./a.out directly, but the execution through mpiexec works universally.

In case of static, single-threaded MPI, a helper script is generated for you, named ompi-mpirun.sh and located in the same directory as a.out, which can be run as:

./ompi-mpirun.sh ./a.out

Simple program to test remote offloading

The following sample application can be used to verify the successful deployment of the remote offloading:

#include <stdio.h>
#include <omp.h>

int main(void)
{
    /* 
     * Expected result (assuming that device with ID 1 is a remote device):
     * Running on remote device 
     */
    #pragma omp target device(1)
    {
        if (omp_is_initial_device()) 
            printf("Running on host\n");    
        else 
            printf("Running on remote device\n"); 
    }
    return 0;
}

Utilizing the user-level remote offloading API

According to OpenMP specifications, a target code region is offloaded for execution to the default device or to any desired device specified by its numerical ID in a device() clause. To effectively use the available resources, one would need to remember all the available devices and their IDs. To ease this burden, we provide an API that simplifies managing the available devices. The API currently offers four calls:

int   ompx_get_module_num_devices(char *);
int   ompx_get_module_device(char *, int);
char *ompx_get_device_module_name(int);
int   ompx_get_module_node_info(char *, char *, int *);

The first two functions focus on helping with work-sharing between devices. When the user wants to target devices of a specific module type (e.g. CUDA GPUs), he/she can determine their count with ompx_get_module_num_devices(). The function ompx_get_module_device() accepts as a parameter the module name and the device index among all devices of the specified module across all nodes. The return value is the global device ID that can be used in a device() clause.

The function ompx_get_device_module_name() is mainly provided for testing and verifying the module type that a device ID belongs to.

Finally, the function ompx_get_module_node_info() provides information regarding a specific module on a specific node. The user passes as parameters the names of the node and the module, as well as a pointer to an integer that upon return will contain the global device ID of the first usable device of that module in the specified node. The function returns the total number of devices of that module in the given node.

Using Remote Offloading with SLURM

OMPi has also been tested on a system that is managed by the SLURM workload manager. Each SLURM installation might behave differently and modifications might be required in your case, but this section aims to provide some guidance in case you attempt to use our mechanism through SLURM. Note that the system we used did not allow the salloc command.

» Worker nodes become known only after executing a program. Building and installing the required modules beforehand, through the provided installation script, is not possible. The solution is to install OMPi on a single node, which contains all the required tools and libraries for the required modules. We were able to load the toolkits required on the login node and install OMPi through the script using the option --cpus-only, which will not attempt remote connecting to any node.

» The CUDA module contains a verification process that tests the execution of a small program. As the login nodes did not contain a GPU, we also had to manually disable this check.

» When compiling a program with OMPi and remote offloading enabled, a snapshot of the configuration file gets included in the resulting binary. However, during compilation it is impossible to know which nodes will be allocated by SLURM when we submit a job for execution. We provide an option to disable inclusion of the snapshot of the configuration file during installation:

./remote_offload_setup.sh --prefix=<installdir> --ignore-snapshot [options]

This way, the configuration file will be parsed at runtime instead of compile time. Whenever we submitted jobs and SLURM allocated some nodes, we used a script to modify the configuration file and use the available nodes and their devices. We can provide that script or other assistance with setting up OMPi correctly upon request.