| 84 | |
| 85 | == I/O forwarding == |
| 86 | |
| 87 | MSA aware collective I/O has the potential of making more efficient use of the storage system by using a subset of tasks that are well suited for performing I/O operations as collectors. |
| 88 | The collective I/O approach however imposes additional constraints that make it inapplicable in certain scenarios: |
| 89 | - By design, collective I/O operations force application tasks to coordinate in order to all perform the same sequence of operations. This is at odds with SIONlib's world view of separate files per task that can be accessed independently. |
| 90 | - Collector tasks in general have to be application tasks, i.e. they have to run the user's application. This can generate conflicts on MSA systems, if the nodes that are capable of performing I/O operations efficiently are part of a module that the user application does not map well onto. |
| 91 | |
| 92 | I/O forwarding can help in both scenarios. |
| 93 | It works by relaying calls to low-level I/O functions (e.g. `open`, `write`, `stat`, etc.) via a remote procedure call (RPC) mechanism from a client task (running the user's application) to a server task (running a dedicated server program) that then executes the functions on behalf of the client. |
| 94 | Because the server tasks are dedicated to performing I/O, they can dynamically respond to individual requests from client tasks rather than imposing coordination constraints. |
| 95 | Also, on MSA systems, the server tasks can run on different modules than the user application. |
| 96 | |
| 97 | I/O forwarding has been implemented in SIONlib through an additional software package, SIONfwd (https://gitlab.version.fz-juelich.de/SIONlib/SIONfwd). |
| 98 | It consists of a server program and a corresponding client library that is used by SIONlib to relay the low-level I/O operations that it wants to perform to the server. |
| 99 | The implementation uses a custom made, minimal RPC mechanism based only on MPI's message passing, ports, and pack/unpack mechanisms. |
| 100 | In the future we intend to evaluate more general and more optimised third party RPC solutions as they become available. |
| 101 | |
| 102 | To use I/O forwarding in SIONlib, the SIONfwd package first has to be installed (it uses a standard CMake based build system) and SIONlib has to be configured to make use of it: |
| 103 | |
| 104 | {{{!#sh |
| 105 | ./configure --enable-sionfwd=/path/to/sionfwd # ... more configure arguments |
| 106 | }}} |
| 107 | |
| 108 | In the user application, just like MSA aware collectives, I/O forwarding has to be selected when opening a file (I/O forwarding is treated like an additional low-level API like POSIX and C standard I/O). |
| 109 | This is done by adding the word `sionfwd` to the `file_mode` argument of SIONlib's `open` functions: |
| 110 | |
| 111 | {{{#!c |
| 112 | sion_paropen_mpi("filename", "...,sionfwd,...", ...); |
| 113 | }}} |
| 114 | |
| 115 | Although in principle MPI contains a mechanism for dynamically spawning additional processes, it is not used to spawn the forwarding server processes for two reasons. |
| 116 | First, the feature is loosely specified with many of the details left for implementations to decide. |
| 117 | This makes it hard to precisely control process placement which is especially important on MSA systems. |
| 118 | Second, the resources necessary to run the server tasks (additional compute nodes) in many cases have to be requested at job submission time anyway. |
| 119 | Thus, the server tasks have to be launched from the user's job script before the application tasks are launched. |
| 120 | A typical job script could look like this: |
| 121 | |
| 122 | {{{#!sh |
| 123 | #!/bin/bash |
| 124 | # Slurm's heterogeneous jobs can be used to partition resources |
| 125 | # for the user's application and the forwarding server, even |
| 126 | # when not running on an MSA system. |
| 127 | #SBATCH --nodes=32 --partition=dp-cn |
| 128 | #SBATCH packjob |
| 129 | #SBATCH --nodes=4 --cpus-per-taks=1 --partition=dp-dam |
| 130 | |
| 131 | module load intel-para SIONlib/1.7.6 |
| 132 | |
| 133 | # Defines a shell function sionfwd-spawn that is used to |
| 134 | # facilitate communication of MPI ports connection details |
| 135 | # between the server and the client. |
| 136 | eval $(sionfwd-server bash-defs) |
| 137 | |
| 138 | # Spawns the server, captures the connection details and |
| 139 | # exports them to the environment to be picked up by the |
| 140 | # client library used from the user's application. |
| 141 | sionfwd-spawn srun --pack-group 1 sionfwd-server |
| 142 | |
| 143 | # Spawn the user application. |
| 144 | srun --pack-group 0 user_application |
| 145 | |
| 146 | # Shut down the server. |
| 147 | srun --pack-group 0 sionfwd-server shutdown |
| 148 | |
| 149 | # Wait for all tasks to end. |
| 150 | wait |
| 151 | }}} |