Allow for compiler+accelerator specific MPI overrides#231
Conversation
|
Example of the output # Set things up
ocaisa@~/EESSI/software-layer-scripts(additional_rpath_fallbacks)$ export EESSI_ACCELERATOR_TARGET_OVERRIDE=accel/nvidia/cc86
ocaisa@~/EESSI/software-layer-scripts(additional_rpath_fallbacks)$ module load EESSI/2025.06
Module for EESSI/2025.06 loaded successfully
{EESSI/2025.06} ocaisa@~/EESSI/software-layer-scripts(additional_rpath_fallbacks)$ echo $MODULEPATH
/cvmfs/software.eessi.io/host_injections/2025.06/software/linux/aarch64/neoverse_n1/accel/nvidia/cc80/modules/all:/cvmfs/software.eessi.io/host_injections/2025.06/software/linux/aarch64/neoverse_n1/modules/all:/cvmfs/software.eessi.io/versions/2025.06/software/linux/aarch64/neoverse_n1/accel/nvidia/cc80/modules/all:/cvmfs/software.eessi.io/versions/2025.06/software/linux/aarch64/neoverse_n1/modules/all:/cvmfs/software.eessi.io/init/modules
{EESSI/2025.06} ocaisa@~/EESSI/software-layer-scripts(additional_rpath_fallbacks)$ module load EESSI-extend
-- Using /tmp/$USER as a temporary working directory for installations, you can override this by setting the environment variable WORKING_DIR and reloading the module (e.g., /dev/shm is a common option)
Configuring for use of EESSI_USER_INSTALL under /home/ocaisa/eessi
-- To create installations for EESSI, you _must_ have write permissions to /home/ocaisa/eessi/versions/2025.06/software/linux/aarch64/neoverse_n1
-- You may wish to configure a sources directory for EasyBuild (for example, via setting the environment variable EASYBUILD_SOURCEPATH) to allow you to reuse existing sources for packages.
# Pretend to want to do a build
{EESSI/2025.06} ocaisa@~/EESSI/software-layer-scripts(additional_rpath_fallbacks)$ eb OSU-Micro-Benchmarks-7.5.1-gompi-2025b-CUDA-12.9.1.eb --stop prepare --rebuild --hooks=./eb_hooks.py
== Temporary log file in case of crash /tmp/eb-uflhewm6/easybuild-0ha7tv9j.log
== found valid index for /cvmfs/software.eessi.io/versions/2025.06/software/linux/aarch64/neoverse_n1/software/EasyBuild/5.3.0/easybuild/easyconfigs, so using it...
== Running parse hook for OSU-Micro-Benchmarks-7.5.1-gompi-2025b-CUDA-12.9.1.eb...
== found valid index for /cvmfs/software.eessi.io/versions/2025.06/software/linux/aarch64/neoverse_n1/software/EasyBuild/5.3.0/easybuild/easyconfigs, so using it...
== Running parse hook for gompi-2025b.eb...
...
== Running parse hook for lfbf-2025b.eb...
== processing EasyBuild easyconfig
/cvmfs/software.eessi.io/versions/2025.06/software/linux/aarch64/neoverse_n1/software/EasyBuild/5.3.0/easybuild/easyconfigs/o/OSU-Micro-Benchmarks/OSU-Micro-Benchmarks-7.5.1-gompi-2025b-CUDA-12.9.1.eb
== building and installing OSU-Micro-Benchmarks/7.5.1-gompi-2025b-CUDA-12.9.1...
>> installation prefix: /home/ocaisa/eessi/versions/2025.06/software/linux/aarch64/neoverse_n1/software/OSU-Micro-Benchmarks/7.5.1-gompi-2025b-CUDA-12.9.1
== fetching files and verifying checksums...
== Running pre-fetch hook...
>> sources:
>> /tmp/ocaisa/easybuild/sources/o/OSU-Micro-Benchmarks/osu-micro-benchmarks-7.5.1.tar.gz [SHA256: 160d0d5e3c3cb022520ecb247e9875bb0973b1d3cadccd6c17624f8407c52e22]
== ... (took < 1 sec)
== creating build dir, resetting environment...
>> build dir: /tmp/ocaisa/easybuild/build/OSUMicroBenchmarks/7.5.1/gompi-2025b-CUDA-12.9.1
== Running post-ready hook...
WARNING: Deprecated functionality, will no longer work in EasyBuild v6.0: Easyconfig parameter 'parallel' is deprecated, use 'max_parallel' or the parallel property instead.; see
https://docs.easybuild.io/deprecated-functionality/ for more information
== ... (took < 1 sec)
== unpacking...
>> running shell command:
tar xzf /tmp/ocaisa/easybuild/sources/o/OSU-Micro-Benchmarks/osu-micro-benchmarks-7.5.1.tar.gz
[started at: 2026-05-14 16:03:37]
[working dir: /tmp/ocaisa/easybuild/build/OSUMicroBenchmarks/7.5.1/gompi-2025b-CUDA-12.9.1]
[output and state saved to /tmp/eb-uflhewm6/run-shell-cmd-output/tar-gfx7xw93]
>> command completed: exit 0, ran in < 1s
== ... (took < 1 sec)
== patching...
== ... (took < 1 sec)
== preparing...
== Running pre-prepare hook...
== Updated rpath_override_dirs (to allow overriding MPI family OpenMPI):
/cvmfs/software.eessi.io/host_injections/2025.06/software/linux/aarch64/neoverse_n1/rpath_overrides/OpenMPI/system-CUDA-12.9.1/lib:/cvmfs/software.eessi.io/host_injections/2025.06/software/linux/aarch64/neover
se_n1/rpath_overrides/OpenMPI/system-CUDA-12.9.1/lib64:/cvmfs/software.eessi.io/host_injections/2025.06/software/linux/aarch64/neoverse_n1/rpath_overrides/OpenMPI/system/lib:/cvmfs/software.eessi.io/host_injec
tions/2025.06/software/linux/aarch64/neoverse_n1/rpath_overrides/OpenMPI/system/lib64
>> loading toolchain module: gompi/2025b
== ... (took < 1 sec)
... |
|
Increased the complexity a bit but it might be necessary: |
| # If the package relies on CUDA or ROCm, the MPI layer may require different overrides | ||
| # for different CUDA/ROCm versions with specific compiler families | ||
| if self.cfg.eessi_gpu_dependency: | ||
| gpu_stub = f"{self.toolchain.COMPILER_FAMILY}-{self.cfg.eessi_gpu_dependency[0]}-{self.cfg.eessi_gpu_dependency[1]}" |
There was a problem hiding this comment.
What this should look like is a bit dependent on CUDA/ROCm. For CUDA it's pretty easy as the dep is CUDA/<version>, but for ROCm, it's a bit more complicated as the dep is ROCm-LLVM and the interesting part is actually the versionsuffix.
There was a problem hiding this comment.
Before we can make a decision here, we first need #228
|
No feedback to date, I'm waiting on someone to review it |
|
@TopRichard can you review it? |
|
Discussed in support meeting: @TopRichard said he already tested this for CUDA and that it works. We agreed he'll add a review here, including the steps taken by him to test it. I can then try to mimic that for ROCm and validate that it also works there. |
|
I have tested this locally, Integrating the changes introduced in the PR into |
|
For clarity: this is currently blocked by #228 for the ROCm side of things. |
| if dep[0] in top_level_accelerator_packages: | ||
| # Store the dependency as a property for later potential use | ||
| # (e.g., accelerator-specific MPI RPATH overrides) | ||
| ec.eessi_gpu_dependency = dep |
There was a problem hiding this comment.
Just as a reminder: this will need to be done for the ROCm side of things after #228 gets merged as well (you'll need to merge main into this feature branch, resolve any potential conflicts because they both touch this same part of the code, then add some ec.eessi_gpu_dependency = ... to the ROCm side of things).
|
#228 is merged. |
Alternative to #230 where we focus only on the potential need for CUDA/ROCm variants.
This also opens the door to other types of variants (but the options here would be multiplicative so I haven't included that until we hit a need for it).