Skip to content

Phase-3: Riemann hot-path decomposition into shared GPU device helpers#1572

Open
sbryngelson wants to merge 10 commits into
MFlowCode:masterfrom
sbryngelson:phase3-riemann
Open

Phase-3: Riemann hot-path decomposition into shared GPU device helpers#1572
sbryngelson wants to merge 10 commits into
MFlowCode:masterfrom
sbryngelson:phase3-riemann

Conversation

@sbryngelson

Copy link
Copy Markdown
Member

Stacked on #1551#1552#1553#1555#1556 — its own content is the top 4 commits. Completes the refactoring roadmap's Phase 3 (hot-path decomposition), the final phase of the series.

Summary

  1. Per-solver benchmark coverage. 5eq_rk3_weno3_hll and 5eq_rk3_weno3_lf join the benchmark suite (previously only HLLC and hypo-HLL had solver-specific coverage), so the performance gates below actually measure every solver the changes touch.

  2. Tier-1 shared device helpers (m_riemann_state.fpp, all GPU_ROUTINE(seq, cray_inline)): mixture-properties aggregation (14 call sites across HLL/HLLC/LF), interface-Reynolds averaging (10 sites), and the HLLC star-momentum flux expression (6 sites). Net −50 LOC from the hot loops. Each conversion verified bitwise-identical on live lanes by independent review; blocks whose arithmetic genuinely differs between solvers were deliberately left inline rather than homogenized. Incidental fix: dead viscous lanes previously stored 2/(1/uninitialized) into Re_avg — now a defined value (fixes the UB/FP-trap hazard).

  3. Hypoelastic interface energy unified under maintainer rulings (recorded in-repo): one shared helper with the verysmall gate (HLL's G > 1000 stability floor and its TODO take out comment retired), unconditional damage scaling, and the dimension-correct shear_indices doubling test; the dead hypoelastic energy blocks in HLLC (×2) and LF deleted outright (the case validator restricts hypoelasticity to HLL). Net −56 LOC. Wave-speed estimates untouched everywhere.

Performance gate (<5% ceiling, maintainer-set)

Simulation grind-time ratios vs. the pre-series baseline, all benchmark cases:

stage range watched case
Tier-1 helpers 0.99–1.04 5eq_rk3_weno3_hllc 1.02 (faster)
Hypoelastic unification 0.98–1.03 hypo_hll 1.03 (faster)

(Ratios are speedups: >1 is faster. Pre/post exec scatter is sub-second I/O noise, documented in the working notes.)

Behavior notes

Docs

gpuParallelization.md gains a "Writing GPU_ROUTINE Device Helpers" section (the macro was previously undocumented): the FLOP threshold, the seq+cray_inline idiom and why Cray requires it, the caller-loads/helper-computes pattern, AMD case-optimization dimensioning, and declare-scoping — with a real helper from this PR as the worked example.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR is the final stage of a stacked refactor series that decomposes the Riemann-solver hot paths into shared GPU device helpers, tightens multi-rank MPI broadcast correctness via registry-driven generation, and updates build/docs infrastructure to support these changes and performance gating.

Changes:

  • Extracts shared, inlined GPU_ROUTINE(seq, cray_inline) device helpers (mixture aggregation, interface Reynolds averaging, hypoelastic energy, HLLC star momentum flux) and reduces solver hot-loop duplication.
  • Moves MPI broadcast lists toward registry-driven generation while preserving manual “irregular” residues (struct roots, nested member patterns), addressing prior multi-rank divergence/UB hazards.
  • Refactors CMake build plumbing into modular .cmake includes and documents the GPU helper pattern.

Reviewed changes

Copilot reviewed 42 out of 45 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/simulation/m_riemann_state.fpp Adds shared GPU device helpers and consolidates per-sweep Riemann state logic.
toolchain/mfc/params/generators/fortran_gen.py Implements generated MPI broadcast include emission and supporting classification/type-mapping helpers.
cmake/MFCTargets.cmake Centralizes target setup/linking logic for MPI/OpenACC/OpenMP/GPU builds.
cmake/Fypp.cmake Wires Fypp preprocessing with explicit generated-include dependencies.
cmake/ParamsCodegen.cmake Adds build-time (ninja-tracked) generation of Fortran parameter include fragments.
cmake/GPU.cmake Collects compiler/GPU-related flags and IPO/LTO handling into a dedicated module.

Comment thread cmake/MFCTargets.cmake Outdated
if (NOT TARGET OpenMP::OpenMP_Fortran)
message(FATAL_ERROR "OpenMP + Fortran is unsupported.")
endif()
set(ENV{OMP_TARGET_OFFLOAD} [MANDATORY])
Comment thread cmake/MFCTargets.cmake Outdated

foreach (a_target ${IPO_TARGETS})
set_target_properties(${a_target} PROPERTIES Fortran_PREPROCESS ON)
message(STATUS ${CMAKE_Fortran_COMPILER_ID})
Comment thread cmake/MFCTargets.cmake Outdated
Comment on lines +212 to +218
target_compile_options(${ARGS_TARGET}
PRIVATE -gpu=mem:unified:managedalloc -cuda
)
# "This option must appear in both the compile and link lines" -- NVHPC Docs
target_link_options(${ARGS_TARGET}
PRIVATE -gpu=mem:unified:managedalloc -cuda
)
Comment thread cmake/MFCTargets.cmake Outdated
Comment on lines +229 to +231
target_compile_options(${ARGS_TARGET}
PRIVATE -DFRONTIER_UNIFIED)
endif()
Comment thread cmake/MFCTargets.cmake Outdated
Comment on lines +242 to +244
target_compile_options(${ARGS_TARGET}
PRIVATE -DFRONTIER_UNIFIED)
endif()
sbryngelson added a commit to sbryngelson/MFC that referenced this pull request Jun 11, 2026
Flagged by Copilot review on MFlowCode#1572; all pre-existing on master, carried verbatim by the CMakeLists split. Unified-memory flags (NVHPC managedalloc pair, FRONTIER_UNIFIED x2) now apply to a_target so the NVHPC two-pass IPO lib target gets them too; the set(ENV{OMP_TARGET_OFFLOAD} [MANDATORY]) line is deleted outright - set(ENV) only affects the configure-time process so it never influenced built binaries, and its bracketed value was a literal besides; a leftover debug message(STATUS) removed.
@sbryngelson

Copy link
Copy Markdown
Member Author

Thanks Copilot — all five findings verified and addressed in b43e445. Context: cmake/MFCTargets.cmake was created by pure motion from the monolithic CMakeLists.txt (the split was verified by byte-identical configure graphs), so all five lines pre-exist on master — but four were worth fixing on substance: the unified-memory flags (the NVHPC managedalloc compile/link pair and both FRONTIER_UNIFIED definitions) now apply to ${a_target} so the NVHPC two-pass IPO lib target receives them consistently, and the leftover debug message(STATUS) is gone. The set(ENV{OMP_TARGET_OFFLOAD} [MANDATORY]) line was deleted outright rather than de-bracketed: set(ENV{...}) only affects the configure-time CMake process environment, so it never influenced the built binaries regardless of the bracket bug.

sbryngelson added a commit to sbryngelson/MFC that referenced this pull request Jun 12, 2026
Flagged by Copilot review on MFlowCode#1572; all pre-existing on master, carried verbatim by the CMakeLists split. Unified-memory flags (NVHPC managedalloc pair, FRONTIER_UNIFIED x2) now apply to a_target so the NVHPC two-pass IPO lib target gets them too; the set(ENV{OMP_TARGET_OFFLOAD} [MANDATORY]) line is deleted outright - set(ENV) only affects the configure-time process so it never influenced built binaries, and its bracketed value was a literal besides; a leftover debug message(STATUS) removed.
@codecov

codecov Bot commented Jun 12, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 79.76190% with 17 lines in your changes missing coverage. Please review.
✅ Project coverage is 60.81%. Comparing base (b4be438) to head (f3c96cc).
⚠️ Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
src/simulation/m_riemann_solver_hllc.fpp 58.62% 12 Missing ⚠️
src/simulation/m_riemann_state.fpp 92.10% 0 Missing and 3 partials ⚠️
src/simulation/m_riemann_solver_hll.fpp 81.81% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1572      +/-   ##
==========================================
+ Coverage   60.51%   60.81%   +0.30%     
==========================================
  Files          83       83              
  Lines       19905    19793     -112     
  Branches     2950     2928      -22     
==========================================
- Hits        12046    12038       -8     
+ Misses       5866     5772      -94     
+ Partials     1993     1983      -10     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

sbryngelson added a commit to sbryngelson/MFC that referenced this pull request Jun 12, 2026
Flagged by Copilot review on MFlowCode#1572; all pre-existing on master, carried verbatim by the CMakeLists split. Unified-memory flags (NVHPC managedalloc pair, FRONTIER_UNIFIED x2) now apply to a_target so the NVHPC two-pass IPO lib target gets them too; the set(ENV{OMP_TARGET_OFFLOAD} [MANDATORY]) line is deleted outright - set(ENV) only affects the configure-time process so it never influenced built binaries, and its bracketed value was a literal besides; a leftover debug message(STATUS) removed.
sbryngelson added a commit to sbryngelson/MFC that referenced this pull request Jun 12, 2026
Flagged by Copilot review on MFlowCode#1572; all pre-existing on master, carried verbatim by the CMakeLists split. Unified-memory flags (NVHPC managedalloc pair, FRONTIER_UNIFIED x2) now apply to a_target so the NVHPC two-pass IPO lib target gets them too; the set(ENV{OMP_TARGET_OFFLOAD} [MANDATORY]) line is deleted outright - set(ENV) only affects the configure-time process so it never influenced built binaries, and its bracketed value was a literal besides; a leftover debug message(STATUS) removed.
sbryngelson added a commit to sbryngelson/MFC that referenced this pull request Jun 12, 2026
Flagged by Copilot review on MFlowCode#1572; all pre-existing on master, carried verbatim by the CMakeLists split. Unified-memory flags (NVHPC managedalloc pair, FRONTIER_UNIFIED x2) now apply to a_target so the NVHPC two-pass IPO lib target gets them too; the set(ENV{OMP_TARGET_OFFLOAD} [MANDATORY]) line is deleted outright - set(ENV) only affects the configure-time process so it never influenced built binaries, and its bracketed value was a literal besides; a leftover debug message(STATUS) removed.
sbryngelson added a commit to sbryngelson/MFC that referenced this pull request Jun 12, 2026
Flagged by Copilot review on MFlowCode#1572; all pre-existing on master, carried verbatim by the CMakeLists split. Unified-memory flags (NVHPC managedalloc pair, FRONTIER_UNIFIED x2) now apply to a_target so the NVHPC two-pass IPO lib target gets them too; the set(ENV{OMP_TARGET_OFFLOAD} [MANDATORY]) line is deleted outright - set(ENV) only affects the configure-time process so it never influenced built binaries, and its bracketed value was a literal besides; a leftover debug message(STATUS) removed.
…indrical source)

Adds three GPU_ROUTINE(cray_inline) helpers to m_riemann_state and replaces the duplicated inline blocks in the HLL, HLLC, and LF solvers (HLLD untouched): s_accumulate_mixture_properties (14 call sites: hll 2, lf 2, hllc 10 across the 6-eq/4-eq/5-eq-bubbles/5-eq branches), s_compute_interface_reynolds (10 call sites: hll 2, lf 4, hllc 4), and f_compute_hllc_star_momentum_flux (6 call sites: hllc cylindrical and azimuthal geometric source flux).

NOT NFC: calls replace inline arithmetic, so floating-point ordering may shift within golden tolerance. The hllc 5-eq Re_max Reynolds variant is replaced by the shared form: live lanes are arithmetically identical; previously-uninitialized dead lanes (Re_size(i)==0) now get the defined 1/sgm_eps value. The hllc 5-eq branch loads post-limiter alpha into new alpha_lim arrays so alpha_L/R keep their pre-limiter values used downstream. The HLL/LF cylindrical source blocks and the hllc 5-eq-bubbles (1-alpha) Reynolds variant use different arithmetic and stay inline.
…dead solver copies

Implements Task 2 of the Phase-3 Riemann decomposition plan per Spencer's rulings (2026-06-11) on the BLOCKED-ON-PHYSICS adjudication:

1. Energy gate: the helper gates on G > verysmall, not HLL's hard-coded 1000 floor; HLL's TODO asking for its removal is honored and deleted. This is a deliberate behavior change for soft-G (G in (verysmall, 1000]) hypoelastic runs; golden tests are unaffected (test shear moduli are 5e4/1e5).

2. Damage scaling: the (1 - damage) factor applies whenever cont_damage is on (HLL's existing behavior); the helper carries it.

3. Dead copies: the hypoelastic energy-loading blocks in HLLC (both instances) and LF are deleted, since case_validator.py:577 restricts hypoelasticity to the HLL solver. LF's G_L/G_R locals became unused and are removed from the declarations and the private list; HLLC's tau_e/G locals remain in use by its hyperelastic and elastic flux/wave-speed code.

4. Wave-speed corrections are untouched in all solvers; the hyper-tau_e divergence is filed upstream separately.

The helper uses the shear_indices-based energy doubling (HLL's dimension-correct form, not the hard-coded i==2/4/5). The caller loads the SF-indexed stress and damage values and passes them in, keeping tau_e_L/R available for the stress-flux and wave-speed blocks; the mixture moduli G_L/G_R are intent(out) for the elastic wave speeds.
Flagged by Copilot review on MFlowCode#1572; all pre-existing on master, carried verbatim by the CMakeLists split. Unified-memory flags (NVHPC managedalloc pair, FRONTIER_UNIFIED x2) now apply to a_target so the NVHPC two-pass IPO lib target gets them too; the set(ENV{OMP_TARGET_OFFLOAD} [MANDATORY]) line is deleted outright - set(ENV) only affects the configure-time process so it never influenced built binaries, and its bracketed value was a literal besides; a leftover debug message(STATUS) removed.
sbryngelson added a commit to sbryngelson/MFC that referenced this pull request Jun 12, 2026
Flagged by Copilot review on MFlowCode#1572; all pre-existing on master, carried verbatim by the CMakeLists split. Unified-memory flags (NVHPC managedalloc pair, FRONTIER_UNIFIED x2) now apply to a_target so the NVHPC two-pass IPO lib target gets them too; the set(ENV{OMP_TARGET_OFFLOAD} [MANDATORY]) line is deleted outright - set(ENV) only affects the configure-time process so it never influenced built binaries, and its bracketed value was a literal besides; a leftover debug message(STATUS) removed.
… HLLC; doc fixes

Guard inventory (every tau_e/G consumer in the two files was classified by its
exact guard before deletion):
- LF: hypoelastic momentum flux (else if hypoelasticity), hypoelastic energy
  flux (else if hypoelasticity), stress evolution flux (if hypoelasticity),
  cylindrical tau_sigmasigma source (if cyl_coord .and. hypoelasticity) - all
  deleted.
- HLLC: stress evolution flux (if hypoelasticity), both 6-eqn and 5-eqn copies
  - deleted. The hyperelasticity-gated tau_e/G/xi_field loads and the
  elasticity-gated wave speeds and elastic momentum/energy fluxes are
  untouched (live under hyperelasticity, which is legal with all solvers).
- HLL is untouched.

case_validator.py prohibits hypoelasticity with any solver but HLL, so these
blocks were unreachable; after PR MFlowCode#1572 deleted the hypoelastic energy blocks
they consumed tau_e_L/R that nothing in these files loads. Per the maintainer
ruling: prefer deleting code - lifting the restriction later means adding the
code back deliberately.

Unused locals removed from LF: tau_e_L/R and flux_tau_L/R (zero refs after
deletion) and xi_field_L/R (orphaned since the solver split, never referenced
in LF) - declarations and GPU private-list entries. HLLC declarations are
unchanged (tau_e/G remain live on the hyper/elasticity paths).

GPU census: LF -7 acc seq-loop directives (2 deleted seq loops x 3 sweep
copies + 1 cyl-only), HLLC -6 (1 seq loop x 3 copies x 2 blocks), OMP path
unchanged (seq loops emit nothing under OMP); all deltas trace to deleted
blocks.

Doc fixes: rescope the cray_inline guidance to cross-module helpers only,
describe the AMD case-opt guard as sitting on the helper's dummy declaration
with matching caller-local guards, and note in
s_compute_hypoelastic_interface_energy that the verysmall gate is the
deliberate ruling that replaced HLL's former G > 1000 stability floor and
retired its TODO.
@sbryngelson

Copy link
Copy Markdown
Member Author

Code review

Found 2 issues, both fixed in fc7f770:

  1. The hypoelastic energy-block deletions left LF and HLLC with hypoelasticity-gated flux/source blocks consuming tau_e_L/R that nothing loads (flagged independently by three review passes). Unreachable through the toolchain (case_validator.py:577 restricts hypoelasticity to HLL), but a latent-UB trap if that restriction is ever lifted. Resolved by completing the dead-copy deletion under the same maintainer ruling: every hypoelasticity-gated block in LF/HLLC removed after a guard-by-guard inventory; all elasticity/hyperelasticity-gated code untouched (with HLLC, elasticity implies hyperelasticity, so the kept consumers are always fed by the kept loads). −60 LOC of dead code, plus orphaned locals.

https://github.com/MFlowCode/MFC/blob/a577f15e09134f4f12bd49646bee9ed8fcab84a0/src/simulation/m_riemann_solver_lf.fpp#L443-L450

  1. The new GPU_ROUTINE documentation overstated cray_inline=True as required "always" while two same-module helpers in the same file correctly omit it, and misplaced the AMD dimensioning guard as "caller-side". Both passages corrected (cray_inline scoped to cross-module helpers; the guard sits on the helper's dummy declaration), and the hypoelastic helper's doc comment now records the deliberate verysmall-gate ruling.

https://github.com/MFlowCode/MFC/blob/a577f15e09134f4f12bd49646bee9ed8fcab84a0/docs/documentation/gpuParallelization.md#L636-L640

Verified sound across five independent review passes: helper algebra bitwise-faithful at all 30 call sites (sampled), no upstream hotfixes lost in any replaced block (history audit vs #1545/#1556 merges), Herschel-Bulkley viscous arguments preserved, cmake a_target fixes correct, all five prior Copilot points confirmed addressed, merge-base current. Known and deliberately preserved pending physics rulings: #1570 (damage-R offset), #1557/#1571 — declared in the PR description.

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

@sbryngelson

Copy link
Copy Markdown
Member Author

Device validation complete on the current head's content (860ecb3b1): full 594-test suite run on GPU nodes under both backends — OpenACC and OpenMP offload — ALL TESTS PASSED on each, alongside the 4-way CPU golden gate. Every change in this PR now has empirical GPU evidence in addition to the emission-level proofs.

…#1588) through the s_compute_interface_reynolds extraction

Phase-3 extracts the viscous Reynolds computation into a cray-inlined GPU_ROUTINE that read the declare-target Re_size static directly, which silently reintroduces the AMD flang stale-cross-TU read fixed in MFlowCode#1588. Thread MFlowCode#1588's host-captured Re_size (Re_size_loc1/2 + firstprivate) through the helper and its 10 call sites, and apply the merge() pattern to hllc's inline block.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

2 participants