Phase-3: Riemann hot-path decomposition into shared GPU device helpers by sbryngelson · Pull Request #1572 · MFlowCode/MFC

sbryngelson · 2026-06-11T21:53:29Z

Stacked on #1551 → #1552 → #1553 → #1555 → #1556 — its own content is the top 4 commits. Completes the refactoring roadmap's Phase 3 (hot-path decomposition), the final phase of the series.

Summary

Per-solver benchmark coverage. 5eq_rk3_weno3_hll and 5eq_rk3_weno3_lf join the benchmark suite (previously only HLLC and hypo-HLL had solver-specific coverage), so the performance gates below actually measure every solver the changes touch.
Tier-1 shared device helpers (m_riemann_state.fpp, all GPU_ROUTINE(seq, cray_inline)): mixture-properties aggregation (14 call sites across HLL/HLLC/LF), interface-Reynolds averaging (10 sites), and the HLLC star-momentum flux expression (6 sites). Net −50 LOC from the hot loops. Each conversion verified bitwise-identical on live lanes by independent review; blocks whose arithmetic genuinely differs between solvers were deliberately left inline rather than homogenized. Incidental fix: dead viscous lanes previously stored 2/(1/uninitialized) into Re_avg — now a defined value (fixes the UB/FP-trap hazard).
Hypoelastic interface energy unified under maintainer rulings (recorded in-repo): one shared helper with the verysmall gate (HLL's G > 1000 stability floor and its TODO take out comment retired), unconditional damage scaling, and the dimension-correct shear_indices doubling test; the dead hypoelastic energy blocks in HLLC (×2) and LF deleted outright (the case validator restricts hypoelasticity to HLL). Net −56 LOC. Wave-speed estimates untouched everywhere.

Performance gate (<5% ceiling, maintainer-set)

Simulation grind-time ratios vs. the pre-series baseline, all benchmark cases:

stage	range	watched case
Tier-1 helpers	0.99–1.04	`5eq_rk3_weno3_hllc` 1.02 (faster)
Hypoelastic unification	0.98–1.03	`hypo_hll` 1.03 (faster)

(Ratios are speedups: >1 is faster. Pre/post exec scatter is sub-second I/O noise, documented in the working notes.)

Behavior notes

Golden suite green (4-way sharded) after each stage — the changes are golden-test-silent, but two are real for untested regimes and deliberate: soft-G hypoelastic runs (verysmall < G ≤ 1000) now receive elastic energy the stability floor previously suppressed (maintainer ruling), and dead-lane Re_avg values changed from garbage to defined (consumers gated; results-inert).
Fixes HLLC hypoelastic energy block mis-doubles the theta-theta normal stress in 2D-axisymmetric (currently dead code) #1566 (the deleted HLLC blocks contained the hard-coded axisym shear-doubling bug).
Fixes Uninitialized read into Re_avg on dead viscous lanes (UB; FP-trap hazard in debug builds) #1567 (the uninitialized Re_avg lanes).
Deliberately NOT touched, tracked for physics rulings: Hyperelastic wave-speed estimates: HLLC includes the tau_e correction, HLL deliberately omits it #1557, HLL hypoelastic damage load reads the LEFT cell's damage for the RIGHT state (missing +1 offset) #1570, HLL hyperelastic wave-speed branch reads shear moduli that only the hypoelastic path initializes #1571 (elastic wave-speed formulation divergence, the damage-R load offset, HLL's hyperelastic G initialization).

Docs

gpuParallelization.md gains a "Writing GPU_ROUTINE Device Helpers" section (the macro was previously undocumented): the FLOP threshold, the seq+cray_inline idiom and why Cray requires it, the caller-loads/helper-computes pattern, AMD case-optimization dimensioning, and declare-scoping — with a real helper from this PR as the worked example.

Copilot

Pull request overview

This PR is the final stage of a stacked refactor series that decomposes the Riemann-solver hot paths into shared GPU device helpers, tightens multi-rank MPI broadcast correctness via registry-driven generation, and updates build/docs infrastructure to support these changes and performance gating.

Changes:

Extracts shared, inlined GPU_ROUTINE(seq, cray_inline) device helpers (mixture aggregation, interface Reynolds averaging, hypoelastic energy, HLLC star momentum flux) and reduces solver hot-loop duplication.
Moves MPI broadcast lists toward registry-driven generation while preserving manual “irregular” residues (struct roots, nested member patterns), addressing prior multi-rank divergence/UB hazards.
Refactors CMake build plumbing into modular .cmake includes and documents the GPU helper pattern.

Reviewed changes

Copilot reviewed 42 out of 45 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
`src/simulation/m_riemann_state.fpp`	Adds shared GPU device helpers and consolidates per-sweep Riemann state logic.
`toolchain/mfc/params/generators/fortran_gen.py`	Implements generated MPI broadcast include emission and supporting classification/type-mapping helpers.
`cmake/MFCTargets.cmake`	Centralizes target setup/linking logic for MPI/OpenACC/OpenMP/GPU builds.
`cmake/Fypp.cmake`	Wires Fypp preprocessing with explicit generated-include dependencies.
`cmake/ParamsCodegen.cmake`	Adds build-time (ninja-tracked) generation of Fortran parameter include fragments.
`cmake/GPU.cmake`	Collects compiler/GPU-related flags and IPO/LTO handling into a dedicated module.

+                if (NOT TARGET OpenMP::OpenMP_Fortran)
+                    message(FATAL_ERROR "OpenMP + Fortran is unsupported.")
+                endif()
+                set(ENV{OMP_TARGET_OFFLOAD} [MANDATORY])


+
+    foreach (a_target ${IPO_TARGETS})
+        set_target_properties(${a_target} PROPERTIES Fortran_PREPROCESS ON)
+	    message(STATUS ${CMAKE_Fortran_COMPILER_ID})


+                    target_compile_options(${ARGS_TARGET}
+                        PRIVATE -gpu=mem:unified:managedalloc -cuda
+                    )
+                    # "This option must appear in both the compile and link lines" -- NVHPC Docs
+                    target_link_options(${ARGS_TARGET}
+                        PRIVATE -gpu=mem:unified:managedalloc -cuda
+                    )


+                    target_compile_options(${ARGS_TARGET}
+                        PRIVATE -DFRONTIER_UNIFIED)
+                endif()


+                    target_compile_options(${ARGS_TARGET}
+                        PRIVATE -DFRONTIER_UNIFIED)
+                endif()


Flagged by Copilot review on MFlowCode#1572; all pre-existing on master, carried verbatim by the CMakeLists split. Unified-memory flags (NVHPC managedalloc pair, FRONTIER_UNIFIED x2) now apply to a_target so the NVHPC two-pass IPO lib target gets them too; the set(ENV{OMP_TARGET_OFFLOAD} [MANDATORY]) line is deleted outright - set(ENV) only affects the configure-time process so it never influenced built binaries, and its bracketed value was a literal besides; a leftover debug message(STATUS) removed.

sbryngelson · 2026-06-11T23:55:56Z

Thanks Copilot — all five findings verified and addressed in b43e445. Context: cmake/MFCTargets.cmake was created by pure motion from the monolithic CMakeLists.txt (the split was verified by byte-identical configure graphs), so all five lines pre-exist on master — but four were worth fixing on substance: the unified-memory flags (the NVHPC managedalloc compile/link pair and both FRONTIER_UNIFIED definitions) now apply to ${a_target} so the NVHPC two-pass IPO lib target receives them consistently, and the leftover debug message(STATUS) is gone. The set(ENV{OMP_TARGET_OFFLOAD} [MANDATORY]) line was deleted outright rather than de-bracketed: set(ENV{...}) only affects the configure-time CMake process environment, so it never influenced the built binaries regardless of the bracket bug.

Flagged by Copilot review on MFlowCode#1572; all pre-existing on master, carried verbatim by the CMakeLists split. Unified-memory flags (NVHPC managedalloc pair, FRONTIER_UNIFIED x2) now apply to a_target so the NVHPC two-pass IPO lib target gets them too; the set(ENV{OMP_TARGET_OFFLOAD} [MANDATORY]) line is deleted outright - set(ENV) only affects the configure-time process so it never influenced built binaries, and its bracketed value was a literal besides; a leftover debug message(STATUS) removed.

codecov · 2026-06-12T02:05:13Z

Codecov Report

❌ Patch coverage is 79.76190% with 17 lines in your changes missing coverage. Please review.
✅ Project coverage is 60.81%. Comparing base (b4be438) to head (f3c96cc).
⚠️ Report is 1 commits behind head on master.

Files with missing lines	Patch %	Lines
src/simulation/m_riemann_solver_hllc.fpp	58.62%	12 Missing ⚠️
src/simulation/m_riemann_state.fpp	92.10%	0 Missing and 3 partials ⚠️
src/simulation/m_riemann_solver_hll.fpp	81.81%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1572      +/-   ##
==========================================
+ Coverage   60.51%   60.81%   +0.30%     
==========================================
  Files          83       83              
  Lines       19905    19793     -112     
  Branches     2950     2928      -22     
==========================================
- Hits        12046    12038       -8     
+ Misses       5866     5772      -94     
+ Partials     1993     1983      -10

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Flagged by Copilot review on MFlowCode#1572; all pre-existing on master, carried verbatim by the CMakeLists split. Unified-memory flags (NVHPC managedalloc pair, FRONTIER_UNIFIED x2) now apply to a_target so the NVHPC two-pass IPO lib target gets them too; the set(ENV{OMP_TARGET_OFFLOAD} [MANDATORY]) line is deleted outright - set(ENV) only affects the configure-time process so it never influenced built binaries, and its bracketed value was a literal besides; a leftover debug message(STATUS) removed.

…indrical source) Adds three GPU_ROUTINE(cray_inline) helpers to m_riemann_state and replaces the duplicated inline blocks in the HLL, HLLC, and LF solvers (HLLD untouched): s_accumulate_mixture_properties (14 call sites: hll 2, lf 2, hllc 10 across the 6-eq/4-eq/5-eq-bubbles/5-eq branches), s_compute_interface_reynolds (10 call sites: hll 2, lf 4, hllc 4), and f_compute_hllc_star_momentum_flux (6 call sites: hllc cylindrical and azimuthal geometric source flux). NOT NFC: calls replace inline arithmetic, so floating-point ordering may shift within golden tolerance. The hllc 5-eq Re_max Reynolds variant is replaced by the shared form: live lanes are arithmetically identical; previously-uninitialized dead lanes (Re_size(i)==0) now get the defined 1/sgm_eps value. The hllc 5-eq branch loads post-limiter alpha into new alpha_lim arrays so alpha_L/R keep their pre-limiter values used downstream. The HLL/LF cylindrical source blocks and the hllc 5-eq-bubbles (1-alpha) Reynolds variant use different arithmetic and stay inline.

…dead solver copies Implements Task 2 of the Phase-3 Riemann decomposition plan per Spencer's rulings (2026-06-11) on the BLOCKED-ON-PHYSICS adjudication: 1. Energy gate: the helper gates on G > verysmall, not HLL's hard-coded 1000 floor; HLL's TODO asking for its removal is honored and deleted. This is a deliberate behavior change for soft-G (G in (verysmall, 1000]) hypoelastic runs; golden tests are unaffected (test shear moduli are 5e4/1e5). 2. Damage scaling: the (1 - damage) factor applies whenever cont_damage is on (HLL's existing behavior); the helper carries it. 3. Dead copies: the hypoelastic energy-loading blocks in HLLC (both instances) and LF are deleted, since case_validator.py:577 restricts hypoelasticity to the HLL solver. LF's G_L/G_R locals became unused and are removed from the declarations and the private list; HLLC's tau_e/G locals remain in use by its hyperelastic and elastic flux/wave-speed code. 4. Wave-speed corrections are untouched in all solvers; the hyper-tau_e divergence is filed upstream separately. The helper uses the shear_indices-based energy doubling (HLL's dimension-correct form, not the hard-coded i==2/4/5). The caller loads the SF-indexed stress and damage values and passes them in, keeping tau_e_L/R available for the stress-flux and wave-speed blocks; the mixture moduli G_L/G_R are intent(out) for the elastic wave speeds.

Flagged by Copilot review on MFlowCode#1572; all pre-existing on master, carried verbatim by the CMakeLists split. Unified-memory flags (NVHPC managedalloc pair, FRONTIER_UNIFIED x2) now apply to a_target so the NVHPC two-pass IPO lib target gets them too; the set(ENV{OMP_TARGET_OFFLOAD} [MANDATORY]) line is deleted outright - set(ENV) only affects the configure-time process so it never influenced built binaries, and its bracketed value was a literal besides; a leftover debug message(STATUS) removed.

… HLLC; doc fixes Guard inventory (every tau_e/G consumer in the two files was classified by its exact guard before deletion): - LF: hypoelastic momentum flux (else if hypoelasticity), hypoelastic energy flux (else if hypoelasticity), stress evolution flux (if hypoelasticity), cylindrical tau_sigmasigma source (if cyl_coord .and. hypoelasticity) - all deleted. - HLLC: stress evolution flux (if hypoelasticity), both 6-eqn and 5-eqn copies - deleted. The hyperelasticity-gated tau_e/G/xi_field loads and the elasticity-gated wave speeds and elastic momentum/energy fluxes are untouched (live under hyperelasticity, which is legal with all solvers). - HLL is untouched. case_validator.py prohibits hypoelasticity with any solver but HLL, so these blocks were unreachable; after PR MFlowCode#1572 deleted the hypoelastic energy blocks they consumed tau_e_L/R that nothing in these files loads. Per the maintainer ruling: prefer deleting code - lifting the restriction later means adding the code back deliberately. Unused locals removed from LF: tau_e_L/R and flux_tau_L/R (zero refs after deletion) and xi_field_L/R (orphaned since the solver split, never referenced in LF) - declarations and GPU private-list entries. HLLC declarations are unchanged (tau_e/G remain live on the hyper/elasticity paths). GPU census: LF -7 acc seq-loop directives (2 deleted seq loops x 3 sweep copies + 1 cyl-only), HLLC -6 (1 seq loop x 3 copies x 2 blocks), OMP path unchanged (seq loops emit nothing under OMP); all deltas trace to deleted blocks. Doc fixes: rescope the cray_inline guidance to cross-module helpers only, describe the AMD case-opt guard as sitting on the helper's dummy declaration with matching caller-local guards, and note in s_compute_hypoelastic_interface_energy that the verysmall gate is the deliberate ruling that replaced HLL's former G > 1000 stability floor and retired its TODO.

sbryngelson · 2026-06-12T04:11:45Z

Code review

Found 2 issues, both fixed in fc7f770:

The hypoelastic energy-block deletions left LF and HLLC with hypoelasticity-gated flux/source blocks consuming tau_e_L/R that nothing loads (flagged independently by three review passes). Unreachable through the toolchain (case_validator.py:577 restricts hypoelasticity to HLL), but a latent-UB trap if that restriction is ever lifted. Resolved by completing the dead-copy deletion under the same maintainer ruling: every hypoelasticity-gated block in LF/HLLC removed after a guard-by-guard inventory; all elasticity/hyperelasticity-gated code untouched (with HLLC, elasticity implies hyperelasticity, so the kept consumers are always fed by the kept loads). −60 LOC of dead code, plus orphaned locals.

https://github.com/MFlowCode/MFC/blob/a577f15e09134f4f12bd49646bee9ed8fcab84a0/src/simulation/m_riemann_solver_lf.fpp#L443-L450

The new GPU_ROUTINE documentation overstated cray_inline=True as required "always" while two same-module helpers in the same file correctly omit it, and misplaced the AMD dimensioning guard as "caller-side". Both passages corrected (cray_inline scoped to cross-module helpers; the guard sits on the helper's dummy declaration), and the hypoelastic helper's doc comment now records the deliberate verysmall-gate ruling.

https://github.com/MFlowCode/MFC/blob/a577f15e09134f4f12bd49646bee9ed8fcab84a0/docs/documentation/gpuParallelization.md#L636-L640

Verified sound across five independent review passes: helper algebra bitwise-faithful at all 30 call sites (sampled), no upstream hotfixes lost in any replaced block (history audit vs #1545/#1556 merges), Herschel-Bulkley viscous arguments preserved, cmake a_target fixes correct, all five prior Copilot points confirmed addressed, merge-base current. Known and deliberately preserved pending physics rulings: #1570 (damage-R offset), #1557/#1571 — declared in the PR description.

🤖 Generated with Claude Code

_{- If this code review was useful, please react with 👍. Otherwise, react with 👎.}

sbryngelson · 2026-06-12T06:08:08Z

Device validation complete on the current head's content (860ecb3b1): full 594-test suite run on GPU nodes under both backends — OpenACC and OpenMP offload — ALL TESTS PASSED on each, alongside the 4-way CPU golden gate. Every change in this PR now has empirical GPU evidence in addition to the emission-level proofs.

…#1588) through the s_compute_interface_reynolds extraction Phase-3 extracts the viscous Reynolds computation into a cray-inlined GPU_ROUTINE that read the declare-target Re_size static directly, which silently reintroduces the AMD flang stale-cross-TU read fixed in MFlowCode#1588. Thread MFlowCode#1588's host-captured Re_size (Re_size_loc1/2 + firstprivate) through the helper and its 10 call sites, and apply the merge() pattern to hllc's inline block.

…ll on Cray GPU-OMP)

Copilot AI review requested due to automatic review settings June 11, 2026 21:53

Copilot started reviewing on behalf of sbryngelson June 11, 2026 21:54 View session

Copilot AI reviewed Jun 11, 2026

View reviewed changes

sbryngelson force-pushed the phase3-riemann branch from b43e445 to ecb2baa Compare June 12, 2026 02:55

sbryngelson force-pushed the phase3-riemann branch from ecb2baa to d3bb0bd Compare June 12, 2026 03:20

sbryngelson added 5 commits June 11, 2026 23:35

bench: add per-solver Riemann benchmark variants

839fead

docs: document the GPU_ROUTINE helper pattern; record Phase-3 completion

f1ffcbe

sbryngelson force-pushed the phase3-riemann branch from d3bb0bd to a577f15 Compare June 12, 2026 03:35

sbryngelson mentioned this pull request Jun 12, 2026

Fix three broadcast defects found by post-merge audit; harden the generator #1581

Merged

sbryngelson added 2 commits June 15, 2026 09:30

Merge branch 'master' into phase3-riemann

159bda2

sbryngelson added 2 commits June 16, 2026 09:00

HLL: add s_M/s_P/xi_M/xi_P to private list (fix CCE-19 defaultmap spi…

a6d9d9d

…ll on Cray GPU-OMP)

Merge branch 'master' into phase3-riemann

f3c96cc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phase-3: Riemann hot-path decomposition into shared GPU device helpers#1572

Phase-3: Riemann hot-path decomposition into shared GPU device helpers#1572
sbryngelson wants to merge 10 commits into
MFlowCode:masterfrom
sbryngelson:phase3-riemann

sbryngelson commented Jun 11, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

sbryngelson commented Jun 11, 2026

Uh oh!

codecov Bot commented Jun 12, 2026 •

edited

Loading

Uh oh!

sbryngelson commented Jun 12, 2026

Uh oh!

sbryngelson commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

Conversation

sbryngelson commented Jun 11, 2026

Summary

Performance gate (<5% ceiling, maintainer-set)

Behavior notes

Docs

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

sbryngelson commented Jun 11, 2026

Uh oh!

codecov Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

sbryngelson commented Jun 12, 2026

Code review

Uh oh!

sbryngelson commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

codecov Bot commented Jun 12, 2026 •

edited

Loading