Skip to content

Experimental performance family (default-off): load-balance infrastructure, active-box windowing, block-structured AMR, hybrid WENO/Riemann sensors#1628

Draft
sbryngelson wants to merge 111 commits into
MFlowCode:masterfrom
sbryngelson:up/mega

Conversation

@sbryngelson

@sbryngelson sbryngelson commented Jul 3, 2026

Copy link
Copy Markdown
Member

Summary

This PR adds an opt-in, default-off family of performance features and the measurement
infrastructure they rest on. With all flags at their defaults the only touched production
path is s_mpi_decompose_computational_domain, refactored to compute its equal split
through the new m_box module (byte-identical; covered by the existing suite).

Load-balance infrastructure (common + sim):

  • m_box: t_box + partition arithmetic; shared by the decomposer, AMR, and the
    weighted splitter.
  • m_load_weight + load_weight_wrt: per-cell load-weight field (active-box, EL-bubble,
    IB, phase-change Newton-iteration contributors) with field output and a per-rank
    imbalance metric.
  • m_sfc_partition + sfc_partition_wrt: Morton-SFC tile ordering and chains-on-chains
    balanced partition, reported as a predicted-imbalance diagnostic.
  • m_load_balance + load_balance: experimental weighted static Cartesian decomposition
    at init (requires parallel_io), with a min-cells feasibility floor and, when AMR is
    on, fine-work-aware weighting with a deterministic feasibility clamp.
  • m_rank_timing + rank_time_wrt: per-rank compute-time diagnostic (halo exchange
    excluded; device-synced on GPU).

Active-box windowing (sim):

  • m_active_box + active_box: restricts reconstruction/Riemann/RK windows to a
    light-cone-grown box around non-ambient flow; a debug tripwire guards under-growth.
    Golden-tested (ECABA006) to stay a strict subset while matching the full-domain
    solution.

Block-structured AMR (sim):

  • m_amr + m_amr_registers + amr: a two-level 2:1 refined hierarchy with
    conservative restriction / conservative-linear prolongation, per-stage flux registers
    with Berger–Colella refluxing, gradient-based dynamic regrid (amr_regrid_int,
    amr_tag_eps, amr_buf), optional dt/2 subcycling (amr_subcycle), multi-rank
    operation with a mirror-decomposed fine level (patches may span rank boundaries; fine
    halo exchange; distributed flux registers; rank-local regrid), and GPU builds
    (device-resident fine fields and registers, on-device ghost fill/RK/restriction).
    Requires WENO, SSP-RK3, model_eqns=2, single fluid (checker-enforced).

Hybrid reconstruction/flux sensors (sim):

  • hybrid_weno (+hybrid_weno_eps): linear-optimal reconstruction in smooth cells, full
    WENO only at flagged discontinuities (Jameson-type density+pressure sensor,
    stencil-dilated, halo-aware).
  • hybrid_riemann (+hybrid_smooth_flux): cheap central/Rusanov flux in smooth cells,
    full HLLC at discontinuities (5- and 6-equation blocks).

Motivation

Measured rank imbalance on heterogeneous-cost workloads (bubbles, IB, phase change)
motivates first-class measurement tools; the active box and hybrid sensors give direct
speedups on localized-flow / mostly-smooth cases; AMR concentrates resolution where the
flow needs it, and the load-balance coupling keeps the refined work spread across ranks.

Testing

  • Four new golden tests: 5ECBB926 (AMR static patch), 1CBACEB5 (AMR dynamic regrid),
    852CCB81 (AMR subcycling), ECABA006 (active_box 3D strict-subset).
  • Locally verified on gfortran/CPU: three-target build; full precheck; the 4 new goldens
    • 4 existing goldens (incl. both periodic-IBM cases, exercising the Fix periodic ib issues #1618 merge) all
      pass; load_balance+amr np=2 end-to-end smoke produces the analytically predicted
      weighted offsets and completes; amr np=2 spanning-patch run completes.
  • GPU: nvfortran OpenACC build of the simulation target (the GPU-accelerated
    executable) verified locally (see PR checks for the full matrix).
  • All parameters default off; case_validator entries, case.md docs, and
    module_categories are included.

Known-untested configurations

Delegated to CI: Cray ftn, Intel ifx, AMD flang, OpenMP target offload, single/mixed
precision. Hybrid WENO/Riemann ship without a dedicated golden case (flagging for
reviewer judgment; the sensors are default-off and checker-guarded).

Review guide

The 75 commits are arc-ordered and cleanly arc-separable — reviewing by arc is much
easier than by file:

  1. 2760da7d…2bb5fdc4 active-box (11)
  2. bbf6b2a9…14b837c6 load-weight field + contributors (8)
  3. 0161fac0…2795e266 SFC partition diagnostic (6)
  4. 6df9c1f0…c43c02a5 weighted decomposition (load_balance) (8)
  5. 95398eb3…cc7882d1 rank timing (4)
  6. 21c60ffa…5082b535 hybrid WENO/Riemann (10)
  7. 74b58771…de244407 m_box refactor + validation hygiene (4)
  8. 352f564e…03b59516 AMR: static hierarchy → restriction/prolongation → fine advance →
    refluxing → regrid → subcycling → multi-rank → GPU → mirror decomposition →
    load-balance coupling (20)
  9. a1a7e3ad merge of upstream/master (num_procs_x/y/z promotion adopted from Fix periodic ib issues #1618)

Addendum: features added after the initial draft

  • Multi-fluid AMR (5-equation multi-component): per-fluid conservative reflux (per-fluid mass defects ~1e-15 through refluxed+subcycled+regridding advance), sum-preserving volume-fraction prolongation (n−1 fractions + closure), mpp_lim required for num_fluids > 1; shock–material-interface demo validated. Known bounded limitation: alpha-sum deviation up to ~5.7e-3 at coarse cells historically hosting a patch face during shock crossing (non-growing, mpp_lim-damped; the volume-fraction K-term is deliberately not refluxed — it is non-conservative).
  • 3D validation: free-stream exact (0.0) in 3D with subcycling+regrid armed; 3D blast regrid-tracked at ~1e-14 defects; np=2 seam-spanning element-exact; code audit found no y/z asymmetry.
  • Golden coverage is now 5 cases: 1D static / 1D regrid / 1D subcycle / 3D static / 1D two-fluid.

Further additions

  • AMR restart (SP10): fine-level save/restore with regridded-box persistence, both IO modes (serial per-rank + parallel MPI-IO); restart continuation element-exact (save-then-restart == straight run), same-num_procs required (np-flexible restart is future work).
  • Viscous AMR (SP11): viscous prohibition lifted; viscous stress/work refluxed through the existing registers (enters rhs as a flux_src_n face-flux difference, same form as advective flux) so coarse/fine boundaries match total flux; energy conservation 0.0, accuracy triplet coarse 2.49e-4 ≫ two-level 6.89e-5 ≈ fine 5.04e-5. A fine-ghost-coordinate bug (viscous gradient using stale coarse dx at the fine subdomain/patch edge — invisible to WENO, which uses only interior dx) was found by an np=2 exactness probe and fixed; the fine viscous seam is now byte-exact across ranks. Residual: a bounded (~1e-6) np-dependence remains only at the coarse/fine patch boundary from prolongation-derived ghost gradients (AMR's inherently-approximate coupling zone); the density-gradient tagger senses shear poorly (buffered/static patch recommended; error-estimator taggers are future work).

Multi-block AMR + terminology

  • Multi-block AMR (SP12a): tagged cells are clustered (Berger–Rigoutsos + a min-separation merge) into a LIST of separated refinement blocks, so multiple separated features (e.g. a shock and a contact that have separated) each get their own tight block instead of one bounding box wastefully spanning both plus the smooth gap between — measured 66% fewer cells refined on a two-interface Sod, conservation and np=1==np=2 element-exactness preserved. amr_max_blocks (default 4; N fixed-size slots, ~N× device memory — compute efficiency is the goal, memory efficiency a follow-up), amr_cluster_eff (default 0.7). Fine blocks stay ≥ buff_size apart ⇒ no fine–fine coupling; all existing per-block machinery (multi-rank, GPU, subcycle, viscous, multi-fluid) loops over the block list unchanged.
  • Terminology: AMR's refinement regions are now called blocks (amr_block_beg/end, amr_max_blocks) — disambiguated from MFC's initial-condition patch_icpp. (Draft-stage rename; golden values unchanged.)

Euler-Euler bubbles under AMR (SP13)

  • Bubbly-flow AMR (monodisperse, polytropic — the simplest Euler-Euler config): bubble moments (in q_cons) are refluxed by the existing register machinery; prolongation is realizability-preserving (radius moment nR > 0 maintained across coarse→fine, analogous to the multi-fluid volume-fraction closure). Validated: conservation defects ~1e-15 through refluxed+subcycled+regridding advance, moments stay realizable, AMR beats the coarse solution, np=1==np=2 element-exact. Non-polytropic / QBMM / polydisperse / Lagrangian bubbles remain explicitly gated (future work — non-polytropic additionally needs per-block pb/mv handling).

Phase-change (relax) under AMR (SP15)

  • Phase-change / pressure relaxation now works under AMR: the per-cell equilibration (relax) runs on each fine block before restriction (a new s_amr_relax_fine), so the refined solution is properly relaxed. Cell-local — no reflux, no c/f coupling. Machine-precision conservation, free-stream preserved, np=1==np=2 bit-exact. Config: model_eqns=2, relax=T, num_fluids>1, mpp_lim=T.

Validation hardening (blind spots closed)

  • GPU correctness of all AMR physics: the AMR goldens (which exercise two-fluid, viscous, bubbles, multi-block, phase-change) were executed on NVIDIA GPU (V100, OpenACC/nvhpc) and match the CPU results within tolerance — 9/9. Every physics rung is GPU-correct, not merely GPU-compilable.
  • Cross-feature interaction coverage: added combined-feature goldens — viscous+multifluid+multiblock+subcycle and bubbles+multiblock+subcycle+regrid — both conserve to ~1e-16, with the viscous+multifluid+multiblock case verified np=1==np=2 element-exact (reflux + fine halo + multi-block + physics all correct together under MPI).

Chemistry under AMR (SP16) + surface-tension limitation

  • Chemistry (reactions + advection) under AMR, multi-rank: species mass fractions get a sum/positivity-preserving prolongation closure (like the multi-fluid volume-fraction closure); reactions run per-cell on the fine blocks; a chemistry+AMR temperature-ghost MPI-exchange bug (uninitialized seam-ghost temperature → NaN in the reacting-EOS Newton solve at rank boundaries) was found and fixed, so np=1==np=2 is element-exact (species bit-identical). Machine-precision conservation, realizability exact. Added the suite's first multi-rank AMR golden.
  • Surface tension is explicitly unsupported under AMR (documented prohibit): the capillary force depends on the interface normal, which the prolonged fine ghost color cannot reproduce consistently across a 2:1 coarse/fine boundary; three fixes were attempted and diagnosed (conservation is structural, but the seam force imbalance is not tamable by fine-block-only corrections). See the AMR docs section.

Further physics rungs (SP17–SP20)

  • Chemistry species diffusion under AMR (SP17): species diffusion now works under AMR — its flux_src is captured by the coarse/fine flux registers (mirroring the viscous path) and refluxed, and the temperature ghost is exchanged at rank seams (the same broadening the reactions fix required). Removed the diffusion prohibit; np=1==np=2 element-exact, machine-precision conservation.
  • Non-polytropic + polydisperse bubbles under AMR (SP18): Euler-Euler bubble support (SP13, previously monodisperse+polytropic only) now covers non-polytropic and polydisperse (nb ≥ 1) configs, with a per-block moment-realizability floor applied to all positive moments on prolongation. Conservation is machine-precision for polytropic; the non-polytropic source-term model carries a ~7e-10 defect that is np-invariant (identical np=1/np=2 — a model property, not an AMR decomposition leak). Removed the polytropic/monodisperse gates.
  • QBMM bubbles under AMR (SP19): polytropic QBMM is supported — the CHyQMOM 6-moment set lives entirely in q_cons and is injected piecewise-constant at prolongation so every fine/ghost child inherits a realizable moment set (variance c20 > 0), keeping the inversion NaN-free; moments reflux/restrict on the standard conservative path. np=2 element-exact, conservation ~1e-15. Non-polytropic QBMM stays gated (its pb/mv quadrature side-state is a global array the fine advance would corrupt through the swap).
  • Static immersed boundaries under AMR (SP20): a fixed, single, non-STL body on a static block (amr_regrid_int = 0) is now resolved on the refined level — each fine block carries its own fine-grid IB markers/ghost points computed from the geometry, and the fine advance applies the ghost-cell IB correction per RK stage. The IB forcing is non-conservative by construction at the body (ghost-cell method), while the flux reflux still conserves to machine precision away from it. Moving/multi-body/STL/dynamic-regrid-with-IB remain gated. A body straddling a rank seam is rejected at startup (the fine-IB image-point stencil across the seam is not yet decomposition-exact) rather than silently producing a small surface error — keep the body within a single rank's subdomain.

…sambiguate from IC patches; golden values unchanged
…hibit (diffuse-interface c/f normal inconsistency; 3 attempts diagnosed)
… closure, per-block reactions, multi-rank temperature-ghost exchange (diffusion gated)
…e-point stencil not decomposition-exact there)
# Conflicts:
#	src/simulation/m_ibm.fpp
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

1 participant