A field report
Building a per-kernel GROMACS benchmark harness for the Radeon RX 9070
Notes from one weekend of plumbing — ROCm 7.12, a custom LLVM, a homemade Discord agent, and an attempt to turn molecular dynamics into a tight compiler-iteration loop on a brand-new GPU.
I'm a GPU compiler developer at AMD. I work on the AMDGPU backend in LLVM — the part that turns HIP and OpenCL source into the ISA that runs on Radeon and Instinct silicon. When my workstation showed up with a brand-new Radeon RX 9070 (RDNA 4, gfx1201), the first thing I wanted was a tight feedback loop: edit a backend pass, rebuild LLVM, rebuild a real GPU application, run a benchmark, see whether the kernels got faster. Six minutes between hitting save and seeing a number.
This is a writeup of the harness I ended up with. It runs GROMACS molecular dynamics workloads under rocprofv3, parses the per-kernel timings, diffs them against a stable baseline, and pings me on Discord with the verdict. The published Kutzner benchmark suite from MPI-NAT (the canonical GROMACS performance inputs) is the fixed point.
The stack: ROCm 7.12 on RDNA 4
The reference platform:
| Component | Choice |
|---|---|
| CPU | AMD Ryzen 7 9800X3D (Zen 5, 8 cores, 96 MB 3D V-Cache) |
| GPU | AMD Radeon RX 9070 (RDNA 4, gfx1201, 16 GB VRAM) |
| Memory | 32 GB DDR5 |
| OS | Ubuntu 24.04.4 LTS, kernel 6.17 (HWE) |
| ROCm | 7.12.0, installed under /opt/rocm/core-7.12/ |
| GROMACS source | GitLab branch 4947-hip-feature-enablement |
| LLVM | github.com/gandhi56/llvm-project at commit d7fead0f (clang-23) |
# Pinned ROCm environment
sudo tee /etc/profile.d/rocm.sh > /dev/null <<'EOF'
export ROCM_PATH=/opt/rocm/core-7.12
export HIP_PATH=/opt/rocm/core-7.12
export PATH=$ROCM_PATH/bin:$PATH
export LD_LIBRARY_PATH=$ROCM_PATH/lib:${LD_LIBRARY_PATH:-}
EOF
Two ROCm-specific tools deserve a mention: rocprofv3 for kernel-level tracing, and rocprof-compute (the renamed Omniperf) for counter-level analysis. Both ship with ROCm 7.x; both work on gfx1201, with the caveat that rocprof-compute's roofline reference isn't calibrated for RDNA 4 yet (kernel-level analysis works fine, just the roofline overlay is approximate).
Two GROMACS builds, one custom LLVM
GROMACS has two HIP-enabled paths right now:
- The 2026 mainline, which has HIP merged for the NBNxM kernels only — stable, conservative.
- The
4947-hip-feature-enablementbranch on GitLab, AMD-maintained, tracking 2025.4 rolling into 2026, with the full HIP optimizations including PME spread/solve/gather and the bonded kernels. This is where AMD's GROMACS performance work lives, and where the kernels you'd want to tune for RDNA 4 actually exist.
My setup uses two parallel builds installed side by side:
| Build | Compiler | Purpose |
|---|---|---|
/opt/gromacs/stable-rocm7.12 | Stock hipcc + amdclang++ | Reference baseline. Untouched between LLVM iterations. |
/opt/gromacs/dev-llvm-<hash>-<timestamp> | My LLVM directly (clang++) | What I iterate on. A new install prefix per build, never overwriting the last. |
/opt/gromacs/dev-current | (symlink) | Points at the most recent dev install. Scripts dereference this. |
The stable build is straightforward — AMD's standard recipe with the architecture overridden to gfx1201:
cd ~/build/gromacs-stable
cmake ~/src/gromacs-amd-hip \
-G Ninja \
-DCMAKE_BUILD_TYPE=Release \
-DGMX_GPU=HIP \
-DCMAKE_HIP_COMPILER=/opt/rocm/core-7.12/bin/amdclang++ \
-DCMAKE_HIP_COMPILER_ROCM_ROOT=/opt/rocm/core-7.12 \
-DGMX_HIP_TARGET_ARCH=gfx1201 \
-DCMAKE_PREFIX_PATH=/opt/rocm/core-7.12 \
-DGMX_MPI=OFF \
-DGMX_OPENMP=ON \
-DGMX_BUILD_OWN_FFTW=OFF \
-DCMAKE_INSTALL_PREFIX=/opt/gromacs/stable-rocm7.12 \
-DGMX_DOUBLE=OFF \
-DGMX_SIMD=AVX2_256 \
-DGMX_USE_RDTSCP=ON
ninja -j$(nproc) install
The dev build is the same recipe with three changes:
LLVM_HASH=$(git -C /home/anshil/llvm-project rev-parse --short HEAD)
PREFIX=/opt/gromacs/dev-llvm-${LLVM_HASH}-$(date +%Y%m%d-%H%M)
HIP_FLAGS="--rocm-path=/opt/rocm/core-7.12 --offload-arch=gfx1201 \
-fPIC -fno-gpu-rdc -ffast-math -munsafe-fp-atomics \
-fdenormal-fp-math=ieee -fcuda-flush-denormals-to-zero \
-fno-slp-vectorize -Wno-unused-command-line-argument -Wno-pass-failed"
cmake ~/src/gromacs-amd-hip \
-G Ninja \
-DCMAKE_BUILD_TYPE=Release \
-DGMX_GPU=HIP \
-DCMAKE_HIP_COMPILER=/home/anshil/llvm-project/build/bin/clang++ \
-DCMAKE_HIP_COMPILER_ROCM_ROOT=/opt/rocm/core-7.12 \
-DCMAKE_HIP_FLAGS="$HIP_FLAGS" \
-DGMX_HIP_TARGET_ARCH=gfx1201 \
-DCMAKE_HIP_ARCHITECTURES=gfx1201 \
-DGPU_TARGETS=gfx1201 \
-DCMAKE_PREFIX_PATH=/opt/rocm/core-7.12 \
-DGMX_MPI=OFF -DGMX_OPENMP=ON -DGMX_BUILD_OWN_FFTW=OFF \
-DCMAKE_INSTALL_PREFIX=$PREFIX \
-DGMX_DOUBLE=OFF -DGMX_SIMD=AVX2_256 -DGMX_USE_RDTSCP=ON \
-DHIPCC_HAS_TARGET_ARCH_gfx1201=TRUE
ninja -j$(nproc) install
ln -sfn $PREFIX /opt/gromacs/dev-current
The three changes from stable, marked in green above:
CMAKE_HIP_COMPILERpoints directly at myclang++. CMake 3.27+ takes the HIP language compiler directly rather than through a hipcc wrapper.CMAKE_HIP_FLAGScarries the flags hipcc would otherwise inject. Going through clang directly puts those on you.HIPCC_HAS_TARGET_ARCH_gfx1201=TRUEseeds the cache variable GROMACS's probe would otherwise populate, skipping the probe.
The install prefix is timestamped with the LLVM commit hash. Every dev build gets its own directory; old builds are never overwritten. Months from now I can dig back to dev-llvm-d7fead0f-20260516-0243 and re-run any benchmark against it.
Benchmark inputs: Kutzner benchMEM and benchRIB
For compiler work, the inputs matter as much as the harness. I picked from Carsten Kutzner's free GROMACS benchmark set¹ — the canonical performance inputs that published GROMACS papers compare against. Using these means my numbers are directly comparable to anything other GROMACS people publish.
| System | Atoms | Description | Why I picked it |
|---|---|---|---|
benchMEM | 82,000 | Membrane protein in water | Small enough for ~3-minute iteration runs. Mixed GPU/CPU workload because membrane lipid constraints prevent full GPU residency. |
benchRIB | 2,000,000 | Ribosome in water | Large enough to saturate the GPU and stress memory traffic. Used for weekly validation runs, not daily iteration. |
The other Kutzner systems either don't fit (benchPEP at 12 M atoms exceeds the 9070's 16 GB VRAM) or are too small to be meaningful (the binding affinity studies at 6-36k atoms). benchMEM is the workhorse; benchRIB is for "before I commit this patch upstream" validation.
cd /var/lib/gromacs-runs/_systems
# benchMEM (82k atoms, membrane protein)
mkdir -p benchMEM && cd benchMEM
wget -O benchMEM.zip https://www.mpinat.mpg.de/benchMEM
unzip benchMEM.zip
cp benchMEM.tpr topol.tpr # gmxrun expects topol.tpr
# benchRIB (2M atoms, ribosome)
cd ..
mkdir -p benchRIB && cd benchRIB
wget -O benchRIB.zip https://www.mpinat.mpg.de/benchRIB
unzip benchRIB.zip
cp benchRIB.tpr topol.tpr
The TPR files are pre-built run inputs. No grompp needed — straight to mdrun. The license is CC-BY 4.0; if you use them in published work, the attribution goes to Dept. of Theoretical and Computational Biophysics, Max Planck Institute for Multidisciplinary Sciences, Göttingen, https://www.mpinat.mpg.de/grubmueller/bench.
The wrapper scripts
Three scripts in /usr/local/bin, composing top-down: gmxrun wraps a single mdrun invocation with notifications and result capture. kernel-bench wraps gmxrun with rocprofv3 trace parsing. bench-llvm-commit wraps everything in the outer iteration loop.
gmxrun
Wraps a single mdrun invocation with notifications, result capture, and optional kernel/counter profiling. The entry point for every other tool below.
gmxrun <build> <system> [tag] [-- <extra mdrun args>]
Arguments:
build— a directory name under/opt/gromacs/(e.g.stable-rocm7.12,dev-current). The script sources$GMX_PREFIX/bin/GMXRCfrom there.system— a directory under/var/lib/gromacs-runs/_systems/containing atopol.tpr(e.g.benchMEM,benchRIB).tag— an arbitrary label that becomes part of the run directory name. Defaults toHHMMSS.- Everything after
--passes through tomdrununchanged.
Two flags are intercepted before they reach mdrun:
--profile=trace— wrap the run inrocprofv3 --hip-trace --kernel-trace --stats, producing per-kernel CSVs underrocprof/.--profile=compute— wrap inrocprof-compute profile --no-rooffor counter-level analysis underrocpc/.
Side effects, in order: creates /var/lib/gromacs-runs/<build>__<system>__<tag>/, copies topol.tpr into it, pins the run to the discrete GPU (HIP_VISIBLE_DEVICES=0, ROCR_VISIBLE_DEVICES=0), posts a start message to $DISCORD_WEBHOOK_URL if it's set, runs mdrun -v -deffnm topol -ntmpi 1 -ntomp 16 -pin on -resethway plus your pass-through args, parses ns/day from topol.log, writes a structured result.json (build, system, tag, gmx_version, duration_seconds, ns_per_day, return_code, finished, profile_mode, timestamp), and posts a completion or failure message with the tail of the log on failure.
Examples:
# Plain run
gmxrun stable-rocm7.12 benchMEM baseline -- -nsteps 30000
# Same, with kernel-level tracing
gmxrun dev-current benchMEM r1 -- -nsteps 30000 --profile=trace
Three design notes on this script worth surfacing:
- Every run gets a unique directory, with the build name encoded in the path. No accidental overwrites of last week's baseline.
- The mdrun args don't force
-nb gpuor-pme gpu. Letting GROMACS use its heuristic defaults means each build picks the right offload level for the system, which keeps the comparison apples-to-apples between builds. Hardcoding GPU residency flags also breaks benchMEM, which can't use-update gpudue to constraint topology. -resethwayis on by default. This is the single most important methodology choice. GROMACS spends the first 1000-8000 steps tuning PME grid sizes — that's bursty, non-representative work that pollutes ns/day if you include it in the measurement.-resethwayresets all timing counters at the halfway point. With-nsteps 30000, you tune for 15000 steps, then measure 15000 steps of steady state.
kernel-bench
A thin wrapper that runs gmxrun with --profile=trace, then post-processes the rocprofv3 CSV into a per-category kernel breakdown and diffs against a baseline.
kernel-bench <build> <system> [tag] [-- <extra mdrun args>]
Arguments are identical to gmxrun. The script does three things after the trace finishes:
- Picks up the rocprofv3 CSV at
rocprof/<host>/*kernel_stats.csvinside the run directory. - Collapses C++ template instantiations —
void pmeSplineAndSpreadKernel<4, false, true>(...)becomespmeSplineAndSpreadKernel— so different instantiations of the same kernel get aggregated into one row. - Categorizes each kernel into
nbnxm,pme_spread,pme_solve,pme_gather,fft,integrate,lincs,settle,memory, orotherby substring match on the name.
It writes kernel_result.json in the run directory with totals per category (milliseconds and percent of GPU time) and per-kernel rows sorted by total time. Then it diffs against the baseline at /var/lib/gromacs-runs/stable-rocm7.12__<system>__kernel-baseline/kernel_result.json and posts a per-category breakdown to Discord:
📊
dev-current/benchMEM/llvm-d7fead0fkernel breakdown vs baseline
🚀nbnxm: 24108.42 ms (−3.29% vs 24928.79)
•pme_spread: 27445.30 ms (−0.20% vs 27501.67)
…
The diff arrows fire at ±2%: 🚀 for >2% faster, 🐢 for >2% slower, • otherwise.
One special tag: if tag is exactly kernel-baseline, the diff step is skipped — that run is the new baseline.
Examples:
# Establish (or refresh) the kernel-level baseline
kernel-bench stable-rocm7.12 benchMEM kernel-baseline -- -nsteps 30000
# Measure a dev build against it; tag with the LLVM commit
kernel-bench dev-current benchMEM "llvm-$(git -C ~/llvm-project rev-parse --short HEAD)" -- -nsteps 30000
One implementation note worth surfacing: rocprofv3 writes CSVs under rocprof/<hostname>/<pid>_kernel_stats.csv. The hostname subdirectory is for multi-host runs; on a single-host setup it just adds a layer of glob to navigate.
bench-llvm-commit
The outermost wrapper, and the one I actually run after editing the AMDGPU backend. Rebuilds LLVM, rebuilds GROMACS dev incrementally, runs the bench through kernel-bench, and pings Discord with the headline verdict.
bench-llvm-commit [--no-llvm-rebuild] [--no-gmx-rebuild]
[--system NAME] [--tag TAG] [--steps N]
Flags:
--no-llvm-rebuild— skip theninja -C $LLVM_DIR/buildstep. Useful when you only changed GROMACS, or just want to re-measure.--no-gmx-rebuild— skip the incremental GROMACS rebuild. Useful for noise-floor runs against the existing dev binary.--system NAME— defaults tobenchMEM; passbenchRIBfor the 2 M-atom run.--tag TAG— defaults to the LLVM commit short hash, with-dirtyappended if the LLVM working tree has uncommitted changes.--steps N— passed through as-nsteps N; defaults to 30000.
What it does, in order: sources ~/ai-ops/.env for DISCORD_WEBHOOK_URL; rebuilds LLVM (aborting with a notification on failure); captures the LLVM short hash plus a -dirty suffix if the tree isn't clean; rebuilds GROMACS dev with HIP_CLANG_PATH and PATH pointed at the freshly built clang; resolves /opt/gromacs/dev-current through its symlink to get the timestamped build directory; calls kernel-bench with that build, the chosen system, and the tag; computes the ns/day delta against the stable-rocm7.12__<system>__baseline run; stamps the LLVM commit hash into the run's result.json; and posts the headline.
The final Discord message looks like:
🚀 LLVM
d7fead0fonbenchMEM: 43.7 ns/day (baseline 41.5, Δ +2.200, +5.30%)
The emoji thresholds match kernel-bench's: 🚀 for >2% faster than baseline, 🐢 for >2% slower, 📊 otherwise.
Examples:
# The common case: edit LLVM, run this, wait six minutes
bench-llvm-commit
# Re-measure without rebuilding anything
bench-llvm-commit --no-llvm-rebuild --no-gmx-rebuild --tag noise-r2
# Big-system validation before a commit goes up
bench-llvm-commit --system benchRIB --steps 10000
After this script runs, every result.json under /var/lib/gromacs-runs/ has the LLVM commit hash stamped on it. Six months from now I can cat .../result.json | jq .llvm_commit on any historical run and know which compiler revision produced which number. For compiler-development work that traceability is non-negotiable.
Per-kernel profiling with rocprofv3
The reason single ns/day numbers aren't enough is that the kernels are uncorrelated. A patch that helps NBNxM by 5% can hurt PME spread by 3%. A patch to FMA selection touches one kernel cluster; a patch to atomic operation scheduling touches a different one. Knowing the aggregate ns/day moved is useful but not actionable.
What rocprofv3 in --hip-trace --kernel-trace --stats mode gives you is a CSV with, per kernel, total wall time, call count, average duration, min/max. For GROMACS on benchMEM, the categories that show up:
| Category | % GPU time (typical) | What it does |
|---|---|---|
nbnxm | 34–40% | Non-bonded forces (vdW + short-range Coulomb), cluster pairs |
pme_spread | 28–38% | B-spline interpolation of charges onto the PME grid |
fft | 22–25% | 3D FFT for PME solve (VkFFT inside the GROMACS build) |
pme_solve | 2–4% | Solve in Fourier space (force/energy convolution) |
pme_gather | 1–2% | Inverse interpolation: forces from grid back to atoms |
memory | ~2% | __amd_rocclr_copyBuffer, __amd_rocclr_fillBufferAligned |
other | ~2% | Reductions, transpose, transform helpers |
From a compiler-codegen perspective, these categories exercise materially different things:
- nbnxm stresses FMA throughput, register pressure (cluster pairs hold many atoms in registers), LDS, VMEM coalescing, and subgroup intrinsics. RDNA 4 is wave32-native, so anything that affects wave32 codegen surfaces here.
- pme_spread stresses atomic adds (
-munsafe-fp-atomicsmatters), strided memory writes, transcendental codegen (sin/cos for theta), and polynomial FMA chains for spline coefficients. - fft is VkFFT, which is a separate third-party library compiled separately. Your LLVM patches don't affect this — useful baseline that ought to be constant across runs.
- pme_solve is pure Fourier-space math: complex multiplication, sums, divisions. Small but a useful canary for floating-point codegen changes.
The diff format the harness produces puts this front and center. A Discord notification from a typical bench-llvm-commit run looks like:
📊
dev-current/benchMEM/llvm-d7fead0fkernel breakdown vs baseline
🚀nbnxm: 24108.42 ms (−3.29% vs 24928.79)
•pme_spread: 27445.30 ms (−0.20% vs 27501.67)
•pme_solve: 1378.61 ms (+0.70% vs 1369.02)
•pme_gather: 705.18 ms (+0.44% vs 702.07)
•fft: 16012.84 ms (−0.09% vs 16027.30)
This tells me exactly what my patch did: it helped nbnxm by 3.3% (consistent with FMA-pipeline changes I was working on), left PME mostly untouched (expected — different kernels), and didn't break anything elsewhere. The fft line being essentially flat is a useful sanity check that the toolchain swap itself isn't introducing wildly different behavior.
Methodology and the gotchas worth knowing
The numbers above only mean what they appear to mean if the methodology underneath is sound. The hard-won lessons:
1. Always use -resethway
GROMACS's PME tuner converges to an optimal grid by trying different operating points during the first thousands of steps. Those tuning steps run at 10× the wall time of steady state. If you include them in the measurement, your ns/day number is dominated by tuner exploration. -resethway resets the timing counters at the halfway mark, throwing out the tuning window.
This makes single-shot benchmarks much more reproducible. Without it, run-to-run variance can be 30%; with it, more like 2%.
2. Pin clocks before benchmarking
The 9070 has dynamic clocks that respond to thermal and power state. The CPU's schedutil governor downclocks aggressively. Both are death for run-to-run reproducibility. Before any A/B:
sudo rocm-smi --setperflevel high
sudo cpupower frequency-set -g performance
This costs ~20W idle and a couple of degrees of GPU temperature. Worth it for the noise reduction.
3. Two builds that converge to different PME grids aren't directly comparable
This bit me. My first dev vs stable comparison showed dev 2.2× faster overall, but every individual kernel slower. That's only possible if the tuner picked different operating points. Dev's tuner found a grid that was overall faster but used kernels less efficiently. Per-kernel comparisons between such runs are apples-to-oranges.
The fix is either to pin the PME grid with -notunepme + an explicit -rcoulomb (eliminating the tuner's freedom), or to focus on the aggregate ns/day for end-to-end performance claims and use kernel deltas only as directional signals. I prefer the second; pinning the grid removes a real adaptive behavior that users get for free.
4. benchMEM can't use -update gpu
It will fail with: "The number of coupled constraints is higher than supported in the GPU LINCS code." benchMEM's membrane lipids have constraint chains that exceed what GROMACS's GPU LINCS implementation handles. Update + constraints stay on the CPU for this system; only nb, pme, pmefft, and bonded run on the GPU.
This means the bench wrapper shouldn't hardcode GPU residency flags — let mdrun pick its own heuristic defaults per-system. Different inputs support different levels of offload; hardcoding kills portability and breaks benchMEM.
5. Don't share builds between machines
The HIP compiler flags include -march=skylake-avx512 or similar host-CPU-specific flags that cmake auto-detects. Builds aren't portable. If you move /opt/gromacs/... to another machine, expect either crashes or silent codegen-quality changes.
6. Variance, variance, variance
One run is not a measurement. Three runs are a measurement. The typical recipe:
for r in 1 2 3; do
kernel-bench stable-rocm7.12 benchMEM "baseline-r$r" -- -nsteps 30000
sleep 15
done
for r in 1 2 3; do
kernel-bench dev-current benchMEM "test-r$r" -- -nsteps 30000
sleep 15
done
Six runs, ~20 minutes total. Use the median per build, look at spread. A 3% delta with 2% per-run noise floor is borderline; a 5% delta is real signal. Anything that doesn't survive across three runs in each direction isn't a finding.
Aside: the Discord agent
The Discord notifications throughout this writeup come from a small local AI agent. Brief description, because it's tangentially relevant: an Ollama instance running gpt-oss:20b on the same 9070, a py-cord bot listening for DMs, and the bot owns its own agent loop — it executes Python tools that read system status, parse GROMACS logs, and post webhook notifications. From anywhere in the world (Tailscale handles the connectivity), I can DM the bot "how's my last bench" and get back a kernel breakdown, or get a proactive ping the moment a long simulation finishes or crashes.
The agent is incidental to the benchmark methodology — webhooks alone would do the alert side, and I could query result.json files from a phone over SSH. But the conversational layer turns out to be genuinely useful for "compare these two runs and tell me what's different" queries, where the agent reads kernel_result.json from both and produces a summary I can act on without booting a laptop. That's a separate post, though.
What I haven't verified yet
This writeup describes infrastructure that works. It does not yet describe a robust performance result.
The single dev-vs-stable comparison I have was collected before I'd applied the methodology fixes — no -resethway, no clock pinning, no CPU governor change, single-shot rather than three-run-each. The numbers in that comparison are not a defensible compiler benchmark result. They were what made me realize the methodology needed to be tightened, and they're the reason the methodology section above exists.
The infrastructure is now correct. The clean re-baselines and re-comparisons happen in a separate session, with the perf settings pinned and three runs in each direction. Those are the numbers I'll publish when I have actual codegen patches to evaluate. This post is about the harness; the perf results that the harness produces are a separate writeup.
A few other things I'd flag for anyone copying this setup:
- rocprofv3 column names vary across point releases. The CSV parser in kernel-bench assumes
Name,Calls,TotalDurationNs. If you're on a different rocprof version, check the header line and adjust the column lookup. My script tries several candidate column names but isn't exhaustive. - VkFFT in the trace is not your codegen. It's a third-party Vulkan-FFT library that GROMACS uses for the 3D FFT. If you want to A/B your LLVM against rocFFT for the FFT step, rebuild with
-DGMX_GPU_FFT_LIBRARY=rocFFTas a separate build configuration. - rocprof-compute roofline reference isn't calibrated for RDNA 4. Kernel-level counters work fine, but the roofline overlay won't be accurate. Use rocprof-compute for VALU utilization, occupancy, LDS busy %, but ignore the roofline plot for now.
- The published Kutzner inputs are old. The TPR files have
VERSION 4.6.3-dev-20130701in their headers. GROMACS reads them fine because TPR file versioning is backward-compatible, but if you compare your numbers to historical published benchmarks, make sure the comparison also held the GROMACS version constant — kernel performance has changed substantially across releases.
Two days of plumbing. A working iteration loop that goes from "edit clang source" to "Discord message with kernel-level delta" in about six minutes. Whether the LLVM patches I'm working on actually produce wins is the next question, and the right one. But until the harness was real, asking that question was guesswork. Now it's measurement.
Footnotes
- The Kutzner suite is one of multiple GROMACS benchmark traditions. The other big one is the water benchmarks from the GROMACS team itself, available at
ftp.gromacs.org/benchmarks/water_GMX50_bare.tar.gz— water systems from 1.5k to 3M atoms in one tarball. Useful for spanning a wider range of system sizes if you want to study kernel performance vs problem size, but less well-attributed in the literature than benchMEM/benchRIB.
Set in Inter (body) and Fira Code (code), matching the rest of the site.
Code blocks rendered with hand-applied span markup rather than a syntax-highlighting library, for portability.
Benchmark inputs courtesy of Carsten Kutzner / MPI for Multidisciplinary Sciences, used under CC-BY 4.0.