A field report

Building a per-kernel GROMACS benchmark harness for the Radeon RX 9070

Notes from one weekend of plumbing — ROCm 7.12, a custom LLVM, a homemade Discord agent, and an attempt to turn molecular dynamics into a tight compiler-iteration loop on a brand-new GPU.

Author Anshil Gandhi Reading time ~30 min

I'm a GPU compiler developer at AMD. I work on the AMDGPU backend in LLVM — the part that turns HIP and OpenCL source into the ISA that runs on Radeon and Instinct silicon. When my workstation showed up with a brand-new Radeon RX 9070 (RDNA 4, gfx1201), the first thing I wanted was a tight feedback loop: edit a backend pass, rebuild LLVM, rebuild a real GPU application, run a benchmark, see whether the kernels got faster. Six minutes between hitting save and seeing a number.

This is a writeup of the harness I ended up with. It runs GROMACS molecular dynamics workloads under rocprofv3, parses the per-kernel timings, diffs them against a stable baseline, and pings me on Discord with the verdict. The published Kutzner benchmark suite from MPI-NAT (the canonical GROMACS performance inputs) is the fixed point.

The stack: ROCm 7.12 on RDNA 4

The reference platform:

ComponentChoice
CPUAMD Ryzen 7 9800X3D (Zen 5, 8 cores, 96 MB 3D V-Cache)
GPUAMD Radeon RX 9070 (RDNA 4, gfx1201, 16 GB VRAM)
Memory32 GB DDR5
OSUbuntu 24.04.4 LTS, kernel 6.17 (HWE)
ROCm7.12.0, installed under /opt/rocm/core-7.12/
GROMACS sourceGitLab branch 4947-hip-feature-enablement
LLVMgithub.com/gandhi56/llvm-project at commit d7fead0f (clang-23)
# Pinned ROCm environment
sudo tee /etc/profile.d/rocm.sh > /dev/null <<'EOF'
export ROCM_PATH=/opt/rocm/core-7.12
export HIP_PATH=/opt/rocm/core-7.12
export PATH=$ROCM_PATH/bin:$PATH
export LD_LIBRARY_PATH=$ROCM_PATH/lib:${LD_LIBRARY_PATH:-}
EOF

Two ROCm-specific tools deserve a mention: rocprofv3 for kernel-level tracing, and rocprof-compute (the renamed Omniperf) for counter-level analysis. Both ship with ROCm 7.x; both work on gfx1201, with the caveat that rocprof-compute's roofline reference isn't calibrated for RDNA 4 yet (kernel-level analysis works fine, just the roofline overlay is approximate).

Two GROMACS builds, one custom LLVM

GROMACS has two HIP-enabled paths right now:

My setup uses two parallel builds installed side by side:

BuildCompilerPurpose
/opt/gromacs/stable-rocm7.12Stock hipcc + amdclang++Reference baseline. Untouched between LLVM iterations.
/opt/gromacs/dev-llvm-<hash>-<timestamp>My LLVM directly (clang++)What I iterate on. A new install prefix per build, never overwriting the last.
/opt/gromacs/dev-current(symlink)Points at the most recent dev install. Scripts dereference this.

The stable build is straightforward — AMD's standard recipe with the architecture overridden to gfx1201:

cd ~/build/gromacs-stable
cmake ~/src/gromacs-amd-hip \
  -G Ninja \
  -DCMAKE_BUILD_TYPE=Release \
  -DGMX_GPU=HIP \
  -DCMAKE_HIP_COMPILER=/opt/rocm/core-7.12/bin/amdclang++ \
  -DCMAKE_HIP_COMPILER_ROCM_ROOT=/opt/rocm/core-7.12 \
  -DGMX_HIP_TARGET_ARCH=gfx1201 \
  -DCMAKE_PREFIX_PATH=/opt/rocm/core-7.12 \
  -DGMX_MPI=OFF \
  -DGMX_OPENMP=ON \
  -DGMX_BUILD_OWN_FFTW=OFF \
  -DCMAKE_INSTALL_PREFIX=/opt/gromacs/stable-rocm7.12 \
  -DGMX_DOUBLE=OFF \
  -DGMX_SIMD=AVX2_256 \
  -DGMX_USE_RDTSCP=ON

ninja -j$(nproc) install

The dev build is the same recipe with three changes:

LLVM_HASH=$(git -C /home/anshil/llvm-project rev-parse --short HEAD)
PREFIX=/opt/gromacs/dev-llvm-${LLVM_HASH}-$(date +%Y%m%d-%H%M)
HIP_FLAGS="--rocm-path=/opt/rocm/core-7.12 --offload-arch=gfx1201 \
   -fPIC -fno-gpu-rdc -ffast-math -munsafe-fp-atomics \
   -fdenormal-fp-math=ieee -fcuda-flush-denormals-to-zero \
   -fno-slp-vectorize -Wno-unused-command-line-argument -Wno-pass-failed"

cmake ~/src/gromacs-amd-hip \
  -G Ninja \
  -DCMAKE_BUILD_TYPE=Release \
  -DGMX_GPU=HIP \
  -DCMAKE_HIP_COMPILER=/home/anshil/llvm-project/build/bin/clang++ \
  -DCMAKE_HIP_COMPILER_ROCM_ROOT=/opt/rocm/core-7.12 \
  -DCMAKE_HIP_FLAGS="$HIP_FLAGS" \
  -DGMX_HIP_TARGET_ARCH=gfx1201 \
  -DCMAKE_HIP_ARCHITECTURES=gfx1201 \
  -DGPU_TARGETS=gfx1201 \
  -DCMAKE_PREFIX_PATH=/opt/rocm/core-7.12 \
  -DGMX_MPI=OFF -DGMX_OPENMP=ON -DGMX_BUILD_OWN_FFTW=OFF \
  -DCMAKE_INSTALL_PREFIX=$PREFIX \
  -DGMX_DOUBLE=OFF -DGMX_SIMD=AVX2_256 -DGMX_USE_RDTSCP=ON \
  -DHIPCC_HAS_TARGET_ARCH_gfx1201=TRUE

ninja -j$(nproc) install
ln -sfn $PREFIX /opt/gromacs/dev-current

The three changes from stable, marked in green above:

  1. CMAKE_HIP_COMPILER points directly at my clang++. CMake 3.27+ takes the HIP language compiler directly rather than through a hipcc wrapper.
  2. CMAKE_HIP_FLAGS carries the flags hipcc would otherwise inject. Going through clang directly puts those on you.
  3. HIPCC_HAS_TARGET_ARCH_gfx1201=TRUE seeds the cache variable GROMACS's probe would otherwise populate, skipping the probe.

The install prefix is timestamped with the LLVM commit hash. Every dev build gets its own directory; old builds are never overwritten. Months from now I can dig back to dev-llvm-d7fead0f-20260516-0243 and re-run any benchmark against it.

Benchmark inputs: Kutzner benchMEM and benchRIB

For compiler work, the inputs matter as much as the harness. I picked from Carsten Kutzner's free GROMACS benchmark set¹ — the canonical performance inputs that published GROMACS papers compare against. Using these means my numbers are directly comparable to anything other GROMACS people publish.

SystemAtomsDescriptionWhy I picked it
benchMEM82,000Membrane protein in waterSmall enough for ~3-minute iteration runs. Mixed GPU/CPU workload because membrane lipid constraints prevent full GPU residency.
benchRIB2,000,000Ribosome in waterLarge enough to saturate the GPU and stress memory traffic. Used for weekly validation runs, not daily iteration.

The other Kutzner systems either don't fit (benchPEP at 12 M atoms exceeds the 9070's 16 GB VRAM) or are too small to be meaningful (the binding affinity studies at 6-36k atoms). benchMEM is the workhorse; benchRIB is for "before I commit this patch upstream" validation.

cd /var/lib/gromacs-runs/_systems

# benchMEM (82k atoms, membrane protein)
mkdir -p benchMEM && cd benchMEM
wget -O benchMEM.zip https://www.mpinat.mpg.de/benchMEM
unzip benchMEM.zip
cp benchMEM.tpr topol.tpr  # gmxrun expects topol.tpr

# benchRIB (2M atoms, ribosome)
cd ..
mkdir -p benchRIB && cd benchRIB
wget -O benchRIB.zip https://www.mpinat.mpg.de/benchRIB
unzip benchRIB.zip
cp benchRIB.tpr topol.tpr

The TPR files are pre-built run inputs. No grompp needed — straight to mdrun. The license is CC-BY 4.0; if you use them in published work, the attribution goes to Dept. of Theoretical and Computational Biophysics, Max Planck Institute for Multidisciplinary Sciences, Göttingen, https://www.mpinat.mpg.de/grubmueller/bench.

The wrapper scripts

Three scripts in /usr/local/bin, composing top-down: gmxrun wraps a single mdrun invocation with notifications and result capture. kernel-bench wraps gmxrun with rocprofv3 trace parsing. bench-llvm-commit wraps everything in the outer iteration loop.

gmxrun

Wraps a single mdrun invocation with notifications, result capture, and optional kernel/counter profiling. The entry point for every other tool below.

gmxrun <build> <system> [tag] [-- <extra mdrun args>]

Arguments:

Two flags are intercepted before they reach mdrun:

Side effects, in order: creates /var/lib/gromacs-runs/<build>__<system>__<tag>/, copies topol.tpr into it, pins the run to the discrete GPU (HIP_VISIBLE_DEVICES=0, ROCR_VISIBLE_DEVICES=0), posts a start message to $DISCORD_WEBHOOK_URL if it's set, runs mdrun -v -deffnm topol -ntmpi 1 -ntomp 16 -pin on -resethway plus your pass-through args, parses ns/day from topol.log, writes a structured result.json (build, system, tag, gmx_version, duration_seconds, ns_per_day, return_code, finished, profile_mode, timestamp), and posts a completion or failure message with the tail of the log on failure.

Examples:

# Plain run
gmxrun stable-rocm7.12 benchMEM baseline -- -nsteps 30000

# Same, with kernel-level tracing
gmxrun dev-current     benchMEM r1       -- -nsteps 30000 --profile=trace

Three design notes on this script worth surfacing:

kernel-bench

A thin wrapper that runs gmxrun with --profile=trace, then post-processes the rocprofv3 CSV into a per-category kernel breakdown and diffs against a baseline.

kernel-bench <build> <system> [tag] [-- <extra mdrun args>]

Arguments are identical to gmxrun. The script does three things after the trace finishes:

It writes kernel_result.json in the run directory with totals per category (milliseconds and percent of GPU time) and per-kernel rows sorted by total time. Then it diffs against the baseline at /var/lib/gromacs-runs/stable-rocm7.12__<system>__kernel-baseline/kernel_result.json and posts a per-category breakdown to Discord:

📊 dev-current/benchMEM/llvm-d7fead0f kernel breakdown vs baseline
🚀 nbnxm: 24108.42 ms (−3.29% vs 24928.79)
pme_spread: 27445.30 ms (−0.20% vs 27501.67)

The diff arrows fire at ±2%: 🚀 for >2% faster, 🐢 for >2% slower, • otherwise.

One special tag: if tag is exactly kernel-baseline, the diff step is skipped — that run is the new baseline.

Examples:

# Establish (or refresh) the kernel-level baseline
kernel-bench stable-rocm7.12 benchMEM kernel-baseline -- -nsteps 30000

# Measure a dev build against it; tag with the LLVM commit
kernel-bench dev-current benchMEM "llvm-$(git -C ~/llvm-project rev-parse --short HEAD)" -- -nsteps 30000

One implementation note worth surfacing: rocprofv3 writes CSVs under rocprof/<hostname>/<pid>_kernel_stats.csv. The hostname subdirectory is for multi-host runs; on a single-host setup it just adds a layer of glob to navigate.

bench-llvm-commit

The outermost wrapper, and the one I actually run after editing the AMDGPU backend. Rebuilds LLVM, rebuilds GROMACS dev incrementally, runs the bench through kernel-bench, and pings Discord with the headline verdict.

bench-llvm-commit [--no-llvm-rebuild] [--no-gmx-rebuild]
                  [--system NAME] [--tag TAG] [--steps N]

Flags:

What it does, in order: sources ~/ai-ops/.env for DISCORD_WEBHOOK_URL; rebuilds LLVM (aborting with a notification on failure); captures the LLVM short hash plus a -dirty suffix if the tree isn't clean; rebuilds GROMACS dev with HIP_CLANG_PATH and PATH pointed at the freshly built clang; resolves /opt/gromacs/dev-current through its symlink to get the timestamped build directory; calls kernel-bench with that build, the chosen system, and the tag; computes the ns/day delta against the stable-rocm7.12__<system>__baseline run; stamps the LLVM commit hash into the run's result.json; and posts the headline.

The final Discord message looks like:

🚀 LLVM d7fead0f on benchMEM: 43.7 ns/day (baseline 41.5, Δ +2.200, +5.30%)

The emoji thresholds match kernel-bench's: 🚀 for >2% faster than baseline, 🐢 for >2% slower, 📊 otherwise.

Examples:

# The common case: edit LLVM, run this, wait six minutes
bench-llvm-commit

# Re-measure without rebuilding anything
bench-llvm-commit --no-llvm-rebuild --no-gmx-rebuild --tag noise-r2

# Big-system validation before a commit goes up
bench-llvm-commit --system benchRIB --steps 10000

After this script runs, every result.json under /var/lib/gromacs-runs/ has the LLVM commit hash stamped on it. Six months from now I can cat .../result.json | jq .llvm_commit on any historical run and know which compiler revision produced which number. For compiler-development work that traceability is non-negotiable.

Per-kernel profiling with rocprofv3

The reason single ns/day numbers aren't enough is that the kernels are uncorrelated. A patch that helps NBNxM by 5% can hurt PME spread by 3%. A patch to FMA selection touches one kernel cluster; a patch to atomic operation scheduling touches a different one. Knowing the aggregate ns/day moved is useful but not actionable.

What rocprofv3 in --hip-trace --kernel-trace --stats mode gives you is a CSV with, per kernel, total wall time, call count, average duration, min/max. For GROMACS on benchMEM, the categories that show up:

Category% GPU time (typical)What it does
nbnxm34–40%Non-bonded forces (vdW + short-range Coulomb), cluster pairs
pme_spread28–38%B-spline interpolation of charges onto the PME grid
fft22–25%3D FFT for PME solve (VkFFT inside the GROMACS build)
pme_solve2–4%Solve in Fourier space (force/energy convolution)
pme_gather1–2%Inverse interpolation: forces from grid back to atoms
memory~2%__amd_rocclr_copyBuffer, __amd_rocclr_fillBufferAligned
other~2%Reductions, transpose, transform helpers

From a compiler-codegen perspective, these categories exercise materially different things:

The diff format the harness produces puts this front and center. A Discord notification from a typical bench-llvm-commit run looks like:

📊 dev-current/benchMEM/llvm-d7fead0f kernel breakdown vs baseline
🚀 nbnxm: 24108.42 ms (−3.29% vs 24928.79)
pme_spread: 27445.30 ms (−0.20% vs 27501.67)
pme_solve: 1378.61 ms (+0.70% vs 1369.02)
pme_gather: 705.18 ms (+0.44% vs 702.07)
fft: 16012.84 ms (−0.09% vs 16027.30)

This tells me exactly what my patch did: it helped nbnxm by 3.3% (consistent with FMA-pipeline changes I was working on), left PME mostly untouched (expected — different kernels), and didn't break anything elsewhere. The fft line being essentially flat is a useful sanity check that the toolchain swap itself isn't introducing wildly different behavior.

Methodology and the gotchas worth knowing

The numbers above only mean what they appear to mean if the methodology underneath is sound. The hard-won lessons:

1. Always use -resethway

GROMACS's PME tuner converges to an optimal grid by trying different operating points during the first thousands of steps. Those tuning steps run at 10× the wall time of steady state. If you include them in the measurement, your ns/day number is dominated by tuner exploration. -resethway resets the timing counters at the halfway mark, throwing out the tuning window.

This makes single-shot benchmarks much more reproducible. Without it, run-to-run variance can be 30%; with it, more like 2%.

2. Pin clocks before benchmarking

The 9070 has dynamic clocks that respond to thermal and power state. The CPU's schedutil governor downclocks aggressively. Both are death for run-to-run reproducibility. Before any A/B:

sudo rocm-smi --setperflevel high
sudo cpupower frequency-set -g performance

This costs ~20W idle and a couple of degrees of GPU temperature. Worth it for the noise reduction.

3. Two builds that converge to different PME grids aren't directly comparable

This bit me. My first dev vs stable comparison showed dev 2.2× faster overall, but every individual kernel slower. That's only possible if the tuner picked different operating points. Dev's tuner found a grid that was overall faster but used kernels less efficiently. Per-kernel comparisons between such runs are apples-to-oranges.

The fix is either to pin the PME grid with -notunepme + an explicit -rcoulomb (eliminating the tuner's freedom), or to focus on the aggregate ns/day for end-to-end performance claims and use kernel deltas only as directional signals. I prefer the second; pinning the grid removes a real adaptive behavior that users get for free.

4. benchMEM can't use -update gpu

It will fail with: "The number of coupled constraints is higher than supported in the GPU LINCS code." benchMEM's membrane lipids have constraint chains that exceed what GROMACS's GPU LINCS implementation handles. Update + constraints stay on the CPU for this system; only nb, pme, pmefft, and bonded run on the GPU.

This means the bench wrapper shouldn't hardcode GPU residency flags — let mdrun pick its own heuristic defaults per-system. Different inputs support different levels of offload; hardcoding kills portability and breaks benchMEM.

5. Don't share builds between machines

The HIP compiler flags include -march=skylake-avx512 or similar host-CPU-specific flags that cmake auto-detects. Builds aren't portable. If you move /opt/gromacs/... to another machine, expect either crashes or silent codegen-quality changes.

6. Variance, variance, variance

One run is not a measurement. Three runs are a measurement. The typical recipe:

for r in 1 2 3; do
  kernel-bench stable-rocm7.12 benchMEM "baseline-r$r" -- -nsteps 30000
  sleep 15
done

for r in 1 2 3; do
  kernel-bench dev-current benchMEM "test-r$r" -- -nsteps 30000
  sleep 15
done

Six runs, ~20 minutes total. Use the median per build, look at spread. A 3% delta with 2% per-run noise floor is borderline; a 5% delta is real signal. Anything that doesn't survive across three runs in each direction isn't a finding.

Aside: the Discord agent

The Discord notifications throughout this writeup come from a small local AI agent. Brief description, because it's tangentially relevant: an Ollama instance running gpt-oss:20b on the same 9070, a py-cord bot listening for DMs, and the bot owns its own agent loop — it executes Python tools that read system status, parse GROMACS logs, and post webhook notifications. From anywhere in the world (Tailscale handles the connectivity), I can DM the bot "how's my last bench" and get back a kernel breakdown, or get a proactive ping the moment a long simulation finishes or crashes.

The agent is incidental to the benchmark methodology — webhooks alone would do the alert side, and I could query result.json files from a phone over SSH. But the conversational layer turns out to be genuinely useful for "compare these two runs and tell me what's different" queries, where the agent reads kernel_result.json from both and produces a summary I can act on without booting a laptop. That's a separate post, though.

What I haven't verified yet

This writeup describes infrastructure that works. It does not yet describe a robust performance result.

The single dev-vs-stable comparison I have was collected before I'd applied the methodology fixes — no -resethway, no clock pinning, no CPU governor change, single-shot rather than three-run-each. The numbers in that comparison are not a defensible compiler benchmark result. They were what made me realize the methodology needed to be tightened, and they're the reason the methodology section above exists.

The infrastructure is now correct. The clean re-baselines and re-comparisons happen in a separate session, with the perf settings pinned and three runs in each direction. Those are the numbers I'll publish when I have actual codegen patches to evaluate. This post is about the harness; the perf results that the harness produces are a separate writeup.

A few other things I'd flag for anyone copying this setup:


Two days of plumbing. A working iteration loop that goes from "edit clang source" to "Discord message with kernel-level delta" in about six minutes. Whether the LLVM patches I'm working on actually produce wins is the next question, and the right one. But until the harness was real, asking that question was guesswork. Now it's measurement.

Footnotes

  1. The Kutzner suite is one of multiple GROMACS benchmark traditions. The other big one is the water benchmarks from the GROMACS team itself, available at ftp.gromacs.org/benchmarks/water_GMX50_bare.tar.gz — water systems from 1.5k to 3M atoms in one tarball. Useful for spanning a wider range of system sizes if you want to study kernel performance vs problem size, but less well-attributed in the literature than benchMEM/benchRIB.

Set in Inter (body) and Fira Code (code), matching the rest of the site.

Code blocks rendered with hand-applied span markup rather than a syntax-highlighting library, for portability.

Benchmark inputs courtesy of Carsten Kutzner / MPI for Multidisciplinary Sciences, used under CC-BY 4.0.