Golden tests are the reason my nonogram app survived contact with more than one screen size. I wrote about that — render the UI at five viewports, commit the resulting images, and let CI scream the next time a layout change shifts a pixel. They’re the cheapest insurance I have against being clever about something I can’t see.

They also have one infuriating property: the reference images depend on who rendered them.

The macOS-versus-Linux problem

Flutter’s golden tests compare a freshly rendered widget against a committed PNG. Font rasterization is not identical across platforms — macOS and Linux hint and anti-alias text differently. So a golden generated on my MacBook and a golden generated on GitHub’s ubuntu-latest runner disagree by a handful of sub-pixel grey values. Not enough for a human to see. More than enough for a byte-for-byte image comparison to fail.

Which means the obvious workflow is a trap:

  1. I change the layout.
  2. I run flutter test --update-goldens locally to regenerate the references.
  3. I commit the new PNGs.
  4. CI regenerates them on Linux, compares against my macOS images, and fails.

The references I just “fixed” are wrong the moment they leave my laptop. The only goldens CI will ever accept are goldens CI made itself. So the rule became: never commit a golden from your Mac. But then how do you update them at all?

Option one: a manual Linux workflow (not good enough)

My first answer was a workflow_dispatch job — update-goldens.yml — that I could trigger by hand from the Actions tab. It checks out the branch, runs flutter test --update-goldens on Linux, and commits the result.

It worked. It was also annoying in a specific way: I’d be sitting in a pull request, reviewing a visual diff, and to accept it I had to leave the PR, find the right workflow, pick the right branch from a dropdown, run it, and come back. Context switch every time. On a solo project that’s death by a thousand papercuts — the kind of friction that eventually makes you stop writing golden tests at all.

I wanted to approve a screenshot the same way I approve everything else: by typing in the PR.

Option two: a /approve-goldens bot

So I built one. approve-goldens.yml listens for issue_comment events and wakes up only when someone comments /approve-goldens on a pull request. The whole loop:

  1. Check permission. It calls getCollaboratorPermissionLevel and bails unless the commenter has write or admin. A comment is a public surface; anyone can type into it. The job has contents: write, so the gate matters — I don’t want a drive-by comment pushing commits to a branch.
  2. Acknowledge. It reacts to the comment with a 🚀 so I know it heard me. Silent automation feels broken even when it’s working — the same lesson the solver performance post is built around.
  3. Find the PR’s branch. issue_comment doesn’t carry the head ref, so it looks up the PR by number and checks out pr.head.ref.
  4. Regenerate on Linux. flutter pub get, then flutter test --update-goldens — on ubuntu-latest, the same environment that will judge them.
  5. Commit only if something changed. git add test/goldens/ and a staged-diff check. No empty commits. If the goldens were already current, it says so and stops.
  6. Push to the PR branch, then re-run the full suite to confirm the new references actually pass.
  7. Report back with a commit status (build → success/failure) and a comment: ✅ updated and verified, ⚠️ updated but tests failed, or ℹ️ nothing needed.

Now the workflow is: open the PR, look at the rendered diff, and if the new look is correct, type /approve-goldens. Thirty seconds later the canonical references exist, generated in the only environment whose opinion counts, and the build goes green. I never touch a PNG on my Mac.

The part I didn’t expect to care about

The detail that makes this pleasant rather than merely correct is that the bot talks back. The emoji reaction, the commit status, the three distinct result comments — none of that is load-bearing. The goldens would update without any of it.

But automation you can’t see is automation you don’t trust, and automation you don’t trust is automation you eventually route around by hand. The 🚀 is there so that in the two-second gap between my comment and the job spinning up, I know the machine is listening. That’s the same instinct behind dimming the board while a hard puzzle generates. The cost is a few lines of github-script. The payoff is that I actually keep using the system.

There’s a sibling flow I’ll save for another post: PRs also publish their rendered golden diffs to a golden-review/pr-<number>/ path on GitHub Pages, so “look at the visual diff” is a link, not a compare.html buried in test/failures/. Reviewing a screenshot change should be as easy as clicking it.

What I’d tell anyone with platform-dependent test artifacts

  • If an artifact is environment-specific, generate it in that environment — full stop. Don’t fight font rendering. Don’t try to make macOS match Linux. Make the artifact in the place that grades it and never anywhere else.
  • Put the approval where the work already is. A workflow_dispatch job is correct and unused. A /command in the PR is correct and used. On a solo project, “used” is the entire ballgame.
  • Gate write-capable automation on real permissions. A bot that can push needs to check who’s asking. One getCollaboratorPermissionLevel call.
  • Make the bot acknowledge itself. The reaction and the status comment aren’t polish. They’re the difference between a tool you trust and a tool you babysit.

The whole thing is about a hundred lines of YAML. It removed the single most common reason I had to leave a pull request, and it’s the reason golden tests are still load-bearing in this project instead of quietly abandoned.


This is part of a short series on the engineering behind shipping the nonogram app, alongside posts on the puzzle solver and the two-and-a-half-year road to TestFlight.