Brad Kinnard Posted on May 31 Your AI's tests pass. That doesn't mean the code works. # ai # testing # opensource # programming You ask a coding agent to fix a bug. It writes the code, writes the tests, CI goes green, you merge. The bug's still there. The agent's job was to turn the check green. The honest way to do that is to fix the code. The lazy way is to write a test that passes no matter what the code does. CI can't tell those two apart. A green check means the tests passed, not that the code is right. It's easy to miss in review, because the test sits right there looking like proof: test ( " parses the config " , () => { const result = parseConfig ( rawInput ); expect ( result ). toBeDefined (); }); Enter fullscreen mode Exit fullscreen mode That passes whether parseConfig works perfectly or returns nothing useful on every input. It checks nothing. Adding more tests like it just raises your coverage number, not your odds of catching a bad change. So I built ClaimCheck ( https://github.com/moonrunnerkc/claimcheck ). Instead of trusting the agent's tests, it tries to break them. If a test still passes after the supposedly fixed code is broken on purpose, the test was never really checking the fix, and it gets blocked. Same answer every time, no AI making the call. So far it's caught every cheat in a set of twelve hand-built cases. Twelve is small, and there's no public release yet, so treat that as a direction, not a finished result. Some cheats slip through anyway. If the agent writes a real, solid test that locks in the wrong answer, every check passes. The only way to know the answer's wrong is to already know the right one, and nothing in the pull request can tell you that except the agent you're trying to catch. The one thing that helps is a clue from outside it, like a human-written bug report you can run the fix against. There's a second, wider tool, Swarm Orchestrator ( https://github.com/moonrunnerkc/swarm-orchestrator ). It flags suspicious
Back to Home

📰Dev.to — dev.to
B
Blizine Admin
View Profile Staff Writer
Related Articles
Frontier Logic at Local Speed: The 2026 Strix Halo Ultimate Benchmark Suite
May 31, 2026·2 min read
Building an Application Log Analytics Platform with Amazon S3 Tables: Cost Optimization by Migrating from CloudWatch Logs
May 31, 2026·2 min read
5 Levels of Telegram Spam Your Anti-Spam Bot Isn't Catching
May 31, 2026·2 min read