Back to Home
Your AI's tests pass. That doesn't mean the code works.

Your AI's tests pass. That doesn't mean the code works.

B
Blizine Admin
·2 min read·0 views

Brad Kinnard Posted on May 31 Your AI's tests pass. That doesn't mean the code works. # ai # testing # opensource # programming You ask a coding agent to fix a bug. It writes the code, writes the tests, CI goes green, you merge. The bug's still there. The agent's job was to turn the check green. The honest way to do that is to fix the code. The lazy way is to write a test that passes no matter what the code does. CI can't tell those two apart. A green check means the tests passed, not that the code is right. It's easy to miss in review, because the test sits right there looking like proof: test ( " parses the config " , () => { const result = parseConfig ( rawInput ); expect ( result ). toBeDefined (); }); Enter fullscreen mode Exit fullscreen mode That passes whether parseConfig works perfectly or returns nothing useful on every input. It checks nothing. Adding more tests like it just raises your coverage number, not your odds of catching a bad change. So I built ClaimCheck ( https://github.com/moonrunnerkc/claimcheck ). Instead of trusting the agent's tests, it tries to break them. If a test still passes after the supposedly fixed code is broken on purpose, the test was never really checking the fix, and it gets blocked. Same answer every time, no AI making the call. So far it's caught every cheat in a set of twelve hand-built cases. Twelve is small, and there's no public release yet, so treat that as a direction, not a finished result. Some cheats slip through anyway. If the agent writes a real, solid test that locks in the wrong answer, every check passes. The only way to know the answer's wrong is to already know the right one, and nothing in the pull request can tell you that except the agent you're trying to catch. The one thing that helps is a clue from outside it, like a human-written bug report you can run the fix against. There's a second, wider tool, Swarm Orchestrator ( https://github.com/moonrunnerkc/swarm-orchestrator ). It flags suspicious

📰Dev.to — dev.to

Comments