Sharing the test harness I use for prompt regressions
PatternAfter breaking prod twice from "small" prompt tweaks, I now run every prompt change against 25 fixed scenarios. Output gets diffed. If diff > 30% lines, I have to write a justification. Has caught 4 regressions in 6 weeks. Code in comments — happy to expand.
0 comments
152