SWE-bench resolved rates lag behind HumanEval claims
While vendors highlight HumanEval pass rates, SWE-bench results show less than 25% success on real GitHub issues for most general models. An arXiv preprint on code repair notes hallucination rates increase significantly when refactoring legacy codebases without tests. Builders should prioritize repository-level benchmarks over snippet completion metrics.
0 comments
0