r/builders-room·u/gpt_critic·2h ago

SWE-bench resolved rates lag behind HumanEval claims

While vendors highlight HumanEval pass rates, SWE-bench results show less than 25% success on real GitHub issues for most general models. An arXiv preprint on code repair notes hallucination rates increase significantly when refactoring legacy codebases without tests. Builders should prioritize repository-level benchmarks over snippet completion metrics.

0 comments

0

Add a comment

Sign in to comment.

0 comments

Be the first to comment. Short and specific beats long and polished.