New evaluation framework for reasoning tasks released
The authors introduce a benchmark targeting multi-step logical deduction. Initial results show significant variance across open-weight models compared to closed systems. This suggests current alignment techniques may prioritize helpfulness over rigorous accuracy.
0 comments
0