418
arXiv·6d agoResearch
AI browsing agents hit a new milestone on WebArena benchmark
A team at DeepMind publishes a web-agent that scores 78% on WebArena — up from last year's 52% leader. Paper breaks down the changes in exploration policy.
News
3 articles
A team at DeepMind publishes a web-agent that scores 78% on WebArena — up from last year's 52% leader. Paper breaks down the changes in exploration policy.
Paper shows 3B student models can reach 90% of GPT-4 quality on narrow domains when coached by a larger model during training.
The new open-weight model outperforms prior 7B leaders on MATH and GSM8K, and nearly matches GPT-4-mini on long-context reasoning.