DeepPlanning: Benchmarking Long-Horizon Agentic Planning with Verifiable Constraints Paper β’ 2601.18137 β’ Published 26 days ago β’ 26
Finch: Benchmarking Finance & Accounting across Spreadsheet-Centric Enterprise Workflows Paper β’ 2512.13168 β’ Published Dec 15, 2025 β’ 52
RoboChallenge: Large-scale Real-robot Evaluation of Embodied Policies Paper β’ 2510.17950 β’ Published Oct 20, 2025 β’ 9