MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use Paper β’ 2509.24002 β’ Published Sep 28 β’ 173
The Dragon Hatchling: The Missing Link between the Transformer and Models of the Brain Paper β’ 2509.26507 β’ Published Sep 30 β’ 535
Less is More: Recursive Reasoning with Tiny Networks Paper β’ 2510.04871 β’ Published Oct 6 β’ 497
StockBench: Can LLM Agents Trade Stocks Profitably In Real-world Markets? Paper β’ 2510.02209 β’ Published Oct 2 β’ 53
Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge Paper β’ 2506.21506 β’ Published Jun 26 β’ 51
Will It Still Be True Tomorrow? Multilingual Evergreen Question Classification to Improve Trustworthy QA Paper β’ 2505.21115 β’ Published May 27 β’ 140
Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning Paper β’ 2504.17192 β’ Published Apr 24 β’ 120
RIG: Synergizing Reasoning and Imagination in End-to-End Generalist Policy Paper β’ 2503.24388 β’ Published Mar 31 β’ 29
view article Article π¦Έπ»#14: What Is MCP, and Why Is Everyone β Suddenly!β Talking About It? Mar 17 β’ 345
LLM-Microscope: Uncovering the Hidden Role of Punctuation in Context Memory of Transformers Paper β’ 2502.15007 β’ Published Feb 20 β’ 174
Talk Structurally, Act Hierarchically: A Collaborative Framework for LLM Multi-Agent Systems Paper β’ 2502.11098 β’ Published Feb 16 β’ 13
SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering? Paper β’ 2502.12115 β’ Published Feb 17 β’ 46