ToolSandbox: A Stateful, Conversational, Interactive Evaluation Benchmark for LLM Tool Use Capabilities Paper • 2408.04682 • Published Aug 8, 2024 • 18
MMAU: A Holistic Benchmark of Agent Capabilities Across Diverse Domains Paper • 2407.18961 • Published Jul 18, 2024 • 40
Str2Str: A Score-based Framework for Zero-shot Protein Conformation Sampling Paper • 2306.03117 • Published Jun 5, 2023