arxiv:2604.05015

Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding

Published on Apr 6

· Submitted by

Chaoyou Fu on Apr 8

Authors:

Chaoyou Fu ,

Yuhao Dong ,

Yi-Fan Zhang ,

Yunhang Shen ,

Xue Yang ,

Haoyu Cao ,

Xing Sun ,

Abstract

Video-MME-v2 presents a comprehensive benchmark for evaluating video understanding models through a progressive hierarchy and group-based evaluation to assess robustness and faithfulness.

AI-generated summary

With the rapid advancement of video understanding, existing benchmarks are becoming increasingly saturated, exposing a critical discrepancy between inflated leaderboard scores and real-world model capabilities. To address this widening gap, we introduce Video-MME-v2, a comprehensive benchmark designed to rigorously evaluate the robustness and faithfulness of video understanding. To systematically evaluate model capabilities, we design a progressive tri-level hierarchy that incrementally increases the complexity of video comprehension, ranging from multi-point visual information aggregation, to temporal dynamics modeling, and ultimately to complex multimodal reasoning. Besides, in contrast to conventional per-question accuracy, we propose a group-based non-linear evaluation strategy that enforces both consistency across related queries and coherence in multi-step reasoning. It penalizes fragmented or guess-based correctness and assigns credit only to answers supported by valid reasoning. To guarantee data quality, Video-MME-v2 is constructed through a rigorously controlled human annotation pipeline, involving 12 annotators and 50 independent reviewers. Backed by 3,300 human-hours and up to 5 rounds of quality assurance, Video-MME-v2 aims to serve as one of the most authoritative video benchmarks. Extensive experiments reveal a substantial gap between current best model Gemini-3-Pro and human experts, and uncover a clear hierarchical bottleneck where errors in visual information aggregation and temporal modeling propagate to limit high-level reasoning. We further find that thinking-based reasoning is highly dependent on textual cues, improving performance with subtitles but sometimes degrading it in purely visual settings. By exposing these limitations, Video-MME-v2 establishes a demanding new testbed for the development of next-generation video MLLMs.

View arXiv page View PDF Project page GitHub 250 Add to collection

Community

BradyFU

Paper author Paper submitter about 18 hours ago

•

edited about 13 hours ago

Video-MME-v2: Towards the Next Stage in Video Understanding Evaluation

Technical Report: https://arxiv.org/pdf/2604.05015

Project Page: https://video-mme-v2.netlify.app/

Leaderboard: https://video-mme-v2.netlify.app/#leaderboard

GitHub: https://github.com/MME-Benchmarks/Video-MME-v2

Dataset: https://huggingface.co/datasets/MME-Benchmarks/Video-MME-v2

BradyFU

Paper author Paper submitter about 18 hours ago

BradyFU

Paper author Paper submitter about 18 hours ago

BradyFU

Paper author Paper submitter about 18 hours ago

avahal

about 8 hours ago

the group-based non-linear evaluation catches my eye because it tries to tie accuracy to consistency across related queries and coherent multi-step reasoning, not just per-question hits. i'd love to know how they operationalize 'valid reasoning' across a group, especially how they handle cases where a model stamps a correct answer with spurious cues versus actually tracing a justification. they report thinking-based reasoning leans on textual cues, which nudges scores when subtitles exist but hurts in purely visual settings; an ablation where subtitles are removed across all tasks would help isolate this bias. the arxivlens breakdown helped me parse the method details without diving into the whole appendix, nice touch for quick digest. still, it's worth checking how this scales beyond the current dataset and whether the non-linear scoring could be gamed by models producing plausible but unsupported reasoning.

mishig

about 6 hours ago

Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding

Video-MME-v2 introduces a next-generation benchmark for evaluating multimodal video understanding models. Built over 3,300 human-hours with 12 annotators and 50 reviewers (up to 5 QA rounds per question), it structures video comprehension into a tri-level hierarchy and replaces per-question accuracy with group-based non-linear evaluation that penalizes guessing. The benchmark reveals a massive gap between the best frontier model (Gemini-3-Pro at 49.4%) and human experts (90.7%), highlighting how far current systems still fall short of human-level video understanding.

Key Idea

The benchmark organizes video comprehension into three progressively harder levels: visual information aggregation at the base, temporal dynamics modeling in the middle, and complex multimodal reasoning at the top. This tri-level hierarchy ensures that evaluation covers the full spectrum of video understanding rather than testing isolated capabilities.

Method / Approach

Video-MME-v2 replaces traditional per-question accuracy with a group-based non-linear evaluation protocol. Related questions are grouped together, and a model must answer all questions in a group correctly for the group to count as passed. This design penalizes lucky guessing -- a model that randomly gets individual questions right still fails at the group level, producing a more trustworthy signal of genuine understanding.

Results

Gemini-3-Pro, the top-performing frontier model, scores only 49.4% compared to human experts at 90.7% -- a 41.3-point gap. The study also finds that thinking-based reasoning strategies depend heavily on textual cues: they improve performance when subtitles are available but actually hurt accuracy in purely visual settings, suggesting current models lean on language shortcuts rather than true visual reasoning.