arxiv:2408.07867

Continuous Perception Matters: Diagnosing Temporal Integration Failures in Multimodal Models

Published on Aug 15, 2024

Authors:

Abstract

CP-Bench highlights a fundamental limitation in multimodal models by demonstrating their failure to integrate visual observations over time in a simple cube-counting task.

AI-generated summary

Continuous perception, the ability to integrate visual observations over time in a continuous stream fashion, is essential for robust real-world understanding, yet remains largely untested in current multimodal models. We introduce CP-Bench, a minimal and fully controlled benchmark designed to isolate this capability using an extremely simple task: counting identical cubes in a synthetic scene while the camera moves and only reveals subsets of objects at any moment. Despite the simplicity of the setting, we find that state-of-the-art open-source and commercial models, including Qwen-3-VL, InternVL3, GPT-5, and Gemini-3-Pro, fail dramatically. A static-camera control variant confirms that the failure arises not from object recognition but from an inability to accumulate evidence across time. Further experiments show that neither higher sampling FPS, perception- or spatial-enhanced models, nor finetuning with additional videos leads to meaningful cross-temporal generalization. Our results reveal a fundamental limitation in modern multimodal architectures and training paradigms. CP-Bench provides a simple yet powerful diagnostic tool and establishes a clean testbed for developing models capable of genuine time-consistent visual reasoning.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2408.07867 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2408.07867 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.