AudioVisual-Caption/ASID-Captioner-3B
Image-Text-to-Text • 5B • Updated
• 646 • 3
Video Understanding, Audio-Visual, Multimodal LLMs, Video Captioning, Instruction Tuning, Dataset Curation, Qwen-based, Open-source, Fully-Open-MLLMs
[🏠 Homepage] [📖 Arxiv Paper] [🤗 Models & Datasets] [💻 Code]
We build ASID-Caption, a data-and-model suite for fine-grained audiovisual video understanding.
Our goal is to move beyond “one video → one generic caption” by providing attribute-structured supervision and quality-verified annotations, enabling models to produce more complete, more controllable, and more temporally consistent descriptions that cover both visual content and audio cues.