Vision Task
updated
An Image is Worth More Than 16x16 Patches: Exploring Transformers on
Individual Pixels
Paper
• 2406.09415
• Published
• 51
4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities
Paper
• 2406.09406
• Published
• 15
VideoGUI: A Benchmark for GUI Automation from Instructional Videos
Paper
• 2406.10227
• Published
• 9
What If We Recaption Billions of Web Images with LLaMA-3?
Paper
• 2406.08478
• Published
• 43
INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal
Large Language Model
Paper
• 2407.16198
• Published
• 13
VideoGameBunny: Towards vision assistants for video games
Paper
• 2407.15295
• Published
• 23
LongVideoBench: A Benchmark for Long-context Interleaved Video-Language
Understanding
Paper
• 2407.15754
• Published
• 21
Theia: Distilling Diverse Vision Foundation Models for Robot Learning
Paper
• 2407.20179
• Published
• 47
SHIC: Shape-Image Correspondences with no Keypoint Supervision
Paper
• 2407.18907
• Published
• 41
Paper
• 2407.21017
• Published
• 24
Improving 2D Feature Representations by 3D-Aware Fine-Tuning
Paper
• 2407.20229
• Published
• 7
Diffusion Models as Data Mining Tools
Paper
• 2408.02752
• Published
• 15
Segment Anything with Multiple Modalities
Paper
• 2408.09085
• Published
• 22
Sapiens: Foundation for Human Vision Models
Paper
• 2408.12569
• Published
• 94
HiRED: Attention-Guided Token Dropping for Efficient Inference of
High-Resolution Vision-Language Models in Resource-Constrained Environments
Paper
• 2408.10945
• Published
• 10
Agent-to-Sim: Learning Interactive Behavior Models from Casual
Longitudinal Videos
Paper
• 2410.16259
• Published
• 5
DimensionX: Create Any 3D and 4D Scenes from a Single Image with
Controllable Video Diffusion
Paper
• 2411.04928
• Published
• 56