Pelican-VL 1.0: A Foundation Brain Model for Embodied Intelligence

๐Ÿ“– Pelican-VL 1.0 Report   |   ๐Ÿค— Hugging Face   |   ๐Ÿค– ModelScope  
๐Ÿงฐ Quick Start   |   ๐ŸŒ Project Website   |   ๐Ÿ› ๏ธ Evaluation  

๐Ÿš€ News

  • 2025-12-01: Pelican1.0-VL-3B model checkpoint has been released in ๐Ÿค— Hugging Face Link and ๐Ÿค– ModelScope Link.

  • 2025-11-13: Pelican1.0-VL-7B and Pelican1.0-VL-72B model checkpoint has been released in ๐Ÿค— Hugging Face Link and ๐Ÿค– ModelScope Link.

  • 2025-10-30: We have released the Pelican-VL 1.0 Report. The 7Bใ€72B model for open source is coming soon. For more details, please check our report!

Introduction

We presents Pelican-VL 1.0, a new family of open-source embodied brain models with parameter scales ranging from 7B to 72B. Pelican-VL 1.0 is currently the largest-scale open-source embodied multimodal brain model. Its core advantage lies in the in-depth integration of data power and intelligent adaptive learning mechanisms.

Overview:

๐ŸŒŸ Highlights:

  • Multimodal Understanding and Reasoning: Pelican-VL processes both visual and textual inputs, trained on massive datasets of images, videos, and cross-modal annotations. It not only recognizes objects accurately but also performs physical reasoning, spatial relationship understanding, and functional prediction based on scene context. For example, in closed environments like kitchens or supermarkets, it can distinguish the placement of fruits and vegetables, counter locations, and plan picking or placing actions accordingly.

  • Spatio-Temporal Cognition: The modelโ€™s training includes tens of thousands of hours of video and dynamic scene question-answering, enabling it to understand continuous temporal sequences. When processing video frames, Pelican-VL captures object motion and the temporal order of actions, allowing it to make coherent inferences about complex sequential tasksโ€”for instance, determining โ€œwhich item should be moved first before operating the next.โ€

  • Embodied Interaction Capabilities: In robotic tasks such as object grasping, navigation, and collaborative manipulation, Pelican-VL not only understands high-level task goals but also produces detailed action sequences along with feasibility assessments for each step. This means that upon receiving an instruction, the model can determine appropriate grasp points and devise corresponding manipulation strategies. Its multi-task proficiency spans grasping, navigation, and humanโ€“robot interaction, demonstrating strong cross-task generalization.

  • Self-Correction and Iterative Learning: Through DPPO cyclic training, Pelican-VL exhibits a โ€œself-correctingโ€ capability. After each reinforcement learning cycle, the model automatically generates new challenging samples for retrainingโ€”similar to repeated practice and reflection. Over time, its weaknesses are gradually addressed, and its abilities continuously improve. This process mirrors the concept of โ€œdeliberate practice,โ€ allowing Pelican-VL to advance iteratively and achieve performance on par with top-tier proprietary systems.

Performance

Overall Tasks

Performance comparison of Pelican-VL1.0. (Left) Comparison against models with โ‰ค100B parameters. The shaded(pink) region highlights the performance gain over our baseline. (Right) Comparison against models with โ‰ฅ100B parameters, including leading open-source and proprietary models, where our model also demonstrates SOTA performance.

Detail Dimensions

Benchmark performance radar comparison of Pelican-VL 1.0 (72B) against other models across nine dimensions.

๐Ÿ’ก Downstream Applications

Please see our project website๏ผš๐ŸŒ pelican-vl.github.io

๐Ÿง  Open-Source Weights

We will released our pelican models on ๐Ÿค— Hugging Face and ๐Ÿค– ModelScope:

Model Name Parameters Checkpoint Checkpoint
Pelican1.0-VL-7B 7B ๐Ÿค— Link ๐Ÿค– Link
Pelican1.0-VL-72B 72B ๐Ÿค— Link ๐Ÿค– Link

Quick Start

Here, We provide you a simple script of LoRa fine-tuning and give you some embodied samples, allowing you to experience how to experiment with embodied data. Training is based on the LLM training and deployment framework Swift.

๐Ÿ› ๏ธ Installation

# pip installation
# pip install ms-swift -U

# Source code installation
git clone https://github.com/modelscope/ms-swift.git
cd ms-swift
pip install -e .

pip install qwen-vl-utils[decord]==0.0.11 # qwen-vl-utils 0.0.11, decord 0.6.0
pip install deepspeed==0.16.9 # distributed training
pip install wandb # wandb=0.21.0
pip install msgspec
pip install transformers==4.51.1 
pip install flash-attn==2.6.1 --no-build-isolation # if GPU supports

LoRa Fine-Tuning

Dataset Source

All embodied data used in this demo are JSON files and all derived from public datasets on Hugging Face:

Dataset Name Type Link
Cosmos Reasoning SFT Data Video ๐Ÿค— Link
Robopoint GQA Data Image ๐Ÿค— Link
VSI-Bench ScanNetpp Data Video ๐Ÿค— Link

Download the three datasets from the above links, Place the downloaded files in the local directory(e.g., /datasets/xxx) matching the paths in the JSON (or modify the JSON paths to your local storage path).

# Using an interactive command line for training.
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
NPROC_PER_NODE=8 \
swift sft \
    --model Qwen2.5-VL-7B-Instruct \
    --dataset /datasets/robopoint_example_500.json \
              /datasets/vsibench_example_500.json \
              /datasets/cosmos_example_500.json \
    --train_type lora \
    --torch_dtype bfloat16 \
    --num_train_epochs 2 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --learning_rate 5e-5 \
    --lora_rank 8 \
    --lora_alpha 32 \
    --lora_dropout 0.1 \
    --freeze_vit true \
    --target_modules all-linear \
    --gradient_accumulation_steps 16 \
    --split_dataset_ratio 0.1 \
    --data_seed 42 \
    --eval_steps 200 \
    --save_strategy epoch \
    --logging_steps 1 \
    --max_length 8192 \
    --output_dir /xxx/output \
    --warmup_ratio 0.05 \
    --dataloader_num_workers 16 \
    --save_only_model True \
    --attn_impl flash_attn

After training is complete, use the following command to infer with the trained weights:

  • Here, --adapters should be replaced with the last checkpoint folder generated during training. Since the adapters folder contains the training parameter file args.json, there is no need to specify --model, --system separately; Swift will automatically read these parameters.
# Using an interactive command line for inference.
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
swift infer \
    --adapters /xxx/output/checkpoint-xxx \
    --stream true \
    --infer_backend pt \
    --max_new_tokens 2048

For more detailed parameters, please refer to the official documents of Swift .

Evaluation Reproduction

To facilitate faithful reproduction of our reported results, we summarize our official evaluation settings below.

Please refer to Evaluation.md.

๐Ÿ“ฌ Contact With Us

  • Email: {vito.dai, jason.ju}@x-humanoid.com

Citation

If you find our Pelican-VL useful in your research, please cite:

@article{Pelican-VL-1.0,
  title={Pelican-VL 1.0: A Foundation Brain Model for Embodied Intelligence},
  author={Yi Zhang, Che Liu, Xiancong Ren, Hanchu Ni, Shuai Zhang, Zeyuan Ding, Jiayu Hu, Hanzhe Shan, Zhenwei Niu, Zhaoyang Liu, Yue Zhao, Junbo Qi, Qinfan Zhang, Dengjie Li, Yidong Wang, Jiachen Luo, Yong Dai, Jian Tang, Xiaozhu Ju},
  journal={arXiv preprint arXiv:2511.00108},
  year={2025}
}

๐Ÿ—จ๏ธ Discussions ๐Ÿ—จ๏ธ

If you're interested in Pelican-VL, welcome to join our WeChat group for discussions.

Downloads last month
21
Video Preview
loading

Model tree for X-Humanoid/Pelican1.0-VL-3B

Quantizations
2 models

Collection including X-Humanoid/Pelican1.0-VL-3B