Xuanwu

Evolving General Multimodal Models into an Industrial-Grade Foundation for Content Ecosystems

Paper License GitHub

Xuanwu VL-2B is a compact vision-language foundation model for industrial content ecosystems. It is designed to balance general multimodal understanding, fine-grained content moderation, and adversarial OCR robustness within an approximately 2B-parameter budget.

The model combines InternViT-300M + a lightweight 2-layer MLP projector + Qwen3 1.7B, together with dynamic high-resolution perception, business-oriented data curation, structured chain-of-thought supervision, and GRPO-based post-training alignment.

Highlights

  • Compact ~2B architecture for deployment-sensitive moderation settings.
  • Dynamic high-resolution perception with up to 12 local 448 x 448 tiles plus a global thumbnail.
  • Progressive three-stage training: pre-training, mid-training, and post-training.
  • Structured moderation reasoning in the form of Observation -> Extraction -> Reasoning -> Conclusion.
  • Strong business moderation and adversarial OCR performance while retaining competitive general multimodal capability.

Model Details

Item Value
Model type Autoregressive vision-language model
Architecture InternViT-300M + 2-layer MLP projector + Qwen3-1.7B
Parameters Approximately 2B
Visual front end Dynamic tiling with up to 12 local crops plus 1 global thumbnail
Token control Pixel Unshuffle; each 448 x 448 tile contributes 256 visual tokens
Context length Up to 16,384 packed tokens
Training stack DeepSpeed, bf16 / AMP, Flash Attention-2
Hardware 64 x NVIDIA A100 80GB GPUs
Training cost ~3,500 GPU hours
Language coverage Primarily English and Chinese

Training

Stage Effective scale Purpose
Pre-training 18.63M Cross-modal alignment and general image-text learning
Mid-training 2.801M General-capability retention plus moderation and adversarial OCR injection
SFT 8.408M High-fidelity supervised tuning for rules, format, and reasoning
RL 810k GRPO alignment for classification, format, and OCR character alignment

The raw pretraining inventory contains 20,078,399 source samples across nine top-level categories, including captioning, chart and table understanding, VQA, OCR, document understanding, science, mathematics, and text-only data. Mid-training and post-training further add business moderation data, adversarial OCR data, and manually reviewed LLM-assisted annotations.

Evaluation

Headline Results

Area Metric Xuanwu VL-2B Reference
General multimodal OpenCompass average-7 67.90 InternVL 3.5 2B: 64.27
Text-only retention average-9 58.38 InternVL 3.5 2B: 59.02
Business moderation average recall over 7 categories 94.38 InternVL 3.5 2B: 47.98
Adversarial OCR weighted overall recall 82.82 Gemini-2.5-Pro: 76.72

General Multimodal Benchmarks

Benchmark InternVL 3.5 2B Xuanwu VL-2B
HallusionBench 46.78 47.32
AI2D 77.95 82.19
MMStar 56.20 60.47
OCRBench 83.10 89.80
MMBench v1.1 75.08 79.02
MMMU (val) 50.51 48.11
MathVista 60.30 68.40
average-7 64.27 67.90

Business Moderation and Adversarial OCR

Xuanwu VL-2B reaches 94.38 average recall over seven business moderation categories and 82.82 weighted overall recall on the adversarial OCR benchmark. On adversarial OCR, it performs particularly strongly on aigc, noise, warp, and watermark subsets.

Gemini-2.5-Pro is reported in the paper as a zero-shot commercial control model without domain adaptation. The comparison is intended as a task-specialized reference rather than a claim of superiority under identical training conditions.

Prompt Format

For difficult moderation and adversarial OCR cases, the paper uses the following structured response pattern:

[Observation] describe the main subjects and background
[Extraction] recover visible or concealed text and symbols
[Reasoning] compare the extracted evidence against moderation rules
[Conclusion] output the final decision (Safe / Violating-Category)

Reported evaluation results use greedy decoding with temperature = 0 and max_tokens = 8192.

Intended Uses

  • Research on industrial multimodal systems, especially content moderation and adversarial OCR.
  • Deployment-sensitive multimodal understanding where a compact 2B-scale model is preferred.
  • Assistive moderation workflows with structured explanations and OCR-aware reasoning.

Limitations

  • The model can still miss violations in extreme cases such as ultra-dense overlapping watermarks or nearly invisible hidden text.
  • In long reasoning chains, the model may occasionally hallucinate or over-attribute a violation because of local high-risk cues.
  • Moderation behavior is shaped partly by business-specific policies and data, so transfer to other platforms or jurisdictions is not guaranteed.
  • Outputs should be verified before use in high-stakes review decisions.

Citation

@article{zhang2026xuanwu,
  title={Xuanwu: Evolving General Multimodal Models into an Industrial-Grade Foundation for Content Ecosystems},
  author={Zhang, Zhiqian and Zhao, Xu and Xu, Xiaoqing and Liang, Guangdong and Wang, Weijia and Lv, Xiaolei and Li, Bo and Gao, Jun},
  journal={arXiv preprint arXiv:2603.29211},
  year={2026}
}

Links

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for hellogroup-opensource/Xuanwu