Xuanwu
Evolving General Multimodal Models into an Industrial-Grade Foundation for Content Ecosystems
Xuanwu VL-2B is a compact vision-language foundation model for industrial content ecosystems. It is designed to balance general multimodal understanding, fine-grained content moderation, and adversarial OCR robustness within an approximately 2B-parameter budget.
The model combines InternViT-300M + a lightweight 2-layer MLP projector + Qwen3 1.7B, together with dynamic high-resolution perception, business-oriented data curation, structured chain-of-thought supervision, and GRPO-based post-training alignment.
Highlights
- Compact ~2B architecture for deployment-sensitive moderation settings.
- Dynamic high-resolution perception with up to 12 local
448 x 448tiles plus a global thumbnail. - Progressive three-stage training: pre-training, mid-training, and post-training.
- Structured moderation reasoning in the form of
Observation -> Extraction -> Reasoning -> Conclusion. - Strong business moderation and adversarial OCR performance while retaining competitive general multimodal capability.
Model Details
| Item | Value |
|---|---|
| Model type | Autoregressive vision-language model |
| Architecture | InternViT-300M + 2-layer MLP projector + Qwen3-1.7B |
| Parameters | Approximately 2B |
| Visual front end | Dynamic tiling with up to 12 local crops plus 1 global thumbnail |
| Token control | Pixel Unshuffle; each 448 x 448 tile contributes 256 visual tokens |
| Context length | Up to 16,384 packed tokens |
| Training stack | DeepSpeed, bf16 / AMP, Flash Attention-2 |
| Hardware | 64 x NVIDIA A100 80GB GPUs |
| Training cost | ~3,500 GPU hours |
| Language coverage | Primarily English and Chinese |
Training
| Stage | Effective scale | Purpose |
|---|---|---|
| Pre-training | 18.63M | Cross-modal alignment and general image-text learning |
| Mid-training | 2.801M | General-capability retention plus moderation and adversarial OCR injection |
| SFT | 8.408M | High-fidelity supervised tuning for rules, format, and reasoning |
| RL | 810k | GRPO alignment for classification, format, and OCR character alignment |
The raw pretraining inventory contains 20,078,399 source samples across nine top-level categories, including captioning, chart and table understanding, VQA, OCR, document understanding, science, mathematics, and text-only data. Mid-training and post-training further add business moderation data, adversarial OCR data, and manually reviewed LLM-assisted annotations.
Evaluation
Headline Results
| Area | Metric | Xuanwu VL-2B | Reference |
|---|---|---|---|
| General multimodal | OpenCompass average-7 | 67.90 | InternVL 3.5 2B: 64.27 |
| Text-only retention | average-9 | 58.38 | InternVL 3.5 2B: 59.02 |
| Business moderation | average recall over 7 categories | 94.38 | InternVL 3.5 2B: 47.98 |
| Adversarial OCR | weighted overall recall | 82.82 | Gemini-2.5-Pro: 76.72 |
General Multimodal Benchmarks
| Benchmark | InternVL 3.5 2B | Xuanwu VL-2B |
|---|---|---|
| HallusionBench | 46.78 | 47.32 |
| AI2D | 77.95 | 82.19 |
| MMStar | 56.20 | 60.47 |
| OCRBench | 83.10 | 89.80 |
| MMBench v1.1 | 75.08 | 79.02 |
| MMMU (val) | 50.51 | 48.11 |
| MathVista | 60.30 | 68.40 |
| average-7 | 64.27 | 67.90 |
Business Moderation and Adversarial OCR
Xuanwu VL-2B reaches 94.38 average recall over seven business moderation categories and 82.82 weighted overall recall on the adversarial OCR benchmark. On adversarial OCR, it performs particularly strongly on aigc, noise, warp, and watermark subsets.
Gemini-2.5-Pro is reported in the paper as a zero-shot commercial control model without domain adaptation. The comparison is intended as a task-specialized reference rather than a claim of superiority under identical training conditions.
Prompt Format
For difficult moderation and adversarial OCR cases, the paper uses the following structured response pattern:
[Observation] describe the main subjects and background
[Extraction] recover visible or concealed text and symbols
[Reasoning] compare the extracted evidence against moderation rules
[Conclusion] output the final decision (Safe / Violating-Category)
Reported evaluation results use greedy decoding with temperature = 0 and max_tokens = 8192.
Intended Uses
- Research on industrial multimodal systems, especially content moderation and adversarial OCR.
- Deployment-sensitive multimodal understanding where a compact 2B-scale model is preferred.
- Assistive moderation workflows with structured explanations and OCR-aware reasoning.
Limitations
- The model can still miss violations in extreme cases such as ultra-dense overlapping watermarks or nearly invisible hidden text.
- In long reasoning chains, the model may occasionally hallucinate or over-attribute a violation because of local high-risk cues.
- Moderation behavior is shaped partly by business-specific policies and data, so transfer to other platforms or jurisdictions is not guaranteed.
- Outputs should be verified before use in high-stakes review decisions.
Citation
@article{zhang2026xuanwu,
title={Xuanwu: Evolving General Multimodal Models into an Industrial-Grade Foundation for Content Ecosystems},
author={Zhang, Zhiqian and Zhao, Xu and Xu, Xiaoqing and Liang, Guangdong and Wang, Weijia and Lv, Xiaolei and Li, Bo and Gao, Jun},
journal={arXiv preprint arXiv:2603.29211},
year={2026}
}