Xuanwu

Evolving General Multimodal Models into an Industrial-Grade Foundation for Content Ecosystems

Xuanwu VL-2B is a compact vision-language foundation model for industrial content ecosystems. It is designed to balance general multimodal understanding, fine-grained content moderation, and adversarial OCR robustness within an approximately 2B-parameter budget.

The model combines InternViT-300M + a lightweight 2-layer MLP projector + Qwen3 1.7B, together with dynamic high-resolution perception, business-oriented data curation, structured chain-of-thought supervision, and GRPO-based post-training alignment.

Highlights

Compact ~2B architecture for deployment-sensitive moderation settings.
Dynamic high-resolution perception with up to 12 local 448 x 448 tiles plus a global thumbnail.
Progressive three-stage training: pre-training, mid-training, and post-training.
Structured moderation reasoning in the form of Observation -> Extraction -> Reasoning -> Conclusion.
Strong business moderation and adversarial OCR performance while retaining competitive general multimodal capability.

Model Details

Item	Value
Model type	Autoregressive vision-language model
Architecture	InternViT-300M + 2-layer MLP projector + Qwen3-1.7B
Parameters	Approximately 2B
Visual front end	Dynamic tiling with up to 12 local crops plus 1 global thumbnail
Token control	Pixel Unshuffle; each `448 x 448` tile contributes 256 visual tokens
Context length	Up to 16,384 packed tokens
Training stack	DeepSpeed, bf16 / AMP, Flash Attention-2
Hardware	64 x NVIDIA A100 80GB GPUs
Training cost	~3,500 GPU hours
Language coverage	Primarily English and Chinese

Training

Stage	Effective scale	Purpose
Pre-training	18.63M	Cross-modal alignment and general image-text learning
Mid-training	2.801M	General-capability retention plus moderation and adversarial OCR injection
SFT	8.408M	High-fidelity supervised tuning for rules, format, and reasoning
RL	810k	GRPO alignment for classification, format, and OCR character alignment

The raw pretraining inventory contains 20,078,399 source samples across nine top-level categories, including captioning, chart and table understanding, VQA, OCR, document understanding, science, mathematics, and text-only data. Mid-training and post-training further add business moderation data, adversarial OCR data, and manually reviewed LLM-assisted annotations.

Evaluation

Headline Results

Area	Metric	Xuanwu VL-2B	Reference
General multimodal	OpenCompass average-7	67.90	InternVL 3.5 2B: 64.27
Text-only retention	average-9	58.38	InternVL 3.5 2B: 59.02
Business moderation	average recall over 7 categories	94.38	InternVL 3.5 2B: 47.98
Adversarial OCR	weighted overall recall	82.82	Gemini-2.5-Pro: 76.72

General Multimodal Benchmarks

Benchmark	InternVL 3.5 2B	Xuanwu VL-2B
HallusionBench	46.78	47.32
AI2D	77.95	82.19
MMStar	56.20	60.47
OCRBench	83.10	89.80
MMBench v1.1	75.08	79.02
MMMU (val)	50.51	48.11
MathVista	60.30	68.40
average-7	64.27	67.90

Business Moderation and Adversarial OCR

Xuanwu VL-2B reaches 94.38 average recall over seven business moderation categories and 82.82 weighted overall recall on the adversarial OCR benchmark. On adversarial OCR, it performs particularly strongly on aigc, noise, warp, and watermark subsets.

Gemini-2.5-Pro is reported in the paper as a zero-shot commercial control model without domain adaptation. The comparison is intended as a task-specialized reference rather than a claim of superiority under identical training conditions.

Prompt Format

For difficult moderation and adversarial OCR cases, the paper uses the following structured response pattern:

[Observation] describe the main subjects and background
[Extraction] recover visible or concealed text and symbols
[Reasoning] compare the extracted evidence against moderation rules
[Conclusion] output the final decision (Safe / Violating-Category)

Reported evaluation results use greedy decoding with temperature = 0 and max_tokens = 8192.

Intended Uses

Research on industrial multimodal systems, especially content moderation and adversarial OCR.
Deployment-sensitive multimodal understanding where a compact 2B-scale model is preferred.
Assistive moderation workflows with structured explanations and OCR-aware reasoning.

Limitations

The model can still miss violations in extreme cases such as ultra-dense overlapping watermarks or nearly invisible hidden text.
In long reasoning chains, the model may occasionally hallucinate or over-attribute a violation because of local high-risk cues.
Moderation behavior is shaped partly by business-specific policies and data, so transfer to other platforms or jurisdictions is not guaranteed.
Outputs should be verified before use in high-stakes review decisions.

Citation

@article{zhang2026xuanwu,
  title={Xuanwu: Evolving General Multimodal Models into an Industrial-Grade Foundation for Content Ecosystems},
  author={Zhang, Zhiqian and Zhao, Xu and Xu, Xiaoqing and Liang, Guangdong and Wang, Weijia and Lv, Xiaolei and Li, Bo and Gao, Jun},
  journal={arXiv preprint arXiv:2603.29211},
  year={2026}
}

Paper for hellogroup-opensource/Xuanwu

Xuanwu: Evolving General Multimodal Models into an Industrial-Grade Foundation for Content Ecosystems

Paper • 2603.29211 • Published 3 days ago

hellogroup-opensource
/

Xuanwu