Darwin-9B-Opus

Model Space 35B Model 35B Space FINAL Bench ALL Bench

Darwin-9B-Opus

Qwen3.5 Dense 9B | Reasoning | Chain-of-Thought | 131K Context | 201 Languages | BF16 | Apache 2.0


Technical Definitions

Term Definition Measurement
Model MRI Layer-level profiling of tensor health indicators L2 norm, Shannon entropy, std per tensor across all layers
LayerMRI.compare_layers Per-tensor A vs B quality comparison yielding optimal ratio_b score = entropy * 0.5 + std * 0.3 + clamp(norm, 100) * 0.002 per model; ratio_b = score_b / (score_a + score_b)
MRI-Guided Merge Per-tensor merge ratios derived from parent diagnostics (70% MRI + 30% genome) final_ratio = mri_ratio * 0.7 + genome_ratio * 0.3
DARE-TIES Merge algorithm: random binary mask on delta, then weighted addition merged = A + (B - A) * random_mask(density) * ratio
Transplant A / B When MRI ratio falls below 0.05 or above 0.95, one parent is used entirely No interpolation — direct tensor copy
Evolutionary Search CMA-ES population evolution over genome space (ratio, attn, ffn, embed, density_a, density_b) Phase 1: 200 steps heuristic proxy, Phase 2: 10 steps real benchmark

Overview

Darwin-9B-Opus is a 9B dense parameter reasoning model created using Darwin V5. Both parent models share the identical Qwen3.5-9B architecture — the Mother is a LoRA SFT on the same base, not a different architecture.

Role Model Training
Father Qwen/Qwen3.5-9B Original pre-training + RLHF
Mother Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled LoRA SFT with text-only Claude 4.6 Opus reasoning chains

How Darwin V5 Works

Darwin V5 does not use mergekit or any external merge library. It implements DARE-TIES merge directly via PyTorch tensor operations, with MRI-guided per-layer ratios. The algorithm is inspired by the DARE-TIES method but re-implemented from scratch to support per-tensor diagnostic-guided ratios.

Merge Implementation (actual code logic)

# For each tensor pair (A, B) across all safetensor shards:
ta = model_a[key]       # Father tensor
tb = model_b[key]       # Mother tensor

# 1. MRI diagnoses both tensors
diag_a = LayerMRI.diagnose_tensor(ta)  # {norm, entropy, std}
diag_b = LayerMRI.diagnose_tensor(tb)  # {norm, entropy, std}

# 2. Quality score comparison determines ratio_b
score_a = diag_a["entropy"] * 0.5 + diag_a["std"] * 0.3 + min(diag_a["norm"], 100) * 0.002
score_b = diag_b["entropy"] * 0.5 + diag_b["std"] * 0.3 + min(diag_b["norm"], 100) * 0.002
mri_ratio = score_b / (score_a + score_b)  # Higher = Mother is better

# 3. Final ratio = MRI 70% + evolutionary genome 30%
final_ratio = mri_ratio * 0.7 + genome_type_ratio * 0.3

# 4. DARE-TIES merge with per-tensor ratio
mask = torch.rand_like(tb) < density_b
delta = (tb - ta) * mask
merged = (ta + delta * final_ratio).bfloat16()

Pipeline

Phase 0: Model MRI
  For every tensor in both parents, measure:
    - L2 norm (layer energy)
    - Shannon entropy (weight distribution uniformity)
    - Standard deviation (activation spread)
  Compare A vs B quality scores -> per-tensor ratio prescription

Phase 1: Evolutionary Search (200 steps, heuristic proxy)
  Population of 20 genomes (ratio, attn, ffn, embed, density_a, density_b)
  Fitness: heuristic score based on genome balance + differentiation
  Selection -> SLERP crossover -> Gaussian mutation

Phase 2: Real Merge + Benchmark (10 steps)
  Top genomes from Phase 1 undergo actual tensor merge
  Each merge: MRI prescription (70%) + genome ratio (30%)
  Fitness: real benchmark score (ARC-Challenge)
  Best model selected and auto-uploaded

Phase 3: Health Check
  Layer-by-layer importance comparison: child vs both parents
  Detect interference (child >> parents) or function loss (parents >> child)

What Makes This Different from Standard Merging

Capability Standard DARE-TIES Darwin V5
Implementation mergekit library call Direct PyTorch tensor operations
Ratio selection Uniform ratio across all tensors Per-tensor ratio from MRI diagnosis
Pre-merge analysis None Tensor-level norm/entropy/std profiling
Ratio determination Human-set or grid search MRI 70% + evolutionary genome 30%
Post-merge validation Benchmark score only Layer-by-layer child vs parents comparison
Transplant support No ratio < 0.05 -> use A entirely, ratio > 0.95 -> use B entirely
Failure diagnosis "Score went down" Per-tensor quality delta identifies problematic layers

Model Specifications

Architecture Qwen3.5 Dense (Gated DeltaNet hybrid)
Total Parameters 9B
Precision BF16
Context Length 131,072 native
Languages 201
Thinking <think> tag chain-of-thought reasoning
License Apache 2.0

Hardware Requirements

Setup VRAM Status
BF16 Full Precision ~20 GB
NVIDIA RTX 4090 24GB 24 GB Comfortable
NVIDIA A100 40GB 40 GB Very comfortable
NVIDIA T4 16GB 16 GB Requires quantization

Usage

Transformers

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained(
    "FINAL-Bench/Darwin-9B-Opus",
    trust_remote_code=True,
)
model = AutoModelForCausalLM.from_pretrained(
    "FINAL-Bench/Darwin-9B-Opus",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

messages = [{"role": "user", "content": "Prove that sqrt(2) is irrational."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=4096)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True))

SGLang

python -m sglang.launch_server \
  --model-path FINAL-Bench/Darwin-9B-Opus \
  --tp 1 \
  --mem-fraction-static 0.90 \
  --context-length 32768 \
  --trust-remote-code

vLLM

vllm serve FINAL-Bench/Darwin-9B-Opus \
  --trust-remote-code \
  --enforce-eager

Evolution Details

Engine Darwin V5 (Evolutionary Merge + Layer-Level Diagnostics)
Merge Method DARE-TIES (direct PyTorch implementation, no external library)
MRI Integration Per-tensor diagnosis: norm, entropy, std -> ratio prescription
Ratio Formula final_ratio = mri_ratio * 0.7 + genome_ratio * 0.3
Evolution Phase 1: 200 steps proxy + Phase 2: 10 steps real benchmark
Best Score 0.8508 (ARC-Challenge)
Infrastructure 4 x NVIDIA H100 NVL (100GB each)

Acknowledgements

  • Korean Government — GPU Support Program research grant
  • Qwen Team — Qwen3.5 base architecture
  • Jackrong — Claude 4.6 Opus Reasoning Distilled model
  • DARE-TIES algorithm — Yadav et al., 2023 (re-implemented, not library-dependent)

Built By

Developer VIDRAFT
Engine Darwin V5
Base Architecture Qwen3.5-9B

Citation

@misc{vidraft_darwin_9b_opus,
  title        = {Darwin-9B-Opus: Diagnostic-Guided Evolutionary Merge},
  author       = {VIDRAFT},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/FINAL-Bench/Darwin-9B-Opus}}
}
Downloads last month
620
Safetensors
Model size
10B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for FINAL-Bench/Darwin-9B-Opus

Space using FINAL-Bench/Darwin-9B-Opus 1

Paper for FINAL-Bench/Darwin-9B-Opus

Evaluation results