codelion (Asankhaya Sharma)

Scaling Pedagogical Pre-training to 10 Billion Tokens

New blog post exploring what happens when you take optimal data mixing insights and scale up the data generation itself.

We built Sutra, a multi-stage framework for generating pedagogical pre-training data guided by a knowledge graph of ~2,000 concepts across 9 domains. The pipeline includes structured content generation, six-dimension quality evaluation, diversity management across 20 content styles, and a cleaning stage to prevent collapse.

The result is codelion/sutra-10B, a 10.2 billion token pedagogical dataset with rich metadata (domain, complexity, prerequisites, quality scores) on every entry.

We trained codelion/SmolLM2-70M on it for 3 full epochs (30.6B tokens) on a single A10 GPU in ~78 hours.

Key finding: perplexity kept improving across epochs, but benchmark gains plateaued fast. At 70M parameters, the model hits a representational ceiling that more data alone can't break through.

Full writeup with comparisons against 7 other datasets, detailed benchmark breakdowns, and connections to recent work on synthetic data scaling, curriculum learning, and data mixing laws: https://huggingface.co/blog/codelion/scaling-pedagogical-pretraining-10-billion-tokens

All datasets at multiple scales (10M, 100M, 1B, 10B) plus seed concepts and an SFT variant are in the Sutra Pedagogical Datasets collection.

2 replies

·

posted an update 9 days ago

Post

3089

Scaling Pedagogical Pre-training to 10 Billion Tokens

New blog post exploring what happens when you take optimal data mixing insights and scale up the data generation itself.

We built Sutra, a multi-stage framework for generating pedagogical pre-training data guided by a knowledge graph of ~2,000 concepts across 9 domains. The pipeline includes structured content generation, six-dimension quality evaluation, diversity management across 20 content styles, and a cleaning stage to prevent collapse.

The result is codelion/sutra-10B, a 10.2 billion token pedagogical dataset with rich metadata (domain, complexity, prerequisites, quality scores) on every entry.

We trained codelion/SmolLM2-70M on it for 3 full epochs (30.6B tokens) on a single A10 GPU in ~78 hours.

Key finding: perplexity kept improving across epochs, but benchmark gains plateaued fast. At 70M parameters, the model hits a representational ceiling that more data alone can't break through.

Full writeup with comparisons against 7 other datasets, detailed benchmark breakdowns, and connections to recent work on synthetic data scaling, curriculum learning, and data mixing laws: https://huggingface.co/blog/codelion/scaling-pedagogical-pretraining-10-billion-tokens

All datasets at multiple scales (10M, 100M, 1B, 10B) plus seed concepts and an SFT variant are in the Sutra Pedagogical Datasets collection.

2 replies

·

updated a collection 9 days ago

Nano Language Models

Collection

A collection of really small language models pre-trained from scratch with open-data. Ideal for use in experimentation and comparisions. • 4 items • Updated 9 days ago • 1

updated a model 10 days ago

codelion/SmolLM2-70M

Text Generation • 69.2M • Updated 10 days ago • 220 • 3

updated 3 datasets 10 days ago

Asankhaya Sharma

AI & ML interests

Recent Activity

Organizations

Sutra Pedagogical Datasets

codelion/sutra-improved-100M

codelion/sutra-improved-100M

codelion/sutra-improved-100M

codelion/sutra-100M

codelion/sutra-10M

codelion/sutra-magpie-sft

codelion/sutra-30k-seeds

codelion/sutra-1B

Nano Language Models

codelion/SmolLM2-70M

codelion/sutra-magpie-sft

codelion/sutra-30k-seeds

codelion/sutra-10M

Asankhaya Sharma

AI & ML interests

Recent Activity

Organizations

codelion's activity