Data Workflows

This directory contains the dataset download helpers and export scripts used for the challenge.

Canonical local layout:

data/datasets/<dataset_name>/
data/tokenizers/
data/manifest.json
data/docs_selected.jsonl
data/docs_selected.source_manifest.json

Downloading Published Data

Download the cached FineWeb export for a tokenizer variant with:

python3 data/cached_challenge_fineweb.py --variant sp1024

This populates ./data/datasets/fineweb10B_sp1024/ and ./data/tokenizers/. By default it downloads the full validation split and 8B training tokens (80 train shards).

To fetch more training shards, pass --train-shards:

python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 180

The downloader is manifest-driven and can fetch only a prefix of train shards from a larger published export. With the current shard size of 100_000_000 tokens, 10B retokenized training tokens is 100 train shards:

MATCHED_FINEWEB_REPO_ID=your-hf-username/your-dataset-repo \
MATCHED_FINEWEB_REMOTE_ROOT_PREFIX=your_50B_export_root \
python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 100

Validation is always downloaded in full from the fixed fineweb_val_* split. Training on the first N train shards means training on the prefix of the same frozen shuffled export, so the data order stays aligned with the baseline for that tokenizer family.

The default published repo is willdepueoai/parameter-golf, with the export rooted under the repo subdirectory datasets/.

Rebuilding Tokenizers From Published Docs

To retrain a tokenizer or re-export shards from exactly the same selected documents, run the standalone retokenizer against the published docs cache:

python3 data/download_hf_docs_and_tokenize.py \
  --repo-id your-hf-username/your-dataset-repo \
  --remote-root your_50B_export_root \
  --output-root /tmp/my_custom_tokenizer_export \
  --tokenizer-config ./data/tokenizer_specs.json \
  --max-train-tokens 8000000000

The sidecar docs_selected.source_manifest.json includes docs_sha256, so users can verify they are rebuilding from the exact same document list and order as the baseline export.

Useful Knobs

For CPU-heavy exports, useful knobs are:

MATCHED_FINEWEB_SP_BATCH_SIZE=2048
MATCHED_FINEWEB_TOKENIZER_THREADS=16
MATCHED_FINEWEB_TIKTOKEN_THREADS=16
MATCHED_FINEWEB_GPT2_DECODE_BATCH_SIZE=512

These control batched tokenizer encoding during shard export, tokenizer thread count, tiktoken thread count, and batched GPT-2 decode for the blobstore docs-cache path.

When rebuilding locally, --max-train-tokens 8000000000 matches the published 8B-train-token export. With the default shard size of 100_000_000, that produces 80 train shards plus the full validation split.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support