YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Data Workflows
This directory contains the dataset download helpers and export scripts used for the challenge.
Canonical local layout:
data/datasets/<dataset_name>/data/tokenizers/data/manifest.jsondata/docs_selected.jsonldata/docs_selected.source_manifest.json
Downloading Published Data
Download the cached FineWeb export for a tokenizer variant with:
python3 data/cached_challenge_fineweb.py --variant sp1024
This populates ./data/datasets/fineweb10B_sp1024/ and ./data/tokenizers/.
By default it downloads the full validation split and 8B training tokens (80 train shards).
To fetch more training shards, pass --train-shards:
python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 180
The downloader is manifest-driven and can fetch only a prefix of train shards from a larger published export. With the current shard size of 100_000_000 tokens, 10B retokenized training tokens is 100 train shards:
MATCHED_FINEWEB_REPO_ID=your-hf-username/your-dataset-repo \
MATCHED_FINEWEB_REMOTE_ROOT_PREFIX=your_50B_export_root \
python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 100
Validation is always downloaded in full from the fixed fineweb_val_* split. Training on the first N train shards means training on the prefix of the same frozen shuffled export, so the data order stays aligned with the baseline for that tokenizer family.
The default published repo is willdepueoai/parameter-golf, with the export rooted under the repo subdirectory datasets/.
Rebuilding Tokenizers From Published Docs
To retrain a tokenizer or re-export shards from exactly the same selected documents, run the standalone retokenizer against the published docs cache:
python3 data/download_hf_docs_and_tokenize.py \
--repo-id your-hf-username/your-dataset-repo \
--remote-root your_50B_export_root \
--output-root /tmp/my_custom_tokenizer_export \
--tokenizer-config ./data/tokenizer_specs.json \
--max-train-tokens 8000000000
The sidecar docs_selected.source_manifest.json includes docs_sha256, so users can verify they are rebuilding from the exact same document list and order as the baseline export.
Useful Knobs
For CPU-heavy exports, useful knobs are:
MATCHED_FINEWEB_SP_BATCH_SIZE=2048
MATCHED_FINEWEB_TOKENIZER_THREADS=16
MATCHED_FINEWEB_TIKTOKEN_THREADS=16
MATCHED_FINEWEB_GPT2_DECODE_BATCH_SIZE=512
These control batched tokenizer encoding during shard export, tokenizer thread count, tiktoken thread count, and batched GPT-2 decode for the blobstore docs-cache path.
When rebuilding locally, --max-train-tokens 8000000000 matches the published 8B-train-token export. With the default shard size of 100_000_000, that produces 80 train shards plus the full validation split.