From a15093adad328a650d421e53c078cbd2c45beb0e Mon Sep 17 00:00:00 2001 From: Will DePue Date: Wed, 18 Mar 2026 09:32:01 -0700 Subject: Launch snapshot --- data/README.md | 66 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 66 insertions(+) create mode 100644 data/README.md (limited to 'data/README.md') diff --git a/data/README.md b/data/README.md new file mode 100644 index 0000000..e1920ad --- /dev/null +++ b/data/README.md @@ -0,0 +1,66 @@ +# Data Workflows + +This directory contains the dataset download helpers and export scripts used for the challenge. + +Canonical local layout: +- `data/datasets//` +- `data/tokenizers/` +- `data/manifest.json` +- `data/docs_selected.jsonl` +- `data/docs_selected.source_manifest.json` + +## Downloading Published Data + +Download the cached FineWeb export for a tokenizer variant with: + +```bash +python3 data/cached_challenge_fineweb.py --variant sp1024 +``` + +This populates `./data/datasets/fineweb10B_sp1024/` and `./data/tokenizers/`. +By default it downloads the full validation split and 8B training tokens (80 train shards). + +To fetch more training shards, pass `--train-shards`: + +```bash +python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 180 +``` + +The downloader is manifest-driven and can fetch only a prefix of train shards from a larger published export. With the current shard size of `100_000_000` tokens, `10B` retokenized training tokens is `100` train shards: + +```bash +MATCHED_FINEWEB_REPO_ID=your-hf-username/your-dataset-repo \ +MATCHED_FINEWEB_REMOTE_ROOT_PREFIX=your_50B_export_root \ +python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 100 +``` + +Validation is always downloaded in full from the fixed `fineweb_val_*` split. Training on the first `N` train shards means training on the prefix of the same frozen shuffled export, so the data order stays aligned with the baseline for that tokenizer family. + +The default published repo is `willdepueoai/parameter-golf`, with the export rooted under the repo subdirectory `datasets/`. + +## Rebuilding Tokenizers From Published Docs + +To retrain a tokenizer or re-export shards from exactly the same selected documents, run the standalone retokenizer against the published docs cache: + +```bash +python3 data/download_hf_docs_and_tokenize.py \ + --repo-id your-hf-username/your-dataset-repo \ + --remote-root your_50B_export_root \ + --output-root /tmp/my_custom_tokenizer_export \ + --tokenizer-config ./data/tokenizer_specs.json +``` + +The sidecar `docs_selected.source_manifest.json` includes `docs_sha256`, so users can verify they are rebuilding from the exact same document list and order as the baseline export. + +## Useful Knobs + +For CPU-heavy exports, useful knobs are: + +```bash +MATCHED_FINEWEB_SP_BATCH_SIZE=2048 +MATCHED_FINEWEB_TOKENIZER_THREADS=16 +MATCHED_FINEWEB_TIKTOKEN_THREADS=16 +MATCHED_FINEWEB_GPT2_DECODE_BATCH_SIZE=512 +``` + +These control batched tokenizer encoding during shard export, tokenizer thread count, tiktoken thread count, and batched GPT-2 decode for the blobstore docs-cache path. -- cgit v1.2.3