From a15093adad328a650d421e53c078cbd2c45beb0e Mon Sep 17 00:00:00 2001 From: Will DePue Date: Wed, 18 Mar 2026 09:32:01 -0700 Subject: Launch snapshot --- .../2026-03-17_NaiveBaseline/README.md | 46 ++++++++++++++++++++++ 1 file changed, 46 insertions(+) create mode 100644 records/track_10min_16mb/2026-03-17_NaiveBaseline/README.md (limited to 'records/track_10min_16mb/2026-03-17_NaiveBaseline/README.md') diff --git a/records/track_10min_16mb/2026-03-17_NaiveBaseline/README.md b/records/track_10min_16mb/2026-03-17_NaiveBaseline/README.md new file mode 100644 index 0000000..1eb352a --- /dev/null +++ b/records/track_10min_16mb/2026-03-17_NaiveBaseline/README.md @@ -0,0 +1,46 @@ +This record captures the `Simple Baseline`. + +Trainer changes in this snapshot: +- current repository `train_gpt.py` snapshot copied into the record folder +- published `fineweb10B_sp1024` dataset and tokenizer loaded from the new Hugging Face export +- 10-minute wallclock cap on `8xH100` +- periodic validation every `200` steps on the full `fineweb_val_*` split + +Configuration: +- Layout: `VOCAB_SIZE=1024 NUM_LAYERS=9 MODEL_DIM=512 NUM_HEADS=8 NUM_KV_HEADS=4 MLP_MULT=2` +- Tied output/input embeddings: `TIE_EMBEDDINGS=1` +- Tied embedding LR: `TIED_EMBED_LR=0.05` +- Batching: `TRAIN_BATCH_TOKENS=524288 TRAIN_SEQ_LEN=1024` + +Command (track-relevant params): +```bash +NCCL_IB_DISABLE=1 \ +RUN_ID=hf_verify_sp1024_8gpu \ +DATA_PATH=/root/code/parameter-golf/data/datasets/fineweb10B_sp1024 \ +TOKENIZER_PATH=/root/code/parameter-golf/data/tokenizers/fineweb_1024_bpe.model \ +VOCAB_SIZE=1024 \ +MAX_WALLCLOCK_SECONDS=600 \ +TRAIN_LOG_EVERY=50 \ +VAL_LOSS_EVERY=200 \ +torchrun --standalone --nproc_per_node=8 /root/code/parameter-golf/train_gpt.py +``` + +Key metrics (from `train.log`): +- Timed training stopped at `13780/20000` steps due to the wallclock cap. +- Pre-quant eval at stop: `val_loss:2.0606`, `val_bpb:1.2172` +- Post-quant roundtrip eval: `val_loss:2.0727`, `val_bpb:1.2244` +- Exact printed metric: `final_int8_zlib_roundtrip_exact val_bpb:1.22436570` +- Train time: `600038ms` (`step_avg:43.54ms`) +- Peak memory: `10184 MiB allocated`, `10200 MiB reserved` +- Serialized model int8+zlib: `15815847 bytes` +- Code size: `47642 bytes` +- Total submission size int8+zlib: `15863489 bytes` + +Training volume: +- Global batch: `524288` tokens/step +- Total train tokens seen: `7224688640` + +Included files: +- `train_gpt.py` (code snapshot used for the run) +- `train.log` (exact remote training log) +- `submission.json` (leaderboard metadata) -- cgit v1.2.3