records/track_10min_16mb/2026-03-17_NaiveBaseline/README.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46

This record captures the `Simple Baseline`.

Trainer changes in this snapshot:
- current repository `train_gpt.py` snapshot copied into the record folder
- published `fineweb10B_sp1024` dataset and tokenizer loaded from the new Hugging Face export
- 10-minute wallclock cap on `8xH100`
- periodic validation every `200` steps on the full `fineweb_val_*` split

Configuration:
- Layout: `VOCAB_SIZE=1024 NUM_LAYERS=9 MODEL_DIM=512 NUM_HEADS=8 NUM_KV_HEADS=4 MLP_MULT=2`
- Tied output/input embeddings: `TIE_EMBEDDINGS=1`
- Tied embedding LR: `TIED_EMBED_LR=0.05`
- Batching: `TRAIN_BATCH_TOKENS=524288 TRAIN_SEQ_LEN=1024`

Command (track-relevant params):
```bash
NCCL_IB_DISABLE=1 \
RUN_ID=hf_verify_sp1024_8gpu \
DATA_PATH=/root/code/parameter-golf/data/datasets/fineweb10B_sp1024 \
TOKENIZER_PATH=/root/code/parameter-golf/data/tokenizers/fineweb_1024_bpe.model \
VOCAB_SIZE=1024 \
MAX_WALLCLOCK_SECONDS=600 \
TRAIN_LOG_EVERY=50 \
VAL_LOSS_EVERY=200 \
torchrun --standalone --nproc_per_node=8 /root/code/parameter-golf/train_gpt.py
```

Key metrics (from `train.log`):
- Timed training stopped at `13780/20000` steps due to the wallclock cap.
- Pre-quant eval at stop: `val_loss:2.0606`, `val_bpb:1.2172`
- Post-quant roundtrip eval: `val_loss:2.0727`, `val_bpb:1.2244`
- Exact printed metric: `final_int8_zlib_roundtrip_exact val_bpb:1.22436570`
- Train time: `600038ms` (`step_avg:43.54ms`)
- Peak memory: `10184 MiB allocated`, `10200 MiB reserved`
- Serialized model int8+zlib: `15815847 bytes`
- Code size: `47642 bytes`
- Total submission size int8+zlib: `15863489 bytes`

Training volume:
- Global batch: `524288` tokens/step
- Total train tokens seen: `7224688640`

Included files:
- `train_gpt.py` (code snapshot used for the run)
- `train.log` (exact remote training log)
- `submission.json` (leaderboard metadata)