diff options
| author | Matthew Li <156706407+mattqlf@users.noreply.github.com> | 2026-03-19 13:28:12 -0400 |
|---|---|---|
| committer | GitHub <noreply@github.com> | 2026-03-19 10:28:12 -0700 |
| commit | d84a3e819100504d96879e1e36d022efa5cbb81b (patch) | |
| tree | 99bdebeb83904d27e409f9a9f2df26905fcec06b /records/track_10min_16mb/2026-03-19_SlidingWindowEval/README.md | |
| parent | 194bb8766eb19ee21618490132594d533d1455ad (diff) | |
Add record: Sliding Window Eval (stride=64), val_bpb=1.1925 (#50)
Diffstat (limited to 'records/track_10min_16mb/2026-03-19_SlidingWindowEval/README.md')
| -rw-r--r-- | records/track_10min_16mb/2026-03-19_SlidingWindowEval/README.md | 78 |
1 files changed, 78 insertions, 0 deletions
diff --git a/records/track_10min_16mb/2026-03-19_SlidingWindowEval/README.md b/records/track_10min_16mb/2026-03-19_SlidingWindowEval/README.md new file mode 100644 index 0000000..e4c8ef6 --- /dev/null +++ b/records/track_10min_16mb/2026-03-19_SlidingWindowEval/README.md @@ -0,0 +1,78 @@ +This record implements sliding window evaluation, showing that eval strategies alone can provide significant improvements. + +**Note on `train_gpt.py`:** The included script contains some unused experimental code paths (QAT, looped architectures) that are **all disabled by default** and were not active during this run. Only the sliding window evaluation code (`eval_val_sliding`, `forward_logits`, `EVAL_STRIDE`, `EVAL_BATCH_SEQS`) is used. The command below shows the exact invocation. + +## Key Idea: Sliding Window Evaluation + +The baseline evaluates by chopping the validation set into non-overlapping 1024-token chunks. The problem is that the first token in each chunk has zero context. On average, each token gets ~512 tokens of context. + +Sliding window evaluation uses overlapping windows with a configurable stride. With `EVAL_STRIDE=64` and `TRAIN_SEQ_LEN=1024`, each window advances by 64 tokens, but only the rightmost 64 tokens (which have 960+ tokens of context) are scored. Every token in the validation set is scored exactly once, but with near-maximum context. + +## Results + +| Metric | Naive Baseline | This Submission | +|---|---|---| +| Pre-quant val_bpb | 1.2172 | 1.2196 | +| **Post-quant val_bpb** | **1.2244** | **1.1925** | +| **Improvement** | — | **-0.0319** | +| Training steps | 13,780 | 13,450 | +| Eval time (8xH100) | ~16s | 70s | +| Artifact size | 15,863,489 bytes | 15,874,829 bytes | + +The pre-quant BPB is nearly identical (training is the same). The 0.032 improvement comes entirely from scoring tokens with richer context during evaluation. + +## Configuration + +Architecture and training are identical to the Naive Baseline: +- Layout: `VOCAB_SIZE=1024 NUM_LAYERS=9 MODEL_DIM=512 NUM_HEADS=8 NUM_KV_HEADS=4 MLP_MULT=2` +- Tied output/input embeddings: `TIE_EMBEDDINGS=1` +- Tied embedding LR: `TIED_EMBED_LR=0.05` +- Batching: `TRAIN_BATCH_TOKENS=524288 TRAIN_SEQ_LEN=1024` + +Evaluation-specific parameters: +- `EVAL_STRIDE=64` (sliding window stride; baseline uses non-overlapping = stride 1024) +- `EVAL_BATCH_SEQS=1024` (number of windows per forward pass for GPU utilization) + +## Command + +```bash +RUN_ID=8xh100_slide64_v2 \ +DATA_PATH=./data/datasets/fineweb10B_sp1024/ \ +TOKENIZER_PATH=./data/tokenizers/fineweb_1024_bpe.model \ +VOCAB_SIZE=1024 \ +NUM_LOOPS=1 \ +LORA_RANK=0 \ +QAT=0 \ +EVAL_STRIDE=64 \ +EVAL_BATCH_SEQS=1024 \ +MAX_WALLCLOCK_SECONDS=600 \ +TRAIN_LOG_EVERY=200 \ +VAL_LOSS_EVERY=1000 \ +torchrun --standalone --nproc_per_node=8 train_gpt.py +``` + +The `NUM_LOOPS=1 LORA_RANK=0 QAT=0` flags explicitly disable all unused code paths (these are also the defaults). + +## Key Metrics (from `train.log`) + +- Timed training stopped at `13450/20000` steps due to the wallclock cap. +- Pre-quant eval at stop: `val_loss:2.0592`, `val_bpb:1.2196` +- Post-quant sliding window eval: `val_loss:2.0135`, `val_bpb:1.1925` +- Exact printed metric: `final_int8_zlib_roundtrip_exact val_bpb:1.19250007` +- Train time: `600028ms` (`step_avg:44.61ms`) +- Peak memory: `10119 MiB allocated`, `10294 MiB reserved` +- Eval time: `69881ms` (sliding window, stride=64, batch_seqs=1024) +- Serialized model int8+zlib: `15816489 bytes` +- Code size: `58340 bytes` +- Total submission size int8+zlib: `15874829 bytes` + +## Training Volume + +- Global batch: `524288` tokens/step +- Total train tokens seen: `7,055,769,600` + +## Included Files + +- `train_gpt.py` (code snapshot used for the run, includes `eval_val_sliding` function) +- `train.log` (exact remote training log) +- `submission.json` (leaderboard metadata) |
