diff options
| author | Nan Liu <45443761+nanlliu@users.noreply.github.com> | 2026-03-19 15:26:46 -0700 |
|---|---|---|
| committer | GitHub <noreply@github.com> | 2026-03-19 15:26:46 -0700 |
| commit | 9ac12c26d550481a1a486ce2b450b1ffed60b832 (patch) | |
| tree | fa30460ec2e96320f9f8761c31df31f798490f94 /records/track_10min_16mb/2026-03-18_LowerLR/submission.json | |
| parent | ae882089b58c74d37a02eda8358219f41cd5f4e1 (diff) | |
Record: 10L Mixed Precision: val_bpb=1.2147 (10 layers + int6 middle layers) (#39)
* Add Lower LR submission: val_bpb=1.2230 (MATRIX_LR=0.02)
Systematic LR sweep showed default Muon/Adam learning rates (0.04) were
too high. MATRIX_LR=0.02, SCALAR_LR=0.02, TIED_EMBED_LR=0.03 gives
consistent improvement. Same 9L/512d architecture, no other changes.
* Add 10L Mixed Precision submission: val_bpb=1.2147
10 transformer layers (vs baseline 9) with mixed int8/int6 compression:
- Full int8 for first/last 3 layers (precision-sensitive)
- Int6 (step=4 rounding) for middle layers 3-6 (compression-friendly)
- Lower LR: MATRIX_LR=0.02, SCALAR_LR=0.02, TIED_EMBED_LR=0.03
- Artifact: 15,928,974 bytes (under 16MB cap)
- Improvement: 0.0097 bpb / 0.0217 nats over baseline (1.2244)
Also adds PRUNE_RATIO and INT4_LAYERS/INT4_STEP support to train_gpt.py
for mixed-precision post-training quantization.
* Revert root train_gpt.py to upstream baseline
The root script should remain the baseline. Submission-specific
modifications (PRUNE_RATIO, INT4_LAYERS, INT4_STEP) only belong
in the records/ folder copy.
Diffstat (limited to 'records/track_10min_16mb/2026-03-18_LowerLR/submission.json')
| -rw-r--r-- | records/track_10min_16mb/2026-03-18_LowerLR/submission.json | 11 |
1 files changed, 11 insertions, 0 deletions
diff --git a/records/track_10min_16mb/2026-03-18_LowerLR/submission.json b/records/track_10min_16mb/2026-03-18_LowerLR/submission.json new file mode 100644 index 0000000..42ec327 --- /dev/null +++ b/records/track_10min_16mb/2026-03-18_LowerLR/submission.json @@ -0,0 +1,11 @@ +{ + "author": "Nan Liu", + "github_id": "nanlliu", + "name": "Lower LR", + "blurb": "Same 9x512 SP-1024 KV4 tied-embedding baseline architecture with lower Muon/Adam learning rates (MATRIX_LR=0.02, SCALAR_LR=0.02, TIED_EMBED_LR=0.03). Systematic LR sweep showed default 0.04 was too high; optimal is ~0.02.", + "date": "2026-03-18T22:30:00Z", + "val_loss": 2.06492760, + "val_bpb": 1.22296644, + "bytes_total": 15854246, + "bytes_code": 50919 +} |
