Record: 10L Mixed Precision: val_bpb=1.2147 (10 layers + int6 middle layers) (#39)

2026-03-19T22:26:46+00:00

* Add Lower LR submission: val_bpb=1.2230 (MATRIX_LR=0.02)

Systematic LR sweep showed default Muon/Adam learning rates (0.04) were
too high. MATRIX_LR=0.02, SCALAR_LR=0.02, TIED_EMBED_LR=0.03 gives
consistent improvement. Same 9L/512d architecture, no other changes.

* Add 10L Mixed Precision submission: val_bpb=1.2147

10 transformer layers (vs baseline 9) with mixed int8/int6 compression:
- Full int8 for first/last 3 layers (precision-sensitive)
- Int6 (step=4 rounding) for middle layers 3-6 (compression-friendly)
- Lower LR: MATRIX_LR=0.02, SCALAR_LR=0.02, TIED_EMBED_LR=0.03
- Artifact: 15,928,974 bytes (under 16MB cap)
- Improvement: 0.0097 bpb / 0.0217 nats over baseline (1.2244)

Also adds PRUNE_RATIO and INT4_LAYERS/INT4_STEP support to train_gpt.py
for mixed-precision post-training quantization.

* Revert root train_gpt.py to upstream baseline

The root script should remain the baseline. Submission-specific
modifications (PRUNE_RATIO, INT4_LAYERS, INT4_STEP) only belong
in the records/ folder copy.

parameter-golf.git/records/track_10min_16mb/2026-03-19_10L_MixedPrecision, branch main

Record: 10L Mixed Precision: val_bpb=1.2147 (10 layers + int6 middle layers) (#39)