records/track_10min_16mb/2026-03-19_10L_MixedPrecision/README.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75

This record captures the `10L Mixed Precision` submission.

## Summary

Two key improvements over the baseline:

1. **10 transformer layers** instead of 9 — adds depth for better language modeling
2. **Lower learning rates** — MATRIX_LR=0.02, SCALAR_LR=0.02, TIED_EMBED_LR=0.03 (vs default 0.04/0.04/0.05)
3. **Mixed int8/int6 compression** — middle layers (3,4,5,6) use int6 precision (round int8 to nearest 4) for better zlib compression, while first/last layers keep full int8

The 10-layer model at dim=512 has 18.9M params which compresses to 17.6MB with standard int8+zlib — 1.6MB over the 16MB cap. By reducing precision on the 4 middle layers to int6 (64 quantization levels instead of 256), the compressed size drops to 15.9MB with only 0.0018 bpb quality loss from quantization.

## Configuration

- Layout: `VOCAB_SIZE=1024 NUM_LAYERS=10 MODEL_DIM=512 NUM_HEADS=8 NUM_KV_HEADS=4 MLP_MULT=2`
- Tied output/input embeddings: `TIE_EMBEDDINGS=1`
- Learning rates: `MATRIX_LR=0.02 SCALAR_LR=0.02 TIED_EMBED_LR=0.03`
- Mixed precision: `INT4_LAYERS=3,4,5,6 INT4_STEP=4` (int6 for middle layers)
- Batching: `TRAIN_BATCH_TOKENS=524288 TRAIN_SEQ_LEN=1024`

## Command

```bash
RUN_ID=exp45_10L_int6_mid \
DATA_PATH=./data/datasets/fineweb10B_sp1024/ \
TOKENIZER_PATH=./data/tokenizers/fineweb_1024_bpe.model \
VOCAB_SIZE=1024 \
MAX_WALLCLOCK_SECONDS=600 \
VAL_LOSS_EVERY=200 \
TRAIN_LOG_EVERY=50 \
NUM_LAYERS=10 \
MATRIX_LR=0.02 \
SCALAR_LR=0.02 \
TIED_EMBED_LR=0.03 \
INT4_LAYERS=3,4,5,6 \
INT4_STEP=4 \
torchrun --standalone --nproc_per_node=8 train_gpt.py
```

## Key Metrics

- Pre-quant eval: `val_loss:2.0480`, `val_bpb:1.2129`
- Post-quant (int8/int6 mixed + zlib): `val_loss:2.0510`, `val_bpb:1.2147`
- Exact: `final_int8_zlib_roundtrip_exact val_bpb:1.21474500`
- Quantization gap: 0.0018 bpb (vs baseline's 0.0093)
- Train time: `599732ms` (`step_avg:45.78ms`)
- Steps: 13,100/20,000 (wallclock limited)
- Peak memory: 11,389 MiB allocated
- Artifact: 15,928,974 bytes (code: 48,917 + model: 15,880,057)

## Compression Analysis

| Layer Group | Precision | Reason |
|---|---|---|
| Layers 0-2 (early) | int8 (256 levels) | Critical for input processing |
| Layers 3-6 (middle) | int6 (64 levels) | Less sensitive, saves ~1.6MB |
| Layers 7-9 (late) | int8 (256 levels) | Critical for output quality |

## LR Sweep Results

Systematic sweep showed default LR (0.04) was too high:
| MATRIX_LR | val_bpb (9L baseline) |
|---|---|
| 0.04 (default) | 1.2286 |
| 0.02 (optimal) | 1.2230 |

## Note on Hardware

Run performed on 8xH200 (step_avg: 45.78ms). H100 baseline was 43.54ms/step for 9 layers; 10 layers would be ~47-48ms on H100, yielding ~12,500-12,700 steps. Results should be comparable.

## Included Files

- `train_gpt.py` (code snapshot)
- `train.log` (training log)
- `submission.json` (metadata)