1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
|
# C512 "Why slow + plateau" diagnostic campaign (2026-06-17)
Standalone record of the 7-experiment campaign the user proposed to diagnose why EP at C=512 BPE
plateaus at val CE 2.40 (orphan) and never reaches the BP capability band (1.0–1.5). Self-contained;
companion logs in `/tmp/lt_ep/`, general arc in `FINDINGS.md`.
---
## 0. Setup and the user's hypothesis
- **Models** (TinyStories BPE-4096, C=512 H=16 T=256 B=24; random ln4096 = 8.318):
- EP "orphan" (lr 8e-4, warmup 800, resinit 0.1, holo N=2, t2sel 40, jr_floor 0.1, res_target
1.5e-3, jr_max 32, pema 0.999): **best val CE 2.4037**, ran full 20k, zero excursions.
- Standard BP transformer (same C/H/T, mlp=4): **best val CE 1.6953**.
- Target: BP capability band **1.0–1.5** (BP-C512 itself is 1.70; BP needs C≥1024 to enter the band).
- **User's hypothesis (verbatim intent):** the "full-time slow + plateau" is **not** simple EP
LR-inequivalence; it is a conflict between the **equilibrium architecture's contractive operating
point** and **long-context mixing**, with LR merely expressing it as "8e-4 lives, 9e-4 explodes."
Supporting prior: at a matched tight residual operating point EP ≥ BPTT; the apparent gap is
EP-tight vs loose-BPTT (non-contractive), and the gap grows with T.
- **Three plateau types the user distinguished:** (1) invalid plateau (res ~1e-2, gate skipping,
λ pinned); (2) valid-but-over-contracted plateau (res tiny, cos high, gate-skip low, still far
from BP); (3) controller-fight / estimator-controller mismatch.
- **Priority order requested:** exp3 (BPTT+ctl) → exp2 (LR-sweep) → exp4 (warmup/gate decouple)
→ exp5 (λ grid) → exp6/exp7 (branch + mixing). exp1 (triangulation) foundational.
---
## 1. The seven experiments — spec, what ran, result, read
### exp3 — C512 BPTT+controller (the decisive matched cut)
- **Spec:** replicate the orphan EP recipe (resinit, qk-norm, warmup, jr_floor, res_target, gate,
pema, lr 8e-4) but swap the EP task gradient for exact BPTT. If it plateaus ~2.2–2.4 → tax is
architecture/controller; if ~1.9 → EP estimator/noise.
- **Ran:** `--mode bptt`, all else matched, 20k steps, fuse on.
- **Result:** descended cleanly to **best 3.85 @step 1400**, then **destabilized at step 1600**
(jr→32, res→0.07) and lodged in a broken basin (val 6.27, res 0.068 — just under the 0.1 fuse,
no abort) for 8400+ steps. Killed at step ~12.8k. **Worse than the EP orphan (3.85 vs 2.40).**
- **Read:** the answer came back neither 2.2–2.4 nor 1.9 — BPTT **broke**. At C512 the contraction
controller does **not** keep the exact gradient on the manifold; EP's implicit contraction-bias is
what kept the orphan alive. Inverts the S0 result (BPTT+ctl 1.635 < EP 1.676). *Caveat: single
lr/seed; BPTT might be stable at another lr.*
### exp2 — one-step loss-decrease LR-sweep at the plateau ckpt
- **Spec:** fixed ckpt + batch, gradients {BPTT, EP-task, EP-task+jac}, one-step update over an lr
grid, same-graph eval ΔCE. Answers "is it LR-inequivalence."
- **Result** (ΔCE, negative = better): BPTT best **−0.161** (lr 1e-2); EP best **−0.042** (lr 1e-4),
**diverges for lr ≥ 3e-4**. cos(g_EP,g_BPTT) 0.107, k=|g_EP|/|g_BPTT| ≈ 486.
- **Read:** **not LR.** EP's best one-step descent over *all* lr is 4× below BPTT's; a magnitude
rescale can't fix a 0.10-aligned direction.
### exp1 — gradient triangulation, multi-checkpoint (foundational)
- **Spec:** at step 0/200/800/2000/plateau, per-group cos of {g_EP, g_BPTT-150, g_BPTT-400},
norm-ratio k, batch-to-batch variance, + res.
- **Result:**
| ckpt | val | res | EP·BPTT150 | BPTT150·400 | k | EP self-cos | BPTT self-cos |
|---|---|---|---|---|---|---|---|
| step0 (random) | 9.90 | 2.1e-2 | 0.261 | 0.459 | 0.319 | +0.654 | +0.617 |
| step200 | 6.07 | 4.2e-9 | **0.989** | 1.000 | 1.000 | +0.371 | +0.376 |
| step800 | 4.11 | 3.9e-4 | 0.964 | 0.935 | 0.844 | +0.755 | +0.721 |
| step2000 | 3.56 | 2.9e-4 | 0.885 | 0.845 | 0.675 | +0.445 | +0.358 |
| **plateau (2.40)** | 2.40 | 2.6e-5 | **−0.045** | **1.000** | **4223** | **−0.273** | **+0.957** |
- **Read (decisive):** the EP estimator is **faithful while descending** (cos 0.99→0.89, steps
200–2000) and undergoes an **SNR/coherence collapse at the optimum** — cos→0, k→4000, and crucially
the **batch self-cos goes −0.27** (EP gradients on different batches anti-correlate) while BPTT is
+0.96. BPTT-150≡BPTT-400 (cos 1.00) the entire time ⇒ the true gradient is always well-defined.
**The 2.40 plateau is an EP estimator bias-floor / batch-incoherence near the optimum** — not
horizon ambiguity (ruled out here), not LR (exp2).
### exp4 — warmup / validity-gate decouple (3 arms)
- **Spec:** armA current warmup; armB full-strength contraction early + task LR still warmup;
armC no-warmup + smaller resinit / lower branch LR. Metric: gate-skip / res-under-gate / early val.
- **Result:** armA (warmup): res stuck ~1.5e-2 for ~50 steps → gate skips the nudge → **no task
learning for ~50–60 steps** (cos≈0, the reported grad is pure jacreg); cos jumps to 0.99 once
res<gate by step ~60; best 4.11@800. armB (`cprewarm 200`): res 1.7e-2 after the pre-phase, dead-
zone only **partially** mitigated (val 5.97 vs 6.07 @100, best 4.13@600). armC (no-warmup +
resinit 0.03): **no dead-zone** (cos 0.99 from step 20, best 4.03@300) but a violent excursion
(res→0.14 @400, recovered), ended 4.71.
- **Read:** the warmup×gate dead-zone is real; establishing contraction once (armB) isn't enough —
warmup's tiny lr can't deepen it. Smallest-resinit + no-warmup (armC) learns fastest early but is
excursion-prone.
### exp5 — λ floor / target / jr_max grid (warm-start from 2.40)
- **Spec:** floor {0.1,0.05,0.03} × target {1.5e-3,3e-3,5e-3} × jr_max {4,8,16,32}, 1–2k steps, fuse on.
- **Result (best val CE, 1200 warm steps):** jr_max 4 → 2.376, **jr_max 8 → 2.3648**, jr_max 16 →
2.416, jr_max 32 (base) → 2.463; floor 0.05/target1.5e-3 → 2.477; floor 0.03/target5e-3 → 2.482.
(orphan = 2.404.)
- **Read:** **jr_max ≤ 8 beats 32 by ~0.04** and dips just below the orphan — the jr_max ceiling was
a small tax; floor/target relaxation does not help. Still far from 1–1.5: the estimator floor
dominates.
### exp6 — branch-growth + attention entropy + qk-RMS (incl. at plateau)
- **Spec:** log |attn|/|xin|, |ffn|/|xin|, |WO|, |pj|, |fc|, attention entropy, qk RMS, logit std;
if branch contribution is small at the plateau, add trainable residual gates α with larger LR.
- **Result:** training (resinit 0.1): |WO| 1.95→4.91, |pj|→6.57, attn/xin 56→3.5, attn entropy
4.56→3.88, qk_rms pinned 1.0. **At the plateau:** |pj|=44, |fc|=58, attn/xin≈1.0, attn entropy
3.62, qk_rms 1.0.
- **Read:** branches grow fine and are **large** at the plateau ⇒ **no "stuck-small branch"** — the
trainable-α-gate intervention is **not indicated**. qk-norm is bounding qk RMS (1.0) as designed.
### exp7 — mixing / influence length + per-position CE (EP vs BP)
- **Spec:** finite-diff |∂z*_t/∂x_{t−d}| vs distance + per-position CE, comparing tight-EP,
BPTT-ctl-tight, BPTT-ctl-loose, standard BP.
- **Bug caught & fixed:** the EP probe first embedded the input with random-init weights (before
loading the ckpt) → garbage (per-pos CE ~7); fixed to embed post-load and re-ran.
- **Result — influence ||Δ·_q||/δ vs distance d:**
| d | 1 | 8 | 16 | 64 | 200 |
|---|---|---|---|---|---|
| BP | 5.34 | 1.81 | 0.88 | 0.15 | 0.003 |
| EP-orphan | 0.238 | 0.102 | 0.071 | 0.019 | 0.006 |
| BP ÷d1 | 1.0 | 0.34 | 0.16 | 0.029 | 0.0006 |
| EP ÷d1 | 1.0 | 0.43 | 0.30 | 0.080 | 0.025 |
Per-position CE: BP uniformly ~0.7–1.0 lower than EP at every position (no long-range-specific gap).
- **Read:** both decay with distance; EP ~20× weaker in **absolute** coupling (scale-caveated — z*
vs hidden norm), but **normalized, EP reaches farther than BP**. ⇒ **does NOT support — arguably
refutes — "contraction cuts long-range mixing."** EP's deficit is **uniform across position**,
consistent with the estimator floor (exp1), not a mixing-length deficit.
- **Honest gap:** a stable BPTT-ctl-tight ckpt does not exist (BPTT broke, exp3), so the cross-method
comparison is **EP-vs-BP only**, not the full tight/loose/BP set requested.
---
## 2. Consolidated verdict (data-level)
**The C512 2.40 plateau is an EP estimator SNR/coherence collapse near the optimum** (exp1: cos→0,
batch self-cos −0.27, k→4000, true gradient clean). Ruled out as causes: LR (exp2), horizon
ambiguity (exp1), λ floor/target (exp5), mixing-length / contraction-cut (exp7). The architecture
has a **separate** wall: exact-gradient BPTT can't stay stable at C512 (exp3 broke at 1600). EP's
contraction-bias makes it the more robust of the two here.
Of the user's three plateau types, the orphan is type (2) **valid-but-over-contracted** — but the
sharper statement is that "over-contracted" is not the mechanism; the binding constraint is the
**EP estimator bias floor**, which becomes batch-incoherent once the true gradient shrinks below it.
**Two levers the data points at (decision pending):**
- (a) **Reduce the EP estimator bias floor near the optimum** — N=4 holo / tracking-AEP / lock-in /
navg averaging; jr_max≤8 also buys ~0.04 (exp5). The lever for EP's own plateau.
- (b) **The architecture's stability margin for exact gradients** — the separate exp3 wall.
---
## 3. Code / logs
- Trainer flags added this round: `--snap` (exp1 ckpts), `--cprewarm` (exp4 armB), `--probe_branch`
(exp6, now with attn entropy + qk RMS), `--probe_bptt` (cos/k along trajectory).
- Scripts: `lr_sweep.py` (exp2), `triangulation.py` (exp1, +batch self-cos), `mixing_probe.py`
(exp7 EP, bug-fixed), `bp_mixing.py` (exp7 BP), `bp_charlm.py --ckpt` (BP ckpt save).
- Orchestration: `master.sh` (first pass), `master2.sh` (full to-spec versions).
- Logs: `triangulation_multi.log`, `exp4_armB.log`/`exp4_armC.log`, `lam_jr{4,8,16}.log`,
`exp6_train.log`/`exp6_plateau.log`, `bp_mixing.log`, `mixing2.log` (+ the corrected re-run).
- Checkpoints: `ep_C512_bpe.best.pt` (orphan 2.40), `bp_C512.pt` (BP 1.70), `snap_{0,200,800,2000}.pt`,
`lam_jr8.pt` (2.365), `lam_t5.pt`.
|