# C512 "Why slow + plateau" diagnostic campaign (2026-06-17) Standalone record of the 7-experiment campaign the user proposed to diagnose why EP at C=512 BPE plateaus at val CE 2.40 (orphan) and never reaches the BP capability band (1.0–1.5). Self-contained; companion logs in `/tmp/lt_ep/`, general arc in `FINDINGS.md`. --- ## 0. Setup and the user's hypothesis - **Models** (TinyStories BPE-4096, C=512 H=16 T=256 B=24; random ln4096 = 8.318): - EP "orphan" (lr 8e-4, warmup 800, resinit 0.1, holo N=2, t2sel 40, jr_floor 0.1, res_target 1.5e-3, jr_max 32, pema 0.999): **best val CE 2.4037**, ran full 20k, zero excursions. - Standard BP transformer (same C/H/T, mlp=4): **best val CE 1.6953**. - Target: BP capability band **1.0–1.5** (BP-C512 itself is 1.70; BP needs C≥1024 to enter the band). - **User's hypothesis (verbatim intent):** the "full-time slow + plateau" is **not** simple EP LR-inequivalence; it is a conflict between the **equilibrium architecture's contractive operating point** and **long-context mixing**, with LR merely expressing it as "8e-4 lives, 9e-4 explodes." Supporting prior: at a matched tight residual operating point EP ≥ BPTT; the apparent gap is EP-tight vs loose-BPTT (non-contractive), and the gap grows with T. - **Three plateau types the user distinguished:** (1) invalid plateau (res ~1e-2, gate skipping, λ pinned); (2) valid-but-over-contracted plateau (res tiny, cos high, gate-skip low, still far from BP); (3) controller-fight / estimator-controller mismatch. - **Priority order requested:** exp3 (BPTT+ctl) → exp2 (LR-sweep) → exp4 (warmup/gate decouple) → exp5 (λ grid) → exp6/exp7 (branch + mixing). exp1 (triangulation) foundational. --- ## 1. The seven experiments — spec, what ran, result, read ### exp3 — C512 BPTT+controller (the decisive matched cut) - **Spec:** replicate the orphan EP recipe (resinit, qk-norm, warmup, jr_floor, res_target, gate, pema, lr 8e-4) but swap the EP task gradient for exact BPTT. If it plateaus ~2.2–2.4 → tax is architecture/controller; if ~1.9 → EP estimator/noise. - **Ran:** `--mode bptt`, all else matched, 20k steps, fuse on. - **Result:** descended cleanly to **best 3.85 @step 1400**, then **destabilized at step 1600** (jr→32, res→0.07) and lodged in a broken basin (val 6.27, res 0.068 — just under the 0.1 fuse, no abort) for 8400+ steps. Killed at step ~12.8k. **Worse than the EP orphan (3.85 vs 2.40).** - **Read:** the answer came back neither 2.2–2.4 nor 1.9 — BPTT **broke**. At C512 the contraction controller does **not** keep the exact gradient on the manifold; EP's implicit contraction-bias is what kept the orphan alive. Inverts the S0 result (BPTT+ctl 1.635 < EP 1.676). *Caveat: single lr/seed; BPTT might be stable at another lr.* ### exp2 — one-step loss-decrease LR-sweep at the plateau ckpt - **Spec:** fixed ckpt + batch, gradients {BPTT, EP-task, EP-task+jac}, one-step update over an lr grid, same-graph eval ΔCE. Answers "is it LR-inequivalence." - **Result** (ΔCE, negative = better): BPTT best **−0.161** (lr 1e-2); EP best **−0.042** (lr 1e-4), **diverges for lr ≥ 3e-4**. cos(g_EP,g_BPTT) 0.107, k=|g_EP|/|g_BPTT| ≈ 486. - **Read:** **not LR.** EP's best one-step descent over *all* lr is 4× below BPTT's; a magnitude rescale can't fix a 0.10-aligned direction. ### exp1 — gradient triangulation, multi-checkpoint (foundational) - **Spec:** at step 0/200/800/2000/plateau, per-group cos of {g_EP, g_BPTT-150, g_BPTT-400}, norm-ratio k, batch-to-batch variance, + res. - **Result:** | ckpt | val | res | EP·BPTT150 | BPTT150·400 | k | EP self-cos | BPTT self-cos | |---|---|---|---|---|---|---|---| | step0 (random) | 9.90 | 2.1e-2 | 0.261 | 0.459 | 0.319 | +0.654 | +0.617 | | step200 | 6.07 | 4.2e-9 | **0.989** | 1.000 | 1.000 | +0.371 | +0.376 | | step800 | 4.11 | 3.9e-4 | 0.964 | 0.935 | 0.844 | +0.755 | +0.721 | | step2000 | 3.56 | 2.9e-4 | 0.885 | 0.845 | 0.675 | +0.445 | +0.358 | | **plateau (2.40)** | 2.40 | 2.6e-5 | **−0.045** | **1.000** | **4223** | **−0.273** | **+0.957** | - **Read (decisive):** the EP estimator is **faithful while descending** (cos 0.99→0.89, steps 200–2000) and undergoes an **SNR/coherence collapse at the optimum** — cos→0, k→4000, and crucially the **batch self-cos goes −0.27** (EP gradients on different batches anti-correlate) while BPTT is +0.96. BPTT-150≡BPTT-400 (cos 1.00) the entire time ⇒ the true gradient is always well-defined. **The 2.40 plateau is an EP estimator bias-floor / batch-incoherence near the optimum** — not horizon ambiguity (ruled out here), not LR (exp2). ### exp4 — warmup / validity-gate decouple (3 arms) - **Spec:** armA current warmup; armB full-strength contraction early + task LR still warmup; armC no-warmup + smaller resinit / lower branch LR. Metric: gate-skip / res-under-gate / early val. - **Result:** armA (warmup): res stuck ~1.5e-2 for ~50 steps → gate skips the nudge → **no task learning for ~50–60 steps** (cos≈0, the reported grad is pure jacreg); cos jumps to 0.99 once res