summaryrefslogtreecommitdiff
path: root/docs/campaign/C512_PLATEAU_CAMPAIGN.md
diff options
context:
space:
mode:
Diffstat (limited to 'docs/campaign/C512_PLATEAU_CAMPAIGN.md')
-rw-r--r--docs/campaign/C512_PLATEAU_CAMPAIGN.md159
1 files changed, 159 insertions, 0 deletions
diff --git a/docs/campaign/C512_PLATEAU_CAMPAIGN.md b/docs/campaign/C512_PLATEAU_CAMPAIGN.md
new file mode 100644
index 0000000..6066c21
--- /dev/null
+++ b/docs/campaign/C512_PLATEAU_CAMPAIGN.md
@@ -0,0 +1,159 @@
+# C512 "Why slow + plateau" diagnostic campaign (2026-06-17)
+
+Standalone record of the 7-experiment campaign the user proposed to diagnose why EP at C=512 BPE
+plateaus at val CE 2.40 (orphan) and never reaches the BP capability band (1.0–1.5). Self-contained;
+companion logs in `/tmp/lt_ep/`, general arc in `FINDINGS.md`.
+
+---
+
+## 0. Setup and the user's hypothesis
+
+- **Models** (TinyStories BPE-4096, C=512 H=16 T=256 B=24; random ln4096 = 8.318):
+ - EP "orphan" (lr 8e-4, warmup 800, resinit 0.1, holo N=2, t2sel 40, jr_floor 0.1, res_target
+ 1.5e-3, jr_max 32, pema 0.999): **best val CE 2.4037**, ran full 20k, zero excursions.
+ - Standard BP transformer (same C/H/T, mlp=4): **best val CE 1.6953**.
+ - Target: BP capability band **1.0–1.5** (BP-C512 itself is 1.70; BP needs C≥1024 to enter the band).
+
+- **User's hypothesis (verbatim intent):** the "full-time slow + plateau" is **not** simple EP
+ LR-inequivalence; it is a conflict between the **equilibrium architecture's contractive operating
+ point** and **long-context mixing**, with LR merely expressing it as "8e-4 lives, 9e-4 explodes."
+ Supporting prior: at a matched tight residual operating point EP ≥ BPTT; the apparent gap is
+ EP-tight vs loose-BPTT (non-contractive), and the gap grows with T.
+
+- **Three plateau types the user distinguished:** (1) invalid plateau (res ~1e-2, gate skipping,
+ λ pinned); (2) valid-but-over-contracted plateau (res tiny, cos high, gate-skip low, still far
+ from BP); (3) controller-fight / estimator-controller mismatch.
+
+- **Priority order requested:** exp3 (BPTT+ctl) → exp2 (LR-sweep) → exp4 (warmup/gate decouple)
+ → exp5 (λ grid) → exp6/exp7 (branch + mixing). exp1 (triangulation) foundational.
+
+---
+
+## 1. The seven experiments — spec, what ran, result, read
+
+### exp3 — C512 BPTT+controller (the decisive matched cut)
+- **Spec:** replicate the orphan EP recipe (resinit, qk-norm, warmup, jr_floor, res_target, gate,
+ pema, lr 8e-4) but swap the EP task gradient for exact BPTT. If it plateaus ~2.2–2.4 → tax is
+ architecture/controller; if ~1.9 → EP estimator/noise.
+- **Ran:** `--mode bptt`, all else matched, 20k steps, fuse on.
+- **Result:** descended cleanly to **best 3.85 @step 1400**, then **destabilized at step 1600**
+ (jr→32, res→0.07) and lodged in a broken basin (val 6.27, res 0.068 — just under the 0.1 fuse,
+ no abort) for 8400+ steps. Killed at step ~12.8k. **Worse than the EP orphan (3.85 vs 2.40).**
+- **Read:** the answer came back neither 2.2–2.4 nor 1.9 — BPTT **broke**. At C512 the contraction
+ controller does **not** keep the exact gradient on the manifold; EP's implicit contraction-bias is
+ what kept the orphan alive. Inverts the S0 result (BPTT+ctl 1.635 < EP 1.676). *Caveat: single
+ lr/seed; BPTT might be stable at another lr.*
+
+### exp2 — one-step loss-decrease LR-sweep at the plateau ckpt
+- **Spec:** fixed ckpt + batch, gradients {BPTT, EP-task, EP-task+jac}, one-step update over an lr
+ grid, same-graph eval ΔCE. Answers "is it LR-inequivalence."
+- **Result** (ΔCE, negative = better): BPTT best **−0.161** (lr 1e-2); EP best **−0.042** (lr 1e-4),
+ **diverges for lr ≥ 3e-4**. cos(g_EP,g_BPTT) 0.107, k=|g_EP|/|g_BPTT| ≈ 486.
+- **Read:** **not LR.** EP's best one-step descent over *all* lr is 4× below BPTT's; a magnitude
+ rescale can't fix a 0.10-aligned direction.
+
+### exp1 — gradient triangulation, multi-checkpoint (foundational)
+- **Spec:** at step 0/200/800/2000/plateau, per-group cos of {g_EP, g_BPTT-150, g_BPTT-400},
+ norm-ratio k, batch-to-batch variance, + res.
+- **Result:**
+
+ | ckpt | val | res | EP·BPTT150 | BPTT150·400 | k | EP self-cos | BPTT self-cos |
+ |---|---|---|---|---|---|---|---|
+ | step0 (random) | 9.90 | 2.1e-2 | 0.261 | 0.459 | 0.319 | +0.654 | +0.617 |
+ | step200 | 6.07 | 4.2e-9 | **0.989** | 1.000 | 1.000 | +0.371 | +0.376 |
+ | step800 | 4.11 | 3.9e-4 | 0.964 | 0.935 | 0.844 | +0.755 | +0.721 |
+ | step2000 | 3.56 | 2.9e-4 | 0.885 | 0.845 | 0.675 | +0.445 | +0.358 |
+ | **plateau (2.40)** | 2.40 | 2.6e-5 | **−0.045** | **1.000** | **4223** | **−0.273** | **+0.957** |
+
+- **Read (decisive):** the EP estimator is **faithful while descending** (cos 0.99→0.89, steps
+ 200–2000) and undergoes an **SNR/coherence collapse at the optimum** — cos→0, k→4000, and crucially
+ the **batch self-cos goes −0.27** (EP gradients on different batches anti-correlate) while BPTT is
+ +0.96. BPTT-150≡BPTT-400 (cos 1.00) the entire time ⇒ the true gradient is always well-defined.
+ **The 2.40 plateau is an EP estimator bias-floor / batch-incoherence near the optimum** — not
+ horizon ambiguity (ruled out here), not LR (exp2).
+
+### exp4 — warmup / validity-gate decouple (3 arms)
+- **Spec:** armA current warmup; armB full-strength contraction early + task LR still warmup;
+ armC no-warmup + smaller resinit / lower branch LR. Metric: gate-skip / res-under-gate / early val.
+- **Result:** armA (warmup): res stuck ~1.5e-2 for ~50 steps → gate skips the nudge → **no task
+ learning for ~50–60 steps** (cos≈0, the reported grad is pure jacreg); cos jumps to 0.99 once
+ res<gate by step ~60; best 4.11@800. armB (`cprewarm 200`): res 1.7e-2 after the pre-phase, dead-
+ zone only **partially** mitigated (val 5.97 vs 6.07 @100, best 4.13@600). armC (no-warmup +
+ resinit 0.03): **no dead-zone** (cos 0.99 from step 20, best 4.03@300) but a violent excursion
+ (res→0.14 @400, recovered), ended 4.71.
+- **Read:** the warmup×gate dead-zone is real; establishing contraction once (armB) isn't enough —
+ warmup's tiny lr can't deepen it. Smallest-resinit + no-warmup (armC) learns fastest early but is
+ excursion-prone.
+
+### exp5 — λ floor / target / jr_max grid (warm-start from 2.40)
+- **Spec:** floor {0.1,0.05,0.03} × target {1.5e-3,3e-3,5e-3} × jr_max {4,8,16,32}, 1–2k steps, fuse on.
+- **Result (best val CE, 1200 warm steps):** jr_max 4 → 2.376, **jr_max 8 → 2.3648**, jr_max 16 →
+ 2.416, jr_max 32 (base) → 2.463; floor 0.05/target1.5e-3 → 2.477; floor 0.03/target5e-3 → 2.482.
+ (orphan = 2.404.)
+- **Read:** **jr_max ≤ 8 beats 32 by ~0.04** and dips just below the orphan — the jr_max ceiling was
+ a small tax; floor/target relaxation does not help. Still far from 1–1.5: the estimator floor
+ dominates.
+
+### exp6 — branch-growth + attention entropy + qk-RMS (incl. at plateau)
+- **Spec:** log |attn|/|xin|, |ffn|/|xin|, |WO|, |pj|, |fc|, attention entropy, qk RMS, logit std;
+ if branch contribution is small at the plateau, add trainable residual gates α with larger LR.
+- **Result:** training (resinit 0.1): |WO| 1.95→4.91, |pj|→6.57, attn/xin 56→3.5, attn entropy
+ 4.56→3.88, qk_rms pinned 1.0. **At the plateau:** |pj|=44, |fc|=58, attn/xin≈1.0, attn entropy
+ 3.62, qk_rms 1.0.
+- **Read:** branches grow fine and are **large** at the plateau ⇒ **no "stuck-small branch"** — the
+ trainable-α-gate intervention is **not indicated**. qk-norm is bounding qk RMS (1.0) as designed.
+
+### exp7 — mixing / influence length + per-position CE (EP vs BP)
+- **Spec:** finite-diff |∂z*_t/∂x_{t−d}| vs distance + per-position CE, comparing tight-EP,
+ BPTT-ctl-tight, BPTT-ctl-loose, standard BP.
+- **Bug caught & fixed:** the EP probe first embedded the input with random-init weights (before
+ loading the ckpt) → garbage (per-pos CE ~7); fixed to embed post-load and re-ran.
+- **Result — influence ||Δ·_q||/δ vs distance d:**
+
+ | d | 1 | 8 | 16 | 64 | 200 |
+ |---|---|---|---|---|---|
+ | BP | 5.34 | 1.81 | 0.88 | 0.15 | 0.003 |
+ | EP-orphan | 0.238 | 0.102 | 0.071 | 0.019 | 0.006 |
+ | BP ÷d1 | 1.0 | 0.34 | 0.16 | 0.029 | 0.0006 |
+ | EP ÷d1 | 1.0 | 0.43 | 0.30 | 0.080 | 0.025 |
+
+ Per-position CE: BP uniformly ~0.7–1.0 lower than EP at every position (no long-range-specific gap).
+- **Read:** both decay with distance; EP ~20× weaker in **absolute** coupling (scale-caveated — z*
+ vs hidden norm), but **normalized, EP reaches farther than BP**. ⇒ **does NOT support — arguably
+ refutes — "contraction cuts long-range mixing."** EP's deficit is **uniform across position**,
+ consistent with the estimator floor (exp1), not a mixing-length deficit.
+- **Honest gap:** a stable BPTT-ctl-tight ckpt does not exist (BPTT broke, exp3), so the cross-method
+ comparison is **EP-vs-BP only**, not the full tight/loose/BP set requested.
+
+---
+
+## 2. Consolidated verdict (data-level)
+
+**The C512 2.40 plateau is an EP estimator SNR/coherence collapse near the optimum** (exp1: cos→0,
+batch self-cos −0.27, k→4000, true gradient clean). Ruled out as causes: LR (exp2), horizon
+ambiguity (exp1), λ floor/target (exp5), mixing-length / contraction-cut (exp7). The architecture
+has a **separate** wall: exact-gradient BPTT can't stay stable at C512 (exp3 broke at 1600). EP's
+contraction-bias makes it the more robust of the two here.
+
+Of the user's three plateau types, the orphan is type (2) **valid-but-over-contracted** — but the
+sharper statement is that "over-contracted" is not the mechanism; the binding constraint is the
+**EP estimator bias floor**, which becomes batch-incoherent once the true gradient shrinks below it.
+
+**Two levers the data points at (decision pending):**
+- (a) **Reduce the EP estimator bias floor near the optimum** — N=4 holo / tracking-AEP / lock-in /
+ navg averaging; jr_max≤8 also buys ~0.04 (exp5). The lever for EP's own plateau.
+- (b) **The architecture's stability margin for exact gradients** — the separate exp3 wall.
+
+---
+
+## 3. Code / logs
+
+- Trainer flags added this round: `--snap` (exp1 ckpts), `--cprewarm` (exp4 armB), `--probe_branch`
+ (exp6, now with attn entropy + qk RMS), `--probe_bptt` (cos/k along trajectory).
+- Scripts: `lr_sweep.py` (exp2), `triangulation.py` (exp1, +batch self-cos), `mixing_probe.py`
+ (exp7 EP, bug-fixed), `bp_mixing.py` (exp7 BP), `bp_charlm.py --ckpt` (BP ckpt save).
+- Orchestration: `master.sh` (first pass), `master2.sh` (full to-spec versions).
+- Logs: `triangulation_multi.log`, `exp4_armB.log`/`exp4_armC.log`, `lam_jr{4,8,16}.log`,
+ `exp6_train.log`/`exp6_plateau.log`, `bp_mixing.log`, `mixing2.log` (+ the corrected re-run).
+- Checkpoints: `ep_C512_bpe.best.pt` (orphan 2.40), `bp_C512.pt` (BP 1.70), `snap_{0,200,800,2000}.pt`,
+ `lam_jr8.pt` (2.365), `lam_t5.pt`.