From b83947778e2c776f757a07d4719b7ce961d7ed55 Mon Sep 17 00:00:00 2001
From: Yuren Hao <yurenh2@illinois.edu>
Date: Fri, 3 Jul 2026 05:56:50 -0500
Subject: =?UTF-8?q?Initial=20commit:=20ept=20=E2=80=94=20backprop-free=20e?=
 =?UTF-8?q?quilibrium=20transformer=20(EP)?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Code (ep_run/), organized docs (docs/{method,campaign,hardware,outreach,paper}),
analysis scripts (scripts/), ONBOARDING.md entry point. Large data/checkpoints
git-ignored (share separately).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_014FAPDWQ49M5Ye3NpTndTpn
---
 docs/campaign/C512_PLATEAU_CAMPAIGN.md             | 159 ++++++
 docs/campaign/C512_ROUND2_ABCD.md                  | 118 ++++
 docs/campaign/EP_BELOW210_DIAGNOSIS_FIX.md         | 260 +++++++++
 docs/campaign/FINDINGS.md                          | 609 +++++++++++++++++++++
 ...SESSION_2026-06-24_HOPF_DIAGNOSIS_RESREG_FIX.md |  81 +++
 5 files changed, 1227 insertions(+)
 create mode 100644 docs/campaign/C512_PLATEAU_CAMPAIGN.md
 create mode 100644 docs/campaign/C512_ROUND2_ABCD.md
 create mode 100644 docs/campaign/EP_BELOW210_DIAGNOSIS_FIX.md
 create mode 100644 docs/campaign/FINDINGS.md
 create mode 100644 docs/campaign/SESSION_2026-06-24_HOPF_DIAGNOSIS_RESREG_FIX.md

(limited to 'docs/campaign')

diff --git a/docs/campaign/C512_PLATEAU_CAMPAIGN.md b/docs/campaign/C512_PLATEAU_CAMPAIGN.md
new file mode 100644
index 0000000..6066c21
--- /dev/null
+++ b/docs/campaign/C512_PLATEAU_CAMPAIGN.md
@@ -0,0 +1,159 @@
+# C512 "Why slow + plateau" diagnostic campaign (2026-06-17)
+
+Standalone record of the 7-experiment campaign the user proposed to diagnose why EP at C=512 BPE
+plateaus at val CE 2.40 (orphan) and never reaches the BP capability band (1.0–1.5). Self-contained;
+companion logs in `/tmp/lt_ep/`, general arc in `FINDINGS.md`.
+
+---
+
+## 0. Setup and the user's hypothesis
+
+- **Models** (TinyStories BPE-4096, C=512 H=16 T=256 B=24; random ln4096 = 8.318):
+  - EP "orphan" (lr 8e-4, warmup 800, resinit 0.1, holo N=2, t2sel 40, jr_floor 0.1, res_target
+    1.5e-3, jr_max 32, pema 0.999): **best val CE 2.4037**, ran full 20k, zero excursions.
+  - Standard BP transformer (same C/H/T, mlp=4): **best val CE 1.6953**.
+  - Target: BP capability band **1.0–1.5** (BP-C512 itself is 1.70; BP needs C≥1024 to enter the band).
+
+- **User's hypothesis (verbatim intent):** the "full-time slow + plateau" is **not** simple EP
+  LR-inequivalence; it is a conflict between the **equilibrium architecture's contractive operating
+  point** and **long-context mixing**, with LR merely expressing it as "8e-4 lives, 9e-4 explodes."
+  Supporting prior: at a matched tight residual operating point EP ≥ BPTT; the apparent gap is
+  EP-tight vs loose-BPTT (non-contractive), and the gap grows with T.
+
+- **Three plateau types the user distinguished:** (1) invalid plateau (res ~1e-2, gate skipping,
+  λ pinned); (2) valid-but-over-contracted plateau (res tiny, cos high, gate-skip low, still far
+  from BP); (3) controller-fight / estimator-controller mismatch.
+
+- **Priority order requested:** exp3 (BPTT+ctl) → exp2 (LR-sweep) → exp4 (warmup/gate decouple)
+  → exp5 (λ grid) → exp6/exp7 (branch + mixing). exp1 (triangulation) foundational.
+
+---
+
+## 1. The seven experiments — spec, what ran, result, read
+
+### exp3 — C512 BPTT+controller (the decisive matched cut)
+- **Spec:** replicate the orphan EP recipe (resinit, qk-norm, warmup, jr_floor, res_target, gate,
+  pema, lr 8e-4) but swap the EP task gradient for exact BPTT. If it plateaus ~2.2–2.4 → tax is
+  architecture/controller; if ~1.9 → EP estimator/noise.
+- **Ran:** `--mode bptt`, all else matched, 20k steps, fuse on.
+- **Result:** descended cleanly to **best 3.85 @step 1400**, then **destabilized at step 1600**
+  (jr→32, res→0.07) and lodged in a broken basin (val 6.27, res 0.068 — just under the 0.1 fuse,
+  no abort) for 8400+ steps. Killed at step ~12.8k. **Worse than the EP orphan (3.85 vs 2.40).**
+- **Read:** the answer came back neither 2.2–2.4 nor 1.9 — BPTT **broke**. At C512 the contraction
+  controller does **not** keep the exact gradient on the manifold; EP's implicit contraction-bias is
+  what kept the orphan alive. Inverts the S0 result (BPTT+ctl 1.635 < EP 1.676). *Caveat: single
+  lr/seed; BPTT might be stable at another lr.*
+
+### exp2 — one-step loss-decrease LR-sweep at the plateau ckpt
+- **Spec:** fixed ckpt + batch, gradients {BPTT, EP-task, EP-task+jac}, one-step update over an lr
+  grid, same-graph eval ΔCE. Answers "is it LR-inequivalence."
+- **Result** (ΔCE, negative = better): BPTT best **−0.161** (lr 1e-2); EP best **−0.042** (lr 1e-4),
+  **diverges for lr ≥ 3e-4**. cos(g_EP,g_BPTT) 0.107, k=|g_EP|/|g_BPTT| ≈ 486.
+- **Read:** **not LR.** EP's best one-step descent over *all* lr is 4× below BPTT's; a magnitude
+  rescale can't fix a 0.10-aligned direction.
+
+### exp1 — gradient triangulation, multi-checkpoint (foundational)
+- **Spec:** at step 0/200/800/2000/plateau, per-group cos of {g_EP, g_BPTT-150, g_BPTT-400},
+  norm-ratio k, batch-to-batch variance, + res.
+- **Result:**
+
+  | ckpt | val | res | EP·BPTT150 | BPTT150·400 | k | EP self-cos | BPTT self-cos |
+  |---|---|---|---|---|---|---|---|
+  | step0 (random) | 9.90 | 2.1e-2 | 0.261 | 0.459 | 0.319 | +0.654 | +0.617 |
+  | step200 | 6.07 | 4.2e-9 | **0.989** | 1.000 | 1.000 | +0.371 | +0.376 |
+  | step800 | 4.11 | 3.9e-4 | 0.964 | 0.935 | 0.844 | +0.755 | +0.721 |
+  | step2000 | 3.56 | 2.9e-4 | 0.885 | 0.845 | 0.675 | +0.445 | +0.358 |
+  | **plateau (2.40)** | 2.40 | 2.6e-5 | **−0.045** | **1.000** | **4223** | **−0.273** | **+0.957** |
+
+- **Read (decisive):** the EP estimator is **faithful while descending** (cos 0.99→0.89, steps
+  200–2000) and undergoes an **SNR/coherence collapse at the optimum** — cos→0, k→4000, and crucially
+  the **batch self-cos goes −0.27** (EP gradients on different batches anti-correlate) while BPTT is
+  +0.96. BPTT-150≡BPTT-400 (cos 1.00) the entire time ⇒ the true gradient is always well-defined.
+  **The 2.40 plateau is an EP estimator bias-floor / batch-incoherence near the optimum** — not
+  horizon ambiguity (ruled out here), not LR (exp2).
+
+### exp4 — warmup / validity-gate decouple (3 arms)
+- **Spec:** armA current warmup; armB full-strength contraction early + task LR still warmup;
+  armC no-warmup + smaller resinit / lower branch LR. Metric: gate-skip / res-under-gate / early val.
+- **Result:** armA (warmup): res stuck ~1.5e-2 for ~50 steps → gate skips the nudge → **no task
+  learning for ~50–60 steps** (cos≈0, the reported grad is pure jacreg); cos jumps to 0.99 once
+  res<gate by step ~60; best 4.11@800. armB (`cprewarm 200`): res 1.7e-2 after the pre-phase, dead-
+  zone only **partially** mitigated (val 5.97 vs 6.07 @100, best 4.13@600). armC (no-warmup +
+  resinit 0.03): **no dead-zone** (cos 0.99 from step 20, best 4.03@300) but a violent excursion
+  (res→0.14 @400, recovered), ended 4.71.
+- **Read:** the warmup×gate dead-zone is real; establishing contraction once (armB) isn't enough —
+  warmup's tiny lr can't deepen it. Smallest-resinit + no-warmup (armC) learns fastest early but is
+  excursion-prone.
+
+### exp5 — λ floor / target / jr_max grid (warm-start from 2.40)
+- **Spec:** floor {0.1,0.05,0.03} × target {1.5e-3,3e-3,5e-3} × jr_max {4,8,16,32}, 1–2k steps, fuse on.
+- **Result (best val CE, 1200 warm steps):** jr_max 4 → 2.376, **jr_max 8 → 2.3648**, jr_max 16 →
+  2.416, jr_max 32 (base) → 2.463; floor 0.05/target1.5e-3 → 2.477; floor 0.03/target5e-3 → 2.482.
+  (orphan = 2.404.)
+- **Read:** **jr_max ≤ 8 beats 32 by ~0.04** and dips just below the orphan — the jr_max ceiling was
+  a small tax; floor/target relaxation does not help. Still far from 1–1.5: the estimator floor
+  dominates.
+
+### exp6 — branch-growth + attention entropy + qk-RMS (incl. at plateau)
+- **Spec:** log |attn|/|xin|, |ffn|/|xin|, |WO|, |pj|, |fc|, attention entropy, qk RMS, logit std;
+  if branch contribution is small at the plateau, add trainable residual gates α with larger LR.
+- **Result:** training (resinit 0.1): |WO| 1.95→4.91, |pj|→6.57, attn/xin 56→3.5, attn entropy
+  4.56→3.88, qk_rms pinned 1.0. **At the plateau:** |pj|=44, |fc|=58, attn/xin≈1.0, attn entropy
+  3.62, qk_rms 1.0.
+- **Read:** branches grow fine and are **large** at the plateau ⇒ **no "stuck-small branch"** — the
+  trainable-α-gate intervention is **not indicated**. qk-norm is bounding qk RMS (1.0) as designed.
+
+### exp7 — mixing / influence length + per-position CE (EP vs BP)
+- **Spec:** finite-diff |∂z*_t/∂x_{t−d}| vs distance + per-position CE, comparing tight-EP,
+  BPTT-ctl-tight, BPTT-ctl-loose, standard BP.
+- **Bug caught & fixed:** the EP probe first embedded the input with random-init weights (before
+  loading the ckpt) → garbage (per-pos CE ~7); fixed to embed post-load and re-ran.
+- **Result — influence ||Δ·_q||/δ vs distance d:**
+
+  | d | 1 | 8 | 16 | 64 | 200 |
+  |---|---|---|---|---|---|
+  | BP | 5.34 | 1.81 | 0.88 | 0.15 | 0.003 |
+  | EP-orphan | 0.238 | 0.102 | 0.071 | 0.019 | 0.006 |
+  | BP ÷d1 | 1.0 | 0.34 | 0.16 | 0.029 | 0.0006 |
+  | EP ÷d1 | 1.0 | 0.43 | 0.30 | 0.080 | 0.025 |
+
+  Per-position CE: BP uniformly ~0.7–1.0 lower than EP at every position (no long-range-specific gap).
+- **Read:** both decay with distance; EP ~20× weaker in **absolute** coupling (scale-caveated — z*
+  vs hidden norm), but **normalized, EP reaches farther than BP**. ⇒ **does NOT support — arguably
+  refutes — "contraction cuts long-range mixing."** EP's deficit is **uniform across position**,
+  consistent with the estimator floor (exp1), not a mixing-length deficit.
+- **Honest gap:** a stable BPTT-ctl-tight ckpt does not exist (BPTT broke, exp3), so the cross-method
+  comparison is **EP-vs-BP only**, not the full tight/loose/BP set requested.
+
+---
+
+## 2. Consolidated verdict (data-level)
+
+**The C512 2.40 plateau is an EP estimator SNR/coherence collapse near the optimum** (exp1: cos→0,
+batch self-cos −0.27, k→4000, true gradient clean). Ruled out as causes: LR (exp2), horizon
+ambiguity (exp1), λ floor/target (exp5), mixing-length / contraction-cut (exp7). The architecture
+has a **separate** wall: exact-gradient BPTT can't stay stable at C512 (exp3 broke at 1600). EP's
+contraction-bias makes it the more robust of the two here.
+
+Of the user's three plateau types, the orphan is type (2) **valid-but-over-contracted** — but the
+sharper statement is that "over-contracted" is not the mechanism; the binding constraint is the
+**EP estimator bias floor**, which becomes batch-incoherent once the true gradient shrinks below it.
+
+**Two levers the data points at (decision pending):**
+- (a) **Reduce the EP estimator bias floor near the optimum** — N=4 holo / tracking-AEP / lock-in /
+  navg averaging; jr_max≤8 also buys ~0.04 (exp5). The lever for EP's own plateau.
+- (b) **The architecture's stability margin for exact gradients** — the separate exp3 wall.
+
+---
+
+## 3. Code / logs
+
+- Trainer flags added this round: `--snap` (exp1 ckpts), `--cprewarm` (exp4 armB), `--probe_branch`
+  (exp6, now with attn entropy + qk RMS), `--probe_bptt` (cos/k along trajectory).
+- Scripts: `lr_sweep.py` (exp2), `triangulation.py` (exp1, +batch self-cos), `mixing_probe.py`
+  (exp7 EP, bug-fixed), `bp_mixing.py` (exp7 BP), `bp_charlm.py --ckpt` (BP ckpt save).
+- Orchestration: `master.sh` (first pass), `master2.sh` (full to-spec versions).
+- Logs: `triangulation_multi.log`, `exp4_armB.log`/`exp4_armC.log`, `lam_jr{4,8,16}.log`,
+  `exp6_train.log`/`exp6_plateau.log`, `bp_mixing.log`, `mixing2.log` (+ the corrected re-run).
+- Checkpoints: `ep_C512_bpe.best.pt` (orphan 2.40), `bp_C512.pt` (BP 1.70), `snap_{0,200,800,2000}.pt`,
+  `lam_jr8.pt` (2.365), `lam_t5.pt`.
diff --git a/docs/campaign/C512_ROUND2_ABCD.md b/docs/campaign/C512_ROUND2_ABCD.md
new file mode 100644
index 0000000..65e5282
--- /dev/null
+++ b/docs/campaign/C512_ROUND2_ABCD.md
@@ -0,0 +1,118 @@
+# C512 Round-2 (A/B/C/D) — Spec Status, Results, and What Followed
+
+**Scope.** Status of the last long spec the user ordered (the A/B/C/D round, after the 7-experiment
+campaign in `C512_PLATEAU_CAMPAIGN.md`), its results, the unfinished gaps, and the work that grew out
+of it (this session). Setting: EP-trained equilibrium transformer block, C=512 BPE-4096, TinyStories,
+T=256, B=24. Random ln(4096)=8.318; BP single-block reference (depth=1, 7.48M) best **1.7921** (20k).
+
+Convention: raw data; conclusions are the user's.
+
+---
+
+## The spec (as ordered)
+
+> Next round (do A+B first, then decide N=4 / lock-in / tracking-AEP vs nudged-phase redesign):
+> - **(A)** plateau estimator ablation at a fixed ckpt — holo N=2/4/8, r=.02–.2, T2=20–120 + adaptive,
+>   tracking-AEP, vs BPTT150/400/800.
+> - **(B)** navg multi-batch one-step sweep (navg 1–16): variance-or-bias test.
+> - **(C)** stable BPTT+controller rescue — lr 2e-4–8e-4 × jr_max 8–32, ≥2 seeds, 2–5k cutoff (the
+>   still-missing matched architecture ceiling).
+> - **(D)** per-group late-collapse localization — attn QKVO / FFN fc-pj / LN / emb / head-clamp /
+>   jacreg + diagnostics |a|, |z^β−z*|, |Jv−Jᵀv|, selected-T2.
+
+---
+
+## Per-item status & results
+
+| item | status | result (raw) |
+|---|---|---|
+| **A** estimator ablation | ✅ core done | At the 2.40 plateau ckpt vs BPTT400: **N2-frozen** cos **−0.045**, self-cos **−0.27**, k **4133**; **N4/N8** fix magnitude only (k→30) NOT direction; **r** .05/.10/.20 do nothing; **fixed-T2** 20/80/120 diverge (need snapshot selection); **tracking-AEP (common-mode)** cos **0.997**, self-cos **+0.95**, k **0.9**. Diagnostic at z*: **\|Jv−Jᵀv\|/\|Jv\| = 1.37** (highly non-normal → the frozen-at-z* AEP linearization IS the bias floor). Script: `plateau_ablation.py`. |
+| **B** navg sweep | ✅ done | cos(EP,BPTT) over navg 1/2/4/8/16 = **0.37 / 0.40 / 0.55 / −0.30 / −0.21**; self-cos rises (0.58→0.87) but toward a direction NEGATIVELY correlated with BPTT; dCE diverges at navg≥8. ⇒ **deterministic BIAS, not variance** — averaging/restart won't fix it. Script: `navg_sweep.py`. |
+| **C** stable BPTT rescue | 🟡 partial | Ran lr{2e-4, 4e-4, 6e-4} × jr_max{16, 32}, **single seed**, ~2.5k cutoff (`bpttr_lr*_jr*.pt`). Found: **stable exact-gradient BPTT exists** at lr ≤ 6e-4 (never broke); lr 8e-4 broke @ step 1600; **jr_max 16 > 32**. Winner (lr6e-4, jr16) extended to 12k = `bptt_ceiling` (best **2.1055** @ step 7400). Scripts: `bptt_rescue.sh`, `chain_next.sh`. |
+| **D** per-group localization | ✅ core done | Late collapse is **UNIFORM** across WQ/WK/WV/WO, fc/pj, ln1/ln2, tok/pos (all cos −0.03..−0.08, k 3k–18k) — a single shared cause (the corrupted contrast `a` poisons every VF gradient equally), NOT attention/LN-specific. + the \|Jv−Jᵀv\|=1.37 diagnostic. Script: `plateau_ablation.py` (per-group mode). |
+
+### Decision gate (after A+B)
+Per spec, A+B chose the direction: **switch the nudged-phase estimator to common-mode tracking-AEP**
+(N=4/lock-in/navg rejected — A showed they fix magnitude not direction; B showed it's bias not
+variance). The mechanism: the bias floor is the **frozen-at-z\*** AEP linearization; re-linearizing at
+the moving common mode z̄=½(z₊+z₋) restores cos −0.045 → 0.997.
+
+### Follow-on (the spec's "next necessary test": does `--track` descend below 2.40 in TRAINING)
+✅ Done. Warm-start from the 2.40 orphan ckpt, jr_max 8:
+- **TRACK-warm (tracking-AEP): 2.40 → best 2.1628 @ 2500, still descending, stable** (res ~6e-5).
+- **STD-warm control (standard estimator, same recipe): ABORTED @ step 397** (res→0.21).
+- **TRACK-fresh (from scratch): ABORTED @ 968.**
+⇒ tracking-AEP descends past the plateau but is **not a from-scratch drop-in**; recipe is **two-phase
+warm-track** (std estimator → ~2.40, then switch to tracking-AEP). Continued (`trkcont`) → best **2.1116**.
+
+---
+
+## Unfinished gaps (do not change conclusions, but were specified)
+1. **C — ≥2 seeds:** only single-seed runs. The one item that affects the rigor of the "matched
+   architecture ceiling" claim.
+2. **C — jr_max=8** not included in the BPTT rescue grid (only 16/32; exp5 covered jr_max 8 for EP).
+3. **A — BPTT800** reference not run separately (BPTT150-vs-400 cos=1.0 from exp1 answered the horizon
+   question, but the spec listed 800).
+4. **D — sub-diagnostics** |a|, |z^β−z*|, and a head-clamp arm not individually logged.
+
+---
+
+## What followed (this session — beyond the A/B/C/D spec)
+
+These grew out of the round; they were **not** in the spec.
+
+1. **bptt_ceiling (architecture-ceiling run, lr6e-4 jr16, 12k).** Best **2.1055** @ step 7400, then
+   **controller-hijack blowup** @ ~8800 (jr→16, val CE→3.4), lodged at ~3.3 through 12000 (jr pinned at
+   16 the entire tail, res bounded 2–5e-3 — never recovered). Logs: `bptt_ceiling.log`.
+
+2. **BPTT-vs-BP "slow vs misaligned" diagnosis** (`lr_overlay.py`). Plotting val-CE vs cumulative-lr
+   (∫lr, Adam path-length proxy): per unit path-length **BPTT meets/beats BP down to 2.11** (cumL ratio
+   1.0→0.33) — the raw-step slowness is the 5× lr cap (6e-4 vs BP's 3e-3), not inefficiency. BPTT never
+   reached BP's 1.74 because it **blew up before converging** (stability-limited), not capacity. S0
+   evidence that capacity is fine when it converges: **BPTT-ctl 1.635 ≈ BP 1.610** (gap 0.025). A
+   weight-import capacity probe was designed but **deemed confounded** (5 formula gaps: damping halves
+   z*, coupled-vs-sequential FFN, attn biases, qknorm, final-LN) and not run.
+
+3. **Controller-hijack diagnosis** (`blowup_diag.py`, at the 2.1055 ckpt). Decomposing the update into
+   CE and jacreg terms: **cos(g_CE, g_jac) = −0.008** (orthogonal); at jr=16, **\|jr·g_jac\|/\|g_CE\| =
+   3.25** and **cos(g_CE, total) = 0.288** — i.e. ~71% of the update is the penalty, not loss descent.
+   Per-group, the penalty over-drives **LN gains 34×**, attn 3.2×. res stays bounded throughout ⇒ NOT a
+   relaxation blowup; a controller-penalty-driven weight-space detour. Trigger: near the optimum res
+   settled ~1.8e-3 (> target 1.5e-3) → controller ramped jr to max → hijack.
+
+4. **"Remove/replace the penalty" structural arms** (running now, all BPTT, base = bptt_ceiling recipe):
+   - `freeze_floor` (GPU1): `--jacreg 0.1 --jr_floor 0.1 --jr_max 0.1` → jr frozen at 0.1, no adaptive ramp.
+   - `specnorm` / Lipschitz (GPU0): `--jacreg 0 --specnorm --rho 0.9` → spectral-norm attention weights ≤0.9, off-gradient, no penalty.
+   - Queued: `no-penalty` (`--jacreg 0`), `energy` (`--attn_mode energy`), `mono` (`--attn_mode mono`).
+   - Watcher `watch_arms.sh` fires when both finish. Logs: `freeze_floor.log`, `specnorm.log`.
+
+5. **DEQ training-stability literature review** (deep-research, 24 sources, 25/25 claims verified).
+   Three families: (i) **soft Jacobian penalty (Bai 2021)** — author-admitted heuristic + DIGITAL-only
+   (relies on Broyden, deliberately doesn't enforce contraction) → wrong tool for analog; (ii)
+   **structural contraction** (monDEQ / NEMON-l∞ / pcDEQ / BiLipNet) — provable, and monDEQ physically =
+   a passive resistor-diode circuit → **analog-native**; (iii) **energy/Hopfield attention** (softmax =
+   one CCCP energy-descent step, provable monotone energy decrease) → the analog-compatible AND
+   attention-preserving route. Recommended construction (medium conf): energy/Hopfield attention +
+   monotone/NEMON-l∞ contractive FFN, plain Picard/averaged (= our εF Euler) iteration, no root-finder.
+   Full report relayed in chat; open questions = our research frontier.
+
+---
+
+## Where it stands / open items
+- **Round A/B/C/D:** substantially complete (4/4 + the decisive tracking-AEP training follow-on);
+  4 specified gaps remain (above), the load-bearing one being **C multi-seed**.
+- **Reframe (this session):** the explosion is a **DEQ forward-stability** problem (shared by EP and
+  BPTT, since EP≈BPTT gradient), not an EP problem — so the fix belongs to the DEQ/structural toolkit,
+  not EP-literature hacks. The soft penalty is both admitted-heuristic and analog-incompatible.
+- **Running:** freeze_floor, specnorm (~10h) — testing whether removing/replacing the penalty keeps
+  the block stable past the ~9k blowup point and whether it descends below 2.1055.
+- **Next-round formal plan (NOT yet executed):** WSD schedule redesign; fix-bundle re-run; converged
+  BP/BPTT/EP comparison; C512-vs-C≥1024 capability question. Plus (lit-informed) prioritize the
+  `energy` arm; consider NEMON-l∞ over Winston-Kolter for the structural `mono` arm.
+
+## Artifacts
+- Scripts: `/tmp/lt_ep/{plateau_ablation,navg_sweep,bptt_rescue,blowup_diag,lr_overlay,cos_now}.py`,
+  `bptt_rescue.sh`, `chain_next.sh`, `watch_arms.sh`.
+- Logs/ckpts: `/tmp/lt_ep/{bptt_ceiling,freeze_floor,specnorm,trkcont,track_warm,std_warm,bpttr_*}.{log,pt}`.
+- Records: `C512_PLATEAU_CAMPAIGN.md` (round-1, exp1-7), `FINDINGS.md` (2026-06-17 round 2 + 2026-06-18),
+  memory `ep-c512-plateau-mechanism.md`.
diff --git a/docs/campaign/EP_BELOW210_DIAGNOSIS_FIX.md b/docs/campaign/EP_BELOW210_DIAGNOSIS_FIX.md
new file mode 100644
index 0000000..54897bc
--- /dev/null
+++ b/docs/campaign/EP_BELOW210_DIAGNOSIS_FIX.md
@@ -0,0 +1,260 @@
+# C512 EP below-2.10 divergence — diagnosis + fix (2026-06-20)
+
+> Written to disk after `/tmp/lt_ep` was wiped (10-day /tmp cleanup) — all local code/ckpts/logs lost.
+> This file preserves the diagnosis + the exact code change so it can be rebuilt.
+
+## Symptom
+EP (frozen jr=0.1, tracking-AEP, qknorm, c=1) descends to best **~2.09**, then within ~200 steps:
+res 5e-3→0.15, cos(g_EP,g_BPTT) 0.98→~0, val CE 2.09→4+, ABORT. Same ~2.09 floor every run
+(erf & tanh both: 2.0919 / 2.0921). **Exact-BPTT, identical recipe, sails past 2.09 → 1.72 (freeze_wsd).**
+
+## Root cause (Codex-confirmed, 5-way corroborated)
+**EP optimizes the fixed point; BPTT optimizes the finite unroll — only the finite unroll defends the residual.**
+- `ep_step` relaxes T1=150 → `z150`, then REFINES to t1max=300 → `z*`, computes the EP gradient at `z*`.
+- But **eval and `bptt_step` use `z150`**. EP's gradient never feels the T1 residual.
+- As attention gets expressive → contraction weakens → `z150` drifts from `z*` → res@T1 inflates → EP
+  estimate (valid only at small res) dies → blowup.
+- BPTT differentiates the actual 150-step unroll → a non-converging unroll → bad CE → gradient implicitly
+  rewards strong contraction. **That defend-the-residual term is what EP structurally lacks.**
+- cos-0.977 holds only AT the fixed point; the missing perpendicular is ~21% (=√(1−0.977²)) = the
+  finite-horizon transient gradient. (T1=∞ would make even BPTT lose it — BPTT's stability IS the finite truncation.)
+
+Two layers: (1) point-mismatch (refinement) — `--t1max 150` alone pushed wall 2.09→2.05 but still blew@600;
+(2) gradient-flavor — needs the explicit penalty.
+
+Diverged state = forward bifurcation to a **limit cycle** (eval_relax: res floors ~6e-2, oscillates, 150→4000
+steps don't help, CE ~3.7). FTLE stayed negative (−0.027..−0.050). ⇒ adaptive/more steps can't fix it (res as
+stopping-criterion chases a vanishing fixed point); only res as a COST (penalty) prevents the drift.
+
+Ruled out: jacreg (3% of grad, orthogonal, removing changes cos <0.001); GELU erf/tanh (both diverge identically);
+the validity gate (removing it diverges FASTER); forward-stability/FTLE (stable-BPTT −0.0347 weaker than diverged-EP −0.0377).
+
+## THE FIX: `--resreg` (T1-residual penalty) — exact edits to lt_ep_train.py / ep_step
+
+1. signature: `def ep_step(..., corr_every=1, res_gate=0.0, resreg=0.0):`
+
+2. right after `res_used = res` (BEFORE the `if t1max > T1:` refinement):
+```python
+    zT, resT1 = zs, res                           # the T1 free-phase state (what eval/BPTT use), BEFORE refinement
+```
+
+3. just before the final `return grads, res` (after the jacreg block):
+```python
+    if resreg > 0 and resT1 > 7e-4:                      # defend z_T1 (BPTT gets this implicitly; EP at z* doesn't)
+        with torch.enable_grad():
+            Fz = blk.tforce(zT, xin0)                     # deterministic thick force at z_T1 (params live, zT/xin0 detached)
+            Rr = (eps * Fz).pow(2).sum() / (zT.pow(2).sum() + 1e-9)    # ~ (T1 residual)^2
+            grr = torch.autograd.grad(Rr, blk.block, allow_unused=True)
+        ratio = resreg * min(1.0, resT1 / 2e-2)          # ramp 0->resreg as res 7e-4->2e-2, capped
+        gtask = math.sqrt(sum(float((grads[id(p)] ** 2).sum()) for p in blk.block if grads.get(id(p)) is not None) + 1e-20)
+        gres = math.sqrt(sum(float((g ** 2).sum()) for g in grr if g is not None) + 1e-20)
+        lam = ratio * gtask / gres                       # scale penalty to `ratio` of the task-grad norm
+        for p, g in zip(blk.block, grr):
+            if g is not None:
+                grads[id(p)] = g * lam if grads.get(id(p)) is None else grads[id(p)] + lam * g
+    return grads, res
+```
+
+4. argparse: `ap.add_argument('--resreg', type=float, default=0.0)`
+
+5. the ep_step CALL in the training loop: append `, cfg.resreg` as the last arg.
+
+**RUN IT WITH `--res_gate 0`** (else the gate bypasses the penalty). Keep `--t1max 300`.
+
+## Validation status (INTERRUPTED — must re-run)
+`ep_resreg2` (res_gate=0, resreg=0.2, warm from erf-2.09 ckpt, t1max=300): held res **1–5e-4 through step 1000**,
+best **2.0573** (lowest any EP run reached, past the wall), zero inflation. Deleted by /tmp wipe before reaching ~1.8.
+No-penalty variants for contrast: ep_nogate blew@100, gated ep_resreg@200, ep_t1max150 reached 2.05 then blew@600.
+
+## What was lost in the /tmp wipe (must rebuild)
+- Code: lt_ep_train.py (with all post-06-14 edits: --resreg, --gelu, --probe_bptt, --wsd, --lcap/--linf, empty_cache,
+  tracking-AEP), plus the probes (grad_decomp.py, ftle_probe.py +--gelu, eval_relax.py, cos_now.py, train_lyap.py).
+  Base (06-14) survives in `/home/yurenh2/ept/lt_ep_code/`; re-apply edits from this doc + the session transcript
+  `/home/yurenh2/.claude/projects/-home-yurenh2-ept/46bc4d43-*.jsonl`.
+- ALL checkpoints: orphan (ep_C512_bpe.best.pt 2.40), freeze_wsd.pt (BPTT 1.72), ep_frzjr_wsd.pt (2.09),
+  ep_resreg2, ep_tanh_frz — gone (H200 backup source 157.66.255.42 is also down). Must retrain from scratch.
+- Data tinystories_bpe: regenerate via `lt_ep_code/prepare_tinystories_bpe.py`.
+
+## Re-run plan
+1. Reconstruct lt_ep_train.py = lt_ep_code/06-14 base + the edits above (+ the other recent flags).
+2. Regenerate tinystories_bpe data. **Work entirely under /home/yurenh2/ept — NEVER /tmp.**
+3. Retrain orphan (~20k) → freeze_wsd (BPTT WSD 16k, expect ~1.72) → EP+resreg (res_gate=0 resreg 0.2, WSD).
+4. Confirm EP+resreg descends past 2.09 toward ~1.8 (BPTT ceiling). Sweep resreg 0.2–1.0 if it plateaus.
+5. NEXT-ROUND code fixes (genuine bugs, kept this round only for comparability with the diverging baselines):
+   - GELU consistency: model force uses erf (F.gelu) but holo nudge (cgelu/rforce/cforce) uses tanh-form → nudge fixed point ≠ z* → ~4.7e-4 force error injects a spurious common-mode drift each step (estimator bias). FIX = `--gelu tanh` (make the model tanh too; the holo complex extension REQUIRES tanh = entire fn, so consistency means everything tanh). NOTE: fixing it does NOT fix the below-2.10 divergence (tested 6/18, that's the residual-defense issue) — but it's still a real bug worth fixing.
+   - cforce (holo_ep.py:41-45) omits the q/k RMSNorm that real attention applies → biases holo_a/holo_a_select with --qknorm (not the --track path, but fix anyway).
+   - Stale argparse defaults (footguns, harmless when flags passed explicitly): --attn_mode default 'real' (should be the intended mode), --data default '/tmp/lt_ep/data/shakespeare_char' (line ~320). Also tighten resreg scaling to pure task-grad norm (currently task+jacreg, ~3% off).
+
+---
+## 2026-06-21 session update — rebuild after /tmp wipe + validation IN PROGRESS
+
+REBUILD (all under /home/yurenh2/ept/ep_run, never /tmp):
+- Code ep_run/lt_ep_train.py = lt_ep_code (a recent ~6/20 snapshot despite its misleading 6/14 mtime — it already has res_gate/gelu/t1max/track/resinit/full ep_step+holo) + my 3 edits (--resreg block, --wsd schedule, line-20 DD data-path fix). `diff` verified clean; holo_ep.py byte-identical to base.
+- Bug-check forensics: holo_ep.py last edited 2026-06-13 -> the estimator (holo_a_track, AEP correction, rforce) is COMPLETE/not stale. Post-6/14 lt_ep_train edits NOT carried over are all diagnostic/other-run-type (--probe_bptt,--probe_branch,--snap,--cprewarm,--specnorm/--linf, bptt_step sep-mode, empty_cache) — none touched by the EP+resreg+track+frozen-jr run.
+- Data: regenerated tinystories_bpe to ep_run/data/ (361M train tokens, vocab 4096; val=tail-15MB of train.txt, ~0.8% overlap — fine for tracking divergence/descent).
+
+VALIDATION RUN (from random init; pid 696706 GPU1):
+  python3 lt_ep_train.py --mode ep --attn_mode thick --B 24 --C 512 --H 16 --T 256 --c 1.0 \
+   --jacreg 0.1 --jr_floor 0.1 --jr_max 0.1 --holo 2 --hr 0.02 --t2sel 40 --track --pema 0.999 --t1max 300 \
+   --res_est 1e-4 --res_gate 0 --resreg 0.2 --qknorm --resinit 0.1 --warmup 800 --T1 150 --T2 20 --lr 6e-4 \
+   --wsd 0.25 --steps 32000 --log 200 --save_every 500 --abort_res 0.3 --data data/tinystories_bpe \
+   --ckpt runs/ep_resreg_scratch.pt --state runs/ep_resreg_scratch.state
+  Status @step 6400 (~20%, 8h, 0.23 it/s): val 2.5070 descending (FASTER than orphan: 2.51@6.4k vs orphan's 2.40@20k). res BOUNDED ~1-2e-2 by resreg (spikes to 2e-2 get pulled back to ~4e-3 — resreg WORKING, not running away like the no-resreg baselines that hit res 0.15->abort@2.09). Decisive break-2.09 test ~10h out; ~1.8 needs the WSD lr-decay (step 24k+). Watcher key: does res stay bounded through 2.09.
+
+resreg CONTAMINATION probe (ep_run/resreg_probe.py, at step-3600 ckpt, res 1e-2):
+  cos(VF,BPTT)=0.7166 vs cos(VF+resreg,BPTT)=0.7108 -> delta -0.006: resreg DOES directly contaminate (small). cos(resreg_dir,BPTT)=-0.047 -> resreg is ~ORTHOGONAL to the true gradient; it is NOT adding back BPTT's residual-defense term (my earlier hypothesis was WRONG) — it's a roughly-orthogonal constraint direction. NET still favorable: the tiny -0.006 direct cost (bounded by the 0.2 ratio cap) buys keeping res low so the EP estimate doesn't COLLAPSE (cos 0.72 held vs ->0 runaway+diverge). CAVEAT the probe also exposes: the EP estimate itself is only 0.72-aligned at res 1e-2 (vs 0.98 at res 1e-4) -> descent on a mediocre gradient -> MAY plateau above BPTT's 1.72. If it stalls >~1.85 or res creeps past 2e-2: retune tighter resreg (ramp-to-full by res~3-5e-3 instead of 2e-2) and/or stronger early contraction (frozen jr=0.1 from scratch is weaker than the orphan's adaptive jr which held res~1e-4).
+
+CODEX review of rebuilt code (verdict triaged): the 2 "High" bugs are STALE-DEFAULT footguns, HARMLESS here (run passes --attn_mode thick + --data explicitly, verified via /proc): line 304 --attn_mode default 'real'; line 320 --data default '/tmp/lt_ep/data/shakespeare_char'. GELU (erf model vs tanh nudge, ~4.7e-4 force err) + cforce-missing-qknorm = real bugs but status-quo (kept this round for comparability) -> next-round fixes (point 5 above). Codex CONFIRMED CORRECT: AEP contrast sign (not flipped/doubled), Wh readout grad, jacreg, the resreg sign/logic + res_gate=0 gate-bypass, frozen-jr clamp@0.1, WSD schedule.
+
+COST (EP vs same-param BP): per-step EP on the equilibrium block ~= ~100x a depth-1 BP transformer in DIGITAL SIM — because the block relaxes ~150-300 steps (effective depth); even exact-grad BPTT is ~100x depth-1 BP. EP/BPTT ~= 1.5x (EP 0.23 it/s vs freeze_wsd BPTT 0.34 it/s). The ~100x is the cost of SIMULATING the analog relaxation on a GPU; on analog relaxation hardware the free phase is physical settling (~free) + EP's local update needs no backward -> EP competitive/better than BP there. Per-quality the gap is smaller (equilibrium block beats same-param standard transformer, rule tax 0.041); EP memory << BPTT (no unrolled graph).
+
+DELIVERABLE: method-intro PDF /home/yurenh2/ept/ept_method_intro.{tex,pdf} (9pg, EP-literate audience; AEP/holomorphic/CET/tracking-AEP background; strict attribution; results honest+ongoing-marked; compiled with tectonic since conda pdflatex is broken). VERIFY before external use: the post-Jan-2026 citations (Scurria arXiv:2602.03670 AsymEP; CET=Hoier/Kerjan/Scellier ICLR2026 AM workshop OpenReview Qrfml76eWJ; Laborieux 2021) + Hoier-vs-Hoier spelling.
+
+---
+## 2026-06-21 (cont.) — divergence is CONTRACTION loss, not res; the missing BPTT term MEASURED
+From-scratch EP (c=1, frozen jr=0.1, track, resreg=0.2) diverged at val 2.25 (ABOVE the 2.09 wall):
+res 1.1e-2 -> 1.7e-1 -> 3.4e-1 then COLLAPSED to a trivial fixed point (res 1.4e-8, val 10). So it is a
+bifurcation: good fixed point destroyed -> brief limit cycle -> degenerate attractor. NOT a numerical blowup.
+
+MEASURED the missing term M = g_BPTT - g_EP at the pre-blowup ckpt (step 9200, res~1e-2; resreg_probe.py):
+  cos(g_EP, g_BPTT) = 0.557 (decayed from 0.72 @ step3600 at the SAME res -> estimate degrades as operator hardens)
+  |M|/|BPTT| = 0.84,  cos(M,BPTT) = +0.91,  cos(M,EP) = +0.16,  cos(M,resreg) = -0.10
+  (|EP|=0.94, |BPTT|=2.2 -> EP captures only ~42% of the true gradient, 56 deg off, near the instability)
+=> EP is missing 84% of the true gradient (M ~ the BPTT direction). resreg points AWAY from M (-0.10):
+   resreg is NOT the missing term and NOT a weak proxy -- it is the wrong direction. resreg retired as the fix.
+
+CODEX THEORY: correct state variable = contraction margin mu_P(J) (one-sided-Lipschitz/log-norm), NOT res
+(lagging symptom; cos decays at fixed res) and NOT spectral radius (non-normal, |Jv-Jtv|/|Jv|~1.4 -> jacreg
+targets the WRONG quantity). BPTT's stabilizer = the finite-horizon gradient term (the (I+eps*J)^k products that
+ALSO govern contraction -> automatic contraction feedback); EP graded at refined z* structurally discards it.
+The fix that (i) preserves the objective [no loss force] and (ii) runs on analog [no digital root-finder] =
+CONTRACTION-BY-CONSTRUCTION (mu_P(J) < -alpha). KICKER: analog HW can never do BPTT -> can never access M ->
+structural contraction is the FORCED hardware design constraint, not merely the cleanest fix.
+
+EXPERIMENT (test causality + the fix; both with resreg=0, no residual term):
+ (A) c-bump c=1->3 from scratch [runs/ep_c3_scratch] -- does stronger structural damping alone prevent divergence?
+ (B) HARD spectral-norm contraction constraint --specnorm (codex implementing) -- warm from preblow_9200 + from scratch.
+ If divergence vanishes with NO residual term -> contraction-loss confirmed as root; principled fix validated.
+
+---
+## 2026-06-21 (cont. 2) — the equivalence-condition diagnosis (deep paper read: Ernoult2019 + AsymEP + LZ-JacobianHomeostasis)
+EP≡BPTT (Ernoult 2019, arXiv:1905.13633, Thm1/Eq16) needs: H1 symmetric Jacobian J=Jᵀ (Eq51 transpose-cancellation = load-bearing);
+H2 free phase EXACTLY at fixed point s*=s_T=..=s_{T-K}; H3 β→0. Ernoult gives NO contraction-rate truncation bound (just assumes
+convergence; my earlier "contraction-rate condition" guess was WRONG). BPTT needs NEITHER H1 nor H2 (differentiates the actual
+finite-T graph) -> that's why BPTT sails to 1.7 while EP can't.
+
+OUR RUN VIOLATES BOTH, via the SAME quantity S^-1 A (S,A = sym/antisym parts of J; our |Jv-Jᵀv|/|Jv|≈1.4 IS ‖S^-1 A‖≈1.4):
+- H2 (dominant, first-order): res~1e-2 -> not at fixed point -> EP & BPTT seeded by different states. "EP≈BPTT only at res≲1e-4,
+  noise at 1e-2" is textbook H2 failure.
+- H1 (makes it sudden, sets M): both AsymEP (Scurria arXiv:2602.03670) and LZ (Laborieux-Zenke "Jacobian Homeostasis"
+  arXiv:2309.02214, ICLR2024) reduce the asymmetry bias to S^-1 A. LZ Eq11: EP error = δ - 2 S^-1 A δ, valid only ‖S^-1 A‖→0.
+  At 1.4 the map (S+A)^-1(S-A) → near-rotation (eigs→-1): right direction, lost magnitude = our cos(M,BPTT)=0.91, |M|=84%.
+
+KEY NEW REALIZATION: our AEP (J→Jᵀ) correction is OUT OF WARRANTY. It is FIRST-ORDER in S^-1 A and assumes ‖S^-1 A‖≪1
+(AsymEP error series Σ(S^-1 A)^{2n+1} converges only if ρ(S^-1 A)<1; they never state it). At 1.4 it may be O(1) wrong. AND it
+is a FIXED linearization at x̄⁰, so res~1e-2 -> wrong anchor -> H2 POISONS the AEP correction too. The cos-0.99 we trust holds
+ONLY at small-asymmetry + tight-convergence — the regime we leave.
+
+THE GAP, PRECISE: NOT "can't converge". Even at perfect convergence, at ‖S^-1 A‖=1.4 the H1 bias is O(1), and the BPTT-1.7
+solution NEEDS that non-normality. So EP's valid regime (‖S^-1 A‖≪1) and the expressive regime (≈1.4) DON'T OVERLAP. Structural.
+
+REFRAMES THE CONTRACTION EXPERIMENTS: specnorm/c3 cap ‖J‖/contraction, NOT the asymmetry ratio ‖A‖/‖S‖ that drives the bias.
+The literature's correct lever = Jacobian Homeostasis (LZ Eq12: penalize ‖A‖² -> J functionally symmetric, works w/ untied weights)
+= the principled version of our jacreg — BUT it caps exactly the non-normality the 1.7 solution needs (expressivity tax; confirms
+the "crutch" worry). Holomorphic nudge only removes finite-β bias, NOT the H1/H2 bias (so β is NOT our dominant error).
+
+THE ONE PATH WITHOUT THE EXPRESSIVITY TAX (novel): the AEP correction is first-order; a HIGHER-ORDER AEP (more Neumann terms in
+S^-1 A) extends the warranty past ‖S^-1 A‖~1 WITHOUT capping non-normality — fix the gradient, not the model. Attacks H1 fundamentally.
+
+NEXT MEASUREMENTS (cheap, on preblow_9200 ckpt): (1) measure ‖S^-1 A‖ / ρ(S^-1 A) DIRECTLY (not the 1.4 proxy) and track over
+training (warranty gauge; predict it grows toward the tip). (2) test M ≈ -2 S^-1 A · g_BPTT — if yes, the asymmetry bias IS the gap
+(not convergence) and higher-order AEP is the lever. Sources: Ernoult 1905.13633; Scurria 2602.03670; LZ 2309.02214 + Holo 2209.00530.
+
+---
+## 2026-06-21 (cont. 3) — CORRECTION: the "AEP out of warranty / ρ(S⁻¹A)<1" framing (cont.2) was MIS-ATTRIBUTED
+Deep read of Scurria 2026 (arXiv:2602.03670, verbatim Eq.24/47) overturns cont.2:
+- Our AEP = the paper's AsymEP (single-state antisym correction −2A(z−z*); holo_a_track re-anchors A at the common mode z̄).
+  It is EXACT at β→0 for ARBITRARY asymmetry: Eq.24 `J_FA(x̄⁰)=J−2A=Jᵀ` is an algebraic IDENTITY (any A magnitude).
+- The ρ(S⁻¹A)<1 / Neumann series Σ(S⁻¹A)^{2n+1} (Eq.47) is the **VF BASELINE's** error (uncorrected EP), which AsymEP REMOVES.
+  We + GPT-Pro (following Laborieux-Zenke, whose first-order −2S⁻¹Aδ analyzes UNCORRECTED EP) MIS-ATTRIBUTED the VF error
+  to AsymEP. It is NOT a restriction on our correction.
+- "AsymEP = first-order projection of Dyadic EP" = DIMENSIONAL projection (2n→n), NOT first-order-in-β. Both exact at β→0.
+- Dyadic EP (Scurria §4) = a SEPARATE 2n doubled-field method (aux field z′ + augmented energy H=−(z−z′)ᵀF((z+z′)/2));
+  we DON'T implement it. It buys finite-displacement robustness + a real energy, NOT extra β→0 exactness. ⇒ optional upgrade, NOT the fix.
+- ⇒ the asym_probe ρ(S⁻¹A)=3.3 result answers the WRONG question (VF's error, + measured at a no-fixed-point pathological ckpt). DISCARD as a validity verdict.
+
+REAL DIAGNOSIS (re-centered): the ONLY fundamental obstacle is **H2 = free-phase convergence/contraction**. AsymEP handles
+asymmetry EXACTLY *given convergence*. The cos-0.557 / M-84% gap was measured at res 1e-2 (NOT converged) ⇒ it's H2, not H1.
+Asymmetry is only "upstream" (expressive attention → weak contraction → H2 fails). FIX DIRECTION = contraction/convergence
+defense (specnorm / damping / monotone / residual-defense) — VALIDATED. The cont.2 "asymmetry/ρ/AEP-warranty" line was a wrong turn.
+
+EMPIRICAL CHECK (running): oracle-adjoint probe — at a HEALTHY converged ckpt (specnorm, res~1e-4): is
+g_EP(AsymEP) ≈ g_transpose(exact adjoint via GMRES Jᵀλ=−ℓ) ≈ g_BPTT? cos→1 confirms AsymEP exact + problem = convergence.
+(The earlier 0.557 was the pathological non-converged point.)
+
+---
+## 2026-06-21 (cont. 5) — ROOT-CAUSE CANDIDATE: --hr 0.02 is 10x too small (should be 0.2) -> noisy EP gradient
+oracle-adjoint probe (asym_probe --diag) on a PERFECTLY converged ckpt (res 1.5e-9): cos(g_EP, g_transpose[=exact
+adjoint=g_BPTT, cross-checked cos 1.0]) = 0.63 mean over 6 batches (SYSTEMATIC, not a bad batch). BETA/r SWEEP: cos gets
+WORSE as beta->0 (hr 0.04->0.86, 0.02->0.72, 0.005->0.38, 0.002->0.15) = 1/beta noise-amplification of the nudge-convergence
+error (NOT finite-beta bias; NOT a holo/track/t2sel component -- plain EP ~= FULL ~0.72). Sweeping hr UP: 0.06->0.85,
+0.1->0.95, 0.15->0.98, **0.2->0.986**, 0.3->0.991. => at the DOCUMENTED hr=0.2 (METHODS line 130) the estimator is FINE
+(cos 0.986, in the historical 0.85-0.99 range). Our runs (this whole session + the EP_BELOW210 "diverging recipe") used
+--hr 0.02 = 10x too small -> noise regime -> 0.72-cos gradient -> plausibly THE destabilizer of the below-2.10 divergence.
+CODE IS CORRECT; mis-set parameter (hr regression from the validated 0.2). FIX = --hr 0.2. Decisive test: pure EP with
+--hr 0.2 (runs/ep_hr02) -- does it stop diverging? The probe was correct; the convergence/asymmetry/Dyadic theory dive was
+largely chasing a symptom of the mis-set hr. (NOTE: also revisit t2sel=40 vs documented 120, secondary.)
+
+## 2026-06-22 (cont. 6) — DECISIVE: divergence is STRUCTURAL (equilibrium gradient is blind to finite-time contraction). hr & t2sel are NOT the fix.
+Deliverables this round: `ep_run/GPT55_BUG_HUNT.md`, `ep_run/CODEX_VERDICT.md`, `ep_run/FWD_MODE_DESIGN.md` (multi-agent: gpt-5.5 + fugu-ultra via codex).
+
+**cont.5's "hr=0.2 fixes it" was WRONG/incomplete.** ep_hr02 (--hr 0.2) STILL diverged (step 2800, val 2.87). hr is a real gradient-QUALITY knob but not the divergence fix.
+
+**Clean chain (all on a fixed val batch):**
+1. **BPTT deconfounder:** BPTT (exact grad) → CE **1.83**, free-phase contraction ratio **ρ≈0.982**, res tight ~1e-3 throughout. Divergence is **EP-specific**; a contractive low-loss solution exists.
+2. **The "0.94 estimator ceiling" = ADJOINT-PHASE TRUNCATION, not hr/β noise, not fundamental.** At a trained, tightly-converged operator (free res 2.2e-5): hr-ceiling sweep cos(g_EP,g_transpose) **flat ~0.94 across hr 0.04–0.8**; t2sel sweep (fixed hr) cos **0.74→0.998 as t2sel 10→160**. The code refines the FREE phase but never converges the NUDGED/adjoint phase (gpt-5.5 bug-hunt #1,#9). So EP CAN compute the near-exact equilibrium adjoint (0.998) at t2sel≥160. cos(g_transpose,g_BPTT)=0.975@res2e-5, =1.0@res1e-9 (finite-horizon gap, res-dependent).
+3. **Bifurcation probe (mechanism):** as EP trains, free-phase **ρ drifts→1** (0.989@val4.96→0.998@val2.74); redx baseline blew step3300 (CE 2.74→41). **Damping does NOT rescue** — ρ-vs-c on the marginal op: more c makes ρ WORSE (0.998→0.9999); on BPTT's op more c helps (0.982→0.96). EP drives weights into a **damping-resistant marginal-contraction region** BPTT avoids. Sudden blowup because **res_T1≈ρ^150** is hyper-sensitive near ρ=1 (gradual ρ-drift + ρ^150 = sudden bifurcation).
+
+**VERDICT (codex/gpt-5.5, PROVEN — CODEX_VERDICT.md): (b) STRUCTURAL.** Equilibrium gradient differentiates L(z*); BPTT differentiates deployed L(z_T), T=150. The equilibrium gradient `grad=L_θ+λᵀF_θ` (F_zᵀλ=−L_z) **lacks the finite-horizon contraction-defense term**. Scalar proof: F=−k(z−z*) → ∀k, z* identical → **dL(z*)/dk=0** (blind to contraction rate), but z_T−z*=(1−εk)ᵀ(z₀−z*) → **dL(z_T)/dk ∝ T(1−εk)^{T−1}** (BPTT sees k; huge near ρ=1). A param that erodes contraction with z* fixed has ZERO equilibrium gradient but large BPTT gradient. **No t2sel/hr adds a term absent from the fixed-point gradient.** t2sel=160 only makes EP a better *equilibrium* gradient = still wrong objective → ep_t2fix predicted to still blow.
+
+**FIX (physically realizable; FWD_MODE_DESIGN.md): change the OBJECTIVE to finite-horizon — forward-mode/RTRL eligibility traces.** Forward-relax T=150, propagate `e_{t+1}=Φ_z(z_t)e_t+ε·∂F_θ/∂p` alongside the state, update `Δp ∝ −L_z(z_T)ᵀe_T`. Forward-only, local, NOT reverse BPTT, NOT a digital solver — keeps the contraction-defense terms. Exact RTRL O(P·dimz) → practical: UORO / random-direction directional-deriv / blockwise low-rank. Alt: explicit contraction homeostat (penalize log-norm μ(J)).
+
+**In-flight confirmations (2026-06-22):** ep_t2fix (t2sel=160, runs/ep_t2fix.log + t2fix_rho.log) — predicted ρ→1 + blow; oracle-adjoint training (GMRES exact adjoint each step, warm-start redx s2000, runs/oracle_adjoint.log) — DEFINITIVE (b) test (if even the exact equilibrium grad blows → (b)); forward-mode rule design = the fix.
+**Tooling:** Sakana fugu-ultra wired into the codex plugin (broker spawn patched: `-c model=fugu-ultra -c model_provider=sakana`, key ~/.codex/sakana.key); gpt-5.5 via direct `codex exec`.
+
+## 2026-06-22 (cont. 7) — CORRECTION: the divergence is a forward LIMIT CYCLE = a DISCRETE-EULER ARTIFACT (continuous/analog system is stable). cont.6's (b)-structural/forward-mode verdict is SUPERSEDED.
+Measured on genuine this-session ckpts (full dossier: `ep_run/EP_DIAGNOSIS_DOSSIER.md`; fugu Q1-Q4: `ep_run/FUGU_VERDICT_FULL.md`).
+1. **It's a LIMIT CYCLE, no fixed point.** eval_relax on the marginal ckpt redx **s3200** (val 2.74): relax from embedding 6000 steps → res FLOORS ~2.3e-2 and OSCILLATES (tail 2.08-2.73e-2, non-monotone), no fixed point. Reproduces the old (lost-run) finding on a genuine ckpt. ⇒ there is **no z\*** at the diverging operator → cont.6's "equilibrium gradient L(z\*) blind to contraction → forward-mode RTRL" (assumed z\* exists) is **moot here**. The bifurcation-probe "ρ=0.998 / slow convergence" was a **transient artifact** (caught the initial decay, missed the floor+oscillation).
+2. **Attention drives the cycle.** Knockout WO×α: α=1.0 cycle (res 2.5e-2), 0.7 smaller, 0.4 nearly gone, **0.2 CONVERGES (res 3e-4, true fixed point)**. Non-conservative attention is causally responsible.
+3. **DECISIVE — it's a DISCRETE-EULER ARTIFACT.** ε-sweep on s3200 (εN fixed = same settling): res-floor + oscillation shrink **monotonically** as ε↓ — 0.1→2.3e-2(cycle), 0.05→7e-3, 0.03→3e-3, 0.02→2e-3, **0.01→8.9e-4 CONVERGED**. ⇒ the cycle is the **explicit-Euler instability of a stiff rotating attention mode** whose CONTINUOUS eigenvalue is **Re(μ)<0 (stable)**: `|1+εμ|>1` at ε=0.1 but `<1` at ε=0.01. **So the below-2.1 "divergence" is largely a DIGITAL-SIM INTEGRATION ARTIFACT (ε=0.1 explicit Euler too coarse), NOT a fundamental non-conservative instability — and ANALOG hardware (continuous relaxation, ε→0) would NOT have it.** Big positive for the thesis.
+- **fugu-ultra (Q1-Q4)** concurred + was appropriately cautious: confirmed attention-driven oscillatory non-convergence, flagged the Euler-artifact alternative (now confirmed), said "Hopf" needs the eigenvalue-of-M crossing to be measured. Fix (Q2): finer integration (smaller ε / implicit/Anderson) + **adaptive jacreg homeostat** (the validated-2.40 mechanism, which the diverging runs *froze* weak — the regression) + **res_gate ON** + qknorm/WO-gain rails + monitor TAIL oscillation (a short residual probe is what fooled us). Thesis (Q3): sub-threshold attention IS expressive enough (BPTT 1.83 proves stable expressive solutions exist), threshold ≈ α_c 0.3; tradeoff real but not fundamental. Primitive (Q4): keep below the (discrete) instability — the code (`ep_step`/holo) REQUIRES a stationary z\*; training *through* the cycle is invalid → gate updates when res says equilibrium lost.
+- **CAVEAT:** the ε-sweep proves the *discrete* artifact at s3200 (continuous-stable there); a *true* continuous Hopf could still emerge deeper in training. **Decisive confirmation (running): `ep_eps05`** = warm-start s2000 (pre-cycle, val 3.13) + **ε=0.05 (T1=300)** → does it clear redx's 2.74 blow point without cycling? Files: eval_relax_s3200.py / knockout_s3200.py / eps_sweep_s3200.py.
+
+## 2026-06-23 — RESULT: ε-TRAINING-MONOTONICITY CONFIRMS THE MECHANISM (wall moves with ε).
+Three training runs, identical recipe except ε:
+- ε=0.1, t2sel=40 (redx): blew @ best **2.74**
+- ε=0.1, t2sel=160 (ep_t2fix): blew @ best **3.02** — the BETTER gradient (cos 0.998 vs 0.84) blew EARLIER ⇒ gradient quality is NOT the lever (a cleaner gradient if anything drives into the stiff regime faster).
+- ε=0.05 (ep_eps05, warm-start s2000): blew @ best **2.41** — cleared *past* redx's 2.74 with res pinned ~5e-3 the whole way down (2.74→2.45), then res inflated 1.3e-2→2.0e-2→4.6e-2→val13 @step3100.
+
+**⇒ smaller ε → strictly lower wall (2.74 → 2.41); the blow point shifts MONOTONICALLY with the integration step.** This proves the divergence IS the explicit-Euler instability of the *stiffening* rotating attention mode (|1+εμ|>1). Fixed ε only RELOCATES the wall, doesn't eliminate it — as the attention keeps stiffening past ~2.4, even ε=0.05 eventually gets too coarse → cycle regrows → blow. The ep_t2fix contrast (better gradient still blows, earlier) seals **"integration, not gradient."** Analog HW (continuous relaxation, ε→0) is unaffected. Pre-blow ckpt preserved: `runs/ep_eps05.pt` @step2850 best2.4059. (2.41 lands near the old 2.40 estimator-plateau — likely coincidence; this is the res-inflation/stiffness blow, not the estimator floor.)
+**NEXT (user's plan — try jacreg + physics-inferred methods):** (a) **adaptive ε** [#30, gate now passed] — shrink ε as res climbs, big ε when smooth → should ELIMINATE the wall, not just move it; (b) **jacreg** (adaptive jr — reduce rotating-mode stiffness → raise the Euler-stability margin → wall up / fixed-ε floor lower; the early-recipe defense); (c) ε=0.02/0.03 to confirm wall→~2.2. Two complementary axes: integration (a,c — SIM-only, no expressivity cost) vs stiffness (b — changes model, also speeds HW settling).
+
+## 2026-06-23 (cont. 8) — MAJOR CORRECTION: GENUINE Hopf instability, NOT an Euler artifact. cont.7's "Euler-artifact / analog-immune / integration-fix" framing is SUPERSEDED.
+Direct leading-eigenvalue measurement (matrix-free FINITE-DIFFERENCE-JVP Arnoldi on M=I+εJ, `eig_probe.py`) + Anderson fixed-point-solver control (`anderson_control.py`). THREE independent confirmations:
+1. **eval_relax**: s3200 floors/oscillates, never reaches a fixed point.
+2. **Leading eigenvalues of M at the operating point**: s2000 (healthy, val 3.13) all |λ|<1, Re μ<0 (STABLE); s3200 (val 2.74) top |λ|=1.044, **Re μ=+0.44, COMPLEX** (μ=+0.26±1.37j); ep_eps05 (val 2.41) top |λ|=1.14, **Re μ=+1.35**, complex (μ=+0.65±2.08j). Re μ grows −0.02→+0.44→+1.35 as CE drops; rotating |Im μ| ~0→1.37→2.08.
+3. **Anderson control**: s2000 converges (root, marginally stable μ≈0). s3200 — plain relax floors (res 2.7e-2), Anderson CANNOT reach a root (best res 1.4e-3), and the near-root has **UNSTABLE COMPLEX** eigenvalues (Re μ=+0.24, |Im μ|≈2.0). The solver itself cannot manufacture the absent stable equilibrium (fugu's exact caveat).
+
+⇒ **The below-2.1 divergence is a GENUINE continuous-time Hopf-type instability** of the learned non-conservative attention operator — a complex eigenvalue pair crosses Re μ=0 between val 3.13 and 2.74 (|Im μ|~2 rotation), EMERGING + GROWING as EP trains the attention expressive. **NOT a discrete-Euler artifact; the continuous ODE / analog HW is ALSO unstable → analog is NOT immune.**
+
+WHERE cont.7 WENT WRONG: the "Euler-artifact" inference rested on the ε-sweep "converging at ε=0.01" — but that was the **step** residual `ε·g`, not `g→0` (fugu caught this); the force-residual `g` floors high (no fixed point). The direct eigenvalue measurement refutes Re μ<0. RECONCILES with all prior data: ε-monotonicity (smaller ε → lower wall) = smaller ε makes `|1+εμ|>1` grow SLOWER → delays the blow → wall relocates (2.74→2.41), doesn't close (Re μ>0 is unstable for ANY ε); ep_t2fix (better gradient blew earlier) = cleaner gradient drives into the unstable regime faster; jacreg history (adaptive→2.40, frozen→blow) = jacreg cuts ‖J_nc‖ → pushes the complex pair back to Re μ<0.
+
+**FIX (FLIPPED): integration (adaptive ε / Anderson / implicit) only DELAYS** — can't converge to an unstable/absent fixed point (Anderson got 1.4e-3, not a root = direct proof). **The REAL fix is MODEL-SIDE: jacreg** (reduce ‖J_nc‖ → push the rotating complex pair back to Re μ<0), per fugu's Q2 (FUGU_OPTIONS_VERDICT.md): use an *adaptive bounded* jacreg homeostat (the validated 2.40 runs used adaptive jacreg; diverging runs froze it). **NEXT: jacreg run + verify with `eig_probe.py` that jacreg holds Re μ<0 / keeps the complex pair below the bifurcation.** Files: eig_probe.py, anderson_control.py, adaptive_eps_calib*.py, FUGU_OPTIONS_VERDICT.md.
+
+## 2026-06-23 (cont. 9) — FIX CONFIRMED AT THE MECHANISM LEVEL: adaptive jacreg pushes Re μ<0, kills the Hopf, restores the fixed point.
+`ep_jacreg` = warm-start s2000 + ε=0.1 + **ADAPTIVE jacreg** (`--jacreg 0.1 --jr_floor 0.1 --jr_max 16 --res_target 1e-3 --res_gate 0`, controller :520-529). The ONLY change vs redx (which froze `jr_max=jr_floor=0.1`) is releasing the controller. It descends cleanly **past redx's 2.74 blow** (best 2.7495, res controlled ~1e-3, jr 0.1–0.3, no blow; ~4.5e-4 CE/step after a slow warm-start re-adjustment).
+**eig_probe (`eig_jacreg.py`) at the SAME loss ~2.74 — the decisive A/B:**
+- FROZEN jacreg (redx s3200, blew): leading μ=+0.45 (real) + +0.26±1.37j (**COMPLEX/rotating**), |λ|>1 UNSTABLE, g_floor=0.26 (cycle, no fixed point).
+- ADAPTIVE jacreg (ep_jacreg @2.75): leading μ=−0.23,−0.25,−0.27 (**ALL real, NO rotating pair**), |λ|<1 STABLE, g_floor=**0.0001** (TRUE fixed point restored).
+⇒ adaptive jacreg moved the leading eigenvalue from **Re μ=+0.45 (unstable rotating Hopf) → Re μ=−0.23 (stable real)**, killing the rotating instability and restoring a true fixed point (g 0.26→1e-4). This confirms BOTH the diagnosis (Hopf from the non-conservative ‖J_nc‖) AND the fix (jacreg reduces ‖J_nc‖ → Re μ<0), at the eigenvalue level, not just the loss. **Bonus:** the restored fixed point makes AsymEP valid again (it requires a stationary state — invalid on redx's cycle). NEXT: does it clear 2.41 (ep_eps05's blow) → 2.09 → below; re-run eig_probe at 2.41 to confirm jacreg holds Re μ<0 deeper. Production = compose with adaptive ε (fugu Q3) for sim fidelity/speed. Files: eig_jacreg.py, ep_jacreg.log.
+**OBS (oscillation, in-progress):** windowed val_std SHRANK monotonically as it descended (0.159→0.031, steps 50→3000) = converging. At step 3400 a TRANSIENT val spike (2.65→3.67) self-recovered over ~150 steps (→2.79), best preserved (2.5624). KEY: throughout the spike **res stayed LOW (4e-4→8e-5) and jr LOW (0.1)** → it is NOT a Hopf/relaxation instability (res did not inflate) but a benign weight/CE training transient (a momentarily-off update; relaxation stayed healthy). So jacreg's Hopf-suppression is intact; spike is a separate phenomenon. Watch: if such spikes recur/grow, investigate — but "res-controlled-during-spike" already rules out Hopf re-emergence.
diff --git a/docs/campaign/FINDINGS.md b/docs/campaign/FINDINGS.md
new file mode 100644
index 0000000..0cb8b53
--- /dev/null
+++ b/docs/campaign/FINDINGS.md
@@ -0,0 +1,609 @@
+# EP / AEP for Transformers — Findings
+
+**Question.** Can transformers (attention + FFN) be trained with Equilibrium Propagation (EP),
+i.e. without backprop / without Feedback Alignment? Started from literature, reproduced the SOTA,
+characterized AEP (EP for non-conservative dynamics), and ported the recipe to a real char-LM.
+
+---
+
+## TL;DR
+
+1. **EP/AEP give FAITHFUL LOCAL GRADIENTS for every transformer component** — attention
+   cosine **0.99**, FFN **1.0** vs the true BP gradient. This solves the credit-assignment that
+   **Feedback Alignment fails at**: FA only trains layers adjacent to the loss (output proj cos≈1.0)
+   and leaves upstream layers as noise (**attention Q/K/V cos≈0.25, FFN fc cos≈−0.01**).
+2. **Recipe to make real (non-conservative) attention EP-able: DAMPING + AEP.** A damping term
+   `s·(attn(z) − c·z)` (c≥1) creates a stable fixed point at *any* attention strength, while keeping
+   the map non-conservative (independent Q/K/V); AEP then recovers the gradient (0.99–1.0).
+3. **Stable end-to-end EP training: SOLVED** by a residual-driven continuous controller on a soft
+   Jacobian penalty (Bai-2021-style, Hutchinson) + damping. EP trains 12k+ steps without blowup.
+   **There is NO structural "EP ceiling"** (a ~2.5-CE wall was claimed mid-project and is
+   **retracted** — see *2026-06-09* section): EP has a quantitative **validity threshold**
+   (free-phase residual ≲ few·1e-4, nudge inside the linear-response window), and meeting it costs
+   regularization tax + steps — both reducible, neither a wall.
+
+---
+
+## 2026-06-20 — below-2.10 EP divergence: ROOT CAUSE + the residual-defense fix
+
+The C512 EP wall (EP frozen-jr descends to best ~2.09, then SUDDENLY diverges within ~200 steps:
+res 5e-3→0.15, cos(g_EP,g_BPTT) 0.98→~0, CE→4+, abort — while exact-BPTT with the IDENTICAL recipe
+sails past to 1.72, freeze_wsd) is diagnosed. NOT the controller, NOT jacreg, NOT the erf/tanh GELU,
+NOT a loss-landscape wall.
+
+**Root cause (Codex-confirmed, 5-way corroborated): EP optimizes the FIXED POINT; BPTT optimizes the
+FINITE UNROLL — only the finite unroll defends the residual.** `ep_step` relaxes T1=150→z150, then REFINES
+to t1max=300→z* and takes the gradient at z*; but eval & `bptt_step` use z150. EP never "feels" the T1
+residual → as attention gets expressive, contraction weakens, z150 drifts from z*, res@T1 inflates, the
+EP estimate (valid only at small res) dies → blowup. BPTT differentiates the actual 150 steps, so a
+non-converging unroll → bad CE → its gradient implicitly rewards strong contraction. That defend-the-residual
+term is what EP structurally lacks: cos-0.977 holds only AT the fixed point; the missing ⊥ component is ~21%
+(=√(1−0.977²)) = the finite-horizon transient gradient (T1=∞ would make even BPTT lose it — BPTT's stability
+IS the finite truncation; the equilibrium/implicit gradient only cares WHERE the fixed point is, not how fast
+you reach it → no contraction-reward).
+
+Two layers: (1) point-mismatch (refinement) — `--t1max 150` alone moved the wall 2.09→2.05 but still blew@600;
+(2) gradient-flavor — needs the explicit penalty. The diverged state is a forward bifurcation to a LIMIT CYCLE,
+not under-relaxation (eval_relax: res floors ~6e-2 and oscillates, 150→4000 relax steps don't help, CE ~3.7;
+FTLE stays negative −0.027..−0.050 — single random-vector FTLE misses the cycle). ⇒ adaptive / more relaxation
+steps CANNOT fix it (res as a STOPPING criterion just chases a vanishing fixed point); only res as a COST
+(penalty) prevents the drift.
+
+**Fix — explicit T1-residual penalty `--resreg`**: defend z150 with R=‖εF(z150)‖²/(‖z150‖²+ε), grad w.r.t. θ
+(z150 detached), task-relative scale (ratio = resreg·min(1, res@T1/2e-2), deadband res@T1>7e-4), added to the
+EP gradient. **MUST run res_gate=0** (the validity gate early-returns jacreg-only above the gate, bypassing the
+penalty exactly when res is high — first gated attempt blew@200). Keep t1max=300 (estimator accuracy) + penalty
+(defends z150). Analog-compatible (one extra force measurement + local VF gradient, no root-finder); more
+targeted than jacreg (which penalizes ‖J_nc‖, not the residual that explodes).
+
+**Validation (interrupted by /tmp wipe, needs re-run):** ep_resreg2 (res_gate=0, resreg=0.2, warm from erf-2.09)
+held res pinned 1–5e-4 through step 1000, best **2.0573** (past the wall, lowest any EP run reached), zero
+inflation — where every no-penalty variant blew (ep_nogate@100, gated ep_resreg@200, ep_t1max150 reached 2.05
+then blew@600). The run + ALL local ckpts were deleted 2026-06-20 by the /tmp 10-day cleanup before reaching
+~1.8. Corroboration (5): Codex independent diagnosis; BPTT-from-2.09 control (res stayed 1.8e-4 + descended,
+same fresh-opt restart where EP blew → the EP *update* is the destabilizer); FTLE (stable-BPTT 1.72 has WEAKER
+contraction −0.0347 than diverged-EP −0.0377 → not forward-stability); eval_relax (limit cycle); gradient
+decomposition (jacreg acquitted: 3% of grad, orthogonal, removing changes cos<0.001). Verbatim code + re-run
+plan: **EP_BELOW210_DIAGNOSIS_FIX.md**.
+
+---
+
+## Background (literature)
+
+- **EP** — Scellier & Bengio 2017. Energy-based; free phase relaxes to a fixed point, nudged phase
+  with strength β; local weight update from the two equilibria. Centered/symmetric nudging
+  (Laborieux 2021) reduces the gradient-estimator bias.
+- **EP ≡ BPTT** — Ernoult et al. 2019: EP updates match BPTT gradients in an RNN with static input,
+  in the limit β→0 with a converged free phase and enough nudged steps. (Relevant: as β→0 EP is
+  unbiased — so our instability is NOT primarily gradient bias.)
+- **Holomorphic EP** — Laborieux & Zenke 2022: exact gradients via finite oscillations, removing the
+  β bias/noise trade-off.
+- **CET** — Høier, Kerjan, Scellier (ICLR 2026 AM workshop), arXiv via OpenReview `Qrfml76eWJ`.
+  Convergent Energy Transformer, EP-trained, CELEBA masked completion, EP ≈ TBPTE. The current SOTA
+  for "EP + attention". Conservative (energy) attention with tied value → guaranteed fixed point.
+- **AEP / AsymEP** — Scurria, Vanden Abeele, Mognetti, Massar, "EP for Non-Conservative Systems",
+  arXiv:2602.03670. Nudged-phase correction `−2 A_J(x*)(x−x*)` (A_J = antisym part of the Jacobian
+  at the free eq) turns the nudged Jacobian J → Jᵀ, giving the exact gradient even when no energy
+  function exists. **The correction is theirs**; their scope is feedforward/Hopfield nets on static
+  MNIST/CIFAR (no attention, no sequence, explicit Jacobian, no stability controller). Their
+  force-form **VF** baseline collapses without the correction. See 2026-06-16 for the full ours/theirs
+  boundary.
+- **FRE-RNN** — "Toward Practical EP", arXiv:2508.11659 (Zhuo Liu et al.). Fixes EP instability via
+  **feedback regulation (reduce spectral radius → fast convergence) + residual connections**
+  (vanishing gradients). Directly relevant to our open problem (keep the free phase converged).
+- Adjacent energy-transformers (BP-trained, not EP): Energy Transformer (2302.07253),
+  EBT "Scalable Learners and Thinkers" (2507.02092), NrGPT (2512.16762).
+
+---
+
+## The arc (what we built & found)
+
+| stage | finding | evidence |
+|---|---|---|
+| literature | EP+attention works only narrowly; CET is SOTA (CelebA, EP≈BP, single block) | CET full read |
+| MVP (CET repro) | EP trains energy-attention+memory; **cosine 0.99; EP≈TBPTE** | CIFAR 0.539/0.546; FMNIST 0.272/0.278 |
+| AEP (non-cons.) | corrects real-attention gradient: toy 0.30→**0.9975**; CET attn-params 0.75→**1.0** | `aep_*.py` |
+| characterization | needs T2≥20–40, T1≥80; β-insensitive (static); **advantage grows with depth** (K=2: naive 0.05 vs AEP 0.99); only changes nudged phase (free eq identical); 1.44× cost | sweeps |
+| fundamental limit | **strongly non-conservative attention has NO fixed point → whole EP family fails** (projection bounds magnitude but does not create a fixed point) | residual stuck 3e-2 |
+| **F: damping** | `s·(attn−c·z)`, c≥1 → stable fixed point at any s → **AEP 0.99–1.0** even at s=8 | `aep_contractive2.py` |
+| Option-2 LM attention | **AEP gives LM causal attention 0.993** (FA 0.25 — FA kills Q/K/V) | `lt_ep_attention.py` |
+| Option-2 H1 (FFN) | **EP Hopfield-memory gives FFN 1.000** (FA −0.01 — the abandonment reason) | `lt_ep_ffn.py` |
+| Option-2 H2 (train) | **BPTT trains (val CE 2.16↓, random 4.17); EP local training destabilizes early** | `lt_ep_train.py`, `H2_*.log` |
+
+**Why FA failed (the project's pain, confirmed):** FA only delivers a usable gradient to the layer
+right before the loss (output projection / FFN proj, cos≈1.0); all upstream layers (attention Q/K/V,
+FFN fc) get random-routed error → cos≈0. EP/AEP fix all of them.
+
+**Why EP training destabilizes (mechanism):** BPTT differentiates whatever the fixed T1-step
+relaxation computes (a deep weight-tied net) — it does NOT need convergence. EP *assumes* the free
+phase is at a fixed point; gradient descent on the loss pushes attention to be more expressive /
+non-conservative → fixed point lost → relaxation diverges → EP estimate (and its jvp/vjp correction)
+goes non-finite. Damping + clipping + weight-caps delay but don't prevent it.
+*(Superseded: the residual-controlled Jacobian penalty DOES prevent it — see the 2026-06-09 section.)*
+
+---
+
+## Key numbers
+
+- CET MVP (masked completion, test masked-MSE): CIFAR **EP 0.539 ≈ TBPTE 0.546**; FMNIST **EP 0.272 ≈ TBPTE 0.278**; BP-transformer 0.106/0.126; trivial(visible-mean) 0.583.
+- EP-vs-BPTT gradient cosine (CET): global **0.99**; attention WQ/WK 0.98.
+- AEP recovers non-conservative attention: toy naive 0.30 → AEP **0.9975**; CET attn-params 0.75 → **1.0**; damped high-s (s=8) AEP **0.99–1.0**.
+- LM (Shakespeare char): AEP attention **0.993** (FA 0.25); EP Hopfield-FFN **1.000** (FA −0.01).
+- H2 training: BPTT val CE **2.16 ↓** (random ln65=4.17); EP non-finite / no sustained training.
+
+---
+
+## Open problem & next directions
+
+The frontier is **stable end-to-end EP training**. Hypotheses / plans:
+1. **Keep the free phase converged during training** (FRE-RNN, 2508.11659): feedback regulation to
+   keep the spectral radius < 1 + residual connections. Most directly targets the mechanism.
+2. **β bias/noise trade-off** — Ernoult 2019 says EP≡BPTT as β→0 with converged phases; but small β
+   amplifies the `(z₋−z₊)/2β` noise. Holomorphic EP (Laborieux–Zenke 2022) removes this trade-off.
+   Diagnostic: log cosine(EP-grad, BPTT-grad) *during* training — does it start 0.99 and drop?
+3. **Unified conservative energy** (no AEP): make the whole LM one energy (energy-attention, tied
+   value) → guaranteed fixed point → plain EP is stable (we saw CET-EP is stable). Trades attention
+   expressivity for stability.
+
+## 2026-06-09 — H2 training era: the "wall" refuted, the real mechanism found
+
+**Setting.** Shakespeare char-LM, single equilibrium block, C=128 H=4, same param budget everywhere.
+`--attn_mode thin/real` = clamp + Hopfield-FFN + damped real attention; `thick` = DEQ-transformer
+block (pre-LN + attention + untied 4× GELU FFN + residual + damping). Stabilizer = damping `c` +
+soft Jacobian-norm penalty λ (Hutchinson on the non-conservative force) driven by a continuous
+controller on the free-phase residual (residual, NOT spectral radius, is the right signal — the
+attention Jacobian is non-normal, transient growth invisible to ρ).
+
+**Scoreboard (best val CE; random ln65 = 4.17):**
+
+| run | result |
+|---|---|
+| BP transformer, MLP=4 | 1.68 |
+| BP transformer, MLP=1 (param-matched) | 1.79 |
+| thick-BPTT (exact grad, same arch as thick-EP) | **1.95** (res ~1e-3→8e-3, learned contraction unaided) |
+| mono-BPTT (monDEQ) | 2.11 |
+| thin-BPTT | 2.21 (optimum at res ~1.5e-2) |
+| **thick-EP champion (R5): R3 recipe + EMA-smoothed controller, 14k** | **2.0467** @ step 5k — within 0.10 of exact-gradient thick-BPTT |
+| thick-EP (R3): λ-floor 0.1, res_target 5e-4, c=1, 8k | 2.0784 — beats thin-BPTT and mono-BPTT |
+| thick-EP (R1): λ-floor 1, c=2, 14k | 2.1504 — 2.467@3k → 2.259@7.2k → 2.150@14k: monotone, no plateau, just slow |
+| thick-EP (R4): λ-floor 0.1, res_target 2e-4, 14k | 2.1352 — tight target ⇒ controller thrash (λ swings 0.5↔13 late) |
+| thick-EP (R2): λ→0, c=0.5 | 2.3572 @ step 800, then **collapse to a degenerate fixed point** (res→0 exactly, val ≫ random) |
+| thick-EP, c=2 λ=1, 3k (the run the "wall" was called on) | 2.4665 (just undertrained) |
+| thin-EP super-long 12k | 2.4847 plateau — **invalid-regime run**: res 2.4e-2–6e-2 the whole time, λ pinned at max 16 |
+| thin-EP 3k, λ=16 | 2.5593 |
+
+**Retraction.** Mid-project verdict — "EP is capped at ~2.5 by a convergence⟷richness wall
+(rich blocks need damping that destroys their expressivity)" — is **wrong**, called prematurely at
+3k steps when thick-EP ≈ thin-EP ≈ 2.47–2.50 by coincidence (two different slow/broken modes passing
+the same value). Three independent refutations: (1) the 8k continuation sailed through the "wall"
+(2.2592@7.2k, monotone); (2) the slack run hit 2.357 in 800 steps; (3) the gradient probe below shows
+the EP estimator is healthy exactly where the stabilized runs operate. Also wrong in the original
+story: damping c does NOT aid convergence for LN-inside blocks (LN Jacobian ∝ 1/σ(z): damping shrinks
+z*, *inflating* the Jacobian — measured: thick plain-relax res 8.8e-3 at c=0 vs 3.4e-2 at c=2,
+`solver_wall.py`), and the scary "init residual~12" was an unnormalized absolute-norm print.
+
+**The real mechanism — a validity threshold, not a wall** (`grad_quality.py`: cosine of each
+estimator vs long-horizon BPTT (T1=400) reference, at a 300-step-BPTT-pretrained thick block):
+
+| estimator | free-phase res | cos vs BPTT-400 (3 batches) |
+|---|---|---|
+| EP T1=400, T2=20 | ~5e-5 | **0.85, 0.83, 0.88** — healthy |
+| EP T1=150, T2=20 | 2e-4–1.6e-3 | 0.21 / 0.86 / 0.27 — marginal, batch-dependent |
+| EP T1=50, T2=20 | 2.4e-3–3.6e-3 | −0.03 / 0.34 / 0.55 — broken |
+| BPTT T1=150 (itself!) | — | 0.45 / 0.77 / 0.36 vs BPTT-400 |
+| EP T1=150, **T2=60** | same as T1=150 | 0.01 / 0.07 / −0.19 — **nudged phase leaves the linear-response window** |
+
+Readings: (a) at tight convergence the EP estimator agrees 0.85 with the exact gradient — the
+EP-specific overhead (finite β, T2 truncation, AEP clipping) is only ~0.15 of misalignment;
+(b) **the horizon/mixing cost is shared by BPTT itself** (BPTT-150 vs BPTT-400 cos 0.35–0.77):
+slow-mixing equilibria are expensive for everyone, not an EP defect; (c) EP needs res ≲ few·1e-4 to
+be valid — at res ~1e-2 (where the thin super-long run lived) EP updates are noise, which is what
+that 2.48 "plateau" actually was: 12k steps of noise + λ=16 Jacobian-shrinking gradient; (d) longer
+nudged phase is NOT better — the AEP correction is linearized at z*, so T2 must stay in the
+linear-response window (T2≈20 at eps=0.1 good, T2=60 destroys the estimate).
+
+**Decomposition of the EP→BP gap (final numbers):** architecture (BPTT−BP) = 1.949−1.79 =
+**0.16** (small — the energy/fixed-point architecture is fine); training rule (EP−BPTT) =
+2.047−1.949 = **0.10** — slower optimization (estimator cos ~0.85) + the λ tax of staying inside
+the validity region. Total EP→param-matched-BP = **0.26** (the "wall" story claimed ~0.7 was
+structural). EP-trained thick (2.047) beats exact-gradient thin (2.206) and monDEQ (2.111):
+EP *does* cash in architecture richness, given validity + steps.
+
+**λ-study (final, prediction confirmed: middle wins):** R5 2.047 < R3 2.078 < R4 2.135 < R1 2.150
+< R2 2.357. The two failure modes flanking the optimum: λ→0 ⇒ fast descent then *collapse* (not
+explosion — a dead res=0 equilibrium); λ heavy ⇒ no failure, just a constant tax (R1 monotone to
+2.150@14k). R4's lesson: too-tight res_target puts the multiplicative controller on the residual
+noise floor — `(res/target)^0.3` per-step updates compound the noise and λ thrashes 0.5↔13,
+degrading late training. R5's fix: EMA-smoothed residual signal (`--res_ema 0.9`) kept λ quiet
+(0.1–1.0) and set the champion 2.0467 by step 5k.
+
+**Open item (the actual next frontier):** every stable recipe peaks mid-run (~5–6k) and then drifts
+up 0.1–0.2 despite cosine lr decay — EP's late-training noise floor. Suspects: VF-estimator noise
+dominating at small lr (cos 0.85 ⇒ persistent gradient-direction error), λ/lr balance shifting late,
+weight-norm caps binding. Holomorphic EP (removes β bias/noise) is the principled candidate for the
+remaining ~0.10; T2 must stay in the AEP linear-response window (T2≈20, NOT longer).
+
+## 2026-06-10 — estimator/controller co-design: EP reaches 1.74, beats exact-gradient BPTT
+
+The Holomorphic-EP line (see METHODS §4.3) + controller re-calibration, in causal order:
+
+1. **R7 — holomorphic clamp-free nudge** (N=2, r=0.02 Cauchy readout, no g-clamp / no corr-clip,
+   closed-form ∇zCE): best **2.029**, faster wall-clock than plain EP (3.3 vs 2.45 it/s). Probe
+   discoveries: the legacy clamps were the dominant estimator error at marginal residuals
+   (cos 0.27→0.89 at res 1.6e-3); N and r are flat ⇒ β-bias/noise was never the binding error;
+   T2 truncation is the remainder (T2=120 → cos 0.985 when the nudged phase is stable; adaptive-T2
+   early-stopping defeated by non-normal transient growth — open).
+2. **R8 — param-EMA alone: tie (2.031).** Exposed the real late-drift mechanism: λ pinned at
+   jr_max=16 in the tail — the controller enforcing the OLD validity margin (5e-4) against weights
+   that want res~1e-3, which the new estimator tolerates fine (cos 0.89 @ 1.6e-3).
+3. **R9 — controller re-calibrated to the new estimator** (target 1.5e-3, λmax 4) + param-EMA:
+   **best 1.7399**, sustained EMA plateau 1.75–1.79, λ quiet at 0.1–0.5, tail ~1.92. Worth 0.29 —
+   the single largest improvement of the project. Lesson: **estimator and controller must be
+   co-designed**; upgrading one without re-tuning the other converts the controller into the main
+   source of harm.
+4. **14k controls** (same horizon as R9): BP MLP=4 **1.610**, BP MLP=1 1.689. Param-shape
+   correction: the **thick** block's 131k FFN matches **MLP=4**, not MLP=1 (which matches the thin
+   block's 33k Hopfield memory) — so thick-EP's like-for-like BP baseline is 1.610.
+   Unregularized **BPTT-14k destabilizes** (res→4.7e-2, val→3.0, best 2.021 — worse than its own
+   3k run, 1.949): the equilibrium architecture NEEDS the contraction control loop for long
+   training; EP carries it out of estimator necessity, bare BPTT got nothing.
+5. **The decisive control — BPTT + identical λ-controller + param-EMA: 1.6348**, tail stable.
+   Verdict: most of EP-beats-bare-BPTT was the regularizer; with equipment matched, exact
+   gradients lead EP (R9) by **0.105** — which matches the measured estimator misalignment
+   (cos 0.85–0.90). Bonus: the controlled equilibrium block (1.635) beats the MLP=1 BP baseline
+   (1.689) — the contraction controller is good for the architecture independent of training
+   rule; EP merely forced its discovery.
+6. **Adaptive T1/T2 (R10): EP 1.6755** — ate 0.064 of the 0.105 rule tax. Adaptive T2 solved by
+   *hindsight snapshot selection* (judge by increments of the contrast estimate a_t, never by step
+   sizes — non-normal transients can't fool it; dangerous batches self-limit to t≈20–30, stable
+   ones run to ~110 and collect cos 0.987; probe mean 0.871→0.932). Adaptive T1 companion: the
+   λ-controller signal stays sampled at T1=150 (R9 peace preserved), then relaxation refines to
+   res≤1e-4 before nudging — long-T2 gains exist ONLY from tight equilibria (at res~1e-3 long T2
+   hurts). **Final 14k decomposition: architecture 0.025 (1.635 vs 1.610), EP rule tax 0.041
+   (1.676 vs 1.635), total 0.066.** EP now beats the MLP=1 BP baseline (1.689).
+   *Multi-seed (3/arm):* EP **1.680 ± 0.005** vs BPTT-ctl **1.639 ± 0.006** ⇒ tax
+   **0.041 ± 0.005 (~9σ)** — confirmed real.
+5. Refuted along the way: λ-floor∝lr annealing (R6) reproduces the λ→0 death; post-mortem of that
+   death (R2/R6): **explosion frozen by fp32 absorption masquerading as convergence** (val 60–77
+   with res≡0: ε·F < ulp(z); uncapped tok/pos/fc/pj blow up) — the λ floor is the anti-collapse
+   mechanism, not just a tax.
+
+## Code map
+
+- `~/ept/cet_mvp.py` — CET energy model + EP & TBPTE training (App. B faithful).
+- `~/ept/bp_transformer.py` — vanilla BP transformer baseline.
+- `~/ept/cet_aep.py` — CETReal (real non-conservative attention) + AEP, VF gradient, damping, clip.
+- `~/ept/aep_characterize.py` / `aep_depth.py` / `aep_projected.py` / `aep_contractive2.py` — AEP sweeps.
+- `~/ept/aep_option1.py` — projected (constrained) adjoint.
+- `/tmp/lt_ep/lt_ep_attention.py` — AEP on the LM's causal attention (gradient quality).
+- `/tmp/lt_ep/lt_ep_ffn.py` — EP Hopfield-memory vs FA-FFN (gradient quality).
+- `/tmp/lt_ep/lt_ep_train.py` — H2 end-to-end EP vs BPTT training of the equilibrium LM block
+  (`--attn_mode real/energy/mono/thick`, jacreg controller `--jacreg/--jr_floor/--res_target`).
+- `/tmp/lt_ep/grad_quality.py` — EP-vs-exact gradient cosine vs residual level (the validity probe).
+- `/tmp/lt_ep/solver_wall.py` — plain relax vs Anderson convergence per damping level (thin vs thick).
+
+*(local-transformer working copy in `/tmp/lt_ep`; original `~/local-transformer` untouched.)*
+
+## 2026-06-13 — late-drift 诊断
+
+每个稳定配方中段触底后 val 上漂(S0 ~0.05 轻微,S1/L2 ~0.5 重度)。多假设诊断(drift_diag.py,13 列/200 步):
+- **嵌入 runaway 假设被否**:|emb| 涨到 11.9 后被 weight-decay 压回 11.6,自限,非 culprit。
+- **drift 对汇报数字无害**:存 best 即可(L2 漂到 1.70 但报 best 1.214;warm-track 报 1.14)。
+- **drift 不藏更低解**:warm-track 4800 步触底 1.14 后,lr 3e-4 漂、lr 1e-4 冻(1.15),均找不到更低 ⇒ 1.14 是该配方真地板,drift = 在地板上被噪声梯度推晃。
+- **机制 = 晚期沿坏方向迈步**:低 lr 摁住 drift(anti-drift 续训冻在 1.16);S0 `cos`(EP,BPTT400)晚期 0.85→0.3 印证方向变差。
+- **严重度随尺度**:S0(C128/T64)≈0.05,S1(C256/T256)≈0.5。
+- 待结(S1 warm-start 诊断 + bctl 对照):cos 晚期塌缩是 EP 估计器退化(A,→N=4/tracking 可救)还是慢混合平衡点的视野模糊(B,EP/BPTT 共病,→架构加速混合)。bctl=cos(BPTT150,BPTT400):若同塌则 B。
+
+## 2026-06-13 (cont.) — "why EP far at S1": operating-point + bias/variance diagnosis
+
+Reframe: at the SAME (tight, res 1.5e-3) operating point EP (1.14–1.39) ≥ BPTT (1.52). The apparent
+"gap" is EP(1.14) vs LOOSE-BPTT(0.97) — an operating point (res 1e-2, non-contractive) EP cannot
+use. Mechanism: T=256 long-range mixing wants non-contractive dynamics; EP needs contraction for the
+fixed point to exist; contraction suppresses the mixing. Signature: gap grows with T (S0/T64: 0.04;
+S1/T256: 0.17).
+
+bias_var.py (v4b ckpt, 16 batches, EP vs BPTT-400, BPTT-150 control), per group [mean-cos | cos-means]:
+- all  0.24|0.36 ;  BPTT-horizon ctrl 0.44|0.57
+- attn 0.22|0.33 ; ffn 0.32|0.49 ; ln 0.19|0.30 ; emb 0.31|0.37
+Readings: (1) two EXACT gradients (BPTT-150 vs 400) only reach cos-means 0.57 — slow mixing makes
+"the gradient" horizon-ambiguous; a PERFECT EP estimator's ceiling here is ~0.57, NOT 1. (2) Both
+EP and BPTT improve with averaging (variance present, partly recoverable via navg). (3) EP-specific
+residual concentrates in **attention + LN** (EP cos-means 0.33/0.30 vs BPTT 0.57/0.65); **FFN nearly
+fine (0.49)**, emb on par. ⇒ the AEP-corrected softmax/LN path is the weak link, pointing the fix at
+qk-norm (bounds logits, conditions the attention Jacobian) + N=4 holo + navg. Chasing cos→1 is the
+wrong target; matching BPTT's 0.57 is (~0.21 of recoverable EP-specific alignment).
+
+## 2026-06-13 (cont.) — scaling to BPE/50M: stability via small-init residual branches
+
+50M (C=2048) from scratch ABORTED (init res 26; qk-norm+c=2 insufficient — res stuck 0.38). Fix:
+**ReZero/Fixup-style small-init of the residual-branch output projections** (WO, pj ×0.1) makes the
+thick block ≈ -(z-xin)-c·z (near-identity, trivially contractive) at init; training grows them.
+Result at 15M (C=1024): res **1.1e-8 at step 50** (vs abort). The big-width init-instability wall is
+a random-init artifact, not fundamental; --resinit folds into the standard recipe. Separately,
+torch.compile's 14.6× was a SMALL-model (launch-overhead) artifact; at C≥1024 the relaxation is
+compute-bound (fp32 ~15 TFLOP/s) → ~0.06 it/s at 15M, ~0.03 at 50M. Demo feasibility: 15M solo
+(~2-4 days); 50M needs multi-GPU data-parallel (per-sample contrast → gradient all-reduce) or the
+external cards. The fp32 contrast (catastrophic-cancellation-bound, bf16 dead) keeps it compute-heavy
+— again the "digital expensive / physical free" axis.
+
+## 2026-06-14 — the big-model EP recipe (scaling past S1 needs a stability stack)
+
+Scaling EP from S1 (C=256/T=256/char) to C=1024/T=512/BPE-4096 broke the S1 recipe: stable for ~50
+steps then violent free-phase blowups (res → 0.5–1.4, jr pinned at max) — the controller's stability
+margin shrinks with width/context. CAUTION: a step-50 res≈1e-8 snapshot is NOT proof of stability;
+must watch through warmup-end peak-lr (~step 500–1000). The stack that fixes it (each necessary):
+  1. **resinit** (WO,pj ×0.1) — near-identity contractive block at init.
+  2. **qk-norm** (Qwen3 q/k RMSNorm) — bounds attention logits / Jacobian (also fixes the attn/LN
+     estimator weak link from the 06-13 bias_var diagnosis).
+  3. **lr warmup** (≥1000 steps linear) — let the λ-controller establish contraction before big steps.
+  4. **µP-style lr scaling**: lr 1e-3 (C=256) → **4e-4 (C=1024)**; 1e-3 caused a 400-step instability
+     episode, 4e-4 only brief recovered excursions.
+  5. **higher jr_max** (16 → 32) — bigger non-conservative force needs more penalty headroom.
+  6. validity gate + fuse retained (catch the residual transient excursions).
+Locked 15M demo recipe (running, 16k steps, best-ckpt): C=1024 H=16 T=512, the 6 items above,
+holo N=2 + adaptive T1/T2, pema 0.999. Smoke val descends 5.86→3.94 BPE-CE by step 1000.
+Cost: ~0.05 it/s (fp32, compute-bound; compile's 14.6× was a small-model launch-overhead artifact)
+⇒ 15M ~3 days solo. 50M (C=2048) still unstable even with resinit 0.03 + warmup — needs a dedicated
+init/lr curriculum AND multi-GPU; deferred as a sub-project (not on the demo critical path).
+
+## 2026-06-16 — C=512 BPE scale: baselines, AsymEP attribution, early-slowness dissection
+
+**C=512 BPE-4096 scoreboard** (TinyStories, T=256, B=24; random ln4096 = 8.318; target band 1.0–1.5):
+
+| run | result |
+|---|---|
+| BP standard transformer (7.48M, no warmup, lr 3e-3) | **best 1.7921** (20k, DONE) |
+| EP thick "orphan" (lr 8e-4, warmup800, resinit0.1, holo2, t2sel40) | **best 2.4037** (20k, DONE; jr pinned at floor 0.1 the ENTIRE run — zero excursions) |
+| EP thick "chin" (lr 9e-4, full-Chinchilla 24k target) | ABORT @5401, best 3.2408 |
+| BPTT thick (exact grad, same arch) | ABORT @2113, best 3.7331 |
+
+Data notes (NOT verdicts): (a) BP-C512 itself = 1.79 sits just *above* the 1–1.5 band (BP needs
+C≥1024 to enter it); EP-orphan 2.40 is 0.61 above BP-C512. (b) **the exact-gradient BPTT twin
+aborted EARLIEST** (2113, 3.73) — even worse than the EP runs — consistent with "training walks off
+the contractive manifold, exact gradient included"; lr/seed not matched across these, treat as
+directional. (c) abort = the `res>0.1 for 100 consecutive steps` fuse. **lr is the knife-edge**:
+8e-4 (orphan) had NO excursions and finished; 9e-4 (chin) had periodic jr→32 / res→0.2–0.5
+excursions, most recovered, one didn't make it back inside the 100-step fuse window → abort@5401.
+Narrow lr-stability margin at C=512. (d) orphan late curve is a plateau: best moved only
+2.4557→2.4037 over the last 3600 steps.
+
+**AsymEP attribution (web-verified, the J→Jᵀ correction is NOT ours).** arXiv:2602.03670,
+Scurria / Vanden Abeele / Mognetti / Massar, "EP for Non-Conservative Systems", names the method
+**AsymEP**. Their correction `−2 A_J(z*)(z−z*)` is **mathematically identical** to ours
+(`Jv−Jᵀv = (J−Jᵀ)v = 2 A_J v`). Their scope: Hopfield nets + feedforward MLPs (≤500 hidden) +
+one CIFAR-10 feedforward CNN; **static inputs; explicit Jacobian construction + decomposition;
+NO residual/validity controller; NO attention / transformer / sequence model; do NOT combine with
+holomorphic**. Best result: CIFAR-10 FF-CNN AsymEP 89.74% vs BP 90.66% (their variational "Dyadic
+EP" matches BP, p=0.75). Their **"VF" baseline** (the force-form ⟨a,∂F/∂θ⟩ readout — *prior art, not
+ours*) **collapses without the correction** (CIFAR-10 → 10% = chance; MNIST 64% vs AsymEP 92.7%) —
+matching our measured cos≈0.25 for uncorrected non-conservative attention. **Ours on this line**:
+matrix-free jvp−vjp (their explicit Jacobian is infeasible at B·T·C dim); softmax-attention
+application; the holomorphic combination (the real-linear correction preserves holomorphy);
+common-mode-tracking + lock-in variants; the validity-threshold/controller/gate stability stack they
+have none of; and the transformer-LM application + scale. (⇒ corrected EP_DERIVATION.md, which had
+mislabeled force-form VF as our step.)
+
+## 2026-06-17 (cont.) — full to-spec versions of exp1/4/5/6/7 (first pass was partial)
+
+The 06-17 campaign ran lighter/subset versions; re-ran the FULL specs. New findings:
+
+**exp1 (FULL) — multi-checkpoint cos-evolution (snap 0/200/800/2000 + plateau).** EP·BPTT150 cos:
+step0(random,res2e-2) 0.26 → step200(res1e-9) **0.99** → step800 0.96 → step2000 0.89 →
+plateau(2.40) **−0.05**; BPTT150·BPTT400 = 0.46→**1.00** throughout (true gradient always clean).
+New **batch self-cos** at the plateau: EP **−0.27** (gradients on different batches anti-correlated)
+vs BPTT **+0.96**. ⇒ decisive: the EP estimator is faithful while descending and undergoes an
+**SNR/coherence collapse near the optimum** (true gradient shrinks below the EP bias floor, which is
+batch-incoherent) — NOT horizon ambiguity, NOT LR. This is THE 2.40-plateau mechanism.
+
+**exp5 (FULL) — jr_max sweep (the dropped dimension, and it mattered).** Warm-started from 2.40:
+**jr_max 8 → 2.3648**, jr_max 4 → 2.376 — both BEAT jr_max 32 (2.463) and the orphan (2.404) by
+~0.04. jr_max 16 → 2.416. floor/target fill still ≥2.47. ⇒ the jr_max=32 ceiling was a small tax;
+~8 is better. (Earlier "not a controller tax" was incomplete — floor/target don't help, but the
+ceiling does.) Still far from the 1–1.5 band; the estimator floor dominates.
+
+**exp4 (FULL) — 3 arms.** armA (warmup): ~50-step gate dead-zone. armB (`cprewarm 200` = full-lr
+contraction before task warmup): res 1.7e-2 after pre-phase, dead-zone only PARTIALLY mitigated
+(val 5.97 vs 6.07 @100) — establishing contraction once isn't enough; warmup's tiny lr can't deepen
+it. armC (no-warmup + resinit 0.03): NO dead-zone (cos 0.99 from step20, best 4.03@300) but a violent
+excursion (res→0.14@400, recovered). ⇒ small-resinit+no-warmup learns fastest early but is
+excursion-prone; the cprewarm decoupling is a weak fix.
+
+**exp6 (FULL) — branch norms + attention entropy + qk RMS, incl. at plateau.** Branches grow fine
+(|WO| 1.95→4.91 over training); at the 2.40 plateau they are LARGE (|pj|=44, |fc|=58, attn/xin≈1.0),
+attn entropy 4.56→3.62 (sharpening), qk_rms pinned 1.0 (qknorm working). ⇒ no "stuck-small branch" —
+the trainable-α-gate intervention is NOT indicated.
+
+**exp7 (FULL, after a probe bug fix) — EP vs BP influence + per-position CE.** [Bug caught: the
+EP probe embedded the input with random-init weights before loading the ckpt → garbage (per-pos CE
+~7); fixed to embed post-load.] Clean influence ||Δz*_q||/δ vs distance d: **BP** 5.34(d1)/0.88(d16)/
+0.15(d64)/0.003(d200); **EP-orphan** 0.24/0.071/0.019/0.006. Both decay with distance. EP ~20×
+weaker in ABSOLUTE coupling (scale-caveated — z* vs hidden norm), but NORMALIZED to d=1 EP reaches
+*farther* than BP (0.30 vs 0.16 at d16; 0.08 vs 0.03 at d64). Per-position CE: EP uniformly ~0.7–1.0
+worse than BP at every position (no long-range-specific failure). ⇒ **does NOT support — arguably
+refutes — "contraction cuts long-range mixing"**; EP's deficit is uniform, consistent with the
+estimator floor (exp1) being the binding constraint. (BP-C512 trained to **1.6953**.)
+Honest gap: a stable BPTT-ctl-tight ckpt does not exist (BPTT broke at C512), so the cross-method
+influence comparison is EP-vs-BP only, not the full tight/loose/BP set.
+
+**Campaign verdict (data; user concludes):** the C512 2.40 plateau is, decisively, an **EP estimator
+SNR/coherence collapse near the optimum** (exp1: cos→0, self-cos −0.27, k→4000, true gradient clean).
+It is not LR (exp2), not horizon (exp1), not λ-floor/target (exp5) — though jr_max≤8 buys ~0.04
+(exp5). The architecture has a SEPARATE wall: exact-gradient BPTT can't stay stable at C512 (exp3).
+The contractive-mixing hypothesis is NOT supported (exp7: EP's range is comparable to BP, deficit is
+uniform). ⇒ the EP lever is reducing the estimator bias floor near the optimum (N=4 holo / tracking-
+AEP / lock-in / navg); the architecture lever is the stability margin for exact gradients.
+
+**Early-slowness dissection (C512 BPE) — cos/k probe CONFIRMED; ablations STILL RUNNING.**
+From existing logs: **EP ≈ BPTT at every early step** (step 200: 5.84 = 5.84; step 800: 3.86 vs
+4.05; step 2000: 3.58 vs 3.73), both **~1.2 behind BP** (2.76 @800). ⇒ early slowness is *not* the
+EP estimator — the exact-gradient twin on the same architecture is equally slow; the gap is the
+equilibrium architecture + recipe vs a standard transformer, present already at step 200 (pre
+warmup-end). Probe (`--probe_bptt`: cos(g_EP,g_BPTT) + k=|gEP|/|gBPTT| along the REAL trajectory):
+- **Estimator quality tracks res tightly**: res≈1e-9 → cos 0.93–0.99, k≈1.0 (all groups); a res
+  spike to 1.4e-3 (mid-excursion) → cos all 0.86 / attn 0.70 / emb 0.13, k→0.4. ⇒ the earlier
+  kinit *synthetic* probe (cos 0.33 / k 0.42 "at init") was a probe-operating-point artifact; on the
+  real trajectory the EP gradient is faithful (cos≈1, k≈1) **whenever the free phase is converged**,
+  and **k≈1 means no adaptive-k is needed in the converged regime**.
+- **warmup × validity-gate early-stall mechanism**: with warmup, early lr is tiny → the contraction
+  homeostat can't pull res under the gate (5e-3) → free-phase res stuck ~1.5e-2 for ~50 steps →
+  the **validity gate skips the nudge** → the reported g_EP is pure jacreg/contraction (cos≈0,
+  k≈11) = *the task is not being learned at all* for ~50–60 steps. res drops < gate by ~step 60–80
+  → cos jumps to 0.99. No-warmup: res converged (1e-9) by step 20, cos 0.98 from step 20
+  (val 5.72 vs warmup 6.07 @ step100).
+- **CAVEAT — the tradeoff is UNRESOLVED**: the no-warmup arm hit an instability excursion ~step 180
+  (res→1.4e-3, jr→32, recovering by 200) — i.e. warmup's stability role is real. Whether no-warmup
+  survives to 1200, plus the **resinit (arm B)** and **lr 2e-3 (arm D)** attributions, are STILL
+  RUNNING as of 16:00 — no verdict drawn. Probe instrumentation: `lt_ep_train.py --probe_bptt N`;
+  launcher `/tmp/lt_ep/early_dissect.sh`.
+
+## 2026-06-17 — C512 diagnostic campaign (7 experiments): the 2.40 plateau is an EP estimator bias-floor; exact-gradient BPTT is LESS stable than EP at C512
+
+Controlled campaign at C512 BPE (orphan ckpt = EP best 2.4037). Raw results + the causal cut each
+experiment was built for. Scripts: `lr_sweep.py`, `triangulation.py`, `mixing_probe.py`,
+`lt_ep_train.py --probe_branch/--probe_bptt`; orchestration `master.sh`.
+
+**exp3 — BPTT+controller, the decisive matched cut (GPU0).** Same arch+controller+lr 8e-4 as the EP
+orphan; only the task gradient differs (exact vs EP). Descended cleanly to best **3.85 @step1400**,
+then **DESTABILIZED at step 1600** (jr→32, res→0.07) and lodged in a broken basin (val 6.27,
+res ~0.068 — just under the 0.1 fuse, so no abort) for 8400+ steps. **The exact-gradient twin did
+WORSE than EP (3.85-stuck vs EP 2.40-completed).** ⇒ at C512 the contraction controller does NOT
+keep the exact gradient on the manifold; EP's implicit contraction-bias is what kept the orphan
+alive. Inverts the S0 picture (BPTT+ctl 1.635 < EP 1.676). Caveat: single lr/seed.
+
+**exp1 — triangulation at the 2.40 plateau (estimator vs horizon).** BPTT150 vs BPTT400
+cos = **1.000** (all/attn/ffn; emb 0.986) — true gradient perfectly well-defined, ZERO horizon
+ambiguity — but EP vs BPTT cos = **−0.045** (orthogonal), k=|gEP|/|gBPTT| ≈ **4000**. ⇒ the plateau
+is an **EP-specific estimator failure**, not slow-mixing horizon ambiguity.
+
+**exp2 — one-step LR-sweep at the plateau.** BPTT best ΔCE −0.16 (lr 1e-2); EP best −0.04 (lr 1e-4),
+diverges for lr≥3e-4. cos 0.11, k 486. ⇒ NOT LR-inequivalence; EP's direction caps one-step descent
+4× below BPTT at any lr.
+
+**exp1+exp2 synthesis:** the EP estimator has a ~fixed bias/noise floor. k runs **~1 during active
+training → ~4000 at the plateau** (true gradient shrank below the EP bias). 2.40 is the SNR
+crossover; below it the true gradient is smaller than the EP bias so EP can't see it. Not LR, not
+horizon, not λ. ⇒ the EP lever is **reducing the estimator floor near the optimum** (N=4 holo /
+tracking-AEP / lock-in / navg averaging), distinct from the architecture's stability lever.
+
+**exp5 — λ floor/target grid (warm-start from orphan).** base 0.1/1.5e-3 → 2.463; 0.05/3e-3 → 2.491;
+0.03/3e-3 → 2.485; 0.1/5e-3 → 2.488 — ALL worse than orphan 2.40 (slight late-drift up). ⇒ relaxing
+the controller does NOT break the plateau — not a λ-floor tax. (floor 0.03 reached jr=0 without
+collapse over 1200 warm-started steps.)
+
+**exp6 — branch-growth (resinit vs no-resinit).** resinit 0.1: branches grow gradually & stably
+(|WO| 2.35→4.91, |pj| 3.23→6.57 over 1000 steps; res ~1e-4; best 3.71). no-resinit: branches start
+large (|WO|~20, |pj|~20), excursion at step 800 (res→0.23, jr→32; recovered, |WO|→14.7; best 4.04).
+⇒ resinit's gradual growth is stabilizing; branches do grow (not stuck-small).
+
+**exp7 — mixing/influence length.** ||Δz_q*||/δ FLAT across distance (~0.007 at d=1 and d=200) and
+weak; loose-target (lam_t5) similar (~0.005). ⇒ no distance-decay signature; INCONCLUSIVE on the
+contraction-cuts-mixing hypothesis (no non-contractive/BP reference; coupling uniformly weak).
+
+**exp4 — warmup/gate (probes done).** no-warmup best 3.56@1200 (survived its step-180 excursion);
+warmup 4.11@800. warmup×validity-gate early-stall confirmed (warmup keeps res>gate ~50 steps →
+nudge skipped → no task learning early).
+
+**Net (data; user to conclude):** the 1–1.5 band is blocked at C512 by TWO distinct walls —
+(1) EP's estimator bias-floor caps EP at 2.40 (cos→0, k→4000 at the plateau; true gradient is
+clean), and (2) the equilibrium architecture+controller can't keep the EXACT gradient stable at
+C512 (BPTT broke at step 1600). EP's contraction-bias makes it the more robust of the two here.
+The contractive-conflict hypothesis is supported on the BPTT side; the EP plateau is specifically
+an estimator-SNR-floor effect.
+
+## 2026-06-17 (round 2) — the plateau bias floor IS the frozen AEP linearization; tracking-AEP fixes it (gradient level)
+
+Estimator ablation at the 2.40 plateau ckpt (`plateau_ablation.py`, vs BPTT400, n_grad=4):
+
+| estimator | cos | k | batch self-cos |
+|---|---|---|---|
+| N2 frozen-AEP (current) | −0.045 | 4133 | −0.27 (incoherent) |
+| N4 / N8 | −0.03 / −0.01 | 31 / 29 | ~0 |
+| N2 r=.05/.10/.20 | −0.04 | ~4200 | −0.27 |
+| N2 fixed-T2 20/80/120 | nan (diverges w/o snapshot selection) | — | — |
+| **track_N2 (common-mode AEP)** | **0.997** | **0.9** | **+0.95 (coherent)** |
+
+Diagnostic at z*: **|Jv−Jᵀv|/|Jv| = 1.37** (J highly non-normal). ⇒ the bias floor IS the
+frozen-at-z* AEP linearization. N=4/8 fix only the magnitude (k 4000→30), not the direction; r and
+T2 do nothing; **tracking-AEP (re-linearize at the moving common mode z̄=½(z₊+z₋)) restores cos
+−0.045→0.997, coherence −0.27→+0.95, magnitude 4133×→0.9×.**
+
+**exp D (per-group):** collapse is UNIFORM across WQ/WK/WV/WO/fc/pj/ln1/ln2/tok/pos (all cos
+−0.03..−0.08, k 3k–18k) — the corrupted contrast `a` poisons every VF gradient equally (single
+shared cause, not attention/LN-specific unlike the S1 bias_var diagnosis).
+
+**exp B (navg):** cos(EP,BPTT) over navg = 0.37→0.40→0.55→−0.30→−0.21 (1/2/4/8/16); self-cos rises
+(0.58→0.87) but toward a direction NEGATIVELY correlated with BPTT; dCE diverges at navg≥8. ⇒
+**deterministic bias, not variance** — averaging converges to the wrong (biased) direction, so
+navg/restart-averaging will NOT fix it.
+
+### 2026-06-17 (round 2b) — free-phase ε is mistuned ~3× (monDEQ α-analysis analog); Anderson untapped
+
+Free-phase convergence probe at the 2.40 ckpt (`solver_probe.py`, res_F=‖F(z)‖/‖z‖, eps-independent,
+B=2, 4 batches). Steps to res_F≤1e-4 / ≤1e-6, plain Euler: ε0.05→470/844, **ε0.1 (ours)→236/548**,
+ε0.2→121/274, **ε0.3→85/182** (~2.8× faster than 0.1, same fixed point, stable), ε0.5→stalls at
+res~0.9 (non-convergent — the stable ceiling is ~0.3–0.4). Anderson(m=5): ε0.1→85/197, **ε0.2→58/118
+(~4× vs our current 0.1)**, robust across ε. ⇒ **ε=0.1 is ~3× too small**; ε≈0.3 or Anderson@ε0.2 cut
+free-phase force-evals ~3–4× (and let T1≈100 always clear res≤1e-4, dropping adaptive-T1) → directly
+attacks the dominant EP compute cost AND tightens res (better estimator). This IS the "preconditioned/
+Anderson relaxation" open item, now quantified. Caveats: free phase at trained point / B=2 (optimal ε
+may shift early-training / slow-mixing); the **nudged phase shares ε but has AEP+holo dynamics — bigger
+ε there needs a separate stability check** before changing. Confirms the monotone-operator α-sensitivity
+(non-normal dynamics, |Jv−Jᵀv|/|Jv|=1.37) carries over to our hand-set Euler step.
+
+**Decision:** the EP lever collapses to **common-mode tracking-AEP** (not N=4 / lock-in / navg).
+Caveat: validated at the gradient level at one ckpt; the one-step ΔCE is magnitude-confounded, so the
+NECESSARY next test is **training** — does EP with `--track` descend below 2.40 (`holo_a_track`
+already wired)? exp C (BPTT+controller rescue, lr 2e-4/4e-4/6e-4 × jr_max 16/32) runs separately for
+the architecture-ceiling question.
+
+## 2026-06-18 — tracking-AEP VALIDATED IN TRAINING: it breaks the 2.40 plateau
+
+Warm-track test (warm-start from the 2.40 orphan ckpt, jr_max 8, warmup 0, 2500 steps):
+- **TRACK-warm (common-mode tracking-AEP): 2.40 → best 2.1628 @2500, still descending**, stable
+  (res ~6e-5, jr at floor). ⇒ tracking-AEP descends PAST the 2.40 plateau — the round-2 estimator-
+  floor fix works in training, exactly as the fixed-ckpt probe predicted (cos −0.05→0.997).
+- **STD-warm control (standard estimator, same recipe): ABORTED @step 397** (blew up, res→0.21).
+- **TRACK-fresh (tracking from scratch): ABORTED @968** (blew from random init).
+
+⇒ under the identical aggressive recipe (jr_max 8), tracking-AEP is BOTH stable AND descends, while
+the standard estimator blows up. But tracking-AEP is **not a from-scratch drop-in** (fresh blows up)
+— the recipe is **two-phase warm-track**: standard estimator to ~2.40 (where it's stable), then
+switch to common-mode tracking-AEP to descend below the floor (matches the S1 warm-track champion).
+Caveat: std-control's abort is partly the aggressive jr_max 8 (orphan used jr_max 32, stable at
+2.40); the clean claim is "at jr_max 8 tracking survives+descends where std blows". The 2.16 ate
+~0.24 of the 0.61 EP-vs-BP gap and was still descending → continuing (`trkcont` from 2.16) to find
+the ceiling.
+
+**exp C — architecture ceiling (BPTT+controller rescue, 2500 steps):** stable exact-gradient exists
+(lr≤6e-4 never breaks, unlike lr8e-4 which broke @1600). Best: lr6e-4 jr16 → 3.08, still descending.
+At equal step (2500): **BP 2.34 ▸ BPTT 3.08 ▸ EP-orphan 3.55** ⇒ the ~10× proportional slowdown
+(confirmed: median 9.4× in the descending regime) splits into an lr-stability-capped architecture
+handicap (~0.74, BP→BPTT; the equilibrium block can't take BP's 3e-3 lr) AND an EP-estimator
+handicap (~0.47, BPTT→EP). jr_max 16 > 32 confirmed again. Long ceiling run (`bptt_ceiling`,
+lr6e-4 jr16, 12k) pending.
+
+**Found + staged a real code inconsistency:** free phase / model / readout use erf-GELU (F.gelu)
+while the holo nudge uses tanh-GELU (cgelu) → z* isn't the nudge-force's fixed point → spurious
+common-mode drift in the contrast every step (which tracking-AEP partly absorbs — possibly part of
+why it wins). Fix staged as `--gelu tanh` (additive, default erf). Separately: free-phase ε=0.1 is
+~3× too small (ε0.3 or Anderson → ~3–4× fewer relaxation steps; `solver_probe`).
+
+---
+
+## 2026-06-24 — below-CE-2.1 divergence: DIAGNOSED (genuine Hopf) + FIXED (resreg/jacreg = finite-T1 LE control)
+
+Full write-up: **`SESSION_2026-06-24_HOPF_DIAGNOSIS_RESREG_FIX.md`**. This pins down the forward-stability wall that sat under TL;DR #3 ("no loss ceiling, but a validity threshold that costs reg + steps") — the threshold is now mechanistically a **Hopf bifurcation**, and the "regularization tax" is specifically **LE control**.
+
+**The mechanism (3 independent confirmations).** Below CE~2.1 the relaxation diverges because the learned **non-conservative attention** operator undergoes a **genuine continuous-time Hopf bifurcation** — a complex eigenvalue pair crosses **Re μ > 0** as EP makes attention expressive. Matrix-free FD-JVP Arnoldi on M=I+εJ (`ep_run/eig_probe.py`): leading continuous μ=(λ−1)/ε goes **−0.024 (s2000, CE 3.13, STABLE) → +0.44±1.37j (s3200, 2.74) → +1.35±2.08j (ep_eps05, 2.41)** — Re μ and |Im μ| both grow as CE drops. Confirmed by (a) eval limit-cycle + attention-knockout (α=0.2 converges), (b) Anderson can't reach a root at s3200 (near-root Re μ=+0.24 rotating). **Earlier "Euler artifact" read (and the cont.6 forward-mode/RTRL framing) are SUPERSEDED** — fugu caught that the ε-sweep "convergence" was the *step* residual ε·g, not g→0; an intermediate misread. (The Euler over-amplification is a real *second* layer — ε relocates the wall 2.74→2.41 — but the root cause is Re μ>0, which no ε closes.)
+
+**Why equilibrium-EP hits this and BPTT doesn't (structural).** Equilibrium-EP optimizes L(z*); its gradient is blind to *finite-time contraction* (a param eroding ρ→1 with z* fixed has zero equilibrium gradient). BPTT differentiates the T1 unroll, so a non-converged z_T1 → bad output → penalized → it implicitly defends contraction. So EP needs that defense added back explicitly.
+
+**The fix — control the finite-T1 Lyapunov exponent (LE).** Stability ⟺ ρ(M)<1 ⟺ finite-T1 LE<0. Two ~orthogonal handles, they stack:
+- **resreg** (`lt_ep_train.py:220-231`) = penalize the T1-residual ‖εF(z_T1)‖ ~ ρ^T1 = exp(T1·LE) → a **direct LE penalty** (also catches the non-normal transients eigenvalues miss). **The PROVEN below-2.1 stabilizer.**
+- **jacreg** (`:211-219`) = penalize ‖J_nc·v‖ → shrinks |Im μ|, pushes the pair to Re μ<0 (cause-side). `ep_run/eig_jacreg.py` confirmed at the mechanism level: at the same CE~2.74, frozen-jacreg = Re μ=+0.45 rotating g_floor 0.26, adaptive-jacreg = **Re μ=−0.23 real, g_floor 1e-4** (Hopf killed, true fixed point restored, AsymEP validity restored).
+
+**★ The 2.09 recipe = FROM SCRATCH + resreg=0.2 + FROZEN jr** (NOT adaptive jacreg — that was a session-long detour; cmd at `EP_BELOW210_DIAGNOSIS_FIX.md:97-101`). Original ep_resreg2 reached **2.0573** (lowest EP ever, lost to /tmp wipe; rebuild 2.22). This session: ep_resreg_warm (resreg eager, warm from stable s2000) descending smoothly through 2.30 (peak res 1.6e-2, no spikes) — the clean 2.09 test; ep_rr_scratch (from-scratch, proven recipe) launched as the robust confirmation. **Warm-start ONLY from a STABLE operator (s2000, Re μ<0); resreg/jacreg PREVENT instability, they can't RESCUE an already-blown one** (warming from ep_eps05@2.41 blew).
+
+**Infra (#14).** `--compile` EXONERATED + SAFE (compiles the free-phase `tforce`; numerically identical to eager — z150 rel-diff 1.6e-7, pure fp32; ~1.43x, ~3.3x with t2sel40). **`--tf32` UNSAFE — do NOT use** (10-bit mantissa ≈ 1e-3 perturbation; the relaxation is hyper-sensitive — ε 0.1→0.05 alone moved the wall 0.33 — so TF32 is too coarse). EP parallelism for the deep/scaling phase: no sequential backward, coupled equilibrium stack (#13) → depth-parallel, adaptive-T1 (residual-stopped) cleaner than adaptive-ε.
+
+**Lit anchor.** Hopfield-ResNet (arXiv 2509.26003) EP-trained 12 layers — but it's **conservative** (energy Φ, symmetric weights, no oscillation by construction). Confirms non-conservativity is our culprit; **we are the first to EP-train non-conservative attention** (which has the Hopf), solved via resreg/jacreg.
diff --git a/docs/campaign/SESSION_2026-06-24_HOPF_DIAGNOSIS_RESREG_FIX.md b/docs/campaign/SESSION_2026-06-24_HOPF_DIAGNOSIS_RESREG_FIX.md
new file mode 100644
index 0000000..d605140
--- /dev/null
+++ b/docs/campaign/SESSION_2026-06-24_HOPF_DIAGNOSIS_RESREG_FIX.md
@@ -0,0 +1,81 @@
+# Session 2026-06-22→24: below-2.1 wall — DIAGNOSIS FLIPPED to genuine Hopf, FIX = resreg/jacreg (LE control), 2.09 recipe recovered
+
+THE definitive write-up of this multi-day session. Supersedes the "Euler-artifact" framing in EP_BELOW210 cont.7.
+
+## 1. HEADLINE — the diagnosis flipped twice, landed on GENUINE Hopf instability
+The C512 EP below-CE-2.1 divergence is a **genuine continuous-time Hopf instability** of the learned **non-conservative attention** operator — a complex eigenvalue pair crossing **Re μ > 0** as EP trains the attention expressive. **NOT a discrete-Euler artifact** (that was an intermediate wrong turn). Analog/continuous hardware is NOT automatically immune.
+
+### Evidence chain (3 independent methods, all in ep_run/)
+1. **eval_relax_s3200.py** — relax the marginal ckpt redx s3200 (val 2.74) 6000 steps → res floors ~2.3e-2, OSCILLATES, no fixed point = limit cycle.
+2. **knockout_s3200.py** — scale attention output WO×α: cycle scales with α, CONVERGES at α=0.2 → the **non-conservative attention drives the cycle**.
+3. **eps_sweep_s3200.py** — cycle amplitude shrinks monotonically as ε↓ (0.1→2.3e-2, 0.01→8.9e-4). *Intermediate misread*: thought "Euler artifact, analog immune" (cont.7). **fugu caught the bug**: that 8.9e-4 at ε=0.01 is the STEP residual `r=ε·g`, i.e. the SAME force-floor g≈0.09 — so the sweep proves the oscillation is discrete-amplified but does NOT prove a clean continuous fixed point exists.
+4. **ε-TRAINING-monotonicity (3 runs)**: ε=0.1→blew@**2.74**; ε=0.1,t2sel=160 (BETTER gradient)→blew@**3.02** EARLIER (⇒ gradient quality is NOT the lever); ε=0.05→blew@**2.41**. Smaller ε → strictly lower wall.
+5. **eig_probe.py** (matrix-free FINITE-DIFFERENCE-JVP Arnoldi on M=I+εJ — DECISIVE): leading continuous μ=(λ−1)/ε at the operating point:
+   - s2000 (3.13): all |λ|<1, **Re μ=−0.024 STABLE**
+   - s3200 (2.74): top |λ|=1.044, **Re μ=+0.44, COMPLEX** (μ=+0.26±1.37j rotating)
+   - ep_eps05 (2.41): top |λ|=1.14, **Re μ=+1.35**, complex (±2.08j)
+   - **Re μ grows −0.02→+0.44→+1.35 as CE drops, |Im μ| grows ~0→1.37→2.08.** GENUINE instability, growing.
+   - NOTE: autograd JVP gave 0 (blk.force detaches internally) → use **FD-JVP** `(F(z+h·u)−F(z))/h·‖v‖`, h=1e-3·‖z‖.
+6. **anderson_control.py** — s3200: plain relax floors (res 2.7e-2), Anderson CANNOT reach a root (best 1.4e-3), near-root has **Re μ=+0.24, |Im μ|≈2.0 UNSTABLE rotating**. Solver can't manufacture an absent/unstable equilibrium.
+
+### Reconciles with everything
+- ε-monotonicity: with Re μ>0, `|1+εμ|>1` for ANY ε; smaller ε just makes the blow-up SLOWER (less discrete over-amplification, esp. the rotating (εb)² term) → delays → wall RELOCATES (2.74→2.41), never closes.
+- ep_t2fix earlier-blow: cleaner gradient drives into the stiff/unstable regime faster.
+- It's a COMPOUND: continuous instability (root, Re μ>0) + explicit-Euler over-amplification (2nd layer). ε attacks layer 2 (delay); jacreg/resreg attack the root.
+
+## 2. THE FIX — resreg AND jacreg both work, by controlling the finite-T1 LE
+Both add back the **finite-time contraction defense** that equilibrium-EP's L(z*) structurally lacks (and that BPTT has implicitly — BPTT differentiates the T1 unroll, so a non-converged z_T1 → bad output → penalized).
+
+- **jacreg** = penalize ‖J_nc·v‖ (the non-conservative/rotating Jacobian = attention+FFN; lt_ep_train.py:211-219). **Cause-side**: shrinks |Im μ| → pushes the complex pair back to Re μ<0.
+- **resreg** = penalize the T1-residual ‖εF(z_T1)‖ (lt_ep_train.py:220-231). **Symptom-side**: residual ~ρ^T1=exp(T1·LE), so resreg ≈ a DIRECT finite-T1 Lyapunov-exponent penalty (catches non-normal transients the eigenvalues miss). ~orthogonal to BPTT-grad (cos −0.047) — a constraint that keeps res low so the EP estimate doesn't collapse.
+- **Geometry**: relaxation map M=I+εJ; stable ⟺ ρ(M)<1 ⟺ finite-T1 LE<0. Hopf = non-conservative part pushes complex μ past Re μ=0. resreg controls LE directly (output), jacreg controls the structural source (‖J_nc‖). **They stack** (orthogonal handles; cause+symptom).
+- **eig_jacreg.py CONFIRMED jacreg at the mechanism level** (cont.9): at the SAME loss ~2.74, FROZEN jacreg (redx) = Re μ=+0.45 rotating UNSTABLE g_floor 0.26; ADAPTIVE jacreg (ep_jacreg @2.75) = **Re μ=−0.23 STABLE real, g_floor 0.0001 (true fixed point restored)**. jacreg killed the Hopf + restored AsymEP validity.
+
+## 3. ★★ THE 2.09 CONFIG (recovered from EP_BELOW210:97-101) — the key stabilizer is RESREG, NOT jacreg ★★
+The session spent days on adaptive-jacreg; the USER pushed to find the actual 2.09 recipe. It is **FROM SCRATCH + resreg=0.2 + FROZEN jr=0.1** (the original ep_resreg2 reached **2.0573**, lowest EP ever; lost to /tmp wipe; rebuilt ep_resreg_scratch reached 2.22):
+```
+python3 lt_ep_train.py --mode ep --attn_mode thick --B 24 --C 512 --H 16 --T 256 --c 1.0 \
+  --jacreg 0.1 --jr_floor 0.1 --jr_max 0.1 --holo 2 --hr 0.02 --t2sel 40 --track --pema 0.999 --t1max 300 \
+  --res_est 1e-4 --res_gate 0 --resreg 0.2 --qknorm --resinit 0.1 --warmup 800 --T1 150 --T2 20 --lr 6e-4 \
+  --wsd 0.25 --steps 32000 ...   # NO --init_ckpt = from scratch
+```
+- jr_max=jr_floor=0.1 = jacreg FROZEN (controller off). Adaptive jacreg = release jr_max (e.g. 16); controller :520-529 ramps jr by `(res/res_target)^0.3`.
+- res_gate MUST be 0 (the gate early-returns before the resreg penalty → bypasses it; res_gate≠0 blew @200 historically).
+
+## 4. RUNS this session (all C512, warm from s2000 unless noted)
+- **s2000** = `runs/redx_traj/s2000.pt` (redx step 2000, val 3.13) = the STABLE pre-bifurcation start (Re μ=−0.024). Use THIS to warm-start; ep_eps05.pt (2.41, Re μ=+1.35) is UNSTABLE — resreg/jacreg can't RESCUE an already-unstable operator (they PREVENT, they don't reverse).
+- ep_jacreg (adaptive jacreg, warm s2000): twitchy (jr-spikes to 15 → CE spikes), crawled to ~2.32 then ~stuck/slow. Broke past 2.74/2.41 cleanly though.
+- **ep_resreg_warm** (resreg=0.2 eager, t2sel=160, warm s2000): SMOOTH (peak res 1.6e-2, no spikes), LEADING at **2.2985** — the clean 2.09 test, still alive.
+- ep_resreg_fast (resreg t2sel40, warm from UNSTABLE ep_eps05@2.41): BLEW — start-point was unstable. (Confirmed by direct eval: fp32 & TF32 both 2.55 → load fine, TF32 didn't change relaxation; the operator just destabilizes under training.)
+- ep_resreg_c (resreg t2sel40 +compile, warm s2000): BLEW @2.31. Cause = **t2sel40 (lean gradient)**, NOT compile, NOT warm-start (resreg_warm same warm-start is fine).
+- ep_rr_scratch (FROM SCRATCH original recipe +compile): launched 2026-06-24 05:32 on GPU3 — tests the user's "from-scratch is robust" hypothesis + the proven 2.09 path.
+- ep_eps05 (ε=0.05): blew @2.41 (the ε-monotonicity run).
+
+## 5. INFRA / #14 speedup findings
+- **compile: EXONERATED + SAFE.** lt_ep_train has `--compile` (compiles the FREE-phase via `tforce`, the no_grad fast path; gradient stays eager). Verified numerically identical: tforce vs force rel-diff 9e-7; compile-z150 vs eager-z150 **1.6e-7** (just fp32 op-reorder rounding). Speedup ~1.43x (free phase) / ~3.3x with t2sel40. reduce-overhead/CUDA-graphs BROKEN (0.07x, graph breaks — needs fullgraph/static-shape fix).
+- **TF32: DROPPED (user decision).** `--tf32` exists (lt_ep_train:368, sets allow_tf32). 10-bit mantissa ≈ 1e-3 precision loss. The relaxation is HYPER-precision-sensitive (ε 0.1→0.05 moved the wall 0.33!), so TF32's 1e-3 perturbation is too coarse → risky. compile(fp32, 1.6e-7) is ~10⁶× below the sensitivity scale → safe; TF32 is not. **DO NOT use --tf32.**
+- **EP parallelism advantages for #14 (esp. for the scaling/deep phase):**
+  1. NO sequential backward (vs BPTT's N reverse layers) + NO activation graph (memory-light).
+  2. COUPLED equilibrium stack (#13, like the Hopfield-ResNet) → all layers relax CONCURRENTLY each step → depth parallelizes (vs BPTT's 2N sequential). (DEQ-style z*=f(z*) with deep f does NOT parallelize.)
+  3. **adaptive-T1** (relax until residual<tol, the t1max machinery already does this for z*) — easier/cleaner than adaptive-ε (convergence signal is cleaner than overshoot). Speed + auto-converged readout + cap-hit=instability-flag.
+  4. adaptive-ε (#30) as 2nd-order.
+- GPU: 4× RTX A6000 (49GB). GPU0/1/3 = ours; **GPU2 = others' NV-Embed-v2 server (port 8555/8556) — DON'T TOUCH.** For RENTING (user found cheap provider, RTX models + H800, no A100/H100): **EP is memory-light + FLOP-bound → consumer 4090/5090 = best $/FLOP** for the bulk; 48-80GB (A6000/H800) only for BPTT-twins (memory-heavy) + 0.6B. EP per-step FLOPs ~1.5-3x BPTT (2 relaxations vs fwd+bwd) but that cost is SIM-only (analog relaxation is free physics).
+
+## 6. Hopfield-ResNet paper (arxiv 2509.26003) — confirms our diagnosis
+"Scaling EP to Deeper Architectures" trained 12-conv Hopfield-ResNet with EP. It is **CONSERVATIVE** (energy function Φ, SYMMETRIC weights, monotone energy descent — "no oscillation or limit cycles"). No attention, no non-symmetric ops. **Confirms: non-conservativity is OUR culprit; conservative systems get depth free (no Hopf). We are the first to EP-train NON-conservative attention (which has the Hopf), solved via jacreg/resreg.** Good for the dossier: prior EP-deep = conservative/no-attention; ours = the harder non-conservative case.
+
+## 7. fugu-ultra consultations (all in ep_run/)
+- FUGU_VERDICT_FULL.md (Q1-Q4): confirmed attention-driven oscillation, FLAGGED the Euler-artifact (we then measured it's a true Hopf), said the eigenpair is the decisive measurement. Fix=adaptive jacreg homeostat+res_gate; sub-threshold attention IS expressive (BPTT 1.83 proves it); keep below the instability.
+- FUGU_OPTIONS_VERDICT.md (Q1-Q3): the step-vs-force-residual CORRECTION; adaptive ε eliminates only if ε_min<ε_crit; jacreg raises ε_crit (model-side, same wall) + a real analog settling benefit + true-Hopf insurance; Anderson/implicit are emulator-fidelity (analog-faithful), jacreg edits the model. Recommended: adaptive-ε+Anderson(both phases)+bounded jacreg, report FORCE-residual + Re μ.
+
+## 8. KEY FILES
+- Probes: ep_run/{eig_probe.py, anderson_control.py, eig_jacreg.py, adaptive_eps_calib.py, adaptive_eps_calib2.py, eps_sweep_s3200.py, knockout_s3200.py, eval_relax_s3200.py, compile_bench.py}
+- Dossiers: ep_run/{EP_DIAGNOSIS_DOSSIER.md, FUGU_VERDICT_FULL.md, FUGU_Q_OPTIONS.md, FUGU_OPTIONS_VERDICT.md, FUGU_Q1_VERDICT.md}
+- EP_BELOW210_DIAGNOSIS_FIX.md: cont.6 (structural/forward-mode — SUPERSEDED), cont.7 (ε-artifact — SUPERSEDED), **cont.8 (Hopf correction — CURRENT)**, **cont.9 (jacreg confirmed at mechanism level)** + OBS (oscillation = benign weight transient, not Hopf), **2026-06-23 ε-monotonicity RESULT**, the 2.09 recipe at :97-101.
+- Code: lt_ep_train.py — force/tforce:81-106, relax:123, ep_step:140, jacreg:211-219, resreg:220-231, jr controller:520-529, --compile (works), --tf32 (DON'T use).
+
+## 9. CURRENT STATE + NEXT
+- Running: ep_jacreg (~2.32), ep_resreg_warm (LEADING 2.2985, clean 2.09 test), ep_rr_scratch (from-scratch, just launched). Watchers: ep_resreg_check.py→2.20, ep_jacreg_binary.py→2.30 (detached→/tmp), ep_rr_scratch needs one.
+- **THE open question: does resreg break 2.09?** ep_resreg_warm (warm+t2sel160) at 2.2985 smooth → likely; ep_rr_scratch (from-scratch, proven recipe) = the robust confirmation.
+- Lessons: (a) the 2.09 stabilizer is RESREG (from scratch), not adaptive jacreg; (b) warm-start ONLY from a STABLE operator (s2000), never an already-blown one (ep_eps05); (c) t2sel40 (lean grad) is fragile deep, t2sel160 safer; (d) compile safe (fp32), TF32 unsafe (precision); (e) the run is precision-hyper-sensitive.
+- Recurring bug to avoid: `pkill -f "ckpt runs/X.pt"` SELF-MATCHES the bash → exit 144; kill by explicit PID instead. And `nohup python … &` inside a run_in_background bash DETACHES it (no notify) — run `python3 watcher.py` directly as the tracked task.
-- 
cgit v1.2.3