Initial commit: ept — backprop-free equilibrium transformer (EP)

Code (ep_run/), organized docs (docs/{method,campaign,hardware,outreach,paper}), analysis scripts (scripts/), ONBOARDING.md entry point. Large data/checkpoints git-ignored (share separately). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014FAPDWQ49M5Ye3NpTndTpn
author: Yuren Hao <yurenh2@illinois.edu> 2026-07-03 05:56:50 -0500
committer: Yuren Hao <yurenh2@illinois.edu> 2026-07-03 05:56:50 -0500
commit: b83947778e2c776f757a07d4719b7ce961d7ed55 (patch)
tree: b9cc01d7adda691d9156d9d04f4fb2f644674e96 /docs/campaign/FINDINGS.md
1 files changed, 609 insertions, 0 deletions
diff --git a/docs/campaign/FINDINGS.md b/docs/campaign/FINDINGS.md
new file mode 100644
index 0000000..0cb8b53
--- /dev/null
+++ b/docs/campaign/FINDINGS.md
@@ -0,0 +1,609 @@
+# EP / AEP for Transformers — Findings
+
+**Question.** Can transformers (attention + FFN) be trained with Equilibrium Propagation (EP),
+i.e. without backprop / without Feedback Alignment? Started from literature, reproduced the SOTA,
+characterized AEP (EP for non-conservative dynamics), and ported the recipe to a real char-LM.
+
+---
+
+## TL;DR
+
+1. **EP/AEP give FAITHFUL LOCAL GRADIENTS for every transformer component** — attention
+   cosine **0.99**, FFN **1.0** vs the true BP gradient. This solves the credit-assignment that
+   **Feedback Alignment fails at**: FA only trains layers adjacent to the loss (output proj cos≈1.0)
+   and leaves upstream layers as noise (**attention Q/K/V cos≈0.25, FFN fc cos≈−0.01**).
+2. **Recipe to make real (non-conservative) attention EP-able: DAMPING + AEP.** A damping term
+   `s·(attn(z) − c·z)` (c≥1) creates a stable fixed point at *any* attention strength, while keeping
+   the map non-conservative (independent Q/K/V); AEP then recovers the gradient (0.99–1.0).
+3. **Stable end-to-end EP training: SOLVED** by a residual-driven continuous controller on a soft
+   Jacobian penalty (Bai-2021-style, Hutchinson) + damping. EP trains 12k+ steps without blowup.
+   **There is NO structural "EP ceiling"** (a ~2.5-CE wall was claimed mid-project and is
+   **retracted** — see *2026-06-09* section): EP has a quantitative **validity threshold**
+   (free-phase residual ≲ few·1e-4, nudge inside the linear-response window), and meeting it costs
+   regularization tax + steps — both reducible, neither a wall.
+
+---
+
+## 2026-06-20 — below-2.10 EP divergence: ROOT CAUSE + the residual-defense fix
+
+The C512 EP wall (EP frozen-jr descends to best ~2.09, then SUDDENLY diverges within ~200 steps:
+res 5e-3→0.15, cos(g_EP,g_BPTT) 0.98→~0, CE→4+, abort — while exact-BPTT with the IDENTICAL recipe
+sails past to 1.72, freeze_wsd) is diagnosed. NOT the controller, NOT jacreg, NOT the erf/tanh GELU,
+NOT a loss-landscape wall.
+
+**Root cause (Codex-confirmed, 5-way corroborated): EP optimizes the FIXED POINT; BPTT optimizes the
+FINITE UNROLL — only the finite unroll defends the residual.** `ep_step` relaxes T1=150→z150, then REFINES
+to t1max=300→z* and takes the gradient at z*; but eval & `bptt_step` use z150. EP never "feels" the T1
+residual → as attention gets expressive, contraction weakens, z150 drifts from z*, res@T1 inflates, the
+EP estimate (valid only at small res) dies → blowup. BPTT differentiates the actual 150 steps, so a
+non-converging unroll → bad CE → its gradient implicitly rewards strong contraction. That defend-the-residual
+term is what EP structurally lacks: cos-0.977 holds only AT the fixed point; the missing ⊥ component is ~21%
+(=√(1−0.977²)) = the finite-horizon transient gradient (T1=∞ would make even BPTT lose it — BPTT's stability
+IS the finite truncation; the equilibrium/implicit gradient only cares WHERE the fixed point is, not how fast
+you reach it → no contraction-reward).
+
+Two layers: (1) point-mismatch (refinement) — `--t1max 150` alone moved the wall 2.09→2.05 but still blew@600;
+(2) gradient-flavor — needs the explicit penalty. The diverged state is a forward bifurcation to a LIMIT CYCLE,
+not under-relaxation (eval_relax: res floors ~6e-2 and oscillates, 150→4000 relax steps don't help, CE ~3.7;
+FTLE stays negative −0.027..−0.050 — single random-vector FTLE misses the cycle). ⇒ adaptive / more relaxation
+steps CANNOT fix it (res as a STOPPING criterion just chases a vanishing fixed point); only res as a COST
+(penalty) prevents the drift.
+
+**Fix — explicit T1-residual penalty `--resreg`**: defend z150 with R=‖εF(z150)‖²/(‖z150‖²+ε), grad w.r.t. θ
+(z150 detached), task-relative scale (ratio = resreg·min(1, res@T1/2e-2), deadband res@T1>7e-4), added to the
+EP gradient. **MUST run res_gate=0** (the validity gate early-returns jacreg-only above the gate, bypassing the
+penalty exactly when res is high — first gated attempt blew@200). Keep t1max=300 (estimator accuracy) + penalty
+(defends z150). Analog-compatible (one extra force measurement + local VF gradient, no root-finder); more
+targeted than jacreg (which penalizes ‖J_nc‖, not the residual that explodes).
+
+**Validation (interrupted by /tmp wipe, needs re-run):** ep_resreg2 (res_gate=0, resreg=0.2, warm from erf-2.09)
+held res pinned 1–5e-4 through step 1000, best **2.0573** (past the wall, lowest any EP run reached), zero
+inflation — where every no-penalty variant blew (ep_nogate@100, gated ep_resreg@200, ep_t1max150 reached 2.05
+then blew@600). The run + ALL local ckpts were deleted 2026-06-20 by the /tmp 10-day cleanup before reaching
+~1.8. Corroboration (5): Codex independent diagnosis; BPTT-from-2.09 control (res stayed 1.8e-4 + descended,
+same fresh-opt restart where EP blew → the EP *update* is the destabilizer); FTLE (stable-BPTT 1.72 has WEAKER
+contraction −0.0347 than diverged-EP −0.0377 → not forward-stability); eval_relax (limit cycle); gradient
+decomposition (jacreg acquitted: 3% of grad, orthogonal, removing changes cos<0.001). Verbatim code + re-run
+plan: **EP_BELOW210_DIAGNOSIS_FIX.md**.
+
+---
+
+## Background (literature)
+
+- **EP** — Scellier & Bengio 2017. Energy-based; free phase relaxes to a fixed point, nudged phase
+  with strength β; local weight update from the two equilibria. Centered/symmetric nudging
+  (Laborieux 2021) reduces the gradient-estimator bias.
+- **EP ≡ BPTT** — Ernoult et al. 2019: EP updates match BPTT gradients in an RNN with static input,
+  in the limit β→0 with a converged free phase and enough nudged steps. (Relevant: as β→0 EP is
+  unbiased — so our instability is NOT primarily gradient bias.)
+- **Holomorphic EP** — Laborieux & Zenke 2022: exact gradients via finite oscillations, removing the
+  β bias/noise trade-off.
+- **CET** — Høier, Kerjan, Scellier (ICLR 2026 AM workshop), arXiv via OpenReview `Qrfml76eWJ`.
+  Convergent Energy Transformer, EP-trained, CELEBA masked completion, EP ≈ TBPTE. The current SOTA
+  for "EP + attention". Conservative (energy) attention with tied value → guaranteed fixed point.
+- **AEP / AsymEP** — Scurria, Vanden Abeele, Mognetti, Massar, "EP for Non-Conservative Systems",
+  arXiv:2602.03670. Nudged-phase correction `−2 A_J(x*)(x−x*)` (A_J = antisym part of the Jacobian
+  at the free eq) turns the nudged Jacobian J → Jᵀ, giving the exact gradient even when no energy
+  function exists. **The correction is theirs**; their scope is feedforward/Hopfield nets on static
+  MNIST/CIFAR (no attention, no sequence, explicit Jacobian, no stability controller). Their
+  force-form **VF** baseline collapses without the correction. See 2026-06-16 for the full ours/theirs
+  boundary.
+- **FRE-RNN** — "Toward Practical EP", arXiv:2508.11659 (Zhuo Liu et al.). Fixes EP instability via
+  **feedback regulation (reduce spectral radius → fast convergence) + residual connections**
+  (vanishing gradients). Directly relevant to our open problem (keep the free phase converged).
+- Adjacent energy-transformers (BP-trained, not EP): Energy Transformer (2302.07253),
+  EBT "Scalable Learners and Thinkers" (2507.02092), NrGPT (2512.16762).
+
+---
+
+## The arc (what we built & found)
+
+| stage | finding | evidence |
+|---|---|---|
+| literature | EP+attention works only narrowly; CET is SOTA (CelebA, EP≈BP, single block) | CET full read |
+| MVP (CET repro) | EP trains energy-attention+memory; **cosine 0.99; EP≈TBPTE** | CIFAR 0.539/0.546; FMNIST 0.272/0.278 |
+| AEP (non-cons.) | corrects real-attention gradient: toy 0.30→**0.9975**; CET attn-params 0.75→**1.0** | `aep_*.py` |
+| characterization | needs T2≥20–40, T1≥80; β-insensitive (static); **advantage grows with depth** (K=2: naive 0.05 vs AEP 0.99); only changes nudged phase (free eq identical); 1.44× cost | sweeps |
+| fundamental limit | **strongly non-conservative attention has NO fixed point → whole EP family fails** (projection bounds magnitude but does not create a fixed point) | residual stuck 3e-2 |
+| **F: damping** | `s·(attn−c·z)`, c≥1 → stable fixed point at any s → **AEP 0.99–1.0** even at s=8 | `aep_contractive2.py` |
+| Option-2 LM attention | **AEP gives LM causal attention 0.993** (FA 0.25 — FA kills Q/K/V) | `lt_ep_attention.py` |
+| Option-2 H1 (FFN) | **EP Hopfield-memory gives FFN 1.000** (FA −0.01 — the abandonment reason) | `lt_ep_ffn.py` |
+| Option-2 H2 (train) | **BPTT trains (val CE 2.16↓, random 4.17); EP local training destabilizes early** | `lt_ep_train.py`, `H2_*.log` |
+
+**Why FA failed (the project's pain, confirmed):** FA only delivers a usable gradient to the layer
+right before the loss (output projection / FFN proj, cos≈1.0); all upstream layers (attention Q/K/V,
+FFN fc) get random-routed error → cos≈0. EP/AEP fix all of them.
+
+**Why EP training destabilizes (mechanism):** BPTT differentiates whatever the fixed T1-step
+relaxation computes (a deep weight-tied net) — it does NOT need convergence. EP *assumes* the free
+phase is at a fixed point; gradient descent on the loss pushes attention to be more expressive /
+non-conservative → fixed point lost → relaxation diverges → EP estimate (and its jvp/vjp correction)
+goes non-finite. Damping + clipping + weight-caps delay but don't prevent it.
+*(Superseded: the residual-controlled Jacobian penalty DOES prevent it — see the 2026-06-09 section.)*
+
+---
+
+## Key numbers
+
+- CET MVP (masked completion, test masked-MSE): CIFAR **EP 0.539 ≈ TBPTE 0.546**; FMNIST **EP 0.272 ≈ TBPTE 0.278**; BP-transformer 0.106/0.126; trivial(visible-mean) 0.583.
+- EP-vs-BPTT gradient cosine (CET): global **0.99**; attention WQ/WK 0.98.
+- AEP recovers non-conservative attention: toy naive 0.30 → AEP **0.9975**; CET attn-params 0.75 → **1.0**; damped high-s (s=8) AEP **0.99–1.0**.
+- LM (Shakespeare char): AEP attention **0.993** (FA 0.25); EP Hopfield-FFN **1.000** (FA −0.01).
+- H2 training: BPTT val CE **2.16 ↓** (random ln65=4.17); EP non-finite / no sustained training.
+
+---
+
+## Open problem & next directions
+
+The frontier is **stable end-to-end EP training**. Hypotheses / plans:
+1. **Keep the free phase converged during training** (FRE-RNN, 2508.11659): feedback regulation to
+   keep the spectral radius < 1 + residual connections. Most directly targets the mechanism.
+2. **β bias/noise trade-off** — Ernoult 2019 says EP≡BPTT as β→0 with converged phases; but small β
+   amplifies the `(z₋−z₊)/2β` noise. Holomorphic EP (Laborieux–Zenke 2022) removes this trade-off.
+   Diagnostic: log cosine(EP-grad, BPTT-grad) *during* training — does it start 0.99 and drop?
+3. **Unified conservative energy** (no AEP): make the whole LM one energy (energy-attention, tied
+   value) → guaranteed fixed point → plain EP is stable (we saw CET-EP is stable). Trades attention
+   expressivity for stability.
+
+## 2026-06-09 — H2 training era: the "wall" refuted, the real mechanism found
+
+**Setting.** Shakespeare char-LM, single equilibrium block, C=128 H=4, same param budget everywhere.
+`--attn_mode thin/real` = clamp + Hopfield-FFN + damped real attention; `thick` = DEQ-transformer
+block (pre-LN + attention + untied 4× GELU FFN + residual + damping). Stabilizer = damping `c` +
+soft Jacobian-norm penalty λ (Hutchinson on the non-conservative force) driven by a continuous
+controller on the free-phase residual (residual, NOT spectral radius, is the right signal — the
+attention Jacobian is non-normal, transient growth invisible to ρ).
+
+**Scoreboard (best val CE; random ln65 = 4.17):**
+
+| run | result |
+|---|---|
+| BP transformer, MLP=4 | 1.68 |
+| BP transformer, MLP=1 (param-matched) | 1.79 |
+| thick-BPTT (exact grad, same arch as thick-EP) | **1.95** (res ~1e-3→8e-3, learned contraction unaided) |
+| mono-BPTT (monDEQ) | 2.11 |
+| thin-BPTT | 2.21 (optimum at res ~1.5e-2) |
+| **thick-EP champion (R5): R3 recipe + EMA-smoothed controller, 14k** | **2.0467** @ step 5k — within 0.10 of exact-gradient thick-BPTT |
+| thick-EP (R3): λ-floor 0.1, res_target 5e-4, c=1, 8k | 2.0784 — beats thin-BPTT and mono-BPTT |
+| thick-EP (R1): λ-floor 1, c=2, 14k | 2.1504 — 2.467@3k → 2.259@7.2k → 2.150@14k: monotone, no plateau, just slow |
+| thick-EP (R4): λ-floor 0.1, res_target 2e-4, 14k | 2.1352 — tight target ⇒ controller thrash (λ swings 0.5↔13 late) |
+| thick-EP (R2): λ→0, c=0.5 | 2.3572 @ step 800, then **collapse to a degenerate fixed point** (res→0 exactly, val ≫ random) |
+| thick-EP, c=2 λ=1, 3k (the run the "wall" was called on) | 2.4665 (just undertrained) |
+| thin-EP super-long 12k | 2.4847 plateau — **invalid-regime run**: res 2.4e-2–6e-2 the whole time, λ pinned at max 16 |
+| thin-EP 3k, λ=16 | 2.5593 |
+
+**Retraction.** Mid-project verdict — "EP is capped at ~2.5 by a convergence⟷richness wall
+(rich blocks need damping that destroys their expressivity)" — is **wrong**, called prematurely at
+3k steps when thick-EP ≈ thin-EP ≈ 2.47–2.50 by coincidence (two different slow/broken modes passing
+the same value). Three independent refutations: (1) the 8k continuation sailed through the "wall"
+(2.2592@7.2k, monotone); (2) the slack run hit 2.357 in 800 steps; (3) the gradient probe below shows
+the EP estimator is healthy exactly where the stabilized runs operate. Also wrong in the original
+story: damping c does NOT aid convergence for LN-inside blocks (LN Jacobian ∝ 1/σ(z): damping shrinks
+z*, *inflating* the Jacobian — measured: thick plain-relax res 8.8e-3 at c=0 vs 3.4e-2 at c=2,
+`solver_wall.py`), and the scary "init residual~12" was an unnormalized absolute-norm print.
+
+**The real mechanism — a validity threshold, not a wall** (`grad_quality.py`: cosine of each
+estimator vs long-horizon BPTT (T1=400) reference, at a 300-step-BPTT-pretrained thick block):
+
+| estimator | free-phase res | cos vs BPTT-400 (3 batches) |
+|---|---|---|
+| EP T1=400, T2=20 | ~5e-5 | **0.85, 0.83, 0.88** — healthy |
+| EP T1=150, T2=20 | 2e-4–1.6e-3 | 0.21 / 0.86 / 0.27 — marginal, batch-dependent |
+| EP T1=50, T2=20 | 2.4e-3–3.6e-3 | −0.03 / 0.34 / 0.55 — broken |
+| BPTT T1=150 (itself!) | — | 0.45 / 0.77 / 0.36 vs BPTT-400 |
+| EP T1=150, **T2=60** | same as T1=150 | 0.01 / 0.07 / −0.19 — **nudged phase leaves the linear-response window** |
+
+Readings: (a) at tight convergence the EP estimator agrees 0.85 with the exact gradient — the
+EP-specific overhead (finite β, T2 truncation, AEP clipping) is only ~0.15 of misalignment;
+(b) **the horizon/mixing cost is shared by BPTT itself** (BPTT-150 vs BPTT-400 cos 0.35–0.77):
+slow-mixing equilibria are expensive for everyone, not an EP defect; (c) EP needs res ≲ few·1e-4 to
+be valid — at res ~1e-2 (where the thin super-long run lived) EP updates are noise, which is what
+that 2.48 "plateau" actually was: 12k steps of noise + λ=16 Jacobian-shrinking gradient; (d) longer
+nudged phase is NOT better — the AEP correction is linearized at z*, so T2 must stay in the
+linear-response window (T2≈20 at eps=0.1 good, T2=60 destroys the estimate).
+
+**Decomposition of the EP→BP gap (final numbers):** architecture (BPTT−BP) = 1.949−1.79 =
+**0.16** (small — the energy/fixed-point architecture is fine); training rule (EP−BPTT) =
+2.047−1.949 = **0.10** — slower optimization (estimator cos ~0.85) + the λ tax of staying inside
+the validity region. Total EP→param-matched-BP = **0.26** (the "wall" story claimed ~0.7 was
+structural). EP-trained thick (2.047) beats exact-gradient thin (2.206) and monDEQ (2.111):
+EP *does* cash in architecture richness, given validity + steps.
+
+**λ-study (final, prediction confirmed: middle wins):** R5 2.047 < R3 2.078 < R4 2.135 < R1 2.150
+< R2 2.357. The two failure modes flanking the optimum: λ→0 ⇒ fast descent then *collapse* (not
+explosion — a dead res=0 equilibrium); λ heavy ⇒ no failure, just a constant tax (R1 monotone to
+2.150@14k). R4's lesson: too-tight res_target puts the multiplicative controller on the residual
+noise floor — `(res/target)^0.3` per-step updates compound the noise and λ thrashes 0.5↔13,
+degrading late training. R5's fix: EMA-smoothed residual signal (`--res_ema 0.9`) kept λ quiet
+(0.1–1.0) and set the champion 2.0467 by step 5k.
+
+**Open item (the actual next frontier):** every stable recipe peaks mid-run (~5–6k) and then drifts
+up 0.1–0.2 despite cosine lr decay — EP's late-training noise floor. Suspects: VF-estimator noise
+dominating at small lr (cos 0.85 ⇒ persistent gradient-direction error), λ/lr balance shifting late,
+weight-norm caps binding. Holomorphic EP (removes β bias/noise) is the principled candidate for the
+remaining ~0.10; T2 must stay in the AEP linear-response window (T2≈20, NOT longer).
+
+## 2026-06-10 — estimator/controller co-design: EP reaches 1.74, beats exact-gradient BPTT
+
+The Holomorphic-EP line (see METHODS §4.3) + controller re-calibration, in causal order:
+
+1. **R7 — holomorphic clamp-free nudge** (N=2, r=0.02 Cauchy readout, no g-clamp / no corr-clip,
+   closed-form ∇zCE): best **2.029**, faster wall-clock than plain EP (3.3 vs 2.45 it/s). Probe
+   discoveries: the legacy clamps were the dominant estimator error at marginal residuals
+   (cos 0.27→0.89 at res 1.6e-3); N and r are flat ⇒ β-bias/noise was never the binding error;
+   T2 truncation is the remainder (T2=120 → cos 0.985 when the nudged phase is stable; adaptive-T2
+   early-stopping defeated by non-normal transient growth — open).
+2. **R8 — param-EMA alone: tie (2.031).** Exposed the real late-drift mechanism: λ pinned at
+   jr_max=16 in the tail — the controller enforcing the OLD validity margin (5e-4) against weights
+   that want res~1e-3, which the new estimator tolerates fine (cos 0.89 @ 1.6e-3).
+3. **R9 — controller re-calibrated to the new estimator** (target 1.5e-3, λmax 4) + param-EMA:
+   **best 1.7399**, sustained EMA plateau 1.75–1.79, λ quiet at 0.1–0.5, tail ~1.92. Worth 0.29 —
+   the single largest improvement of the project. Lesson: **estimator and controller must be
+   co-designed**; upgrading one without re-tuning the other converts the controller into the main
+   source of harm.
+4. **14k controls** (same horizon as R9): BP MLP=4 **1.610**, BP MLP=1 1.689. Param-shape
+   correction: the **thick** block's 131k FFN matches **MLP=4**, not MLP=1 (which matches the thin
+   block's 33k Hopfield memory) — so thick-EP's like-for-like BP baseline is 1.610.
+   Unregularized **BPTT-14k destabilizes** (res→4.7e-2, val→3.0, best 2.021 — worse than its own
+   3k run, 1.949): the equilibrium architecture NEEDS the contraction control loop for long
+   training; EP carries it out of estimator necessity, bare BPTT got nothing.
+5. **The decisive control — BPTT + identical λ-controller + param-EMA: 1.6348**, tail stable.
+   Verdict: most of EP-beats-bare-BPTT was the regularizer; with equipment matched, exact
+   gradients lead EP (R9) by **0.105** — which matches the measured estimator misalignment
+   (cos 0.85–0.90). Bonus: the controlled equilibrium block (1.635) beats the MLP=1 BP baseline
+   (1.689) — the contraction controller is good for the architecture independent of training
+   rule; EP merely forced its discovery.
+6. **Adaptive T1/T2 (R10): EP 1.6755** — ate 0.064 of the 0.105 rule tax. Adaptive T2 solved by
+   *hindsight snapshot selection* (judge by increments of the contrast estimate a_t, never by step
+   sizes — non-normal transients can't fool it; dangerous batches self-limit to t≈20–30, stable
+   ones run to ~110 and collect cos 0.987; probe mean 0.871→0.932). Adaptive T1 companion: the
+   λ-controller signal stays sampled at T1=150 (R9 peace preserved), then relaxation refines to
+   res≤1e-4 before nudging — long-T2 gains exist ONLY from tight equilibria (at res~1e-3 long T2
+   hurts). **Final 14k decomposition: architecture 0.025 (1.635 vs 1.610), EP rule tax 0.041
+   (1.676 vs 1.635), total 0.066.** EP now beats the MLP=1 BP baseline (1.689).
+   *Multi-seed (3/arm):* EP **1.680 ± 0.005** vs BPTT-ctl **1.639 ± 0.006** ⇒ tax
+   **0.041 ± 0.005 (~9σ)** — confirmed real.
+5. Refuted along the way: λ-floor∝lr annealing (R6) reproduces the λ→0 death; post-mortem of that
+   death (R2/R6): **explosion frozen by fp32 absorption masquerading as convergence** (val 60–77
+   with res≡0: ε·F < ulp(z); uncapped tok/pos/fc/pj blow up) — the λ floor is the anti-collapse
+   mechanism, not just a tax.
+
+## Code map
+
+- `~/ept/cet_mvp.py` — CET energy model + EP & TBPTE training (App. B faithful).
+- `~/ept/bp_transformer.py` — vanilla BP transformer baseline.
+- `~/ept/cet_aep.py` — CETReal (real non-conservative attention) + AEP, VF gradient, damping, clip.
+- `~/ept/aep_characterize.py` / `aep_depth.py` / `aep_projected.py` / `aep_contractive2.py` — AEP sweeps.
+- `~/ept/aep_option1.py` — projected (constrained) adjoint.
+- `/tmp/lt_ep/lt_ep_attention.py` — AEP on the LM's causal attention (gradient quality).
+- `/tmp/lt_ep/lt_ep_ffn.py` — EP Hopfield-memory vs FA-FFN (gradient quality).
+- `/tmp/lt_ep/lt_ep_train.py` — H2 end-to-end EP vs BPTT training of the equilibrium LM block
+  (`--attn_mode real/energy/mono/thick`, jacreg controller `--jacreg/--jr_floor/--res_target`).
+- `/tmp/lt_ep/grad_quality.py` — EP-vs-exact gradient cosine vs residual level (the validity probe).
+- `/tmp/lt_ep/solver_wall.py` — plain relax vs Anderson convergence per damping level (thin vs thick).
+
+*(local-transformer working copy in `/tmp/lt_ep`; original `~/local-transformer` untouched.)*
+
+## 2026-06-13 — late-drift 诊断
+
+每个稳定配方中段触底后 val 上漂(S0 ~0.05 轻微,S1/L2 ~0.5 重度)。多假设诊断(drift_diag.py,13 列/200 步):
+- **嵌入 runaway 假设被否**:|emb| 涨到 11.9 后被 weight-decay 压回 11.6,自限,非 culprit。
+- **drift 对汇报数字无害**:存 best 即可(L2 漂到 1.70 但报 best 1.214;warm-track 报 1.14)。
+- **drift 不藏更低解**:warm-track 4800 步触底 1.14 后,lr 3e-4 漂、lr 1e-4 冻(1.15),均找不到更低 ⇒ 1.14 是该配方真地板,drift = 在地板上被噪声梯度推晃。
+- **机制 = 晚期沿坏方向迈步**:低 lr 摁住 drift(anti-drift 续训冻在 1.16);S0 `cos`(EP,BPTT400)晚期 0.85→0.3 印证方向变差。
+- **严重度随尺度**:S0(C128/T64)≈0.05,S1(C256/T256)≈0.5。
+- 待结(S1 warm-start 诊断 + bctl 对照):cos 晚期塌缩是 EP 估计器退化(A,→N=4/tracking 可救)还是慢混合平衡点的视野模糊(B,EP/BPTT 共病,→架构加速混合)。bctl=cos(BPTT150,BPTT400):若同塌则 B。
+
+## 2026-06-13 (cont.) — "why EP far at S1": operating-point + bias/variance diagnosis
+
+Reframe: at the SAME (tight, res 1.5e-3) operating point EP (1.14–1.39) ≥ BPTT (1.52). The apparent
+"gap" is EP(1.14) vs LOOSE-BPTT(0.97) — an operating point (res 1e-2, non-contractive) EP cannot
+use. Mechanism: T=256 long-range mixing wants non-contractive dynamics; EP needs contraction for the
+fixed point to exist; contraction suppresses the mixing. Signature: gap grows with T (S0/T64: 0.04;
+S1/T256: 0.17).
+
+bias_var.py (v4b ckpt, 16 batches, EP vs BPTT-400, BPTT-150 control), per group [mean-cos | cos-means]:
+- all  0.24|0.36 ;  BPTT-horizon ctrl 0.44|0.57
+- attn 0.22|0.33 ; ffn 0.32|0.49 ; ln 0.19|0.30 ; emb 0.31|0.37
+Readings: (1) two EXACT gradients (BPTT-150 vs 400) only reach cos-means 0.57 — slow mixing makes
+"the gradient" horizon-ambiguous; a PERFECT EP estimator's ceiling here is ~0.57, NOT 1. (2) Both
+EP and BPTT improve with averaging (variance present, partly recoverable via navg). (3) EP-specific
+residual concentrates in **attention + LN** (EP cos-means 0.33/0.30 vs BPTT 0.57/0.65); **FFN nearly
+fine (0.49)**, emb on par. ⇒ the AEP-corrected softmax/LN path is the weak link, pointing the fix at
+qk-norm (bounds logits, conditions the attention Jacobian) + N=4 holo + navg. Chasing cos→1 is the
+wrong target; matching BPTT's 0.57 is (~0.21 of recoverable EP-specific alignment).
+
+## 2026-06-13 (cont.) — scaling to BPE/50M: stability via small-init residual branches
+
+50M (C=2048) from scratch ABORTED (init res 26; qk-norm+c=2 insufficient — res stuck 0.38). Fix:
+**ReZero/Fixup-style small-init of the residual-branch output projections** (WO, pj ×0.1) makes the
+thick block ≈ -(z-xin)-c·z (near-identity, trivially contractive) at init; training grows them.
+Result at 15M (C=1024): res **1.1e-8 at step 50** (vs abort). The big-width init-instability wall is
+a random-init artifact, not fundamental; --resinit folds into the standard recipe. Separately,
+torch.compile's 14.6× was a SMALL-model (launch-overhead) artifact; at C≥1024 the relaxation is
+compute-bound (fp32 ~15 TFLOP/s) → ~0.06 it/s at 15M, ~0.03 at 50M. Demo feasibility: 15M solo
+(~2-4 days); 50M needs multi-GPU data-parallel (per-sample contrast → gradient all-reduce) or the
+external cards. The fp32 contrast (catastrophic-cancellation-bound, bf16 dead) keeps it compute-heavy
+— again the "digital expensive / physical free" axis.
+
+## 2026-06-14 — the big-model EP recipe (scaling past S1 needs a stability stack)
+
+Scaling EP from S1 (C=256/T=256/char) to C=1024/T=512/BPE-4096 broke the S1 recipe: stable for ~50
+steps then violent free-phase blowups (res → 0.5–1.4, jr pinned at max) — the controller's stability
+margin shrinks with width/context. CAUTION: a step-50 res≈1e-8 snapshot is NOT proof of stability;
+must watch through warmup-end peak-lr (~step 500–1000). The stack that fixes it (each necessary):
+  1. **resinit** (WO,pj ×0.1) — near-identity contractive block at init.
+  2. **qk-norm** (Qwen3 q/k RMSNorm) — bounds attention logits / Jacobian (also fixes the attn/LN
+     estimator weak link from the 06-13 bias_var diagnosis).
+  3. **lr warmup** (≥1000 steps linear) — let the λ-controller establish contraction before big steps.
+  4. **µP-style lr scaling**: lr 1e-3 (C=256) → **4e-4 (C=1024)**; 1e-3 caused a 400-step instability
+     episode, 4e-4 only brief recovered excursions.
+  5. **higher jr_max** (16 → 32) — bigger non-conservative force needs more penalty headroom.
+  6. validity gate + fuse retained (catch the residual transient excursions).
+Locked 15M demo recipe (running, 16k steps, best-ckpt): C=1024 H=16 T=512, the 6 items above,
+holo N=2 + adaptive T1/T2, pema 0.999. Smoke val descends 5.86→3.94 BPE-CE by step 1000.
+Cost: ~0.05 it/s (fp32, compute-bound; compile's 14.6× was a small-model launch-overhead artifact)
+⇒ 15M ~3 days solo. 50M (C=2048) still unstable even with resinit 0.03 + warmup — needs a dedicated
+init/lr curriculum AND multi-GPU; deferred as a sub-project (not on the demo critical path).
+
+## 2026-06-16 — C=512 BPE scale: baselines, AsymEP attribution, early-slowness dissection
+
+**C=512 BPE-4096 scoreboard** (TinyStories, T=256, B=24; random ln4096 = 8.318; target band 1.0–1.5):
+
+| run | result |
+|---|---|
+| BP standard transformer (7.48M, no warmup, lr 3e-3) | **best 1.7921** (20k, DONE) |
+| EP thick "orphan" (lr 8e-4, warmup800, resinit0.1, holo2, t2sel40) | **best 2.4037** (20k, DONE; jr pinned at floor 0.1 the ENTIRE run — zero excursions) |
+| EP thick "chin" (lr 9e-4, full-Chinchilla 24k target) | ABORT @5401, best 3.2408 |
+| BPTT thick (exact grad, same arch) | ABORT @2113, best 3.7331 |
+
+Data notes (NOT verdicts): (a) BP-C512 itself = 1.79 sits just *above* the 1–1.5 band (BP needs
+C≥1024 to enter it); EP-orphan 2.40 is 0.61 above BP-C512. (b) **the exact-gradient BPTT twin
+aborted EARLIEST** (2113, 3.73) — even worse than the EP runs — consistent with "training walks off
+the contractive manifold, exact gradient included"; lr/seed not matched across these, treat as
+directional. (c) abort = the `res>0.1 for 100 consecutive steps` fuse. **lr is the knife-edge**:
+8e-4 (orphan) had NO excursions and finished; 9e-4 (chin) had periodic jr→32 / res→0.2–0.5
+excursions, most recovered, one didn't make it back inside the 100-step fuse window → abort@5401.
+Narrow lr-stability margin at C=512. (d) orphan late curve is a plateau: best moved only
+2.4557→2.4037 over the last 3600 steps.
+
+**AsymEP attribution (web-verified, the J→Jᵀ correction is NOT ours).** arXiv:2602.03670,
+Scurria / Vanden Abeele / Mognetti / Massar, "EP for Non-Conservative Systems", names the method
+**AsymEP**. Their correction `−2 A_J(z*)(z−z*)` is **mathematically identical** to ours
+(`Jv−Jᵀv = (J−Jᵀ)v = 2 A_J v`). Their scope: Hopfield nets + feedforward MLPs (≤500 hidden) +
+one CIFAR-10 feedforward CNN; **static inputs; explicit Jacobian construction + decomposition;
+NO residual/validity controller; NO attention / transformer / sequence model; do NOT combine with
+holomorphic**. Best result: CIFAR-10 FF-CNN AsymEP 89.74% vs BP 90.66% (their variational "Dyadic
+EP" matches BP, p=0.75). Their **"VF" baseline** (the force-form ⟨a,∂F/∂θ⟩ readout — *prior art, not
+ours*) **collapses without the correction** (CIFAR-10 → 10% = chance; MNIST 64% vs AsymEP 92.7%) —
+matching our measured cos≈0.25 for uncorrected non-conservative attention. **Ours on this line**:
+matrix-free jvp−vjp (their explicit Jacobian is infeasible at B·T·C dim); softmax-attention
+application; the holomorphic combination (the real-linear correction preserves holomorphy);
+common-mode-tracking + lock-in variants; the validity-threshold/controller/gate stability stack they
+have none of; and the transformer-LM application + scale. (⇒ corrected EP_DERIVATION.md, which had
+mislabeled force-form VF as our step.)
+
+## 2026-06-17 (cont.) — full to-spec versions of exp1/4/5/6/7 (first pass was partial)
+
+The 06-17 campaign ran lighter/subset versions; re-ran the FULL specs. New findings:
+
+**exp1 (FULL) — multi-checkpoint cos-evolution (snap 0/200/800/2000 + plateau).** EP·BPTT150 cos:
+step0(random,res2e-2) 0.26 → step200(res1e-9) **0.99** → step800 0.96 → step2000 0.89 →
+plateau(2.40) **−0.05**; BPTT150·BPTT400 = 0.46→**1.00** throughout (true gradient always clean).
+New **batch self-cos** at the plateau: EP **−0.27** (gradients on different batches anti-correlated)
+vs BPTT **+0.96**. ⇒ decisive: the EP estimator is faithful while descending and undergoes an
+**SNR/coherence collapse near the optimum** (true gradient shrinks below the EP bias floor, which is
+batch-incoherent) — NOT horizon ambiguity, NOT LR. This is THE 2.40-plateau mechanism.
+
+**exp5 (FULL) — jr_max sweep (the dropped dimension, and it mattered).** Warm-started from 2.40:
+**jr_max 8 → 2.3648**, jr_max 4 → 2.376 — both BEAT jr_max 32 (2.463) and the orphan (2.404) by
+~0.04. jr_max 16 → 2.416. floor/target fill still ≥2.47. ⇒ the jr_max=32 ceiling was a small tax;
+~8 is better. (Earlier "not a controller tax" was incomplete — floor/target don't help, but the
+ceiling does.) Still far from the 1–1.5 band; the estimator floor dominates.
+
+**exp4 (FULL) — 3 arms.** armA (warmup): ~50-step gate dead-zone. armB (`cprewarm 200` = full-lr
+contraction before task warmup): res 1.7e-2 after pre-phase, dead-zone only PARTIALLY mitigated
+(val 5.97 vs 6.07 @100) — establishing contraction once isn't enough; warmup's tiny lr can't deepen
+it. armC (no-warmup + resinit 0.03): NO dead-zone (cos 0.99 from step20, best 4.03@300) but a violent
+excursion (res→0.14@400, recovered). ⇒ small-resinit+no-warmup learns fastest early but is
+excursion-prone; the cprewarm decoupling is a weak fix.
+
+**exp6 (FULL) — branch norms + attention entropy + qk RMS, incl. at plateau.** Branches grow fine
+(|WO| 1.95→4.91 over training); at the 2.40 plateau they are LARGE (|pj|=44, |fc|=58, attn/xin≈1.0),
+attn entropy 4.56→3.62 (sharpening), qk_rms pinned 1.0 (qknorm working). ⇒ no "stuck-small branch" —
+the trainable-α-gate intervention is NOT indicated.
+
+**exp7 (FULL, after a probe bug fix) — EP vs BP influence + per-position CE.** [Bug caught: the
+EP probe embedded the input with random-init weights before loading the ckpt → garbage (per-pos CE
+~7); fixed to embed post-load.] Clean influence ||Δz*_q||/δ vs distance d: **BP** 5.34(d1)/0.88(d16)/
+0.15(d64)/0.003(d200); **EP-orphan** 0.24/0.071/0.019/0.006. Both decay with distance. EP ~20×
+weaker in ABSOLUTE coupling (scale-caveated — z* vs hidden norm), but NORMALIZED to d=1 EP reaches
+*farther* than BP (0.30 vs 0.16 at d16; 0.08 vs 0.03 at d64). Per-position CE: EP uniformly ~0.7–1.0
+worse than BP at every position (no long-range-specific failure). ⇒ **does NOT support — arguably
+refutes — "contraction cuts long-range mixing"**; EP's deficit is uniform, consistent with the
+estimator floor (exp1) being the binding constraint. (BP-C512 trained to **1.6953**.)
+Honest gap: a stable BPTT-ctl-tight ckpt does not exist (BPTT broke at C512), so the cross-method
+influence comparison is EP-vs-BP only, not the full tight/loose/BP set.
+
+**Campaign verdict (data; user concludes):** the C512 2.40 plateau is, decisively, an **EP estimator
+SNR/coherence collapse near the optimum** (exp1: cos→0, self-cos −0.27, k→4000, true gradient clean).
+It is not LR (exp2), not horizon (exp1), not λ-floor/target (exp5) — though jr_max≤8 buys ~0.04
+(exp5). The architecture has a SEPARATE wall: exact-gradient BPTT can't stay stable at C512 (exp3).
+The contractive-mixing hypothesis is NOT supported (exp7: EP's range is comparable to BP, deficit is
+uniform). ⇒ the EP lever is reducing the estimator bias floor near the optimum (N=4 holo / tracking-
+AEP / lock-in / navg); the architecture lever is the stability margin for exact gradients.
+
+**Early-slowness dissection (C512 BPE) — cos/k probe CONFIRMED; ablations STILL RUNNING.**
+From existing logs: **EP ≈ BPTT at every early step** (step 200: 5.84 = 5.84; step 800: 3.86 vs
+4.05; step 2000: 3.58 vs 3.73), both **~1.2 behind BP** (2.76 @800). ⇒ early slowness is *not* the
+EP estimator — the exact-gradient twin on the same architecture is equally slow; the gap is the
+equilibrium architecture + recipe vs a standard transformer, present already at step 200 (pre
+warmup-end). Probe (`--probe_bptt`: cos(g_EP,g_BPTT) + k=|gEP|/|gBPTT| along the REAL trajectory):
+- **Estimator quality tracks res tightly**: res≈1e-9 → cos 0.93–0.99, k≈1.0 (all groups); a res
+  spike to 1.4e-3 (mid-excursion) → cos all 0.86 / attn 0.70 / emb 0.13, k→0.4. ⇒ the earlier
+  kinit *synthetic* probe (cos 0.33 / k 0.42 "at init") was a probe-operating-point artifact; on the
+  real trajectory the EP gradient is faithful (cos≈1, k≈1) **whenever the free phase is converged**,
+  and **k≈1 means no adaptive-k is needed in the converged regime**.
+- **warmup × validity-gate early-stall mechanism**: with warmup, early lr is tiny → the contraction
+  homeostat can't pull res under the gate (5e-3) → free-phase res stuck ~1.5e-2 for ~50 steps →
+  the **validity gate skips the nudge** → the reported g_EP is pure jacreg/contraction (cos≈0,
+  k≈11) = *the task is not being learned at all* for ~50–60 steps. res drops < gate by ~step 60–80
+  → cos jumps to 0.99. No-warmup: res converged (1e-9) by step 20, cos 0.98 from step 20
+  (val 5.72 vs warmup 6.07 @ step100).
+- **CAVEAT — the tradeoff is UNRESOLVED**: the no-warmup arm hit an instability excursion ~step 180
+  (res→1.4e-3, jr→32, recovering by 200) — i.e. warmup's stability role is real. Whether no-warmup
+  survives to 1200, plus the **resinit (arm B)** and **lr 2e-3 (arm D)** attributions, are STILL
+  RUNNING as of 16:00 — no verdict drawn. Probe instrumentation: `lt_ep_train.py --probe_bptt N`;
+  launcher `/tmp/lt_ep/early_dissect.sh`.
+
+## 2026-06-17 — C512 diagnostic campaign (7 experiments): the 2.40 plateau is an EP estimator bias-floor; exact-gradient BPTT is LESS stable than EP at C512
+
+Controlled campaign at C512 BPE (orphan ckpt = EP best 2.4037). Raw results + the causal cut each
+experiment was built for. Scripts: `lr_sweep.py`, `triangulation.py`, `mixing_probe.py`,
+`lt_ep_train.py --probe_branch/--probe_bptt`; orchestration `master.sh`.
+
+**exp3 — BPTT+controller, the decisive matched cut (GPU0).** Same arch+controller+lr 8e-4 as the EP
+orphan; only the task gradient differs (exact vs EP). Descended cleanly to best **3.85 @step1400**,
+then **DESTABILIZED at step 1600** (jr→32, res→0.07) and lodged in a broken basin (val 6.27,
+res ~0.068 — just under the 0.1 fuse, so no abort) for 8400+ steps. **The exact-gradient twin did
+WORSE than EP (3.85-stuck vs EP 2.40-completed).** ⇒ at C512 the contraction controller does NOT
+keep the exact gradient on the manifold; EP's implicit contraction-bias is what kept the orphan
+alive. Inverts the S0 picture (BPTT+ctl 1.635 < EP 1.676). Caveat: single lr/seed.
+
+**exp1 — triangulation at the 2.40 plateau (estimator vs horizon).** BPTT150 vs BPTT400
+cos = **1.000** (all/attn/ffn; emb 0.986) — true gradient perfectly well-defined, ZERO horizon
+ambiguity — but EP vs BPTT cos = **−0.045** (orthogonal), k=|gEP|/|gBPTT| ≈ **4000**. ⇒ the plateau
+is an **EP-specific estimator failure**, not slow-mixing horizon ambiguity.
+
+**exp2 — one-step LR-sweep at the plateau.** BPTT best ΔCE −0.16 (lr 1e-2); EP best −0.04 (lr 1e-4),
+diverges for lr≥3e-4. cos 0.11, k 486. ⇒ NOT LR-inequivalence; EP's direction caps one-step descent
+4× below BPTT at any lr.
+
+**exp1+exp2 synthesis:** the EP estimator has a ~fixed bias/noise floor. k runs **~1 during active
+training → ~4000 at the plateau** (true gradient shrank below the EP bias). 2.40 is the SNR
+crossover; below it the true gradient is smaller than the EP bias so EP can't see it. Not LR, not
+horizon, not λ. ⇒ the EP lever is **reducing the estimator floor near the optimum** (N=4 holo /
+tracking-AEP / lock-in / navg averaging), distinct from the architecture's stability lever.
+
+**exp5 — λ floor/target grid (warm-start from orphan).** base 0.1/1.5e-3 → 2.463; 0.05/3e-3 → 2.491;
+0.03/3e-3 → 2.485; 0.1/5e-3 → 2.488 — ALL worse than orphan 2.40 (slight late-drift up). ⇒ relaxing
+the controller does NOT break the plateau — not a λ-floor tax. (floor 0.03 reached jr=0 without
+collapse over 1200 warm-started steps.)
+
+**exp6 — branch-growth (resinit vs no-resinit).** resinit 0.1: branches grow gradually & stably
+(|WO| 2.35→4.91, |pj| 3.23→6.57 over 1000 steps; res ~1e-4; best 3.71). no-resinit: branches start
+large (|WO|~20, |pj|~20), excursion at step 800 (res→0.23, jr→32; recovered, |WO|→14.7; best 4.04).
+⇒ resinit's gradual growth is stabilizing; branches do grow (not stuck-small).
+
+**exp7 — mixing/influence length.** ||Δz_q*||/δ FLAT across distance (~0.007 at d=1 and d=200) and
+weak; loose-target (lam_t5) similar (~0.005). ⇒ no distance-decay signature; INCONCLUSIVE on the
+contraction-cuts-mixing hypothesis (no non-contractive/BP reference; coupling uniformly weak).
+
+**exp4 — warmup/gate (probes done).** no-warmup best 3.56@1200 (survived its step-180 excursion);
+warmup 4.11@800. warmup×validity-gate early-stall confirmed (warmup keeps res>gate ~50 steps →
+nudge skipped → no task learning early).
+
+**Net (data; user to conclude):** the 1–1.5 band is blocked at C512 by TWO distinct walls —
+(1) EP's estimator bias-floor caps EP at 2.40 (cos→0, k→4000 at the plateau; true gradient is
+clean), and (2) the equilibrium architecture+controller can't keep the EXACT gradient stable at
+C512 (BPTT broke at step 1600). EP's contraction-bias makes it the more robust of the two here.
+The contractive-conflict hypothesis is supported on the BPTT side; the EP plateau is specifically
+an estimator-SNR-floor effect.
+
+## 2026-06-17 (round 2) — the plateau bias floor IS the frozen AEP linearization; tracking-AEP fixes it (gradient level)
+
+Estimator ablation at the 2.40 plateau ckpt (`plateau_ablation.py`, vs BPTT400, n_grad=4):
+
+| estimator | cos | k | batch self-cos |
+|---|---|---|---|
+| N2 frozen-AEP (current) | −0.045 | 4133 | −0.27 (incoherent) |
+| N4 / N8 | −0.03 / −0.01 | 31 / 29 | ~0 |
+| N2 r=.05/.10/.20 | −0.04 | ~4200 | −0.27 |
+| N2 fixed-T2 20/80/120 | nan (diverges w/o snapshot selection) | — | — |
+| **track_N2 (common-mode AEP)** | **0.997** | **0.9** | **+0.95 (coherent)** |
+
+Diagnostic at z*: **|Jv−Jᵀv|/|Jv| = 1.37** (J highly non-normal). ⇒ the bias floor IS the
+frozen-at-z* AEP linearization. N=4/8 fix only the magnitude (k 4000→30), not the direction; r and
+T2 do nothing; **tracking-AEP (re-linearize at the moving common mode z̄=½(z₊+z₋)) restores cos
+−0.045→0.997, coherence −0.27→+0.95, magnitude 4133×→0.9×.**
+
+**exp D (per-group):** collapse is UNIFORM across WQ/WK/WV/WO/fc/pj/ln1/ln2/tok/pos (all cos
+−0.03..−0.08, k 3k–18k) — the corrupted contrast `a` poisons every VF gradient equally (single
+shared cause, not attention/LN-specific unlike the S1 bias_var diagnosis).
+
+**exp B (navg):** cos(EP,BPTT) over navg = 0.37→0.40→0.55→−0.30→−0.21 (1/2/4/8/16); self-cos rises
+(0.58→0.87) but toward a direction NEGATIVELY correlated with BPTT; dCE diverges at navg≥8. ⇒
+**deterministic bias, not variance** — averaging converges to the wrong (biased) direction, so
+navg/restart-averaging will NOT fix it.
+
+### 2026-06-17 (round 2b) — free-phase ε is mistuned ~3× (monDEQ α-analysis analog); Anderson untapped
+
+Free-phase convergence probe at the 2.40 ckpt (`solver_probe.py`, res_F=‖F(z)‖/‖z‖, eps-independent,
+B=2, 4 batches). Steps to res_F≤1e-4 / ≤1e-6, plain Euler: ε0.05→470/844, **ε0.1 (ours)→236/548**,
+ε0.2→121/274, **ε0.3→85/182** (~2.8× faster than 0.1, same fixed point, stable), ε0.5→stalls at
+res~0.9 (non-convergent — the stable ceiling is ~0.3–0.4). Anderson(m=5): ε0.1→85/197, **ε0.2→58/118
+(~4× vs our current 0.1)**, robust across ε. ⇒ **ε=0.1 is ~3× too small**; ε≈0.3 or Anderson@ε0.2 cut
+free-phase force-evals ~3–4× (and let T1≈100 always clear res≤1e-4, dropping adaptive-T1) → directly
+attacks the dominant EP compute cost AND tightens res (better estimator). This IS the "preconditioned/
+Anderson relaxation" open item, now quantified. Caveats: free phase at trained point / B=2 (optimal ε
+may shift early-training / slow-mixing); the **nudged phase shares ε but has AEP+holo dynamics — bigger
+ε there needs a separate stability check** before changing. Confirms the monotone-operator α-sensitivity
+(non-normal dynamics, |Jv−Jᵀv|/|Jv|=1.37) carries over to our hand-set Euler step.
+
+**Decision:** the EP lever collapses to **common-mode tracking-AEP** (not N=4 / lock-in / navg).
+Caveat: validated at the gradient level at one ckpt; the one-step ΔCE is magnitude-confounded, so the
+NECESSARY next test is **training** — does EP with `--track` descend below 2.40 (`holo_a_track`
+already wired)? exp C (BPTT+controller rescue, lr 2e-4/4e-4/6e-4 × jr_max 16/32) runs separately for
+the architecture-ceiling question.
+
+## 2026-06-18 — tracking-AEP VALIDATED IN TRAINING: it breaks the 2.40 plateau
+
+Warm-track test (warm-start from the 2.40 orphan ckpt, jr_max 8, warmup 0, 2500 steps):
+- **TRACK-warm (common-mode tracking-AEP): 2.40 → best 2.1628 @2500, still descending**, stable
+  (res ~6e-5, jr at floor). ⇒ tracking-AEP descends PAST the 2.40 plateau — the round-2 estimator-
+  floor fix works in training, exactly as the fixed-ckpt probe predicted (cos −0.05→0.997).
+- **STD-warm control (standard estimator, same recipe): ABORTED @step 397** (blew up, res→0.21).
+- **TRACK-fresh (tracking from scratch): ABORTED @968** (blew from random init).
+
+⇒ under the identical aggressive recipe (jr_max 8), tracking-AEP is BOTH stable AND descends, while
+the standard estimator blows up. But tracking-AEP is **not a from-scratch drop-in** (fresh blows up)
+— the recipe is **two-phase warm-track**: standard estimator to ~2.40 (where it's stable), then
+switch to common-mode tracking-AEP to descend below the floor (matches the S1 warm-track champion).
+Caveat: std-control's abort is partly the aggressive jr_max 8 (orphan used jr_max 32, stable at
+2.40); the clean claim is "at jr_max 8 tracking survives+descends where std blows". The 2.16 ate
+~0.24 of the 0.61 EP-vs-BP gap and was still descending → continuing (`trkcont` from 2.16) to find
+the ceiling.
+
+**exp C — architecture ceiling (BPTT+controller rescue, 2500 steps):** stable exact-gradient exists
+(lr≤6e-4 never breaks, unlike lr8e-4 which broke @1600). Best: lr6e-4 jr16 → 3.08, still descending.
+At equal step (2500): **BP 2.34 ▸ BPTT 3.08 ▸ EP-orphan 3.55** ⇒ the ~10× proportional slowdown
+(confirmed: median 9.4× in the descending regime) splits into an lr-stability-capped architecture
+handicap (~0.74, BP→BPTT; the equilibrium block can't take BP's 3e-3 lr) AND an EP-estimator
+handicap (~0.47, BPTT→EP). jr_max 16 > 32 confirmed again. Long ceiling run (`bptt_ceiling`,
+lr6e-4 jr16, 12k) pending.
+
+**Found + staged a real code inconsistency:** free phase / model / readout use erf-GELU (F.gelu)
+while the holo nudge uses tanh-GELU (cgelu) → z* isn't the nudge-force's fixed point → spurious
+common-mode drift in the contrast every step (which tracking-AEP partly absorbs — possibly part of
+why it wins). Fix staged as `--gelu tanh` (additive, default erf). Separately: free-phase ε=0.1 is
+~3× too small (ε0.3 or Anderson → ~3–4× fewer relaxation steps; `solver_probe`).
+
+---
+
+## 2026-06-24 — below-CE-2.1 divergence: DIAGNOSED (genuine Hopf) + FIXED (resreg/jacreg = finite-T1 LE control)
+
+Full write-up: **`SESSION_2026-06-24_HOPF_DIAGNOSIS_RESREG_FIX.md`**. This pins down the forward-stability wall that sat under TL;DR #3 ("no loss ceiling, but a validity threshold that costs reg + steps") — the threshold is now mechanistically a **Hopf bifurcation**, and the "regularization tax" is specifically **LE control**.
+
+**The mechanism (3 independent confirmations).** Below CE~2.1 the relaxation diverges because the learned **non-conservative attention** operator undergoes a **genuine continuous-time Hopf bifurcation** — a complex eigenvalue pair crosses **Re μ > 0** as EP makes attention expressive. Matrix-free FD-JVP Arnoldi on M=I+εJ (`ep_run/eig_probe.py`): leading continuous μ=(λ−1)/ε goes **−0.024 (s2000, CE 3.13, STABLE) → +0.44±1.37j (s3200, 2.74) → +1.35±2.08j (ep_eps05, 2.41)** — Re μ and |Im μ| both grow as CE drops. Confirmed by (a) eval limit-cycle + attention-knockout (α=0.2 converges), (b) Anderson can't reach a root at s3200 (near-root Re μ=+0.24 rotating). **Earlier "Euler artifact" read (and the cont.6 forward-mode/RTRL framing) are SUPERSEDED** — fugu caught that the ε-sweep "convergence" was the *step* residual ε·g, not g→0; an intermediate misread. (The Euler over-amplification is a real *second* layer — ε relocates the wall 2.74→2.41 — but the root cause is Re μ>0, which no ε closes.)
+
+**Why equilibrium-EP hits this and BPTT doesn't (structural).** Equilibrium-EP optimizes L(z*); its gradient is blind to *finite-time contraction* (a param eroding ρ→1 with z* fixed has zero equilibrium gradient). BPTT differentiates the T1 unroll, so a non-converged z_T1 → bad output → penalized → it implicitly defends contraction. So EP needs that defense added back explicitly.
+
+**The fix — control the finite-T1 Lyapunov exponent (LE).** Stability ⟺ ρ(M)<1 ⟺ finite-T1 LE<0. Two ~orthogonal handles, they stack:
+- **resreg** (`lt_ep_train.py:220-231`) = penalize the T1-residual ‖εF(z_T1)‖ ~ ρ^T1 = exp(T1·LE) → a **direct LE penalty** (also catches the non-normal transients eigenvalues miss). **The PROVEN below-2.1 stabilizer.**
+- **jacreg** (`:211-219`) = penalize ‖J_nc·v‖ → shrinks |Im μ|, pushes the pair to Re μ<0 (cause-side). `ep_run/eig_jacreg.py` confirmed at the mechanism level: at the same CE~2.74, frozen-jacreg = Re μ=+0.45 rotating g_floor 0.26, adaptive-jacreg = **Re μ=−0.23 real, g_floor 1e-4** (Hopf killed, true fixed point restored, AsymEP validity restored).
+
+**★ The 2.09 recipe = FROM SCRATCH + resreg=0.2 + FROZEN jr** (NOT adaptive jacreg — that was a session-long detour; cmd at `EP_BELOW210_DIAGNOSIS_FIX.md:97-101`). Original ep_resreg2 reached **2.0573** (lowest EP ever, lost to /tmp wipe; rebuild 2.22). This session: ep_resreg_warm (resreg eager, warm from stable s2000) descending smoothly through 2.30 (peak res 1.6e-2, no spikes) — the clean 2.09 test; ep_rr_scratch (from-scratch, proven recipe) launched as the robust confirmation. **Warm-start ONLY from a STABLE operator (s2000, Re μ<0); resreg/jacreg PREVENT instability, they can't RESCUE an already-blown one** (warming from ep_eps05@2.41 blew).
+
+**Infra (#14).** `--compile` EXONERATED + SAFE (compiles the free-phase `tforce`; numerically identical to eager — z150 rel-diff 1.6e-7, pure fp32; ~1.43x, ~3.3x with t2sel40). **`--tf32` UNSAFE — do NOT use** (10-bit mantissa ≈ 1e-3 perturbation; the relaxation is hyper-sensitive — ε 0.1→0.05 alone moved the wall 0.33 — so TF32 is too coarse). EP parallelism for the deep/scaling phase: no sequential backward, coupled equilibrium stack (#13) → depth-parallel, adaptive-T1 (residual-stopped) cleaner than adaptive-ε.
+
+**Lit anchor.** Hopfield-ResNet (arXiv 2509.26003) EP-trained 12 layers — but it's **conservative** (energy Φ, symmetric weights, no oscillation by construction). Confirms non-conservativity is our culprit; **we are the first to EP-train non-conservative attention** (which has the Hopf), solved via resreg/jacreg.
author	Yuren Hao <yurenh2@illinois.edu>	2026-07-03 05:56:50 -0500
committer	Yuren Hao <yurenh2@illinois.edu>	2026-07-03 05:56:50 -0500
commit	b83947778e2c776f757a07d4719b7ce961d7ed55 (patch)
tree	b9cc01d7adda691d9156d9d04f4fb2f644674e96 /docs/campaign/FINDINGS.md