# EP / AEP for Transformers — Findings **Question.** Can transformers (attention + FFN) be trained with Equilibrium Propagation (EP), i.e. without backprop / without Feedback Alignment? Started from literature, reproduced the SOTA, characterized AEP (EP for non-conservative dynamics), and ported the recipe to a real char-LM. --- ## TL;DR 1. **EP/AEP give FAITHFUL LOCAL GRADIENTS for every transformer component** — attention cosine **0.99**, FFN **1.0** vs the true BP gradient. This solves the credit-assignment that **Feedback Alignment fails at**: FA only trains layers adjacent to the loss (output proj cos≈1.0) and leaves upstream layers as noise (**attention Q/K/V cos≈0.25, FFN fc cos≈−0.01**). 2. **Recipe to make real (non-conservative) attention EP-able: DAMPING + AEP.** A damping term `s·(attn(z) − c·z)` (c≥1) creates a stable fixed point at *any* attention strength, while keeping the map non-conservative (independent Q/K/V); AEP then recovers the gradient (0.99–1.0). 3. **Stable end-to-end EP training: SOLVED** by a residual-driven continuous controller on a soft Jacobian penalty (Bai-2021-style, Hutchinson) + damping. EP trains 12k+ steps without blowup. **There is NO structural "EP ceiling"** (a ~2.5-CE wall was claimed mid-project and is **retracted** — see *2026-06-09* section): EP has a quantitative **validity threshold** (free-phase residual ≲ few·1e-4, nudge inside the linear-response window), and meeting it costs regularization tax + steps — both reducible, neither a wall. --- ## 2026-06-20 — below-2.10 EP divergence: ROOT CAUSE + the residual-defense fix The C512 EP wall (EP frozen-jr descends to best ~2.09, then SUDDENLY diverges within ~200 steps: res 5e-3→0.15, cos(g_EP,g_BPTT) 0.98→~0, CE→4+, abort — while exact-BPTT with the IDENTICAL recipe sails past to 1.72, freeze_wsd) is diagnosed. NOT the controller, NOT jacreg, NOT the erf/tanh GELU, NOT a loss-landscape wall. **Root cause (Codex-confirmed, 5-way corroborated): EP optimizes the FIXED POINT; BPTT optimizes the FINITE UNROLL — only the finite unroll defends the residual.** `ep_step` relaxes T1=150→z150, then REFINES to t1max=300→z* and takes the gradient at z*; but eval & `bptt_step` use z150. EP never "feels" the T1 residual → as attention gets expressive, contraction weakens, z150 drifts from z*, res@T1 inflates, the EP estimate (valid only at small res) dies → blowup. BPTT differentiates the actual 150 steps, so a non-converging unroll → bad CE → its gradient implicitly rewards strong contraction. That defend-the-residual term is what EP structurally lacks: cos-0.977 holds only AT the fixed point; the missing ⊥ component is ~21% (=√(1−0.977²)) = the finite-horizon transient gradient (T1=∞ would make even BPTT lose it — BPTT's stability IS the finite truncation; the equilibrium/implicit gradient only cares WHERE the fixed point is, not how fast you reach it → no contraction-reward). Two layers: (1) point-mismatch (refinement) — `--t1max 150` alone moved the wall 2.09→2.05 but still blew@600; (2) gradient-flavor — needs the explicit penalty. The diverged state is a forward bifurcation to a LIMIT CYCLE, not under-relaxation (eval_relax: res floors ~6e-2 and oscillates, 150→4000 relax steps don't help, CE ~3.7; FTLE stays negative −0.027..−0.050 — single random-vector FTLE misses the cycle). ⇒ adaptive / more relaxation steps CANNOT fix it (res as a STOPPING criterion just chases a vanishing fixed point); only res as a COST (penalty) prevents the drift. **Fix — explicit T1-residual penalty `--resreg`**: defend z150 with R=‖εF(z150)‖²/(‖z150‖²+ε), grad w.r.t. θ (z150 detached), task-relative scale (ratio = resreg·min(1, res@T1/2e-2), deadband res@T1>7e-4), added to the EP gradient. **MUST run res_gate=0** (the validity gate early-returns jacreg-only above the gate, bypassing the penalty exactly when res is high — first gated attempt blew@200). Keep t1max=300 (estimator accuracy) + penalty (defends z150). Analog-compatible (one extra force measurement + local VF gradient, no root-finder); more targeted than jacreg (which penalizes ‖J_nc‖, not the residual that explodes). **Validation (interrupted by /tmp wipe, needs re-run):** ep_resreg2 (res_gate=0, resreg=0.2, warm from erf-2.09) held res pinned 1–5e-4 through step 1000, best **2.0573** (past the wall, lowest any EP run reached), zero inflation — where every no-penalty variant blew (ep_nogate@100, gated ep_resreg@200, ep_t1max150 reached 2.05 then blew@600). The run + ALL local ckpts were deleted 2026-06-20 by the /tmp 10-day cleanup before reaching ~1.8. Corroboration (5): Codex independent diagnosis; BPTT-from-2.09 control (res stayed 1.8e-4 + descended, same fresh-opt restart where EP blew → the EP *update* is the destabilizer); FTLE (stable-BPTT 1.72 has WEAKER contraction −0.0347 than diverged-EP −0.0377 → not forward-stability); eval_relax (limit cycle); gradient decomposition (jacreg acquitted: 3% of grad, orthogonal, removing changes cos<0.001). Verbatim code + re-run plan: **EP_BELOW210_DIAGNOSIS_FIX.md**. --- ## Background (literature) - **EP** — Scellier & Bengio 2017. Energy-based; free phase relaxes to a fixed point, nudged phase with strength β; local weight update from the two equilibria. Centered/symmetric nudging (Laborieux 2021) reduces the gradient-estimator bias. - **EP ≡ BPTT** — Ernoult et al. 2019: EP updates match BPTT gradients in an RNN with static input, in the limit β→0 with a converged free phase and enough nudged steps. (Relevant: as β→0 EP is unbiased — so our instability is NOT primarily gradient bias.) - **Holomorphic EP** — Laborieux & Zenke 2022: exact gradients via finite oscillations, removing the β bias/noise trade-off. - **CET** — Høier, Kerjan, Scellier (ICLR 2026 AM workshop), arXiv via OpenReview `Qrfml76eWJ`. Convergent Energy Transformer, EP-trained, CELEBA masked completion, EP ≈ TBPTE. The current SOTA for "EP + attention". Conservative (energy) attention with tied value → guaranteed fixed point. - **AEP / AsymEP** — Scurria, Vanden Abeele, Mognetti, Massar, "EP for Non-Conservative Systems", arXiv:2602.03670. Nudged-phase correction `−2 A_J(x*)(x−x*)` (A_J = antisym part of the Jacobian at the free eq) turns the nudged Jacobian J → Jᵀ, giving the exact gradient even when no energy function exists. **The correction is theirs**; their scope is feedforward/Hopfield nets on static MNIST/CIFAR (no attention, no sequence, explicit Jacobian, no stability controller). Their force-form **VF** baseline collapses without the correction. See 2026-06-16 for the full ours/theirs boundary. - **FRE-RNN** — "Toward Practical EP", arXiv:2508.11659 (Zhuo Liu et al.). Fixes EP instability via **feedback regulation (reduce spectral radius → fast convergence) + residual connections** (vanishing gradients). Directly relevant to our open problem (keep the free phase converged). - Adjacent energy-transformers (BP-trained, not EP): Energy Transformer (2302.07253), EBT "Scalable Learners and Thinkers" (2507.02092), NrGPT (2512.16762). --- ## The arc (what we built & found) | stage | finding | evidence | |---|---|---| | literature | EP+attention works only narrowly; CET is SOTA (CelebA, EP≈BP, single block) | CET full read | | MVP (CET repro) | EP trains energy-attention+memory; **cosine 0.99; EP≈TBPTE** | CIFAR 0.539/0.546; FMNIST 0.272/0.278 | | AEP (non-cons.) | corrects real-attention gradient: toy 0.30→**0.9975**; CET attn-params 0.75→**1.0** | `aep_*.py` | | characterization | needs T2≥20–40, T1≥80; β-insensitive (static); **advantage grows with depth** (K=2: naive 0.05 vs AEP 0.99); only changes nudged phase (free eq identical); 1.44× cost | sweeps | | fundamental limit | **strongly non-conservative attention has NO fixed point → whole EP family fails** (projection bounds magnitude but does not create a fixed point) | residual stuck 3e-2 | | **F: damping** | `s·(attn−c·z)`, c≥1 → stable fixed point at any s → **AEP 0.99–1.0** even at s=8 | `aep_contractive2.py` | | Option-2 LM attention | **AEP gives LM causal attention 0.993** (FA 0.25 — FA kills Q/K/V) | `lt_ep_attention.py` | | Option-2 H1 (FFN) | **EP Hopfield-memory gives FFN 1.000** (FA −0.01 — the abandonment reason) | `lt_ep_ffn.py` | | Option-2 H2 (train) | **BPTT trains (val CE 2.16↓, random 4.17); EP local training destabilizes early** | `lt_ep_train.py`, `H2_*.log` | **Why FA failed (the project's pain, confirmed):** FA only delivers a usable gradient to the layer right before the loss (output projection / FFN proj, cos≈1.0); all upstream layers (attention Q/K/V, FFN fc) get random-routed error → cos≈0. EP/AEP fix all of them. **Why EP training destabilizes (mechanism):** BPTT differentiates whatever the fixed T1-step relaxation computes (a deep weight-tied net) — it does NOT need convergence. EP *assumes* the free phase is at a fixed point; gradient descent on the loss pushes attention to be more expressive / non-conservative → fixed point lost → relaxation diverges → EP estimate (and its jvp/vjp correction) goes non-finite. Damping + clipping + weight-caps delay but don't prevent it. *(Superseded: the residual-controlled Jacobian penalty DOES prevent it — see the 2026-06-09 section.)* --- ## Key numbers - CET MVP (masked completion, test masked-MSE): CIFAR **EP 0.539 ≈ TBPTE 0.546**; FMNIST **EP 0.272 ≈ TBPTE 0.278**; BP-transformer 0.106/0.126; trivial(visible-mean) 0.583. - EP-vs-BPTT gradient cosine (CET): global **0.99**; attention WQ/WK 0.98. - AEP recovers non-conservative attention: toy naive 0.30 → AEP **0.9975**; CET attn-params 0.75 → **1.0**; damped high-s (s=8) AEP **0.99–1.0**. - LM (Shakespeare char): AEP attention **0.993** (FA 0.25); EP Hopfield-FFN **1.000** (FA −0.01). - H2 training: BPTT val CE **2.16 ↓** (random ln65=4.17); EP non-finite / no sustained training. --- ## Open problem & next directions The frontier is **stable end-to-end EP training**. Hypotheses / plans: 1. **Keep the free phase converged during training** (FRE-RNN, 2508.11659): feedback regulation to keep the spectral radius < 1 + residual connections. Most directly targets the mechanism. 2. **β bias/noise trade-off** — Ernoult 2019 says EP≡BPTT as β→0 with converged phases; but small β amplifies the `(z₋−z₊)/2β` noise. Holomorphic EP (Laborieux–Zenke 2022) removes this trade-off. Diagnostic: log cosine(EP-grad, BPTT-grad) *during* training — does it start 0.99 and drop? 3. **Unified conservative energy** (no AEP): make the whole LM one energy (energy-attention, tied value) → guaranteed fixed point → plain EP is stable (we saw CET-EP is stable). Trades attention expressivity for stability. ## 2026-06-09 — H2 training era: the "wall" refuted, the real mechanism found **Setting.** Shakespeare char-LM, single equilibrium block, C=128 H=4, same param budget everywhere. `--attn_mode thin/real` = clamp + Hopfield-FFN + damped real attention; `thick` = DEQ-transformer block (pre-LN + attention + untied 4× GELU FFN + residual + damping). Stabilizer = damping `c` + soft Jacobian-norm penalty λ (Hutchinson on the non-conservative force) driven by a continuous controller on the free-phase residual (residual, NOT spectral radius, is the right signal — the attention Jacobian is non-normal, transient growth invisible to ρ). **Scoreboard (best val CE; random ln65 = 4.17):** | run | result | |---|---| | BP transformer, MLP=4 | 1.68 | | BP transformer, MLP=1 (param-matched) | 1.79 | | thick-BPTT (exact grad, same arch as thick-EP) | **1.95** (res ~1e-3→8e-3, learned contraction unaided) | | mono-BPTT (monDEQ) | 2.11 | | thin-BPTT | 2.21 (optimum at res ~1.5e-2) | | **thick-EP champion (R5): R3 recipe + EMA-smoothed controller, 14k** | **2.0467** @ step 5k — within 0.10 of exact-gradient thick-BPTT | | thick-EP (R3): λ-floor 0.1, res_target 5e-4, c=1, 8k | 2.0784 — beats thin-BPTT and mono-BPTT | | thick-EP (R1): λ-floor 1, c=2, 14k | 2.1504 — 2.467@3k → 2.259@7.2k → 2.150@14k: monotone, no plateau, just slow | | thick-EP (R4): λ-floor 0.1, res_target 2e-4, 14k | 2.1352 — tight target ⇒ controller thrash (λ swings 0.5↔13 late) | | thick-EP (R2): λ→0, c=0.5 | 2.3572 @ step 800, then **collapse to a degenerate fixed point** (res→0 exactly, val ≫ random) | | thick-EP, c=2 λ=1, 3k (the run the "wall" was called on) | 2.4665 (just undertrained) | | thin-EP super-long 12k | 2.4847 plateau — **invalid-regime run**: res 2.4e-2–6e-2 the whole time, λ pinned at max 16 | | thin-EP 3k, λ=16 | 2.5593 | **Retraction.** Mid-project verdict — "EP is capped at ~2.5 by a convergence⟷richness wall (rich blocks need damping that destroys their expressivity)" — is **wrong**, called prematurely at 3k steps when thick-EP ≈ thin-EP ≈ 2.47–2.50 by coincidence (two different slow/broken modes passing the same value). Three independent refutations: (1) the 8k continuation sailed through the "wall" (2.2592@7.2k, monotone); (2) the slack run hit 2.357 in 800 steps; (3) the gradient probe below shows the EP estimator is healthy exactly where the stabilized runs operate. Also wrong in the original story: damping c does NOT aid convergence for LN-inside blocks (LN Jacobian ∝ 1/σ(z): damping shrinks z*, *inflating* the Jacobian — measured: thick plain-relax res 8.8e-3 at c=0 vs 3.4e-2 at c=2, `solver_wall.py`), and the scary "init residual~12" was an unnormalized absolute-norm print. **The real mechanism — a validity threshold, not a wall** (`grad_quality.py`: cosine of each estimator vs long-horizon BPTT (T1=400) reference, at a 300-step-BPTT-pretrained thick block): | estimator | free-phase res | cos vs BPTT-400 (3 batches) | |---|---|---| | EP T1=400, T2=20 | ~5e-5 | **0.85, 0.83, 0.88** — healthy | | EP T1=150, T2=20 | 2e-4–1.6e-3 | 0.21 / 0.86 / 0.27 — marginal, batch-dependent | | EP T1=50, T2=20 | 2.4e-3–3.6e-3 | −0.03 / 0.34 / 0.55 — broken | | BPTT T1=150 (itself!) | — | 0.45 / 0.77 / 0.36 vs BPTT-400 | | EP T1=150, **T2=60** | same as T1=150 | 0.01 / 0.07 / −0.19 — **nudged phase leaves the linear-response window** | Readings: (a) at tight convergence the EP estimator agrees 0.85 with the exact gradient — the EP-specific overhead (finite β, T2 truncation, AEP clipping) is only ~0.15 of misalignment; (b) **the horizon/mixing cost is shared by BPTT itself** (BPTT-150 vs BPTT-400 cos 0.35–0.77): slow-mixing equilibria are expensive for everyone, not an EP defect; (c) EP needs res ≲ few·1e-4 to be valid — at res ~1e-2 (where the thin super-long run lived) EP updates are noise, which is what that 2.48 "plateau" actually was: 12k steps of noise + λ=16 Jacobian-shrinking gradient; (d) longer nudged phase is NOT better — the AEP correction is linearized at z*, so T2 must stay in the linear-response window (T2≈20 at eps=0.1 good, T2=60 destroys the estimate). **Decomposition of the EP→BP gap (final numbers):** architecture (BPTT−BP) = 1.949−1.79 = **0.16** (small — the energy/fixed-point architecture is fine); training rule (EP−BPTT) = 2.047−1.949 = **0.10** — slower optimization (estimator cos ~0.85) + the λ tax of staying inside the validity region. Total EP→param-matched-BP = **0.26** (the "wall" story claimed ~0.7 was structural). EP-trained thick (2.047) beats exact-gradient thin (2.206) and monDEQ (2.111): EP *does* cash in architecture richness, given validity + steps. **λ-study (final, prediction confirmed: middle wins):** R5 2.047 < R3 2.078 < R4 2.135 < R1 2.150 < R2 2.357. The two failure modes flanking the optimum: λ→0 ⇒ fast descent then *collapse* (not explosion — a dead res=0 equilibrium); λ heavy ⇒ no failure, just a constant tax (R1 monotone to 2.150@14k). R4's lesson: too-tight res_target puts the multiplicative controller on the residual noise floor — `(res/target)^0.3` per-step updates compound the noise and λ thrashes 0.5↔13, degrading late training. R5's fix: EMA-smoothed residual signal (`--res_ema 0.9`) kept λ quiet (0.1–1.0) and set the champion 2.0467 by step 5k. **Open item (the actual next frontier):** every stable recipe peaks mid-run (~5–6k) and then drifts up 0.1–0.2 despite cosine lr decay — EP's late-training noise floor. Suspects: VF-estimator noise dominating at small lr (cos 0.85 ⇒ persistent gradient-direction error), λ/lr balance shifting late, weight-norm caps binding. Holomorphic EP (removes β bias/noise) is the principled candidate for the remaining ~0.10; T2 must stay in the AEP linear-response window (T2≈20, NOT longer). ## 2026-06-10 — estimator/controller co-design: EP reaches 1.74, beats exact-gradient BPTT The Holomorphic-EP line (see METHODS §4.3) + controller re-calibration, in causal order: 1. **R7 — holomorphic clamp-free nudge** (N=2, r=0.02 Cauchy readout, no g-clamp / no corr-clip, closed-form ∇zCE): best **2.029**, faster wall-clock than plain EP (3.3 vs 2.45 it/s). Probe discoveries: the legacy clamps were the dominant estimator error at marginal residuals (cos 0.27→0.89 at res 1.6e-3); N and r are flat ⇒ β-bias/noise was never the binding error; T2 truncation is the remainder (T2=120 → cos 0.985 when the nudged phase is stable; adaptive-T2 early-stopping defeated by non-normal transient growth — open). 2. **R8 — param-EMA alone: tie (2.031).** Exposed the real late-drift mechanism: λ pinned at jr_max=16 in the tail — the controller enforcing the OLD validity margin (5e-4) against weights that want res~1e-3, which the new estimator tolerates fine (cos 0.89 @ 1.6e-3). 3. **R9 — controller re-calibrated to the new estimator** (target 1.5e-3, λmax 4) + param-EMA: **best 1.7399**, sustained EMA plateau 1.75–1.79, λ quiet at 0.1–0.5, tail ~1.92. Worth 0.29 — the single largest improvement of the project. Lesson: **estimator and controller must be co-designed**; upgrading one without re-tuning the other converts the controller into the main source of harm. 4. **14k controls** (same horizon as R9): BP MLP=4 **1.610**, BP MLP=1 1.689. Param-shape correction: the **thick** block's 131k FFN matches **MLP=4**, not MLP=1 (which matches the thin block's 33k Hopfield memory) — so thick-EP's like-for-like BP baseline is 1.610. Unregularized **BPTT-14k destabilizes** (res→4.7e-2, val→3.0, best 2.021 — worse than its own 3k run, 1.949): the equilibrium architecture NEEDS the contraction control loop for long training; EP carries it out of estimator necessity, bare BPTT got nothing. 5. **The decisive control — BPTT + identical λ-controller + param-EMA: 1.6348**, tail stable. Verdict: most of EP-beats-bare-BPTT was the regularizer; with equipment matched, exact gradients lead EP (R9) by **0.105** — which matches the measured estimator misalignment (cos 0.85–0.90). Bonus: the controlled equilibrium block (1.635) beats the MLP=1 BP baseline (1.689) — the contraction controller is good for the architecture independent of training rule; EP merely forced its discovery. 6. **Adaptive T1/T2 (R10): EP 1.6755** — ate 0.064 of the 0.105 rule tax. Adaptive T2 solved by *hindsight snapshot selection* (judge by increments of the contrast estimate a_t, never by step sizes — non-normal transients can't fool it; dangerous batches self-limit to t≈20–30, stable ones run to ~110 and collect cos 0.987; probe mean 0.871→0.932). Adaptive T1 companion: the λ-controller signal stays sampled at T1=150 (R9 peace preserved), then relaxation refines to res≤1e-4 before nudging — long-T2 gains exist ONLY from tight equilibria (at res~1e-3 long T2 hurts). **Final 14k decomposition: architecture 0.025 (1.635 vs 1.610), EP rule tax 0.041 (1.676 vs 1.635), total 0.066.** EP now beats the MLP=1 BP baseline (1.689). *Multi-seed (3/arm):* EP **1.680 ± 0.005** vs BPTT-ctl **1.639 ± 0.006** ⇒ tax **0.041 ± 0.005 (~9σ)** — confirmed real. 5. Refuted along the way: λ-floor∝lr annealing (R6) reproduces the λ→0 death; post-mortem of that death (R2/R6): **explosion frozen by fp32 absorption masquerading as convergence** (val 60–77 with res≡0: ε·F < ulp(z); uncapped tok/pos/fc/pj blow up) — the λ floor is the anti-collapse mechanism, not just a tax. ## Code map - `~/ept/cet_mvp.py` — CET energy model + EP & TBPTE training (App. B faithful). - `~/ept/bp_transformer.py` — vanilla BP transformer baseline. - `~/ept/cet_aep.py` — CETReal (real non-conservative attention) + AEP, VF gradient, damping, clip. - `~/ept/aep_characterize.py` / `aep_depth.py` / `aep_projected.py` / `aep_contractive2.py` — AEP sweeps. - `~/ept/aep_option1.py` — projected (constrained) adjoint. - `/tmp/lt_ep/lt_ep_attention.py` — AEP on the LM's causal attention (gradient quality). - `/tmp/lt_ep/lt_ep_ffn.py` — EP Hopfield-memory vs FA-FFN (gradient quality). - `/tmp/lt_ep/lt_ep_train.py` — H2 end-to-end EP vs BPTT training of the equilibrium LM block (`--attn_mode real/energy/mono/thick`, jacreg controller `--jacreg/--jr_floor/--res_target`). - `/tmp/lt_ep/grad_quality.py` — EP-vs-exact gradient cosine vs residual level (the validity probe). - `/tmp/lt_ep/solver_wall.py` — plain relax vs Anderson convergence per damping level (thin vs thick). *(local-transformer working copy in `/tmp/lt_ep`; original `~/local-transformer` untouched.)* ## 2026-06-13 — late-drift 诊断 每个稳定配方中段触底后 val 上漂(S0 ~0.05 轻微,S1/L2 ~0.5 重度)。多假设诊断(drift_diag.py,13 列/200 步): - **嵌入 runaway 假设被否**:|emb| 涨到 11.9 后被 weight-decay 压回 11.6,自限,非 culprit。 - **drift 对汇报数字无害**:存 best 即可(L2 漂到 1.70 但报 best 1.214;warm-track 报 1.14)。 - **drift 不藏更低解**:warm-track 4800 步触底 1.14 后,lr 3e-4 漂、lr 1e-4 冻(1.15),均找不到更低 ⇒ 1.14 是该配方真地板,drift = 在地板上被噪声梯度推晃。 - **机制 = 晚期沿坏方向迈步**:低 lr 摁住 drift(anti-drift 续训冻在 1.16);S0 `cos`(EP,BPTT400)晚期 0.85→0.3 印证方向变差。 - **严重度随尺度**:S0(C128/T64)≈0.05,S1(C256/T256)≈0.5。 - 待结(S1 warm-start 诊断 + bctl 对照):cos 晚期塌缩是 EP 估计器退化(A,→N=4/tracking 可救)还是慢混合平衡点的视野模糊(B,EP/BPTT 共病,→架构加速混合)。bctl=cos(BPTT150,BPTT400):若同塌则 B。 ## 2026-06-13 (cont.) — "why EP far at S1": operating-point + bias/variance diagnosis Reframe: at the SAME (tight, res 1.5e-3) operating point EP (1.14–1.39) ≥ BPTT (1.52). The apparent "gap" is EP(1.14) vs LOOSE-BPTT(0.97) — an operating point (res 1e-2, non-contractive) EP cannot use. Mechanism: T=256 long-range mixing wants non-contractive dynamics; EP needs contraction for the fixed point to exist; contraction suppresses the mixing. Signature: gap grows with T (S0/T64: 0.04; S1/T256: 0.17). bias_var.py (v4b ckpt, 16 batches, EP vs BPTT-400, BPTT-150 control), per group [mean-cos | cos-means]: - all 0.24|0.36 ; BPTT-horizon ctrl 0.44|0.57 - attn 0.22|0.33 ; ffn 0.32|0.49 ; ln 0.19|0.30 ; emb 0.31|0.37 Readings: (1) two EXACT gradients (BPTT-150 vs 400) only reach cos-means 0.57 — slow mixing makes "the gradient" horizon-ambiguous; a PERFECT EP estimator's ceiling here is ~0.57, NOT 1. (2) Both EP and BPTT improve with averaging (variance present, partly recoverable via navg). (3) EP-specific residual concentrates in **attention + LN** (EP cos-means 0.33/0.30 vs BPTT 0.57/0.65); **FFN nearly fine (0.49)**, emb on par. ⇒ the AEP-corrected softmax/LN path is the weak link, pointing the fix at qk-norm (bounds logits, conditions the attention Jacobian) + N=4 holo + navg. Chasing cos→1 is the wrong target; matching BPTT's 0.57 is (~0.21 of recoverable EP-specific alignment). ## 2026-06-13 (cont.) — scaling to BPE/50M: stability via small-init residual branches 50M (C=2048) from scratch ABORTED (init res 26; qk-norm+c=2 insufficient — res stuck 0.38). Fix: **ReZero/Fixup-style small-init of the residual-branch output projections** (WO, pj ×0.1) makes the thick block ≈ -(z-xin)-c·z (near-identity, trivially contractive) at init; training grows them. Result at 15M (C=1024): res **1.1e-8 at step 50** (vs abort). The big-width init-instability wall is a random-init artifact, not fundamental; --resinit folds into the standard recipe. Separately, torch.compile's 14.6× was a SMALL-model (launch-overhead) artifact; at C≥1024 the relaxation is compute-bound (fp32 ~15 TFLOP/s) → ~0.06 it/s at 15M, ~0.03 at 50M. Demo feasibility: 15M solo (~2-4 days); 50M needs multi-GPU data-parallel (per-sample contrast → gradient all-reduce) or the external cards. The fp32 contrast (catastrophic-cancellation-bound, bf16 dead) keeps it compute-heavy — again the "digital expensive / physical free" axis. ## 2026-06-14 — the big-model EP recipe (scaling past S1 needs a stability stack) Scaling EP from S1 (C=256/T=256/char) to C=1024/T=512/BPE-4096 broke the S1 recipe: stable for ~50 steps then violent free-phase blowups (res → 0.5–1.4, jr pinned at max) — the controller's stability margin shrinks with width/context. CAUTION: a step-50 res≈1e-8 snapshot is NOT proof of stability; must watch through warmup-end peak-lr (~step 500–1000). The stack that fixes it (each necessary): 1. **resinit** (WO,pj ×0.1) — near-identity contractive block at init. 2. **qk-norm** (Qwen3 q/k RMSNorm) — bounds attention logits / Jacobian (also fixes the attn/LN estimator weak link from the 06-13 bias_var diagnosis). 3. **lr warmup** (≥1000 steps linear) — let the λ-controller establish contraction before big steps. 4. **µP-style lr scaling**: lr 1e-3 (C=256) → **4e-4 (C=1024)**; 1e-3 caused a 400-step instability episode, 4e-4 only brief recovered excursions. 5. **higher jr_max** (16 → 32) — bigger non-conservative force needs more penalty headroom. 6. validity gate + fuse retained (catch the residual transient excursions). Locked 15M demo recipe (running, 16k steps, best-ckpt): C=1024 H=16 T=512, the 6 items above, holo N=2 + adaptive T1/T2, pema 0.999. Smoke val descends 5.86→3.94 BPE-CE by step 1000. Cost: ~0.05 it/s (fp32, compute-bound; compile's 14.6× was a small-model launch-overhead artifact) ⇒ 15M ~3 days solo. 50M (C=2048) still unstable even with resinit 0.03 + warmup — needs a dedicated init/lr curriculum AND multi-GPU; deferred as a sub-project (not on the demo critical path). ## 2026-06-16 — C=512 BPE scale: baselines, AsymEP attribution, early-slowness dissection **C=512 BPE-4096 scoreboard** (TinyStories, T=256, B=24; random ln4096 = 8.318; target band 1.0–1.5): | run | result | |---|---| | BP standard transformer (7.48M, no warmup, lr 3e-3) | **best 1.7921** (20k, DONE) | | EP thick "orphan" (lr 8e-4, warmup800, resinit0.1, holo2, t2sel40) | **best 2.4037** (20k, DONE; jr pinned at floor 0.1 the ENTIRE run — zero excursions) | | EP thick "chin" (lr 9e-4, full-Chinchilla 24k target) | ABORT @5401, best 3.2408 | | BPTT thick (exact grad, same arch) | ABORT @2113, best 3.7331 | Data notes (NOT verdicts): (a) BP-C512 itself = 1.79 sits just *above* the 1–1.5 band (BP needs C≥1024 to enter it); EP-orphan 2.40 is 0.61 above BP-C512. (b) **the exact-gradient BPTT twin aborted EARLIEST** (2113, 3.73) — even worse than the EP runs — consistent with "training walks off the contractive manifold, exact gradient included"; lr/seed not matched across these, treat as directional. (c) abort = the `res>0.1 for 100 consecutive steps` fuse. **lr is the knife-edge**: 8e-4 (orphan) had NO excursions and finished; 9e-4 (chin) had periodic jr→32 / res→0.2–0.5 excursions, most recovered, one didn't make it back inside the 100-step fuse window → abort@5401. Narrow lr-stability margin at C=512. (d) orphan late curve is a plateau: best moved only 2.4557→2.4037 over the last 3600 steps. **AsymEP attribution (web-verified, the J→Jᵀ correction is NOT ours).** arXiv:2602.03670, Scurria / Vanden Abeele / Mognetti / Massar, "EP for Non-Conservative Systems", names the method **AsymEP**. Their correction `−2 A_J(z*)(z−z*)` is **mathematically identical** to ours (`Jv−Jᵀv = (J−Jᵀ)v = 2 A_J v`). Their scope: Hopfield nets + feedforward MLPs (≤500 hidden) + one CIFAR-10 feedforward CNN; **static inputs; explicit Jacobian construction + decomposition; NO residual/validity controller; NO attention / transformer / sequence model; do NOT combine with holomorphic**. Best result: CIFAR-10 FF-CNN AsymEP 89.74% vs BP 90.66% (their variational "Dyadic EP" matches BP, p=0.75). Their **"VF" baseline** (the force-form ⟨a,∂F/∂θ⟩ readout — *prior art, not ours*) **collapses without the correction** (CIFAR-10 → 10% = chance; MNIST 64% vs AsymEP 92.7%) — matching our measured cos≈0.25 for uncorrected non-conservative attention. **Ours on this line**: matrix-free jvp−vjp (their explicit Jacobian is infeasible at B·T·C dim); softmax-attention application; the holomorphic combination (the real-linear correction preserves holomorphy); common-mode-tracking + lock-in variants; the validity-threshold/controller/gate stability stack they have none of; and the transformer-LM application + scale. (⇒ corrected EP_DERIVATION.md, which had mislabeled force-form VF as our step.) ## 2026-06-17 (cont.) — full to-spec versions of exp1/4/5/6/7 (first pass was partial) The 06-17 campaign ran lighter/subset versions; re-ran the FULL specs. New findings: **exp1 (FULL) — multi-checkpoint cos-evolution (snap 0/200/800/2000 + plateau).** EP·BPTT150 cos: step0(random,res2e-2) 0.26 → step200(res1e-9) **0.99** → step800 0.96 → step2000 0.89 → plateau(2.40) **−0.05**; BPTT150·BPTT400 = 0.46→**1.00** throughout (true gradient always clean). New **batch self-cos** at the plateau: EP **−0.27** (gradients on different batches anti-correlated) vs BPTT **+0.96**. ⇒ decisive: the EP estimator is faithful while descending and undergoes an **SNR/coherence collapse near the optimum** (true gradient shrinks below the EP bias floor, which is batch-incoherent) — NOT horizon ambiguity, NOT LR. This is THE 2.40-plateau mechanism. **exp5 (FULL) — jr_max sweep (the dropped dimension, and it mattered).** Warm-started from 2.40: **jr_max 8 → 2.3648**, jr_max 4 → 2.376 — both BEAT jr_max 32 (2.463) and the orphan (2.404) by ~0.04. jr_max 16 → 2.416. floor/target fill still ≥2.47. ⇒ the jr_max=32 ceiling was a small tax; ~8 is better. (Earlier "not a controller tax" was incomplete — floor/target don't help, but the ceiling does.) Still far from the 1–1.5 band; the estimator floor dominates. **exp4 (FULL) — 3 arms.** armA (warmup): ~50-step gate dead-zone. armB (`cprewarm 200` = full-lr contraction before task warmup): res 1.7e-2 after pre-phase, dead-zone only PARTIALLY mitigated (val 5.97 vs 6.07 @100) — establishing contraction once isn't enough; warmup's tiny lr can't deepen it. armC (no-warmup + resinit 0.03): NO dead-zone (cos 0.99 from step20, best 4.03@300) but a violent excursion (res→0.14@400, recovered). ⇒ small-resinit+no-warmup learns fastest early but is excursion-prone; the cprewarm decoupling is a weak fix. **exp6 (FULL) — branch norms + attention entropy + qk RMS, incl. at plateau.** Branches grow fine (|WO| 1.95→4.91 over training); at the 2.40 plateau they are LARGE (|pj|=44, |fc|=58, attn/xin≈1.0), attn entropy 4.56→3.62 (sharpening), qk_rms pinned 1.0 (qknorm working). ⇒ no "stuck-small branch" — the trainable-α-gate intervention is NOT indicated. **exp7 (FULL, after a probe bug fix) — EP vs BP influence + per-position CE.** [Bug caught: the EP probe embedded the input with random-init weights before loading the ckpt → garbage (per-pos CE ~7); fixed to embed post-load.] Clean influence ||Δz*_q||/δ vs distance d: **BP** 5.34(d1)/0.88(d16)/ 0.15(d64)/0.003(d200); **EP-orphan** 0.24/0.071/0.019/0.006. Both decay with distance. EP ~20× weaker in ABSOLUTE coupling (scale-caveated — z* vs hidden norm), but NORMALIZED to d=1 EP reaches *farther* than BP (0.30 vs 0.16 at d16; 0.08 vs 0.03 at d64). Per-position CE: EP uniformly ~0.7–1.0 worse than BP at every position (no long-range-specific failure). ⇒ **does NOT support — arguably refutes — "contraction cuts long-range mixing"**; EP's deficit is uniform, consistent with the estimator floor (exp1) being the binding constraint. (BP-C512 trained to **1.6953**.) Honest gap: a stable BPTT-ctl-tight ckpt does not exist (BPTT broke at C512), so the cross-method influence comparison is EP-vs-BP only, not the full tight/loose/BP set. **Campaign verdict (data; user concludes):** the C512 2.40 plateau is, decisively, an **EP estimator SNR/coherence collapse near the optimum** (exp1: cos→0, self-cos −0.27, k→4000, true gradient clean). It is not LR (exp2), not horizon (exp1), not λ-floor/target (exp5) — though jr_max≤8 buys ~0.04 (exp5). The architecture has a SEPARATE wall: exact-gradient BPTT can't stay stable at C512 (exp3). The contractive-mixing hypothesis is NOT supported (exp7: EP's range is comparable to BP, deficit is uniform). ⇒ the EP lever is reducing the estimator bias floor near the optimum (N=4 holo / tracking- AEP / lock-in / navg); the architecture lever is the stability margin for exact gradients. **Early-slowness dissection (C512 BPE) — cos/k probe CONFIRMED; ablations STILL RUNNING.** From existing logs: **EP ≈ BPTT at every early step** (step 200: 5.84 = 5.84; step 800: 3.86 vs 4.05; step 2000: 3.58 vs 3.73), both **~1.2 behind BP** (2.76 @800). ⇒ early slowness is *not* the EP estimator — the exact-gradient twin on the same architecture is equally slow; the gap is the equilibrium architecture + recipe vs a standard transformer, present already at step 200 (pre warmup-end). Probe (`--probe_bptt`: cos(g_EP,g_BPTT) + k=|gEP|/|gBPTT| along the REAL trajectory): - **Estimator quality tracks res tightly**: res≈1e-9 → cos 0.93–0.99, k≈1.0 (all groups); a res spike to 1.4e-3 (mid-excursion) → cos all 0.86 / attn 0.70 / emb 0.13, k→0.4. ⇒ the earlier kinit *synthetic* probe (cos 0.33 / k 0.42 "at init") was a probe-operating-point artifact; on the real trajectory the EP gradient is faithful (cos≈1, k≈1) **whenever the free phase is converged**, and **k≈1 means no adaptive-k is needed in the converged regime**. - **warmup × validity-gate early-stall mechanism**: with warmup, early lr is tiny → the contraction homeostat can't pull res under the gate (5e-3) → free-phase res stuck ~1.5e-2 for ~50 steps → the **validity gate skips the nudge** → the reported g_EP is pure jacreg/contraction (cos≈0, k≈11) = *the task is not being learned at all* for ~50–60 steps. res drops < gate by ~step 60–80 → cos jumps to 0.99. No-warmup: res converged (1e-9) by step 20, cos 0.98 from step 20 (val 5.72 vs warmup 6.07 @ step100). - **CAVEAT — the tradeoff is UNRESOLVED**: the no-warmup arm hit an instability excursion ~step 180 (res→1.4e-3, jr→32, recovering by 200) — i.e. warmup's stability role is real. Whether no-warmup survives to 1200, plus the **resinit (arm B)** and **lr 2e-3 (arm D)** attributions, are STILL RUNNING as of 16:00 — no verdict drawn. Probe instrumentation: `lt_ep_train.py --probe_bptt N`; launcher `/tmp/lt_ep/early_dissect.sh`. ## 2026-06-17 — C512 diagnostic campaign (7 experiments): the 2.40 plateau is an EP estimator bias-floor; exact-gradient BPTT is LESS stable than EP at C512 Controlled campaign at C512 BPE (orphan ckpt = EP best 2.4037). Raw results + the causal cut each experiment was built for. Scripts: `lr_sweep.py`, `triangulation.py`, `mixing_probe.py`, `lt_ep_train.py --probe_branch/--probe_bptt`; orchestration `master.sh`. **exp3 — BPTT+controller, the decisive matched cut (GPU0).** Same arch+controller+lr 8e-4 as the EP orphan; only the task gradient differs (exact vs EP). Descended cleanly to best **3.85 @step1400**, then **DESTABILIZED at step 1600** (jr→32, res→0.07) and lodged in a broken basin (val 6.27, res ~0.068 — just under the 0.1 fuse, so no abort) for 8400+ steps. **The exact-gradient twin did WORSE than EP (3.85-stuck vs EP 2.40-completed).** ⇒ at C512 the contraction controller does NOT keep the exact gradient on the manifold; EP's implicit contraction-bias is what kept the orphan alive. Inverts the S0 picture (BPTT+ctl 1.635 < EP 1.676). Caveat: single lr/seed. **exp1 — triangulation at the 2.40 plateau (estimator vs horizon).** BPTT150 vs BPTT400 cos = **1.000** (all/attn/ffn; emb 0.986) — true gradient perfectly well-defined, ZERO horizon ambiguity — but EP vs BPTT cos = **−0.045** (orthogonal), k=|gEP|/|gBPTT| ≈ **4000**. ⇒ the plateau is an **EP-specific estimator failure**, not slow-mixing horizon ambiguity. **exp2 — one-step LR-sweep at the plateau.** BPTT best ΔCE −0.16 (lr 1e-2); EP best −0.04 (lr 1e-4), diverges for lr≥3e-4. cos 0.11, k 486. ⇒ NOT LR-inequivalence; EP's direction caps one-step descent 4× below BPTT at any lr. **exp1+exp2 synthesis:** the EP estimator has a ~fixed bias/noise floor. k runs **~1 during active training → ~4000 at the plateau** (true gradient shrank below the EP bias). 2.40 is the SNR crossover; below it the true gradient is smaller than the EP bias so EP can't see it. Not LR, not horizon, not λ. ⇒ the EP lever is **reducing the estimator floor near the optimum** (N=4 holo / tracking-AEP / lock-in / navg averaging), distinct from the architecture's stability lever. **exp5 — λ floor/target grid (warm-start from orphan).** base 0.1/1.5e-3 → 2.463; 0.05/3e-3 → 2.491; 0.03/3e-3 → 2.485; 0.1/5e-3 → 2.488 — ALL worse than orphan 2.40 (slight late-drift up). ⇒ relaxing the controller does NOT break the plateau — not a λ-floor tax. (floor 0.03 reached jr=0 without collapse over 1200 warm-started steps.) **exp6 — branch-growth (resinit vs no-resinit).** resinit 0.1: branches grow gradually & stably (|WO| 2.35→4.91, |pj| 3.23→6.57 over 1000 steps; res ~1e-4; best 3.71). no-resinit: branches start large (|WO|~20, |pj|~20), excursion at step 800 (res→0.23, jr→32; recovered, |WO|→14.7; best 4.04). ⇒ resinit's gradual growth is stabilizing; branches do grow (not stuck-small). **exp7 — mixing/influence length.** ||Δz_q*||/δ FLAT across distance (~0.007 at d=1 and d=200) and weak; loose-target (lam_t5) similar (~0.005). ⇒ no distance-decay signature; INCONCLUSIVE on the contraction-cuts-mixing hypothesis (no non-contractive/BP reference; coupling uniformly weak). **exp4 — warmup/gate (probes done).** no-warmup best 3.56@1200 (survived its step-180 excursion); warmup 4.11@800. warmup×validity-gate early-stall confirmed (warmup keeps res>gate ~50 steps → nudge skipped → no task learning early). **Net (data; user to conclude):** the 1–1.5 band is blocked at C512 by TWO distinct walls — (1) EP's estimator bias-floor caps EP at 2.40 (cos→0, k→4000 at the plateau; true gradient is clean), and (2) the equilibrium architecture+controller can't keep the EXACT gradient stable at C512 (BPTT broke at step 1600). EP's contraction-bias makes it the more robust of the two here. The contractive-conflict hypothesis is supported on the BPTT side; the EP plateau is specifically an estimator-SNR-floor effect. ## 2026-06-17 (round 2) — the plateau bias floor IS the frozen AEP linearization; tracking-AEP fixes it (gradient level) Estimator ablation at the 2.40 plateau ckpt (`plateau_ablation.py`, vs BPTT400, n_grad=4): | estimator | cos | k | batch self-cos | |---|---|---|---| | N2 frozen-AEP (current) | −0.045 | 4133 | −0.27 (incoherent) | | N4 / N8 | −0.03 / −0.01 | 31 / 29 | ~0 | | N2 r=.05/.10/.20 | −0.04 | ~4200 | −0.27 | | N2 fixed-T2 20/80/120 | nan (diverges w/o snapshot selection) | — | — | | **track_N2 (common-mode AEP)** | **0.997** | **0.9** | **+0.95 (coherent)** | Diagnostic at z*: **|Jv−Jᵀv|/|Jv| = 1.37** (J highly non-normal). ⇒ the bias floor IS the frozen-at-z* AEP linearization. N=4/8 fix only the magnitude (k 4000→30), not the direction; r and T2 do nothing; **tracking-AEP (re-linearize at the moving common mode z̄=½(z₊+z₋)) restores cos −0.045→0.997, coherence −0.27→+0.95, magnitude 4133×→0.9×.** **exp D (per-group):** collapse is UNIFORM across WQ/WK/WV/WO/fc/pj/ln1/ln2/tok/pos (all cos −0.03..−0.08, k 3k–18k) — the corrupted contrast `a` poisons every VF gradient equally (single shared cause, not attention/LN-specific unlike the S1 bias_var diagnosis). **exp B (navg):** cos(EP,BPTT) over navg = 0.37→0.40→0.55→−0.30→−0.21 (1/2/4/8/16); self-cos rises (0.58→0.87) but toward a direction NEGATIVELY correlated with BPTT; dCE diverges at navg≥8. ⇒ **deterministic bias, not variance** — averaging converges to the wrong (biased) direction, so navg/restart-averaging will NOT fix it. ### 2026-06-17 (round 2b) — free-phase ε is mistuned ~3× (monDEQ α-analysis analog); Anderson untapped Free-phase convergence probe at the 2.40 ckpt (`solver_probe.py`, res_F=‖F(z)‖/‖z‖, eps-independent, B=2, 4 batches). Steps to res_F≤1e-4 / ≤1e-6, plain Euler: ε0.05→470/844, **ε0.1 (ours)→236/548**, ε0.2→121/274, **ε0.3→85/182** (~2.8× faster than 0.1, same fixed point, stable), ε0.5→stalls at res~0.9 (non-convergent — the stable ceiling is ~0.3–0.4). Anderson(m=5): ε0.1→85/197, **ε0.2→58/118 (~4× vs our current 0.1)**, robust across ε. ⇒ **ε=0.1 is ~3× too small**; ε≈0.3 or Anderson@ε0.2 cut free-phase force-evals ~3–4× (and let T1≈100 always clear res≤1e-4, dropping adaptive-T1) → directly attacks the dominant EP compute cost AND tightens res (better estimator). This IS the "preconditioned/ Anderson relaxation" open item, now quantified. Caveats: free phase at trained point / B=2 (optimal ε may shift early-training / slow-mixing); the **nudged phase shares ε but has AEP+holo dynamics — bigger ε there needs a separate stability check** before changing. Confirms the monotone-operator α-sensitivity (non-normal dynamics, |Jv−Jᵀv|/|Jv|=1.37) carries over to our hand-set Euler step. **Decision:** the EP lever collapses to **common-mode tracking-AEP** (not N=4 / lock-in / navg). Caveat: validated at the gradient level at one ckpt; the one-step ΔCE is magnitude-confounded, so the NECESSARY next test is **training** — does EP with `--track` descend below 2.40 (`holo_a_track` already wired)? exp C (BPTT+controller rescue, lr 2e-4/4e-4/6e-4 × jr_max 16/32) runs separately for the architecture-ceiling question. ## 2026-06-18 — tracking-AEP VALIDATED IN TRAINING: it breaks the 2.40 plateau Warm-track test (warm-start from the 2.40 orphan ckpt, jr_max 8, warmup 0, 2500 steps): - **TRACK-warm (common-mode tracking-AEP): 2.40 → best 2.1628 @2500, still descending**, stable (res ~6e-5, jr at floor). ⇒ tracking-AEP descends PAST the 2.40 plateau — the round-2 estimator- floor fix works in training, exactly as the fixed-ckpt probe predicted (cos −0.05→0.997). - **STD-warm control (standard estimator, same recipe): ABORTED @step 397** (blew up, res→0.21). - **TRACK-fresh (tracking from scratch): ABORTED @968** (blew from random init). ⇒ under the identical aggressive recipe (jr_max 8), tracking-AEP is BOTH stable AND descends, while the standard estimator blows up. But tracking-AEP is **not a from-scratch drop-in** (fresh blows up) — the recipe is **two-phase warm-track**: standard estimator to ~2.40 (where it's stable), then switch to common-mode tracking-AEP to descend below the floor (matches the S1 warm-track champion). Caveat: std-control's abort is partly the aggressive jr_max 8 (orphan used jr_max 32, stable at 2.40); the clean claim is "at jr_max 8 tracking survives+descends where std blows". The 2.16 ate ~0.24 of the 0.61 EP-vs-BP gap and was still descending → continuing (`trkcont` from 2.16) to find the ceiling. **exp C — architecture ceiling (BPTT+controller rescue, 2500 steps):** stable exact-gradient exists (lr≤6e-4 never breaks, unlike lr8e-4 which broke @1600). Best: lr6e-4 jr16 → 3.08, still descending. At equal step (2500): **BP 2.34 ▸ BPTT 3.08 ▸ EP-orphan 3.55** ⇒ the ~10× proportional slowdown (confirmed: median 9.4× in the descending regime) splits into an lr-stability-capped architecture handicap (~0.74, BP→BPTT; the equilibrium block can't take BP's 3e-3 lr) AND an EP-estimator handicap (~0.47, BPTT→EP). jr_max 16 > 32 confirmed again. Long ceiling run (`bptt_ceiling`, lr6e-4 jr16, 12k) pending. **Found + staged a real code inconsistency:** free phase / model / readout use erf-GELU (F.gelu) while the holo nudge uses tanh-GELU (cgelu) → z* isn't the nudge-force's fixed point → spurious common-mode drift in the contrast every step (which tracking-AEP partly absorbs — possibly part of why it wins). Fix staged as `--gelu tanh` (additive, default erf). Separately: free-phase ε=0.1 is ~3× too small (ε0.3 or Anderson → ~3–4× fewer relaxation steps; `solver_probe`). --- ## 2026-06-24 — below-CE-2.1 divergence: DIAGNOSED (genuine Hopf) + FIXED (resreg/jacreg = finite-T1 LE control) Full write-up: **`SESSION_2026-06-24_HOPF_DIAGNOSIS_RESREG_FIX.md`**. This pins down the forward-stability wall that sat under TL;DR #3 ("no loss ceiling, but a validity threshold that costs reg + steps") — the threshold is now mechanistically a **Hopf bifurcation**, and the "regularization tax" is specifically **LE control**. **The mechanism (3 independent confirmations).** Below CE~2.1 the relaxation diverges because the learned **non-conservative attention** operator undergoes a **genuine continuous-time Hopf bifurcation** — a complex eigenvalue pair crosses **Re μ > 0** as EP makes attention expressive. Matrix-free FD-JVP Arnoldi on M=I+εJ (`ep_run/eig_probe.py`): leading continuous μ=(λ−1)/ε goes **−0.024 (s2000, CE 3.13, STABLE) → +0.44±1.37j (s3200, 2.74) → +1.35±2.08j (ep_eps05, 2.41)** — Re μ and |Im μ| both grow as CE drops. Confirmed by (a) eval limit-cycle + attention-knockout (α=0.2 converges), (b) Anderson can't reach a root at s3200 (near-root Re μ=+0.24 rotating). **Earlier "Euler artifact" read (and the cont.6 forward-mode/RTRL framing) are SUPERSEDED** — fugu caught that the ε-sweep "convergence" was the *step* residual ε·g, not g→0; an intermediate misread. (The Euler over-amplification is a real *second* layer — ε relocates the wall 2.74→2.41 — but the root cause is Re μ>0, which no ε closes.) **Why equilibrium-EP hits this and BPTT doesn't (structural).** Equilibrium-EP optimizes L(z*); its gradient is blind to *finite-time contraction* (a param eroding ρ→1 with z* fixed has zero equilibrium gradient). BPTT differentiates the T1 unroll, so a non-converged z_T1 → bad output → penalized → it implicitly defends contraction. So EP needs that defense added back explicitly. **The fix — control the finite-T1 Lyapunov exponent (LE).** Stability ⟺ ρ(M)<1 ⟺ finite-T1 LE<0. Two ~orthogonal handles, they stack: - **resreg** (`lt_ep_train.py:220-231`) = penalize the T1-residual ‖εF(z_T1)‖ ~ ρ^T1 = exp(T1·LE) → a **direct LE penalty** (also catches the non-normal transients eigenvalues miss). **The PROVEN below-2.1 stabilizer.** - **jacreg** (`:211-219`) = penalize ‖J_nc·v‖ → shrinks |Im μ|, pushes the pair to Re μ<0 (cause-side). `ep_run/eig_jacreg.py` confirmed at the mechanism level: at the same CE~2.74, frozen-jacreg = Re μ=+0.45 rotating g_floor 0.26, adaptive-jacreg = **Re μ=−0.23 real, g_floor 1e-4** (Hopf killed, true fixed point restored, AsymEP validity restored). **★ The 2.09 recipe = FROM SCRATCH + resreg=0.2 + FROZEN jr** (NOT adaptive jacreg — that was a session-long detour; cmd at `EP_BELOW210_DIAGNOSIS_FIX.md:97-101`). Original ep_resreg2 reached **2.0573** (lowest EP ever, lost to /tmp wipe; rebuild 2.22). This session: ep_resreg_warm (resreg eager, warm from stable s2000) descending smoothly through 2.30 (peak res 1.6e-2, no spikes) — the clean 2.09 test; ep_rr_scratch (from-scratch, proven recipe) launched as the robust confirmation. **Warm-start ONLY from a STABLE operator (s2000, Re μ<0); resreg/jacreg PREVENT instability, they can't RESCUE an already-blown one** (warming from ep_eps05@2.41 blew). **Infra (#14).** `--compile` EXONERATED + SAFE (compiles the free-phase `tforce`; numerically identical to eager — z150 rel-diff 1.6e-7, pure fp32; ~1.43x, ~3.3x with t2sel40). **`--tf32` UNSAFE — do NOT use** (10-bit mantissa ≈ 1e-3 perturbation; the relaxation is hyper-sensitive — ε 0.1→0.05 alone moved the wall 0.33 — so TF32 is too coarse). EP parallelism for the deep/scaling phase: no sequential backward, coupled equilibrium stack (#13) → depth-parallel, adaptive-T1 (residual-stopped) cleaner than adaptive-ε. **Lit anchor.** Hopfield-ResNet (arXiv 2509.26003) EP-trained 12 layers — but it's **conservative** (energy Φ, symmetric weights, no oscillation by construction). Confirms non-conservativity is our culprit; **we are the first to EP-train non-conservative attention** (which has the Hopf), solved via resreg/jacreg. --- ## 2026-07-03 — ω/norm family REFUTED as stability signal; fingerprint story RETRACTED; eigreg v2 (true map-eig) + the "one fundamental reg" verdict **Trigger.** Fable judge-pass on the reg causal chain (`‖J‖_F ⟹ ω ⟹ α → res`) caught that `eig_control.num_abscissa` used *plain* power iteration on the indefinite Sym(J_nc) — which converges to the largest-|λ| end (or a mixture), not λmax. Fixed with shifted PI; then audited everything with gold-standard Lanczos (`ep_run/eig_recheck.py`, scipy eigsh LA/SA, matvec-only, B=6 z_T1 states, same seeded batches). **Retraction.** The `--fingerprint` story "warm s2000 = deeply contractive (num_abscissa −10.14) vs from-scratch plateau = near-boundary (+1.11); training drifts the operator toward the boundary" is **withdrawn**: both numbers were PI-mixture artifacts (sign set by which spectrum end slightly dominates). The audit table: | operator | λmax(Sym) | λmin(Sym) | oldPI (bug) | res(abs,150) | val | |---|---|---|---|---|---| | rand-init | +5.02 | −4.98 | −0.16 | 13.6 | 8.49 | | s2000 (warm src) | **+12.84** | −15.62 | −6.43 | 2.65 | 3.10 | | resreg-scratch | +10.64 | −9.85 | +1.40 | 6.24 | 2.24 | | fast-adaptive | +7.89 | −7.46 | +2.41 | 1.69 | 2.18 | | warm-fast (live) | +8.47 | −8.21 | −0.08 | 2.71 | 2.34 | | self-restart (live) | +10.67 | −10.62 | +0.10 | 1.01 | 2.39 | **ω is dead here**: +5..+13 on every operator, stable and unstable alike; the *stablest* op (s2000, true α≈−0.02 per eig_probe) has the *largest* ω. Non-normality gap ω−α ≈ 10, and ω anti-correlates with stability (rand +5 → s2000 +12.8). An ω-margin penalty would fire permanently → over-constraint ≫ jacreg. Corollary: **the whole norm family (jacreg's ‖J‖_F, log-norm, ω) measures the wrong axis** on this operator; the frozen jr=0.1 floor in the ★2.09 recipe is a weak generic contraction bias, not spectral control. `num_abscissa`/`eig_penalty` demoted to diagnostics. **eigreg v2 = `eig_control.spec_penalty`**: soft one-sided penalty on the TRUE leading map eigenvalue |λ|(M), M = I+εJ_F, via warm-started 2-D Rayleigh-Ritz (captures the complex Hopf pair; matvec-only; aep-dynamics 'spectral' ported, applied softly). Wired to `--eigreg` / `--eig_margin 0.995`; `--fingerprint` now reports ρ/Re μ instead of ω. Smoke (`eig_v2_smoke.py`): mechanics work (fires, 11 finite grads at rand-init), BUT precision ±0.01..0.04 vs ARPACK (cluster under-resolution at s2000, field-of-values overshoot at rand-init on the non-normal M). **The decisive twist (`eig_v2_depth.py`, s2000, same batch, ARPACK at 3 depths):** | relax depth | res(abs) | top-3 |λ|(M) | |---|---|---| | 150 | 2.58 | 1.00880, 1.00615, 1.00509 | | 400 | 4.6e-2 | 0.99798, 0.99788, 0.99753 | | 800 | 3.3e-3 | 0.99942, 0.99804, 0.99698 | (μ=(λ−1)/ε ⇒ −0.02..−0.006 at depth — matches eig_probe's −0.024.) So: (a) the >1 readings at z_T1=150 are **unconverged-state contamination** — spec_penalty applied at the training state would punish healthy operators; use it in *measure mode*: deep-relaxed state (400+), small sub-batch, every K steps. (b) The fixed-point spectrum's top is not an isolated pair but a **band of slow modes hugging the boundary from below** — a rank-2 scalpel cannot compress a band; only broadband control (resreg) can. The Hopf failure mode is a complex pair **detaching from the band** (eig_probe: +0.44±1.37j in blown runs) — *that* is the scalpel's target. **Verdict on "which reg is fundamental" (the resreg+jacreg redundancy question):** 1. **Fundamental quantity = finite-horizon path growth (finite-T1 LE) — resreg's axis.** Broadband (handles the band), path-true (no state-contamination issue), and exactly the quantity EP-estimator validity needs. resreg's known weakness is its *shape*, not its axis: ρ^T1 is a cliff (silent until ρ≈1, then explosive — the live jr/res oscillation in ep_warm_fast is that cliff ringing). 2. **The "one reg" candidate: de-cliff resreg → a graded finite-horizon LE penalty (aep 'floss' ported, "floss-ept")** — penalize the measured per-step contraction rate over the last q relax steps (differentiable through q≈8–16 steps on a sub-batch), giving a linear early signal of the same fundamental quantity. resreg = its cliff-shaped special case; spec_penalty = its rank-1 approximation. 3. **spec_penalty stays as the surgical scalpel** for a detached Hopf pair (measure-mode application), **jacreg retires** from stability duty. 4. Three-arm from-scratch ablation when GPU frees: floss-ept-only vs resreg-only (never cleanly run — ROUND2 only had specnorm+jacreg0 / queued no-penalty) vs proven pair (resreg 0.2 + frozen jr 0.1) — logs res + measure-mode ρ + val.