# EP below-CE-2.1 divergence — complete diagnosis dossier (2026-06-22, CORRECTED) ## Setup Equilibrium transformer block: fixed point of a damped relaxation `z ← z + ε·F(z)`, ε=0.1, `F(z) = x_in − (1+c)z + Attn(LN z) + FFN(LN z)`, c=1. Attention is **non-conservative** (independent WQ,WK,WV,WO; qknorm RMSNorms q,k). Untied 4× GELU FFN. Trained **backprop-free** with **AsymEP** (Scurria 2602.03670: nudged dynamics get `−2A_J(x⁰)(x−x⁰)`, making the nudged Jacobian = Jᵀ at the free equilibrium). Code: `lt_ep_train.py` (`force`/`tforce`:81-106, `relax`:123, `ep_step`:140), `holo_ep.py` (holomorphic estimator). Eval/BPTT use the T1=150 relaxed state. **SYMPTOM:** EP descends, then **suddenly** diverges below CE≈2.1 (e.g. val 2.74 → 41 within ~100 steps, T1-residual 2.5e-2 → 0.42). Exact **BPTT on the identical model trains cleanly to CE 1.83.** ## CORRECTED diagnosis (measured this round — supersedes earlier framings) **Fact 1 — it is a forward LIMIT CYCLE, there is no fixed point at the diverging operator.** `eval_relax` on the marginal ckpt redx **s3200** (val 2.74, just before the blowup): relax from the embedding for **6000 steps** → `res(t): 50→3.6e-2, 150→2.3e-2, 500→2.5e-2, 1000→2.5e-2, 2000→2.3e-2, 4000→2.3e-2, 6000→2.4e-2`, tail(last 1000) min 2.08e-2 / max 2.73e-2, **non-monotone**. It **floors ~2.3e-2 and oscillates — never reaches a fixed point.** (Reproduces an earlier lost-run finding: limit cycle, FTLE<0.) **Fact 2 — the cycle is driven by the non-conservative ATTENTION.** Knockout: scale the attention output (`WO ×= α`), eval_relax 3000 steps: ``` α=1.0: res-floor 2.5e-2, osc 6.0e-3 CYCLE α=0.7: res-floor 1.6e-2, osc 3.0e-3 CYCLE (smaller) α=0.4: res-floor 4.1e-3, osc 5.3e-4 nearly gone α=0.2: res-floor 3.2e-4 CONVERGED (true fixed point restored) α=0.0: res-floor 1.3e-3, osc 1.2e-3 tiny FFN-only cycle ``` Reducing the attention monotonically shrinks the cycle and restores convergence. **The attention's non-conservativity drives the limit cycle.** **Fact 3 — hypothesis: a Hopf-type bifurcation.** A relaxation `z←z+εF(z)` (map `M=I+εJ`) can only *oscillate* (limit cycle) if `M` has a **complex eigenvalue pair crossing |λ|=1**. A symmetric/conservative J has real eigenvalues → monotone convergence or blow-up, never a cycle. As EP training grows the attention asymmetry/gain, a complex pair crosses → Hopf → limit cycle → readout of a cycle-point degrades → CE explodes. ## RETRACTED framings (do not anchor on these) - codex's "(b) structural: equilibrium gradient L(z\*) blind to contraction → forward-mode/RTRL fix" — **assumed a fixed point z\* exists.** It does not at the diverging operator (limit cycle). The scalar counterexample (param changing convergence rate but not z\*) is moot when z\* doesn't exist. - "ρ drifts to 0.998 / slow convergence" — was a **transient artifact** of a ρ-probe window (caught the initial 3.6e-2→2.3e-2 decay, missed the floor+oscillation). ## Still-valid facts (about the GRADIENT estimator — separate axis from the forward cycle) - BPTT (exact grad) → CE 1.83, converges; its trajectory does NOT drive the attention into the cycling regime. - AsymEP gradient is accurate WHEN a converged fixed point exists: cos(g_EP, exact-adjoint)=0.99 at hr=0.2, res 1e-9; the "0.94 ceiling" was nudged/adjoint-phase truncation (cos rises 0.74→0.998 as nudge-depth t2sel 10→160). i.e. the estimator is fine *given a fixed point* — but at the diverging state there is none. ## AEP paper (Scurria 2602.03670) context - AsymEP is exact AT the **stationary state** (needs convergence). Appendix G.3 explicitly treats the **stability of non-conservative dynamics** — they can **oscillate** — controlled by the **asymmetry ratio r_str** (Eq 37-38: `J = γ(√(1−r_str²)·S̃/‖S̃‖ + r_str·Ã/‖Ã‖)`) + **gain γ** + conservative init `Var[J]∝1/N`. "AsymEP reduces oscillations." --- ## Q1 (THIS query) — CONFIRM THE MECHANISM Is the divergence a **non-conservative Hopf bifurcation**: the attention's antisymmetric part A drives a **complex conjugate eigenvalue pair of the relaxation map M=I+εJ across |λ|=1**, producing the forward limit cycle (Facts 1-3)? 1. Is the evidence (limit cycle in Fact 1 + the attention-scaling knockout in Fact 2) **conclusive** for a Hopf bifurcation, or what is the gap / what alternative (e.g. real-eigenvalue saddle-node, a 2-cycle from the discrete Euler step εF, an FFN contribution) is not yet excluded? 2. What is the **single cleanest measurement** to nail it — e.g. compute the eigenvalues of `M=I+εJ` at s3200 (is there a complex pair with |λ|≥1, vs a real λ≥1)? a Floquet/period analysis of the cycle? an ε-sweep (does shrinking ε convert the cycle to convergence — distinguishing a continuous-time vs discrete-Euler instability)? 3. Verify the mechanism against the actual `lt_ep_train.py` force/relax code. ## Q2 — THE FIX Given Q1 (a Hopf bifurcation from the attention's non-conservativity), what is the best way to keep the operator **below** the bifurcation (so a fixed point exists and AsymEP is valid) while preserving as much attention expressivity as possible? Candidates: (a) **adaptive asymmetry penalty** (our `jacreg` penalizes ‖J_nc‖≈‖A‖, ramped on the residual/cycle onset; the validated 2.40 runs used this, the diverging runs froze it weak); (b) **structural r_str-style parameterization** (bound the antisymmetric part by construction, paper Eq 38); (c) **gain control** (γ scaling / qknorm — bound the spectral gain); (d) a **direct cycle-amplitude / log-norm μ_P(J) penalty**. Which is most effective AND analog-realizable (forward-only, local)? Give a concrete recipe. ## Q3 — THE THESIS Can a non-conservative attention stay **sub-Hopf** (no limit cycle) AND be expressive enough for coherent language, or is there a **fundamental expressivity-vs-stability tradeoff** (the expressivity needs asymmetry/gain that triggers the bifurcation)? Estimate the bifurcation threshold (in r_str/γ terms) for this architecture and whether the sub-threshold regime suffices for an LM. Is a hybrid (bounded-asymmetry core + thin correction) the realistic ceiling? ## Q4 — EQUILIBRIUM vs NON-EQUILIBRIUM PRIMITIVE AsymEP requires a **stationary state**, which does not exist in the limit cycle. Two routes: (i) keep the operator below the Hopf (fixed point exists → AsymEP exact), accepting the expressivity bound; (ii) **embrace the non-equilibrium** (limit-cycle) computation with a learning rule native to it (oscillatory / reservoir / time-averaged). Which is the right primitive for analog hardware, and is (ii) even tractable with a local forward-only rule? --- Answer **Q1 → Q2 → Q3 → Q4 in order**, each rigorously and grounded in the code/data. Be decisive.