1 files changed, 99 insertions, 0 deletions
diff --git a/ep_run/EP_DIAGNOSIS_DOSSIER.md b/ep_run/EP_DIAGNOSIS_DOSSIER.md
new file mode 100644
index 0000000..22308fa
--- /dev/null
+++ b/ep_run/EP_DIAGNOSIS_DOSSIER.md
@@ -0,0 +1,99 @@
+# EP below-CE-2.1 divergence — complete diagnosis dossier (2026-06-22, CORRECTED)
+
+## Setup
+Equilibrium transformer block: fixed point of a damped relaxation `z ← z + ε·F(z)`, ε=0.1,
+`F(z) = x_in − (1+c)z + Attn(LN z) + FFN(LN z)`, c=1. Attention is **non-conservative** (independent
+WQ,WK,WV,WO; qknorm RMSNorms q,k). Untied 4× GELU FFN. Trained **backprop-free** with **AsymEP**
+(Scurria 2602.03670: nudged dynamics get `−2A_J(x⁰)(x−x⁰)`, making the nudged Jacobian = Jᵀ at the
+free equilibrium). Code: `lt_ep_train.py` (`force`/`tforce`:81-106, `relax`:123, `ep_step`:140),
+`holo_ep.py` (holomorphic estimator). Eval/BPTT use the T1=150 relaxed state.
+
+**SYMPTOM:** EP descends, then **suddenly** diverges below CE≈2.1 (e.g. val 2.74 → 41 within ~100 steps,
+T1-residual 2.5e-2 → 0.42). Exact **BPTT on the identical model trains cleanly to CE 1.83.**
+
+## CORRECTED diagnosis (measured this round — supersedes earlier framings)
+
+**Fact 1 — it is a forward LIMIT CYCLE, there is no fixed point at the diverging operator.**
+`eval_relax` on the marginal ckpt redx **s3200** (val 2.74, just before the blowup): relax from the
+embedding for **6000 steps** →
+`res(t): 50→3.6e-2, 150→2.3e-2, 500→2.5e-2, 1000→2.5e-2, 2000→2.3e-2, 4000→2.3e-2, 6000→2.4e-2`,
+tail(last 1000) min 2.08e-2 / max 2.73e-2, **non-monotone**. It **floors ~2.3e-2 and oscillates — never
+reaches a fixed point.** (Reproduces an earlier lost-run finding: limit cycle, FTLE<0.)
+
+**Fact 2 — the cycle is driven by the non-conservative ATTENTION.**
+Knockout: scale the attention output (`WO ×= α`), eval_relax 3000 steps:
+```
+α=1.0: res-floor 2.5e-2, osc 6.0e-3   CYCLE
+α=0.7: res-floor 1.6e-2, osc 3.0e-3   CYCLE (smaller)
+α=0.4: res-floor 4.1e-3, osc 5.3e-4   nearly gone
+α=0.2: res-floor 3.2e-4               CONVERGED (true fixed point restored)
+α=0.0: res-floor 1.3e-3, osc 1.2e-3   tiny FFN-only cycle
+```
+Reducing the attention monotonically shrinks the cycle and restores convergence. **The attention's
+non-conservativity drives the limit cycle.**
+
+**Fact 3 — hypothesis: a Hopf-type bifurcation.** A relaxation `z←z+εF(z)` (map `M=I+εJ`) can only
+*oscillate* (limit cycle) if `M` has a **complex eigenvalue pair crossing |λ|=1**. A symmetric/conservative
+J has real eigenvalues → monotone convergence or blow-up, never a cycle. As EP training grows the attention
+asymmetry/gain, a complex pair crosses → Hopf → limit cycle → readout of a cycle-point degrades → CE explodes.
+
+## RETRACTED framings (do not anchor on these)
+- codex's "(b) structural: equilibrium gradient L(z\*) blind to contraction → forward-mode/RTRL fix"
+  — **assumed a fixed point z\* exists.** It does not at the diverging operator (limit cycle). The scalar
+  counterexample (param changing convergence rate but not z\*) is moot when z\* doesn't exist.
+- "ρ drifts to 0.998 / slow convergence" — was a **transient artifact** of a ρ-probe window (caught the
+  initial 3.6e-2→2.3e-2 decay, missed the floor+oscillation).
+
+## Still-valid facts (about the GRADIENT estimator — separate axis from the forward cycle)
+- BPTT (exact grad) → CE 1.83, converges; its trajectory does NOT drive the attention into the cycling regime.
+- AsymEP gradient is accurate WHEN a converged fixed point exists: cos(g_EP, exact-adjoint)=0.99 at hr=0.2,
+  res 1e-9; the "0.94 ceiling" was nudged/adjoint-phase truncation (cos rises 0.74→0.998 as nudge-depth
+  t2sel 10→160). i.e. the estimator is fine *given a fixed point* — but at the diverging state there is none.
+
+## AEP paper (Scurria 2602.03670) context
+- AsymEP is exact AT the **stationary state** (needs convergence). Appendix G.3 explicitly treats the
+  **stability of non-conservative dynamics** — they can **oscillate** — controlled by the **asymmetry ratio
+  r_str** (Eq 37-38: `J = γ(√(1−r_str²)·S̃/‖S̃‖ + r_str·Ã/‖Ã‖)`) + **gain γ** + conservative init
+  `Var[J]∝1/N`. "AsymEP reduces oscillations."
+
+---
+
+## Q1 (THIS query) — CONFIRM THE MECHANISM
+Is the divergence a **non-conservative Hopf bifurcation**: the attention's antisymmetric part A drives a
+**complex conjugate eigenvalue pair of the relaxation map M=I+εJ across |λ|=1**, producing the forward limit
+cycle (Facts 1-3)?
+1. Is the evidence (limit cycle in Fact 1 + the attention-scaling knockout in Fact 2) **conclusive** for a
+   Hopf bifurcation, or what is the gap / what alternative (e.g. real-eigenvalue saddle-node, a 2-cycle from
+   the discrete Euler step εF, an FFN contribution) is not yet excluded?
+2. What is the **single cleanest measurement** to nail it — e.g. compute the eigenvalues of `M=I+εJ` at
+   s3200 (is there a complex pair with |λ|≥1, vs a real λ≥1)? a Floquet/period analysis of the cycle? an
+   ε-sweep (does shrinking ε convert the cycle to convergence — distinguishing a continuous-time vs
+   discrete-Euler instability)?
+3. Verify the mechanism against the actual `lt_ep_train.py` force/relax code.
+
+## Q2 — THE FIX
+Given Q1 (a Hopf bifurcation from the attention's non-conservativity), what is the best way to keep the
+operator **below** the bifurcation (so a fixed point exists and AsymEP is valid) while preserving as much
+attention expressivity as possible? Candidates: (a) **adaptive asymmetry penalty** (our `jacreg` penalizes
+‖J_nc‖≈‖A‖, ramped on the residual/cycle onset; the validated 2.40 runs used this, the diverging runs froze
+it weak); (b) **structural r_str-style parameterization** (bound the antisymmetric part by construction,
+paper Eq 38); (c) **gain control** (γ scaling / qknorm — bound the spectral gain); (d) a **direct
+cycle-amplitude / log-norm μ_P(J) penalty**. Which is most effective AND analog-realizable (forward-only,
+local)? Give a concrete recipe.
+
+## Q3 — THE THESIS
+Can a non-conservative attention stay **sub-Hopf** (no limit cycle) AND be expressive enough for coherent
+language, or is there a **fundamental expressivity-vs-stability tradeoff** (the expressivity needs
+asymmetry/gain that triggers the bifurcation)? Estimate the bifurcation threshold (in r_str/γ terms) for this
+architecture and whether the sub-threshold regime suffices for an LM. Is a hybrid (bounded-asymmetry core +
+thin correction) the realistic ceiling?
+
+## Q4 — EQUILIBRIUM vs NON-EQUILIBRIUM PRIMITIVE
+AsymEP requires a **stationary state**, which does not exist in the limit cycle. Two routes: (i) keep the
+operator below the Hopf (fixed point exists → AsymEP exact), accepting the expressivity bound; (ii) **embrace
+the non-equilibrium** (limit-cycle) computation with a learning rule native to it (oscillatory / reservoir /
+time-averaged). Which is the right primitive for analog hardware, and is (ii) even tractable with a local
+forward-only rule?
+
+---
+Answer **Q1 → Q2 → Q3 → Q4 in order**, each rigorously and grounded in the code/data. Be decisive.