diff options
Diffstat (limited to 'ep_run/EP_DIAGNOSIS_DOSSIER.md')
| -rw-r--r-- | ep_run/EP_DIAGNOSIS_DOSSIER.md | 99 |
1 files changed, 99 insertions, 0 deletions
diff --git a/ep_run/EP_DIAGNOSIS_DOSSIER.md b/ep_run/EP_DIAGNOSIS_DOSSIER.md new file mode 100644 index 0000000..22308fa --- /dev/null +++ b/ep_run/EP_DIAGNOSIS_DOSSIER.md @@ -0,0 +1,99 @@ +# EP below-CE-2.1 divergence — complete diagnosis dossier (2026-06-22, CORRECTED) + +## Setup +Equilibrium transformer block: fixed point of a damped relaxation `z ← z + ε·F(z)`, ε=0.1, +`F(z) = x_in − (1+c)z + Attn(LN z) + FFN(LN z)`, c=1. Attention is **non-conservative** (independent +WQ,WK,WV,WO; qknorm RMSNorms q,k). Untied 4× GELU FFN. Trained **backprop-free** with **AsymEP** +(Scurria 2602.03670: nudged dynamics get `−2A_J(x⁰)(x−x⁰)`, making the nudged Jacobian = Jᵀ at the +free equilibrium). Code: `lt_ep_train.py` (`force`/`tforce`:81-106, `relax`:123, `ep_step`:140), +`holo_ep.py` (holomorphic estimator). Eval/BPTT use the T1=150 relaxed state. + +**SYMPTOM:** EP descends, then **suddenly** diverges below CE≈2.1 (e.g. val 2.74 → 41 within ~100 steps, +T1-residual 2.5e-2 → 0.42). Exact **BPTT on the identical model trains cleanly to CE 1.83.** + +## CORRECTED diagnosis (measured this round — supersedes earlier framings) + +**Fact 1 — it is a forward LIMIT CYCLE, there is no fixed point at the diverging operator.** +`eval_relax` on the marginal ckpt redx **s3200** (val 2.74, just before the blowup): relax from the +embedding for **6000 steps** → +`res(t): 50→3.6e-2, 150→2.3e-2, 500→2.5e-2, 1000→2.5e-2, 2000→2.3e-2, 4000→2.3e-2, 6000→2.4e-2`, +tail(last 1000) min 2.08e-2 / max 2.73e-2, **non-monotone**. It **floors ~2.3e-2 and oscillates — never +reaches a fixed point.** (Reproduces an earlier lost-run finding: limit cycle, FTLE<0.) + +**Fact 2 — the cycle is driven by the non-conservative ATTENTION.** +Knockout: scale the attention output (`WO ×= α`), eval_relax 3000 steps: +``` +α=1.0: res-floor 2.5e-2, osc 6.0e-3 CYCLE +α=0.7: res-floor 1.6e-2, osc 3.0e-3 CYCLE (smaller) +α=0.4: res-floor 4.1e-3, osc 5.3e-4 nearly gone +α=0.2: res-floor 3.2e-4 CONVERGED (true fixed point restored) +α=0.0: res-floor 1.3e-3, osc 1.2e-3 tiny FFN-only cycle +``` +Reducing the attention monotonically shrinks the cycle and restores convergence. **The attention's +non-conservativity drives the limit cycle.** + +**Fact 3 — hypothesis: a Hopf-type bifurcation.** A relaxation `z←z+εF(z)` (map `M=I+εJ`) can only +*oscillate* (limit cycle) if `M` has a **complex eigenvalue pair crossing |λ|=1**. A symmetric/conservative +J has real eigenvalues → monotone convergence or blow-up, never a cycle. As EP training grows the attention +asymmetry/gain, a complex pair crosses → Hopf → limit cycle → readout of a cycle-point degrades → CE explodes. + +## RETRACTED framings (do not anchor on these) +- codex's "(b) structural: equilibrium gradient L(z\*) blind to contraction → forward-mode/RTRL fix" + — **assumed a fixed point z\* exists.** It does not at the diverging operator (limit cycle). The scalar + counterexample (param changing convergence rate but not z\*) is moot when z\* doesn't exist. +- "ρ drifts to 0.998 / slow convergence" — was a **transient artifact** of a ρ-probe window (caught the + initial 3.6e-2→2.3e-2 decay, missed the floor+oscillation). + +## Still-valid facts (about the GRADIENT estimator — separate axis from the forward cycle) +- BPTT (exact grad) → CE 1.83, converges; its trajectory does NOT drive the attention into the cycling regime. +- AsymEP gradient is accurate WHEN a converged fixed point exists: cos(g_EP, exact-adjoint)=0.99 at hr=0.2, + res 1e-9; the "0.94 ceiling" was nudged/adjoint-phase truncation (cos rises 0.74→0.998 as nudge-depth + t2sel 10→160). i.e. the estimator is fine *given a fixed point* — but at the diverging state there is none. + +## AEP paper (Scurria 2602.03670) context +- AsymEP is exact AT the **stationary state** (needs convergence). Appendix G.3 explicitly treats the + **stability of non-conservative dynamics** — they can **oscillate** — controlled by the **asymmetry ratio + r_str** (Eq 37-38: `J = γ(√(1−r_str²)·S̃/‖S̃‖ + r_str·Ã/‖Ã‖)`) + **gain γ** + conservative init + `Var[J]∝1/N`. "AsymEP reduces oscillations." + +--- + +## Q1 (THIS query) — CONFIRM THE MECHANISM +Is the divergence a **non-conservative Hopf bifurcation**: the attention's antisymmetric part A drives a +**complex conjugate eigenvalue pair of the relaxation map M=I+εJ across |λ|=1**, producing the forward limit +cycle (Facts 1-3)? +1. Is the evidence (limit cycle in Fact 1 + the attention-scaling knockout in Fact 2) **conclusive** for a + Hopf bifurcation, or what is the gap / what alternative (e.g. real-eigenvalue saddle-node, a 2-cycle from + the discrete Euler step εF, an FFN contribution) is not yet excluded? +2. What is the **single cleanest measurement** to nail it — e.g. compute the eigenvalues of `M=I+εJ` at + s3200 (is there a complex pair with |λ|≥1, vs a real λ≥1)? a Floquet/period analysis of the cycle? an + ε-sweep (does shrinking ε convert the cycle to convergence — distinguishing a continuous-time vs + discrete-Euler instability)? +3. Verify the mechanism against the actual `lt_ep_train.py` force/relax code. + +## Q2 — THE FIX +Given Q1 (a Hopf bifurcation from the attention's non-conservativity), what is the best way to keep the +operator **below** the bifurcation (so a fixed point exists and AsymEP is valid) while preserving as much +attention expressivity as possible? Candidates: (a) **adaptive asymmetry penalty** (our `jacreg` penalizes +‖J_nc‖≈‖A‖, ramped on the residual/cycle onset; the validated 2.40 runs used this, the diverging runs froze +it weak); (b) **structural r_str-style parameterization** (bound the antisymmetric part by construction, +paper Eq 38); (c) **gain control** (γ scaling / qknorm — bound the spectral gain); (d) a **direct +cycle-amplitude / log-norm μ_P(J) penalty**. Which is most effective AND analog-realizable (forward-only, +local)? Give a concrete recipe. + +## Q3 — THE THESIS +Can a non-conservative attention stay **sub-Hopf** (no limit cycle) AND be expressive enough for coherent +language, or is there a **fundamental expressivity-vs-stability tradeoff** (the expressivity needs +asymmetry/gain that triggers the bifurcation)? Estimate the bifurcation threshold (in r_str/γ terms) for this +architecture and whether the sub-threshold regime suffices for an LM. Is a hybrid (bounded-asymmetry core + +thin correction) the realistic ceiling? + +## Q4 — EQUILIBRIUM vs NON-EQUILIBRIUM PRIMITIVE +AsymEP requires a **stationary state**, which does not exist in the limit cycle. Two routes: (i) keep the +operator below the Hopf (fixed point exists → AsymEP exact), accepting the expressivity bound; (ii) **embrace +the non-equilibrium** (limit-cycle) computation with a learning rule native to it (oscillatory / reservoir / +time-averaged). Which is the right primitive for analog hardware, and is (ii) even tractable with a local +forward-only rule? + +--- +Answer **Q1 → Q2 → Q3 → Q4 in order**, each rigorously and grounded in the code/data. Be decisive. |
