summaryrefslogtreecommitdiff
path: root/ep_run/EP_DIAGNOSIS_DOSSIER.md
blob: 22308fafae4c751016f9b618494a37b93212504d (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
# EP below-CE-2.1 divergence — complete diagnosis dossier (2026-06-22, CORRECTED)

## Setup
Equilibrium transformer block: fixed point of a damped relaxation `z ← z + ε·F(z)`, ε=0.1,
`F(z) = x_in − (1+c)z + Attn(LN z) + FFN(LN z)`, c=1. Attention is **non-conservative** (independent
WQ,WK,WV,WO; qknorm RMSNorms q,k). Untied 4× GELU FFN. Trained **backprop-free** with **AsymEP**
(Scurria 2602.03670: nudged dynamics get `−2A_J(x⁰)(x−x⁰)`, making the nudged Jacobian = Jᵀ at the
free equilibrium). Code: `lt_ep_train.py` (`force`/`tforce`:81-106, `relax`:123, `ep_step`:140),
`holo_ep.py` (holomorphic estimator). Eval/BPTT use the T1=150 relaxed state.

**SYMPTOM:** EP descends, then **suddenly** diverges below CE≈2.1 (e.g. val 2.74 → 41 within ~100 steps,
T1-residual 2.5e-2 → 0.42). Exact **BPTT on the identical model trains cleanly to CE 1.83.**

## CORRECTED diagnosis (measured this round — supersedes earlier framings)

**Fact 1 — it is a forward LIMIT CYCLE, there is no fixed point at the diverging operator.**
`eval_relax` on the marginal ckpt redx **s3200** (val 2.74, just before the blowup): relax from the
embedding for **6000 steps** →
`res(t): 50→3.6e-2, 150→2.3e-2, 500→2.5e-2, 1000→2.5e-2, 2000→2.3e-2, 4000→2.3e-2, 6000→2.4e-2`,
tail(last 1000) min 2.08e-2 / max 2.73e-2, **non-monotone**. It **floors ~2.3e-2 and oscillates — never
reaches a fixed point.** (Reproduces an earlier lost-run finding: limit cycle, FTLE<0.)

**Fact 2 — the cycle is driven by the non-conservative ATTENTION.**
Knockout: scale the attention output (`WO ×= α`), eval_relax 3000 steps:
```
α=1.0: res-floor 2.5e-2, osc 6.0e-3   CYCLE
α=0.7: res-floor 1.6e-2, osc 3.0e-3   CYCLE (smaller)
α=0.4: res-floor 4.1e-3, osc 5.3e-4   nearly gone
α=0.2: res-floor 3.2e-4               CONVERGED (true fixed point restored)
α=0.0: res-floor 1.3e-3, osc 1.2e-3   tiny FFN-only cycle
```
Reducing the attention monotonically shrinks the cycle and restores convergence. **The attention's
non-conservativity drives the limit cycle.**

**Fact 3 — hypothesis: a Hopf-type bifurcation.** A relaxation `z←z+εF(z)` (map `M=I+εJ`) can only
*oscillate* (limit cycle) if `M` has a **complex eigenvalue pair crossing |λ|=1**. A symmetric/conservative
J has real eigenvalues → monotone convergence or blow-up, never a cycle. As EP training grows the attention
asymmetry/gain, a complex pair crosses → Hopf → limit cycle → readout of a cycle-point degrades → CE explodes.

## RETRACTED framings (do not anchor on these)
- codex's "(b) structural: equilibrium gradient L(z\*) blind to contraction → forward-mode/RTRL fix"
  — **assumed a fixed point z\* exists.** It does not at the diverging operator (limit cycle). The scalar
  counterexample (param changing convergence rate but not z\*) is moot when z\* doesn't exist.
- "ρ drifts to 0.998 / slow convergence" — was a **transient artifact** of a ρ-probe window (caught the
  initial 3.6e-2→2.3e-2 decay, missed the floor+oscillation).

## Still-valid facts (about the GRADIENT estimator — separate axis from the forward cycle)
- BPTT (exact grad) → CE 1.83, converges; its trajectory does NOT drive the attention into the cycling regime.
- AsymEP gradient is accurate WHEN a converged fixed point exists: cos(g_EP, exact-adjoint)=0.99 at hr=0.2,
  res 1e-9; the "0.94 ceiling" was nudged/adjoint-phase truncation (cos rises 0.74→0.998 as nudge-depth
  t2sel 10→160). i.e. the estimator is fine *given a fixed point* — but at the diverging state there is none.

## AEP paper (Scurria 2602.03670) context
- AsymEP is exact AT the **stationary state** (needs convergence). Appendix G.3 explicitly treats the
  **stability of non-conservative dynamics** — they can **oscillate** — controlled by the **asymmetry ratio
  r_str** (Eq 37-38: `J = γ(√(1−r_str²)·S̃/‖S̃‖ + r_str·Ã/‖Ã‖)`) + **gain γ** + conservative init
  `Var[J]∝1/N`. "AsymEP reduces oscillations."

---

## Q1 (THIS query) — CONFIRM THE MECHANISM
Is the divergence a **non-conservative Hopf bifurcation**: the attention's antisymmetric part A drives a
**complex conjugate eigenvalue pair of the relaxation map M=I+εJ across |λ|=1**, producing the forward limit
cycle (Facts 1-3)?
1. Is the evidence (limit cycle in Fact 1 + the attention-scaling knockout in Fact 2) **conclusive** for a
   Hopf bifurcation, or what is the gap / what alternative (e.g. real-eigenvalue saddle-node, a 2-cycle from
   the discrete Euler step εF, an FFN contribution) is not yet excluded?
2. What is the **single cleanest measurement** to nail it — e.g. compute the eigenvalues of `M=I+εJ` at
   s3200 (is there a complex pair with |λ|≥1, vs a real λ≥1)? a Floquet/period analysis of the cycle? an
   ε-sweep (does shrinking ε convert the cycle to convergence — distinguishing a continuous-time vs
   discrete-Euler instability)?
3. Verify the mechanism against the actual `lt_ep_train.py` force/relax code.

## Q2 — THE FIX
Given Q1 (a Hopf bifurcation from the attention's non-conservativity), what is the best way to keep the
operator **below** the bifurcation (so a fixed point exists and AsymEP is valid) while preserving as much
attention expressivity as possible? Candidates: (a) **adaptive asymmetry penalty** (our `jacreg` penalizes
‖J_nc‖≈‖A‖, ramped on the residual/cycle onset; the validated 2.40 runs used this, the diverging runs froze
it weak); (b) **structural r_str-style parameterization** (bound the antisymmetric part by construction,
paper Eq 38); (c) **gain control** (γ scaling / qknorm — bound the spectral gain); (d) a **direct
cycle-amplitude / log-norm μ_P(J) penalty**. Which is most effective AND analog-realizable (forward-only,
local)? Give a concrete recipe.

## Q3 — THE THESIS
Can a non-conservative attention stay **sub-Hopf** (no limit cycle) AND be expressive enough for coherent
language, or is there a **fundamental expressivity-vs-stability tradeoff** (the expressivity needs
asymmetry/gain that triggers the bifurcation)? Estimate the bifurcation threshold (in r_str/γ terms) for this
architecture and whether the sub-threshold regime suffices for an LM. Is a hybrid (bounded-asymmetry core +
thin correction) the realistic ceiling?

## Q4 — EQUILIBRIUM vs NON-EQUILIBRIUM PRIMITIVE
AsymEP requires a **stationary state**, which does not exist in the limit cycle. Two routes: (i) keep the
operator below the Hopf (fixed point exists → AsymEP exact), accepting the expressivity bound; (ii) **embrace
the non-equilibrium** (limit-cycle) computation with a learning rule native to it (oscillatory / reservoir /
time-averaged). Which is the right primitive for analog hardware, and is (ii) even tractable with a local
forward-only rule?

---
Answer **Q1 → Q2 → Q3 → Q4 in order**, each rigorously and grounded in the code/data. Be decisive.