summaryrefslogtreecommitdiff
path: root/docs/campaign/EP_BELOW210_DIAGNOSIS_FIX.md
blob: 54897bc8df1911ab7aa8b65a42bc9c16f1b732e7 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
# C512 EP below-2.10 divergence — diagnosis + fix (2026-06-20)

> Written to disk after `/tmp/lt_ep` was wiped (10-day /tmp cleanup) — all local code/ckpts/logs lost.
> This file preserves the diagnosis + the exact code change so it can be rebuilt.

## Symptom
EP (frozen jr=0.1, tracking-AEP, qknorm, c=1) descends to best **~2.09**, then within ~200 steps:
res 5e-3→0.15, cos(g_EP,g_BPTT) 0.98→~0, val CE 2.09→4+, ABORT. Same ~2.09 floor every run
(erf & tanh both: 2.0919 / 2.0921). **Exact-BPTT, identical recipe, sails past 2.09 → 1.72 (freeze_wsd).**

## Root cause (Codex-confirmed, 5-way corroborated)
**EP optimizes the fixed point; BPTT optimizes the finite unroll — only the finite unroll defends the residual.**
- `ep_step` relaxes T1=150 → `z150`, then REFINES to t1max=300 → `z*`, computes the EP gradient at `z*`.
- But **eval and `bptt_step` use `z150`**. EP's gradient never feels the T1 residual.
- As attention gets expressive → contraction weakens → `z150` drifts from `z*` → res@T1 inflates → EP
  estimate (valid only at small res) dies → blowup.
- BPTT differentiates the actual 150-step unroll → a non-converging unroll → bad CE → gradient implicitly
  rewards strong contraction. **That defend-the-residual term is what EP structurally lacks.**
- cos-0.977 holds only AT the fixed point; the missing perpendicular is ~21% (=√(1−0.977²)) = the
  finite-horizon transient gradient. (T1=∞ would make even BPTT lose it — BPTT's stability IS the finite truncation.)

Two layers: (1) point-mismatch (refinement) — `--t1max 150` alone pushed wall 2.09→2.05 but still blew@600;
(2) gradient-flavor — needs the explicit penalty.

Diverged state = forward bifurcation to a **limit cycle** (eval_relax: res floors ~6e-2, oscillates, 150→4000
steps don't help, CE ~3.7). FTLE stayed negative (−0.027..−0.050). ⇒ adaptive/more steps can't fix it (res as
stopping-criterion chases a vanishing fixed point); only res as a COST (penalty) prevents the drift.

Ruled out: jacreg (3% of grad, orthogonal, removing changes cos <0.001); GELU erf/tanh (both diverge identically);
the validity gate (removing it diverges FASTER); forward-stability/FTLE (stable-BPTT −0.0347 weaker than diverged-EP −0.0377).

## THE FIX: `--resreg` (T1-residual penalty) — exact edits to lt_ep_train.py / ep_step

1. signature: `def ep_step(..., corr_every=1, res_gate=0.0, resreg=0.0):`

2. right after `res_used = res` (BEFORE the `if t1max > T1:` refinement):
```python
    zT, resT1 = zs, res                           # the T1 free-phase state (what eval/BPTT use), BEFORE refinement
```

3. just before the final `return grads, res` (after the jacreg block):
```python
    if resreg > 0 and resT1 > 7e-4:                      # defend z_T1 (BPTT gets this implicitly; EP at z* doesn't)
        with torch.enable_grad():
            Fz = blk.tforce(zT, xin0)                     # deterministic thick force at z_T1 (params live, zT/xin0 detached)
            Rr = (eps * Fz).pow(2).sum() / (zT.pow(2).sum() + 1e-9)    # ~ (T1 residual)^2
            grr = torch.autograd.grad(Rr, blk.block, allow_unused=True)
        ratio = resreg * min(1.0, resT1 / 2e-2)          # ramp 0->resreg as res 7e-4->2e-2, capped
        gtask = math.sqrt(sum(float((grads[id(p)] ** 2).sum()) for p in blk.block if grads.get(id(p)) is not None) + 1e-20)
        gres = math.sqrt(sum(float((g ** 2).sum()) for g in grr if g is not None) + 1e-20)
        lam = ratio * gtask / gres                       # scale penalty to `ratio` of the task-grad norm
        for p, g in zip(blk.block, grr):
            if g is not None:
                grads[id(p)] = g * lam if grads.get(id(p)) is None else grads[id(p)] + lam * g
    return grads, res
```

4. argparse: `ap.add_argument('--resreg', type=float, default=0.0)`

5. the ep_step CALL in the training loop: append `, cfg.resreg` as the last arg.

**RUN IT WITH `--res_gate 0`** (else the gate bypasses the penalty). Keep `--t1max 300`.

## Validation status (INTERRUPTED — must re-run)
`ep_resreg2` (res_gate=0, resreg=0.2, warm from erf-2.09 ckpt, t1max=300): held res **1–5e-4 through step 1000**,
best **2.0573** (lowest any EP run reached, past the wall), zero inflation. Deleted by /tmp wipe before reaching ~1.8.
No-penalty variants for contrast: ep_nogate blew@100, gated ep_resreg@200, ep_t1max150 reached 2.05 then blew@600.

## What was lost in the /tmp wipe (must rebuild)
- Code: lt_ep_train.py (with all post-06-14 edits: --resreg, --gelu, --probe_bptt, --wsd, --lcap/--linf, empty_cache,
  tracking-AEP), plus the probes (grad_decomp.py, ftle_probe.py +--gelu, eval_relax.py, cos_now.py, train_lyap.py).
  Base (06-14) survives in `/home/yurenh2/ept/lt_ep_code/`; re-apply edits from this doc + the session transcript
  `/home/yurenh2/.claude/projects/-home-yurenh2-ept/46bc4d43-*.jsonl`.
- ALL checkpoints: orphan (ep_C512_bpe.best.pt 2.40), freeze_wsd.pt (BPTT 1.72), ep_frzjr_wsd.pt (2.09),
  ep_resreg2, ep_tanh_frz — gone (H200 backup source 157.66.255.42 is also down). Must retrain from scratch.
- Data tinystories_bpe: regenerate via `lt_ep_code/prepare_tinystories_bpe.py`.

## Re-run plan
1. Reconstruct lt_ep_train.py = lt_ep_code/06-14 base + the edits above (+ the other recent flags).
2. Regenerate tinystories_bpe data. **Work entirely under /home/yurenh2/ept — NEVER /tmp.**
3. Retrain orphan (~20k) → freeze_wsd (BPTT WSD 16k, expect ~1.72) → EP+resreg (res_gate=0 resreg 0.2, WSD).
4. Confirm EP+resreg descends past 2.09 toward ~1.8 (BPTT ceiling). Sweep resreg 0.2–1.0 if it plateaus.
5. NEXT-ROUND code fixes (genuine bugs, kept this round only for comparability with the diverging baselines):
   - GELU consistency: model force uses erf (F.gelu) but holo nudge (cgelu/rforce/cforce) uses tanh-form → nudge fixed point ≠ z* → ~4.7e-4 force error injects a spurious common-mode drift each step (estimator bias). FIX = `--gelu tanh` (make the model tanh too; the holo complex extension REQUIRES tanh = entire fn, so consistency means everything tanh). NOTE: fixing it does NOT fix the below-2.10 divergence (tested 6/18, that's the residual-defense issue) — but it's still a real bug worth fixing.
   - cforce (holo_ep.py:41-45) omits the q/k RMSNorm that real attention applies → biases holo_a/holo_a_select with --qknorm (not the --track path, but fix anyway).
   - Stale argparse defaults (footguns, harmless when flags passed explicitly): --attn_mode default 'real' (should be the intended mode), --data default '/tmp/lt_ep/data/shakespeare_char' (line ~320). Also tighten resreg scaling to pure task-grad norm (currently task+jacreg, ~3% off).

---
## 2026-06-21 session update — rebuild after /tmp wipe + validation IN PROGRESS

REBUILD (all under /home/yurenh2/ept/ep_run, never /tmp):
- Code ep_run/lt_ep_train.py = lt_ep_code (a recent ~6/20 snapshot despite its misleading 6/14 mtime — it already has res_gate/gelu/t1max/track/resinit/full ep_step+holo) + my 3 edits (--resreg block, --wsd schedule, line-20 DD data-path fix). `diff` verified clean; holo_ep.py byte-identical to base.
- Bug-check forensics: holo_ep.py last edited 2026-06-13 -> the estimator (holo_a_track, AEP correction, rforce) is COMPLETE/not stale. Post-6/14 lt_ep_train edits NOT carried over are all diagnostic/other-run-type (--probe_bptt,--probe_branch,--snap,--cprewarm,--specnorm/--linf, bptt_step sep-mode, empty_cache) — none touched by the EP+resreg+track+frozen-jr run.
- Data: regenerated tinystories_bpe to ep_run/data/ (361M train tokens, vocab 4096; val=tail-15MB of train.txt, ~0.8% overlap — fine for tracking divergence/descent).

VALIDATION RUN (from random init; pid 696706 GPU1):
  python3 lt_ep_train.py --mode ep --attn_mode thick --B 24 --C 512 --H 16 --T 256 --c 1.0 \
   --jacreg 0.1 --jr_floor 0.1 --jr_max 0.1 --holo 2 --hr 0.02 --t2sel 40 --track --pema 0.999 --t1max 300 \
   --res_est 1e-4 --res_gate 0 --resreg 0.2 --qknorm --resinit 0.1 --warmup 800 --T1 150 --T2 20 --lr 6e-4 \
   --wsd 0.25 --steps 32000 --log 200 --save_every 500 --abort_res 0.3 --data data/tinystories_bpe \
   --ckpt runs/ep_resreg_scratch.pt --state runs/ep_resreg_scratch.state
  Status @step 6400 (~20%, 8h, 0.23 it/s): val 2.5070 descending (FASTER than orphan: 2.51@6.4k vs orphan's 2.40@20k). res BOUNDED ~1-2e-2 by resreg (spikes to 2e-2 get pulled back to ~4e-3 — resreg WORKING, not running away like the no-resreg baselines that hit res 0.15->abort@2.09). Decisive break-2.09 test ~10h out; ~1.8 needs the WSD lr-decay (step 24k+). Watcher key: does res stay bounded through 2.09.

resreg CONTAMINATION probe (ep_run/resreg_probe.py, at step-3600 ckpt, res 1e-2):
  cos(VF,BPTT)=0.7166 vs cos(VF+resreg,BPTT)=0.7108 -> delta -0.006: resreg DOES directly contaminate (small). cos(resreg_dir,BPTT)=-0.047 -> resreg is ~ORTHOGONAL to the true gradient; it is NOT adding back BPTT's residual-defense term (my earlier hypothesis was WRONG) — it's a roughly-orthogonal constraint direction. NET still favorable: the tiny -0.006 direct cost (bounded by the 0.2 ratio cap) buys keeping res low so the EP estimate doesn't COLLAPSE (cos 0.72 held vs ->0 runaway+diverge). CAVEAT the probe also exposes: the EP estimate itself is only 0.72-aligned at res 1e-2 (vs 0.98 at res 1e-4) -> descent on a mediocre gradient -> MAY plateau above BPTT's 1.72. If it stalls >~1.85 or res creeps past 2e-2: retune tighter resreg (ramp-to-full by res~3-5e-3 instead of 2e-2) and/or stronger early contraction (frozen jr=0.1 from scratch is weaker than the orphan's adaptive jr which held res~1e-4).

CODEX review of rebuilt code (verdict triaged): the 2 "High" bugs are STALE-DEFAULT footguns, HARMLESS here (run passes --attn_mode thick + --data explicitly, verified via /proc): line 304 --attn_mode default 'real'; line 320 --data default '/tmp/lt_ep/data/shakespeare_char'. GELU (erf model vs tanh nudge, ~4.7e-4 force err) + cforce-missing-qknorm = real bugs but status-quo (kept this round for comparability) -> next-round fixes (point 5 above). Codex CONFIRMED CORRECT: AEP contrast sign (not flipped/doubled), Wh readout grad, jacreg, the resreg sign/logic + res_gate=0 gate-bypass, frozen-jr clamp@0.1, WSD schedule.

COST (EP vs same-param BP): per-step EP on the equilibrium block ~= ~100x a depth-1 BP transformer in DIGITAL SIM — because the block relaxes ~150-300 steps (effective depth); even exact-grad BPTT is ~100x depth-1 BP. EP/BPTT ~= 1.5x (EP 0.23 it/s vs freeze_wsd BPTT 0.34 it/s). The ~100x is the cost of SIMULATING the analog relaxation on a GPU; on analog relaxation hardware the free phase is physical settling (~free) + EP's local update needs no backward -> EP competitive/better than BP there. Per-quality the gap is smaller (equilibrium block beats same-param standard transformer, rule tax 0.041); EP memory << BPTT (no unrolled graph).

DELIVERABLE: method-intro PDF /home/yurenh2/ept/ept_method_intro.{tex,pdf} (9pg, EP-literate audience; AEP/holomorphic/CET/tracking-AEP background; strict attribution; results honest+ongoing-marked; compiled with tectonic since conda pdflatex is broken). VERIFY before external use: the post-Jan-2026 citations (Scurria arXiv:2602.03670 AsymEP; CET=Hoier/Kerjan/Scellier ICLR2026 AM workshop OpenReview Qrfml76eWJ; Laborieux 2021) + Hoier-vs-Hoier spelling.

---
## 2026-06-21 (cont.) — divergence is CONTRACTION loss, not res; the missing BPTT term MEASURED
From-scratch EP (c=1, frozen jr=0.1, track, resreg=0.2) diverged at val 2.25 (ABOVE the 2.09 wall):
res 1.1e-2 -> 1.7e-1 -> 3.4e-1 then COLLAPSED to a trivial fixed point (res 1.4e-8, val 10). So it is a
bifurcation: good fixed point destroyed -> brief limit cycle -> degenerate attractor. NOT a numerical blowup.

MEASURED the missing term M = g_BPTT - g_EP at the pre-blowup ckpt (step 9200, res~1e-2; resreg_probe.py):
  cos(g_EP, g_BPTT) = 0.557 (decayed from 0.72 @ step3600 at the SAME res -> estimate degrades as operator hardens)
  |M|/|BPTT| = 0.84,  cos(M,BPTT) = +0.91,  cos(M,EP) = +0.16,  cos(M,resreg) = -0.10
  (|EP|=0.94, |BPTT|=2.2 -> EP captures only ~42% of the true gradient, 56 deg off, near the instability)
=> EP is missing 84% of the true gradient (M ~ the BPTT direction). resreg points AWAY from M (-0.10):
   resreg is NOT the missing term and NOT a weak proxy -- it is the wrong direction. resreg retired as the fix.

CODEX THEORY: correct state variable = contraction margin mu_P(J) (one-sided-Lipschitz/log-norm), NOT res
(lagging symptom; cos decays at fixed res) and NOT spectral radius (non-normal, |Jv-Jtv|/|Jv|~1.4 -> jacreg
targets the WRONG quantity). BPTT's stabilizer = the finite-horizon gradient term (the (I+eps*J)^k products that
ALSO govern contraction -> automatic contraction feedback); EP graded at refined z* structurally discards it.
The fix that (i) preserves the objective [no loss force] and (ii) runs on analog [no digital root-finder] =
CONTRACTION-BY-CONSTRUCTION (mu_P(J) < -alpha). KICKER: analog HW can never do BPTT -> can never access M ->
structural contraction is the FORCED hardware design constraint, not merely the cleanest fix.

EXPERIMENT (test causality + the fix; both with resreg=0, no residual term):
 (A) c-bump c=1->3 from scratch [runs/ep_c3_scratch] -- does stronger structural damping alone prevent divergence?
 (B) HARD spectral-norm contraction constraint --specnorm (codex implementing) -- warm from preblow_9200 + from scratch.
 If divergence vanishes with NO residual term -> contraction-loss confirmed as root; principled fix validated.

---
## 2026-06-21 (cont. 2) — the equivalence-condition diagnosis (deep paper read: Ernoult2019 + AsymEP + LZ-JacobianHomeostasis)
EP≡BPTT (Ernoult 2019, arXiv:1905.13633, Thm1/Eq16) needs: H1 symmetric Jacobian J=Jᵀ (Eq51 transpose-cancellation = load-bearing);
H2 free phase EXACTLY at fixed point s*=s_T=..=s_{T-K}; H3 β→0. Ernoult gives NO contraction-rate truncation bound (just assumes
convergence; my earlier "contraction-rate condition" guess was WRONG). BPTT needs NEITHER H1 nor H2 (differentiates the actual
finite-T graph) -> that's why BPTT sails to 1.7 while EP can't.

OUR RUN VIOLATES BOTH, via the SAME quantity S^-1 A (S,A = sym/antisym parts of J; our |Jv-Jᵀv|/|Jv|≈1.4 IS ‖S^-1 A‖≈1.4):
- H2 (dominant, first-order): res~1e-2 -> not at fixed point -> EP & BPTT seeded by different states. "EP≈BPTT only at res≲1e-4,
  noise at 1e-2" is textbook H2 failure.
- H1 (makes it sudden, sets M): both AsymEP (Scurria arXiv:2602.03670) and LZ (Laborieux-Zenke "Jacobian Homeostasis"
  arXiv:2309.02214, ICLR2024) reduce the asymmetry bias to S^-1 A. LZ Eq11: EP error = δ - 2 S^-1 A δ, valid only ‖S^-1 A‖→0.
  At 1.4 the map (S+A)^-1(S-A) → near-rotation (eigs→-1): right direction, lost magnitude = our cos(M,BPTT)=0.91, |M|=84%.

KEY NEW REALIZATION: our AEP (J→Jᵀ) correction is OUT OF WARRANTY. It is FIRST-ORDER in S^-1 A and assumes ‖S^-1 A‖≪1
(AsymEP error series Σ(S^-1 A)^{2n+1} converges only if ρ(S^-1 A)<1; they never state it). At 1.4 it may be O(1) wrong. AND it
is a FIXED linearization at x̄⁰, so res~1e-2 -> wrong anchor -> H2 POISONS the AEP correction too. The cos-0.99 we trust holds
ONLY at small-asymmetry + tight-convergence — the regime we leave.

THE GAP, PRECISE: NOT "can't converge". Even at perfect convergence, at ‖S^-1 A‖=1.4 the H1 bias is O(1), and the BPTT-1.7
solution NEEDS that non-normality. So EP's valid regime (‖S^-1 A‖≪1) and the expressive regime (≈1.4) DON'T OVERLAP. Structural.

REFRAMES THE CONTRACTION EXPERIMENTS: specnorm/c3 cap ‖J‖/contraction, NOT the asymmetry ratio ‖A‖/‖S‖ that drives the bias.
The literature's correct lever = Jacobian Homeostasis (LZ Eq12: penalize ‖A‖² -> J functionally symmetric, works w/ untied weights)
= the principled version of our jacreg — BUT it caps exactly the non-normality the 1.7 solution needs (expressivity tax; confirms
the "crutch" worry). Holomorphic nudge only removes finite-β bias, NOT the H1/H2 bias (so β is NOT our dominant error).

THE ONE PATH WITHOUT THE EXPRESSIVITY TAX (novel): the AEP correction is first-order; a HIGHER-ORDER AEP (more Neumann terms in
S^-1 A) extends the warranty past ‖S^-1 A‖~1 WITHOUT capping non-normality — fix the gradient, not the model. Attacks H1 fundamentally.

NEXT MEASUREMENTS (cheap, on preblow_9200 ckpt): (1) measure ‖S^-1 A‖ / ρ(S^-1 A) DIRECTLY (not the 1.4 proxy) and track over
training (warranty gauge; predict it grows toward the tip). (2) test M ≈ -2 S^-1 A · g_BPTT — if yes, the asymmetry bias IS the gap
(not convergence) and higher-order AEP is the lever. Sources: Ernoult 1905.13633; Scurria 2602.03670; LZ 2309.02214 + Holo 2209.00530.

---
## 2026-06-21 (cont. 3) — CORRECTION: the "AEP out of warranty / ρ(S⁻¹A)<1" framing (cont.2) was MIS-ATTRIBUTED
Deep read of Scurria 2026 (arXiv:2602.03670, verbatim Eq.24/47) overturns cont.2:
- Our AEP = the paper's AsymEP (single-state antisym correction −2A(z−z*); holo_a_track re-anchors A at the common mode z̄).
  It is EXACT at β→0 for ARBITRARY asymmetry: Eq.24 `J_FA(x̄⁰)=J−2A=Jᵀ` is an algebraic IDENTITY (any A magnitude).
- The ρ(S⁻¹A)<1 / Neumann series Σ(S⁻¹A)^{2n+1} (Eq.47) is the **VF BASELINE's** error (uncorrected EP), which AsymEP REMOVES.
  We + GPT-Pro (following Laborieux-Zenke, whose first-order −2S⁻¹Aδ analyzes UNCORRECTED EP) MIS-ATTRIBUTED the VF error
  to AsymEP. It is NOT a restriction on our correction.
- "AsymEP = first-order projection of Dyadic EP" = DIMENSIONAL projection (2n→n), NOT first-order-in-β. Both exact at β→0.
- Dyadic EP (Scurria §4) = a SEPARATE 2n doubled-field method (aux field z′ + augmented energy H=−(z−z′)ᵀF((z+z′)/2));
  we DON'T implement it. It buys finite-displacement robustness + a real energy, NOT extra β→0 exactness. ⇒ optional upgrade, NOT the fix.
- ⇒ the asym_probe ρ(S⁻¹A)=3.3 result answers the WRONG question (VF's error, + measured at a no-fixed-point pathological ckpt). DISCARD as a validity verdict.

REAL DIAGNOSIS (re-centered): the ONLY fundamental obstacle is **H2 = free-phase convergence/contraction**. AsymEP handles
asymmetry EXACTLY *given convergence*. The cos-0.557 / M-84% gap was measured at res 1e-2 (NOT converged) ⇒ it's H2, not H1.
Asymmetry is only "upstream" (expressive attention → weak contraction → H2 fails). FIX DIRECTION = contraction/convergence
defense (specnorm / damping / monotone / residual-defense) — VALIDATED. The cont.2 "asymmetry/ρ/AEP-warranty" line was a wrong turn.

EMPIRICAL CHECK (running): oracle-adjoint probe — at a HEALTHY converged ckpt (specnorm, res~1e-4): is
g_EP(AsymEP) ≈ g_transpose(exact adjoint via GMRES Jᵀλ=−ℓ) ≈ g_BPTT? cos→1 confirms AsymEP exact + problem = convergence.
(The earlier 0.557 was the pathological non-converged point.)

---
## 2026-06-21 (cont. 5) — ROOT-CAUSE CANDIDATE: --hr 0.02 is 10x too small (should be 0.2) -> noisy EP gradient
oracle-adjoint probe (asym_probe --diag) on a PERFECTLY converged ckpt (res 1.5e-9): cos(g_EP, g_transpose[=exact
adjoint=g_BPTT, cross-checked cos 1.0]) = 0.63 mean over 6 batches (SYSTEMATIC, not a bad batch). BETA/r SWEEP: cos gets
WORSE as beta->0 (hr 0.04->0.86, 0.02->0.72, 0.005->0.38, 0.002->0.15) = 1/beta noise-amplification of the nudge-convergence
error (NOT finite-beta bias; NOT a holo/track/t2sel component -- plain EP ~= FULL ~0.72). Sweeping hr UP: 0.06->0.85,
0.1->0.95, 0.15->0.98, **0.2->0.986**, 0.3->0.991. => at the DOCUMENTED hr=0.2 (METHODS line 130) the estimator is FINE
(cos 0.986, in the historical 0.85-0.99 range). Our runs (this whole session + the EP_BELOW210 "diverging recipe") used
--hr 0.02 = 10x too small -> noise regime -> 0.72-cos gradient -> plausibly THE destabilizer of the below-2.10 divergence.
CODE IS CORRECT; mis-set parameter (hr regression from the validated 0.2). FIX = --hr 0.2. Decisive test: pure EP with
--hr 0.2 (runs/ep_hr02) -- does it stop diverging? The probe was correct; the convergence/asymmetry/Dyadic theory dive was
largely chasing a symptom of the mis-set hr. (NOTE: also revisit t2sel=40 vs documented 120, secondary.)

## 2026-06-22 (cont. 6) — DECISIVE: divergence is STRUCTURAL (equilibrium gradient is blind to finite-time contraction). hr & t2sel are NOT the fix.
Deliverables this round: `ep_run/GPT55_BUG_HUNT.md`, `ep_run/CODEX_VERDICT.md`, `ep_run/FWD_MODE_DESIGN.md` (multi-agent: gpt-5.5 + fugu-ultra via codex).

**cont.5's "hr=0.2 fixes it" was WRONG/incomplete.** ep_hr02 (--hr 0.2) STILL diverged (step 2800, val 2.87). hr is a real gradient-QUALITY knob but not the divergence fix.

**Clean chain (all on a fixed val batch):**
1. **BPTT deconfounder:** BPTT (exact grad) → CE **1.83**, free-phase contraction ratio **ρ≈0.982**, res tight ~1e-3 throughout. Divergence is **EP-specific**; a contractive low-loss solution exists.
2. **The "0.94 estimator ceiling" = ADJOINT-PHASE TRUNCATION, not hr/β noise, not fundamental.** At a trained, tightly-converged operator (free res 2.2e-5): hr-ceiling sweep cos(g_EP,g_transpose) **flat ~0.94 across hr 0.04–0.8**; t2sel sweep (fixed hr) cos **0.74→0.998 as t2sel 10→160**. The code refines the FREE phase but never converges the NUDGED/adjoint phase (gpt-5.5 bug-hunt #1,#9). So EP CAN compute the near-exact equilibrium adjoint (0.998) at t2sel≥160. cos(g_transpose,g_BPTT)=0.975@res2e-5, =1.0@res1e-9 (finite-horizon gap, res-dependent).
3. **Bifurcation probe (mechanism):** as EP trains, free-phase **ρ drifts→1** (0.989@val4.96→0.998@val2.74); redx baseline blew step3300 (CE 2.74→41). **Damping does NOT rescue** — ρ-vs-c on the marginal op: more c makes ρ WORSE (0.998→0.9999); on BPTT's op more c helps (0.982→0.96). EP drives weights into a **damping-resistant marginal-contraction region** BPTT avoids. Sudden blowup because **res_T1≈ρ^150** is hyper-sensitive near ρ=1 (gradual ρ-drift + ρ^150 = sudden bifurcation).

**VERDICT (codex/gpt-5.5, PROVEN — CODEX_VERDICT.md): (b) STRUCTURAL.** Equilibrium gradient differentiates L(z*); BPTT differentiates deployed L(z_T), T=150. The equilibrium gradient `grad=L_θ+λᵀF_θ` (F_zᵀλ=−L_z) **lacks the finite-horizon contraction-defense term**. Scalar proof: F=−k(z−z*) → ∀k, z* identical → **dL(z*)/dk=0** (blind to contraction rate), but z_T−z*=(1−εk)ᵀ(z₀−z*) → **dL(z_T)/dk ∝ T(1−εk)^{T−1}** (BPTT sees k; huge near ρ=1). A param that erodes contraction with z* fixed has ZERO equilibrium gradient but large BPTT gradient. **No t2sel/hr adds a term absent from the fixed-point gradient.** t2sel=160 only makes EP a better *equilibrium* gradient = still wrong objective → ep_t2fix predicted to still blow.

**FIX (physically realizable; FWD_MODE_DESIGN.md): change the OBJECTIVE to finite-horizon — forward-mode/RTRL eligibility traces.** Forward-relax T=150, propagate `e_{t+1}=Φ_z(z_t)e_t+ε·∂F_θ/∂p` alongside the state, update `Δp ∝ −L_z(z_T)ᵀe_T`. Forward-only, local, NOT reverse BPTT, NOT a digital solver — keeps the contraction-defense terms. Exact RTRL O(P·dimz) → practical: UORO / random-direction directional-deriv / blockwise low-rank. Alt: explicit contraction homeostat (penalize log-norm μ(J)).

**In-flight confirmations (2026-06-22):** ep_t2fix (t2sel=160, runs/ep_t2fix.log + t2fix_rho.log) — predicted ρ→1 + blow; oracle-adjoint training (GMRES exact adjoint each step, warm-start redx s2000, runs/oracle_adjoint.log) — DEFINITIVE (b) test (if even the exact equilibrium grad blows → (b)); forward-mode rule design = the fix.
**Tooling:** Sakana fugu-ultra wired into the codex plugin (broker spawn patched: `-c model=fugu-ultra -c model_provider=sakana`, key ~/.codex/sakana.key); gpt-5.5 via direct `codex exec`.

## 2026-06-22 (cont. 7) — CORRECTION: the divergence is a forward LIMIT CYCLE = a DISCRETE-EULER ARTIFACT (continuous/analog system is stable). cont.6's (b)-structural/forward-mode verdict is SUPERSEDED.
Measured on genuine this-session ckpts (full dossier: `ep_run/EP_DIAGNOSIS_DOSSIER.md`; fugu Q1-Q4: `ep_run/FUGU_VERDICT_FULL.md`).
1. **It's a LIMIT CYCLE, no fixed point.** eval_relax on the marginal ckpt redx **s3200** (val 2.74): relax from embedding 6000 steps → res FLOORS ~2.3e-2 and OSCILLATES (tail 2.08-2.73e-2, non-monotone), no fixed point. Reproduces the old (lost-run) finding on a genuine ckpt. ⇒ there is **no z\*** at the diverging operator → cont.6's "equilibrium gradient L(z\*) blind to contraction → forward-mode RTRL" (assumed z\* exists) is **moot here**. The bifurcation-probe "ρ=0.998 / slow convergence" was a **transient artifact** (caught the initial decay, missed the floor+oscillation).
2. **Attention drives the cycle.** Knockout WO×α: α=1.0 cycle (res 2.5e-2), 0.7 smaller, 0.4 nearly gone, **0.2 CONVERGES (res 3e-4, true fixed point)**. Non-conservative attention is causally responsible.
3. **DECISIVE — it's a DISCRETE-EULER ARTIFACT.** ε-sweep on s3200 (εN fixed = same settling): res-floor + oscillation shrink **monotonically** as ε↓ — 0.1→2.3e-2(cycle), 0.05→7e-3, 0.03→3e-3, 0.02→2e-3, **0.01→8.9e-4 CONVERGED**. ⇒ the cycle is the **explicit-Euler instability of a stiff rotating attention mode** whose CONTINUOUS eigenvalue is **Re(μ)<0 (stable)**: `|1+εμ|>1` at ε=0.1 but `<1` at ε=0.01. **So the below-2.1 "divergence" is largely a DIGITAL-SIM INTEGRATION ARTIFACT (ε=0.1 explicit Euler too coarse), NOT a fundamental non-conservative instability — and ANALOG hardware (continuous relaxation, ε→0) would NOT have it.** Big positive for the thesis.
- **fugu-ultra (Q1-Q4)** concurred + was appropriately cautious: confirmed attention-driven oscillatory non-convergence, flagged the Euler-artifact alternative (now confirmed), said "Hopf" needs the eigenvalue-of-M crossing to be measured. Fix (Q2): finer integration (smaller ε / implicit/Anderson) + **adaptive jacreg homeostat** (the validated-2.40 mechanism, which the diverging runs *froze* weak — the regression) + **res_gate ON** + qknorm/WO-gain rails + monitor TAIL oscillation (a short residual probe is what fooled us). Thesis (Q3): sub-threshold attention IS expressive enough (BPTT 1.83 proves stable expressive solutions exist), threshold ≈ α_c 0.3; tradeoff real but not fundamental. Primitive (Q4): keep below the (discrete) instability — the code (`ep_step`/holo) REQUIRES a stationary z\*; training *through* the cycle is invalid → gate updates when res says equilibrium lost.
- **CAVEAT:** the ε-sweep proves the *discrete* artifact at s3200 (continuous-stable there); a *true* continuous Hopf could still emerge deeper in training. **Decisive confirmation (running): `ep_eps05`** = warm-start s2000 (pre-cycle, val 3.13) + **ε=0.05 (T1=300)** → does it clear redx's 2.74 blow point without cycling? Files: eval_relax_s3200.py / knockout_s3200.py / eps_sweep_s3200.py.

## 2026-06-23 — RESULT: ε-TRAINING-MONOTONICITY CONFIRMS THE MECHANISM (wall moves with ε).
Three training runs, identical recipe except ε:
- ε=0.1, t2sel=40 (redx): blew @ best **2.74**
- ε=0.1, t2sel=160 (ep_t2fix): blew @ best **3.02** — the BETTER gradient (cos 0.998 vs 0.84) blew EARLIER ⇒ gradient quality is NOT the lever (a cleaner gradient if anything drives into the stiff regime faster).
- ε=0.05 (ep_eps05, warm-start s2000): blew @ best **2.41** — cleared *past* redx's 2.74 with res pinned ~5e-3 the whole way down (2.74→2.45), then res inflated 1.3e-2→2.0e-2→4.6e-2→val13 @step3100.

**⇒ smaller ε → strictly lower wall (2.74 → 2.41); the blow point shifts MONOTONICALLY with the integration step.** This proves the divergence IS the explicit-Euler instability of the *stiffening* rotating attention mode (|1+εμ|>1). Fixed ε only RELOCATES the wall, doesn't eliminate it — as the attention keeps stiffening past ~2.4, even ε=0.05 eventually gets too coarse → cycle regrows → blow. The ep_t2fix contrast (better gradient still blows, earlier) seals **"integration, not gradient."** Analog HW (continuous relaxation, ε→0) is unaffected. Pre-blow ckpt preserved: `runs/ep_eps05.pt` @step2850 best2.4059. (2.41 lands near the old 2.40 estimator-plateau — likely coincidence; this is the res-inflation/stiffness blow, not the estimator floor.)
**NEXT (user's plan — try jacreg + physics-inferred methods):** (a) **adaptive ε** [#30, gate now passed] — shrink ε as res climbs, big ε when smooth → should ELIMINATE the wall, not just move it; (b) **jacreg** (adaptive jr — reduce rotating-mode stiffness → raise the Euler-stability margin → wall up / fixed-ε floor lower; the early-recipe defense); (c) ε=0.02/0.03 to confirm wall→~2.2. Two complementary axes: integration (a,c — SIM-only, no expressivity cost) vs stiffness (b — changes model, also speeds HW settling).

## 2026-06-23 (cont. 8) — MAJOR CORRECTION: GENUINE Hopf instability, NOT an Euler artifact. cont.7's "Euler-artifact / analog-immune / integration-fix" framing is SUPERSEDED.
Direct leading-eigenvalue measurement (matrix-free FINITE-DIFFERENCE-JVP Arnoldi on M=I+εJ, `eig_probe.py`) + Anderson fixed-point-solver control (`anderson_control.py`). THREE independent confirmations:
1. **eval_relax**: s3200 floors/oscillates, never reaches a fixed point.
2. **Leading eigenvalues of M at the operating point**: s2000 (healthy, val 3.13) all |λ|<1, Re μ<0 (STABLE); s3200 (val 2.74) top |λ|=1.044, **Re μ=+0.44, COMPLEX** (μ=+0.26±1.37j); ep_eps05 (val 2.41) top |λ|=1.14, **Re μ=+1.35**, complex (μ=+0.65±2.08j). Re μ grows −0.02→+0.44→+1.35 as CE drops; rotating |Im μ| ~0→1.37→2.08.
3. **Anderson control**: s2000 converges (root, marginally stable μ≈0). s3200 — plain relax floors (res 2.7e-2), Anderson CANNOT reach a root (best res 1.4e-3), and the near-root has **UNSTABLE COMPLEX** eigenvalues (Re μ=+0.24, |Im μ|≈2.0). The solver itself cannot manufacture the absent stable equilibrium (fugu's exact caveat).

⇒ **The below-2.1 divergence is a GENUINE continuous-time Hopf-type instability** of the learned non-conservative attention operator — a complex eigenvalue pair crosses Re μ=0 between val 3.13 and 2.74 (|Im μ|~2 rotation), EMERGING + GROWING as EP trains the attention expressive. **NOT a discrete-Euler artifact; the continuous ODE / analog HW is ALSO unstable → analog is NOT immune.**

WHERE cont.7 WENT WRONG: the "Euler-artifact" inference rested on the ε-sweep "converging at ε=0.01" — but that was the **step** residual `ε·g`, not `g→0` (fugu caught this); the force-residual `g` floors high (no fixed point). The direct eigenvalue measurement refutes Re μ<0. RECONCILES with all prior data: ε-monotonicity (smaller ε → lower wall) = smaller ε makes `|1+εμ|>1` grow SLOWER → delays the blow → wall relocates (2.74→2.41), doesn't close (Re μ>0 is unstable for ANY ε); ep_t2fix (better gradient blew earlier) = cleaner gradient drives into the unstable regime faster; jacreg history (adaptive→2.40, frozen→blow) = jacreg cuts ‖J_nc‖ → pushes the complex pair back to Re μ<0.

**FIX (FLIPPED): integration (adaptive ε / Anderson / implicit) only DELAYS** — can't converge to an unstable/absent fixed point (Anderson got 1.4e-3, not a root = direct proof). **The REAL fix is MODEL-SIDE: jacreg** (reduce ‖J_nc‖ → push the rotating complex pair back to Re μ<0), per fugu's Q2 (FUGU_OPTIONS_VERDICT.md): use an *adaptive bounded* jacreg homeostat (the validated 2.40 runs used adaptive jacreg; diverging runs froze it). **NEXT: jacreg run + verify with `eig_probe.py` that jacreg holds Re μ<0 / keeps the complex pair below the bifurcation.** Files: eig_probe.py, anderson_control.py, adaptive_eps_calib*.py, FUGU_OPTIONS_VERDICT.md.

## 2026-06-23 (cont. 9) — FIX CONFIRMED AT THE MECHANISM LEVEL: adaptive jacreg pushes Re μ<0, kills the Hopf, restores the fixed point.
`ep_jacreg` = warm-start s2000 + ε=0.1 + **ADAPTIVE jacreg** (`--jacreg 0.1 --jr_floor 0.1 --jr_max 16 --res_target 1e-3 --res_gate 0`, controller :520-529). The ONLY change vs redx (which froze `jr_max=jr_floor=0.1`) is releasing the controller. It descends cleanly **past redx's 2.74 blow** (best 2.7495, res controlled ~1e-3, jr 0.1–0.3, no blow; ~4.5e-4 CE/step after a slow warm-start re-adjustment).
**eig_probe (`eig_jacreg.py`) at the SAME loss ~2.74 — the decisive A/B:**
- FROZEN jacreg (redx s3200, blew): leading μ=+0.45 (real) + +0.26±1.37j (**COMPLEX/rotating**), |λ|>1 UNSTABLE, g_floor=0.26 (cycle, no fixed point).
- ADAPTIVE jacreg (ep_jacreg @2.75): leading μ=−0.23,−0.25,−0.27 (**ALL real, NO rotating pair**), |λ|<1 STABLE, g_floor=**0.0001** (TRUE fixed point restored).
⇒ adaptive jacreg moved the leading eigenvalue from **Re μ=+0.45 (unstable rotating Hopf) → Re μ=−0.23 (stable real)**, killing the rotating instability and restoring a true fixed point (g 0.26→1e-4). This confirms BOTH the diagnosis (Hopf from the non-conservative ‖J_nc‖) AND the fix (jacreg reduces ‖J_nc‖ → Re μ<0), at the eigenvalue level, not just the loss. **Bonus:** the restored fixed point makes AsymEP valid again (it requires a stationary state — invalid on redx's cycle). NEXT: does it clear 2.41 (ep_eps05's blow) → 2.09 → below; re-run eig_probe at 2.41 to confirm jacreg holds Re μ<0 deeper. Production = compose with adaptive ε (fugu Q3) for sim fidelity/speed. Files: eig_jacreg.py, ep_jacreg.log.
**OBS (oscillation, in-progress):** windowed val_std SHRANK monotonically as it descended (0.159→0.031, steps 50→3000) = converging. At step 3400 a TRANSIENT val spike (2.65→3.67) self-recovered over ~150 steps (→2.79), best preserved (2.5624). KEY: throughout the spike **res stayed LOW (4e-4→8e-5) and jr LOW (0.1)** → it is NOT a Hopf/relaxation instability (res did not inflate) but a benign weight/CE training transient (a momentarily-off update; relaxation stayed healthy). So jacreg's Hopf-suppression is intact; spike is a separate phenomenon. Watch: if such spikes recur/grow, investigate — but "res-controlled-during-spike" already rules out Hopf re-emergence.