diff options
Diffstat (limited to 'docs/campaign/EP_BELOW210_DIAGNOSIS_FIX.md')
| -rw-r--r-- | docs/campaign/EP_BELOW210_DIAGNOSIS_FIX.md | 260 |
1 files changed, 260 insertions, 0 deletions
diff --git a/docs/campaign/EP_BELOW210_DIAGNOSIS_FIX.md b/docs/campaign/EP_BELOW210_DIAGNOSIS_FIX.md new file mode 100644 index 0000000..54897bc --- /dev/null +++ b/docs/campaign/EP_BELOW210_DIAGNOSIS_FIX.md @@ -0,0 +1,260 @@ +# C512 EP below-2.10 divergence — diagnosis + fix (2026-06-20) + +> Written to disk after `/tmp/lt_ep` was wiped (10-day /tmp cleanup) — all local code/ckpts/logs lost. +> This file preserves the diagnosis + the exact code change so it can be rebuilt. + +## Symptom +EP (frozen jr=0.1, tracking-AEP, qknorm, c=1) descends to best **~2.09**, then within ~200 steps: +res 5e-3→0.15, cos(g_EP,g_BPTT) 0.98→~0, val CE 2.09→4+, ABORT. Same ~2.09 floor every run +(erf & tanh both: 2.0919 / 2.0921). **Exact-BPTT, identical recipe, sails past 2.09 → 1.72 (freeze_wsd).** + +## Root cause (Codex-confirmed, 5-way corroborated) +**EP optimizes the fixed point; BPTT optimizes the finite unroll — only the finite unroll defends the residual.** +- `ep_step` relaxes T1=150 → `z150`, then REFINES to t1max=300 → `z*`, computes the EP gradient at `z*`. +- But **eval and `bptt_step` use `z150`**. EP's gradient never feels the T1 residual. +- As attention gets expressive → contraction weakens → `z150` drifts from `z*` → res@T1 inflates → EP + estimate (valid only at small res) dies → blowup. +- BPTT differentiates the actual 150-step unroll → a non-converging unroll → bad CE → gradient implicitly + rewards strong contraction. **That defend-the-residual term is what EP structurally lacks.** +- cos-0.977 holds only AT the fixed point; the missing perpendicular is ~21% (=√(1−0.977²)) = the + finite-horizon transient gradient. (T1=∞ would make even BPTT lose it — BPTT's stability IS the finite truncation.) + +Two layers: (1) point-mismatch (refinement) — `--t1max 150` alone pushed wall 2.09→2.05 but still blew@600; +(2) gradient-flavor — needs the explicit penalty. + +Diverged state = forward bifurcation to a **limit cycle** (eval_relax: res floors ~6e-2, oscillates, 150→4000 +steps don't help, CE ~3.7). FTLE stayed negative (−0.027..−0.050). ⇒ adaptive/more steps can't fix it (res as +stopping-criterion chases a vanishing fixed point); only res as a COST (penalty) prevents the drift. + +Ruled out: jacreg (3% of grad, orthogonal, removing changes cos <0.001); GELU erf/tanh (both diverge identically); +the validity gate (removing it diverges FASTER); forward-stability/FTLE (stable-BPTT −0.0347 weaker than diverged-EP −0.0377). + +## THE FIX: `--resreg` (T1-residual penalty) — exact edits to lt_ep_train.py / ep_step + +1. signature: `def ep_step(..., corr_every=1, res_gate=0.0, resreg=0.0):` + +2. right after `res_used = res` (BEFORE the `if t1max > T1:` refinement): +```python + zT, resT1 = zs, res # the T1 free-phase state (what eval/BPTT use), BEFORE refinement +``` + +3. just before the final `return grads, res` (after the jacreg block): +```python + if resreg > 0 and resT1 > 7e-4: # defend z_T1 (BPTT gets this implicitly; EP at z* doesn't) + with torch.enable_grad(): + Fz = blk.tforce(zT, xin0) # deterministic thick force at z_T1 (params live, zT/xin0 detached) + Rr = (eps * Fz).pow(2).sum() / (zT.pow(2).sum() + 1e-9) # ~ (T1 residual)^2 + grr = torch.autograd.grad(Rr, blk.block, allow_unused=True) + ratio = resreg * min(1.0, resT1 / 2e-2) # ramp 0->resreg as res 7e-4->2e-2, capped + gtask = math.sqrt(sum(float((grads[id(p)] ** 2).sum()) for p in blk.block if grads.get(id(p)) is not None) + 1e-20) + gres = math.sqrt(sum(float((g ** 2).sum()) for g in grr if g is not None) + 1e-20) + lam = ratio * gtask / gres # scale penalty to `ratio` of the task-grad norm + for p, g in zip(blk.block, grr): + if g is not None: + grads[id(p)] = g * lam if grads.get(id(p)) is None else grads[id(p)] + lam * g + return grads, res +``` + +4. argparse: `ap.add_argument('--resreg', type=float, default=0.0)` + +5. the ep_step CALL in the training loop: append `, cfg.resreg` as the last arg. + +**RUN IT WITH `--res_gate 0`** (else the gate bypasses the penalty). Keep `--t1max 300`. + +## Validation status (INTERRUPTED — must re-run) +`ep_resreg2` (res_gate=0, resreg=0.2, warm from erf-2.09 ckpt, t1max=300): held res **1–5e-4 through step 1000**, +best **2.0573** (lowest any EP run reached, past the wall), zero inflation. Deleted by /tmp wipe before reaching ~1.8. +No-penalty variants for contrast: ep_nogate blew@100, gated ep_resreg@200, ep_t1max150 reached 2.05 then blew@600. + +## What was lost in the /tmp wipe (must rebuild) +- Code: lt_ep_train.py (with all post-06-14 edits: --resreg, --gelu, --probe_bptt, --wsd, --lcap/--linf, empty_cache, + tracking-AEP), plus the probes (grad_decomp.py, ftle_probe.py +--gelu, eval_relax.py, cos_now.py, train_lyap.py). + Base (06-14) survives in `/home/yurenh2/ept/lt_ep_code/`; re-apply edits from this doc + the session transcript + `/home/yurenh2/.claude/projects/-home-yurenh2-ept/46bc4d43-*.jsonl`. +- ALL checkpoints: orphan (ep_C512_bpe.best.pt 2.40), freeze_wsd.pt (BPTT 1.72), ep_frzjr_wsd.pt (2.09), + ep_resreg2, ep_tanh_frz — gone (H200 backup source 157.66.255.42 is also down). Must retrain from scratch. +- Data tinystories_bpe: regenerate via `lt_ep_code/prepare_tinystories_bpe.py`. + +## Re-run plan +1. Reconstruct lt_ep_train.py = lt_ep_code/06-14 base + the edits above (+ the other recent flags). +2. Regenerate tinystories_bpe data. **Work entirely under /home/yurenh2/ept — NEVER /tmp.** +3. Retrain orphan (~20k) → freeze_wsd (BPTT WSD 16k, expect ~1.72) → EP+resreg (res_gate=0 resreg 0.2, WSD). +4. Confirm EP+resreg descends past 2.09 toward ~1.8 (BPTT ceiling). Sweep resreg 0.2–1.0 if it plateaus. +5. NEXT-ROUND code fixes (genuine bugs, kept this round only for comparability with the diverging baselines): + - GELU consistency: model force uses erf (F.gelu) but holo nudge (cgelu/rforce/cforce) uses tanh-form → nudge fixed point ≠ z* → ~4.7e-4 force error injects a spurious common-mode drift each step (estimator bias). FIX = `--gelu tanh` (make the model tanh too; the holo complex extension REQUIRES tanh = entire fn, so consistency means everything tanh). NOTE: fixing it does NOT fix the below-2.10 divergence (tested 6/18, that's the residual-defense issue) — but it's still a real bug worth fixing. + - cforce (holo_ep.py:41-45) omits the q/k RMSNorm that real attention applies → biases holo_a/holo_a_select with --qknorm (not the --track path, but fix anyway). + - Stale argparse defaults (footguns, harmless when flags passed explicitly): --attn_mode default 'real' (should be the intended mode), --data default '/tmp/lt_ep/data/shakespeare_char' (line ~320). Also tighten resreg scaling to pure task-grad norm (currently task+jacreg, ~3% off). + +--- +## 2026-06-21 session update — rebuild after /tmp wipe + validation IN PROGRESS + +REBUILD (all under /home/yurenh2/ept/ep_run, never /tmp): +- Code ep_run/lt_ep_train.py = lt_ep_code (a recent ~6/20 snapshot despite its misleading 6/14 mtime — it already has res_gate/gelu/t1max/track/resinit/full ep_step+holo) + my 3 edits (--resreg block, --wsd schedule, line-20 DD data-path fix). `diff` verified clean; holo_ep.py byte-identical to base. +- Bug-check forensics: holo_ep.py last edited 2026-06-13 -> the estimator (holo_a_track, AEP correction, rforce) is COMPLETE/not stale. Post-6/14 lt_ep_train edits NOT carried over are all diagnostic/other-run-type (--probe_bptt,--probe_branch,--snap,--cprewarm,--specnorm/--linf, bptt_step sep-mode, empty_cache) — none touched by the EP+resreg+track+frozen-jr run. +- Data: regenerated tinystories_bpe to ep_run/data/ (361M train tokens, vocab 4096; val=tail-15MB of train.txt, ~0.8% overlap — fine for tracking divergence/descent). + +VALIDATION RUN (from random init; pid 696706 GPU1): + python3 lt_ep_train.py --mode ep --attn_mode thick --B 24 --C 512 --H 16 --T 256 --c 1.0 \ + --jacreg 0.1 --jr_floor 0.1 --jr_max 0.1 --holo 2 --hr 0.02 --t2sel 40 --track --pema 0.999 --t1max 300 \ + --res_est 1e-4 --res_gate 0 --resreg 0.2 --qknorm --resinit 0.1 --warmup 800 --T1 150 --T2 20 --lr 6e-4 \ + --wsd 0.25 --steps 32000 --log 200 --save_every 500 --abort_res 0.3 --data data/tinystories_bpe \ + --ckpt runs/ep_resreg_scratch.pt --state runs/ep_resreg_scratch.state + Status @step 6400 (~20%, 8h, 0.23 it/s): val 2.5070 descending (FASTER than orphan: 2.51@6.4k vs orphan's 2.40@20k). res BOUNDED ~1-2e-2 by resreg (spikes to 2e-2 get pulled back to ~4e-3 — resreg WORKING, not running away like the no-resreg baselines that hit res 0.15->abort@2.09). Decisive break-2.09 test ~10h out; ~1.8 needs the WSD lr-decay (step 24k+). Watcher key: does res stay bounded through 2.09. + +resreg CONTAMINATION probe (ep_run/resreg_probe.py, at step-3600 ckpt, res 1e-2): + cos(VF,BPTT)=0.7166 vs cos(VF+resreg,BPTT)=0.7108 -> delta -0.006: resreg DOES directly contaminate (small). cos(resreg_dir,BPTT)=-0.047 -> resreg is ~ORTHOGONAL to the true gradient; it is NOT adding back BPTT's residual-defense term (my earlier hypothesis was WRONG) — it's a roughly-orthogonal constraint direction. NET still favorable: the tiny -0.006 direct cost (bounded by the 0.2 ratio cap) buys keeping res low so the EP estimate doesn't COLLAPSE (cos 0.72 held vs ->0 runaway+diverge). CAVEAT the probe also exposes: the EP estimate itself is only 0.72-aligned at res 1e-2 (vs 0.98 at res 1e-4) -> descent on a mediocre gradient -> MAY plateau above BPTT's 1.72. If it stalls >~1.85 or res creeps past 2e-2: retune tighter resreg (ramp-to-full by res~3-5e-3 instead of 2e-2) and/or stronger early contraction (frozen jr=0.1 from scratch is weaker than the orphan's adaptive jr which held res~1e-4). + +CODEX review of rebuilt code (verdict triaged): the 2 "High" bugs are STALE-DEFAULT footguns, HARMLESS here (run passes --attn_mode thick + --data explicitly, verified via /proc): line 304 --attn_mode default 'real'; line 320 --data default '/tmp/lt_ep/data/shakespeare_char'. GELU (erf model vs tanh nudge, ~4.7e-4 force err) + cforce-missing-qknorm = real bugs but status-quo (kept this round for comparability) -> next-round fixes (point 5 above). Codex CONFIRMED CORRECT: AEP contrast sign (not flipped/doubled), Wh readout grad, jacreg, the resreg sign/logic + res_gate=0 gate-bypass, frozen-jr clamp@0.1, WSD schedule. + +COST (EP vs same-param BP): per-step EP on the equilibrium block ~= ~100x a depth-1 BP transformer in DIGITAL SIM — because the block relaxes ~150-300 steps (effective depth); even exact-grad BPTT is ~100x depth-1 BP. EP/BPTT ~= 1.5x (EP 0.23 it/s vs freeze_wsd BPTT 0.34 it/s). The ~100x is the cost of SIMULATING the analog relaxation on a GPU; on analog relaxation hardware the free phase is physical settling (~free) + EP's local update needs no backward -> EP competitive/better than BP there. Per-quality the gap is smaller (equilibrium block beats same-param standard transformer, rule tax 0.041); EP memory << BPTT (no unrolled graph). + +DELIVERABLE: method-intro PDF /home/yurenh2/ept/ept_method_intro.{tex,pdf} (9pg, EP-literate audience; AEP/holomorphic/CET/tracking-AEP background; strict attribution; results honest+ongoing-marked; compiled with tectonic since conda pdflatex is broken). VERIFY before external use: the post-Jan-2026 citations (Scurria arXiv:2602.03670 AsymEP; CET=Hoier/Kerjan/Scellier ICLR2026 AM workshop OpenReview Qrfml76eWJ; Laborieux 2021) + Hoier-vs-Hoier spelling. + +--- +## 2026-06-21 (cont.) — divergence is CONTRACTION loss, not res; the missing BPTT term MEASURED +From-scratch EP (c=1, frozen jr=0.1, track, resreg=0.2) diverged at val 2.25 (ABOVE the 2.09 wall): +res 1.1e-2 -> 1.7e-1 -> 3.4e-1 then COLLAPSED to a trivial fixed point (res 1.4e-8, val 10). So it is a +bifurcation: good fixed point destroyed -> brief limit cycle -> degenerate attractor. NOT a numerical blowup. + +MEASURED the missing term M = g_BPTT - g_EP at the pre-blowup ckpt (step 9200, res~1e-2; resreg_probe.py): + cos(g_EP, g_BPTT) = 0.557 (decayed from 0.72 @ step3600 at the SAME res -> estimate degrades as operator hardens) + |M|/|BPTT| = 0.84, cos(M,BPTT) = +0.91, cos(M,EP) = +0.16, cos(M,resreg) = -0.10 + (|EP|=0.94, |BPTT|=2.2 -> EP captures only ~42% of the true gradient, 56 deg off, near the instability) +=> EP is missing 84% of the true gradient (M ~ the BPTT direction). resreg points AWAY from M (-0.10): + resreg is NOT the missing term and NOT a weak proxy -- it is the wrong direction. resreg retired as the fix. + +CODEX THEORY: correct state variable = contraction margin mu_P(J) (one-sided-Lipschitz/log-norm), NOT res +(lagging symptom; cos decays at fixed res) and NOT spectral radius (non-normal, |Jv-Jtv|/|Jv|~1.4 -> jacreg +targets the WRONG quantity). BPTT's stabilizer = the finite-horizon gradient term (the (I+eps*J)^k products that +ALSO govern contraction -> automatic contraction feedback); EP graded at refined z* structurally discards it. +The fix that (i) preserves the objective [no loss force] and (ii) runs on analog [no digital root-finder] = +CONTRACTION-BY-CONSTRUCTION (mu_P(J) < -alpha). KICKER: analog HW can never do BPTT -> can never access M -> +structural contraction is the FORCED hardware design constraint, not merely the cleanest fix. + +EXPERIMENT (test causality + the fix; both with resreg=0, no residual term): + (A) c-bump c=1->3 from scratch [runs/ep_c3_scratch] -- does stronger structural damping alone prevent divergence? + (B) HARD spectral-norm contraction constraint --specnorm (codex implementing) -- warm from preblow_9200 + from scratch. + If divergence vanishes with NO residual term -> contraction-loss confirmed as root; principled fix validated. + +--- +## 2026-06-21 (cont. 2) — the equivalence-condition diagnosis (deep paper read: Ernoult2019 + AsymEP + LZ-JacobianHomeostasis) +EP≡BPTT (Ernoult 2019, arXiv:1905.13633, Thm1/Eq16) needs: H1 symmetric Jacobian J=Jᵀ (Eq51 transpose-cancellation = load-bearing); +H2 free phase EXACTLY at fixed point s*=s_T=..=s_{T-K}; H3 β→0. Ernoult gives NO contraction-rate truncation bound (just assumes +convergence; my earlier "contraction-rate condition" guess was WRONG). BPTT needs NEITHER H1 nor H2 (differentiates the actual +finite-T graph) -> that's why BPTT sails to 1.7 while EP can't. + +OUR RUN VIOLATES BOTH, via the SAME quantity S^-1 A (S,A = sym/antisym parts of J; our |Jv-Jᵀv|/|Jv|≈1.4 IS ‖S^-1 A‖≈1.4): +- H2 (dominant, first-order): res~1e-2 -> not at fixed point -> EP & BPTT seeded by different states. "EP≈BPTT only at res≲1e-4, + noise at 1e-2" is textbook H2 failure. +- H1 (makes it sudden, sets M): both AsymEP (Scurria arXiv:2602.03670) and LZ (Laborieux-Zenke "Jacobian Homeostasis" + arXiv:2309.02214, ICLR2024) reduce the asymmetry bias to S^-1 A. LZ Eq11: EP error = δ - 2 S^-1 A δ, valid only ‖S^-1 A‖→0. + At 1.4 the map (S+A)^-1(S-A) → near-rotation (eigs→-1): right direction, lost magnitude = our cos(M,BPTT)=0.91, |M|=84%. + +KEY NEW REALIZATION: our AEP (J→Jᵀ) correction is OUT OF WARRANTY. It is FIRST-ORDER in S^-1 A and assumes ‖S^-1 A‖≪1 +(AsymEP error series Σ(S^-1 A)^{2n+1} converges only if ρ(S^-1 A)<1; they never state it). At 1.4 it may be O(1) wrong. AND it +is a FIXED linearization at x̄⁰, so res~1e-2 -> wrong anchor -> H2 POISONS the AEP correction too. The cos-0.99 we trust holds +ONLY at small-asymmetry + tight-convergence — the regime we leave. + +THE GAP, PRECISE: NOT "can't converge". Even at perfect convergence, at ‖S^-1 A‖=1.4 the H1 bias is O(1), and the BPTT-1.7 +solution NEEDS that non-normality. So EP's valid regime (‖S^-1 A‖≪1) and the expressive regime (≈1.4) DON'T OVERLAP. Structural. + +REFRAMES THE CONTRACTION EXPERIMENTS: specnorm/c3 cap ‖J‖/contraction, NOT the asymmetry ratio ‖A‖/‖S‖ that drives the bias. +The literature's correct lever = Jacobian Homeostasis (LZ Eq12: penalize ‖A‖² -> J functionally symmetric, works w/ untied weights) += the principled version of our jacreg — BUT it caps exactly the non-normality the 1.7 solution needs (expressivity tax; confirms +the "crutch" worry). Holomorphic nudge only removes finite-β bias, NOT the H1/H2 bias (so β is NOT our dominant error). + +THE ONE PATH WITHOUT THE EXPRESSIVITY TAX (novel): the AEP correction is first-order; a HIGHER-ORDER AEP (more Neumann terms in +S^-1 A) extends the warranty past ‖S^-1 A‖~1 WITHOUT capping non-normality — fix the gradient, not the model. Attacks H1 fundamentally. + +NEXT MEASUREMENTS (cheap, on preblow_9200 ckpt): (1) measure ‖S^-1 A‖ / ρ(S^-1 A) DIRECTLY (not the 1.4 proxy) and track over +training (warranty gauge; predict it grows toward the tip). (2) test M ≈ -2 S^-1 A · g_BPTT — if yes, the asymmetry bias IS the gap +(not convergence) and higher-order AEP is the lever. Sources: Ernoult 1905.13633; Scurria 2602.03670; LZ 2309.02214 + Holo 2209.00530. + +--- +## 2026-06-21 (cont. 3) — CORRECTION: the "AEP out of warranty / ρ(S⁻¹A)<1" framing (cont.2) was MIS-ATTRIBUTED +Deep read of Scurria 2026 (arXiv:2602.03670, verbatim Eq.24/47) overturns cont.2: +- Our AEP = the paper's AsymEP (single-state antisym correction −2A(z−z*); holo_a_track re-anchors A at the common mode z̄). + It is EXACT at β→0 for ARBITRARY asymmetry: Eq.24 `J_FA(x̄⁰)=J−2A=Jᵀ` is an algebraic IDENTITY (any A magnitude). +- The ρ(S⁻¹A)<1 / Neumann series Σ(S⁻¹A)^{2n+1} (Eq.47) is the **VF BASELINE's** error (uncorrected EP), which AsymEP REMOVES. + We + GPT-Pro (following Laborieux-Zenke, whose first-order −2S⁻¹Aδ analyzes UNCORRECTED EP) MIS-ATTRIBUTED the VF error + to AsymEP. It is NOT a restriction on our correction. +- "AsymEP = first-order projection of Dyadic EP" = DIMENSIONAL projection (2n→n), NOT first-order-in-β. Both exact at β→0. +- Dyadic EP (Scurria §4) = a SEPARATE 2n doubled-field method (aux field z′ + augmented energy H=−(z−z′)ᵀF((z+z′)/2)); + we DON'T implement it. It buys finite-displacement robustness + a real energy, NOT extra β→0 exactness. ⇒ optional upgrade, NOT the fix. +- ⇒ the asym_probe ρ(S⁻¹A)=3.3 result answers the WRONG question (VF's error, + measured at a no-fixed-point pathological ckpt). DISCARD as a validity verdict. + +REAL DIAGNOSIS (re-centered): the ONLY fundamental obstacle is **H2 = free-phase convergence/contraction**. AsymEP handles +asymmetry EXACTLY *given convergence*. The cos-0.557 / M-84% gap was measured at res 1e-2 (NOT converged) ⇒ it's H2, not H1. +Asymmetry is only "upstream" (expressive attention → weak contraction → H2 fails). FIX DIRECTION = contraction/convergence +defense (specnorm / damping / monotone / residual-defense) — VALIDATED. The cont.2 "asymmetry/ρ/AEP-warranty" line was a wrong turn. + +EMPIRICAL CHECK (running): oracle-adjoint probe — at a HEALTHY converged ckpt (specnorm, res~1e-4): is +g_EP(AsymEP) ≈ g_transpose(exact adjoint via GMRES Jᵀλ=−ℓ) ≈ g_BPTT? cos→1 confirms AsymEP exact + problem = convergence. +(The earlier 0.557 was the pathological non-converged point.) + +--- +## 2026-06-21 (cont. 5) — ROOT-CAUSE CANDIDATE: --hr 0.02 is 10x too small (should be 0.2) -> noisy EP gradient +oracle-adjoint probe (asym_probe --diag) on a PERFECTLY converged ckpt (res 1.5e-9): cos(g_EP, g_transpose[=exact +adjoint=g_BPTT, cross-checked cos 1.0]) = 0.63 mean over 6 batches (SYSTEMATIC, not a bad batch). BETA/r SWEEP: cos gets +WORSE as beta->0 (hr 0.04->0.86, 0.02->0.72, 0.005->0.38, 0.002->0.15) = 1/beta noise-amplification of the nudge-convergence +error (NOT finite-beta bias; NOT a holo/track/t2sel component -- plain EP ~= FULL ~0.72). Sweeping hr UP: 0.06->0.85, +0.1->0.95, 0.15->0.98, **0.2->0.986**, 0.3->0.991. => at the DOCUMENTED hr=0.2 (METHODS line 130) the estimator is FINE +(cos 0.986, in the historical 0.85-0.99 range). Our runs (this whole session + the EP_BELOW210 "diverging recipe") used +--hr 0.02 = 10x too small -> noise regime -> 0.72-cos gradient -> plausibly THE destabilizer of the below-2.10 divergence. +CODE IS CORRECT; mis-set parameter (hr regression from the validated 0.2). FIX = --hr 0.2. Decisive test: pure EP with +--hr 0.2 (runs/ep_hr02) -- does it stop diverging? The probe was correct; the convergence/asymmetry/Dyadic theory dive was +largely chasing a symptom of the mis-set hr. (NOTE: also revisit t2sel=40 vs documented 120, secondary.) + +## 2026-06-22 (cont. 6) — DECISIVE: divergence is STRUCTURAL (equilibrium gradient is blind to finite-time contraction). hr & t2sel are NOT the fix. +Deliverables this round: `ep_run/GPT55_BUG_HUNT.md`, `ep_run/CODEX_VERDICT.md`, `ep_run/FWD_MODE_DESIGN.md` (multi-agent: gpt-5.5 + fugu-ultra via codex). + +**cont.5's "hr=0.2 fixes it" was WRONG/incomplete.** ep_hr02 (--hr 0.2) STILL diverged (step 2800, val 2.87). hr is a real gradient-QUALITY knob but not the divergence fix. + +**Clean chain (all on a fixed val batch):** +1. **BPTT deconfounder:** BPTT (exact grad) → CE **1.83**, free-phase contraction ratio **ρ≈0.982**, res tight ~1e-3 throughout. Divergence is **EP-specific**; a contractive low-loss solution exists. +2. **The "0.94 estimator ceiling" = ADJOINT-PHASE TRUNCATION, not hr/β noise, not fundamental.** At a trained, tightly-converged operator (free res 2.2e-5): hr-ceiling sweep cos(g_EP,g_transpose) **flat ~0.94 across hr 0.04–0.8**; t2sel sweep (fixed hr) cos **0.74→0.998 as t2sel 10→160**. The code refines the FREE phase but never converges the NUDGED/adjoint phase (gpt-5.5 bug-hunt #1,#9). So EP CAN compute the near-exact equilibrium adjoint (0.998) at t2sel≥160. cos(g_transpose,g_BPTT)=0.975@res2e-5, =1.0@res1e-9 (finite-horizon gap, res-dependent). +3. **Bifurcation probe (mechanism):** as EP trains, free-phase **ρ drifts→1** (0.989@val4.96→0.998@val2.74); redx baseline blew step3300 (CE 2.74→41). **Damping does NOT rescue** — ρ-vs-c on the marginal op: more c makes ρ WORSE (0.998→0.9999); on BPTT's op more c helps (0.982→0.96). EP drives weights into a **damping-resistant marginal-contraction region** BPTT avoids. Sudden blowup because **res_T1≈ρ^150** is hyper-sensitive near ρ=1 (gradual ρ-drift + ρ^150 = sudden bifurcation). + +**VERDICT (codex/gpt-5.5, PROVEN — CODEX_VERDICT.md): (b) STRUCTURAL.** Equilibrium gradient differentiates L(z*); BPTT differentiates deployed L(z_T), T=150. The equilibrium gradient `grad=L_θ+λᵀF_θ` (F_zᵀλ=−L_z) **lacks the finite-horizon contraction-defense term**. Scalar proof: F=−k(z−z*) → ∀k, z* identical → **dL(z*)/dk=0** (blind to contraction rate), but z_T−z*=(1−εk)ᵀ(z₀−z*) → **dL(z_T)/dk ∝ T(1−εk)^{T−1}** (BPTT sees k; huge near ρ=1). A param that erodes contraction with z* fixed has ZERO equilibrium gradient but large BPTT gradient. **No t2sel/hr adds a term absent from the fixed-point gradient.** t2sel=160 only makes EP a better *equilibrium* gradient = still wrong objective → ep_t2fix predicted to still blow. + +**FIX (physically realizable; FWD_MODE_DESIGN.md): change the OBJECTIVE to finite-horizon — forward-mode/RTRL eligibility traces.** Forward-relax T=150, propagate `e_{t+1}=Φ_z(z_t)e_t+ε·∂F_θ/∂p` alongside the state, update `Δp ∝ −L_z(z_T)ᵀe_T`. Forward-only, local, NOT reverse BPTT, NOT a digital solver — keeps the contraction-defense terms. Exact RTRL O(P·dimz) → practical: UORO / random-direction directional-deriv / blockwise low-rank. Alt: explicit contraction homeostat (penalize log-norm μ(J)). + +**In-flight confirmations (2026-06-22):** ep_t2fix (t2sel=160, runs/ep_t2fix.log + t2fix_rho.log) — predicted ρ→1 + blow; oracle-adjoint training (GMRES exact adjoint each step, warm-start redx s2000, runs/oracle_adjoint.log) — DEFINITIVE (b) test (if even the exact equilibrium grad blows → (b)); forward-mode rule design = the fix. +**Tooling:** Sakana fugu-ultra wired into the codex plugin (broker spawn patched: `-c model=fugu-ultra -c model_provider=sakana`, key ~/.codex/sakana.key); gpt-5.5 via direct `codex exec`. + +## 2026-06-22 (cont. 7) — CORRECTION: the divergence is a forward LIMIT CYCLE = a DISCRETE-EULER ARTIFACT (continuous/analog system is stable). cont.6's (b)-structural/forward-mode verdict is SUPERSEDED. +Measured on genuine this-session ckpts (full dossier: `ep_run/EP_DIAGNOSIS_DOSSIER.md`; fugu Q1-Q4: `ep_run/FUGU_VERDICT_FULL.md`). +1. **It's a LIMIT CYCLE, no fixed point.** eval_relax on the marginal ckpt redx **s3200** (val 2.74): relax from embedding 6000 steps → res FLOORS ~2.3e-2 and OSCILLATES (tail 2.08-2.73e-2, non-monotone), no fixed point. Reproduces the old (lost-run) finding on a genuine ckpt. ⇒ there is **no z\*** at the diverging operator → cont.6's "equilibrium gradient L(z\*) blind to contraction → forward-mode RTRL" (assumed z\* exists) is **moot here**. The bifurcation-probe "ρ=0.998 / slow convergence" was a **transient artifact** (caught the initial decay, missed the floor+oscillation). +2. **Attention drives the cycle.** Knockout WO×α: α=1.0 cycle (res 2.5e-2), 0.7 smaller, 0.4 nearly gone, **0.2 CONVERGES (res 3e-4, true fixed point)**. Non-conservative attention is causally responsible. +3. **DECISIVE — it's a DISCRETE-EULER ARTIFACT.** ε-sweep on s3200 (εN fixed = same settling): res-floor + oscillation shrink **monotonically** as ε↓ — 0.1→2.3e-2(cycle), 0.05→7e-3, 0.03→3e-3, 0.02→2e-3, **0.01→8.9e-4 CONVERGED**. ⇒ the cycle is the **explicit-Euler instability of a stiff rotating attention mode** whose CONTINUOUS eigenvalue is **Re(μ)<0 (stable)**: `|1+εμ|>1` at ε=0.1 but `<1` at ε=0.01. **So the below-2.1 "divergence" is largely a DIGITAL-SIM INTEGRATION ARTIFACT (ε=0.1 explicit Euler too coarse), NOT a fundamental non-conservative instability — and ANALOG hardware (continuous relaxation, ε→0) would NOT have it.** Big positive for the thesis. +- **fugu-ultra (Q1-Q4)** concurred + was appropriately cautious: confirmed attention-driven oscillatory non-convergence, flagged the Euler-artifact alternative (now confirmed), said "Hopf" needs the eigenvalue-of-M crossing to be measured. Fix (Q2): finer integration (smaller ε / implicit/Anderson) + **adaptive jacreg homeostat** (the validated-2.40 mechanism, which the diverging runs *froze* weak — the regression) + **res_gate ON** + qknorm/WO-gain rails + monitor TAIL oscillation (a short residual probe is what fooled us). Thesis (Q3): sub-threshold attention IS expressive enough (BPTT 1.83 proves stable expressive solutions exist), threshold ≈ α_c 0.3; tradeoff real but not fundamental. Primitive (Q4): keep below the (discrete) instability — the code (`ep_step`/holo) REQUIRES a stationary z\*; training *through* the cycle is invalid → gate updates when res says equilibrium lost. +- **CAVEAT:** the ε-sweep proves the *discrete* artifact at s3200 (continuous-stable there); a *true* continuous Hopf could still emerge deeper in training. **Decisive confirmation (running): `ep_eps05`** = warm-start s2000 (pre-cycle, val 3.13) + **ε=0.05 (T1=300)** → does it clear redx's 2.74 blow point without cycling? Files: eval_relax_s3200.py / knockout_s3200.py / eps_sweep_s3200.py. + +## 2026-06-23 — RESULT: ε-TRAINING-MONOTONICITY CONFIRMS THE MECHANISM (wall moves with ε). +Three training runs, identical recipe except ε: +- ε=0.1, t2sel=40 (redx): blew @ best **2.74** +- ε=0.1, t2sel=160 (ep_t2fix): blew @ best **3.02** — the BETTER gradient (cos 0.998 vs 0.84) blew EARLIER ⇒ gradient quality is NOT the lever (a cleaner gradient if anything drives into the stiff regime faster). +- ε=0.05 (ep_eps05, warm-start s2000): blew @ best **2.41** — cleared *past* redx's 2.74 with res pinned ~5e-3 the whole way down (2.74→2.45), then res inflated 1.3e-2→2.0e-2→4.6e-2→val13 @step3100. + +**⇒ smaller ε → strictly lower wall (2.74 → 2.41); the blow point shifts MONOTONICALLY with the integration step.** This proves the divergence IS the explicit-Euler instability of the *stiffening* rotating attention mode (|1+εμ|>1). Fixed ε only RELOCATES the wall, doesn't eliminate it — as the attention keeps stiffening past ~2.4, even ε=0.05 eventually gets too coarse → cycle regrows → blow. The ep_t2fix contrast (better gradient still blows, earlier) seals **"integration, not gradient."** Analog HW (continuous relaxation, ε→0) is unaffected. Pre-blow ckpt preserved: `runs/ep_eps05.pt` @step2850 best2.4059. (2.41 lands near the old 2.40 estimator-plateau — likely coincidence; this is the res-inflation/stiffness blow, not the estimator floor.) +**NEXT (user's plan — try jacreg + physics-inferred methods):** (a) **adaptive ε** [#30, gate now passed] — shrink ε as res climbs, big ε when smooth → should ELIMINATE the wall, not just move it; (b) **jacreg** (adaptive jr — reduce rotating-mode stiffness → raise the Euler-stability margin → wall up / fixed-ε floor lower; the early-recipe defense); (c) ε=0.02/0.03 to confirm wall→~2.2. Two complementary axes: integration (a,c — SIM-only, no expressivity cost) vs stiffness (b — changes model, also speeds HW settling). + +## 2026-06-23 (cont. 8) — MAJOR CORRECTION: GENUINE Hopf instability, NOT an Euler artifact. cont.7's "Euler-artifact / analog-immune / integration-fix" framing is SUPERSEDED. +Direct leading-eigenvalue measurement (matrix-free FINITE-DIFFERENCE-JVP Arnoldi on M=I+εJ, `eig_probe.py`) + Anderson fixed-point-solver control (`anderson_control.py`). THREE independent confirmations: +1. **eval_relax**: s3200 floors/oscillates, never reaches a fixed point. +2. **Leading eigenvalues of M at the operating point**: s2000 (healthy, val 3.13) all |λ|<1, Re μ<0 (STABLE); s3200 (val 2.74) top |λ|=1.044, **Re μ=+0.44, COMPLEX** (μ=+0.26±1.37j); ep_eps05 (val 2.41) top |λ|=1.14, **Re μ=+1.35**, complex (μ=+0.65±2.08j). Re μ grows −0.02→+0.44→+1.35 as CE drops; rotating |Im μ| ~0→1.37→2.08. +3. **Anderson control**: s2000 converges (root, marginally stable μ≈0). s3200 — plain relax floors (res 2.7e-2), Anderson CANNOT reach a root (best res 1.4e-3), and the near-root has **UNSTABLE COMPLEX** eigenvalues (Re μ=+0.24, |Im μ|≈2.0). The solver itself cannot manufacture the absent stable equilibrium (fugu's exact caveat). + +⇒ **The below-2.1 divergence is a GENUINE continuous-time Hopf-type instability** of the learned non-conservative attention operator — a complex eigenvalue pair crosses Re μ=0 between val 3.13 and 2.74 (|Im μ|~2 rotation), EMERGING + GROWING as EP trains the attention expressive. **NOT a discrete-Euler artifact; the continuous ODE / analog HW is ALSO unstable → analog is NOT immune.** + +WHERE cont.7 WENT WRONG: the "Euler-artifact" inference rested on the ε-sweep "converging at ε=0.01" — but that was the **step** residual `ε·g`, not `g→0` (fugu caught this); the force-residual `g` floors high (no fixed point). The direct eigenvalue measurement refutes Re μ<0. RECONCILES with all prior data: ε-monotonicity (smaller ε → lower wall) = smaller ε makes `|1+εμ|>1` grow SLOWER → delays the blow → wall relocates (2.74→2.41), doesn't close (Re μ>0 is unstable for ANY ε); ep_t2fix (better gradient blew earlier) = cleaner gradient drives into the unstable regime faster; jacreg history (adaptive→2.40, frozen→blow) = jacreg cuts ‖J_nc‖ → pushes the complex pair back to Re μ<0. + +**FIX (FLIPPED): integration (adaptive ε / Anderson / implicit) only DELAYS** — can't converge to an unstable/absent fixed point (Anderson got 1.4e-3, not a root = direct proof). **The REAL fix is MODEL-SIDE: jacreg** (reduce ‖J_nc‖ → push the rotating complex pair back to Re μ<0), per fugu's Q2 (FUGU_OPTIONS_VERDICT.md): use an *adaptive bounded* jacreg homeostat (the validated 2.40 runs used adaptive jacreg; diverging runs froze it). **NEXT: jacreg run + verify with `eig_probe.py` that jacreg holds Re μ<0 / keeps the complex pair below the bifurcation.** Files: eig_probe.py, anderson_control.py, adaptive_eps_calib*.py, FUGU_OPTIONS_VERDICT.md. + +## 2026-06-23 (cont. 9) — FIX CONFIRMED AT THE MECHANISM LEVEL: adaptive jacreg pushes Re μ<0, kills the Hopf, restores the fixed point. +`ep_jacreg` = warm-start s2000 + ε=0.1 + **ADAPTIVE jacreg** (`--jacreg 0.1 --jr_floor 0.1 --jr_max 16 --res_target 1e-3 --res_gate 0`, controller :520-529). The ONLY change vs redx (which froze `jr_max=jr_floor=0.1`) is releasing the controller. It descends cleanly **past redx's 2.74 blow** (best 2.7495, res controlled ~1e-3, jr 0.1–0.3, no blow; ~4.5e-4 CE/step after a slow warm-start re-adjustment). +**eig_probe (`eig_jacreg.py`) at the SAME loss ~2.74 — the decisive A/B:** +- FROZEN jacreg (redx s3200, blew): leading μ=+0.45 (real) + +0.26±1.37j (**COMPLEX/rotating**), |λ|>1 UNSTABLE, g_floor=0.26 (cycle, no fixed point). +- ADAPTIVE jacreg (ep_jacreg @2.75): leading μ=−0.23,−0.25,−0.27 (**ALL real, NO rotating pair**), |λ|<1 STABLE, g_floor=**0.0001** (TRUE fixed point restored). +⇒ adaptive jacreg moved the leading eigenvalue from **Re μ=+0.45 (unstable rotating Hopf) → Re μ=−0.23 (stable real)**, killing the rotating instability and restoring a true fixed point (g 0.26→1e-4). This confirms BOTH the diagnosis (Hopf from the non-conservative ‖J_nc‖) AND the fix (jacreg reduces ‖J_nc‖ → Re μ<0), at the eigenvalue level, not just the loss. **Bonus:** the restored fixed point makes AsymEP valid again (it requires a stationary state — invalid on redx's cycle). NEXT: does it clear 2.41 (ep_eps05's blow) → 2.09 → below; re-run eig_probe at 2.41 to confirm jacreg holds Re μ<0 deeper. Production = compose with adaptive ε (fugu Q3) for sim fidelity/speed. Files: eig_jacreg.py, ep_jacreg.log. +**OBS (oscillation, in-progress):** windowed val_std SHRANK monotonically as it descended (0.159→0.031, steps 50→3000) = converging. At step 3400 a TRANSIENT val spike (2.65→3.67) self-recovered over ~150 steps (→2.79), best preserved (2.5624). KEY: throughout the spike **res stayed LOW (4e-4→8e-5) and jr LOW (0.1)** → it is NOT a Hopf/relaxation instability (res did not inflate) but a benign weight/CE training transient (a momentarily-off update; relaxation stayed healthy). So jacreg's Hopf-suppression is intact; spike is a separate phenomenon. Watch: if such spikes recur/grow, investigate — but "res-controlled-during-spike" already rules out Hopf re-emergence. |
