summaryrefslogtreecommitdiff
path: root/ep_run/FUGU_Q_OPTIONS.md
blob: 513e21a35cd9f7e159fb93c3350eb6e0326333c4 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# Fugu query — fix options for the EP below-CE-2.1 divergence (2026-06-23)

Read for full chain: EP_DIAGNOSIS_DOSSIER.md, EP_BELOW210_DIAGNOSIS_FIX.md (cont.6/cont.7), FUGU_VERDICT_FULL.md.
Code: lt_ep_train.py (force/tforce :81-106, relax :123 `z=z+eps*blk.force(z,xin).detach()`, ep_step :140, jacreg :211-219), holo_ep.py.

## CONFIRMED diagnosis (measured this session)
The divergence is the **explicit-Euler instability of the *stiffening* rotating (non-conservative) attention mode**:
- relax is explicit Euler `z←z+εF`, ε=0.1 → stability object = discrete map M=I+εJ. Attention non-conservative (indep WQ/WK/WV/WO) → J has a complex eigenvalue μ=a+ib, **a<0 (continuous-STABLE)**, b≠0. Euler unstable when |1+εμ|>1, i.e. ε>ε_crit=−2a/(a²+b²). EP training makes attention expressive → mode stiffens (b grows) → ε_crit shrinks → at fixed ε it crosses → forward LIMIT CYCLE → loss blows.
- **ε-monotonicity (decisive), 3 training runs identical except ε:** ε=0.1→blew@**2.74**; ε=0.1,t2sel=160 (BETTER gradient, cos0.998)→blew@**3.02** (EARLIER — gradient is NOT the lever); ε=0.05→blew@**2.41**. Smaller ε → strictly lower wall; fixed ε only RELOCATES it (as attention keeps stiffening, even ε=0.05 eventually too coarse).
- **⇒ digital-sim integration artifact:** Re μ<0 ⇒ the CONTINUOUS ODE (analog HW, ε→0) is stable; only the coarse explicit-Euler sim cycles. eval ε-sweep on marginal ckpt s3200 confirms (res-floor shrinks monotonically as ε↓; ε=0.01 converges).
- AsymEP gradient is accurate WHEN a fixed point exists (cos 0.99 vs exact adjoint at tight res); the failure is the FORWARD relaxation losing its fixed point, not the gradient.

## The 3 candidate fixes
(a) **adaptive ε** [#30]: closed-loop step-size — grow ε when contracting, shrink on residual OVERSHOOT. Integration-axis; SIM-only; no model change / no expressivity cost.
(b) **jacreg**: penalize ‖J_nc·v‖ (non-conservative Jacobian = attention+FFN; lt_ep_train.py:211-219 JVP through blk.nc_force) → reduces |Im μ| → raises ε_crit. MODEL change. Early 2.40-validated runs used *adaptive* jacreg; diverging runs froze it weak.
(c) **smaller fixed ε** (0.02/0.03): just moves the wall down; confirmatory, not a fix.

## Adaptive-ε calibration so far (eval-only on s3200 cycling-op + s2000 smooth-op; ground-truthed by the ε-sweep)
- NAIVE controller (shrink if g=‖F‖/‖z‖ falls <2%/step): WRONG — parks ε at floor 0.005 on ALL ops, because slow per-step contraction (ρ~0.99) is misread as "ε too big." Conflates small-ε's slow contraction with instability.
- CORRECTED (shrink only on OVERSHOOT g_t>g_{t-1}; grow otherwise): adaptive as desired — s3200 (stiff): ε→0.003-0.008, g floors ~0.10 (≈ε=0.01 benchmark 0.09); s2000 (smooth): converges to **g=0** (true fixed pt), ε grows toward 0.1. Tradeoff in overshoot-tolerance: strict (tol1.0) catches s3200 cycle but over-shrinks near s2000's converged tail (float-noise ticks→spurious shrink); permissive (tol1.02) stays fast on smooth but MISSES s3200's slow cycle (stays ε~0.095, g~0.23). → suggests EMA-smoothed signal / step-doubling local-error / noise-relative tol.
- NOTE: s3200 g floors ~0.09 even at tiny ε (genuinely no fixed point at the marginal op, OR just slow finite-step convergence — ambiguous); s2000→g=0 cleanly.

## QUESTIONS — answer Q1→Q3 in order, decisive, grounded in code/data, flag solid-vs-uncertain.

**Q1 — Evaluate (a)/(b)/(c).** Which ELIMINATES the wall vs merely RELOCATES it? Which fits the *analog* target (continuous relaxation)? Rank + justify. Is adaptive ε guaranteed to eliminate the wall, or does its ε_min floor just relocate it (like fixed small ε) IF the rotating mode stiffens without bound — and does b/|Im μ| stiffen unboundedly as CE drops, or saturate?

**Q2 — The jacreg paradox.** If the divergence is fundamentally a *simulation* discrete-Euler artifact (Re μ<0, continuous/analog stable), WHY does jacreg work? Is it just **raising ε_crit** (cutting |Im μ| so fixed ε becomes stable) — fixing the SAME discretization wall from the model side — or fixing a genuine continuous-time problem? **Does jacreg's benefit transfer to analog hardware** (continuous, no ε): (i) papering over a sim artifact analog wouldn't have, or (ii) also improving the continuous dynamics (faster/less-ringing settling — useful on analog)? So: is jacreg a "real fix" or a sim-crutch, relative to adaptive ε?

**Q3 — Anderson / implicit integrators.** Can Anderson acceleration or implicit/semi-implicit (IMEX) integration replace explicit Euler to kill the discretization instability? (i) Compatible with **AsymEP** — which needs free equilibrium z*, nudged equilibrium, and the local Jacobian at z* (the −2A correction)? Does changing the forward integrator break the EP gradient estimator (which assumes relaxation reaches z*)? (ii) Implicit Euler's per-step nonlinear solve — tractable for a transformer block, or self-defeating (the solve is itself a relaxation)? (iii) Anderson on the fixed-point iteration — only speeds convergence, or also STABILIZES (suppresses the limit cycle)? (iv) Does the integrator choice matter for the **analog** target (continuous, no integrator), or is this purely sim-side acceleration — i.e. is the right framing "analog HW does the true continuous relaxation; the sim just needs a faithful+cheap emulator, and adaptive-ε / Anderson / implicit are all just better emulators"?