diff options
Diffstat (limited to 'ep_run/FUGU_OPTIONS_VERDICT.md')
| -rw-r--r-- | ep_run/FUGU_OPTIONS_VERDICT.md | 263 |
1 files changed, 263 insertions, 0 deletions
diff --git a/ep_run/FUGU_OPTIONS_VERDICT.md b/ep_run/FUGU_OPTIONS_VERDICT.md new file mode 100644 index 0000000..4e3ed25 --- /dev/null +++ b/ep_run/FUGU_OPTIONS_VERDICT.md @@ -0,0 +1,263 @@ +# FUGU_OPTIONS_VERDICT — Q1–Q3 (independently verified) + +Scope: answers grounded in `lt_ep_train.py` (`force`/`tforce` :81-106, `relax` :123-133, +`ep_step` :140-232, `jacreg` :211-219, weight caps :52-53/398-399/563-567), `holo_ep.py`, +the calibration probes (`adaptive_eps_calib.py`, `adaptive_eps_calib2.py`, `eps_sweep_s3200.py`, +`jnc_scaling.py`, `lt_ep_anderson.py`), and the diagnosis dossiers. Each claim is flagged +**[SOLID]** (proved by code/data in repo) or **[UNCERTAIN]** (reasoned, not measured here). + +--- + +## Shared mechanism (the object all three questions act on) + +**[SOLID]** The active free relaxation is explicit (forward) Euler: +`z = z + eps * blk.force(z, xin).detach()` (`relax`, :123-133). In thick mode the force is +`F(z) = -(z - xin) + Attn(LN1 z) + FFN(LN2 z) - c*z` (`tforce`/`force` :81-85, :102-106), c=1. +So the per-step linear stability object is the **discrete map** `M = I + eps*J`, `J = dF/dz`. + +**[SOLID]** For a continuous eigenvalue `mu = a + i b` of `J`, the Euler multiplier is +`lambda = 1 + eps*mu`, and the map is stable iff `|1+eps*mu| < 1`, i.e. +`eps < eps_crit = -2a/(a^2 + b^2)` for `a < 0`. A continuous-STABLE rotating mode (`a<0`, `b` large) +is destabilized purely by too-large `eps`. + +**[SOLID]** The ε-monotonicity training data are decisive that this is an *integration* wall, not a +*gradient-quality* wall: eps=0.1 blew @ CE 2.74; eps=0.1 with a strictly better gradient (t2sel=160, +cos 0.998) blew EARLIER @ 3.02; eps=0.05 reached 2.41 before blowing. Better gradient → not later but +earlier; smaller step → strictly lower wall. That is exactly the `|1+eps*mu|>1` signature. + +### One correction to the dossier's "continuous/analog is stable at s3200" framing +**[SOLID — verified, refines prior verdict]** The eps-sweep "CONVERGED at eps=0.01" is measured with a +*different residual* than the cycle floor. `eps_sweep_s3200.py:17` reports the **step** residual +`r = ‖z2-z‖/‖z‖ = eps·‖F‖/‖z‖`; `adaptive_eps_calib.py:15` reports the **force** residual +`g = ‖F‖/‖z‖`. At eps=0.01 the sweep's `r≈8.9e-4` is just `0.01 × 0.089` — i.e. the *same* force-floor +`g≈0.09` that is called a "cycle" at eps=0.1. `FUGU_Q_OPTIONS.md` itself flags this: +"s3200 g floors ~0.09 even at tiny ε (genuinely no fixed point at the marginal op, OR just slow +finite-step convergence — ambiguous)." +**Implication:** the eps-sweep robustly proves *the oscillation/blow-up is a discrete-Euler artifact* +(the cycle amplitude dies as eps→0). It does **not** by itself prove the s3200 operator has a true +attracting fixed point (g→0) in continuous time — the force floor g≈0.09 persists. The clean +continuous-stable case is s2000 (g→0). So "analog HW would have no problem" is **[SOLID]** for the +*oscillatory blow-up* but **[UNCERTAIN]** for "s3200 settles to a usable equilibrium." The decisive +missing measurement remains the leading eigenpair of `J`/`M` at a continued fixed-point branch +(sign of `Re mu`). + +--- + +## Q1 — Evaluate (a) adaptive ε, (b) jacreg, (c) smaller fixed ε + +**Bottom line:** +- **(c) smaller fixed ε — RELOCATES the wall. [SOLID]** Already shown empirically (2.74→2.41). +- **(b) jacreg — RAISES/RELOCATES the wall from the model side. [SOLID it raises eps_crit; UNCERTAIN whether it can eliminate]** It lifts `eps_crit` by cutting `|Im mu|`/gain, but at fixed ε it is still a wall in `eps_crit`-space; it also taxes the expressivity it suppresses. +- **(a) adaptive ε — ELIMINATES the fixed-ε wall *iff* its floor stays below the instantaneous `eps_crit`; otherwise it degenerates to (c). [SOLID for the mechanism; the guarantee is conditional]** + +### Ranking +**To remove the measured software wall while preserving the model and the analog target:** +1. **Adaptive ε / robust solver** — only option that removes the *fixed-step* wall with **zero model/expressivity cost** and **zero change to the analog target**. It is pure integration-axis. +2. **jacreg** — effective secondary homeostat; raises `eps_crit`, but changes the learned operator and can cap the non-normality the good (BPTT-1.83) solution uses. +3. **smaller fixed ε** — diagnostic/fallback only; permanently pays the small-step cost on *every* example (including smooth ones) and still fails once stiffening passes the new floor. + +**For the analog (continuous) target specifically:** adaptive ε and smaller fixed ε are *emulator* +choices that leave the model identical to what analog HW runs — they are the right kind of fix. +jacreg *changes the model that analog HW would run* (see Q2). + +### (a) Adaptive ε — grounded in code +**[SOLID]** `adaptive_eps_calib2.py` uses the correct signal: shrink only on **overshoot** +(`g_t > prev*tol` → `eps*=down`), grow otherwise. The naive `adaptive_eps_calib.py` controller +(shrink on slow contraction) is shown to mis-park ε at the floor on all ops — it conflates small-ε's +slow contraction with instability. The corrected controller behaves as a continuous-relaxation +emulator: stiff s3200 → ε to 0.003-0.008; smooth s2000 → ε grows toward 0.1 and reaches g=0. + +### Is adaptive ε *guaranteed* to eliminate the wall? — the eps_min question +**[SOLID, decisive]** No, not unconditionally. With a hard floor `eps_min`, adaptive ε eliminates the +wall only while `eps_min < eps_crit = -2a/(a^2+b^2)`. If training keeps stiffening the rotating mode so +`eps_crit` falls below `eps_min`, adaptive ε becomes a fixed small step at the floor — i.e. it +**degenerates into option (c) and merely relocates the wall.** So the guarantee is conditional on the +floor, and equivalently on whether `eps_crit` (hence `|Im mu|`) is bounded away from where the floor +sits. + +### Does |Im μ| (b) saturate or grow unboundedly as CE drops? +This is the crux, and the honest answer is split: + +- **[SOLID] There IS structural stiffness-bounding machinery in the code that argues for saturation.** + (i) `qknorm` RMSNorms q,k → softmax logits are bounded regardless of weight growth (`attn` :63-67); + (ii) **weight-norm caps**: `capw = {WQ,WK,WV,WO,Wm,Wh,fc,pj}` are each projected back to + `capx × initial-norm` every optimizer step (`:52-53`, `:398-399`, `:563-567`); (iii) damping `c=1` + gives a passive `-(1+c)z = xin-2z` contraction floor; (iv) LayerNorm bounds input scale into attn/FFN; + (v) weight decay. With qknorm + capped projections + LN, the per-matrix gains feeding `J_nc` cannot + grow without bound, which bounds `|Im mu|` and therefore keeps `eps_crit` bounded **below**. This is a + genuine reason to expect `|Im mu|` to **saturate** (or at least be bounded) rather than diverge. + +- **[SOLID, opposing data point] But within the *observed* range stiffness was still rising:** fixed + ε=0.1→0.05 moved the wall 2.74→2.41 rather than removing it, i.e. `eps_crit` was still falling across + that CE interval. So saturation, if it exists, had not yet bitten in the measured window. + +- **[UNCERTAIN] No direct eigenvalue/`|Im mu|`-vs-CE trace exists in the repo.** `jnc_scaling.py` + measures `‖J_nc‖` growth-per-step vs width but is not a CE-resolved `|Im mu|` curve. So whether `b` + truly plateaus before `eps_crit` reaches a practical `eps_min` is **not measured**. + +**Synthesis (decisive, hedged correctly):** adaptive ε is the best wall-eliminator and the only +zero-tax, analog-faithful one — **and** the code's caps/qknorm/damping make it *likely* that `|Im mu|` +is bounded, so a sufficiently small `eps_min` should eliminate (not merely relocate) the wall in +practice. But this is a *bounded-floor* guarantee, not an unconditional one: if `|Im mu|` were to grow +without bound, any finite `eps_min` is eventually a wall. **Recommended:** make the floor itself +log an `eps_crit` proxy (overshoot persisting at the floor) and either drop the floor, reject the step, +or hand off to Anderson — i.e. fail-open rather than fail-into-(c). + +--- + +## Q2 — The jacreg paradox + +**Verdict: no paradox. jacreg works by RAISING `eps_crit` from the model side — it fixes the SAME +discretization wall, not a demonstrated continuous-time instability. Relative to adaptive ε it is a +sim-crutch for the measured failure, but it carries a *separate, real* analog benefit (settling +quality), and it would become a genuine fix if a true continuous instability (Re μ≥0) ever emerges.** + +### Why a model-side stiffness penalty fixes a simulation artifact — mechanism +**[SOLID]** `jacreg` is a Hutchinson JVP penalty `R = jacreg·‖J_nc·er‖²/‖er‖²` (`:211-219`), and in thick +mode `nc_force = Attn + FFN` (`:92-97`). Minimizing `‖J_nc‖` reduces the learned non-conservative +gain, which reduces the rotating component `|b|=|Im mu|` (and non-normal amplification). Since +`eps_crit = -2a/(a^2+b^2)`, smaller `|b|` → **larger** `eps_crit` → fixed ε=0.1 stays under the +Euler-stability boundary longer. So jacreg moves the *same* `|1+eps*mu|=1` wall by shrinking `b`, while +adaptive ε moves the *same* wall by shrinking `eps`. Two knobs on one inequality. + +### Raising eps_crit vs fixing a continuous-time problem +**[SOLID for measured regime]** For s3200-type failures the relevant mode has `Re mu < 0` (the cycle +dies as eps→0). There is no *established* continuous instability to fix, so jacreg's contribution there +is purely "raise eps_crit" — discretization-wall relief from the model side. +**[UNCERTAIN beyond it]** If training ever drives `Re mu → 0⁺` (a true Hopf), then no integrator +(adaptive ε, implicit, Anderson) can stabilize the original continuous equilibrium; only a model-side +change (jacreg, stronger damping/c, structural monotonicity, gain/asymmetry bounds) is a real fix. +jacreg is the insurance policy for that case. + +### Does the benefit transfer to analog hardware? — two benefits, separated +**[SOLID] (i) The "prevents eps=0.1 Euler blow-up" benefit does NOT transfer.** Analog HW has no `eps` +and does not iterate `z←z+εF`; it performs continuous relaxation. If `Re mu<0`, the analog ODE is +stable and never had this wall. To the extent jacreg only buys eps_crit headroom, it is papering over a +sim artifact analog wouldn't have — a crutch. + +**[SOLID/UNCERTAIN-magnitude] (ii) The "less stiff/less ringy continuous dynamics" benefit DOES +transfer.** Even with `Re mu<0`, a large `|Im mu|` mode has a poor damping ratio: it rings, settles +slowly, demands more bandwidth, longer observation/integration windows, and is more noise/delay +sensitive — all of which degrade the *physical* free-phase settling and the readout of nudged +equilibria on analog HW. Reducing `‖J_nc‖` improves the continuous damping ratio. So jacreg is *also* a +legitimate analog settling/robustness regularizer. **[UNCERTAIN]** the size of this analog benefit is +not measured here. + +### Real fix or sim-crutch, relative to adaptive ε? +**[SOLID]** For the *confirmed explicit-Euler artifact*: +- **adaptive ε / Anderson / implicit = the real fix of the emulator** — they preserve the learned + vector field and make the digital sim stop inventing a cycle the analog system wouldn't have. +- **jacreg = a model-changing crutch for that artifact**, but simultaneously a *real* (if secondary) + analog settling regularizer and the *only* lever if a genuine continuous instability appears. + +**Recommended composition (not "either/or"):** (1) use adaptive ε / a real solver as the primary +emulator fix so the sim is faithful; (2) keep jacreg as a **bounded, adaptive** homeostat +(the controller already exists, `:520-529`) sized for analog settling-time/robustness or true +marginality — NOT as a strong fixed penalty that taxes the non-normality the BPTT-1.83 solution needs. +The historical evidence fits this: the validated ~2.40 runs used *adaptive* jacreg; the diverging runs +*froze it weak* — i.e. they removed the homeostat, not the integrator. + +--- + +## Q3 — Anderson acceleration / implicit (IMEX) integrators + +**Verdict: Yes — they can replace explicit Euler as the *solver* and kill the discretization +instability, and they are compatible with AsymEP *provided they converge to the same equilibria of the +same vector fields*. They change nothing about the analog model; they are emulator-fidelity choices. +Implicit Euler is unconditionally stable but per-step expensive (the solve is itself a relaxation). +Anderson is the more practical lever: it both accelerates and can suppress the Euler cycle when a true +fixed point exists, but it is not guaranteed and needs damping/restarts/residual gating.** + +### (i) Compatibility with AsymEP +**[SOLID]** The EP estimator depends on the *states*, not on how they were reached. `ep_step` computes +`zs = relax(...)` and treats it as the free equilibrium (`:142-144`); the AsymEP correction uses local +`Jv = jvp(nc_force, zs, v)`, `JTv = vjp(nc_force, zs, v)`, `corr = Jv - JTv` at `zs` (`:172-178`); the +parameter gradient is `(a * f).sum()` with `f = force(zs.detach(), xin, cg=True)` (`:202-205`). None of +this requires explicit Euler — it requires that `zs` is a genuine root `F(zs)≈0` and that the nudged +states are equilibria of the nudged/corrected force. A better solver that returns the *same roots* is +fully compatible, and the `-2A` correction is computed *at* `z*` regardless of the solver that found it. + +**[SOLID — important, refines prior framing] The nudged phase must also be re-solved.** The free phase +is not the only explicit-Euler loop: the nudge (`nudge()` :163-180) and every holomorphic estimator +(`holo_a`, `holo_a_select2`, `holo_a_track`, `holo_a_lockin` in `holo_ep.py`) advance with +`z = z + eps*(f - corr)`. The `-2A` correction lives *inside* these loops. So "swap the integrator" +means swap it in **both** phases; a solver that converges the free `z*` but leaves the nudged phase on +coarse Euler will still corrupt `a = -dz*/dβ`. + +**[SOLID] Hard limit:** if the continuous field has no attracting root in the operating regime, no +solver can manufacture the stationary state AsymEP needs — it will fail, find a spurious root, or +return a numerical artifact. A solver fixes *integration*, not *non-existence of equilibrium*. (This is +why the s3200 force-floor ambiguity from the Shared-mechanism section matters: confirm a true root +exists before trusting AsymEP there.) + +### (ii) Implicit / IMEX — tractable or self-defeating? +**[SOLID, theory]** Backward Euler multiplier is `1/(1-h·mu)`, A-stable: for any `Re mu<0` it is stable +at *every* step size, so it would kill the stiff-rotation Euler cycle outright. +**[SOLID, cost]** Each backward step solves `y - h·F(y) - z_n = 0`, where `F` contains LN, causal +softmax attention, and FFN. A Newton/Krylov/Picard solve needs several force evals and matrix-free +JVP/VJP linear solves over the full `B·T·C` state per step — i.e. **the per-step solve is itself a +relaxation/root-find**, which is the self-defeating risk for a default inner loop. +**[UNCERTAIN/qualitative] IMEX nuance:** making only the cheap leak `-(1+c)z` implicit is trivial but +does **not** tame the dangerous learned rotating attention mode (the danger is in `J_nc`, not the leak); +treating `J_nc` implicitly reintroduces the big linear solve. So implicit/IMEX is best as a **robust +fallback / macro-step / offline reference**, not the default per-step integrator. + +### (iii) Anderson — speed only, or stabilization too? +**[SOLID, conceptual]** Anderson (DEQ-style; `lt_ep_anderson.py` stores recent `X`, `G(X)=z+εF`, solves +a small regularized least-squares for the mixing coefficients, extrapolates) is a quasi-Newton/GMRES-on- +the-residual. For a Picard/Euler map whose oscillatory multiplier sits just outside the unit circle, +the residual-minimizing extrapolation can **suppress the limit cycle**, not merely speed a contracting +one — so it is more than acceleration. `lt_ep_anderson.py` is explicitly framed as exactly this test +("can a fixed-point solver converge the free phase where plain relaxation cannot?"). +**[SOLID, caveats]** Not guaranteed: it cannot create a root that doesn't exist; aggressive mixing can +diverge; it needs damping (β-mixing), restarts, and residual-monotonicity gating; and (per (i)) it must +wrap the nudged phase too. Net: **strongest practical candidate** — cheaper than full implicit Newton, +able to stabilize when a root exists, but must be safeguarded. + +### (iv) Does integrator choice matter for the ANALOG target? +**[SOLID] For the analog model itself: no.** Analog HW performs the true continuous relaxation of `F`; +it runs no explicit Euler, no Anderson, no backward Euler. The integrator is not part of the deployed +computation. +**[SOLID] For digital training/eval of that target: yes, decisively.** Coarse explicit Euler can invent +a limit cycle the analog system would never exhibit, corrupting both the loss and the equilibrium the +EP gradient is taken at. The correct framing — and the right way to state it in the thesis — is exactly: + +> Analog HW does the true continuous relaxation; the simulator only needs a **faithful + cheap emulator** +> of that relaxation. Adaptive ε, Anderson, and implicit/IMEX are all just *better emulators* — they +> change the simulation's fidelity/cost, not the EP objective or the analog primitive. + +The one asymmetry to keep in mind: **jacreg is NOT in this "just a better emulator" bucket** (it edits +the model the analog HW would run), whereas adaptive ε / Anderson / implicit ARE. That is the precise +sense in which the integrator family is the analog-faithful fix and jacreg is the model-side one. + +### Recommended solver strategy +1. Replace fixed ε=0.1 explicit Euler in the **free** phase with an overshoot/step-rejection adaptive + solver (the corrected `adaptive_eps_calib2.py` logic), with a fail-open floor (Q1). +2. Add **damped Anderson with restarts + residual gating** for both free and nudged phases once the + residual stalls/cycles; solve `F=0` rather than running a fixed Euler count and hoping. +3. Keep **implicit/backward Euler as a reference/fallback**, not the default inner loop (per-step cost). +4. Leave **AsymEP unchanged in principle**: find `z*`, find nudged equilibria, apply `Jv-JTv` at `z*`, + and **gate the update** (`res_gate`, `:153-162`) when residual says no stationary state was found. +5. Retain **jacreg as a bounded adaptive homeostat** (analog settling / true-Hopf insurance), not as the + primary fix. +6. For analog claims, report **solver-independent diagnostics**: force residual `‖F(z*)‖/‖z*‖` (NOT just + the eps-scaled step residual — they differ by a factor of eps, which confounded the eps-sweep), and, + when feasible, the leading continuous `mu` (sign of `Re mu`) and settling/ringing time. + +--- + +## Summary table + +| Option | Eliminates or relocates wall | Changes model? | Analog-faithful? | Verdict | +|---|---|---|---|---| +| (a) adaptive ε | Eliminates if eps_min < eps_crit; else relocates | No | Yes (emulator) | **Primary fix** [SOLID mechanism; bounded-floor guarantee] | +| (b) jacreg | Raises eps_crit (relocates in eps_crit-space) | Yes | No for the wall; yes for settling | **Secondary homeostat / crutch + true-Hopf insurance** | +| (c) smaller fixed ε | Relocates only | No | Yes but inefficient | **Diagnostic / fallback** [SOLID] | +| Anderson | Can eliminate cycle if a root exists | No | Yes (emulator) | **Best practical solver, needs safeguards** | +| Implicit/IMEX | Eliminates (A-stable) | No | Yes (emulator) | **Correct but per-step costly; fallback/reference** | + +Key uncertainties flagged: (1) whether `|Im mu|` saturates vs grows as CE drops is **not directly +measured** — code caps/qknorm/damping argue for bounded, but ε=0.1→0.05 data show it was still rising +in-window; (2) whether s3200 has a true continuous fixed point (g→0) vs only a dead oscillation is +**ambiguous** because the eps-sweep's step-residual ≠ force-residual; the clean continuous-stable +evidence is s2000, not s3200. |
