Initial commit: ept — backprop-free equilibrium transformer (EP)

Code (ep_run/), organized docs (docs/{method,campaign,hardware,outreach,paper}), analysis scripts (scripts/), ONBOARDING.md entry point. Large data/checkpoints git-ignored (share separately). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014FAPDWQ49M5Ye3NpTndTpn
author: Yuren Hao <yurenh2@illinois.edu> 2026-07-03 05:56:50 -0500
committer: Yuren Hao <yurenh2@illinois.edu> 2026-07-03 05:56:50 -0500
commit: b83947778e2c776f757a07d4719b7ce961d7ed55 (patch)
tree: b9cc01d7adda691d9156d9d04f4fb2f644674e96 /ep_run/FUGU_OPTIONS_VERDICT.md
1 files changed, 263 insertions, 0 deletions
diff --git a/ep_run/FUGU_OPTIONS_VERDICT.md b/ep_run/FUGU_OPTIONS_VERDICT.md
new file mode 100644
index 0000000..4e3ed25
--- /dev/null
+++ b/ep_run/FUGU_OPTIONS_VERDICT.md
@@ -0,0 +1,263 @@
+# FUGU_OPTIONS_VERDICT — Q1–Q3 (independently verified)
+
+Scope: answers grounded in `lt_ep_train.py` (`force`/`tforce` :81-106, `relax` :123-133,
+`ep_step` :140-232, `jacreg` :211-219, weight caps :52-53/398-399/563-567), `holo_ep.py`,
+the calibration probes (`adaptive_eps_calib.py`, `adaptive_eps_calib2.py`, `eps_sweep_s3200.py`,
+`jnc_scaling.py`, `lt_ep_anderson.py`), and the diagnosis dossiers. Each claim is flagged
+**[SOLID]** (proved by code/data in repo) or **[UNCERTAIN]** (reasoned, not measured here).
+
+---
+
+## Shared mechanism (the object all three questions act on)
+
+**[SOLID]** The active free relaxation is explicit (forward) Euler:
+`z = z + eps * blk.force(z, xin).detach()` (`relax`, :123-133). In thick mode the force is
+`F(z) = -(z - xin) + Attn(LN1 z) + FFN(LN2 z) - c*z` (`tforce`/`force` :81-85, :102-106), c=1.
+So the per-step linear stability object is the **discrete map** `M = I + eps*J`, `J = dF/dz`.
+
+**[SOLID]** For a continuous eigenvalue `mu = a + i b` of `J`, the Euler multiplier is
+`lambda = 1 + eps*mu`, and the map is stable iff `|1+eps*mu| < 1`, i.e.
+`eps < eps_crit = -2a/(a^2 + b^2)` for `a < 0`. A continuous-STABLE rotating mode (`a<0`, `b` large)
+is destabilized purely by too-large `eps`.
+
+**[SOLID]** The ε-monotonicity training data are decisive that this is an *integration* wall, not a
+*gradient-quality* wall: eps=0.1 blew @ CE 2.74; eps=0.1 with a strictly better gradient (t2sel=160,
+cos 0.998) blew EARLIER @ 3.02; eps=0.05 reached 2.41 before blowing. Better gradient → not later but
+earlier; smaller step → strictly lower wall. That is exactly the `|1+eps*mu|>1` signature.
+
+### One correction to the dossier's "continuous/analog is stable at s3200" framing
+**[SOLID — verified, refines prior verdict]** The eps-sweep "CONVERGED at eps=0.01" is measured with a
+*different residual* than the cycle floor. `eps_sweep_s3200.py:17` reports the **step** residual
+`r = ‖z2-z‖/‖z‖ = eps·‖F‖/‖z‖`; `adaptive_eps_calib.py:15` reports the **force** residual
+`g = ‖F‖/‖z‖`. At eps=0.01 the sweep's `r≈8.9e-4` is just `0.01 × 0.089` — i.e. the *same* force-floor
+`g≈0.09` that is called a "cycle" at eps=0.1. `FUGU_Q_OPTIONS.md` itself flags this:
+"s3200 g floors ~0.09 even at tiny ε (genuinely no fixed point at the marginal op, OR just slow
+finite-step convergence — ambiguous)."
+**Implication:** the eps-sweep robustly proves *the oscillation/blow-up is a discrete-Euler artifact*
+(the cycle amplitude dies as eps→0). It does **not** by itself prove the s3200 operator has a true
+attracting fixed point (g→0) in continuous time — the force floor g≈0.09 persists. The clean
+continuous-stable case is s2000 (g→0). So "analog HW would have no problem" is **[SOLID]** for the
+*oscillatory blow-up* but **[UNCERTAIN]** for "s3200 settles to a usable equilibrium." The decisive
+missing measurement remains the leading eigenpair of `J`/`M` at a continued fixed-point branch
+(sign of `Re mu`).
+
+---
+
+## Q1 — Evaluate (a) adaptive ε, (b) jacreg, (c) smaller fixed ε
+
+**Bottom line:**
+- **(c) smaller fixed ε — RELOCATES the wall. [SOLID]** Already shown empirically (2.74→2.41).
+- **(b) jacreg — RAISES/RELOCATES the wall from the model side. [SOLID it raises eps_crit; UNCERTAIN whether it can eliminate]** It lifts `eps_crit` by cutting `|Im mu|`/gain, but at fixed ε it is still a wall in `eps_crit`-space; it also taxes the expressivity it suppresses.
+- **(a) adaptive ε — ELIMINATES the fixed-ε wall *iff* its floor stays below the instantaneous `eps_crit`; otherwise it degenerates to (c). [SOLID for the mechanism; the guarantee is conditional]**
+
+### Ranking
+**To remove the measured software wall while preserving the model and the analog target:**
+1. **Adaptive ε / robust solver** — only option that removes the *fixed-step* wall with **zero model/expressivity cost** and **zero change to the analog target**. It is pure integration-axis.
+2. **jacreg** — effective secondary homeostat; raises `eps_crit`, but changes the learned operator and can cap the non-normality the good (BPTT-1.83) solution uses.
+3. **smaller fixed ε** — diagnostic/fallback only; permanently pays the small-step cost on *every* example (including smooth ones) and still fails once stiffening passes the new floor.
+
+**For the analog (continuous) target specifically:** adaptive ε and smaller fixed ε are *emulator*
+choices that leave the model identical to what analog HW runs — they are the right kind of fix.
+jacreg *changes the model that analog HW would run* (see Q2).
+
+### (a) Adaptive ε — grounded in code
+**[SOLID]** `adaptive_eps_calib2.py` uses the correct signal: shrink only on **overshoot**
+(`g_t > prev*tol` → `eps*=down`), grow otherwise. The naive `adaptive_eps_calib.py` controller
+(shrink on slow contraction) is shown to mis-park ε at the floor on all ops — it conflates small-ε's
+slow contraction with instability. The corrected controller behaves as a continuous-relaxation
+emulator: stiff s3200 → ε to 0.003-0.008; smooth s2000 → ε grows toward 0.1 and reaches g=0.
+
+### Is adaptive ε *guaranteed* to eliminate the wall? — the eps_min question
+**[SOLID, decisive]** No, not unconditionally. With a hard floor `eps_min`, adaptive ε eliminates the
+wall only while `eps_min < eps_crit = -2a/(a^2+b^2)`. If training keeps stiffening the rotating mode so
+`eps_crit` falls below `eps_min`, adaptive ε becomes a fixed small step at the floor — i.e. it
+**degenerates into option (c) and merely relocates the wall.** So the guarantee is conditional on the
+floor, and equivalently on whether `eps_crit` (hence `|Im mu|`) is bounded away from where the floor
+sits.
+
+### Does |Im μ| (b) saturate or grow unboundedly as CE drops?
+This is the crux, and the honest answer is split:
+
+- **[SOLID] There IS structural stiffness-bounding machinery in the code that argues for saturation.**
+  (i) `qknorm` RMSNorms q,k → softmax logits are bounded regardless of weight growth (`attn` :63-67);
+  (ii) **weight-norm caps**: `capw = {WQ,WK,WV,WO,Wm,Wh,fc,pj}` are each projected back to
+  `capx × initial-norm` every optimizer step (`:52-53`, `:398-399`, `:563-567`); (iii) damping `c=1`
+  gives a passive `-(1+c)z = xin-2z` contraction floor; (iv) LayerNorm bounds input scale into attn/FFN;
+  (v) weight decay. With qknorm + capped projections + LN, the per-matrix gains feeding `J_nc` cannot
+  grow without bound, which bounds `|Im mu|` and therefore keeps `eps_crit` bounded **below**. This is a
+  genuine reason to expect `|Im mu|` to **saturate** (or at least be bounded) rather than diverge.
+
+- **[SOLID, opposing data point] But within the *observed* range stiffness was still rising:** fixed
+  ε=0.1→0.05 moved the wall 2.74→2.41 rather than removing it, i.e. `eps_crit` was still falling across
+  that CE interval. So saturation, if it exists, had not yet bitten in the measured window.
+
+- **[UNCERTAIN] No direct eigenvalue/`|Im mu|`-vs-CE trace exists in the repo.** `jnc_scaling.py`
+  measures `‖J_nc‖` growth-per-step vs width but is not a CE-resolved `|Im mu|` curve. So whether `b`
+  truly plateaus before `eps_crit` reaches a practical `eps_min` is **not measured**.
+
+**Synthesis (decisive, hedged correctly):** adaptive ε is the best wall-eliminator and the only
+zero-tax, analog-faithful one — **and** the code's caps/qknorm/damping make it *likely* that `|Im mu|`
+is bounded, so a sufficiently small `eps_min` should eliminate (not merely relocate) the wall in
+practice. But this is a *bounded-floor* guarantee, not an unconditional one: if `|Im mu|` were to grow
+without bound, any finite `eps_min` is eventually a wall. **Recommended:** make the floor itself
+log an `eps_crit` proxy (overshoot persisting at the floor) and either drop the floor, reject the step,
+or hand off to Anderson — i.e. fail-open rather than fail-into-(c).
+
+---
+
+## Q2 — The jacreg paradox
+
+**Verdict: no paradox. jacreg works by RAISING `eps_crit` from the model side — it fixes the SAME
+discretization wall, not a demonstrated continuous-time instability. Relative to adaptive ε it is a
+sim-crutch for the measured failure, but it carries a *separate, real* analog benefit (settling
+quality), and it would become a genuine fix if a true continuous instability (Re μ≥0) ever emerges.**
+
+### Why a model-side stiffness penalty fixes a simulation artifact — mechanism
+**[SOLID]** `jacreg` is a Hutchinson JVP penalty `R = jacreg·‖J_nc·er‖²/‖er‖²` (`:211-219`), and in thick
+mode `nc_force = Attn + FFN` (`:92-97`). Minimizing `‖J_nc‖` reduces the learned non-conservative
+gain, which reduces the rotating component `|b|=|Im mu|` (and non-normal amplification). Since
+`eps_crit = -2a/(a^2+b^2)`, smaller `|b|` → **larger** `eps_crit` → fixed ε=0.1 stays under the
+Euler-stability boundary longer. So jacreg moves the *same* `|1+eps*mu|=1` wall by shrinking `b`, while
+adaptive ε moves the *same* wall by shrinking `eps`. Two knobs on one inequality.
+
+### Raising eps_crit vs fixing a continuous-time problem
+**[SOLID for measured regime]** For s3200-type failures the relevant mode has `Re mu < 0` (the cycle
+dies as eps→0). There is no *established* continuous instability to fix, so jacreg's contribution there
+is purely "raise eps_crit" — discretization-wall relief from the model side.
+**[UNCERTAIN beyond it]** If training ever drives `Re mu → 0⁺` (a true Hopf), then no integrator
+(adaptive ε, implicit, Anderson) can stabilize the original continuous equilibrium; only a model-side
+change (jacreg, stronger damping/c, structural monotonicity, gain/asymmetry bounds) is a real fix.
+jacreg is the insurance policy for that case.
+
+### Does the benefit transfer to analog hardware? — two benefits, separated
+**[SOLID] (i) The "prevents eps=0.1 Euler blow-up" benefit does NOT transfer.** Analog HW has no `eps`
+and does not iterate `z←z+εF`; it performs continuous relaxation. If `Re mu<0`, the analog ODE is
+stable and never had this wall. To the extent jacreg only buys eps_crit headroom, it is papering over a
+sim artifact analog wouldn't have — a crutch.
+
+**[SOLID/UNCERTAIN-magnitude] (ii) The "less stiff/less ringy continuous dynamics" benefit DOES
+transfer.** Even with `Re mu<0`, a large `|Im mu|` mode has a poor damping ratio: it rings, settles
+slowly, demands more bandwidth, longer observation/integration windows, and is more noise/delay
+sensitive — all of which degrade the *physical* free-phase settling and the readout of nudged
+equilibria on analog HW. Reducing `‖J_nc‖` improves the continuous damping ratio. So jacreg is *also* a
+legitimate analog settling/robustness regularizer. **[UNCERTAIN]** the size of this analog benefit is
+not measured here.
+
+### Real fix or sim-crutch, relative to adaptive ε?
+**[SOLID]** For the *confirmed explicit-Euler artifact*:
+- **adaptive ε / Anderson / implicit = the real fix of the emulator** — they preserve the learned
+  vector field and make the digital sim stop inventing a cycle the analog system wouldn't have.
+- **jacreg = a model-changing crutch for that artifact**, but simultaneously a *real* (if secondary)
+  analog settling regularizer and the *only* lever if a genuine continuous instability appears.
+
+**Recommended composition (not "either/or"):** (1) use adaptive ε / a real solver as the primary
+emulator fix so the sim is faithful; (2) keep jacreg as a **bounded, adaptive** homeostat
+(the controller already exists, `:520-529`) sized for analog settling-time/robustness or true
+marginality — NOT as a strong fixed penalty that taxes the non-normality the BPTT-1.83 solution needs.
+The historical evidence fits this: the validated ~2.40 runs used *adaptive* jacreg; the diverging runs
+*froze it weak* — i.e. they removed the homeostat, not the integrator.
+
+---
+
+## Q3 — Anderson acceleration / implicit (IMEX) integrators
+
+**Verdict: Yes — they can replace explicit Euler as the *solver* and kill the discretization
+instability, and they are compatible with AsymEP *provided they converge to the same equilibria of the
+same vector fields*. They change nothing about the analog model; they are emulator-fidelity choices.
+Implicit Euler is unconditionally stable but per-step expensive (the solve is itself a relaxation).
+Anderson is the more practical lever: it both accelerates and can suppress the Euler cycle when a true
+fixed point exists, but it is not guaranteed and needs damping/restarts/residual gating.**
+
+### (i) Compatibility with AsymEP
+**[SOLID]** The EP estimator depends on the *states*, not on how they were reached. `ep_step` computes
+`zs = relax(...)` and treats it as the free equilibrium (`:142-144`); the AsymEP correction uses local
+`Jv = jvp(nc_force, zs, v)`, `JTv = vjp(nc_force, zs, v)`, `corr = Jv - JTv` at `zs` (`:172-178`); the
+parameter gradient is `(a * f).sum()` with `f = force(zs.detach(), xin, cg=True)` (`:202-205`). None of
+this requires explicit Euler — it requires that `zs` is a genuine root `F(zs)≈0` and that the nudged
+states are equilibria of the nudged/corrected force. A better solver that returns the *same roots* is
+fully compatible, and the `-2A` correction is computed *at* `z*` regardless of the solver that found it.
+
+**[SOLID — important, refines prior framing] The nudged phase must also be re-solved.** The free phase
+is not the only explicit-Euler loop: the nudge (`nudge()` :163-180) and every holomorphic estimator
+(`holo_a`, `holo_a_select2`, `holo_a_track`, `holo_a_lockin` in `holo_ep.py`) advance with
+`z = z + eps*(f - corr)`. The `-2A` correction lives *inside* these loops. So "swap the integrator"
+means swap it in **both** phases; a solver that converges the free `z*` but leaves the nudged phase on
+coarse Euler will still corrupt `a = -dz*/dβ`.
+
+**[SOLID] Hard limit:** if the continuous field has no attracting root in the operating regime, no
+solver can manufacture the stationary state AsymEP needs — it will fail, find a spurious root, or
+return a numerical artifact. A solver fixes *integration*, not *non-existence of equilibrium*. (This is
+why the s3200 force-floor ambiguity from the Shared-mechanism section matters: confirm a true root
+exists before trusting AsymEP there.)
+
+### (ii) Implicit / IMEX — tractable or self-defeating?
+**[SOLID, theory]** Backward Euler multiplier is `1/(1-h·mu)`, A-stable: for any `Re mu<0` it is stable
+at *every* step size, so it would kill the stiff-rotation Euler cycle outright.
+**[SOLID, cost]** Each backward step solves `y - h·F(y) - z_n = 0`, where `F` contains LN, causal
+softmax attention, and FFN. A Newton/Krylov/Picard solve needs several force evals and matrix-free
+JVP/VJP linear solves over the full `B·T·C` state per step — i.e. **the per-step solve is itself a
+relaxation/root-find**, which is the self-defeating risk for a default inner loop.
+**[UNCERTAIN/qualitative] IMEX nuance:** making only the cheap leak `-(1+c)z` implicit is trivial but
+does **not** tame the dangerous learned rotating attention mode (the danger is in `J_nc`, not the leak);
+treating `J_nc` implicitly reintroduces the big linear solve. So implicit/IMEX is best as a **robust
+fallback / macro-step / offline reference**, not the default per-step integrator.
+
+### (iii) Anderson — speed only, or stabilization too?
+**[SOLID, conceptual]** Anderson (DEQ-style; `lt_ep_anderson.py` stores recent `X`, `G(X)=z+εF`, solves
+a small regularized least-squares for the mixing coefficients, extrapolates) is a quasi-Newton/GMRES-on-
+the-residual. For a Picard/Euler map whose oscillatory multiplier sits just outside the unit circle,
+the residual-minimizing extrapolation can **suppress the limit cycle**, not merely speed a contracting
+one — so it is more than acceleration. `lt_ep_anderson.py` is explicitly framed as exactly this test
+("can a fixed-point solver converge the free phase where plain relaxation cannot?").
+**[SOLID, caveats]** Not guaranteed: it cannot create a root that doesn't exist; aggressive mixing can
+diverge; it needs damping (β-mixing), restarts, and residual-monotonicity gating; and (per (i)) it must
+wrap the nudged phase too. Net: **strongest practical candidate** — cheaper than full implicit Newton,
+able to stabilize when a root exists, but must be safeguarded.
+
+### (iv) Does integrator choice matter for the ANALOG target?
+**[SOLID] For the analog model itself: no.** Analog HW performs the true continuous relaxation of `F`;
+it runs no explicit Euler, no Anderson, no backward Euler. The integrator is not part of the deployed
+computation.
+**[SOLID] For digital training/eval of that target: yes, decisively.** Coarse explicit Euler can invent
+a limit cycle the analog system would never exhibit, corrupting both the loss and the equilibrium the
+EP gradient is taken at. The correct framing — and the right way to state it in the thesis — is exactly:
+
+> Analog HW does the true continuous relaxation; the simulator only needs a **faithful + cheap emulator**
+> of that relaxation. Adaptive ε, Anderson, and implicit/IMEX are all just *better emulators* — they
+> change the simulation's fidelity/cost, not the EP objective or the analog primitive.
+
+The one asymmetry to keep in mind: **jacreg is NOT in this "just a better emulator" bucket** (it edits
+the model the analog HW would run), whereas adaptive ε / Anderson / implicit ARE. That is the precise
+sense in which the integrator family is the analog-faithful fix and jacreg is the model-side one.
+
+### Recommended solver strategy
+1. Replace fixed ε=0.1 explicit Euler in the **free** phase with an overshoot/step-rejection adaptive
+   solver (the corrected `adaptive_eps_calib2.py` logic), with a fail-open floor (Q1).
+2. Add **damped Anderson with restarts + residual gating** for both free and nudged phases once the
+   residual stalls/cycles; solve `F=0` rather than running a fixed Euler count and hoping.
+3. Keep **implicit/backward Euler as a reference/fallback**, not the default inner loop (per-step cost).
+4. Leave **AsymEP unchanged in principle**: find `z*`, find nudged equilibria, apply `Jv-JTv` at `z*`,
+   and **gate the update** (`res_gate`, `:153-162`) when residual says no stationary state was found.
+5. Retain **jacreg as a bounded adaptive homeostat** (analog settling / true-Hopf insurance), not as the
+   primary fix.
+6. For analog claims, report **solver-independent diagnostics**: force residual `‖F(z*)‖/‖z*‖` (NOT just
+   the eps-scaled step residual — they differ by a factor of eps, which confounded the eps-sweep), and,
+   when feasible, the leading continuous `mu` (sign of `Re mu`) and settling/ringing time.
+
+---
+
+## Summary table
+
+| Option | Eliminates or relocates wall | Changes model? | Analog-faithful? | Verdict |
+|---|---|---|---|---|
+| (a) adaptive ε | Eliminates if eps_min < eps_crit; else relocates | No | Yes (emulator) | **Primary fix** [SOLID mechanism; bounded-floor guarantee] |
+| (b) jacreg | Raises eps_crit (relocates in eps_crit-space) | Yes | No for the wall; yes for settling | **Secondary homeostat / crutch + true-Hopf insurance** |
+| (c) smaller fixed ε | Relocates only | No | Yes but inefficient | **Diagnostic / fallback** [SOLID] |
+| Anderson | Can eliminate cycle if a root exists | No | Yes (emulator) | **Best practical solver, needs safeguards** |
+| Implicit/IMEX | Eliminates (A-stable) | No | Yes (emulator) | **Correct but per-step costly; fallback/reference** |
+
+Key uncertainties flagged: (1) whether `|Im mu|` saturates vs grows as CE drops is **not directly
+measured** — code caps/qknorm/damping argue for bounded, but ε=0.1→0.05 data show it was still rising
+in-window; (2) whether s3200 has a true continuous fixed point (g→0) vs only a dead oscillation is
+**ambiguous** because the eps-sweep's step-residual ≠ force-residual; the clean continuous-stable
+evidence is s2000, not s3200.
author	Yuren Hao <yurenh2@illinois.edu>	2026-07-03 05:56:50 -0500
committer	Yuren Hao <yurenh2@illinois.edu>	2026-07-03 05:56:50 -0500
commit	b83947778e2c776f757a07d4719b7ce961d7ed55 (patch)
tree	b9cc01d7adda691d9156d9d04f4fb2f644674e96 /ep_run/FUGU_OPTIONS_VERDICT.md