summaryrefslogtreecommitdiff
path: root/ep_run/FUGU_OPTIONS_VERDICT.md
blob: 4e3ed25ff3faf3e2887c7d6e8f32c34166650e4c (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
# FUGU_OPTIONS_VERDICT — Q1–Q3 (independently verified)

Scope: answers grounded in `lt_ep_train.py` (`force`/`tforce` :81-106, `relax` :123-133,
`ep_step` :140-232, `jacreg` :211-219, weight caps :52-53/398-399/563-567), `holo_ep.py`,
the calibration probes (`adaptive_eps_calib.py`, `adaptive_eps_calib2.py`, `eps_sweep_s3200.py`,
`jnc_scaling.py`, `lt_ep_anderson.py`), and the diagnosis dossiers. Each claim is flagged
**[SOLID]** (proved by code/data in repo) or **[UNCERTAIN]** (reasoned, not measured here).

---

## Shared mechanism (the object all three questions act on)

**[SOLID]** The active free relaxation is explicit (forward) Euler:
`z = z + eps * blk.force(z, xin).detach()` (`relax`, :123-133). In thick mode the force is
`F(z) = -(z - xin) + Attn(LN1 z) + FFN(LN2 z) - c*z` (`tforce`/`force` :81-85, :102-106), c=1.
So the per-step linear stability object is the **discrete map** `M = I + eps*J`, `J = dF/dz`.

**[SOLID]** For a continuous eigenvalue `mu = a + i b` of `J`, the Euler multiplier is
`lambda = 1 + eps*mu`, and the map is stable iff `|1+eps*mu| < 1`, i.e.
`eps < eps_crit = -2a/(a^2 + b^2)` for `a < 0`. A continuous-STABLE rotating mode (`a<0`, `b` large)
is destabilized purely by too-large `eps`.

**[SOLID]** The ε-monotonicity training data are decisive that this is an *integration* wall, not a
*gradient-quality* wall: eps=0.1 blew @ CE 2.74; eps=0.1 with a strictly better gradient (t2sel=160,
cos 0.998) blew EARLIER @ 3.02; eps=0.05 reached 2.41 before blowing. Better gradient → not later but
earlier; smaller step → strictly lower wall. That is exactly the `|1+eps*mu|>1` signature.

### One correction to the dossier's "continuous/analog is stable at s3200" framing
**[SOLID — verified, refines prior verdict]** The eps-sweep "CONVERGED at eps=0.01" is measured with a
*different residual* than the cycle floor. `eps_sweep_s3200.py:17` reports the **step** residual
`r = ‖z2-z‖/‖z‖ = eps·‖F‖/‖z‖`; `adaptive_eps_calib.py:15` reports the **force** residual
`g = ‖F‖/‖z‖`. At eps=0.01 the sweep's `r≈8.9e-4` is just `0.01 × 0.089` — i.e. the *same* force-floor
`g≈0.09` that is called a "cycle" at eps=0.1. `FUGU_Q_OPTIONS.md` itself flags this:
"s3200 g floors ~0.09 even at tiny ε (genuinely no fixed point at the marginal op, OR just slow
finite-step convergence — ambiguous)."
**Implication:** the eps-sweep robustly proves *the oscillation/blow-up is a discrete-Euler artifact*
(the cycle amplitude dies as eps→0). It does **not** by itself prove the s3200 operator has a true
attracting fixed point (g→0) in continuous time — the force floor g≈0.09 persists. The clean
continuous-stable case is s2000 (g→0). So "analog HW would have no problem" is **[SOLID]** for the
*oscillatory blow-up* but **[UNCERTAIN]** for "s3200 settles to a usable equilibrium." The decisive
missing measurement remains the leading eigenpair of `J`/`M` at a continued fixed-point branch
(sign of `Re mu`).

---

## Q1 — Evaluate (a) adaptive ε, (b) jacreg, (c) smaller fixed ε

**Bottom line:**
- **(c) smaller fixed ε — RELOCATES the wall. [SOLID]** Already shown empirically (2.74→2.41).
- **(b) jacreg — RAISES/RELOCATES the wall from the model side. [SOLID it raises eps_crit; UNCERTAIN whether it can eliminate]** It lifts `eps_crit` by cutting `|Im mu|`/gain, but at fixed ε it is still a wall in `eps_crit`-space; it also taxes the expressivity it suppresses.
- **(a) adaptive ε — ELIMINATES the fixed-ε wall *iff* its floor stays below the instantaneous `eps_crit`; otherwise it degenerates to (c). [SOLID for the mechanism; the guarantee is conditional]**

### Ranking
**To remove the measured software wall while preserving the model and the analog target:**
1. **Adaptive ε / robust solver** — only option that removes the *fixed-step* wall with **zero model/expressivity cost** and **zero change to the analog target**. It is pure integration-axis.
2. **jacreg** — effective secondary homeostat; raises `eps_crit`, but changes the learned operator and can cap the non-normality the good (BPTT-1.83) solution uses.
3. **smaller fixed ε** — diagnostic/fallback only; permanently pays the small-step cost on *every* example (including smooth ones) and still fails once stiffening passes the new floor.

**For the analog (continuous) target specifically:** adaptive ε and smaller fixed ε are *emulator*
choices that leave the model identical to what analog HW runs — they are the right kind of fix.
jacreg *changes the model that analog HW would run* (see Q2).

### (a) Adaptive ε — grounded in code
**[SOLID]** `adaptive_eps_calib2.py` uses the correct signal: shrink only on **overshoot**
(`g_t > prev*tol` → `eps*=down`), grow otherwise. The naive `adaptive_eps_calib.py` controller
(shrink on slow contraction) is shown to mis-park ε at the floor on all ops — it conflates small-ε's
slow contraction with instability. The corrected controller behaves as a continuous-relaxation
emulator: stiff s3200 → ε to 0.003-0.008; smooth s2000 → ε grows toward 0.1 and reaches g=0.

### Is adaptive ε *guaranteed* to eliminate the wall? — the eps_min question
**[SOLID, decisive]** No, not unconditionally. With a hard floor `eps_min`, adaptive ε eliminates the
wall only while `eps_min < eps_crit = -2a/(a^2+b^2)`. If training keeps stiffening the rotating mode so
`eps_crit` falls below `eps_min`, adaptive ε becomes a fixed small step at the floor — i.e. it
**degenerates into option (c) and merely relocates the wall.** So the guarantee is conditional on the
floor, and equivalently on whether `eps_crit` (hence `|Im mu|`) is bounded away from where the floor
sits.

### Does |Im μ| (b) saturate or grow unboundedly as CE drops?
This is the crux, and the honest answer is split:

- **[SOLID] There IS structural stiffness-bounding machinery in the code that argues for saturation.**
  (i) `qknorm` RMSNorms q,k → softmax logits are bounded regardless of weight growth (`attn` :63-67);
  (ii) **weight-norm caps**: `capw = {WQ,WK,WV,WO,Wm,Wh,fc,pj}` are each projected back to
  `capx × initial-norm` every optimizer step (`:52-53`, `:398-399`, `:563-567`); (iii) damping `c=1`
  gives a passive `-(1+c)z = xin-2z` contraction floor; (iv) LayerNorm bounds input scale into attn/FFN;
  (v) weight decay. With qknorm + capped projections + LN, the per-matrix gains feeding `J_nc` cannot
  grow without bound, which bounds `|Im mu|` and therefore keeps `eps_crit` bounded **below**. This is a
  genuine reason to expect `|Im mu|` to **saturate** (or at least be bounded) rather than diverge.

- **[SOLID, opposing data point] But within the *observed* range stiffness was still rising:** fixed
  ε=0.1→0.05 moved the wall 2.74→2.41 rather than removing it, i.e. `eps_crit` was still falling across
  that CE interval. So saturation, if it exists, had not yet bitten in the measured window.

- **[UNCERTAIN] No direct eigenvalue/`|Im mu|`-vs-CE trace exists in the repo.** `jnc_scaling.py`
  measures `‖J_nc‖` growth-per-step vs width but is not a CE-resolved `|Im mu|` curve. So whether `b`
  truly plateaus before `eps_crit` reaches a practical `eps_min` is **not measured**.

**Synthesis (decisive, hedged correctly):** adaptive ε is the best wall-eliminator and the only
zero-tax, analog-faithful one — **and** the code's caps/qknorm/damping make it *likely* that `|Im mu|`
is bounded, so a sufficiently small `eps_min` should eliminate (not merely relocate) the wall in
practice. But this is a *bounded-floor* guarantee, not an unconditional one: if `|Im mu|` were to grow
without bound, any finite `eps_min` is eventually a wall. **Recommended:** make the floor itself
log an `eps_crit` proxy (overshoot persisting at the floor) and either drop the floor, reject the step,
or hand off to Anderson — i.e. fail-open rather than fail-into-(c).

---

## Q2 — The jacreg paradox

**Verdict: no paradox. jacreg works by RAISING `eps_crit` from the model side — it fixes the SAME
discretization wall, not a demonstrated continuous-time instability. Relative to adaptive ε it is a
sim-crutch for the measured failure, but it carries a *separate, real* analog benefit (settling
quality), and it would become a genuine fix if a true continuous instability (Re μ≥0) ever emerges.**

### Why a model-side stiffness penalty fixes a simulation artifact — mechanism
**[SOLID]** `jacreg` is a Hutchinson JVP penalty `R = jacreg·‖J_nc·er‖²/‖er‖²` (`:211-219`), and in thick
mode `nc_force = Attn + FFN` (`:92-97`). Minimizing `‖J_nc‖` reduces the learned non-conservative
gain, which reduces the rotating component `|b|=|Im mu|` (and non-normal amplification). Since
`eps_crit = -2a/(a^2+b^2)`, smaller `|b|` → **larger** `eps_crit` → fixed ε=0.1 stays under the
Euler-stability boundary longer. So jacreg moves the *same* `|1+eps*mu|=1` wall by shrinking `b`, while
adaptive ε moves the *same* wall by shrinking `eps`. Two knobs on one inequality.

### Raising eps_crit vs fixing a continuous-time problem
**[SOLID for measured regime]** For s3200-type failures the relevant mode has `Re mu < 0` (the cycle
dies as eps→0). There is no *established* continuous instability to fix, so jacreg's contribution there
is purely "raise eps_crit" — discretization-wall relief from the model side.
**[UNCERTAIN beyond it]** If training ever drives `Re mu → 0⁺` (a true Hopf), then no integrator
(adaptive ε, implicit, Anderson) can stabilize the original continuous equilibrium; only a model-side
change (jacreg, stronger damping/c, structural monotonicity, gain/asymmetry bounds) is a real fix.
jacreg is the insurance policy for that case.

### Does the benefit transfer to analog hardware? — two benefits, separated
**[SOLID] (i) The "prevents eps=0.1 Euler blow-up" benefit does NOT transfer.** Analog HW has no `eps`
and does not iterate `z←z+εF`; it performs continuous relaxation. If `Re mu<0`, the analog ODE is
stable and never had this wall. To the extent jacreg only buys eps_crit headroom, it is papering over a
sim artifact analog wouldn't have — a crutch.

**[SOLID/UNCERTAIN-magnitude] (ii) The "less stiff/less ringy continuous dynamics" benefit DOES
transfer.** Even with `Re mu<0`, a large `|Im mu|` mode has a poor damping ratio: it rings, settles
slowly, demands more bandwidth, longer observation/integration windows, and is more noise/delay
sensitive — all of which degrade the *physical* free-phase settling and the readout of nudged
equilibria on analog HW. Reducing `‖J_nc‖` improves the continuous damping ratio. So jacreg is *also* a
legitimate analog settling/robustness regularizer. **[UNCERTAIN]** the size of this analog benefit is
not measured here.

### Real fix or sim-crutch, relative to adaptive ε?
**[SOLID]** For the *confirmed explicit-Euler artifact*:
- **adaptive ε / Anderson / implicit = the real fix of the emulator** — they preserve the learned
  vector field and make the digital sim stop inventing a cycle the analog system wouldn't have.
- **jacreg = a model-changing crutch for that artifact**, but simultaneously a *real* (if secondary)
  analog settling regularizer and the *only* lever if a genuine continuous instability appears.

**Recommended composition (not "either/or"):** (1) use adaptive ε / a real solver as the primary
emulator fix so the sim is faithful; (2) keep jacreg as a **bounded, adaptive** homeostat
(the controller already exists, `:520-529`) sized for analog settling-time/robustness or true
marginality — NOT as a strong fixed penalty that taxes the non-normality the BPTT-1.83 solution needs.
The historical evidence fits this: the validated ~2.40 runs used *adaptive* jacreg; the diverging runs
*froze it weak* — i.e. they removed the homeostat, not the integrator.

---

## Q3 — Anderson acceleration / implicit (IMEX) integrators

**Verdict: Yes — they can replace explicit Euler as the *solver* and kill the discretization
instability, and they are compatible with AsymEP *provided they converge to the same equilibria of the
same vector fields*. They change nothing about the analog model; they are emulator-fidelity choices.
Implicit Euler is unconditionally stable but per-step expensive (the solve is itself a relaxation).
Anderson is the more practical lever: it both accelerates and can suppress the Euler cycle when a true
fixed point exists, but it is not guaranteed and needs damping/restarts/residual gating.**

### (i) Compatibility with AsymEP
**[SOLID]** The EP estimator depends on the *states*, not on how they were reached. `ep_step` computes
`zs = relax(...)` and treats it as the free equilibrium (`:142-144`); the AsymEP correction uses local
`Jv = jvp(nc_force, zs, v)`, `JTv = vjp(nc_force, zs, v)`, `corr = Jv - JTv` at `zs` (`:172-178`); the
parameter gradient is `(a * f).sum()` with `f = force(zs.detach(), xin, cg=True)` (`:202-205`). None of
this requires explicit Euler — it requires that `zs` is a genuine root `F(zs)≈0` and that the nudged
states are equilibria of the nudged/corrected force. A better solver that returns the *same roots* is
fully compatible, and the `-2A` correction is computed *at* `z*` regardless of the solver that found it.

**[SOLID — important, refines prior framing] The nudged phase must also be re-solved.** The free phase
is not the only explicit-Euler loop: the nudge (`nudge()` :163-180) and every holomorphic estimator
(`holo_a`, `holo_a_select2`, `holo_a_track`, `holo_a_lockin` in `holo_ep.py`) advance with
`z = z + eps*(f - corr)`. The `-2A` correction lives *inside* these loops. So "swap the integrator"
means swap it in **both** phases; a solver that converges the free `z*` but leaves the nudged phase on
coarse Euler will still corrupt `a = -dz*/dβ`.

**[SOLID] Hard limit:** if the continuous field has no attracting root in the operating regime, no
solver can manufacture the stationary state AsymEP needs — it will fail, find a spurious root, or
return a numerical artifact. A solver fixes *integration*, not *non-existence of equilibrium*. (This is
why the s3200 force-floor ambiguity from the Shared-mechanism section matters: confirm a true root
exists before trusting AsymEP there.)

### (ii) Implicit / IMEX — tractable or self-defeating?
**[SOLID, theory]** Backward Euler multiplier is `1/(1-h·mu)`, A-stable: for any `Re mu<0` it is stable
at *every* step size, so it would kill the stiff-rotation Euler cycle outright.
**[SOLID, cost]** Each backward step solves `y - h·F(y) - z_n = 0`, where `F` contains LN, causal
softmax attention, and FFN. A Newton/Krylov/Picard solve needs several force evals and matrix-free
JVP/VJP linear solves over the full `B·T·C` state per step — i.e. **the per-step solve is itself a
relaxation/root-find**, which is the self-defeating risk for a default inner loop.
**[UNCERTAIN/qualitative] IMEX nuance:** making only the cheap leak `-(1+c)z` implicit is trivial but
does **not** tame the dangerous learned rotating attention mode (the danger is in `J_nc`, not the leak);
treating `J_nc` implicitly reintroduces the big linear solve. So implicit/IMEX is best as a **robust
fallback / macro-step / offline reference**, not the default per-step integrator.

### (iii) Anderson — speed only, or stabilization too?
**[SOLID, conceptual]** Anderson (DEQ-style; `lt_ep_anderson.py` stores recent `X`, `G(X)=z+εF`, solves
a small regularized least-squares for the mixing coefficients, extrapolates) is a quasi-Newton/GMRES-on-
the-residual. For a Picard/Euler map whose oscillatory multiplier sits just outside the unit circle,
the residual-minimizing extrapolation can **suppress the limit cycle**, not merely speed a contracting
one — so it is more than acceleration. `lt_ep_anderson.py` is explicitly framed as exactly this test
("can a fixed-point solver converge the free phase where plain relaxation cannot?").
**[SOLID, caveats]** Not guaranteed: it cannot create a root that doesn't exist; aggressive mixing can
diverge; it needs damping (β-mixing), restarts, and residual-monotonicity gating; and (per (i)) it must
wrap the nudged phase too. Net: **strongest practical candidate** — cheaper than full implicit Newton,
able to stabilize when a root exists, but must be safeguarded.

### (iv) Does integrator choice matter for the ANALOG target?
**[SOLID] For the analog model itself: no.** Analog HW performs the true continuous relaxation of `F`;
it runs no explicit Euler, no Anderson, no backward Euler. The integrator is not part of the deployed
computation.
**[SOLID] For digital training/eval of that target: yes, decisively.** Coarse explicit Euler can invent
a limit cycle the analog system would never exhibit, corrupting both the loss and the equilibrium the
EP gradient is taken at. The correct framing — and the right way to state it in the thesis — is exactly:

> Analog HW does the true continuous relaxation; the simulator only needs a **faithful + cheap emulator**
> of that relaxation. Adaptive ε, Anderson, and implicit/IMEX are all just *better emulators* — they
> change the simulation's fidelity/cost, not the EP objective or the analog primitive.

The one asymmetry to keep in mind: **jacreg is NOT in this "just a better emulator" bucket** (it edits
the model the analog HW would run), whereas adaptive ε / Anderson / implicit ARE. That is the precise
sense in which the integrator family is the analog-faithful fix and jacreg is the model-side one.

### Recommended solver strategy
1. Replace fixed ε=0.1 explicit Euler in the **free** phase with an overshoot/step-rejection adaptive
   solver (the corrected `adaptive_eps_calib2.py` logic), with a fail-open floor (Q1).
2. Add **damped Anderson with restarts + residual gating** for both free and nudged phases once the
   residual stalls/cycles; solve `F=0` rather than running a fixed Euler count and hoping.
3. Keep **implicit/backward Euler as a reference/fallback**, not the default inner loop (per-step cost).
4. Leave **AsymEP unchanged in principle**: find `z*`, find nudged equilibria, apply `Jv-JTv` at `z*`,
   and **gate the update** (`res_gate`, `:153-162`) when residual says no stationary state was found.
5. Retain **jacreg as a bounded adaptive homeostat** (analog settling / true-Hopf insurance), not as the
   primary fix.
6. For analog claims, report **solver-independent diagnostics**: force residual `‖F(z*)‖/‖z*‖` (NOT just
   the eps-scaled step residual — they differ by a factor of eps, which confounded the eps-sweep), and,
   when feasible, the leading continuous `mu` (sign of `Re mu`) and settling/ringing time.

---

## Summary table

| Option | Eliminates or relocates wall | Changes model? | Analog-faithful? | Verdict |
|---|---|---|---|---|
| (a) adaptive ε | Eliminates if eps_min < eps_crit; else relocates | No | Yes (emulator) | **Primary fix** [SOLID mechanism; bounded-floor guarantee] |
| (b) jacreg | Raises eps_crit (relocates in eps_crit-space) | Yes | No for the wall; yes for settling | **Secondary homeostat / crutch + true-Hopf insurance** |
| (c) smaller fixed ε | Relocates only | No | Yes but inefficient | **Diagnostic / fallback** [SOLID] |
| Anderson | Can eliminate cycle if a root exists | No | Yes (emulator) | **Best practical solver, needs safeguards** |
| Implicit/IMEX | Eliminates (A-stable) | No | Yes (emulator) | **Correct but per-step costly; fallback/reference** |

Key uncertainties flagged: (1) whether `|Im mu|` saturates vs grows as CE drops is **not directly
measured** — code caps/qknorm/damping argue for bounded, but ε=0.1→0.05 data show it was still rising
in-window; (2) whether s3200 has a true continuous fixed point (g→0) vs only a dead oscillation is
**ambiguous** because the eps-sweep's step-residual ≠ force-residual; the clean continuous-stable
evidence is s2000, not s3200.