diff options
Diffstat (limited to 'ep_run/FUGU_VERDICT_FULL.md')
| -rw-r--r-- | ep_run/FUGU_VERDICT_FULL.md | 160 |
1 files changed, 160 insertions, 0 deletions
diff --git a/ep_run/FUGU_VERDICT_FULL.md b/ep_run/FUGU_VERDICT_FULL.md new file mode 100644 index 0000000..3c06293 --- /dev/null +++ b/ep_run/FUGU_VERDICT_FULL.md @@ -0,0 +1,160 @@ +# FUGU_VERDICT_FULL — Q1–Q4 + +## Q1 — Mechanism: confirm/refute the non-conservative Hopf claim + +**Verdict:** confirm the broad failure mode, but do **not** overclaim the exact bifurcation label yet. The code/data are conclusive for an **attention-dominated non-conservative forward oscillatory instability with no usable fixed point** at redx `s3200`. They are **not yet conclusive** that the route is specifically a local continuous-time Hopf bifurcation of a fixed point. The best current statement is: + +> The implemented relaxation map `z_{t+1} = z_t + eps*F(z_t)` has crossed from a stationary computation into an attention-driven oscillatory attractor. The most likely local mechanism is a complex-conjugate pair of the map Jacobian `M = I + eps*J` crossing `|lambda| = 1` — a Hopf/Neimark-Sacker-type instability of the Euler relaxation. But the eigenvalue crossing has not yet been measured, so the exact bifurcation class remains a hypothesis. + +Grounding in the code: + +- `relax` is explicit Euler: `z = z + eps * blk.force(z, xin).detach()` in `lt_ep_train.py:123-133`. Therefore the relevant stability object for the implemented computation is the **discrete map** `M = I + eps*J`, not only the continuous vector field `J=dF/dz`. +- In the relevant `attn_mode='thick'` branch, `tforce` / `force` implement + `F(z) = -(z - xin) + Attn(LN1(z)) + FFN(LN2(z)) - c*z` (`lt_ep_train.py:81-85`, `102-106`). With `c=1`, the passive term is `xin - 2z`; learned attention/FFN Jacobian must fit inside that contraction margin. +- Attention is genuinely non-conservative in the implementation: independent `WQ/WK/WV/WO`, causal softmax, optional q/k RMSNorm (`qknorm`) in `lt_ep_train.py:58-68`. It is not the gradient of the tied conservative `attn_energy` path. +- In thick mode, `nc_force` includes **attention plus the untied FFN** (`lt_ep_train.py:92-97`). Thus the knockout supports “attention is dominant,” but the code also explains why a tiny FFN-only oscillation can remain when attention output is zeroed. + +What the measurements prove: + +1. **It is not slow convergence.** At s3200 the residual decays initially and then floors/oscillates around `~2.3e-2` through 6000 relaxation steps, with non-monotone tail `2.08e-2` to `2.73e-2`. That rules out the earlier “rho close to one but still convergent” framing for the actual forward computation. +2. **Attention is causally responsible for the large cycle.** Scaling `WO` monotonically shrinks the oscillation: `alpha=1.0` cycles, `0.7` cycles smaller, `0.4` is nearly gone, and `0.2` restores a true fixed point. That is strong causal evidence that attention’s non-conservative/gain component drives the failure. +3. **The estimator is not the primary explanation once no fixed point exists.** `ep_step` assumes `zs = relax(...)` is a free equilibrium and forms the AEP/nudged update around it. The dossier says AsymEP is accurate when a fixed point exists; at s3200 the required object is absent. + +What remains unexcluded: + +- **Discrete Euler artifact vs continuous Hopf.** For a continuous eigenvalue `mu=a+ib` of `J`, the Euler multiplier is `lambda=1+eps*mu`; stability requires `(1+eps*a)^2 + (eps*b)^2 < 1`. A stiff rotating mode with `a<0` can still have `|1+eps*mu|>1` at `eps=0.1`. Then the digital relaxation cycles even if the underlying continuous-time analog ODE would converge for smaller `eps`. +- **Real-multiplier alternatives.** A real `lambda` crossing `+1` would indicate saddle-node/pitchfork/loss of stationary solutions; a real `lambda` crossing `-1` would indicate a flip/period-doubling route. The observed smooth oscillation and attention scaling favor a complex pair, but do not prove one. +- **Global/coexisting-attractor route.** The long relaxation proves that the trajectory from the embedding does not settle to a stationary computation. It does not, by itself, prove the cycle emerged through a local fixed-point Hopf rather than a global basin/coexisting-attractor mechanism. +- **FFN contribution.** Because `alpha=0` still leaves a tiny cycle and thick-mode `nc_force` includes the FFN, the precise claim is “attention-dominated,” not “attention-only.” + +**Single cleanest confirming measurement:** perform an **attention-output-scale continuation with leading eigenvalues of the actual Euler map `M=I+eps*J` at the converged fixed point just below the transition**. + +Concretely, for the s3200 checkpoint: set `WO <- alpha*WO`, solve to tight fixed-point residual for subcritical `alpha`, compute leading eigenvalues of `J=dF/dz` and `M=I+eps*J` at `z*(alpha)`, and increase `alpha` until convergence is lost. This is cleaner than eigenvalues at an arbitrary point on the already-existing cycle, because Hopf is a fixed-point stability statement; Floquet analysis is useful second, but characterizes the cycle after it exists. + +Outcomes: + +- **Complex pair of `M` reaches/crosses `|lambda|=1` at the same `alpha_c` where the fixed point disappears:** confirms the Hopf/Neimark-Sacker mechanism for the implemented relaxation map. +- **The corresponding continuous eigenvalues `mu` have `Re(mu)` crossing zero:** confirms a true continuous-time Hopf, relevant to analog ODE hardware. +- **`|1+eps*mu|>1` while `Re(mu)<0`, and smaller `eps` restores convergence:** the failure is mainly an explicit-Euler/stiff-rotation artifact, not a continuous-time Hopf. +- **A real `lambda>=1`:** not Hopf; look for saddle-node/pitchfork/loss of stationary solution. +- **A real `lambda<=-1`:** not Hopf; a discrete flip/period-doubling route is implicated. +- **All fixed-point multipliers stay inside the unit circle up to loss of convergence:** likely global/coexisting attractor or basin issue rather than local Hopf. +- **Floquet multipliers of the observed cycle all stable except phase:** confirms a stable limit cycle, but still does not identify how the stationary solution was lost. + +So Q1: **confirm attention-driven non-conservative oscillatory non-convergence; keep “Hopf” as the leading, not-yet-proven, local mechanism until the fixed-point eigenvalue continuation is measured.** + +--- + +## Q2 — Fix: keep the operator below Hopf while preserving expressivity + +**Verdict:** the best immediate fix is a **residual-triggered adaptive stability homeostat**, implemented primarily with adaptive `jacreg`, plus `qknorm` and modest attention-gain/spectral guardrails. Direct cycle/residual penalties should be alarms/gates, not the primary shaping objective. A structural `r_str` parameterization is the cleanest long-term analog design, but it is less immediately surgical for the current transformer attention code. + +Important detail: the current `jacreg` is not a pure antisymmetric penalty. In `lt_ep_train.py:211-219`, it estimates `||J_nc v||^2 / ||v||^2` by a JVP through `blk.nc_force`; in thick mode `J_nc` is attention plus FFN (`lt_ep_train.py:92-97`). Thus it penalizes learned non-conservative/gain response — a proxy for dangerous rotating dynamics — not exactly `||(J-J^T)/2||`. That proxy is nevertheless the best-supported control knob in the dossier. + +Why adaptive `jacreg` is the right primary fix: + +- It targets the learned recurrent Jacobian that must remain inside the passive `-2z` contraction margin. +- The controller is already wired to the right observable: the free-phase residual. `lt_ep_train.py:520-529` increases `jr` when `res/res_target` rises and relaxes it when dynamics settle. +- The failure is abrupt; a fixed weak penalty can allow training to walk past the bifurcation. The penalty must adapt to residual/cycle onset. +- The dossier states the validated stable runs used adaptive `jacreg`, while diverging runs froze it weakly. +- It preserves more expressivity than simply shrinking all attention: attention can remain strong where it does not destroy the fixed point. + +Role of the other candidates: + +- **Structural `r_str` bound:** best theoretical/hardware guarantee if the recurrent operator can be parameterized as bounded symmetric plus bounded antisymmetric components. But in this code the attention state-Jacobian is data-dependent through LN, q/k projections, softmax, values, and `WO`; a simple weight-level `r_str` does not directly bound the actual rotating eigenvalues. Use this for redesign, not as the immediate rescue. +- **Gain control / `gamma` / `qknorm`:** necessary guardrail, insufficient alone. `qknorm` is already enabled in the s3200 scripts, yet `alpha=1` cycles. The `WO` knockout proves gain matters; use gain caps, but do not rely on blunt global gain reduction as the main solution. +- **Direct cycle-amplitude / residual penalty:** extremely analog-measurable, but symptom-level. It activates when the operator is already near/off the stationary manifold and may punish slow-but-stable modes. Use it to gate invalid EP updates and drive the homeostat. +- **Log-norm / contraction penalty:** theoretically stronger if computed in the right metric, but global and expensive; less obviously forward-local. Use as an offline diagnostic or occasional calibration, not the main analog training primitive. + +Concrete recipe: + +1. **Keep `qknorm` on** for thick attention. It bounds q/k logits and reduces Jacobian stiffness, but is not sufficient by itself. +2. **Initialize inside the basin.** Use small residual-branch initialization (`resinit < 1`, scaling `WO` and `pj`) and keep `c=1` or stronger initial leak. +3. **Use adaptive `jacreg` with nonzero floor and enough ceiling.** Keep the existing controller structure. Set `res_target` well below the measured cycle floor, roughly `1e-3` to `5e-3`; keep `jr_max` high enough to recover, e.g. the code’s `16` scale; use residual EMA to avoid controller thrash. +4. **Turn on `res_gate`.** If free/refined residual exceeds the validity gate, skip task EP/nudge gradients and apply only stabilization. Since the observed cycle floor is around `2.3e-2`, a gate of order `5e-3` to `1e-2` is appropriate. +5. **Prefer branch-aware regularization if modifying code.** Penalize attention’s learned Jacobian more strongly than FFN, because the knockout identifies attention as dominant; keep a lighter FFN penalty because the FFN-only tiny cycle exists. +6. **Add slow attention-output gain rails.** Because post-hoc `WO` scaling restores convergence at `alpha=0.2` and is near-safe by `alpha=0.4`, impose a soft cap/homeostat on `WO` or attention-output spectral/branch gain. Use it as a rail, not the primary objective. +7. **Use `resreg` only as secondary T1 protection.** The `resreg` term (`lt_ep_train.py:220-231`) protects the finite `T1` state used by evaluation/BPTT, but it does not replace fixed-point stability control. +8. **Monitor tail oscillation, not only a short residual probe.** The previous false “slow convergence” framing came from seeing transient decay and missing the residual floor. Track `T1` residual plus tail min/max or autocorrelation. + +Analog-realizable version: + +- Measure `||z_{t+1}-z_t||/||z_t||` or continuous `||F(z)||/||z||` locally during settling. +- Approximate `jacreg` forward-only by injecting small random state perturbations `eta` and measuring `F_nc(z+delta eta)-F_nc(z)`; use that to locally reduce attention/FFN array gains or asymmetry budgets. +- Do not require exact software `vjp`/`J-J^T` as the hardware primitive unless the substrate supports reciprocal probes. For hardware, use forward perturb-and-measure gain/curl proxies plus residual gating. + +So Q2: **primary fix = adaptive `jacreg`-style stability homeostasis; guardrails = `qknorm`, small residual initialization, spectral/gain caps, and `res_gate`; long-term clean analog design = structural `r_str/gamma` bounded operator.** + +--- + +## Q3 — Thesis: can sub-Hopf non-conservative attention be expressive enough? + +**Verdict:** yes, at least for this architecture/scale. The data show a practical expressivity-vs-stability tradeoff in rotating/gain budget, but not a fundamental theorem that coherent language requires post-Hopf dynamics. + +The right thesis is: + +> Non-conservative attention can be expressive below the Hopf boundary, but it must operate with a measured stability margin. Beyond that margin, the model becomes an oscillator rather than a valid equilibrium language model. + +Evidence: + +- Exact BPTT on the identical model trains cleanly to CE `1.83` and does not drive the forward operator into the cycling regime. That strongly suggests the architecture contains stable expressive LM solutions. +- The dossier says AsymEP matches the exact adjoint when a true fixed point exists. Therefore the failure is not an inherent fixed-point gradient ceiling; it is the EP trajectory crossing the stationary-computation boundary. +- The knockout gives a local threshold estimate at s3200: `alpha=1.0` cycles, `0.7` cycles, `0.4` is nearly gone but still floored, `0.2` converges. For this checkpoint/batch/`eps=0.1`, the strict fixed-point critical attention-output scale is roughly between `0.2` and `0.4`, plausibly near `0.3` of the trained s3200 attention-output gain. + +In `r_str/gamma` terms: + +- The code does not measure `r_str` directly, so a numeric `r_str` threshold would be fake precision. +- The dangerous quantity is the **effective rotating learned Jacobian relative to the contraction margin**: roughly asymmetric/rotating fraction times total attention/FFN gain, divided by the passive damping from `-(1+c)z`. +- For the implemented Euler map, the boundary is `|1+eps*mu|=1`, not just `Re(mu)=0`. With `eps=0.1`, high imaginary frequency can destabilize the map even when continuous-time damping remains negative. +- Operationally, the threshold is wherever the leading complex multiplier of `M=I+eps*J` reaches one. In the s3200 `WO`-scale coordinate, stay on the `alpha<=0.2-0.3` side for a strict fixed-point criterion unless the weights re-adapt under regularization. + +Does the sub-threshold regime suffice? + +- **For a coherent small LM:** yes, the BPTT result is strong evidence. +- **For maximum transformer expressivity:** stability imposes a cost. It limits sharp recurrent routing, high non-normal amplification, and strong directed cycles. Extra capacity should come from width, depth, longer but stable settling, or controlled feedforward correction, not uncontrolled curl. +- **For current AsymEP:** stable runs around CE `2.40` do not prove a ceiling; they show the present local training/control recipe has not yet matched BPTT. + +Is a hybrid the ceiling? + +**Likely yes for competitive analog language hardware.** The realistic design is: + +1. a bounded-asymmetry equilibrium core that stays sub-Hopf and supports exact AsymEP; +2. non-conservative attention inside a measured `r_str/gamma` or spectral-margin budget, with qknorm and gain homeostasis; +3. a thin explicit correction/readout/feedforward/digital-clocked path for operations that would otherwise require too much recurrent curl. + +So Q3: **sub-Hopf non-conservative attention can be expressive enough; the tradeoff is real but practical, not proven fundamental. The local s3200 threshold is `alpha_c ~ 0.2-0.4` — probably near `0.3` — or, generally, the point where the leading complex multiplier of `M` hits the unit circle. A bounded-asymmetry core plus thin correction is the realistic ceiling.** + +--- + +## Q4 — Primitive: equilibrium AsymEP or native non-equilibrium learning? + +**Verdict:** for this codebase and near-term analog hardware, the right primitive is **equilibrium computation kept below Hopf**. Preserve a stationary state `z*`; then AsymEP is exact in the regime it assumes. Do not embrace the accidental limit cycle as the main primitive unless you replace the objective, readout, and learning rule. + +Why the current code requires a stationary state: + +- `ep_step` first computes `zs = relax(...)` and treats it as the free equilibrium (`lt_ep_train.py:142-146`). +- The AEP correction uses `v = z - zs`, `Jv`, and `JTv` at `zs` (`lt_ep_train.py:172-178`). That is a local stationary-state linearization. A phase point on a cycle is not the `z*` required by the implicit-gradient formula. +- The block parameter gradient is formed from `(a * f).sum()` with `f = blk.force(zs.detach(), xin, cg=True)` (`lt_ep_train.py:202-205`). If no fixed point exists, this is not the derivative of a stationary solve. +- `holo_ep.py` has the same assumption: `holo_a`, `holo_a_select2`, and `holo_a_track` expand nudged trajectories around `zs`; `holo_a_lockin` is a demodulated nudge estimator around a base state, not a full learning rule for a self-sustained free limit cycle. + +What a true non-equilibrium route would require: + +1. define the computation as a periodic orbit, phase-averaged state, invariant measure, or reservoir trajectory instead of `z*`; +2. define CE on a time/phase average, not an arbitrary T1 sample; +3. handle the neutral phase mode of the orbit; +4. replace fixed-point adjoints with Floquet/Poincare adjoints, eligibility traces, perturbation learning, or lock-in demodulation over periods; +5. keep the orbit stable while learning all recurrent attention weights; +6. demonstrate locality, forward-only operation, and sample efficiency for language-scale credit assignment. + +Is route (ii) tractable forward-only? + +- **For fixed reservoirs plus readout:** yes; oscillatory analog reservoirs can be useful and trainable with local/readout or perturbation rules. +- **For full recurrent attention weights:** not currently as a clean replacement for AsymEP. It is possible as a research program — phase-demodulated perturbation learning, e-prop-like traces, Floquet-local approximations — but it will be approximate, noisy, phase-sensitive, and likely much less efficient. +- **For this project:** training through the s3200 cycle with existing EP/holo estimators is invalid. Gate task updates when residual indicates loss of equilibrium; apply stabilizing homeostasis; resume only after a fixed point is restored. + +Analog-hardware conclusion: + +- Equilibrium analog hardware has a simple primitive: settle, nudge, measure local contrast, update. +- A limit-cycle primitive requires clocks/phase references, demodulation windows, eligibility storage, and phase-stable credit assignment. That may be viable for special-purpose oscillatory reservoirs, but it gives up the main simplicity and exactness of EP. + +So Q4: **keep the operator below Hopf.** Treat non-equilibrium oscillatory learning as a separate reservoir/auxiliary research direction, not the central primitive for this AsymEP transformer. The system should be a bounded non-conservative equilibrium machine, not an accidental oscillator. |
