# FUGU_VERDICT_FULL — Q1–Q4 ## Q1 — Mechanism: confirm/refute the non-conservative Hopf claim **Verdict:** confirm the broad failure mode, but do **not** overclaim the exact bifurcation label yet. The code/data are conclusive for an **attention-dominated non-conservative forward oscillatory instability with no usable fixed point** at redx `s3200`. They are **not yet conclusive** that the route is specifically a local continuous-time Hopf bifurcation of a fixed point. The best current statement is: > The implemented relaxation map `z_{t+1} = z_t + eps*F(z_t)` has crossed from a stationary computation into an attention-driven oscillatory attractor. The most likely local mechanism is a complex-conjugate pair of the map Jacobian `M = I + eps*J` crossing `|lambda| = 1` — a Hopf/Neimark-Sacker-type instability of the Euler relaxation. But the eigenvalue crossing has not yet been measured, so the exact bifurcation class remains a hypothesis. Grounding in the code: - `relax` is explicit Euler: `z = z + eps * blk.force(z, xin).detach()` in `lt_ep_train.py:123-133`. Therefore the relevant stability object for the implemented computation is the **discrete map** `M = I + eps*J`, not only the continuous vector field `J=dF/dz`. - In the relevant `attn_mode='thick'` branch, `tforce` / `force` implement `F(z) = -(z - xin) + Attn(LN1(z)) + FFN(LN2(z)) - c*z` (`lt_ep_train.py:81-85`, `102-106`). With `c=1`, the passive term is `xin - 2z`; learned attention/FFN Jacobian must fit inside that contraction margin. - Attention is genuinely non-conservative in the implementation: independent `WQ/WK/WV/WO`, causal softmax, optional q/k RMSNorm (`qknorm`) in `lt_ep_train.py:58-68`. It is not the gradient of the tied conservative `attn_energy` path. - In thick mode, `nc_force` includes **attention plus the untied FFN** (`lt_ep_train.py:92-97`). Thus the knockout supports “attention is dominant,” but the code also explains why a tiny FFN-only oscillation can remain when attention output is zeroed. What the measurements prove: 1. **It is not slow convergence.** At s3200 the residual decays initially and then floors/oscillates around `~2.3e-2` through 6000 relaxation steps, with non-monotone tail `2.08e-2` to `2.73e-2`. That rules out the earlier “rho close to one but still convergent” framing for the actual forward computation. 2. **Attention is causally responsible for the large cycle.** Scaling `WO` monotonically shrinks the oscillation: `alpha=1.0` cycles, `0.7` cycles smaller, `0.4` is nearly gone, and `0.2` restores a true fixed point. That is strong causal evidence that attention’s non-conservative/gain component drives the failure. 3. **The estimator is not the primary explanation once no fixed point exists.** `ep_step` assumes `zs = relax(...)` is a free equilibrium and forms the AEP/nudged update around it. The dossier says AsymEP is accurate when a fixed point exists; at s3200 the required object is absent. What remains unexcluded: - **Discrete Euler artifact vs continuous Hopf.** For a continuous eigenvalue `mu=a+ib` of `J`, the Euler multiplier is `lambda=1+eps*mu`; stability requires `(1+eps*a)^2 + (eps*b)^2 < 1`. A stiff rotating mode with `a<0` can still have `|1+eps*mu|>1` at `eps=0.1`. Then the digital relaxation cycles even if the underlying continuous-time analog ODE would converge for smaller `eps`. - **Real-multiplier alternatives.** A real `lambda` crossing `+1` would indicate saddle-node/pitchfork/loss of stationary solutions; a real `lambda` crossing `-1` would indicate a flip/period-doubling route. The observed smooth oscillation and attention scaling favor a complex pair, but do not prove one. - **Global/coexisting-attractor route.** The long relaxation proves that the trajectory from the embedding does not settle to a stationary computation. It does not, by itself, prove the cycle emerged through a local fixed-point Hopf rather than a global basin/coexisting-attractor mechanism. - **FFN contribution.** Because `alpha=0` still leaves a tiny cycle and thick-mode `nc_force` includes the FFN, the precise claim is “attention-dominated,” not “attention-only.” **Single cleanest confirming measurement:** perform an **attention-output-scale continuation with leading eigenvalues of the actual Euler map `M=I+eps*J` at the converged fixed point just below the transition**. Concretely, for the s3200 checkpoint: set `WO <- alpha*WO`, solve to tight fixed-point residual for subcritical `alpha`, compute leading eigenvalues of `J=dF/dz` and `M=I+eps*J` at `z*(alpha)`, and increase `alpha` until convergence is lost. This is cleaner than eigenvalues at an arbitrary point on the already-existing cycle, because Hopf is a fixed-point stability statement; Floquet analysis is useful second, but characterizes the cycle after it exists. Outcomes: - **Complex pair of `M` reaches/crosses `|lambda|=1` at the same `alpha_c` where the fixed point disappears:** confirms the Hopf/Neimark-Sacker mechanism for the implemented relaxation map. - **The corresponding continuous eigenvalues `mu` have `Re(mu)` crossing zero:** confirms a true continuous-time Hopf, relevant to analog ODE hardware. - **`|1+eps*mu|>1` while `Re(mu)<0`, and smaller `eps` restores convergence:** the failure is mainly an explicit-Euler/stiff-rotation artifact, not a continuous-time Hopf. - **A real `lambda>=1`:** not Hopf; look for saddle-node/pitchfork/loss of stationary solution. - **A real `lambda<=-1`:** not Hopf; a discrete flip/period-doubling route is implicated. - **All fixed-point multipliers stay inside the unit circle up to loss of convergence:** likely global/coexisting attractor or basin issue rather than local Hopf. - **Floquet multipliers of the observed cycle all stable except phase:** confirms a stable limit cycle, but still does not identify how the stationary solution was lost. So Q1: **confirm attention-driven non-conservative oscillatory non-convergence; keep “Hopf” as the leading, not-yet-proven, local mechanism until the fixed-point eigenvalue continuation is measured.** --- ## Q2 — Fix: keep the operator below Hopf while preserving expressivity **Verdict:** the best immediate fix is a **residual-triggered adaptive stability homeostat**, implemented primarily with adaptive `jacreg`, plus `qknorm` and modest attention-gain/spectral guardrails. Direct cycle/residual penalties should be alarms/gates, not the primary shaping objective. A structural `r_str` parameterization is the cleanest long-term analog design, but it is less immediately surgical for the current transformer attention code. Important detail: the current `jacreg` is not a pure antisymmetric penalty. In `lt_ep_train.py:211-219`, it estimates `||J_nc v||^2 / ||v||^2` by a JVP through `blk.nc_force`; in thick mode `J_nc` is attention plus FFN (`lt_ep_train.py:92-97`). Thus it penalizes learned non-conservative/gain response — a proxy for dangerous rotating dynamics — not exactly `||(J-J^T)/2||`. That proxy is nevertheless the best-supported control knob in the dossier. Why adaptive `jacreg` is the right primary fix: - It targets the learned recurrent Jacobian that must remain inside the passive `-2z` contraction margin. - The controller is already wired to the right observable: the free-phase residual. `lt_ep_train.py:520-529` increases `jr` when `res/res_target` rises and relaxes it when dynamics settle. - The failure is abrupt; a fixed weak penalty can allow training to walk past the bifurcation. The penalty must adapt to residual/cycle onset. - The dossier states the validated stable runs used adaptive `jacreg`, while diverging runs froze it weakly. - It preserves more expressivity than simply shrinking all attention: attention can remain strong where it does not destroy the fixed point. Role of the other candidates: - **Structural `r_str` bound:** best theoretical/hardware guarantee if the recurrent operator can be parameterized as bounded symmetric plus bounded antisymmetric components. But in this code the attention state-Jacobian is data-dependent through LN, q/k projections, softmax, values, and `WO`; a simple weight-level `r_str` does not directly bound the actual rotating eigenvalues. Use this for redesign, not as the immediate rescue. - **Gain control / `gamma` / `qknorm`:** necessary guardrail, insufficient alone. `qknorm` is already enabled in the s3200 scripts, yet `alpha=1` cycles. The `WO` knockout proves gain matters; use gain caps, but do not rely on blunt global gain reduction as the main solution. - **Direct cycle-amplitude / residual penalty:** extremely analog-measurable, but symptom-level. It activates when the operator is already near/off the stationary manifold and may punish slow-but-stable modes. Use it to gate invalid EP updates and drive the homeostat. - **Log-norm / contraction penalty:** theoretically stronger if computed in the right metric, but global and expensive; less obviously forward-local. Use as an offline diagnostic or occasional calibration, not the main analog training primitive. Concrete recipe: 1. **Keep `qknorm` on** for thick attention. It bounds q/k logits and reduces Jacobian stiffness, but is not sufficient by itself. 2. **Initialize inside the basin.** Use small residual-branch initialization (`resinit < 1`, scaling `WO` and `pj`) and keep `c=1` or stronger initial leak. 3. **Use adaptive `jacreg` with nonzero floor and enough ceiling.** Keep the existing controller structure. Set `res_target` well below the measured cycle floor, roughly `1e-3` to `5e-3`; keep `jr_max` high enough to recover, e.g. the code’s `16` scale; use residual EMA to avoid controller thrash. 4. **Turn on `res_gate`.** If free/refined residual exceeds the validity gate, skip task EP/nudge gradients and apply only stabilization. Since the observed cycle floor is around `2.3e-2`, a gate of order `5e-3` to `1e-2` is appropriate. 5. **Prefer branch-aware regularization if modifying code.** Penalize attention’s learned Jacobian more strongly than FFN, because the knockout identifies attention as dominant; keep a lighter FFN penalty because the FFN-only tiny cycle exists. 6. **Add slow attention-output gain rails.** Because post-hoc `WO` scaling restores convergence at `alpha=0.2` and is near-safe by `alpha=0.4`, impose a soft cap/homeostat on `WO` or attention-output spectral/branch gain. Use it as a rail, not the primary objective. 7. **Use `resreg` only as secondary T1 protection.** The `resreg` term (`lt_ep_train.py:220-231`) protects the finite `T1` state used by evaluation/BPTT, but it does not replace fixed-point stability control. 8. **Monitor tail oscillation, not only a short residual probe.** The previous false “slow convergence” framing came from seeing transient decay and missing the residual floor. Track `T1` residual plus tail min/max or autocorrelation. Analog-realizable version: - Measure `||z_{t+1}-z_t||/||z_t||` or continuous `||F(z)||/||z||` locally during settling. - Approximate `jacreg` forward-only by injecting small random state perturbations `eta` and measuring `F_nc(z+delta eta)-F_nc(z)`; use that to locally reduce attention/FFN array gains or asymmetry budgets. - Do not require exact software `vjp`/`J-J^T` as the hardware primitive unless the substrate supports reciprocal probes. For hardware, use forward perturb-and-measure gain/curl proxies plus residual gating. So Q2: **primary fix = adaptive `jacreg`-style stability homeostasis; guardrails = `qknorm`, small residual initialization, spectral/gain caps, and `res_gate`; long-term clean analog design = structural `r_str/gamma` bounded operator.** --- ## Q3 — Thesis: can sub-Hopf non-conservative attention be expressive enough? **Verdict:** yes, at least for this architecture/scale. The data show a practical expressivity-vs-stability tradeoff in rotating/gain budget, but not a fundamental theorem that coherent language requires post-Hopf dynamics. The right thesis is: > Non-conservative attention can be expressive below the Hopf boundary, but it must operate with a measured stability margin. Beyond that margin, the model becomes an oscillator rather than a valid equilibrium language model. Evidence: - Exact BPTT on the identical model trains cleanly to CE `1.83` and does not drive the forward operator into the cycling regime. That strongly suggests the architecture contains stable expressive LM solutions. - The dossier says AsymEP matches the exact adjoint when a true fixed point exists. Therefore the failure is not an inherent fixed-point gradient ceiling; it is the EP trajectory crossing the stationary-computation boundary. - The knockout gives a local threshold estimate at s3200: `alpha=1.0` cycles, `0.7` cycles, `0.4` is nearly gone but still floored, `0.2` converges. For this checkpoint/batch/`eps=0.1`, the strict fixed-point critical attention-output scale is roughly between `0.2` and `0.4`, plausibly near `0.3` of the trained s3200 attention-output gain. In `r_str/gamma` terms: - The code does not measure `r_str` directly, so a numeric `r_str` threshold would be fake precision. - The dangerous quantity is the **effective rotating learned Jacobian relative to the contraction margin**: roughly asymmetric/rotating fraction times total attention/FFN gain, divided by the passive damping from `-(1+c)z`. - For the implemented Euler map, the boundary is `|1+eps*mu|=1`, not just `Re(mu)=0`. With `eps=0.1`, high imaginary frequency can destabilize the map even when continuous-time damping remains negative. - Operationally, the threshold is wherever the leading complex multiplier of `M=I+eps*J` reaches one. In the s3200 `WO`-scale coordinate, stay on the `alpha<=0.2-0.3` side for a strict fixed-point criterion unless the weights re-adapt under regularization. Does the sub-threshold regime suffice? - **For a coherent small LM:** yes, the BPTT result is strong evidence. - **For maximum transformer expressivity:** stability imposes a cost. It limits sharp recurrent routing, high non-normal amplification, and strong directed cycles. Extra capacity should come from width, depth, longer but stable settling, or controlled feedforward correction, not uncontrolled curl. - **For current AsymEP:** stable runs around CE `2.40` do not prove a ceiling; they show the present local training/control recipe has not yet matched BPTT. Is a hybrid the ceiling? **Likely yes for competitive analog language hardware.** The realistic design is: 1. a bounded-asymmetry equilibrium core that stays sub-Hopf and supports exact AsymEP; 2. non-conservative attention inside a measured `r_str/gamma` or spectral-margin budget, with qknorm and gain homeostasis; 3. a thin explicit correction/readout/feedforward/digital-clocked path for operations that would otherwise require too much recurrent curl. So Q3: **sub-Hopf non-conservative attention can be expressive enough; the tradeoff is real but practical, not proven fundamental. The local s3200 threshold is `alpha_c ~ 0.2-0.4` — probably near `0.3` — or, generally, the point where the leading complex multiplier of `M` hits the unit circle. A bounded-asymmetry core plus thin correction is the realistic ceiling.** --- ## Q4 — Primitive: equilibrium AsymEP or native non-equilibrium learning? **Verdict:** for this codebase and near-term analog hardware, the right primitive is **equilibrium computation kept below Hopf**. Preserve a stationary state `z*`; then AsymEP is exact in the regime it assumes. Do not embrace the accidental limit cycle as the main primitive unless you replace the objective, readout, and learning rule. Why the current code requires a stationary state: - `ep_step` first computes `zs = relax(...)` and treats it as the free equilibrium (`lt_ep_train.py:142-146`). - The AEP correction uses `v = z - zs`, `Jv`, and `JTv` at `zs` (`lt_ep_train.py:172-178`). That is a local stationary-state linearization. A phase point on a cycle is not the `z*` required by the implicit-gradient formula. - The block parameter gradient is formed from `(a * f).sum()` with `f = blk.force(zs.detach(), xin, cg=True)` (`lt_ep_train.py:202-205`). If no fixed point exists, this is not the derivative of a stationary solve. - `holo_ep.py` has the same assumption: `holo_a`, `holo_a_select2`, and `holo_a_track` expand nudged trajectories around `zs`; `holo_a_lockin` is a demodulated nudge estimator around a base state, not a full learning rule for a self-sustained free limit cycle. What a true non-equilibrium route would require: 1. define the computation as a periodic orbit, phase-averaged state, invariant measure, or reservoir trajectory instead of `z*`; 2. define CE on a time/phase average, not an arbitrary T1 sample; 3. handle the neutral phase mode of the orbit; 4. replace fixed-point adjoints with Floquet/Poincare adjoints, eligibility traces, perturbation learning, or lock-in demodulation over periods; 5. keep the orbit stable while learning all recurrent attention weights; 6. demonstrate locality, forward-only operation, and sample efficiency for language-scale credit assignment. Is route (ii) tractable forward-only? - **For fixed reservoirs plus readout:** yes; oscillatory analog reservoirs can be useful and trainable with local/readout or perturbation rules. - **For full recurrent attention weights:** not currently as a clean replacement for AsymEP. It is possible as a research program — phase-demodulated perturbation learning, e-prop-like traces, Floquet-local approximations — but it will be approximate, noisy, phase-sensitive, and likely much less efficient. - **For this project:** training through the s3200 cycle with existing EP/holo estimators is invalid. Gate task updates when residual indicates loss of equilibrium; apply stabilizing homeostasis; resume only after a fixed point is restored. Analog-hardware conclusion: - Equilibrium analog hardware has a simple primitive: settle, nudge, measure local contrast, update. - A limit-cycle primitive requires clocks/phase references, demodulation windows, eligibility storage, and phase-stable credit assignment. That may be viable for special-purpose oscillatory reservoirs, but it gives up the main simplicity and exactness of EP. So Q4: **keep the operator below Hopf.** Treat non-equilibrium oscillatory learning as a separate reservoir/auxiliary research direction, not the central primitive for this AsymEP transformer. The system should be a bounded non-conservative equilibrium machine, not an accidental oscillator.