summaryrefslogtreecommitdiff
path: root/docs/hardware/FUGU_CODEX_PHYSICS_ANSWER.md
diff options
context:
space:
mode:
Diffstat (limited to 'docs/hardware/FUGU_CODEX_PHYSICS_ANSWER.md')
-rw-r--r--docs/hardware/FUGU_CODEX_PHYSICS_ANSWER.md125
1 files changed, 125 insertions, 0 deletions
diff --git a/docs/hardware/FUGU_CODEX_PHYSICS_ANSWER.md b/docs/hardware/FUGU_CODEX_PHYSICS_ANSWER.md
new file mode 100644
index 0000000..55b511c
--- /dev/null
+++ b/docs/hardware/FUGU_CODEX_PHYSICS_ANSWER.md
@@ -0,0 +1,125 @@
+# Physics answer for Q1–Q7, grounded in the actual simulator code
+
+*Independent verification pass. Read-only: I did **not** run any training, probe, or GPU job. Every line citation below was re-checked against the files on disk; empirical numbers are quoted from probe **logs already on disk** (`ep_run/runs/*.log`), not from new runs.*
+
+---
+
+## 0. The decisive facts the code actually establishes (read these first)
+
+These control all seven answers. Two of them overturn parts of the brief.
+
+**F1 — The active "thick" force is a damped residual DEQ-transformer block.**
+`tforce(z,xin) = -(z-xin) + attn(LN z) + FFN(LN z) - c·z`, i.e. `x_in - (1+c)z + Attn + FFN`. FFN uses **tanh-approx GELU**. (`lt_ep_train.py:81-85`; same branch in `force(...,'thick')` at `lt_ep_train.py:102-106`.) Q,K,V,O are **independent** matrices (`lt_ep_train.py:39-40`), so attention is genuinely non-conservative; `--qknorm` optionally RMSNorms q/k (`lt_ep_train.py:63-65`).
+
+**F2 — This code's EP is NOT a forward-only, settled-state-only rule. It already uses the transpose.**
+The AEP correction explicitly forms **both** `Jv` (jvp) and **`Jᵀv` (vjp)** of the non-conservative force and injects `-(Jv-Jᵀv)` into the nudged dynamics (`lt_ep_train.py:172-178`). The production holomorphic paths do the same with `torch.func.jvp`/`torch.func.vjp` (`holo_ep.py:181-185` plain N=2; `holo_ep.py:236-239` tracking). The block-param gradient is then `grad((a·f).sum(), block)` (`lt_ep_train.py:202-207`) and the readout head gets its own local grad `dCE/dWh` (`lt_ep_train.py:208-210`). **Implication:** on a GPU the transpose is a free autograd call, but on a one-way analog relaxation substrate `Jᵀv` is exactly the capability you do not have. So the simulator is *not itself* a demonstration of a physically forward-only rule — a critical caveat the brief glosses.
+
+**F3 — The implemented AEP is an algebraic `J→Jᵀ` identity, not a first-order `S⁻¹A` series, when the correction is not clipped.**
+Subtracting `(J-Jᵀ)v` turns the nudged differential operator `J` into `J-(J-Jᵀ)=Jᵀ` exactly, for any asymmetry (`lt_ep_train.py:176`, `holo_ep.py:239`). Caveat: the **plain real** nudge path clips the correction if `‖corr‖>‖f‖` (`lt_ep_train.py:177-178`), so that path is not algebraically exact under clipping. The holomorphic/production paths shown in `holo_ep.py` do not apply that norm clip at the same point (`holo_ep.py:181-185,236-239`). The "ρ(S⁻¹A)<1 / Neumann series" warranty applies to the **uncorrected** vector-field EP baseline, **not** to this correction — and the diagnosis file itself retracts the earlier mis-attribution (`EP_BELOW210_DIAGNOSIS_FIX.md:174-184`). **The brief's framing in its §1/§2 (AEP "out of warranty" at ‖S⁻¹A‖≈1.4) does not match the code.**
+
+**F4 — The "‖S⁻¹A‖≈1.4" number in the brief is a different quantity than ρ(S⁻¹A).**
+The brief line 11, and the older diagnosis at `EP_BELOW210_DIAGNOSIS_FIX.md:146`, equate `‖A‖/‖S‖`-scale ≈1.4 with `|Jv−Jᵀv|/|Jv|≈1.4`. That ratio is `|2Av|/|Jv|`, **not** `ρ(S⁻¹A)` or `‖S⁻¹A‖`. When the code actually measures `S⁻¹A` (`asym_probe.py:325-332,364-432`), it gets `‖S⁻¹A‖₂≈10–13` and `ρ` between 1.6 (Arnoldi) and 4.6 (power) — and even those are at **non-converged** points with MINRES solve residuals ~0.15 (`runs/oracle_probe.log:110-114`; `runs/asym_probe_full.log:145-148`). So the "1.4" is a loose proxy, the real `S⁻¹A` operator is large and **ill-measured**, and no single trustworthy ρ value exists in the logs. Treat all ρ-based claims as soft.
+
+**F5 — The cleanest converged measurement says the adjoint is EXACT and the gap is the ESTIMATOR.**
+At a tightly converged checkpoint (`runs/diag_probe.log`, step-200 ckpt, free-phase residual **1.6e-9**): on the block-parameter scope used by the diagnostic (readout `Wh` excluded; `diag_probe.log:6`), `cos(g_transpose, g_BPTT) = 1.000000` on all 6 batches, while `cos(g_EP, g_transpose)` is only **0.51–0.72** (`diag_probe.log:13-44`). The β/hr sweep at that *same* converged state shows cos **rising with β**: 0.002→0.15, 0.005→0.38, 0.01→0.56, 0.02→0.72, 0.04→0.86 (`diag_probe.log:46-50`). **Reading:** (i) the exact transpose adjoint *is* BPTT (so it is not a finite-time-vs-fixed-point gap, at least at T1=150 with res→0); (ii) the EP shortfall is a 1/β noise-amplification / nudge-conditioning problem, consistent with the diagnosis' "hr=0.02 is 10× too small" note (`EP_BELOW210_DIAGNOSIS_FIX.md:196-205`). This is the single most important empirical anchor and the prior draft under-weighted it.
+
+**F6 — EP and BPTT can optimize different states (point mismatch).**
+EP relaxes to T1, records the T1 residual, then *refines* up to `t1max` before computing the gradient (`lt_ep_train.py:143-152`); but `evaluate` (`:279-287`) and `bptt_step` (`:260-265`) both use the T1 state. With `t1max>T1` the EP grad is taken at a *different, tighter* point than the one being evaluated. This is real, but F5 shows it is **not** the dominant failure once you actually converge T1.
+
+**F7 — The "stabilizers" are proxies, not the physical quantities the brief asks about.**
+`jacreg` penalizes `‖J_nc v‖²` for random `v` (Hutchinson `‖J_nc‖_F²`), **not** `‖A‖`, ρ(S⁻¹A), or a log-norm μ_P(J) (`lt_ep_train.py:211-216`). `specnorm` projects **weight** singular values post-step (`lt_ep_train.py:290-332`, applied at `:561-562`), not the state Jacobian. The no-jacreg path instead feedback-controls the damping `blk.c` on the residual (`lt_ep_train.py:530-534`). None is a contraction-by-construction guarantee.
+
+**F8 — Stale-assumption / dead-parameter caveats.**
+The diagnosis lists old bugs (holo `cforce` missing q/k norm; erf-vs-tanh GELU mismatch: `EP_BELOW210_DIAGNOSIS_FIX.md:83-86`). In the current code thick `tforce` uses tanh GELU (`lt_ep_train.py:84`) and the nudge forces include q/k RMSNorm (`holo_ep.py:44-46,142-144`), so those are largely closed. `asym_probe.py` still exposes `--gelu` with default `erf` and stores it on `blk` (`asym_probe.py:35,119`), but `EQBlock.force/tforce` in this file never read `blk.gelu`; they hard-code tanh-approx GELU (`lt_ep_train.py:84,105`). So the current mismatch is a stale/no-op probe flag, **not** an erf-vs-tanh force mismatch in the active code. Also: in `thick` mode `Wm`, `P`, `Q` are allocated and listed in `blk.block` (`lt_ep_train.py:41-50`) but never used by the thick force (`lt_ep_train.py:81-85`) — so any "parameter count" / energy-vs-free comparison must exclude them.
+
+---
+
+## Q1 — Is the obstruction FUNDAMENTAL or an estimator artifact?
+
+**(a) Verdict — split, and the code resolves it decisively for the converged regime.**
+- *Plain forward-only settled-state EP* on a non-conservative equilibrium: **FUNDAMENTAL** bias. The implicit-function gradient needs `λ` solving `Jᵀλ = -∂L/∂z`; a one-way nudge relaxes under `J`, giving `J⁻¹`, and `J⁻¹≠J⁻ᵀ` when `J≠Jᵀ`. No reading of settled states of the *forward* dynamics supplies the missing `Jᵀ` action.
+- *AEP as actually used in the unclipped holomorphic/tracking paths* (F2/F3): the `J→Jᵀ` swap is **exact at any asymmetry** in the β→0, converged limit; the plain real path can lose exactness when its correction clip triggers. So there is **no fundamental asymmetry no-go for this estimator** — the brief's central "ρ(S⁻¹A)<1 wall" is **not** what the code suffers from.
+- *What actually remains* (F5): an **estimator artifact** — at hr/β=0.02 the EP contrast is only 0.5–0.7 aligned with the (exact) adjoint even at residual 1e-9, and it *improves monotonically with β*. The gap is 1/β noise amplification of the nudge-convergence error, not the asymmetry.
+
+The brief's own sub-question — "is `J⁻¹−J⁻ᵀ` fundamentally inaccessible to a forward relaxation?" — is answered: **yes for a strictly forward-only substrate; no for this simulator, because it pays for `Jᵀv` with a digital VJP.**
+
+**(b) Physical-realizability:** Forward-only, no transpose → **No** (can't get exact non-conservative gradients). Forward relaxation **+ a physical transpose/adjoint phase** → **Yes in principle**, but that is a *second* physical primitive (see Q3), not vanilla EP. The current code is local in the weight update but **not** transpose-free.
+
+**(c) Cheapest falsifier:** On one converged ckpt + one batch, run `asym_probe.py --diag` style three-way compare with the AEP correction **disabled** (set `corr` to zero in the nudge, `lt_ep_train.py:176`). Mechanism "the transpose is what AEP buys" is falsified if **AEP-off EP still matches `g_transpose`** at high asymmetry. (All machinery exists: `asym_probe.py:804-854`; no training.)
+
+---
+
+## Q2 — Decouple non-normality from lost contraction. Is the GOOD solution intrinsically high ρ(S⁻¹A)?
+
+**(a) Verdict — ENGINEERING gap, and the brief's premise is unproven by the code.**
+The damping enters as `-(1+c)z` (F1), so at a healthy fixed point `S≈-(1+c)I` and `S⁻¹A≈-A/(1+c)` — raising `c` shrinks `S⁻¹A` *mechanically*. The code lets you test this (`--c`, `lt_ep_train.py:348,397`). But "expressivity *requires* ‖S⁻¹A‖≈1.4" is **not** established: (i) the 1.4 is a proxy, not `S⁻¹A` (F4); (ii) the real `S⁻¹A` measurements (10–13, ρ 1.6–4.6) were taken at **non-converged** or **pre-blowup** ckpts with garbage solve residuals (F4), exactly where contraction is already lost — so they confound "intrinsic asymmetry" with "S near-singular," which is the brief's own worry (brief line 15). The diagnosis concurs the ρ verdict "answered the wrong question" (`EP_BELOW210_DIAGNOSIS_FIX.md:184`).
+
+Caveats from code: with `jacreg>0` (the documented recipe) the residual controller moves `jr`, **not** `c` (`:520-529`), so `c` is effectively fixed mid-run; only the no-jacreg branch adapts `c` (`:530-534`). `qknorm` bounds logits but does not guarantee contraction (`:63-68`). The `mono` branch's "sym(J)<0 guaranteed" comment (`:107-111`) is **false for the full operator** because it still adds raw `self.attn(zr)` (`:111`), whose antisymmetric part is unconstrained.
+
+**(b) Physical-realizability:** `c`, q/k-norm, weight scaling → forward-only, physical, **Yes**. `jacreg`/`specnorm` as written are autograd/SVD digital ops → **not** native local analog rules. A passive metric / per-channel `c` that holds ‖S⁻¹A‖<1 *uniformly without a penalty* is **not implemented**.
+
+**(c) Cheapest falsifier:** A `--c` sweep (e.g. 1,2,3,4) from identical init/recipe, logging val-CE, T1 residual, and one-batch `cos(g_EP,g_transpose)` + measured ρ(S⁻¹A) at matched CE. Mechanism "lost contraction, not intrinsic asymmetry" is falsified if large `c` keeps residual/ρ healthy but EP **still** collapses at the same CE wall. (`runs/ep_c3_scratch.*` already exists from such a run.)
+
+---
+
+## Q3 — Can the physical hardware compute its own adjoint (transpose dynamics)?
+
+**(a) Verdict — ENGINEERING gap *if* a transpose twin is allowed; FUNDAMENTAL *if* the substrate is strictly one-way.** The simulator already *assumes* transpose access everywhere (F2; `asym_probe.py` even solves `Jᵀλ=-ℓ` by GMRES, `:290-323,443-454`). So the code proves the *conditional*: given `Jᵀ`, non-conservative equilibrium gradients are exact (F5: `cos(g_transpose,g_BPTT)=1.0`). It says **nothing** about whether a one-way damped analog attention circuit can produce `Jᵀ` for free — and a reciprocal `J=Jᵀ` circuit (the EBM case) is precisely what independent Q,K,V,O (`:39-40`) breaks.
+
+Hamiltonian-echo / time-reversal does **not** drop in here: the simulated dynamics are explicitly **dissipative** (`-(1+c)z`, `:85`). Reversing them needs active gain or a built transpose-twin network, not passive settling.
+
+**(b) Physical-realizability:** Same substrate, forward only → **No**. Substrate with a programmed transpose twin / reciprocal-by-construction coupling / reversible read path → **plausibly Yes**, at the cost of a second adjoint relaxation + calibrating the transposed operators. Current code = the digital stand-in for that.
+
+**(c) Cheapest falsifier:** `asym_probe.py` on a converged ckpt prints `cos(g_transpose,g_BPTT)`. "Transpose hardware is sufficient" is falsified if `g_transpose ≠ g_BPTT` even at tight residual — then the missing piece is finite-time, not the transpose. (Logs show it is **NOT** falsified: cos=1.0 at res 1e-9, `diag_probe.log:13`.)
+
+---
+
+## Q4 — Is "equilibrium" the right primitive? Finite-time vs fixed-point.
+
+**(a) Verdict — ENGINEERING gap, and the code partly contradicts the brief's "finite-time is the stabilizer" thesis.** Point mismatch is real (F6). The brief/diagnosis claim BPTT's edge is the finite-horizon transient term that settled-state EP structurally lacks (`EP_BELOW210_DIAGNOSIS_FIX.md:12-20`). **But** F5 shows that at T1=150 with residual→0, `g_transpose` (a pure *fixed-point* adjoint, no unrolling) equals `g_BPTT` **exactly** (cos 1.0). If a finite-horizon term were the missing physics, the fixed-point adjoint could not match BPTT. So at convergence the two coincide and the residual gap is the estimator (F5), not a finite-time term. The diagnosis reaches the same retraction (`:186-189`). The resreg "finite-time defense" (`lt_ep_train.py:220-231`) was measured ~orthogonal to the true missing gradient (`EP_BELOW210_DIAGNOSIS_FIX.md:119-124`) — consistent: it was patching a non-problem.
+
+The brief is right in the *divergent* regime: once contraction is lost, z150 ≠ z* and BPTT-of-the-unroll implicitly rewards contraction while settled-state EP does not. So finite-time matters as a **stability crutch off-equilibrium**, not as a missing **gradient component at equilibrium**.
+
+**(b) Physical-realizability:** Pure fixed-point EP → physically clean but mis-targeted off-equilibrium. Exact finite-T BPTT → **not** realizable (no backward-through-time). Forward eligibility/RTRL physical accumulation → possible in principle, **not** in this code, and O(params²) state — poor scaling.
+
+**(c) Cheapest falsifier:** One ckpt, one batch: compare `g_EP(t1max=T1)`, `g_EP(t1max=2000)`, `g_BPTT(T1)`, `g_transpose`. "Finite-time term is the gap" is falsified if `g_transpose=g_BPTT` (it does, `diag_probe.log:13`) — already essentially falsified by existing logs at convergence.
+
+---
+
+## Q5 — Non-normality as physics: must it hurt, and can it be bounded for free?
+
+**(a) Verdict — ENGINEERING gap; the code uses proxies, not the right invariant.** The physically natural regularizer is contraction-in-a-metric (log-norm `μ_P(J)<0`) / passivity, which a passive RC network enforces automatically. The code's controls are all proxies (F7): damping `c` (a genuine negative-diagonal contribution, `:85`), `qknorm` (logit bound), `jacreg` (‖J_nc‖_F², symmetric-blind), `specnorm` (weight σ, not state-Jacobian). None measures or bounds `μ_P(J)` or `‖A‖`. So bounded non-normality is **sufficient in principle** (and probably plenty for LM-scale computation), but this block is an *unconstrained active non-reciprocal* module + damping, not a passivity-by-construction element.
+
+**(b) Physical-realizability:** Passive/contractive design → **Yes**, but restricts the operator class. `jacreg`/`specnorm` → digital. Unconstrained active attention → realizable only with active elements that can ring/overshoot (the transient growth) and thus need explicit contraction safeguards.
+
+**(c) Cheapest falsifier:** Add a one-batch log-norm monitor `max_v ⟨v,Sv⟩/‖v‖²` reusing the existing `Operators.sv` (`asym_probe.py:247-250`); compare unconstrained vs `--specnorm` vs high-`c` at matched CE. "Passivity prevents the failure" is falsified if a run with consistently `μ_P(J)<0` and low residual **still** shows EP gradient collapse.
+
+---
+
+## Q6 — The conservative-attention escape: fundamental or engineering cost?
+
+**(a) Verdict — ENGINEERING cost; no fundamental cost is demonstrated by the code.** An `energy` mode exists: a scalar `E` plus LSE attention-energy, differentiated to give a conservative force (`lt_ep_train.py:70-76,113-117`). But it is **not** an apples-to-apples symmetric version of the thick block: thick uses independent Q,K,V,O + LN + untied 4× FFN (`:39-46,81-85`), whereas `attn_energy` is a **tied** LSE over q·k with **no V/O path** (`:70-76`) — a strictly weaker, value-less form. So any measured 0.15–0.2 CE tax reflects *this particular* energy parameterization, not an information-theoretic bound on conservative attention. A partially-conservative sweet spot (small controlled `A` + exact AEP/transpose) is plausible and untested in code.
+
+**(b) Physical-realizability:** Fully conservative energy attention → **Yes**, cleanest analog-equilibrium route (EP exact, `J=Jᵀ`, no transpose phase). Partially conservative w/ bounded `A` → Yes if `A` small or transpose twin available. Free directed attention w/o safeguards → **No** for exact local EP.
+
+**(c) Cheapest falsifier:** Active-parameter-matched Pareto sweep — `energy` vs `thick` vs a tied/penalized-`A` interpolant — logging CE, residual, cos, asymmetry. "Conservative cost is only engineering" is falsified if **every** near-conservative variant at matched *active* params (excluding dead Wm/P/Q, F8) stays stuck at a large CE gap while free attention succeeds.
+
+---
+
+## Q7 — The make-or-break: readable ~10–50M LM inside the analog-EP regime?
+
+**(a) Verdict — NOT proven by the code, NOT ruled out.** The hardware-realizable regime is conservative/contractive + tight convergence + local update + **no digital VJP/BPTT**. The current strongest path (thick AEP) **violates the no-transpose constraint** (F2), so it is not itself a proof of the thesis. The honest physical candidate is `energy`/`mono`/contractive attention — which in this code is the *weaker, value-less* form (Q6). And the current model is small and confounded: with C=512, T=256, vocab=4096, most params are embeddings/readout, and Wm/P/Q are dead in thick mode (F8). So no clean active-parameter-matched analog-regime LM result exists in these files. Empirically, BPTT reaches val-CE **1.83** (`runs/bptt_clean.log:162`) while the converged-from-scratch EP/AEP recipe stalls/diverges around **2.0–2.9** depending on hr (`runs/ep_hr02.log:16-22` aborts at 2.87 with hr=0.02; the diagnosis attributes this to mis-set hr, `EP_BELOW210_DIAGNOSIS_FIX.md:196-205`). That gap is currently an **estimator/hr** gap (F5), not a proven expressivity ceiling.
+
+**(b) Physical-realizability:** Strict conservative/contractive local EP → plausible, unproven for readable 10–50M. Current thick AEP → not a valid proof (digital transpose + dead params). Hybrid (conservative/contractive analog core + thin transpose-capable or digital non-conservative correction) → most realistic near-term, but breaks the strict "no backward at all" thesis.
+
+**(c) Cheapest falsifier:** Train the smallest *clean* hardware-faithful model — active-param-matched `energy`/`mono`, **AEP/VJP off**, **no BPTT**, same tokenizer/data/eval — and sample text at fixed CE milestones. Thesis falsified if such a model can't cross a pre-declared readability/CE bar that same-budget free directed attention clears. Must report **active** param count (exclude dead Wm/P/Q, F8).
+
+---
+
+## Bottom line
+
+1. The brief's headline no-go — "AEP dies because ρ(S⁻¹A)>1" — **does not match the code**. The unclipped holomorphic/tracking AEP correction implements an exact `J→Jᵀ` identity (F3), and the diagnosis already retracts the warranty claim (`EP_BELOW210_DIAGNOSIS_FIX.md:174-184`).
+2. The "1.4" asymmetry number is a proxy `|2Av|/|Jv|`, not `‖S⁻¹A‖`; the real measured `S⁻¹A` is ~10–13 with ρ unresolved (1.6–4.6) and only at non-converged points (F4).
+3. The **true fundamental** obstruction is narrow and physical: a strictly forward-only, transpose-free substrate cannot realize `J⁻ᵀ` for a non-conservative equilibrium (Q1/Q3). This simulator sidesteps it with digital VJPs, so it is not a forward-only proof of concept.
+4. The **dominant practical failure** in this code is an **estimator-conditioning** problem (β/hr too small ⇒ 1/β noise), demonstrated at residual 1e-9 where `g_transpose=g_BPTT` exactly but `cos(g_EP,g_transpose)`=0.5–0.7 and rises with β (F5). Finite-time and asymmetry are secondary/off-equilibrium effects.
+5. The hardware-clean route is conservative/contractive attention or a real physical transpose twin — `jacreg`/`specnorm` are proxies (F7), the `energy` mode is value-less (Q6), and readable small LMs in the strict regime remain **plausible but unproven** in these files (Q7).