6 files changed, 438 insertions, 0 deletions
diff --git a/docs/hardware/COLLABORATOR_BRIEF.md b/docs/hardware/COLLABORATOR_BRIEF.md
new file mode 100644
index 0000000..3728657
--- /dev/null
+++ b/docs/hardware/COLLABORATOR_BRIEF.md
@@ -0,0 +1,46 @@
+# Backprop-free analog training of a transformer — collaboration brief
+**One-page ask for hardware-side collaborators · 2026-06-21 · Yuren Hao (UIUC)**
+
+## The idea in three sentences
+We train a **transformer block as a physical equilibrium (fixed-point) system** using **Equilibrium Propagation
+(EP)** — no backpropagation. The forward pass is a damped relaxation `z ← z + ε·F(z)` that **settles** to a fixed
+point (on analog hardware, the settling *is* the physics — nearly free); the weight update is **local**, computed
+from the contrast between a free settle and a slightly-nudged settle. This is exactly the computation an analog
+in-memory / memristive array is good at — and unlike every shipping analog-AI chip (all inference-only), it needs
+**in-situ weight update**, which is the open opportunity.
+
+## Why now / why it's real (not speculative)
+- **Algorithm side (ours, in simulation):** EP's gradient matches true backprop (cosine ≈ 0.99–1.0 per component);
+  the equilibrium transformer trains stably and **matches/beats a same-parameter BP transformer** on language modeling.
+  Currently scaling the recipe; a fix for the one known instability (a residual-defense term) is under validation.
+- **Hardware precedent exists:** local contrastive/EP learning has been physically demonstrated (self-learning analog
+  resistor networks, ~1 µs settling, on-chip weight update from a local free-vs-clamped difference; EP on a D-Wave
+  Ising machine). **But nobody has built an EP-trained *transformer* in analog hardware — that is the first-mover demo.**
+- **Endurance clears the bar:** HfOx-class RRAM survives ~10^10 write cycles; a training run needs ≤10^8 device writes
+  (fewer with digital-accumulate-then-threshold-program). Endurance is not the blocker — update linearity/symmetry is
+  the real device challenge.
+
+## What a hardware demo needs (three layers) — and the UIUC ECE fit
+| Layer | What it does | Closest collaborator |
+|---|---|---|
+| **Trainable device** | in-situ-updatable analog weights (RRAM/FeFET/ECRAM) — *the part you cannot buy* | **Wenjuan Zhu** (UIUC ECE, memristor/RRAM/FeFET/2D devices) |
+| **In-memory MVM circuit** | analog matrix-vector multiply + on-chip weight write-back | **Naresh Shanbhag** (UIUC ECE) — his JSSC-2018 DIMA chip *already* does analog MVM **+ on-chip SGD weight write-back** in 65nm; nearest existing substrate |
+| **Mixed-signal glue / control loop** | ADC/DAC to read settled states + apply the nudge; switched-cap integrators = relaxation primitives | **Pavan Hanumolu** (UIUC ECE, data converters / PLL / switched-cap) |
+| **EP control + sim** | the settle→nudge→settle→local-Δθ loop, noise/endurance de-risk in simulation | **us** (FPGA + the trained model + analog-noise sim already built) |
+
+**Escalation / device frontier:** **H.-S. Philip Wong (黄汉森, Stanford EE / TSMC Chief Scientist)** — NeuRRAM (Nature
+2022) is the most EP-relevant analog-MVM substrate (inference-only today); the RRAM-device heavyweight + a TSMC-foundry
+path, reachable via a Stanford student contact.
+
+## The concrete ask (staged, modular — stitch existing capabilities, no startup-scale custom fab)
+- **Phase 1:** put ONE equilibrium-transformer block on an existing in-situ-trainable substrate (Shanbhag's DIMA-class
+  chip + Hanumolu converter/integrator glue; Zhu devices) + our FPGA EP-control loop → prove end-to-end analog EP training.
+- **Phase 2:** scale weights (foundry RRAM MPW — e.g. SkyWater S130 + Weebit ReRAM IP — or a fixed-weight inference array
+  for the forward path with the trainable layer in-situ).
+- **What we bring:** the validated algorithm, the trained model + scaling data, the EP control logic, and a simulator
+  that already models analog non-idealities (device noise / quantization / asymmetric update) to de-risk before tape-out.
+
+**Bottom line:** the science is done in sim and the hardware pieces all exist in-house at UIUC ECE — this is a
+stitching + first-demo opportunity, not a multi-year custom-silicon program.
+
+*(Backing detail + citations: HW_RESEARCH_FINDINGS.md; method: ept_method_intro.pdf)*
diff --git a/docs/hardware/FUGU_CODEX_PHYSICS_ANSWER.md b/docs/hardware/FUGU_CODEX_PHYSICS_ANSWER.md
new file mode 100644
index 0000000..55b511c
--- /dev/null
+++ b/docs/hardware/FUGU_CODEX_PHYSICS_ANSWER.md
@@ -0,0 +1,125 @@
+# Physics answer for Q1–Q7, grounded in the actual simulator code
+
+*Independent verification pass. Read-only: I did **not** run any training, probe, or GPU job. Every line citation below was re-checked against the files on disk; empirical numbers are quoted from probe **logs already on disk** (`ep_run/runs/*.log`), not from new runs.*
+
+---
+
+## 0. The decisive facts the code actually establishes (read these first)
+
+These control all seven answers. Two of them overturn parts of the brief.
+
+**F1 — The active "thick" force is a damped residual DEQ-transformer block.**
+`tforce(z,xin) = -(z-xin) + attn(LN z) + FFN(LN z) - c·z`, i.e. `x_in - (1+c)z + Attn + FFN`. FFN uses **tanh-approx GELU**. (`lt_ep_train.py:81-85`; same branch in `force(...,'thick')` at `lt_ep_train.py:102-106`.) Q,K,V,O are **independent** matrices (`lt_ep_train.py:39-40`), so attention is genuinely non-conservative; `--qknorm` optionally RMSNorms q/k (`lt_ep_train.py:63-65`).
+
+**F2 — This code's EP is NOT a forward-only, settled-state-only rule. It already uses the transpose.**
+The AEP correction explicitly forms **both** `Jv` (jvp) and **`Jᵀv` (vjp)** of the non-conservative force and injects `-(Jv-Jᵀv)` into the nudged dynamics (`lt_ep_train.py:172-178`). The production holomorphic paths do the same with `torch.func.jvp`/`torch.func.vjp` (`holo_ep.py:181-185` plain N=2; `holo_ep.py:236-239` tracking). The block-param gradient is then `grad((a·f).sum(), block)` (`lt_ep_train.py:202-207`) and the readout head gets its own local grad `dCE/dWh` (`lt_ep_train.py:208-210`). **Implication:** on a GPU the transpose is a free autograd call, but on a one-way analog relaxation substrate `Jᵀv` is exactly the capability you do not have. So the simulator is *not itself* a demonstration of a physically forward-only rule — a critical caveat the brief glosses.
+
+**F3 — The implemented AEP is an algebraic `J→Jᵀ` identity, not a first-order `S⁻¹A` series, when the correction is not clipped.**
+Subtracting `(J-Jᵀ)v` turns the nudged differential operator `J` into `J-(J-Jᵀ)=Jᵀ` exactly, for any asymmetry (`lt_ep_train.py:176`, `holo_ep.py:239`). Caveat: the **plain real** nudge path clips the correction if `‖corr‖>‖f‖` (`lt_ep_train.py:177-178`), so that path is not algebraically exact under clipping. The holomorphic/production paths shown in `holo_ep.py` do not apply that norm clip at the same point (`holo_ep.py:181-185,236-239`). The "ρ(S⁻¹A)<1 / Neumann series" warranty applies to the **uncorrected** vector-field EP baseline, **not** to this correction — and the diagnosis file itself retracts the earlier mis-attribution (`EP_BELOW210_DIAGNOSIS_FIX.md:174-184`). **The brief's framing in its §1/§2 (AEP "out of warranty" at ‖S⁻¹A‖≈1.4) does not match the code.**
+
+**F4 — The "‖S⁻¹A‖≈1.4" number in the brief is a different quantity than ρ(S⁻¹A).**
+The brief line 11, and the older diagnosis at `EP_BELOW210_DIAGNOSIS_FIX.md:146`, equate `‖A‖/‖S‖`-scale ≈1.4 with `|Jv−Jᵀv|/|Jv|≈1.4`. That ratio is `|2Av|/|Jv|`, **not** `ρ(S⁻¹A)` or `‖S⁻¹A‖`. When the code actually measures `S⁻¹A` (`asym_probe.py:325-332,364-432`), it gets `‖S⁻¹A‖₂≈10–13` and `ρ` between 1.6 (Arnoldi) and 4.6 (power) — and even those are at **non-converged** points with MINRES solve residuals ~0.15 (`runs/oracle_probe.log:110-114`; `runs/asym_probe_full.log:145-148`). So the "1.4" is a loose proxy, the real `S⁻¹A` operator is large and **ill-measured**, and no single trustworthy ρ value exists in the logs. Treat all ρ-based claims as soft.
+
+**F5 — The cleanest converged measurement says the adjoint is EXACT and the gap is the ESTIMATOR.**
+At a tightly converged checkpoint (`runs/diag_probe.log`, step-200 ckpt, free-phase residual **1.6e-9**): on the block-parameter scope used by the diagnostic (readout `Wh` excluded; `diag_probe.log:6`), `cos(g_transpose, g_BPTT) = 1.000000` on all 6 batches, while `cos(g_EP, g_transpose)` is only **0.51–0.72** (`diag_probe.log:13-44`). The β/hr sweep at that *same* converged state shows cos **rising with β**: 0.002→0.15, 0.005→0.38, 0.01→0.56, 0.02→0.72, 0.04→0.86 (`diag_probe.log:46-50`). **Reading:** (i) the exact transpose adjoint *is* BPTT (so it is not a finite-time-vs-fixed-point gap, at least at T1=150 with res→0); (ii) the EP shortfall is a 1/β noise-amplification / nudge-conditioning problem, consistent with the diagnosis' "hr=0.02 is 10× too small" note (`EP_BELOW210_DIAGNOSIS_FIX.md:196-205`). This is the single most important empirical anchor and the prior draft under-weighted it.
+
+**F6 — EP and BPTT can optimize different states (point mismatch).**
+EP relaxes to T1, records the T1 residual, then *refines* up to `t1max` before computing the gradient (`lt_ep_train.py:143-152`); but `evaluate` (`:279-287`) and `bptt_step` (`:260-265`) both use the T1 state. With `t1max>T1` the EP grad is taken at a *different, tighter* point than the one being evaluated. This is real, but F5 shows it is **not** the dominant failure once you actually converge T1.
+
+**F7 — The "stabilizers" are proxies, not the physical quantities the brief asks about.**
+`jacreg` penalizes `‖J_nc v‖²` for random `v` (Hutchinson `‖J_nc‖_F²`), **not** `‖A‖`, ρ(S⁻¹A), or a log-norm μ_P(J) (`lt_ep_train.py:211-216`). `specnorm` projects **weight** singular values post-step (`lt_ep_train.py:290-332`, applied at `:561-562`), not the state Jacobian. The no-jacreg path instead feedback-controls the damping `blk.c` on the residual (`lt_ep_train.py:530-534`). None is a contraction-by-construction guarantee.
+
+**F8 — Stale-assumption / dead-parameter caveats.**
+The diagnosis lists old bugs (holo `cforce` missing q/k norm; erf-vs-tanh GELU mismatch: `EP_BELOW210_DIAGNOSIS_FIX.md:83-86`). In the current code thick `tforce` uses tanh GELU (`lt_ep_train.py:84`) and the nudge forces include q/k RMSNorm (`holo_ep.py:44-46,142-144`), so those are largely closed. `asym_probe.py` still exposes `--gelu` with default `erf` and stores it on `blk` (`asym_probe.py:35,119`), but `EQBlock.force/tforce` in this file never read `blk.gelu`; they hard-code tanh-approx GELU (`lt_ep_train.py:84,105`). So the current mismatch is a stale/no-op probe flag, **not** an erf-vs-tanh force mismatch in the active code. Also: in `thick` mode `Wm`, `P`, `Q` are allocated and listed in `blk.block` (`lt_ep_train.py:41-50`) but never used by the thick force (`lt_ep_train.py:81-85`) — so any "parameter count" / energy-vs-free comparison must exclude them.
+
+---
+
+## Q1 — Is the obstruction FUNDAMENTAL or an estimator artifact?
+
+**(a) Verdict — split, and the code resolves it decisively for the converged regime.**
+- *Plain forward-only settled-state EP* on a non-conservative equilibrium: **FUNDAMENTAL** bias. The implicit-function gradient needs `λ` solving `Jᵀλ = -∂L/∂z`; a one-way nudge relaxes under `J`, giving `J⁻¹`, and `J⁻¹≠J⁻ᵀ` when `J≠Jᵀ`. No reading of settled states of the *forward* dynamics supplies the missing `Jᵀ` action.
+- *AEP as actually used in the unclipped holomorphic/tracking paths* (F2/F3): the `J→Jᵀ` swap is **exact at any asymmetry** in the β→0, converged limit; the plain real path can lose exactness when its correction clip triggers. So there is **no fundamental asymmetry no-go for this estimator** — the brief's central "ρ(S⁻¹A)<1 wall" is **not** what the code suffers from.
+- *What actually remains* (F5): an **estimator artifact** — at hr/β=0.02 the EP contrast is only 0.5–0.7 aligned with the (exact) adjoint even at residual 1e-9, and it *improves monotonically with β*. The gap is 1/β noise amplification of the nudge-convergence error, not the asymmetry.
+
+The brief's own sub-question — "is `J⁻¹−J⁻ᵀ` fundamentally inaccessible to a forward relaxation?" — is answered: **yes for a strictly forward-only substrate; no for this simulator, because it pays for `Jᵀv` with a digital VJP.**
+
+**(b) Physical-realizability:** Forward-only, no transpose → **No** (can't get exact non-conservative gradients). Forward relaxation **+ a physical transpose/adjoint phase** → **Yes in principle**, but that is a *second* physical primitive (see Q3), not vanilla EP. The current code is local in the weight update but **not** transpose-free.
+
+**(c) Cheapest falsifier:** On one converged ckpt + one batch, run `asym_probe.py --diag` style three-way compare with the AEP correction **disabled** (set `corr` to zero in the nudge, `lt_ep_train.py:176`). Mechanism "the transpose is what AEP buys" is falsified if **AEP-off EP still matches `g_transpose`** at high asymmetry. (All machinery exists: `asym_probe.py:804-854`; no training.)
+
+---
+
+## Q2 — Decouple non-normality from lost contraction. Is the GOOD solution intrinsically high ρ(S⁻¹A)?
+
+**(a) Verdict — ENGINEERING gap, and the brief's premise is unproven by the code.**
+The damping enters as `-(1+c)z` (F1), so at a healthy fixed point `S≈-(1+c)I` and `S⁻¹A≈-A/(1+c)` — raising `c` shrinks `S⁻¹A` *mechanically*. The code lets you test this (`--c`, `lt_ep_train.py:348,397`). But "expressivity *requires* ‖S⁻¹A‖≈1.4" is **not** established: (i) the 1.4 is a proxy, not `S⁻¹A` (F4); (ii) the real `S⁻¹A` measurements (10–13, ρ 1.6–4.6) were taken at **non-converged** or **pre-blowup** ckpts with garbage solve residuals (F4), exactly where contraction is already lost — so they confound "intrinsic asymmetry" with "S near-singular," which is the brief's own worry (brief line 15). The diagnosis concurs the ρ verdict "answered the wrong question" (`EP_BELOW210_DIAGNOSIS_FIX.md:184`).
+
+Caveats from code: with `jacreg>0` (the documented recipe) the residual controller moves `jr`, **not** `c` (`:520-529`), so `c` is effectively fixed mid-run; only the no-jacreg branch adapts `c` (`:530-534`). `qknorm` bounds logits but does not guarantee contraction (`:63-68`). The `mono` branch's "sym(J)<0 guaranteed" comment (`:107-111`) is **false for the full operator** because it still adds raw `self.attn(zr)` (`:111`), whose antisymmetric part is unconstrained.
+
+**(b) Physical-realizability:** `c`, q/k-norm, weight scaling → forward-only, physical, **Yes**. `jacreg`/`specnorm` as written are autograd/SVD digital ops → **not** native local analog rules. A passive metric / per-channel `c` that holds ‖S⁻¹A‖<1 *uniformly without a penalty* is **not implemented**.
+
+**(c) Cheapest falsifier:** A `--c` sweep (e.g. 1,2,3,4) from identical init/recipe, logging val-CE, T1 residual, and one-batch `cos(g_EP,g_transpose)` + measured ρ(S⁻¹A) at matched CE. Mechanism "lost contraction, not intrinsic asymmetry" is falsified if large `c` keeps residual/ρ healthy but EP **still** collapses at the same CE wall. (`runs/ep_c3_scratch.*` already exists from such a run.)
+
+---
+
+## Q3 — Can the physical hardware compute its own adjoint (transpose dynamics)?
+
+**(a) Verdict — ENGINEERING gap *if* a transpose twin is allowed; FUNDAMENTAL *if* the substrate is strictly one-way.** The simulator already *assumes* transpose access everywhere (F2; `asym_probe.py` even solves `Jᵀλ=-ℓ` by GMRES, `:290-323,443-454`). So the code proves the *conditional*: given `Jᵀ`, non-conservative equilibrium gradients are exact (F5: `cos(g_transpose,g_BPTT)=1.0`). It says **nothing** about whether a one-way damped analog attention circuit can produce `Jᵀ` for free — and a reciprocal `J=Jᵀ` circuit (the EBM case) is precisely what independent Q,K,V,O (`:39-40`) breaks.
+
+Hamiltonian-echo / time-reversal does **not** drop in here: the simulated dynamics are explicitly **dissipative** (`-(1+c)z`, `:85`). Reversing them needs active gain or a built transpose-twin network, not passive settling.
+
+**(b) Physical-realizability:** Same substrate, forward only → **No**. Substrate with a programmed transpose twin / reciprocal-by-construction coupling / reversible read path → **plausibly Yes**, at the cost of a second adjoint relaxation + calibrating the transposed operators. Current code = the digital stand-in for that.
+
+**(c) Cheapest falsifier:** `asym_probe.py` on a converged ckpt prints `cos(g_transpose,g_BPTT)`. "Transpose hardware is sufficient" is falsified if `g_transpose ≠ g_BPTT` even at tight residual — then the missing piece is finite-time, not the transpose. (Logs show it is **NOT** falsified: cos=1.0 at res 1e-9, `diag_probe.log:13`.)
+
+---
+
+## Q4 — Is "equilibrium" the right primitive? Finite-time vs fixed-point.
+
+**(a) Verdict — ENGINEERING gap, and the code partly contradicts the brief's "finite-time is the stabilizer" thesis.** Point mismatch is real (F6). The brief/diagnosis claim BPTT's edge is the finite-horizon transient term that settled-state EP structurally lacks (`EP_BELOW210_DIAGNOSIS_FIX.md:12-20`). **But** F5 shows that at T1=150 with residual→0, `g_transpose` (a pure *fixed-point* adjoint, no unrolling) equals `g_BPTT` **exactly** (cos 1.0). If a finite-horizon term were the missing physics, the fixed-point adjoint could not match BPTT. So at convergence the two coincide and the residual gap is the estimator (F5), not a finite-time term. The diagnosis reaches the same retraction (`:186-189`). The resreg "finite-time defense" (`lt_ep_train.py:220-231`) was measured ~orthogonal to the true missing gradient (`EP_BELOW210_DIAGNOSIS_FIX.md:119-124`) — consistent: it was patching a non-problem.
+
+The brief is right in the *divergent* regime: once contraction is lost, z150 ≠ z* and BPTT-of-the-unroll implicitly rewards contraction while settled-state EP does not. So finite-time matters as a **stability crutch off-equilibrium**, not as a missing **gradient component at equilibrium**.
+
+**(b) Physical-realizability:** Pure fixed-point EP → physically clean but mis-targeted off-equilibrium. Exact finite-T BPTT → **not** realizable (no backward-through-time). Forward eligibility/RTRL physical accumulation → possible in principle, **not** in this code, and O(params²) state — poor scaling.
+
+**(c) Cheapest falsifier:** One ckpt, one batch: compare `g_EP(t1max=T1)`, `g_EP(t1max=2000)`, `g_BPTT(T1)`, `g_transpose`. "Finite-time term is the gap" is falsified if `g_transpose=g_BPTT` (it does, `diag_probe.log:13`) — already essentially falsified by existing logs at convergence.
+
+---
+
+## Q5 — Non-normality as physics: must it hurt, and can it be bounded for free?
+
+**(a) Verdict — ENGINEERING gap; the code uses proxies, not the right invariant.** The physically natural regularizer is contraction-in-a-metric (log-norm `μ_P(J)<0`) / passivity, which a passive RC network enforces automatically. The code's controls are all proxies (F7): damping `c` (a genuine negative-diagonal contribution, `:85`), `qknorm` (logit bound), `jacreg` (‖J_nc‖_F², symmetric-blind), `specnorm` (weight σ, not state-Jacobian). None measures or bounds `μ_P(J)` or `‖A‖`. So bounded non-normality is **sufficient in principle** (and probably plenty for LM-scale computation), but this block is an *unconstrained active non-reciprocal* module + damping, not a passivity-by-construction element.
+
+**(b) Physical-realizability:** Passive/contractive design → **Yes**, but restricts the operator class. `jacreg`/`specnorm` → digital. Unconstrained active attention → realizable only with active elements that can ring/overshoot (the transient growth) and thus need explicit contraction safeguards.
+
+**(c) Cheapest falsifier:** Add a one-batch log-norm monitor `max_v ⟨v,Sv⟩/‖v‖²` reusing the existing `Operators.sv` (`asym_probe.py:247-250`); compare unconstrained vs `--specnorm` vs high-`c` at matched CE. "Passivity prevents the failure" is falsified if a run with consistently `μ_P(J)<0` and low residual **still** shows EP gradient collapse.
+
+---
+
+## Q6 — The conservative-attention escape: fundamental or engineering cost?
+
+**(a) Verdict — ENGINEERING cost; no fundamental cost is demonstrated by the code.** An `energy` mode exists: a scalar `E` plus LSE attention-energy, differentiated to give a conservative force (`lt_ep_train.py:70-76,113-117`). But it is **not** an apples-to-apples symmetric version of the thick block: thick uses independent Q,K,V,O + LN + untied 4× FFN (`:39-46,81-85`), whereas `attn_energy` is a **tied** LSE over q·k with **no V/O path** (`:70-76`) — a strictly weaker, value-less form. So any measured 0.15–0.2 CE tax reflects *this particular* energy parameterization, not an information-theoretic bound on conservative attention. A partially-conservative sweet spot (small controlled `A` + exact AEP/transpose) is plausible and untested in code.
+
+**(b) Physical-realizability:** Fully conservative energy attention → **Yes**, cleanest analog-equilibrium route (EP exact, `J=Jᵀ`, no transpose phase). Partially conservative w/ bounded `A` → Yes if `A` small or transpose twin available. Free directed attention w/o safeguards → **No** for exact local EP.
+
+**(c) Cheapest falsifier:** Active-parameter-matched Pareto sweep — `energy` vs `thick` vs a tied/penalized-`A` interpolant — logging CE, residual, cos, asymmetry. "Conservative cost is only engineering" is falsified if **every** near-conservative variant at matched *active* params (excluding dead Wm/P/Q, F8) stays stuck at a large CE gap while free attention succeeds.
+
+---
+
+## Q7 — The make-or-break: readable ~10–50M LM inside the analog-EP regime?
+
+**(a) Verdict — NOT proven by the code, NOT ruled out.** The hardware-realizable regime is conservative/contractive + tight convergence + local update + **no digital VJP/BPTT**. The current strongest path (thick AEP) **violates the no-transpose constraint** (F2), so it is not itself a proof of the thesis. The honest physical candidate is `energy`/`mono`/contractive attention — which in this code is the *weaker, value-less* form (Q6). And the current model is small and confounded: with C=512, T=256, vocab=4096, most params are embeddings/readout, and Wm/P/Q are dead in thick mode (F8). So no clean active-parameter-matched analog-regime LM result exists in these files. Empirically, BPTT reaches val-CE **1.83** (`runs/bptt_clean.log:162`) while the converged-from-scratch EP/AEP recipe stalls/diverges around **2.0–2.9** depending on hr (`runs/ep_hr02.log:16-22` aborts at 2.87 with hr=0.02; the diagnosis attributes this to mis-set hr, `EP_BELOW210_DIAGNOSIS_FIX.md:196-205`). That gap is currently an **estimator/hr** gap (F5), not a proven expressivity ceiling.
+
+**(b) Physical-realizability:** Strict conservative/contractive local EP → plausible, unproven for readable 10–50M. Current thick AEP → not a valid proof (digital transpose + dead params). Hybrid (conservative/contractive analog core + thin transpose-capable or digital non-conservative correction) → most realistic near-term, but breaks the strict "no backward at all" thesis.
+
+**(c) Cheapest falsifier:** Train the smallest *clean* hardware-faithful model — active-param-matched `energy`/`mono`, **AEP/VJP off**, **no BPTT**, same tokenizer/data/eval — and sample text at fixed CE milestones. Thesis falsified if such a model can't cross a pre-declared readability/CE bar that same-budget free directed attention clears. Must report **active** param count (exclude dead Wm/P/Q, F8).
+
+---
+
+## Bottom line
+
+1. The brief's headline no-go — "AEP dies because ρ(S⁻¹A)>1" — **does not match the code**. The unclipped holomorphic/tracking AEP correction implements an exact `J→Jᵀ` identity (F3), and the diagnosis already retracts the warranty claim (`EP_BELOW210_DIAGNOSIS_FIX.md:174-184`).
+2. The "1.4" asymmetry number is a proxy `|2Av|/|Jv|`, not `‖S⁻¹A‖`; the real measured `S⁻¹A` is ~10–13 with ρ unresolved (1.6–4.6) and only at non-converged points (F4).
+3. The **true fundamental** obstruction is narrow and physical: a strictly forward-only, transpose-free substrate cannot realize `J⁻ᵀ` for a non-conservative equilibrium (Q1/Q3). This simulator sidesteps it with digital VJPs, so it is not a forward-only proof of concept.
+4. The **dominant practical failure** in this code is an **estimator-conditioning** problem (β/hr too small ⇒ 1/β noise), demonstrated at residual 1e-9 where `g_transpose=g_BPTT` exactly but `cos(g_EP,g_transpose)`=0.5–0.7 and rises with β (F5). Finite-time and asymmetry are secondary/off-equilibrium effects.
+5. The hardware-clean route is conservative/contractive attention or a real physical transpose twin — `jacreg`/`specnorm` are proxies (F7), the `energy` mode is value-less (Q6), and readable small LMs in the strict regime remain **plausible but unproven** in these files (Q7).
diff --git a/docs/hardware/GPT_PRO_PHYSICS_ANSWER.md b/docs/hardware/GPT_PRO_PHYSICS_ANSWER.md
new file mode 100644
index 0000000..6dfbde2
--- /dev/null
+++ b/docs/hardware/GPT_PRO_PHYSICS_ANSWER.md
@@ -0,0 +1,46 @@
+# GPT-Pro physics consult — actionable synthesis (2026-06-21)
+Full answer pasted by user (in session). This file = the load-bearing extraction + the prescribed experiment.
+
+## The decisive reframe: the no-go is REAL but NARROW
+- **Same-dynamics, settled-state-only EP computes `J⁻¹ℓ`; the true implicit gradient needs `J⁻ᵀℓ`.**
+  True adjoint `λ = −J⁻ᵀℓ` (ℓ=∂L/∂z*); EP's infinitesimal nudge contrast solves `J s = −ℓ` ⇒ `s = −J⁻¹ℓ`.
+  Exactness for all ℓ ⇒ requires `J = Jᵀ` (self-adjoint in the physical metric P: `PJ=JᵀP`). **That is the precise no-go.**
+- **But the BROADER no-go is FALSE.** `ρ(S⁻¹A)<1` is ONLY the convergence condition of the *first-order* (Neumann) AEP
+  expansion — NOT a universal impossibility. Exact non-conservative learning EXISTS if you alter the learning dynamics:
+  - **AsymEP** (Scurria 2026): add a local antisymmetric correction in the nudged phase so the modified nudged Jacobian = `Jᵀ`.
+  - **Dyadic EP** (Scurria 2026): double the state space; the auxiliary variable becomes a physical error signal ∝ cost gradient. EXACT for arbitrary non-conservative stationary systems.
+- ⇒ **The obstruction MOVES from "gradient estimator impossible" to "can the HARDWARE embody the adjoint (transpose couplings / doubled state) without becoming a hidden digital backprop engine?"** Math exists; physical realizability is the open problem.
+- Exact error operator (better than the ρ proxy): `E = J⁻¹Jᵀ − I = −2 J⁻¹A`. `‖S⁻¹A‖` is a diagnostic only when S is well-conditioned.
+
+## THE PRESCRIBED PRIORITY EXPERIMENT (GPT-Pro's stated #1: "oracle adjoint physics ablation")
+At the SAME checkpoints, implement THREE gradient estimators + sweep an asymmetry knob ε (scale of the antisymmetric attention component):
+1. **vanilla/our EP** (`g_EP`) — gives `J⁻¹ℓ`-flavored.
+2. **exact transpose relaxation** (`g_transpose`): solve `Jᵀ λ = −ℓ` (GMRES; Jᵀ via vjp), then `g = (∂F/∂θ)ᵀ λ`. = the EXACT implicit gradient.
+3. **Dyadic/AsymEP** with exact local `A` from autograd as a simulator ORACLE.
+Reference: `g_BPTT` (finite-T).
+**DECISION:** if `g_transpose` (and/or dyadic) stays aligned + correctly scaled vs `g_BPTT` even at `ρ(S⁻¹A)>1` (closing our measured 84% gap) ⇒ **the collapse is NOT a mathematical no-go — it's a hardware-realization / convergence problem** ⇒ the research problem becomes "build a local transpose/dyadic substrate." If `g_transpose` ALSO fails ⇒ culprit is finite residual / nonlinearity / saddle instability / loss of contraction, NOT the conservativity assumption.
+- MUST run at a HEALTHY tight fixed point (the BPTT good solution, or a converged specnorm ckpt) — at the divergence edge (no fixed point) the solve is ill-conditioned (we saw it: MINRES resid 0.2).
+
+## Q2 (damping) — our suspicion CONFIRMED as plausible
+`‖S⁻¹A‖ ≤ ‖A_G‖ / ((1+c) − ‖S_G‖)`. Two routes to high ρ: numerator `‖A_G‖` grows, OR contraction margin `(1+c)−‖S_G‖` collapses (S→singular). Weak-contraction good solutions fit the SECOND. Raising `c` buys EP-validity ONLY for *gratuitous* asymmetry; if you rescale attn/FFN by `(1+c)` to preserve the map, A_G & S_G scale too and ρ may not improve. **Damping sweep is high-value.** Decompose ρ into `‖A‖`, log-norm margin `μ_P(J)`, transient gain `sup_t‖e^{Jt}‖`.
+
+## Q3 (physical adjoint) — buildable, not free
+- Transpose relaxation `λ̇ = Jᵀλ + ℓ` = recurrent-backprop/adjoint relaxation (NOT BPTT-through-time), physically allowed IF substrate has transpose couplings. Reciprocal crossbar: `Wᵀ` = swap drive/sense ports. Nonreciprocal attention: transpose includes state-dependent transposes of LN/softmax/QK/value/output ⇒ a "transpose twin" = duplicate routing + calibration + derivative circuits + extra phase.
+- **Hamiltonian Echo (HEB/HEL)**: physical adjoint via time-reversal — but does NOT directly extend to DAMPED dissipative dynamics (dissipation destroys time-reversal info). Need bath/ancilla embedding or dyadic doubling. (Pourcel 2026: Lagrangian EP ⊇ HEL, forward-only, passes-independent-of-size, local.)
+- Passive RC/resistive networks are energy-based (Kirchhoff ⇒ energy) — that's WHY local EP works there; they will NOT natively give `Jᵀ` for directed attention.
+- Falsify: simulate transpose mismatch `Jᵀ+Δ` (sparsity/calibration/quantization error) → measure cos/norm-error/stability vs `‖Δ‖/‖J‖` = the HARDWARE TOLERANCE SPEC.
+
+## Q5/Q6 — non-normality & conservative cost
+- Non-normality is NOT purely bad (nnRNN uses it for expressivity w/ stable spectra). Right objective: bound transient gain + adjoint mismatch, not kill A. Physical regularizer = contraction in a metric (`μ_P(J)<−m`), NOT Euclidean spectral radius.
+- Conservative-attention's 0.15–0.2 CE gap is likely **parameterization/efficiency cost, NOT an information-theoretic wall** (conservative systems can encode directed maps via latent/paired compartments). Practical path = partially-conservative `F = F_energy + ε F_skew`, controlled ε per head/layer, keep `ρ<1` if using vanilla EP.
+
+## Q7 — make-or-break (staged, not monolithic)
+Train 4× (10–50M, same corpus): (a) free non-conservative attn + BPTT [oracle]; (b) conservative Energy/Hopfield/CET + exact digital grad [architecture upper bound]; (c) same conservative + EP at strict residual [EP-estimator test]; (d) partially-conservative ε-skew + exact transpose/Dyadic EP. The result names the WALL: architecture (a bad) / EP-estimator (b good, c bad) / physical-adjoint (transpose works, local impl fails) / non-equilibrium (only finite-T BPTT readable).
+
+## Refs surfaced (pursue)
+Dyadic EP + AsymEP = Scurria 2026 (arXiv:2602.03670, the EXACT non-conservative method). Lagrangian EP / Hamiltonian Echo Learning = Pourcel 2026 (arXiv:2506.06248) + RHEL (arXiv:2506.05259, forward-only BPTT proxy). Energy Transformer = Hoover/Krotov (arXiv:2302.07253); Convergent Energy Transformer (EP-trainable, OpenReview Qrfml76eWJ — already in our HW collab docs). Monotone-operator equ. nets = Winston-Kolter (arXiv:2006.08591). Resistive-net universal approx = arXiv:2312.15063. nnRNN = arXiv:1905.12080.
+
+## NEXT (our plan)
+1. BUILD the oracle-adjoint probe (extend asym_probe: add `g_transpose` via GMRES solve of `Jᵀλ=−ℓ` + `(∂F/∂θ)ᵀλ`; + synthetic ε-asymmetry sweep on the J operator). [codex]
+2. RUN at the BPTT good solution (watcher pings) + the ε-sweep on a current converged ckpt. DECIDES fundamental-vs-hardware.
+3. Secondary: ρ(S⁻¹A) decomposition (‖A‖ / μ_P margin / transient gain) at BPTT ckpts; damping-c continuation.
diff --git a/docs/hardware/HW_RESEARCH_FINDINGS.md b/docs/hardware/HW_RESEARCH_FINDINGS.md
new file mode 100644
index 0000000..ea5a6b8
--- /dev/null
+++ b/docs/hardware/HW_RESEARCH_FINDINGS.md
@@ -0,0 +1,98 @@
+# Analog-hardware substrate research — findings (2026-06-21)
+
+Deep-research run (108 agents, 25 sources, 118 claims → 22 adversarially-verified 3-0/2-1).
+Raw verified claims + source URLs + quotes: `hw_research_claims.json`. Synthesis below is mine.
+(The run's auto-synthesis step died on a mid-run /login 401; no DATA lost — all 22 verified claims recovered.)
+
+## THE decisive split confirmed: TRAINABLE-but-small vs LARGE-but-fixed
+The single most important filter — does the substrate support **in-situ weight update** (EP needs it) — cleanly partitions the market:
+
+### LARGE but FIXED-WEIGHT (inference-only — fail EP's in-situ filter as-is)
+- **Mythic M1076** (analog flash CIM): **80M weights/chip**, eval boards / M.2 / PCIe cards exist. BUT explicitly **inference-only** — train off-device, program once. [mythic.ai, 3-0]
+- **IBM HERMES** (PCM, 14nm, 64×256×256 = **4.2M weights**, mixed-signal): research chip, **inference-only**, weights programmed once via hardware-aware training. [Nature Electronics 2023, 3-0]
+- **MRAM / PCM crossbars** generally: program-once, fixed during inference; authors state in-situ training "increases energy + degrades device lifespan" → why the whole field avoids it. [Science, NCBI, 3-0]
+- → These give SCALE (the tens-of-M you want) but can't do EP's repeated local updates without re-flashing.
+
+### TRAINABLE in-situ (small, but the EP-correct regime)
+- **Bulk-switching memristor CIM module** (arXiv 2305.14547): experimentally implements **on-chip mixed-precision TRAINING** with in-situ VMM. KEY mechanism: **digital high-precision update accumulation, physically program the memristor only when accumulated Δw exceeds a threshold** — exactly the hybrid scheme that limits write/endurance stress in an EP loop. [3-0] ← **this is the template for our update path.**
+- **In-situ training demonstrated** on memristor crossbars for MLP/CNN/LSTM/RL — local in-array updates during a learning loop are physically real. [arXiv, 3-0]
+- Constraints to design around: limited NVM **endurance**, **asymmetric/nonlinear** weight update, variability, retention, stuck-at-faults. Compensation methods exist (stochastic rounding etc.). [escholarship, 2-1/3-0]
+
+## EP / equilibrium learning ALREADY physically realized (precedent exists!)
+- **PNAS — self-learning analog resistor network** (Coupled Learning, EP-cousin): XOR + nonlinear regression learned **fully in-situ, NO computer, NO backprop**. Weights = transistor gate-voltage on a local 22µF cap, updated by on-edge circuitry from the **local free-vs-clamped difference**. Forward = physical settling, **τ≈1µs**; learning on 18ms timescale. [PNAS, 3-0] ← **proof the whole concept works in COTS-buildable analog.**
+- **EP on D-Wave** (quantum annealer Ising machine): the physical machine does both free + nudge relaxation to steady state (settling is physical). Learning rule is **local** (updates from the two equilibrium states, no backprop). Caveat (1-1): weights live on the classical computer; only couplings loaded per phase → hybrid, not fully in-situ. [Nature, 3-0 on the local-rule claim]
+- → EP/local learning on physical equilibrium hardware is **demonstrated**, not speculative. Our contribution would be doing it for a TRANSFORMER block at scale.
+
+## Softmax/attention in analog (the hard part)
+- Confirmed open challenge: Transformers need frequent Q/K/V updates, which **conflicts with crossbars' weakness at reprogramming** — flagged as an open HW problem. [arXiv, 3-0]
+- (The energy/Hopfield-attention analog-native route verification was among the 3 claims killed by the 401 — needs a re-run. The pragmatic mixed-signal answer — softmax/LN/GELU in FPGA, linear+relaxation in analog — was the framing, not contradicted.)
+
+## BOTTOM LINE for our build (synthesis)
+The market splits exactly as feared: **you cannot buy one module that is both tens-of-M AND in-situ-trainable.** So:
+- **Phase 1 (trainable, small) — DO THIS FIRST.** Stitch a **bulk-switching/memristor CIM eval module** (in-situ, threshold-accumulated update) + an **FPGA** (softmax/LN/GELU + the EP control loop: settle→nudge→settle→local Δθ). Prove ONE equilibrium-transformer block trains end-to-end via EP in analog. The PNAS resistor-network + the memristor-training paper together show every piece is real.
+- **Phase 2 (scale) — LARGE-but-fixed used cleverly.** Use Mythic-80M / HERMES-class for the bulk fixed linear MVM (the relaxation forward), and keep ONLY the trainable/updated weights on the in-situ substrate, OR do mixed-signal "analog-forward, digital-accumulate, periodic-reflash" updates (the threshold-program trick) to tolerate their write limits.
+- **Update path = the crux.** Adopt the verified hybrid: **accumulate Δθ in digital high-precision, physically program the analog weight only when |Δθ|>threshold.** This is what makes EP survive endurance limits.
+- **De-risk in sim first (free):** the code's `--fnoise` already models multiplicative analog noise — sweep device noise / quantization / asymmetric-update in the 1B sim before buying anything.
+
+## Re-run #2 (2026-06-21, focused) — GAP 1 SOLVED, GAPs 2/3 still thin
+Raw: `hw_research_claims2.json`. 107 agents, clean run (no auth drop).
+
+### GAP 1 — analog attention: ANSWERED. It exists, across substrates, but all inference-only.
+- **Real fabricated silicon**: UCSD **65nm charge-based SRAM-CIM attention** chip (Moradifirouzabadi/Dodla/Kang, arXiv 2409.04940, ESSERC 2024) — first charge-based analog CIM in SRAM for transformers, **measured 14.8 TOPS/W**, 9-T bitcell does Q·Kᵀ via capacitor charge-sharing. [high]
+- **Jülich gain-cell in-memory attention** (Leroux et al., arXiv 2409.19315, Nature Comp Sci 2025): charge-on-capacitor, **~70,000× energy / ~100× speed vs GPU** (simulated). [high]
+- Memristor: Nature Sci Reports 2024 self-attention accel (128×128, 2-bit); **STAR RRAM softmax engine** (arXiv 2401.17582). Photonic: TFLN-microring softmax PROPOSAL (arXiv 2603.12934, not fabricated). [high/med]
+- **Softmax IS analog-realizable in principle**: a subthreshold source-coupled differential-pair / WTA network computes normalized-exp **"for free" via KCL at the shared tail node** (translinear). [high] — so an energy/LSE-attention analog route is physically grounded.
+- **BUT GAP 1(c) CONFIRMED**: real prototypes **overwhelmingly use the mixed-signal split we proposed** — softmax/LN/normalization in DIGITAL/LUT/FPGA, only the linear maps + dot-products in analog. So our architecture choice is the validated one. [high]
+- **EVERY analog-attention implementation found is INFERENCE-ONLY / fixed-weight** (Jülich uses offline HW-aware init + offline backprop fine-tune before deploy). Reinforces: nobody has done in-situ-trained analog attention → that IS our novel contribution. [high]
+- Noise budget datapoint: a variation-aware memristor-ViT sim tolerates **~35% compute error + ~10% conductance variation** while matching digital Top-1 (MDPI Electronics 2026) — encouraging for the `--fnoise` de-risk. [med]
+- Caveat (Sillman, arXiv 2305.13649): an analog softmax block only pays off INSIDE a fully-analog system; isolating it behind ADC/DAC dwarfs the saving → keep softmax digital UNLESS going fully analog. [med]
+
+### GAP 2 (buy-now SKUs) + GAP 3 (endurance/ECRAM) — STILL OPEN
+The re-run did NOT substantively verify these (its own summary says so). The one product claim (Knowm $800 kit) was REFUTED/split. So procurement (TetraMem/Mythic/Anadigm/Aspinity SKU+price+order-today) and the **make-or-break endurance budget (RRAM/PCM/FeFET/Flash vs ECRAM writes-to-failure)** remain genuinely unanswered. Indirect signal only: NVM rejected for KV-cache because of slow/high-energy/low-endurance writes; gain-cells chosen for endurance.
+
+### Still to pin (3rd focused pass — procurement + endurance ONLY)
+1. SKU-level buy-now: TetraMem MX100, Mythic dev kit, Anadigm AN231E04 board, Aspinity AML100, any RRAM eval kit — orderable today? price? (deep-research struggles here — may need vendor sites / direct contact, not web search.)
+2. Per-device **write endurance**: RRAM/PCM/FeFET/Flash/**ECRAM** cycles-to-failure; is ECRAM the symmetric-update + endurance fix, and is it available outside research labs? (Likely research-only — flag if so.)
+3. With digital-accumulate-then-threshold-program, how many physical writes does a ~30k-step EP run actually incur, vs device endurance?
+
+## UIUC ECE collaboration map (2026-06-21, user-directed — the hardware-side gap)
+User's reachable hardware groups (ALL ECE — this is the team's missing layer). The key insight: the
+three UIUC ECE groups span EXACTLY the three layers an in-situ-EP analog demo needs, and together they
+SOLVE the market's fatal gap (you can't BUY an in-situ-trainable analog array — but you can fab one in-house):
+- **Wenjuan Zhu (UIUC ECE) = DEVICE layer** [user-confirmed]: memristor/RRAM/FeFET / 2D-material devices.
+  This is the in-situ-trainable substrate that is research-only on the market — her group can FABRICATE it.
+- **Naresh Shanbhag (UIUC ECE) = CIRCUIT/ARCH layer**: SRAM in-memory compute (DIMA/C3SRAM line) — the analog MVM.
+- **Pavan Hanumolu (UIUC ECE) = MIXED-SIGNAL GLUE**: ADC/DAC, PLL, switched-cap — the converters + analog
+  integrator for the relaxation/control loop (settle→nudge→settle→local Δθ).
+- Tao Chen (USTC) = hardware but NOT EP; Stanford = student can broker intros (Wong RRAM / Murmann-legacy CIM / etc.).
+STRATEGY SHIFT: not "buy a board" — it's in-house fab of the trainable substrate (Zhu) + CIM circuit (Shanbhag)
++ converter glue (Hanumolu) + our FPGA/EP control loop. Sourcing deep-research w1kuw4zmz profiling all + industry;
+its "Zhu" angle mis-targets ML-accel (wrong layer) — corrected to Wenjuan-Zhu-device here; will run a focused
+pass on her device work + merge.
+
+### Sourcing run RESULTS (w1kuw4zmz, 11 high-conf named-paper findings; raw: hw_groups_claims.json)
+**HEADLINE: Shanbhag (UIUC) is the closest match of ANY named group — and it's the ONLY group silicon that already does analog MVM + genuine on-chip in-situ weight update.**
+- **Shanbhag DIMA chip** (Gonugondla/Kang/Shanbhag, **JSSC 2018**, "Variation-Tolerant In-Memory ML Classifier via On-Chip Training"): 65nm, 16kB 6T-SRAM, analog MVM via "functional read" + charge-sharing, AND a **dedicated on-chip digital trainer doing SGD + writing weights back to the array each batch** (random→within 1% of FP in ~400 batches). In-situ training gave **2.4× energy cut** at iso-accuracy. ⇒ the two EP-critical mechanisms (analog MVM + on-chip weight write-back) ALREADY in one fabricated UIUC chip. CAVEAT: it's a single-layer SVM, batch-SGD — **no relaxation/settling loop, no multilayer credit assignment**. We'd add the equilibrium dynamics + two-phase EP rule on top. [high]
+- Shanbhag also: **C3SRAM** (w/ ASU, JSSC 2020) — capacitive-coupling XNOR-MAC, but inference-only binary.
+- **Hanumolu** — (the run under-covered him; he's the ADC/DAC+integrator glue, still the right converter-layer partner; pin specific silicon separately.)
+- **"Zhu"/Lin correction CONFIRMED**: Yingyan **Celine Lin** is UIUC-PhD, now **Georgia Tech** (not current UIUC faculty, surname Lin not Zhu), digital-accel/co-design, **no analog/RRAM/in-situ silicon**. ⇒ the device-layer partner is **Wenjuan Zhu** (user-confirmed), NOT Lin.
+- **USTC Tao Chen (陈涛)**: device/materials "in-materio" reservoir computing (disordered dopant-atom networks in Si, Nature 2020). Real device work but NOT circuit-CIM, NOT EP — confirms user's "no EP hardware." Possible device-physics collaborator, not a demo host.
+- **Stanford NeuRRAM** (Wong + Raina + UCSD Cauwenberghs, **Nature 2022**): 48-core, ~3M-cell RRAM analog-CIM — the most EP-relevant *substrate* (analog MVM at scale), **BUT INFERENCE-ONLY** (weights programmed offline; only chip-in-the-loop forward fine-tune). Gives the MVM primitive, not native in-situ learning.
+  - **Wong = H.-S. Philip Wong (Hon-Sum Philip Wong, 黄汉森)**: Stanford EE, Willard R. & Inez Kerr Bell Professor; also **TSMC Chief Scientist**. RRAM/memristor, 3D monolithic integration, in-memory computing; co-author of NeuRRAM + the canonical "Memristive devices for computing" review (Nature Nanotech 2013). THE RRAM-device heavyweight for the trainable-substrate conversation (reach via the Stanford student contact). TSMC tie = a path to real foundry RRAM.
+- **Industry = all inference-only**: TetraMem **MX100** (Nature Electronics 2025; 10 cores, 248×256 1T1R RRAM+RISC-V) ships real silicon but **inference-only** (no in-situ update). Mythic/EnCharge class same.
+- **DIY in-situ test-chip path**: **SkyWater S130 + Weebit Nano 256Kb ReRAM IP** (JEDEC/AEC-Q100 qualified 2023, open SKY130 PDK) = foundry RRAM access for an MPW — a lab can fab its OWN trainable RRAM array. [med]
+- **EP-on-hardware = still only SPICE sim**: "Memristor Crossbar Circuits Implementing Equilibrium Propagation" (Oh et al., Kookmin U) is circuit simulation, NOT silicon. ⇒ **no fabricated EP-transformer hardware exists anywhere — the demo is genuinely novel.** [high]
+
+**BOTTOM LINE FOR THE PITCH:** lead with **Shanbhag** (his JSSC-2018 chip already proves analog-MVM + on-chip-training in one die — the nearest substrate; we add relaxation + EP) + **Wenjuan Zhu** (trainable device) + **Hanumolu** (converter glue) = a complete in-house UIUC-ECE stack. Stanford **Wong** as the RRAM-device escalation (via the student). Industry (TetraMem/Mythic) only useful for the fixed-weight Phase-2 forward path. Nobody has built EP-transformer hardware → first-mover.
+
+### Hanumolu profile (targeted, 2026-06-21)
+**Pavan Kumar Hanumolu** — Seendripu Family Professor, UIUC ECE (since 2013; prior Oregon State); member of CSL's Integrated Circuits & Systems Group. "Top-five mixed-signal IC researchers worldwide," NSF CAREER 2010, heavy JSSC/ISSCC record. Work: **energy-efficient analog/mixed-signal — time-based ADCs, continuous-time filters, ultralow-jitter clocking/PLLs, high-speed serial links, switched-cap, DC-DC power conversion.** ⇒ exactly the **converter + analog-integrator + feedback-loop** layer the EP control loop needs (ADC/DAC glue to read settled states + apply the nudge; switched-cap integrators ARE relaxation-dynamics primitives). Note: his published silicon is converters/links/clocking, NOT CIM — he's the glue/control-loop partner, not the MVM substrate. (Also co-founded Omni Design Technologies — converter IP.) [sources: ece.illinois.edu/.../hanumolu, icsg.csl.illinois.edu]
+
+### GAP 3 — endurance budget: the make-or-break number, and it CLEARS the bar
+The feasibility question: an EP run does ~tens-of-thousands of update STEPS; with digital-accumulate-then-threshold-program, physical device writes are FEWER than steps. How many cycles do devices survive?
+- **HfOx RRAM: up to ~10^10 cycles** endurance (best-in-class metal-oxide). [arxiv 1909.01771, IOP 10.1088/1361-6641/abf29d]
+- AlOx / weaker oxides: only ~10^4 — material choice matters a lot.
+- **Budget check**: MNIST-class training writes ~10^4 cycles; gradient training can scale to **~10^8** cycles. ⇒ **HfOx (10^10) has ~100× headroom even over a 10^8-write run** — endurance is NOT a blocker IF you use HfOx-class RRAM + the threshold-accumulate scheme (which cuts writes below step-count). [web-search snippets, med-high]
+- **Device-nudge insight**: an EP/Coupled-Learning *nudge* changes resistance far less than a full state write, so per-nudge endurance is plausibly >> rated full-write endurance (needs empirical confirmation, but favorable). 
+- **ECRAM (electrochemical RAM)** = the symmetric/linear-analog-update + high-endurance technology specifically aimed at in-situ training: "open-loop analog programmable electrochemical memory array" (Nature Comms 2023, s41467-023-41958-4) — but **research-only** (not commercially available; lab/foundry fab). It's the device-physics frontier Wenjuan Zhu / Wong-type collaborators work in.
+- **VERDICT: endurance is survivable** with HfOx-class RRAM (10^10) + threshold-program; ECRAM is the better-but-research-only upgrade. The make-or-break risk is NOT endurance — it's **update linearity/symmetry + device variation** (the asymmetric-nonlinear-update problem), which the digital-accumulate scheme + compensation (stochastic rounding) mitigates. [from earlier run + this]
diff --git a/docs/hardware/PHYSICS_QUESTIONS_FOR_DEEP_REASONING.md b/docs/hardware/PHYSICS_QUESTIONS_FOR_DEEP_REASONING.md
new file mode 100644
index 0000000..46f79e3
--- /dev/null
+++ b/docs/hardware/PHYSICS_QUESTIONS_FOR_DEEP_REASONING.md
@@ -0,0 +1,65 @@
+# Deep-reasoning brief: the physics of backprop-free equilibrium learning on non-conservative systems
+*Self-contained briefing for a deep-reasoning consult. Goal: identify whether the central obstruction is fundamental, and what the physically-realizable escape routes are. 2026-06-21.*
+
+---
+
+## 0. One-paragraph setup
+We train a **transformer as a physical equilibrium system** — a fixed point `z*` of a damped relaxation `ż = F(z)`, `F(z) = x_in − (1+c)z + Attn(LN z) + FFN(LN z)` (`c>0` damping) — **without backpropagation**, using **Equilibrium Propagation (EP)**: a free relaxation to `z*`, a nudged relaxation with the output perturbed by `β ∂L/∂z`, and a **local** weight update from the contrast of the two settled states. The motivation is **analog hardware**: the relaxation is physical settling (≈ free), and the update is local (no backward pass). On such hardware **BPTT is physically impossible** (no backward-through-time). The problem: EP trains cleanly until attention becomes expressive, then the gradient estimate collapses and training diverges — while exact BPTT with the identical model sails to a good solution.
+
+## 1. What is rigorously established (the constraints any answer must respect)
+- **EP ≡ BPTT only under** (Ernoult 2019): **H1** symmetric Jacobian `∂F/∂z = (∂F/∂z)ᵀ` (i.e. F is a gradient of a scalar — *conservative*); **H2** free phase *exactly* at the fixed point; **H3** `β→0`. BPTT needs **none** of these — it differentiates the actual finite computation graph.
+- **Softmax self-attention is non-conservative**: independent Q/K (and untied FFN) ⇒ `J ≠ Jᵀ`. We measure the non-normality `‖A‖/‖S‖`-scale ≈ **1.4** (S, A = symmetric/antisymmetric parts of J).
+- The standard fix, the **AEP correction** (inject `−2A(x−x*)` in the nudged phase to make the nudged Jacobian `J→Jᵀ`), is **first-order in `S⁻¹A`** — its error is a Neumann series `Σ (S⁻¹A)^{2n+1}` that **converges only if `ρ(S⁻¹A) < 1`**. At our `‖S⁻¹A‖ ≈ 1.4` it is *out of warranty*.
+- The **good solutions are weakly contractive** — a BPTT model at good loss has finite-time Lyapunov ≈ **−0.0347** (barely inside the stable disk). The instability is **non-normal transient growth**, not an eigenvalue crossing (spectral radius is a misleading stability signal here).
+- Empirically the EP↔BPTT gradient gap `M = g_BPTT − g_EP` reaches **~84 % of `‖g_BPTT‖`** just before divergence; EP retains the *direction* (cos 0.91) but loses the *magnitude/structure*. EP gradient ≈ BPTT only when the free-phase residual `res ≲ 1e-4`; at `res ≈ 1e-2` it is noise.
+- The damping `−(1+c)z` is what makes `S` well-conditioned at a *healthy* fixed point (`S ≈ −(1+c)I` dominant ⇒ `S⁻¹A ≈ −A/(1+c)`, small); `ρ(S⁻¹A)` blows up only *as contraction is lost* (S → singular). So large `ρ(S⁻¹A)` is partly a *symptom* of lost contraction, not purely intrinsic asymmetry. **(This is unresolved — see Q2.)**
+
+## 2. The central tension (the thing to crack)
+**EP's gradient fidelity requires near-conservativity (`‖S⁻¹A‖ ≪ 1`) and tight convergence; the expressive power of attention comes precisely from its non-conservativity (`‖S⁻¹A‖ ≈ 1.4`).** BPTT escapes because it needs neither H1 nor H2 — but BPTT is not physically realizable on analog relaxation hardware. So the question is whether a **local, physically-realizable, backprop-free** rule can get the correct gradient for a **non-conservative equilibrium**, or whether there is a fundamental obstruction.
+
+---
+
+## QUESTIONS FOR DEEP REASONING
+
+### Q1 — Is the obstruction FUNDAMENTAL or an estimator artifact? (the no-go question)
+Is there a **theorem** of the form *"any local learning rule that reads only the settled states of a physical relaxation can recover the true loss-gradient of a non-conservative equilibrium system only to fidelity bounded by a function of the operator's non-normality `‖S⁻¹A‖`"*? Or is the `ρ(S⁻¹A)<1` limit specific to the *first-order* AEP construction, with a different (possibly non-perturbative) local estimator that is exact at any asymmetry?
+- Concretely: the true adjoint needs `J⁻ᵀ`; EP's nudged relaxation naturally produces `J⁻¹`. For **symmetric** J these coincide (why EP works on energy-based systems). Is the gap `J⁻¹ − J⁻ᵀ` *fundamentally* inaccessible to any forward-only physical relaxation, or can a cleverly-coupled relaxation produce `J⁻ᵀ`?
+- Relevant: implicit-function-theorem / adjoint-method structure; whether the adjoint of a *dissipative non-reciprocal* system is itself the steady state of *some* physically-runnable dynamics.
+
+### Q2 — Decouple non-normality from lost-contraction. Is the GOOD solution actually high-`ρ(S⁻¹A)`?
+We suspect `ρ(S⁻¹A)` is large at the *divergence edge* mostly because contraction is lost (`S` near-singular), not because the *good* solution is intrinsically that asymmetric. At a healthy fixed point the damping makes `S ≈ −(1+c)I`, so `ρ(S⁻¹A) ≈ ‖A‖/(1+c)`.
+- Does the **expressivity of attention** require large **`‖A‖`** (antisymmetric coupling) per se, or only large **`‖J‖`** (which the damping + a larger `c` could offset while keeping `‖S⁻¹A‖<1`)? I.e. **can one have an expressive attention operator that is strongly damped (large `1+c`) so that `S⁻¹A` stays sub-1 — buying EP-validity by *scaling*, not by killing asymmetry?** What is the trade between damping `c`, expressivity, and `ρ(S⁻¹A)`?
+- Is there a reparameterization (e.g. normalizing the value/output path, or `c` per-channel) that holds `ρ(S⁻¹A)<1` *uniformly over training* without an explicit penalty?
+
+### Q3 — Can the PHYSICAL hardware compute its own adjoint (transpose dynamics)?
+EP works on conservative systems because a **reciprocal** (symmetric-coupling) physical network has `J=Jᵀ`, so the same physical relaxation computes both the forward state and the adjoint. Our attention is **non-reciprocal** (independent Q/K = directed coupling).
+- Is there a **physical mechanism** to run the *transpose/adjoint* dynamics of a non-reciprocal dissipative network on the same substrate — e.g. **time-reversal / Hamiltonian-echo backpropagation** (López-Pastor & Marquardt 2023, for Hamiltonian systems), a physically-coupled "transpose twin" network, or an Onsager-reciprocity-breaking trick? Does HEB extend from *Hamiltonian* (lossless) to our *damped dissipative* setting, and at what cost?
+- If a substrate can physically realize **both** `J` and `Jᵀ` relaxations, EP/AEP becomes *exact* for non-conservative systems with **no** perturbative `ρ(S⁻¹A)<1` restriction. Is that physically buildable (analog crossbar / photonic / spintronic), and what does the transpose operation cost?
+
+### Q4 — Is "equilibrium" even the right primitive? Finite-time vs fixed-point computation
+BPTT's stabilizer is exactly the **finite-horizon truncation gradient** (it differentiates the *actual* 150-step settling, not the `T→∞` fixed point). The good BPTT solution rides a **weakly-contractive / not-fully-settled** state — arguably a *finite-time transient*, not a true equilibrium.
+- Should the physical computation be defined at a **fixed finite settling time** `T` (a "finite-time fixed point") rather than `T→∞`? Is there a **local** learning rule whose target is the finite-time output — i.e. a physically-realizable analog of BPTT's finite-horizon term — without an actual backward pass? (e.g. forward-mode/RTRL-like physical accumulation, two-timescale dynamics, or an auxiliary physical variable that accumulates the transient sensitivity.)
+- More radically: should we **embrace non-equilibrium computation** (transient / limit-cycle / wave dynamics) as the primitive and seek a local rule for *that* (reservoir-computing, oscillatory/Ising-machine, physical-learning lineages), rather than forcing a fixed-point interpretation onto a computation that is natively transient?
+
+### Q5 — Non-normality as physics: does it have to hurt, and can it be bounded for free?
+The failure is **non-normal transient amplification** — a stable operator whose perturbations grow transiently before decaying. In a real relaxation circuit this should be a *physical* phenomenon (ringing / overshoot before settling).
+- What is the **physically natural regularizer** of non-normality? Is there a passivity / dissipativity / **contraction-in-a-metric (log-norm `μ_P(J)<0`)** structure that a physical substrate enforces *automatically* (e.g. a passive RC network is contractive by construction), and does building attention as a *passive* element bound `‖S⁻¹A‖` without a learned penalty?
+- Is there an expressivity argument that **bounded non-normality is sufficient** for language modeling — i.e. that the useful computations of attention live at modest `‖S⁻¹A‖` and the divergence-causing high non-normality is *gratuitous* (an artifact of unconstrained training), not *functional*?
+
+### Q6 — The conservative-attention escape: fundamental cost or engineering cost?
+Energy-based attention exists — **modern Hopfield networks / Energy Transformer (Krotov), tied-value LSE-energy attention** — which are conservative by construction (`H1` holds, EP is *exact*, runs natively on analog equilibrium hardware). We measure it costs ~0.15–0.2 CE vs free (non-conservative) attention.
+- Is the expressivity loss of conservative/energy attention **fundamental** (information-theoretic: a symmetric-coupling associative memory provably cannot represent what directed Q≠K attention can) or **engineering** (current energy-attention forms are just suboptimal, and a better conservative parameterization closes the gap)?
+- Is there a **partially-conservative** sweet spot: an attention with a *controlled, small* antisymmetric part (`‖S⁻¹A‖<1`, inside the first-order AEP warranty) that captures most of the expressivity — and what is the **Pareto frontier of [gradient fidelity ↔ asymmetry ↔ task loss]**?
+
+### Q7 — THE make-or-break for the hardware thesis
+Granting that analog relaxation hardware is **structurally confined to (near-)conservative, contractive, equilibrium-based local learning**: **is a "readable" small language model (≈10–50M params, coherent text) achievable inside that regime?** Or does coherent language *require* the high non-normality / finite-time-transient expressivity that provably breaks local equilibrium learning?
+- If achievable: what is the *minimum* non-normality / contraction margin a coherent-language equilibrium model needs, and does it fit inside `‖S⁻¹A‖<1` (+ comfortable contraction)?
+- If not: what is the realistic *ceiling* of analog-EP-trainable models, and is there a hybrid (mostly-conservative core + a thin digitally-assisted non-conservative correction) that lifts it?
+
+---
+
+## What an actionable answer looks like
+For each: (a) is it a **fundamental obstruction** (cite/sketch the no-go) or an **engineering gap** (sketch the construction)? (b) the **physical realizability** verdict (can it run on a relaxation substrate with local updates, no backward pass?) (c) the **cheapest experiment** that would falsify the proposed mechanism on our simulator (we can measure `ρ(S⁻¹A)`, contraction margin, gradient cos vs a non-conservative knob, energy-vs-free-attention loss, damping-`c` sweeps). We are not looking for "use BPTT" or "use a digital solver" — both are physically excluded. We want the **physics of a local, forward-only, non-conservative equilibrium learning rule**, or the proof that none can exist beyond bounded asymmetry.
+
+## Key references the reasoner should engage
+Scellier–Bengio 2017 (EP / energy); Ernoult et al. 2019 (EP≡BPTT conditions); Scurria et al. 2026 (AsymEP, the `J→Jᵀ` first-order correction); Laborieux–Zenke 2023 (Jacobian homeostasis, the `S⁻¹A` bias + functional-symmetry fix); López-Pastor–Marquardt 2023 (Hamiltonian echo backprop = physical adjoint via time-reversal); Krotov / Hoover et al. 2023 (Energy Transformer); Winston–Kolter 2020 (monotone-operator equilibrium nets = contraction by construction); Stern–Murugan 2023 (physical learning review); Kendall et al. 2020 (EP on analog resistor networks).
diff --git a/docs/hardware/SCALING_AND_HARDWARE_PLAN.md b/docs/hardware/SCALING_AND_HARDWARE_PLAN.md
new file mode 100644
index 0000000..8e93a40
--- /dev/null
+++ b/docs/hardware/SCALING_AND_HARDWARE_PLAN.md
@@ -0,0 +1,58 @@
+# EP scaling-to-1B sim cost + modular analog-hardware plan (2026-06-21)
+
+## PART 1 — 1B simulation cost (H200)
+
+Anchor: your 1B **BP** run ≈ 36h × 8 H200 = **288 H200-hours**.
+
+Cost driver = the equilibrium block relaxes ~150–300 steps (T1=150 + refine to t1max=300) — this is the
+WHOLE cost, ~width-independent, so the per-step factor measured at C512 holds at 1B:
+- EP/BP per-step FLOPs ≈ **~50–100×** (relaxation is forward-only, 2ND each; BP is 6ND). Measured: C512 EP 0.23 it/s
+  vs a depth-1 C512 BP ~tens it/s ⇒ ~65–130×. **EP/BPTT ≈ only 1.5×** — the cost is the equilibrium, not EP.
+- Reducers: bf16 + torch.compile + the compiled free-phase fast-path (blk._cstep) ≈ 2–4×; refine exits early
+  while contractive (×~0.6); EP's low memory (no unrolled graph) packs bigger batches than BPTT. Net effective **~20–50×**.
+
+**Estimate: 1B EP ≈ 20–50 × 288 ≈ 6,000–15,000 H200-hours** ≈ $18k–60k @ ~$3–4/H200-hr (or burn AWS credits).
+Wall-clock: ~30–75 days on 8×H200, or **~4–9 days on 64×H200** (same H200-hours — scale OUT for wall-clock, cost is fixed).
+
+**Recommendation (don't burn 10k H200-hr blind):**
+1. Ship the **speed package (task #14)** FIRST — directly cuts the bill 2–4×.
+2. Validate the recipe + MEASURE the real EP/BP factor on a **~100–300M rung** (scaling-law dossier) before committing 1B.
+3. Then the 1B run. Use EP's memory advantage to pack batch; data-parallel across many H200s for wall-clock.
+
+## PART 2 — Modular analog-hardware plan (tens-of-M, COTS, NO custom fab)
+
+**Principle that makes this cheap:** the damped equilibrium block is a *physical settling system* (monDEQ ≈ passive
+resistor/op-amp circuit, Chaffey 2025). The free-phase relaxation = the hardware SETTLING — so the ~100× sim cost becomes
+**~free physical settle**, and EP's update is **local** (no backprop). The algorithm is DESIGNED for this substrate; the IP
+is substrate-agnostic, so demo on whatever COTS analog you can buy — never fab.
+
+**Decompose the block → what each part needs:**
+| part | hardware | note |
+|---|---|---|
+| linear maps WQ/K/V/O, fc, pj | analog MVM (crossbar) | bulk of params + compute |
+| relaxation z←z+εF(z) | physical feedback (RC + the −c·z damping resistor) | THE win — free settle |
+| nonlinearities: softmax, GELU, LN | the analog-HARD part | do mixed-signal in FPGA, OR use energy/Hopfield attention (analog-native) |
+| EP update (local) | needs **updatable** analog weights | the key constraint |
+
+**Substrate options (COTS/existing; key tradeoff = weight updatability — VERIFY current availability/specs):**
+- **(A) Memristor/ReRAM crossbar eval kits** — in-situ updatable ⇒ cleanest for EP TRAINING; small/research-grade (TetraMem, CrossBar, Knowm, academic arrays).
+- **(B) Analog-inference compute modules** (~tens-of-M weights, e.g. Mythic-class) — large MVM but FIXED/re-flash weights ⇒ mixed-signal EP (analog forward, digital re-program for the update). Matches "tens of M" in MVM size.
+- **(C) FPAA + discrete op-amps** (Anadigm / GT RASP) — fully-analog SMALL block, true physical settle, fully programmable.
+
+**Recommended architecture = mixed-signal, stitched (no fab):**
+- analog MVM substrate (A for trainable, or B for scale) does the linear F(z) + the relaxation feedback (the energy win);
+- a COTS **FPGA / RFSoC** (fast ADC/DAC) does the nonlinearities (softmax/GELU/LN) + the EP control loop
+  (drive settle → apply nudge βg → settle → measure contrast → compute + apply local Δθ);
+- ADC/DAC glue between them. All COTS modules + a PCB. The "out-of-the-box" stitch.
+
+**Out-of-the-box levers:**
+- **Softmax is the analog-hard piece** → for the HARDWARE demo use the energy/Hopfield (LSE, tied-value) attention
+  variant (analog-native, conservative) even if the sim keeps softmax; or keep softmax in the FPGA (small fraction of compute).
+- The code's **`--fnoise` optics-noise model already exists** → simulate analog non-idealities (device noise, quantization,
+  variation) IN the 1B sim to de-risk the hardware before buying anything.
+- **Stage it:** Phase-1 = ONE small block on FPAA/discrete + FPGA, prove the EP analog loop trains a toy task end-to-end;
+  Phase-2 = scale the crossbar to tens-of-M. Tens-of-M is the Phase-2 target, not the first build.
+
+**Next step:** a deep-research pass on "2026 COTS / eval-board analog-compute substrates with in-situ weight update for
+equilibrium-network training" to pin the specific modules (memristor kits, Mythic-class availability, FPAA, RFSoC) — the
+specific availability is the thing to verify, not guess.