# Methods — EP-Trained Equilibrium Transformer Language Model

Complete technical notes for discussion: architecture, how attention/FFN are made EP-trainable,
the training rule, every stabilizer/regularizer and its reason, the LM setting, validation
methodology, results, and open problems. Code paths at the end. (Companion doc:
`/home/yurenh2/ept/FINDINGS.md` — the project arc and findings log.)

---

## 1. Problem statement

Train a transformer-class language model where **both attention and FFN learn without
backpropagation through the computation** — using Equilibrium Propagation: two (or N) relaxations
of the same dynamics plus a local contrast readout. The questions: (a) does it train at all,
(b) what does it cost vs the exact gradient (BPTT on the same architecture) and vs a standard
BP transformer at equal parameters, (c) what are the actual failure mechanisms.

Headline result (all at 14k steps, fully-controlled comparison): best EP model reaches
**val CE 1.676** (adaptive T1/T2, run R10). Like-for-like standard BP transformer (MLP=4 — the
same parameter shape as the thick block, see §2) reaches 1.610 ⇒ **total gap 0.066**, decomposed:
**architecture ≈ 0.025** (BPTT + the same stabilizer + same param-EMA on the identical block:
1.635) and **training rule ≈ 0.041** (EP 1.676 vs that control) — and EP beats the thin-matched
BP MLP=1 baseline (1.689). Unregularized BPTT *destabilizes* on long horizons (walks off the
contractive manifold, res→5e-2, best 2.021 — worse than its 3k run, 1.949): the stabilization
loop EP carries out of estimator necessity (residual-driven Jacobian-penalty controller) is what
the equilibrium architecture itself needs for long training. Random = ln 65 = 4.17.

## 2. LM setting (data, embeddings, readout, evaluation)

- **Corpus**: Shakespeare character-level (nanoGPT preprocessing): `train.bin`/`val.bin` uint16
  token streams + `meta.pkl`, vocab = 65 chars (~1.1 MB text). Local copy:
  `/tmp/lt_ep/data/shakespeare_char`.
- **Batching**: random crops, B=32 sequences × T=64 context; next-char targets (shift by 1).
- **Embeddings**: learned token table `tok ∈ R^{65×128}` (init N(0, 0.02²)) + learned absolute
  positional table `pos ∈ R^{64×128}`. Input injection `x_in = tok[idx] + pos`. **Embeddings are
  trained by EP too** — they enter the force through the input-clamp term −(z − x_in), so the same
  vector-field readout (Sec. 4) delivers their gradient. No pretrained embeddings, no BP path.
- **Readout head**: logits = z* Wh, `Wh ∈ R^{128×65}`, trained with its **own local CE gradient**
  ∂CE/∂Wh at the free equilibrium (loss-adjacent layer — local learning suffices; this is standard
  in EP setups and is not backprop through the dynamics).
- **Objective / eval**: mean next-token cross-entropy over all B·T positions; val CE = average over
  8 fresh validation batches, computed by running the same free-phase relaxation (T1 steps) used in
  training — i.e. the eval graph equals the inference graph. Random baseline ln(65) = 4.174.
- **Model size**: C=128, H=4 heads, single equilibrium block (weight-tied recurrence ⇒ effective
  depth = T1). Parameter matching to the BP baseline depends on the variant: the **thin** block's
  Hopfield memory (Wm: 128×256 ≈ 33k) matches BP **MLP=1** (2C² ≈ 33k); the **thick** block's
  untied FFN (fc+pj = 2·4C² ≈ 131k) matches BP **MLP=4** (131k) — so thick-block results compare
  against MLP=4 (1.610), thin-block results against MLP=1 (1.689).

## 3. Architecture: the equilibrium block

State `z ∈ R^{B×T×C}` (one vector per token position). Dynamics ż = F(z); inference = relax to
fixed point z* (Euler: z ← z + ε F(z), ε=0.1, T1=150 steps), predict from z*.

We built four force variants (all share the input clamp and the readout):

### 3.1 `thick` — DEQ-transformer block (the winner)

```
F(z) = −(z − x_in)                            input clamp (leak toward embedding)
     + Attn(LN1(z))                           causal multi-head softmax attention, separate WQ WK WV WO
     + W2 · GELU(W1 · LN2(z) + b1) + b2       untied 4× FFN (W1: C→4C, W2: 4C→C)
     − c·z                                    damping (c=1–2)
```

LN1/LN2 carry learned affine (g, b). This is exactly a pre-LN transformer block written as a
*force* instead of a layer stack — same form DEQ uses as its fixed-point map. It is strongly
**non-conservative**: no scalar energy has this gradient (Q≠K asymmetric coupling, untied FFN).
EP is made exact for it via the AEP correction (Sec. 4.2).

### 3.2 `real`/thin — Hopfield-FFN + damped real attention

```
F(z) = −∇_z [ ½‖z − x_in‖² + E_mem(z) ] + s·(Attn(z) − c·z)
E_mem(z) = −Σ relu(z Wm)²                     modern-Hopfield / dense associative memory, Wm: C×256
```

The FFN here is **energy-based**: E_mem is the dense-associative-memory energy; its force
2·relu(zWm)Wmᵀ is a one-hidden-layer FFN with *tied* weights (Wm in, Wmᵀ out). Attention remains a
raw non-conservative force with damping. This was our first stable trainable variant.

### 3.3 `energy` — fully conservative attention (CET-style)

Attention folded into the energy: `E_att = −(1/γ) Σ_heads,i LSE_j(γ q_i·k_j)` (causal-masked,
**tied value** — the force of this energy mixes values v=k), plus `½c‖z‖²` confinement because
E_att+E_mem alone are unbounded below. F = −∇E exactly ⇒ classic EP applies with no correction.
This is the CET (Høier/Kerjan/Scellier) route, which we reproduced separately on vision
(masked CelebA/CIFAR completion, EP ≈ TBPTE). Trade-off: tied value + reciprocal coupling = the
least expressive attention.

### 3.4 `mono` — monDEQ-structured contraction

`F(z) = −(m·z + z PᵀP) + z(Q−Qᵀ)ᵀ + x_in − ∇E_mem + s·Attn(z)` — the linear part is a monotone
operator by construction (Winston–Kolter): symmetric part ⪯ −m·I guaranteed, antisymmetric part
(Q−Qᵀ) free (non-reciprocal coupling at no stability cost). Guaranteed unique fixed point for the
linear core; softmax attention sits on top with gain s. Ablation for "how much does guaranteed
contraction cost": BPTT-mono = 2.11 vs BPTT-thick 1.95.

### How attention is "EP-ified": the two routes, explicitly

1. **Energy route** (3.3): make attention conservative (tied value, LSE energy) so F = −∇E and
   vanilla EP is valid. Cost: expressivity (Q≠K asymmetry and free value are what make attention
   attention).
2. **Force route** (3.1, 3.2 — ours): keep real attention as a non-conservative force and repair
   the *estimator* instead, with the AEP correction (Sec 4.2) which restores exact gradients for
   non-reciprocal couplings. Validated gradient cosine vs autograd: attention 0.99, FFN 1.00,
   full LM block 0.99 (and FA, for contrast, gives Q/K/V ≈ 0.25, upstream FFN ≈ −0.01).

### How attention and FFN are trained *jointly*

There is no per-module schedule or pipeline: attention, FFN, LN affines, and embeddings are all
terms of the **same force** F. One free relaxation + one nudged ensemble produce the contrast state
`a`; every parameter θ gets its gradient from the single vector-field formula ∇θ = ∂⟨a, F(z*,θ)⟩/∂θ
(Sec 4.1), which decomposes into purely **local** per-term updates (each force term touches only its
own parameters). The readout Wh is the only separately-trained parameter (local CE gradient).

## 4. Training rule

### 4.1 Vector-field (force-form) EP with symmetric nudging

- **Free phase**: z⁰ = x_in; z^{t+1} = z^t + ε F(z^t), T1=150, ε=0.1 → z*. Monitor relative
  residual `res = ‖z⁺−z*‖/‖z*‖` (one extra step) — the load-bearing health signal (Sec 5).
- **Nudged phases** (±β, β=0.02, T2=20 steps from z*): relax the augmented force
  `F(z) ∓ β ∇_z CE(z)` (the CE gradient w.r.t. the *state* — local at the readout).
- **Contrast**: `a = (z₋ − z₊)/(2β)` ≈ −dz*/dβ (centered ⇒ O(β²) bias; Laborieux-style symmetric
  nudging).
- **Parameter update** (force form, valid for non-gradient dynamics): for all force params θ,
  `∇θ L ≈ ∂/∂θ ⟨a, F(z*; θ)⟩` — one autograd call **at the fixed point only** (this is a local
  Hebbian-style contrast in θ for each force term; autograd here is per-term bookkeeping, not
  backprop through time/steps).
- At a converged fixed point and β→0 this equals the implicit/equilibrium gradient
  (Scellier–Bengio; Ernoult: EP ≡ BPTT stepwise under convergence).

### 4.2 AEP correction (non-conservative repair)

For non-conservative F the naive nudged phase linearizes around z* with Jacobian J, but the correct
adjoint needs Jᵀ. Following **AsymEP** (Scurria, Vanden Abeele, Mognetti, Massar, "EP for
Non-Conservative Systems", arXiv:2602.03670), we add to the nudged force the term `−(Jv − Jᵀv)`,
v = z − z*, where J = ∂F_nc/∂z at z* and F_nc = the non-conservative part (attention, or
attention+FFN for `thick`). This is **identical to their `−2 A_J(z*)(z−z*)`** (A_J = antisym part of
J; `Jv−Jᵀv = (J−Jᵀ)v = 2 A_J v`) — **the correction is theirs, not ours**. What is ours on this
line: (i) the **matrix-free jvp+vjp** form (one of each per nudged step) — AsymEP builds the full
Jacobian explicitly and decomposes it, which is infeasible at transformer state dim B·T·C; (ii) the
application to **softmax attention** (data-dependent Jacobian — they test only feedforward
nets/Hopfield on static MNIST/CIFAR, no attention/sequence); (iii) the **holomorphic combination**
(§4.3 — the correction is real-linear so it preserves holomorphy; they use plain ±β); (iv) the
**common-mode-tracking** linearization variant (§4.3/below). Their force-form **VF** readout (= our
`⟨a,∂F/∂θ⟩`) is *prior art* and is the baseline that collapses without the correction (their CIFAR
VF = 10% chance), matching our cos≈0.25 for uncorrected attention. Effect: nudged-phase Jacobian J → Jᵀ ⇒ a approximates the *adjoint*
response −(I−Jᵀ)⁻¹-type solve ⇒ exact gradients for Q≠K attention (measured 0.99–1.0).
Caveats: the correction is **linearized at z***, so the nudged trajectory must stay in the
linear-response window — T2≈20 at ε=0.1 is inside; T2=60+ can leave it (Sec 7).

### 4.3 Holomorphic EP upgrade (current)

Replace the 2-point real ±β difference by N points on a **complex circle** β_k = r·e^{2πik/N}
(Laborieux–Zenke 2022): relax the *holomorphically extended* dynamics (manual complex LN with
non-conjugate variance, softmax as exp-ratio, tanh-form GELU; the AEP correction is linear with
real coefficients so it preserves holomorphy — apply to Re/Im parts separately), then read
`a = −Re[(1/Nr) Σ_k e^{−iφ_k}(z_k − z*)]` (discrete Cauchy formula; bias O(r^N) instead of O(r²)).
**No clamps inside the holomorphic nudge** (clamps are non-analytic and break the bias order).

Probe findings (cos vs long-horizon-BPTT reference, 300-step-pretrained thick block):
- The **clamps were the dominant estimator error at marginal residuals**: at res 1.6e-3, plain EP
  cos 0.27 → clamp-free 0.89. (The clamps existed to protect early training; they were silently
  poisoning mid-training updates whenever res drifted up.)
- N and r are flat (N=2…8, r=0.02…0.2 all ≈ equal): finite-β bias and 1/β noise are *not* the
  binding error at this scale.
- The remaining ~0.12 misalignment is **T2 truncation**: with stable nudged dynamics, T2=120 →
  cos 0.985. But on slow-mixing batches long T2 diverges (AEP linearization error compounds on
  near-critical modes). Step-size-based early stopping FAILS (non-normal transient growth triggers
  it at t≈6–39; same pathology that makes spectral radius the wrong free-phase signal).
- **Adaptive T2, solved by hindsight selection** (`holo_a_select`): run the nudged phases to
  T2max=120 in lockstep, snapshot the contrast a_t every 10 steps, return the snapshot with the
  smallest increment (= most settled), early-exit only on clear blowup (inc > 5× the running min).
  Judging stability by increments of the *quantity of interest* — not step sizes — makes transient
  growth harmless. Probe: never worse than fixed T2=20; mean cos 0.871 → **0.932** at tight
  equilibria (0.853 / 0.987 / 0.956 per batch; the dangerous batch self-limits to t≈20–30).
- **Adaptive T1 companion**: long-T2 gains require a tight free phase (res ≲ 1e-4; at res ~1e-3
  long T2 actively hurts). So the free phase is two-stage: the λ-controller's residual signal is
  still sampled at T1=150 (R9 semantics preserved — no new λ war), then relaxation continues in
  chunks of 50 until res ≤ 1e-4 (cap 500) before the nudged phases. Compute buys tightness;
  λ pressure does not.

Training outcomes: **R7** (N=2, r=0.02, fixed T2=20): best 2.0289, faster wall-clock than plain EP
(3.3 vs 2.45 it/s — the holomorphic nudge's ∇_z CE is closed-form). **R10** (R9's controller +
adaptive T1/T2): **best 1.6755** (sustained EMA plateau 1.68–1.70 around step 8–10k; ~0.7 it/s).

## 5. Stabilization & regularization — what, and exactly why

**The governing fact (measured, not assumed):** the EP estimator has a **validity threshold** in
free-phase residual. Gradient cosine vs exact reference: res ≈ 5e-5 → 0.85–0.88; res ≈ 1e-3 →
0.2–0.9 (batch-dependent); res ≈ 3e-3 → ≈ 0–0.5; res ≈ 1e-2 → noise. BPTT has no such threshold
(it differentiates the actual finite computation, converged or not) — *this asymmetry, and nothing
deeper, is the EP-specific difficulty*. There is **no structural ceiling**: an early "EP caps at
~2.5" verdict was refuted (it conflated two undertrained/invalid-regime runs; see FINDINGS).

Each stabilizer and its reason:

1. **Damping −c·z** — creates/strengthens a fixed point for raw attention forces; *required* for
   the thin/real variant (attention alone has no fixed point at high gain: residual floor ~3e-2,
   no equilibrium to find). **Caveat for LN-inside blocks (`thick`)**: damping shrinks ‖z*‖ and the
   LN Jacobian scales like 1/σ(z) ⇒ damping *inflates* the effective Jacobian — measured: thick
   plain-relax residual 8.8e-3 at c=0 vs 3.4e-2 at c=2. So for `thick`, c is kept small (1) and is
   NOT the stabilizer; the Jacobian penalty is.
2. **Soft Jacobian penalty** λ‖J_nc(z*)‖²_F (Hutchinson estimator: one jvp on a random probe vector,
   differentiated w.r.t. θ; Bai et al. 2021 "Stabilizing Equilibrium Models by Jacobian
   Regularization") — the actual stabilizer: keeps the free phase contractive ⇒ keeps the estimator
   inside its validity region. Soft penalty ≻ hard constraints (spectral-norm capping the attention
   matrices to ρ=0.9 was tried: too restrictive, kills learning — consistent with FRE-RNN-style
   regulation being preferable to hard projection).
3. **Why the control signal is the residual, NOT the spectral radius**: the attention/block Jacobian
   is highly **non-normal** — transient growth is invisible to eigenvalues (measured: ρ(J)=0.94
   "stable" while the relaxation diverged at res 0.21). The one-step residual *is* the transient;
   control on it.
4. **Continuous λ controller**: λ ← clip( λ · (res_EMA/target)^0.3 , floor, 16 ), per step.
   - **EMA on the signal (0.9)**: the raw residual is noisy; a multiplicative controller on a noisy
     signal random-walks (measured thrash λ 0.5↔13 when the target sat at the noise floor) and the
     thrashing λ itself perturbs training. EMA removed it and gave the current best run.
   - **Target = 5e-4**: just inside the validity threshold (few·1e-4), with margin; NOT tighter —
     demanding res ≪ threshold buys nothing and costs expressivity (a 2e-4-target run was worse).
     For reference, BPTT's own optima sit at res ~1e-3–2e-2: good solutions are only mildly
     contractive; we ask for slightly more than BPTT needs, because our *estimator* needs it.
   - **Floor**: λ may shrink when res is healthy but **must not vanish — the floor is
     load-bearing**. Floor=λ₀ (never off) is a permanent tax (2.150); floor 0.1 is the sweet spot
     (2.078 → 2.047 with signal-EMA → 2.029 with the holomorphic estimator). Two independent runs
     prove λ≲0.02 is fatal at *any* stage: λ→0 from the start (R2) and λ-floor annealed with lr
     (R6) both ended in the same death — val CE 60–77 with res ≡ 0.0. Post-mortem: this is **an
     explosion disguised as convergence by floating point**, not a dead state: ‖z*‖ and the
     uncapped parameters (tok/pos/fc/pj) blow up under temporarily-invalid gradients until
     ε·F < ulp(z) and the relaxation freezes (res = 0 by absorption), with huge confidently-wrong
     logits. The λ penalty (whose θ-gradient touches fc/pj/LN/attention) is what keeps that basin
     out of reach; it cannot be annealed away. The late-drift hypothesis "persistent penalty
     gradient vs vanishing task signal" is hence only half-true — the persistent pressure is also
     the anti-collapse mechanism. Current anti-drift attempt: **parameter EMA** (decay 0.999,
     evaluated alongside raw weights), which targets late-phase estimator-noise wander without
     touching the stability loop at all.
5. **Weight-norm caps** (renorm to 3× init norm on WQ,WK,WV,WO,Wm,Wh after each step): blunt safety
   net against runaway during transients when the estimator is temporarily invalid. Rarely binding
   in healthy runs.
6. **Nudge clamp g.clamp(±2) and AEP-correction clip (‖corr‖ ≤ ‖F‖)** — *legacy*: protected the
   nudged relaxation early in training, but measured to be the main estimator error at marginal
   residuals (cos 0.27 → 0.89 once removed). Replaced by the clamp-free holomorphic nudge; a
   non-finite-gradient step-skip + the λ controller now carry the early-training safety.
7. **Optimizer**: AdamW lr 1e-3, wd 1e-4, cosine to 5%, grad-norm clip 5.0, skip non-finite steps.
   β(EP)=0.02, ε=0.1, T1=150, T2=20 everywhere unless stated.

### 5.x T1-residual penalty (`--resreg`) — defend the evaluated state (2026-06-20)

EP's gradient is the fixed-point/implicit gradient: it only cares WHERE the fixed point is, not how fast the
relaxation reaches it, so it has no reward for keeping the block contractive. BPTT — differentiating the finite
T1=150 unroll, which is what eval actually uses — gets that reward implicitly (a non-converging unroll → bad
output → high CE). This asymmetry is why frozen-jr EP diverges past ~2.09 (res inflates → forward bifurcates to
a limit cycle) while exact-BPTT with the identical recipe descends to 1.72 (see FINDINGS 2026-06-20; the EP run
refines the free phase to t1max=300=z* and grades there, so it never feels the residual of the evaluated z150).

The fix gives EP that missing term explicitly — penalize the T1 free-phase residual of the state actually
evaluated, `z150 = relax(xin, T1)` taken BEFORE any t1max refinement:
- `R_res = ‖ε·F(z150)‖² / (‖z150‖²+ε)`, gradient w.r.t. θ with z150 detached (`blk.tforce`);
- scaled task-relative: `ratio = resreg·min(1, res@T1 / 2e-2)`, deadband `res@T1 > 7e-4`,
  `λ = ratio·‖g_task‖/‖g_res‖`, added to the EP gradient.
- **Run with `res_gate=0`** — the validity gate early-returns (jacreg-only) above the gate, which would bypass
  the penalty exactly when res is high. Keep `t1max=300` (estimator accuracy) + the penalty (defends z150).

Analog-compatible (one extra force measurement + the same local vector-field gradient, no digital root-finder)
and more targeted than jacreg (which penalizes ‖J_nc‖_F, not the actual residual vector that explodes). Validated
res-tight through step 1000 / best 2.0573 (past the 2.09 wall) before a /tmp wipe; full re-validation pending.

## 6. Validation methodology (how we know the estimator/claims are right)

- **Gradient-cosine probes**: at a fixed realistic operating point (300 BPTT steps from init —
  no contraction penalty, the "natural" weight region), compare every estimator against a
  long-horizon BPTT reference (T1=400), per parameter group (attn / ffn / LN / emb). This is what
  exposed the validity threshold, the clamp damage, and the T2 truncation.
- **Horizon control**: BPTT-150 vs BPTT-400 cosine is itself only 0.35–0.77 on slow-mixing batches
  — the "finite horizon vs true equilibrium gradient" cost is shared by everyone, EP is not
  special; at matched horizon EP is within ~0.15 of BPTT.
- **BPTT-as-ablation**: BPTT on the identical architecture isolates *training-rule* cost (EP−BPTT)
  from *architecture* cost (BPTT−BP). BPTT is an ablation, not the target; BP is the target.
- **Same-graph eval**: val CE is computed through the same T1-step relaxation used in training, so
  no train/eval mismatch flatters either method.
- **Gradient-cosine has a lifecycle**: early/mid training it measures estimator quality (0.93 →
  0.79–0.85 across scale); late in training, at slow-mixing trained points, even two *exact*
  gradients at different horizons decorrelate (cos(BPTT-150, BPTT-800) = 0.25 at the trained S1
  point) — no single "true gradient" exists to cosine against, and the meaningful arbiter becomes
  the training outcome on the horizon-matched eval objective. Validity-threshold claims here are
  early/mid-phase statements. Late-phase corollary: EP's res target 1.5e-3 at S1 is already
  optimal — cos rises monotonically with tightness and the loose-weights/refined-measurement
  variant nulls at training level: the measurability-contraction tax is rigid across the interval
  (the physical escape is oscillatory/lock-in measurement, not operating-point engineering).

## 7. Results

**14k-step matched comparison (the honest table; thick block ≈ BP MLP=4 in parameter shape):**

| training rule | architecture / recipe | best val CE |
|---|---|---|
| BP | standard transformer, MLP=4 (**like-for-like for thick**) | **1.610** |
| BPTT + R9's λ-controller + param-EMA | thick (exact grad, same stabilization as EP) | **1.635** — tail stable |
| **EP (R10)** | **thick; R9 + adaptive T1 (refine to res≤1e-4) + adaptive T2 (selection, cap 120)** | **1.676** (EMA plateau 1.68–1.70) |
| BP | standard transformer, MLP=1 (thin-matched) | 1.689 |
| EP (R9) | thick; holo nudge + recalibrated controller (target 1.5e-3, λmax 4) + param-EMA | 1.740 |
| BPTT (exact grad) | thick, unregularized | 2.021 — **destabilizes late** (res→4.7e-2, val→3.0) |

3k-era and ablation numbers (shorter schedule):

| run | best val CE |
|---|---|
| BPTT thick, 3k (its best showing) | 1.949 |
| EP R7: holo estimator, old tight controller (target 5e-4) | 2.029 (late λ pinned 16 ⇒ drift) |
| EP R8: R7 + param-EMA | 2.031 (EMA alone ≠ fix; the λ fight dominates) |
| EP R5/R3: plain estimator generations | 2.047 / 2.078 |
| BPTT monDEQ / thin, 3k | 2.111 / 2.206 |
| EP R2 (λ→0) / R6 (λ-floor∝lr) | 2.357 / 2.501 — both die by fp-absorption explosion |
| random | 4.174 |

Reading: (a) final decomposition — **architecture tax ≈ 0.025** (1.635 vs 1.610), **EP rule tax
≈ 0.041** (1.676 vs 1.635), total **0.066** to the like-for-like BP transformer; EP beats the
thin-matched MLP=1 baseline. (b) EP beats *bare* BPTT at both horizons, but the controlled
comparison shows most of that win was EP's mandatory stabilization loop doubling as regularization
— bare exact-gradient training walks off the contractive manifold at 14k, and the same controller
that EP cannot live without lifts BPTT to 1.635 (also beating MLP=1): **the contraction controller
is good for the architecture regardless of training rule; EP simply forced its discovery.**
(c) The estimator and the controller must be **co-designed**: upgrading the estimator
(holomorphic, clamp-free) widened the validity region from res≲5e-4 to ~1.5e-3, and re-calibrating
the controller to that wider region (R7→R9) was worth **0.29**; adaptive T1/T2 (R9→R10) was worth
another **0.064**, matching the probe's cos 0.871→0.932. (d) **Multi-seed confirmation (3 seeds per arm)**: EP
1.6755/1.6851/1.6786 → **1.680 ± 0.005** vs BPTT+controller 1.6348/1.6459/1.6365 →
**1.639 ± 0.006**; the rule tax is **0.041 ± 0.005 (~9σ)** — real, tightly reproducible, and
consistent with the measured estimator misalignment (cos 0.85–0.93).

**Scale rung S1 (TinyStories char, C=256 H=8 T=256, 0.92M params; random ln127 = 4.84):**

| run | best char-CE (BPC) |
|---|---|
| BP same-shape, 14k | 0.827 (1.19) |
| **BPTT-ctl, loose target 1e-2, 14k** | **1.009 (1.46)** |
| BPTT-ctl, tight target 1.5e-3, 14k | 1.521 — ⇒ **controller-mismatch tax 0.51** |
| EP v4b (validity gate, lr 1e-3, 20k from scratch) | 1.393 |
| EP L2 (v4b recipe, 40k from scratch) | 1.214 |
| **EP warm-track (v4b → phase-2: common-mode tracking + loosened target)** | **1.141** — EP champion |

S1 scale lessons: (a) containment must scale with model size (λ ceiling, cap list, fuse);
(b) the **validity gate is load-bearing** — off-equilibrium EP updates poison weights (three
deaths before the gate, alive after); (c) the estimator validity threshold tightens with scale
(res 1e-4 → 1e-5 for full quality; rescue is compute-bounded, saturating at cos ≈ 0.85);
(d) **the controller operating point is part of the training rule**: EP needs validity-tight
targets, exact-gradient methods want loose ones — match controllers for rule-tax measurements,
but report each method at its own best operating point for ceilings.

**Optimizer pricing (S0-Shakespeare, R10 recipe, 8k steps; AdamW ≈ 1.70 at 8k):** EP-SaI
(per-tensor lr from init g-SNR, frozen = one calibrated gain line per array) **2.048**;
SGDM 2.166; Lion 2.175; Lion+LARS 2.244. Per-tensor calibration recovers ~0.12 of the 0.47
uniform-scale gap; the remaining ~0.35 measures the value of per-coordinate adaptivity under EP's
noisy heteroscedastic gradients. Pretraining therefore stays in the digital shell; fine-tuning is
exempt (SGD suffices in the RL/fine-tune regime, with <0.02% sparse updates — an endurance gift;
Mukherjee et al., arXiv:2602.07729).

**Hardware twin v4 (S0, 8-bit program-verify + 30% static mismatch + σ=1e-4 white + 4× restart
averaging): best 1.937 — 90% of the clean improvement** (clean 1.68). Noise laws measured:
contrast pollution strictly linear in σ; √N restart averaging; snapshot SNR ≈ 1/53 at σ=0.3%,
r=0.2 ⇒ hardware needs ~10³–10⁴ lock-in averages per update (ms at MHz loops — physically trivial,
digitally prohibitive: the noise dimension repeats the compute story). Discovery en route: the
frozen AEP correction has a clean-environment instability at nudge horizons ≳150–300 steps
(spectrum of J(z) − J* + J*ᵀ uncontrolled) — windows ≤120 steps + restart averaging circumvent it;
the single-trajectory oscillatory (true lock-in) estimator awaits a fix for this horizon limit.

## 8. Open problems

1. **Late drift — mostly solved, mechanism identified**: the drift was the controller *fighting*
   the weights (λ pinned at max enforcing a target the upgraded estimator no longer needs; R7/R8
   tails). Re-calibrating target/λmax to the holomorphic estimator's wider validity region removed
   the fight (R9: λ stays 0.1–0.5, tail drift shrank from ~0.3 to ~0.15 above best). Refuted
   routes: λ-floor annealing (R6 ⇒ fp-absorption explosion — the floor is load-bearing);
   param-EMA alone (R8 — smooths wobble ~0.05, can't fix the fight). Residual ~0.15 tail drift
   remains open (estimator direction bias near optimum is the suspect).
2. **Adaptive T2 — SOLVED** by hindsight snapshot selection (§4.3): judge by increments of the
   contrast estimate, not step sizes; select the most-settled snapshot. Probe mean cos 0.932;
   training −0.064 val CE (R10). Possible refinements: larger N with selection, per-batch T2max.
3. **Mixing time**: slow-mixing equilibria make *all* gradients horizon-expensive (BPTT included);
   conditioning the dynamics for fast mixing (preconditioned/Anderson relaxation that preserves the
   EP contrast structure) is unexplored here.
4. **Scale**: depth-1 block, char-level, C=128. The mechanisms (validity threshold, non-normality,
   controller design) are dynamics-level and should transfer; the constants will move.

## 9. Hardware translation — can this now-complex algorithm still run on EP hardware?

Audit of every component of the final recipe (R10) against an analog/in-memory substrate. The
surprise: most of the added "complexity" is *control*, and control is cheap in analog; several of
our fixes specifically REMOVED digital artifacts.

| algorithm component | analog realization | difficulty |
|---|---|---|
| free phase (T1≈500 Euler steps — our digital bottleneck) | physical settling, ns–µs, "free" | trivial (hardware's whole pitch) |
| adaptive T1 ("relax until res≤1e-4") | settling detector = comparator on dz/dt | trivial |
| symmetric nudging ±β | output-node current injection | standard EP hardware |
| **holomorphic N-point circle** | **AC-modulated nudge + lock-in (homodyne) detection** — the Cauchy sum over phases IS lock-in readout; this is Laborieux–Zenke's "finite-size oscillations" taken literally | standard measurement technique; *more* native than DC differencing, and the standard weapon against analog noise floors |
| clamp removal (our biggest estimator fix) | hardware never had clamps; saturation is smooth | already done by physics |
| VF update ⟨a, ∂F/∂θ⟩ | local Hebbian outer product (contrast × presynaptic activity) per crossbar; autograd was only digital bookkeeping | native |
| λ-controller (residual → EMA → multiplicative λ, floor/cap) | a **neuromodulator/homeostatic loop**: 1 measurable scalar (settling ripple) → RC filter (the EMA) → log-domain integrator with rails (floor/cap) → global broadcast scaling a local anti-Hebbian (contraction) rule | a handful of analog components |
| λ floor = anti-collapse (R2/R6 lesson) | minimum leak conductance — never let homeostasis switch off | natural |
| adaptive T2 (snapshot selection) | sample-and-hold bank on the *contrast* signal + stability gate on the OUTPUT quantity (the transferable lesson: never gate on state velocity — non-normal transients fool it) | cheap |
| weight caps (3× init) | device conductance range | physics gives it free |
| param-EMA | slow/fast weight pairs (volatile + nonvolatile device) | known proposals |
| AdamW | per-synapse capacitor for momentum; second moment is the real gap (shared by all analog-training schemes) | open engineering |
| softmax attention + LN circuits | analog WTA / divisive-normalization primitives; data-dependent T×T attention in-memory | hard, but an *inference*-hardware problem shared by all analog-transformer efforts, not EP-specific |

**The two genuine research obstacles:**

1. **The AEP correction (J → Jᵀ) in physics.** Crossbars give Wᵀ for free (drive the other side),
   but the correction needs the *circuit's* transposed Jacobian, including data-dependent softmax
   parts. The classical answer exists: the **adjoint network** (Director–Rohrer 1969, circuit
   sensitivity theory) — constructible by reversing non-reciprocal elements; nobody has built
   AEP-learning with one yet. The alternative is a measured price list from our own ablations:
   accept *reciprocal* (energy-based, tied-value) attention and skip the correction entirely,
   costing ~0.15–0.2 CE (monDEQ 2.11 / energy-mode vs thick 1.95 under exact gradients).
2. **Precision budget vs the validity threshold.** Analog noise floors (~1%, 8–10 effective bits)
   sit exactly where we measured EP gradients dying (res ~1e-2). Mitigations, all quantified here:
   the N-point estimator tolerates **r=0.2** (10× nudge signal at equal bias — probe: flat in r);
   lock-in detection buys orders of magnitude of SNR below the noise floor; free and nudged phases
   run on the *same* devices so static mismatch cancels in the contrast (EP's structural advantage
   over on-chip backprop — note the adjoint path partially forfeits this and needs care); and
   hardware's 10⁶× speed headroom converts to phase-averaging. Our cos-vs-residual and cos-vs-r
   tables (§5, §4.3) are, read this way, the **spec sheet** for an analog EP-transformer design.

**Memristor-crossbar specifics — the Jᵀ question.** Needing Jᵀ does NOT disqualify memristor EP
platforms; crossbars are the most transpose-friendly analog substrate there is (drive rows→read
columns = Wx; drive columns→read rows = Wᵀy — the property on-chip-BP designs rely on). The fork:
(a) *passive-reciprocal* platforms (resistor-coupled, Kirchhoff/coupled-learning style) are J=Jᵀ by
physics — no correction needed, but they cannot express non-reciprocal attention even at
inference; on these, run the **reciprocal recipe**: LSE energy attention (tied value) + Hopfield
FFN — whose force 2·relu(zWm)Wmᵀ uses ONE crossbar driven in both directions — at the measured
~0.15–0.2 CE expressivity cost, zero hardware changes. (b) *active-periphery* platforms
(DAC→crossbar→ADC loops; periphery already breaks reciprocity) get Jᵀ as **transposed reads of the
same arrays + frozen local gains** (gelu′, softmax p, LN 1/σ from the free-phase operating point)
— the Director–Rohrer adjoint network in crossbar form: a periphery/routing redesign, not a
different device technology. Scope note: Jᵀ appears ONLY in the nudged phase applied to one vector;
the free phase needs nothing, weight updates are local outer products, and even the λ penalty needs
only ‖Jv‖² (forward perturbation + response energy, no transpose). Unlike on-chip BP there is no
per-layer activation storage or strict reverse scheduling — one held operating point z* suffices.
Numerics for the hardware team: large nudge amplitude r≈0.2 + multi-phase/AC (lock-in) readout is
validated equivalent to r=0.02 (10× signal headroom); small-signal DC differencing dies at ~1e-3
noise (our tf32 experiment: cos→−0.03). Suggested collaboration phasing: (1) reciprocal demo on
the existing rig (zero redesign, pay 0.2), (2) transposed-periphery nudge → full AEP, buy it back.

**On "training arbitrary analog circuits"** (the bigger question): classic EP requires
energy-based (reciprocal) circuits. AEP lifts this to *any circuit with a stable fixed point* —
IF the antisymmetric correction is realizable (adjoint network) or waived (reciprocal trade).
What this project adds to that picture is the missing stability half: **training pushes arbitrary
circuits off the contractive manifold** (bare-BPTT-14k showed even exact gradients walk off it),
and a residual-driven homeostatic controller both prevents this AND improves learning — with the
hard constraint that its floor never anneals to zero (fp-absorption collapse; in hardware:
latch-up). Combined with agnostic/physical EP (Scellier et al. 2022 — no circuit model needed,
contrast is measured) and small-scale physical demonstrations (Dillavou et al.'s self-learning
resistor networks; Laydevant et al.'s Ising-machine EP, 2024; memristor activity-difference
training), the pieces for "arbitrary stable analog circuit + adjoint or reciprocity + homeostatic
contraction control = trainable" are all individually demonstrated; this work supplies the
control law and the quantitative budgets.

Inference note: causal attention's lower-triangular coupling means autoregressive generation
settles *incrementally* — a new token's state relaxes with past states frozen, so EP inference is
one physical settling per token, not a re-relaxation of the sequence.

**Component BOM (assuming the current recipe survives the ladder unchanged):**
(1) bidirectional-read analog weight arrays (ReRAM/PCM/analog-Flash/gain-cell/switched-cap) — all
W and Wᵀ including the AEP adjoint reads; (2) state-integrator arrays (capacitor+OTA per state
variable; K·T·C nodes — ~1M at the 33M demo, ~17M at 0.6B; the sequence dimension T dominates);
(3) analog attention primitives — large-fan-in current/charge-domain softmax-WTA + T² score
sample-and-hold for the frozen nudge gains — **the hardest, least-shelf-ready component**;
(4) divisive-normalization circuits (LN/RMS/qk-norm); (5) mixed-signal periphery: DAC/ADC arrays,
S&H banks, and **lock-in (synchronous-detection) channels** — large-r AC nudging is mandatory
(small-signal DC differencing dies at analog noise floors; measured digitally via the tf32
experiment); (6) control plane: settling comparator + RC filter (res-EMA) + log-domain integrator
with rails (λ controller) + a global **learn-enable line (= the validity gate)** + fuse — a
handful of components or one MCU; (7) weight-update machinery: coincidence pulse programming
(local outer products), with device nonlinearity/endurance the classic pain point; (8) an FPGA
phase sequencer (settle→hold→nudge±→snapshot→update).

**Six-month prototype plan (borrow physics, don't fab — main track: optical).**
*Primary — desktop optical EP machine (Goodman MVM + electronic loop):* one off-the-shelf LCoS SLM
(~2M pixels, $15–25k) holds, with differential encoding, ~1M signed analog weights — the fully
digitally-validated R10 thick block (C=128, 12C² ≈ 200k weights) occupies **one tenth of one SLM**
(5× headroom). Weights (WQ/K/V/O, FFN) static on the SLM during settling; state z (C=128) cycles
through a 128-channel DAC-driven source array → SLM → cylindrical-lens summation → photodiode
array; nonlinearity, T² attention scores (negligible digitally at T=64), Euler integration, and
the λ/gate control law in loop electronics; one loop pass = one Euler step. **Timing is set by
B·T multiplexing, not settling**: each Euler step = B·T·(~6 matrices) MVM passes; at loop rates
0.1–1 MHz and B=4–8, settle ≈ sub-second and a 14k-step training run ≈ hours; SLM refresh once
per training step (60 Hz ample). **Wᵀ for AEP: program the transposed panels alongside W in the
spare SLM area** (reverse-propagation reciprocity remains the Phase-2 elegance; don't gate the
prototype on bidirectional alignment). Precision framing: master weights live fp32 in the digital
shell, the SLM holds a fresh ~8-bit projection each step — standard QAT regime; the open question
is per-pass multiplicative noise + slow drift (speckle/calibration, the optical ~1% floor), which
is exactly what the spec-sheet arsenal (r=0.2 nudging, lock-in/homodyne — the field's native
measurement, same-device contrast cancellation, λ-controller) was built for, and which is
**pre-validated digitally by an optics-noise-model run** (8-bit weight quantization per step +
1–2% multiplicative force-eval noise + drift) before any purchase. Novelty: photonics has in-situ
BP and BP-free local learning (Science 2023 ×2); EP-on-optics exists only as oscillator theory —
"EP-trained transformer on optics" is unclaimed. Budget $20–50k; optics ~2 months, loop
electronics 1–2 months (rehearsed by the PCB track, same parts/skills), calibration+training 2.
*Secondary (one day of email, no more):* Mythic M1076 as a borrowed settle engine — gated solely
on SDK raw-MVM access + incremental writes; flash endurance marginal beyond few-k-step demos.
(Laydevant et al.'s D-Wave EP precedent: reviewers accept rented physics.)
*Fallback / rehearsal (zero-dependency):* board-level reciprocal block (C=8–16, Hopfield Wm
driven bidirectionally + energy attention), digipot/MDAC weights, OTA+cap integrators, Red Pitaya
AC nudge + lock-in, comparator+RC λ loop — small headline, but lands continuous settling, lock-in
contrast readout, and the homeostatic control law in real electronics, and its loop electronics
ARE the optical track's loop electronics. Six months buys no foundry CMOS — but **university cleanrooms (e.g., UIUC HMNTL) are a different
category, and they fabricate exactly the one BOM item money can't buy: the weight arrays.**
Passive BEOL memristor crossbars (bottom electrode / ALD HfOx-TaOx / top electrode, 3–4 masks,
µm linewidth, contact or maskless litho) are the academic-cleanroom comfort zone; the practical
per-array limit is sneak-path-set (~64×64–128×128 for 1R passive with V/2 biasing), and a handful
of tiled arrays covers a C=32 reciprocal block (C=128 ≈ a dozen-array wiring project). FeFET
(three-terminal, on the same line's ferroelectric pedigree) cures sneak paths for a few extra
masks. The algorithm side has already bought insurance for first-batch device quality: Phase-1
keeps fp32 masters with **program-verify writes (only ~6-bit iterative programmability needed —
no pretty pulse physics)**, and 10–50% device mismatch is absorbed by same-device contrast
cancellation + the λ-controller — validated in digital twin runs (per-step N-bit weight
projection + per-pass multiplicative noise + static mismatch; see `--wq_bits/--fnoise/--wmis`).
Discipline: fab ONLY the arrays; anything on Digi-Key stays COTS board-level (student-process
CMOS periphery would be stone-age). This raises the board-track ceiling from digipot (~10³
weights, C=8) to homemade crossbars (10⁴–10⁵ weights, C=32–64) without leaving campus. Execution
for a no-fab-experience team: the standard academic rentals — (1) recipe-owner collaboration
(their senior student runs their existing process; weeks of routine work; co-authorship; you
never gown up), (2) apprenticeship via facility training + staff engineers (executed by a
recruited student), (3) paid staff-run fabrication. Find the recipe owner before booking tool
time; lead the pitch with the device-twin plot ("your first-batch devices suffice — proven").

**Recommended Phase-1 architecture: analog equilibrium core + digital optimizer shell.** Physics
performs only the expensive part (settling + contrast measurement = the ×300–1000 digital
overhead); contrasts are ADC'd out, Adam/schedules/λ-logic stay digital, weights DAC back. This
sidesteps component (7)'s update-nonlinearity pain and the missing analog Adam, at no loss of the
compute advantage. Phase-2: full in-array updates.

**Sizing correction — causal serialization (the assumption that kills the wafer-scale monster):**
naive sizing assumes the whole sequence's state must be physically resident during settling
(K·T·C integrator nodes — hundreds of millions at 8B scale, 1e4–5e4 mm²). Causality removes this:
the free phase settles **token-by-token** (token t's equilibrium depends only on tokens ≤ t; past
states live in an ordinary digital KV-cache), and the nudged phase is the adjoint of a
time-lower-triangular system = an exact **reverse sweep** (upper-triangular back-substitution,
exact to the same order as the AEP linearization). Physical state requirement drops from K·T·C to
**K·C per token-slice** (÷T ≈ ÷2048): ~15 mm² of integrators; sequence caches are ~1 GB of
commodity DRAM. Readout also streams per token-slice (~10⁵ values/token, batch-accumulated in
charge domain) — ADC throughput lands at the standard IMC design point, wall-clock days for an
8B-Chinchilla run at MWh-scale energy. Remaining big item: weight arrays only — 0.6B ≈ 2–4 chips
@28nm (university-consortium scale, $5–20M staged program); 8B ≈ 5–8 reticle dies @7nm (gen-3).
New throughput consideration: serialized operation makes τ_settle the rate limit, and the spring
chain's ~K² mixing tax favors **shallow-wide stacks (K=4–8)** — which the ladder data already
supports (thick single block ≈ same-shape BP). Program staging: MPW single-block demo (33M-class,
$1–5M) → 0.6B 2–4-chip machine → 8B gen-3. Economics read: capex-dominated, opex→0 — the pitch is
2–3 orders of magnitude energy and edge/continual learning, not cloud-GPU rent replacement.

**Compute reality (digital simulation)**: EP's per-step force-eval budget E ≈ 700–3000 makes full
EP training cost ≈ E/3 ≈ **230–1000× the BP cost** at equal tokens (Chinchilla 20×: 0.6B ⇒ 12B
tokens ⇒ ~4.3e19 BP-FLOP ⇒ ~1e22-class EP-FLOP: H100-cluster scale for one run, years on 4×A6000).
This multiplier is exactly what physical settling eliminates — the algorithm is expensive in
digital silicon and native in physics. Note TinyStories (~0.6B tokens) Chinchilla-matches ~30M
params — precisely the planned "readable stories" demo scale (S4).

## 10. Code map (all on timan1)

- `/tmp/lt_ep/lt_ep_train.py` — main trainer: EQBlock (all four force variants), `ep_step`
  (VF-EP + AEP + optional holomorphic nudge), `bptt_step`, λ controller, caps.
  Key flags: `--mode ep|bptt --attn_mode thick|real|energy|mono --jacreg --jr_floor --res_target
  --res_ema --jr_lrcouple --holo N --hr r --c --T1 --T2 --eps --beta`.
- `/tmp/lt_ep/holo_ep.py` — holomorphic force/softmax/LN/GELU, `holo_a` (Cauchy readout), probe.
- `/tmp/lt_ep/grad_quality.py` — estimator-vs-exact cosine probe (validity threshold measurement).
- `/tmp/lt_ep/solver_wall.py` — plain vs Anderson free-phase convergence per damping level.
- `/tmp/lt_ep/bp_charlm.py` — param-matched standard BP transformer baseline.
- `/home/yurenh2/ept/cet_mvp.py`, `cet_aep.py`, `aep_*.py` — CET reproduction + AEP validation
  (vision side; gradient-fidelity numbers in Sec 3/FINDINGS).
- Run logs: `/tmp/lt_ep/thickep_*.log`, `H2_*.json`.
- Data: `/tmp/lt_ep/data/shakespeare_char/{train,val}.bin, meta.pkl`.

Hardware: 1× RTX A6000 per run (shared node); plain-EP ~2.4 it/s, holo-EP(N=2) ~1.5–2 it/s at
B=32, T=64, C=128, T1=150, T2=20. A 14k-step run ≈ 1.6–2.5 h.