1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
|
# Methods — EP-Trained Equilibrium Transformer Language Model
Complete technical notes for discussion: architecture, how attention/FFN are made EP-trainable,
the training rule, every stabilizer/regularizer and its reason, the LM setting, validation
methodology, results, and open problems. Code paths at the end. (Companion doc:
`/home/yurenh2/ept/FINDINGS.md` — the project arc and findings log.)
---
## 1. Problem statement
Train a transformer-class language model where **both attention and FFN learn without
backpropagation through the computation** — using Equilibrium Propagation: two (or N) relaxations
of the same dynamics plus a local contrast readout. The questions: (a) does it train at all,
(b) what does it cost vs the exact gradient (BPTT on the same architecture) and vs a standard
BP transformer at equal parameters, (c) what are the actual failure mechanisms.
Headline result (all at 14k steps, fully-controlled comparison): best EP model reaches
**val CE 1.676** (adaptive T1/T2, run R10). Like-for-like standard BP transformer (MLP=4 — the
same parameter shape as the thick block, see §2) reaches 1.610 ⇒ **total gap 0.066**, decomposed:
**architecture ≈ 0.025** (BPTT + the same stabilizer + same param-EMA on the identical block:
1.635) and **training rule ≈ 0.041** (EP 1.676 vs that control) — and EP beats the thin-matched
BP MLP=1 baseline (1.689). Unregularized BPTT *destabilizes* on long horizons (walks off the
contractive manifold, res→5e-2, best 2.021 — worse than its 3k run, 1.949): the stabilization
loop EP carries out of estimator necessity (residual-driven Jacobian-penalty controller) is what
the equilibrium architecture itself needs for long training. Random = ln 65 = 4.17.
## 2. LM setting (data, embeddings, readout, evaluation)
- **Corpus**: Shakespeare character-level (nanoGPT preprocessing): `train.bin`/`val.bin` uint16
token streams + `meta.pkl`, vocab = 65 chars (~1.1 MB text). Local copy:
`/tmp/lt_ep/data/shakespeare_char`.
- **Batching**: random crops, B=32 sequences × T=64 context; next-char targets (shift by 1).
- **Embeddings**: learned token table `tok ∈ R^{65×128}` (init N(0, 0.02²)) + learned absolute
positional table `pos ∈ R^{64×128}`. Input injection `x_in = tok[idx] + pos`. **Embeddings are
trained by EP too** — they enter the force through the input-clamp term −(z − x_in), so the same
vector-field readout (Sec. 4) delivers their gradient. No pretrained embeddings, no BP path.
- **Readout head**: logits = z* Wh, `Wh ∈ R^{128×65}`, trained with its **own local CE gradient**
∂CE/∂Wh at the free equilibrium (loss-adjacent layer — local learning suffices; this is standard
in EP setups and is not backprop through the dynamics).
- **Objective / eval**: mean next-token cross-entropy over all B·T positions; val CE = average over
8 fresh validation batches, computed by running the same free-phase relaxation (T1 steps) used in
training — i.e. the eval graph equals the inference graph. Random baseline ln(65) = 4.174.
- **Model size**: C=128, H=4 heads, single equilibrium block (weight-tied recurrence ⇒ effective
depth = T1). Parameter matching to the BP baseline depends on the variant: the **thin** block's
Hopfield memory (Wm: 128×256 ≈ 33k) matches BP **MLP=1** (2C² ≈ 33k); the **thick** block's
untied FFN (fc+pj = 2·4C² ≈ 131k) matches BP **MLP=4** (131k) — so thick-block results compare
against MLP=4 (1.610), thin-block results against MLP=1 (1.689).
## 3. Architecture: the equilibrium block
State `z ∈ R^{B×T×C}` (one vector per token position). Dynamics ż = F(z); inference = relax to
fixed point z* (Euler: z ← z + ε F(z), ε=0.1, T1=150 steps), predict from z*.
We built four force variants (all share the input clamp and the readout):
### 3.1 `thick` — DEQ-transformer block (the winner)
```
F(z) = −(z − x_in) input clamp (leak toward embedding)
+ Attn(LN1(z)) causal multi-head softmax attention, separate WQ WK WV WO
+ W2 · GELU(W1 · LN2(z) + b1) + b2 untied 4× FFN (W1: C→4C, W2: 4C→C)
− c·z damping (c=1–2)
```
LN1/LN2 carry learned affine (g, b). This is exactly a pre-LN transformer block written as a
*force* instead of a layer stack — same form DEQ uses as its fixed-point map. It is strongly
**non-conservative**: no scalar energy has this gradient (Q≠K asymmetric coupling, untied FFN).
EP is made exact for it via the AEP correction (Sec. 4.2).
### 3.2 `real`/thin — Hopfield-FFN + damped real attention
```
F(z) = −∇_z [ ½‖z − x_in‖² + E_mem(z) ] + s·(Attn(z) − c·z)
E_mem(z) = −Σ relu(z Wm)² modern-Hopfield / dense associative memory, Wm: C×256
```
The FFN here is **energy-based**: E_mem is the dense-associative-memory energy; its force
2·relu(zWm)Wmᵀ is a one-hidden-layer FFN with *tied* weights (Wm in, Wmᵀ out). Attention remains a
raw non-conservative force with damping. This was our first stable trainable variant.
### 3.3 `energy` — fully conservative attention (CET-style)
Attention folded into the energy: `E_att = −(1/γ) Σ_heads,i LSE_j(γ q_i·k_j)` (causal-masked,
**tied value** — the force of this energy mixes values v=k), plus `½c‖z‖²` confinement because
E_att+E_mem alone are unbounded below. F = −∇E exactly ⇒ classic EP applies with no correction.
This is the CET (Høier/Kerjan/Scellier) route, which we reproduced separately on vision
(masked CelebA/CIFAR completion, EP ≈ TBPTE). Trade-off: tied value + reciprocal coupling = the
least expressive attention.
### 3.4 `mono` — monDEQ-structured contraction
`F(z) = −(m·z + z PᵀP) + z(Q−Qᵀ)ᵀ + x_in − ∇E_mem + s·Attn(z)` — the linear part is a monotone
operator by construction (Winston–Kolter): symmetric part ⪯ −m·I guaranteed, antisymmetric part
(Q−Qᵀ) free (non-reciprocal coupling at no stability cost). Guaranteed unique fixed point for the
linear core; softmax attention sits on top with gain s. Ablation for "how much does guaranteed
contraction cost": BPTT-mono = 2.11 vs BPTT-thick 1.95.
### How attention is "EP-ified": the two routes, explicitly
1. **Energy route** (3.3): make attention conservative (tied value, LSE energy) so F = −∇E and
vanilla EP is valid. Cost: expressivity (Q≠K asymmetry and free value are what make attention
attention).
2. **Force route** (3.1, 3.2 — ours): keep real attention as a non-conservative force and repair
the *estimator* instead, with the AEP correction (Sec 4.2) which restores exact gradients for
non-reciprocal couplings. Validated gradient cosine vs autograd: attention 0.99, FFN 1.00,
full LM block 0.99 (and FA, for contrast, gives Q/K/V ≈ 0.25, upstream FFN ≈ −0.01).
### How attention and FFN are trained *jointly*
There is no per-module schedule or pipeline: attention, FFN, LN affines, and embeddings are all
terms of the **same force** F. One free relaxation + one nudged ensemble produce the contrast state
`a`; every parameter θ gets its gradient from the single vector-field formula ∇θ = ∂⟨a, F(z*,θ)⟩/∂θ
(Sec 4.1), which decomposes into purely **local** per-term updates (each force term touches only its
own parameters). The readout Wh is the only separately-trained parameter (local CE gradient).
## 4. Training rule
### 4.1 Vector-field (force-form) EP with symmetric nudging
- **Free phase**: z⁰ = x_in; z^{t+1} = z^t + ε F(z^t), T1=150, ε=0.1 → z*. Monitor relative
residual `res = ‖z⁺−z*‖/‖z*‖` (one extra step) — the load-bearing health signal (Sec 5).
- **Nudged phases** (±β, β=0.02, T2=20 steps from z*): relax the augmented force
`F(z) ∓ β ∇_z CE(z)` (the CE gradient w.r.t. the *state* — local at the readout).
- **Contrast**: `a = (z₋ − z₊)/(2β)` ≈ −dz*/dβ (centered ⇒ O(β²) bias; Laborieux-style symmetric
nudging).
- **Parameter update** (force form, valid for non-gradient dynamics): for all force params θ,
`∇θ L ≈ ∂/∂θ ⟨a, F(z*; θ)⟩` — one autograd call **at the fixed point only** (this is a local
Hebbian-style contrast in θ for each force term; autograd here is per-term bookkeeping, not
backprop through time/steps).
- At a converged fixed point and β→0 this equals the implicit/equilibrium gradient
(Scellier–Bengio; Ernoult: EP ≡ BPTT stepwise under convergence).
### 4.2 AEP correction (non-conservative repair)
For non-conservative F the naive nudged phase linearizes around z* with Jacobian J, but the correct
adjoint needs Jᵀ. Following **AsymEP** (Scurria, Vanden Abeele, Mognetti, Massar, "EP for
Non-Conservative Systems", arXiv:2602.03670), we add to the nudged force the term `−(Jv − Jᵀv)`,
v = z − z*, where J = ∂F_nc/∂z at z* and F_nc = the non-conservative part (attention, or
attention+FFN for `thick`). This is **identical to their `−2 A_J(z*)(z−z*)`** (A_J = antisym part of
J; `Jv−Jᵀv = (J−Jᵀ)v = 2 A_J v`) — **the correction is theirs, not ours**. What is ours on this
line: (i) the **matrix-free jvp+vjp** form (one of each per nudged step) — AsymEP builds the full
Jacobian explicitly and decomposes it, which is infeasible at transformer state dim B·T·C; (ii) the
application to **softmax attention** (data-dependent Jacobian — they test only feedforward
nets/Hopfield on static MNIST/CIFAR, no attention/sequence); (iii) the **holomorphic combination**
(§4.3 — the correction is real-linear so it preserves holomorphy; they use plain ±β); (iv) the
**common-mode-tracking** linearization variant (§4.3/below). Their force-form **VF** readout (= our
`⟨a,∂F/∂θ⟩`) is *prior art* and is the baseline that collapses without the correction (their CIFAR
VF = 10% chance), matching our cos≈0.25 for uncorrected attention. Effect: nudged-phase Jacobian J → Jᵀ ⇒ a approximates the *adjoint*
response −(I−Jᵀ)⁻¹-type solve ⇒ exact gradients for Q≠K attention (measured 0.99–1.0).
Caveats: the correction is **linearized at z***, so the nudged trajectory must stay in the
linear-response window — T2≈20 at ε=0.1 is inside; T2=60+ can leave it (Sec 7).
### 4.3 Holomorphic EP upgrade (current)
Replace the 2-point real ±β difference by N points on a **complex circle** β_k = r·e^{2πik/N}
(Laborieux–Zenke 2022): relax the *holomorphically extended* dynamics (manual complex LN with
non-conjugate variance, softmax as exp-ratio, tanh-form GELU; the AEP correction is linear with
real coefficients so it preserves holomorphy — apply to Re/Im parts separately), then read
`a = −Re[(1/Nr) Σ_k e^{−iφ_k}(z_k − z*)]` (discrete Cauchy formula; bias O(r^N) instead of O(r²)).
**No clamps inside the holomorphic nudge** (clamps are non-analytic and break the bias order).
Probe findings (cos vs long-horizon-BPTT reference, 300-step-pretrained thick block):
- The **clamps were the dominant estimator error at marginal residuals**: at res 1.6e-3, plain EP
cos 0.27 → clamp-free 0.89. (The clamps existed to protect early training; they were silently
poisoning mid-training updates whenever res drifted up.)
- N and r are flat (N=2…8, r=0.02…0.2 all ≈ equal): finite-β bias and 1/β noise are *not* the
binding error at this scale.
- The remaining ~0.12 misalignment is **T2 truncation**: with stable nudged dynamics, T2=120 →
cos 0.985. But on slow-mixing batches long T2 diverges (AEP linearization error compounds on
near-critical modes). Step-size-based early stopping FAILS (non-normal transient growth triggers
it at t≈6–39; same pathology that makes spectral radius the wrong free-phase signal).
- **Adaptive T2, solved by hindsight selection** (`holo_a_select`): run the nudged phases to
T2max=120 in lockstep, snapshot the contrast a_t every 10 steps, return the snapshot with the
smallest increment (= most settled), early-exit only on clear blowup (inc > 5× the running min).
Judging stability by increments of the *quantity of interest* — not step sizes — makes transient
growth harmless. Probe: never worse than fixed T2=20; mean cos 0.871 → **0.932** at tight
equilibria (0.853 / 0.987 / 0.956 per batch; the dangerous batch self-limits to t≈20–30).
- **Adaptive T1 companion**: long-T2 gains require a tight free phase (res ≲ 1e-4; at res ~1e-3
long T2 actively hurts). So the free phase is two-stage: the λ-controller's residual signal is
still sampled at T1=150 (R9 semantics preserved — no new λ war), then relaxation continues in
chunks of 50 until res ≤ 1e-4 (cap 500) before the nudged phases. Compute buys tightness;
λ pressure does not.
Training outcomes: **R7** (N=2, r=0.02, fixed T2=20): best 2.0289, faster wall-clock than plain EP
(3.3 vs 2.45 it/s — the holomorphic nudge's ∇_z CE is closed-form). **R10** (R9's controller +
adaptive T1/T2): **best 1.6755** (sustained EMA plateau 1.68–1.70 around step 8–10k; ~0.7 it/s).
## 5. Stabilization & regularization — what, and exactly why
**The governing fact (measured, not assumed):** the EP estimator has a **validity threshold** in
free-phase residual. Gradient cosine vs exact reference: res ≈ 5e-5 → 0.85–0.88; res ≈ 1e-3 →
0.2–0.9 (batch-dependent); res ≈ 3e-3 → ≈ 0–0.5; res ≈ 1e-2 → noise. BPTT has no such threshold
(it differentiates the actual finite computation, converged or not) — *this asymmetry, and nothing
deeper, is the EP-specific difficulty*. There is **no structural ceiling**: an early "EP caps at
~2.5" verdict was refuted (it conflated two undertrained/invalid-regime runs; see FINDINGS).
Each stabilizer and its reason:
1. **Damping −c·z** — creates/strengthens a fixed point for raw attention forces; *required* for
the thin/real variant (attention alone has no fixed point at high gain: residual floor ~3e-2,
no equilibrium to find). **Caveat for LN-inside blocks (`thick`)**: damping shrinks ‖z*‖ and the
LN Jacobian scales like 1/σ(z) ⇒ damping *inflates* the effective Jacobian — measured: thick
plain-relax residual 8.8e-3 at c=0 vs 3.4e-2 at c=2. So for `thick`, c is kept small (1) and is
NOT the stabilizer; the Jacobian penalty is.
2. **Soft Jacobian penalty** λ‖J_nc(z*)‖²_F (Hutchinson estimator: one jvp on a random probe vector,
differentiated w.r.t. θ; Bai et al. 2021 "Stabilizing Equilibrium Models by Jacobian
Regularization") — the actual stabilizer: keeps the free phase contractive ⇒ keeps the estimator
inside its validity region. Soft penalty ≻ hard constraints (spectral-norm capping the attention
matrices to ρ=0.9 was tried: too restrictive, kills learning — consistent with FRE-RNN-style
regulation being preferable to hard projection).
3. **Why the control signal is the residual, NOT the spectral radius**: the attention/block Jacobian
is highly **non-normal** — transient growth is invisible to eigenvalues (measured: ρ(J)=0.94
"stable" while the relaxation diverged at res 0.21). The one-step residual *is* the transient;
control on it.
4. **Continuous λ controller**: λ ← clip( λ · (res_EMA/target)^0.3 , floor, 16 ), per step.
- **EMA on the signal (0.9)**: the raw residual is noisy; a multiplicative controller on a noisy
signal random-walks (measured thrash λ 0.5↔13 when the target sat at the noise floor) and the
thrashing λ itself perturbs training. EMA removed it and gave the current best run.
- **Target = 5e-4**: just inside the validity threshold (few·1e-4), with margin; NOT tighter —
demanding res ≪ threshold buys nothing and costs expressivity (a 2e-4-target run was worse).
For reference, BPTT's own optima sit at res ~1e-3–2e-2: good solutions are only mildly
contractive; we ask for slightly more than BPTT needs, because our *estimator* needs it.
- **Floor**: λ may shrink when res is healthy but **must not vanish — the floor is
load-bearing**. Floor=λ₀ (never off) is a permanent tax (2.150); floor 0.1 is the sweet spot
(2.078 → 2.047 with signal-EMA → 2.029 with the holomorphic estimator). Two independent runs
prove λ≲0.02 is fatal at *any* stage: λ→0 from the start (R2) and λ-floor annealed with lr
(R6) both ended in the same death — val CE 60–77 with res ≡ 0.0. Post-mortem: this is **an
explosion disguised as convergence by floating point**, not a dead state: ‖z*‖ and the
uncapped parameters (tok/pos/fc/pj) blow up under temporarily-invalid gradients until
ε·F < ulp(z) and the relaxation freezes (res = 0 by absorption), with huge confidently-wrong
logits. The λ penalty (whose θ-gradient touches fc/pj/LN/attention) is what keeps that basin
out of reach; it cannot be annealed away. The late-drift hypothesis "persistent penalty
gradient vs vanishing task signal" is hence only half-true — the persistent pressure is also
the anti-collapse mechanism. Current anti-drift attempt: **parameter EMA** (decay 0.999,
evaluated alongside raw weights), which targets late-phase estimator-noise wander without
touching the stability loop at all.
5. **Weight-norm caps** (renorm to 3× init norm on WQ,WK,WV,WO,Wm,Wh after each step): blunt safety
net against runaway during transients when the estimator is temporarily invalid. Rarely binding
in healthy runs.
6. **Nudge clamp g.clamp(±2) and AEP-correction clip (‖corr‖ ≤ ‖F‖)** — *legacy*: protected the
nudged relaxation early in training, but measured to be the main estimator error at marginal
residuals (cos 0.27 → 0.89 once removed). Replaced by the clamp-free holomorphic nudge; a
non-finite-gradient step-skip + the λ controller now carry the early-training safety.
7. **Optimizer**: AdamW lr 1e-3, wd 1e-4, cosine to 5%, grad-norm clip 5.0, skip non-finite steps.
β(EP)=0.02, ε=0.1, T1=150, T2=20 everywhere unless stated.
### 5.x T1-residual penalty (`--resreg`) — defend the evaluated state (2026-06-20)
EP's gradient is the fixed-point/implicit gradient: it only cares WHERE the fixed point is, not how fast the
relaxation reaches it, so it has no reward for keeping the block contractive. BPTT — differentiating the finite
T1=150 unroll, which is what eval actually uses — gets that reward implicitly (a non-converging unroll → bad
output → high CE). This asymmetry is why frozen-jr EP diverges past ~2.09 (res inflates → forward bifurcates to
a limit cycle) while exact-BPTT with the identical recipe descends to 1.72 (see FINDINGS 2026-06-20; the EP run
refines the free phase to t1max=300=z* and grades there, so it never feels the residual of the evaluated z150).
The fix gives EP that missing term explicitly — penalize the T1 free-phase residual of the state actually
evaluated, `z150 = relax(xin, T1)` taken BEFORE any t1max refinement:
- `R_res = ‖ε·F(z150)‖² / (‖z150‖²+ε)`, gradient w.r.t. θ with z150 detached (`blk.tforce`);
- scaled task-relative: `ratio = resreg·min(1, res@T1 / 2e-2)`, deadband `res@T1 > 7e-4`,
`λ = ratio·‖g_task‖/‖g_res‖`, added to the EP gradient.
- **Run with `res_gate=0`** — the validity gate early-returns (jacreg-only) above the gate, which would bypass
the penalty exactly when res is high. Keep `t1max=300` (estimator accuracy) + the penalty (defends z150).
Analog-compatible (one extra force measurement + the same local vector-field gradient, no digital root-finder)
and more targeted than jacreg (which penalizes ‖J_nc‖_F, not the actual residual vector that explodes). Validated
res-tight through step 1000 / best 2.0573 (past the 2.09 wall) before a /tmp wipe; full re-validation pending.
## 6. Validation methodology (how we know the estimator/claims are right)
- **Gradient-cosine probes**: at a fixed realistic operating point (300 BPTT steps from init —
no contraction penalty, the "natural" weight region), compare every estimator against a
long-horizon BPTT reference (T1=400), per parameter group (attn / ffn / LN / emb). This is what
exposed the validity threshold, the clamp damage, and the T2 truncation.
- **Horizon control**: BPTT-150 vs BPTT-400 cosine is itself only 0.35–0.77 on slow-mixing batches
— the "finite horizon vs true equilibrium gradient" cost is shared by everyone, EP is not
special; at matched horizon EP is within ~0.15 of BPTT.
- **BPTT-as-ablation**: BPTT on the identical architecture isolates *training-rule* cost (EP−BPTT)
from *architecture* cost (BPTT−BP). BPTT is an ablation, not the target; BP is the target.
- **Same-graph eval**: val CE is computed through the same T1-step relaxation used in training, so
no train/eval mismatch flatters either method.
- **Gradient-cosine has a lifecycle**: early/mid training it measures estimator quality (0.93 →
0.79–0.85 across scale); late in training, at slow-mixing trained points, even two *exact*
gradients at different horizons decorrelate (cos(BPTT-150, BPTT-800) = 0.25 at the trained S1
point) — no single "true gradient" exists to cosine against, and the meaningful arbiter becomes
the training outcome on the horizon-matched eval objective. Validity-threshold claims here are
early/mid-phase statements. Late-phase corollary: EP's res target 1.5e-3 at S1 is already
optimal — cos rises monotonically with tightness and the loose-weights/refined-measurement
variant nulls at training level: the measurability-contraction tax is rigid across the interval
(the physical escape is oscillatory/lock-in measurement, not operating-point engineering).
## 7. Results
**14k-step matched comparison (the honest table; thick block ≈ BP MLP=4 in parameter shape):**
| training rule | architecture / recipe | best val CE |
|---|---|---|
| BP | standard transformer, MLP=4 (**like-for-like for thick**) | **1.610** |
| BPTT + R9's λ-controller + param-EMA | thick (exact grad, same stabilization as EP) | **1.635** — tail stable |
| **EP (R10)** | **thick; R9 + adaptive T1 (refine to res≤1e-4) + adaptive T2 (selection, cap 120)** | **1.676** (EMA plateau 1.68–1.70) |
| BP | standard transformer, MLP=1 (thin-matched) | 1.689 |
| EP (R9) | thick; holo nudge + recalibrated controller (target 1.5e-3, λmax 4) + param-EMA | 1.740 |
| BPTT (exact grad) | thick, unregularized | 2.021 — **destabilizes late** (res→4.7e-2, val→3.0) |
3k-era and ablation numbers (shorter schedule):
| run | best val CE |
|---|---|
| BPTT thick, 3k (its best showing) | 1.949 |
| EP R7: holo estimator, old tight controller (target 5e-4) | 2.029 (late λ pinned 16 ⇒ drift) |
| EP R8: R7 + param-EMA | 2.031 (EMA alone ≠ fix; the λ fight dominates) |
| EP R5/R3: plain estimator generations | 2.047 / 2.078 |
| BPTT monDEQ / thin, 3k | 2.111 / 2.206 |
| EP R2 (λ→0) / R6 (λ-floor∝lr) | 2.357 / 2.501 — both die by fp-absorption explosion |
| random | 4.174 |
Reading: (a) final decomposition — **architecture tax ≈ 0.025** (1.635 vs 1.610), **EP rule tax
≈ 0.041** (1.676 vs 1.635), total **0.066** to the like-for-like BP transformer; EP beats the
thin-matched MLP=1 baseline. (b) EP beats *bare* BPTT at both horizons, but the controlled
comparison shows most of that win was EP's mandatory stabilization loop doubling as regularization
— bare exact-gradient training walks off the contractive manifold at 14k, and the same controller
that EP cannot live without lifts BPTT to 1.635 (also beating MLP=1): **the contraction controller
is good for the architecture regardless of training rule; EP simply forced its discovery.**
(c) The estimator and the controller must be **co-designed**: upgrading the estimator
(holomorphic, clamp-free) widened the validity region from res≲5e-4 to ~1.5e-3, and re-calibrating
the controller to that wider region (R7→R9) was worth **0.29**; adaptive T1/T2 (R9→R10) was worth
another **0.064**, matching the probe's cos 0.871→0.932. (d) **Multi-seed confirmation (3 seeds per arm)**: EP
1.6755/1.6851/1.6786 → **1.680 ± 0.005** vs BPTT+controller 1.6348/1.6459/1.6365 →
**1.639 ± 0.006**; the rule tax is **0.041 ± 0.005 (~9σ)** — real, tightly reproducible, and
consistent with the measured estimator misalignment (cos 0.85–0.93).
**Scale rung S1 (TinyStories char, C=256 H=8 T=256, 0.92M params; random ln127 = 4.84):**
| run | best char-CE (BPC) |
|---|---|
| BP same-shape, 14k | 0.827 (1.19) |
| **BPTT-ctl, loose target 1e-2, 14k** | **1.009 (1.46)** |
| BPTT-ctl, tight target 1.5e-3, 14k | 1.521 — ⇒ **controller-mismatch tax 0.51** |
| EP v4b (validity gate, lr 1e-3, 20k from scratch) | 1.393 |
| EP L2 (v4b recipe, 40k from scratch) | 1.214 |
| **EP warm-track (v4b → phase-2: common-mode tracking + loosened target)** | **1.141** — EP champion |
S1 scale lessons: (a) containment must scale with model size (λ ceiling, cap list, fuse);
(b) the **validity gate is load-bearing** — off-equilibrium EP updates poison weights (three
deaths before the gate, alive after); (c) the estimator validity threshold tightens with scale
(res 1e-4 → 1e-5 for full quality; rescue is compute-bounded, saturating at cos ≈ 0.85);
(d) **the controller operating point is part of the training rule**: EP needs validity-tight
targets, exact-gradient methods want loose ones — match controllers for rule-tax measurements,
but report each method at its own best operating point for ceilings.
**Optimizer pricing (S0-Shakespeare, R10 recipe, 8k steps; AdamW ≈ 1.70 at 8k):** EP-SaI
(per-tensor lr from init g-SNR, frozen = one calibrated gain line per array) **2.048**;
SGDM 2.166; Lion 2.175; Lion+LARS 2.244. Per-tensor calibration recovers ~0.12 of the 0.47
uniform-scale gap; the remaining ~0.35 measures the value of per-coordinate adaptivity under EP's
noisy heteroscedastic gradients. Pretraining therefore stays in the digital shell; fine-tuning is
exempt (SGD suffices in the RL/fine-tune regime, with <0.02% sparse updates — an endurance gift;
Mukherjee et al., arXiv:2602.07729).
**Hardware twin v4 (S0, 8-bit program-verify + 30% static mismatch + σ=1e-4 white + 4× restart
averaging): best 1.937 — 90% of the clean improvement** (clean 1.68). Noise laws measured:
contrast pollution strictly linear in σ; √N restart averaging; snapshot SNR ≈ 1/53 at σ=0.3%,
r=0.2 ⇒ hardware needs ~10³–10⁴ lock-in averages per update (ms at MHz loops — physically trivial,
digitally prohibitive: the noise dimension repeats the compute story). Discovery en route: the
frozen AEP correction has a clean-environment instability at nudge horizons ≳150–300 steps
(spectrum of J(z) − J* + J*ᵀ uncontrolled) — windows ≤120 steps + restart averaging circumvent it;
the single-trajectory oscillatory (true lock-in) estimator awaits a fix for this horizon limit.
## 8. Open problems
1. **Late drift — mostly solved, mechanism identified**: the drift was the controller *fighting*
the weights (λ pinned at max enforcing a target the upgraded estimator no longer needs; R7/R8
tails). Re-calibrating target/λmax to the holomorphic estimator's wider validity region removed
the fight (R9: λ stays 0.1–0.5, tail drift shrank from ~0.3 to ~0.15 above best). Refuted
routes: λ-floor annealing (R6 ⇒ fp-absorption explosion — the floor is load-bearing);
param-EMA alone (R8 — smooths wobble ~0.05, can't fix the fight). Residual ~0.15 tail drift
remains open (estimator direction bias near optimum is the suspect).
2. **Adaptive T2 — SOLVED** by hindsight snapshot selection (§4.3): judge by increments of the
contrast estimate, not step sizes; select the most-settled snapshot. Probe mean cos 0.932;
training −0.064 val CE (R10). Possible refinements: larger N with selection, per-batch T2max.
3. **Mixing time**: slow-mixing equilibria make *all* gradients horizon-expensive (BPTT included);
conditioning the dynamics for fast mixing (preconditioned/Anderson relaxation that preserves the
EP contrast structure) is unexplored here.
4. **Scale**: depth-1 block, char-level, C=128. The mechanisms (validity threshold, non-normality,
controller design) are dynamics-level and should transfer; the constants will move.
## 9. Hardware translation — can this now-complex algorithm still run on EP hardware?
Audit of every component of the final recipe (R10) against an analog/in-memory substrate. The
surprise: most of the added "complexity" is *control*, and control is cheap in analog; several of
our fixes specifically REMOVED digital artifacts.
| algorithm component | analog realization | difficulty |
|---|---|---|
| free phase (T1≈500 Euler steps — our digital bottleneck) | physical settling, ns–µs, "free" | trivial (hardware's whole pitch) |
| adaptive T1 ("relax until res≤1e-4") | settling detector = comparator on dz/dt | trivial |
| symmetric nudging ±β | output-node current injection | standard EP hardware |
| **holomorphic N-point circle** | **AC-modulated nudge + lock-in (homodyne) detection** — the Cauchy sum over phases IS lock-in readout; this is Laborieux–Zenke's "finite-size oscillations" taken literally | standard measurement technique; *more* native than DC differencing, and the standard weapon against analog noise floors |
| clamp removal (our biggest estimator fix) | hardware never had clamps; saturation is smooth | already done by physics |
| VF update ⟨a, ∂F/∂θ⟩ | local Hebbian outer product (contrast × presynaptic activity) per crossbar; autograd was only digital bookkeeping | native |
| λ-controller (residual → EMA → multiplicative λ, floor/cap) | a **neuromodulator/homeostatic loop**: 1 measurable scalar (settling ripple) → RC filter (the EMA) → log-domain integrator with rails (floor/cap) → global broadcast scaling a local anti-Hebbian (contraction) rule | a handful of analog components |
| λ floor = anti-collapse (R2/R6 lesson) | minimum leak conductance — never let homeostasis switch off | natural |
| adaptive T2 (snapshot selection) | sample-and-hold bank on the *contrast* signal + stability gate on the OUTPUT quantity (the transferable lesson: never gate on state velocity — non-normal transients fool it) | cheap |
| weight caps (3× init) | device conductance range | physics gives it free |
| param-EMA | slow/fast weight pairs (volatile + nonvolatile device) | known proposals |
| AdamW | per-synapse capacitor for momentum; second moment is the real gap (shared by all analog-training schemes) | open engineering |
| softmax attention + LN circuits | analog WTA / divisive-normalization primitives; data-dependent T×T attention in-memory | hard, but an *inference*-hardware problem shared by all analog-transformer efforts, not EP-specific |
**The two genuine research obstacles:**
1. **The AEP correction (J → Jᵀ) in physics.** Crossbars give Wᵀ for free (drive the other side),
but the correction needs the *circuit's* transposed Jacobian, including data-dependent softmax
parts. The classical answer exists: the **adjoint network** (Director–Rohrer 1969, circuit
sensitivity theory) — constructible by reversing non-reciprocal elements; nobody has built
AEP-learning with one yet. The alternative is a measured price list from our own ablations:
accept *reciprocal* (energy-based, tied-value) attention and skip the correction entirely,
costing ~0.15–0.2 CE (monDEQ 2.11 / energy-mode vs thick 1.95 under exact gradients).
2. **Precision budget vs the validity threshold.** Analog noise floors (~1%, 8–10 effective bits)
sit exactly where we measured EP gradients dying (res ~1e-2). Mitigations, all quantified here:
the N-point estimator tolerates **r=0.2** (10× nudge signal at equal bias — probe: flat in r);
lock-in detection buys orders of magnitude of SNR below the noise floor; free and nudged phases
run on the *same* devices so static mismatch cancels in the contrast (EP's structural advantage
over on-chip backprop — note the adjoint path partially forfeits this and needs care); and
hardware's 10⁶× speed headroom converts to phase-averaging. Our cos-vs-residual and cos-vs-r
tables (§5, §4.3) are, read this way, the **spec sheet** for an analog EP-transformer design.
**Memristor-crossbar specifics — the Jᵀ question.** Needing Jᵀ does NOT disqualify memristor EP
platforms; crossbars are the most transpose-friendly analog substrate there is (drive rows→read
columns = Wx; drive columns→read rows = Wᵀy — the property on-chip-BP designs rely on). The fork:
(a) *passive-reciprocal* platforms (resistor-coupled, Kirchhoff/coupled-learning style) are J=Jᵀ by
physics — no correction needed, but they cannot express non-reciprocal attention even at
inference; on these, run the **reciprocal recipe**: LSE energy attention (tied value) + Hopfield
FFN — whose force 2·relu(zWm)Wmᵀ uses ONE crossbar driven in both directions — at the measured
~0.15–0.2 CE expressivity cost, zero hardware changes. (b) *active-periphery* platforms
(DAC→crossbar→ADC loops; periphery already breaks reciprocity) get Jᵀ as **transposed reads of the
same arrays + frozen local gains** (gelu′, softmax p, LN 1/σ from the free-phase operating point)
— the Director–Rohrer adjoint network in crossbar form: a periphery/routing redesign, not a
different device technology. Scope note: Jᵀ appears ONLY in the nudged phase applied to one vector;
the free phase needs nothing, weight updates are local outer products, and even the λ penalty needs
only ‖Jv‖² (forward perturbation + response energy, no transpose). Unlike on-chip BP there is no
per-layer activation storage or strict reverse scheduling — one held operating point z* suffices.
Numerics for the hardware team: large nudge amplitude r≈0.2 + multi-phase/AC (lock-in) readout is
validated equivalent to r=0.02 (10× signal headroom); small-signal DC differencing dies at ~1e-3
noise (our tf32 experiment: cos→−0.03). Suggested collaboration phasing: (1) reciprocal demo on
the existing rig (zero redesign, pay 0.2), (2) transposed-periphery nudge → full AEP, buy it back.
**On "training arbitrary analog circuits"** (the bigger question): classic EP requires
energy-based (reciprocal) circuits. AEP lifts this to *any circuit with a stable fixed point* —
IF the antisymmetric correction is realizable (adjoint network) or waived (reciprocal trade).
What this project adds to that picture is the missing stability half: **training pushes arbitrary
circuits off the contractive manifold** (bare-BPTT-14k showed even exact gradients walk off it),
and a residual-driven homeostatic controller both prevents this AND improves learning — with the
hard constraint that its floor never anneals to zero (fp-absorption collapse; in hardware:
latch-up). Combined with agnostic/physical EP (Scellier et al. 2022 — no circuit model needed,
contrast is measured) and small-scale physical demonstrations (Dillavou et al.'s self-learning
resistor networks; Laydevant et al.'s Ising-machine EP, 2024; memristor activity-difference
training), the pieces for "arbitrary stable analog circuit + adjoint or reciprocity + homeostatic
contraction control = trainable" are all individually demonstrated; this work supplies the
control law and the quantitative budgets.
Inference note: causal attention's lower-triangular coupling means autoregressive generation
settles *incrementally* — a new token's state relaxes with past states frozen, so EP inference is
one physical settling per token, not a re-relaxation of the sequence.
**Component BOM (assuming the current recipe survives the ladder unchanged):**
(1) bidirectional-read analog weight arrays (ReRAM/PCM/analog-Flash/gain-cell/switched-cap) — all
W and Wᵀ including the AEP adjoint reads; (2) state-integrator arrays (capacitor+OTA per state
variable; K·T·C nodes — ~1M at the 33M demo, ~17M at 0.6B; the sequence dimension T dominates);
(3) analog attention primitives — large-fan-in current/charge-domain softmax-WTA + T² score
sample-and-hold for the frozen nudge gains — **the hardest, least-shelf-ready component**;
(4) divisive-normalization circuits (LN/RMS/qk-norm); (5) mixed-signal periphery: DAC/ADC arrays,
S&H banks, and **lock-in (synchronous-detection) channels** — large-r AC nudging is mandatory
(small-signal DC differencing dies at analog noise floors; measured digitally via the tf32
experiment); (6) control plane: settling comparator + RC filter (res-EMA) + log-domain integrator
with rails (λ controller) + a global **learn-enable line (= the validity gate)** + fuse — a
handful of components or one MCU; (7) weight-update machinery: coincidence pulse programming
(local outer products), with device nonlinearity/endurance the classic pain point; (8) an FPGA
phase sequencer (settle→hold→nudge±→snapshot→update).
**Six-month prototype plan (borrow physics, don't fab — main track: optical).**
*Primary — desktop optical EP machine (Goodman MVM + electronic loop):* one off-the-shelf LCoS SLM
(~2M pixels, $15–25k) holds, with differential encoding, ~1M signed analog weights — the fully
digitally-validated R10 thick block (C=128, 12C² ≈ 200k weights) occupies **one tenth of one SLM**
(5× headroom). Weights (WQ/K/V/O, FFN) static on the SLM during settling; state z (C=128) cycles
through a 128-channel DAC-driven source array → SLM → cylindrical-lens summation → photodiode
array; nonlinearity, T² attention scores (negligible digitally at T=64), Euler integration, and
the λ/gate control law in loop electronics; one loop pass = one Euler step. **Timing is set by
B·T multiplexing, not settling**: each Euler step = B·T·(~6 matrices) MVM passes; at loop rates
0.1–1 MHz and B=4–8, settle ≈ sub-second and a 14k-step training run ≈ hours; SLM refresh once
per training step (60 Hz ample). **Wᵀ for AEP: program the transposed panels alongside W in the
spare SLM area** (reverse-propagation reciprocity remains the Phase-2 elegance; don't gate the
prototype on bidirectional alignment). Precision framing: master weights live fp32 in the digital
shell, the SLM holds a fresh ~8-bit projection each step — standard QAT regime; the open question
is per-pass multiplicative noise + slow drift (speckle/calibration, the optical ~1% floor), which
is exactly what the spec-sheet arsenal (r=0.2 nudging, lock-in/homodyne — the field's native
measurement, same-device contrast cancellation, λ-controller) was built for, and which is
**pre-validated digitally by an optics-noise-model run** (8-bit weight quantization per step +
1–2% multiplicative force-eval noise + drift) before any purchase. Novelty: photonics has in-situ
BP and BP-free local learning (Science 2023 ×2); EP-on-optics exists only as oscillator theory —
"EP-trained transformer on optics" is unclaimed. Budget $20–50k; optics ~2 months, loop
electronics 1–2 months (rehearsed by the PCB track, same parts/skills), calibration+training 2.
*Secondary (one day of email, no more):* Mythic M1076 as a borrowed settle engine — gated solely
on SDK raw-MVM access + incremental writes; flash endurance marginal beyond few-k-step demos.
(Laydevant et al.'s D-Wave EP precedent: reviewers accept rented physics.)
*Fallback / rehearsal (zero-dependency):* board-level reciprocal block (C=8–16, Hopfield Wm
driven bidirectionally + energy attention), digipot/MDAC weights, OTA+cap integrators, Red Pitaya
AC nudge + lock-in, comparator+RC λ loop — small headline, but lands continuous settling, lock-in
contrast readout, and the homeostatic control law in real electronics, and its loop electronics
ARE the optical track's loop electronics. Six months buys no foundry CMOS — but **university cleanrooms (e.g., UIUC HMNTL) are a different
category, and they fabricate exactly the one BOM item money can't buy: the weight arrays.**
Passive BEOL memristor crossbars (bottom electrode / ALD HfOx-TaOx / top electrode, 3–4 masks,
µm linewidth, contact or maskless litho) are the academic-cleanroom comfort zone; the practical
per-array limit is sneak-path-set (~64×64–128×128 for 1R passive with V/2 biasing), and a handful
of tiled arrays covers a C=32 reciprocal block (C=128 ≈ a dozen-array wiring project). FeFET
(three-terminal, on the same line's ferroelectric pedigree) cures sneak paths for a few extra
masks. The algorithm side has already bought insurance for first-batch device quality: Phase-1
keeps fp32 masters with **program-verify writes (only ~6-bit iterative programmability needed —
no pretty pulse physics)**, and 10–50% device mismatch is absorbed by same-device contrast
cancellation + the λ-controller — validated in digital twin runs (per-step N-bit weight
projection + per-pass multiplicative noise + static mismatch; see `--wq_bits/--fnoise/--wmis`).
Discipline: fab ONLY the arrays; anything on Digi-Key stays COTS board-level (student-process
CMOS periphery would be stone-age). This raises the board-track ceiling from digipot (~10³
weights, C=8) to homemade crossbars (10⁴–10⁵ weights, C=32–64) without leaving campus. Execution
for a no-fab-experience team: the standard academic rentals — (1) recipe-owner collaboration
(their senior student runs their existing process; weeks of routine work; co-authorship; you
never gown up), (2) apprenticeship via facility training + staff engineers (executed by a
recruited student), (3) paid staff-run fabrication. Find the recipe owner before booking tool
time; lead the pitch with the device-twin plot ("your first-batch devices suffice — proven").
**Recommended Phase-1 architecture: analog equilibrium core + digital optimizer shell.** Physics
performs only the expensive part (settling + contrast measurement = the ×300–1000 digital
overhead); contrasts are ADC'd out, Adam/schedules/λ-logic stay digital, weights DAC back. This
sidesteps component (7)'s update-nonlinearity pain and the missing analog Adam, at no loss of the
compute advantage. Phase-2: full in-array updates.
**Sizing correction — causal serialization (the assumption that kills the wafer-scale monster):**
naive sizing assumes the whole sequence's state must be physically resident during settling
(K·T·C integrator nodes — hundreds of millions at 8B scale, 1e4–5e4 mm²). Causality removes this:
the free phase settles **token-by-token** (token t's equilibrium depends only on tokens ≤ t; past
states live in an ordinary digital KV-cache), and the nudged phase is the adjoint of a
time-lower-triangular system = an exact **reverse sweep** (upper-triangular back-substitution,
exact to the same order as the AEP linearization). Physical state requirement drops from K·T·C to
**K·C per token-slice** (÷T ≈ ÷2048): ~15 mm² of integrators; sequence caches are ~1 GB of
commodity DRAM. Readout also streams per token-slice (~10⁵ values/token, batch-accumulated in
charge domain) — ADC throughput lands at the standard IMC design point, wall-clock days for an
8B-Chinchilla run at MWh-scale energy. Remaining big item: weight arrays only — 0.6B ≈ 2–4 chips
@28nm (university-consortium scale, $5–20M staged program); 8B ≈ 5–8 reticle dies @7nm (gen-3).
New throughput consideration: serialized operation makes τ_settle the rate limit, and the spring
chain's ~K² mixing tax favors **shallow-wide stacks (K=4–8)** — which the ladder data already
supports (thick single block ≈ same-shape BP). Program staging: MPW single-block demo (33M-class,
$1–5M) → 0.6B 2–4-chip machine → 8B gen-3. Economics read: capex-dominated, opex→0 — the pitch is
2–3 orders of magnitude energy and edge/continual learning, not cloud-GPU rent replacement.
**Compute reality (digital simulation)**: EP's per-step force-eval budget E ≈ 700–3000 makes full
EP training cost ≈ E/3 ≈ **230–1000× the BP cost** at equal tokens (Chinchilla 20×: 0.6B ⇒ 12B
tokens ⇒ ~4.3e19 BP-FLOP ⇒ ~1e22-class EP-FLOP: H100-cluster scale for one run, years on 4×A6000).
This multiplier is exactly what physical settling eliminates — the algorithm is expensive in
digital silicon and native in physics. Note TinyStories (~0.6B tokens) Chinchilla-matches ~30M
params — precisely the planned "readable stories" demo scale (S4).
## 10. Code map (all on timan1)
- `/tmp/lt_ep/lt_ep_train.py` — main trainer: EQBlock (all four force variants), `ep_step`
(VF-EP + AEP + optional holomorphic nudge), `bptt_step`, λ controller, caps.
Key flags: `--mode ep|bptt --attn_mode thick|real|energy|mono --jacreg --jr_floor --res_target
--res_ema --jr_lrcouple --holo N --hr r --c --T1 --T2 --eps --beta`.
- `/tmp/lt_ep/holo_ep.py` — holomorphic force/softmax/LN/GELU, `holo_a` (Cauchy readout), probe.
- `/tmp/lt_ep/grad_quality.py` — estimator-vs-exact cosine probe (validity threshold measurement).
- `/tmp/lt_ep/solver_wall.py` — plain vs Anderson free-phase convergence per damping level.
- `/tmp/lt_ep/bp_charlm.py` — param-matched standard BP transformer baseline.
- `/home/yurenh2/ept/cet_mvp.py`, `cet_aep.py`, `aep_*.py` — CET reproduction + AEP validation
(vision side; gradient-fidelity numbers in Sec 3/FINDINGS).
- Run logs: `/tmp/lt_ep/thickep_*.log`, `H2_*.json`.
- Data: `/tmp/lt_ep/data/shakespeare_char/{train,val}.bin, meta.pkl`.
Hardware: 1× RTX A6000 per run (shared node); plain-EP ~2.4 it/s, holo-EP(N=2) ~1.5–2 it/s at
B=32, T=64, C=128, T1=150, T2=20. A 14k-step run ≈ 1.6–2.5 h.
|