summaryrefslogtreecommitdiff
path: root/docs/campaign/FINDINGS.md
blob: 0cb8b538bca8272df052f78837d56c642aa5fce6 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
# EP / AEP for Transformers — Findings

**Question.** Can transformers (attention + FFN) be trained with Equilibrium Propagation (EP),
i.e. without backprop / without Feedback Alignment? Started from literature, reproduced the SOTA,
characterized AEP (EP for non-conservative dynamics), and ported the recipe to a real char-LM.

---

## TL;DR

1. **EP/AEP give FAITHFUL LOCAL GRADIENTS for every transformer component** — attention
   cosine **0.99**, FFN **1.0** vs the true BP gradient. This solves the credit-assignment that
   **Feedback Alignment fails at**: FA only trains layers adjacent to the loss (output proj cos≈1.0)
   and leaves upstream layers as noise (**attention Q/K/V cos≈0.25, FFN fc cos≈−0.01**).
2. **Recipe to make real (non-conservative) attention EP-able: DAMPING + AEP.** A damping term
   `s·(attn(z) − c·z)` (c≥1) creates a stable fixed point at *any* attention strength, while keeping
   the map non-conservative (independent Q/K/V); AEP then recovers the gradient (0.99–1.0).
3. **Stable end-to-end EP training: SOLVED** by a residual-driven continuous controller on a soft
   Jacobian penalty (Bai-2021-style, Hutchinson) + damping. EP trains 12k+ steps without blowup.
   **There is NO structural "EP ceiling"** (a ~2.5-CE wall was claimed mid-project and is
   **retracted** — see *2026-06-09* section): EP has a quantitative **validity threshold**
   (free-phase residual ≲ few·1e-4, nudge inside the linear-response window), and meeting it costs
   regularization tax + steps — both reducible, neither a wall.

---

## 2026-06-20 — below-2.10 EP divergence: ROOT CAUSE + the residual-defense fix

The C512 EP wall (EP frozen-jr descends to best ~2.09, then SUDDENLY diverges within ~200 steps:
res 5e-3→0.15, cos(g_EP,g_BPTT) 0.98→~0, CE→4+, abort — while exact-BPTT with the IDENTICAL recipe
sails past to 1.72, freeze_wsd) is diagnosed. NOT the controller, NOT jacreg, NOT the erf/tanh GELU,
NOT a loss-landscape wall.

**Root cause (Codex-confirmed, 5-way corroborated): EP optimizes the FIXED POINT; BPTT optimizes the
FINITE UNROLL — only the finite unroll defends the residual.** `ep_step` relaxes T1=150→z150, then REFINES
to t1max=300→z* and takes the gradient at z*; but eval & `bptt_step` use z150. EP never "feels" the T1
residual → as attention gets expressive, contraction weakens, z150 drifts from z*, res@T1 inflates, the
EP estimate (valid only at small res) dies → blowup. BPTT differentiates the actual 150 steps, so a
non-converging unroll → bad CE → its gradient implicitly rewards strong contraction. That defend-the-residual
term is what EP structurally lacks: cos-0.977 holds only AT the fixed point; the missing ⊥ component is ~21%
(=√(1−0.977²)) = the finite-horizon transient gradient (T1=∞ would make even BPTT lose it — BPTT's stability
IS the finite truncation; the equilibrium/implicit gradient only cares WHERE the fixed point is, not how fast
you reach it → no contraction-reward).

Two layers: (1) point-mismatch (refinement) — `--t1max 150` alone moved the wall 2.09→2.05 but still blew@600;
(2) gradient-flavor — needs the explicit penalty. The diverged state is a forward bifurcation to a LIMIT CYCLE,
not under-relaxation (eval_relax: res floors ~6e-2 and oscillates, 150→4000 relax steps don't help, CE ~3.7;
FTLE stays negative −0.027..−0.050 — single random-vector FTLE misses the cycle). ⇒ adaptive / more relaxation
steps CANNOT fix it (res as a STOPPING criterion just chases a vanishing fixed point); only res as a COST
(penalty) prevents the drift.

**Fix — explicit T1-residual penalty `--resreg`**: defend z150 with R=‖εF(z150)‖²/(‖z150‖²+ε), grad w.r.t. θ
(z150 detached), task-relative scale (ratio = resreg·min(1, res@T1/2e-2), deadband res@T1>7e-4), added to the
EP gradient. **MUST run res_gate=0** (the validity gate early-returns jacreg-only above the gate, bypassing the
penalty exactly when res is high — first gated attempt blew@200). Keep t1max=300 (estimator accuracy) + penalty
(defends z150). Analog-compatible (one extra force measurement + local VF gradient, no root-finder); more
targeted than jacreg (which penalizes ‖J_nc‖, not the residual that explodes).

**Validation (interrupted by /tmp wipe, needs re-run):** ep_resreg2 (res_gate=0, resreg=0.2, warm from erf-2.09)
held res pinned 1–5e-4 through step 1000, best **2.0573** (past the wall, lowest any EP run reached), zero
inflation — where every no-penalty variant blew (ep_nogate@100, gated ep_resreg@200, ep_t1max150 reached 2.05
then blew@600). The run + ALL local ckpts were deleted 2026-06-20 by the /tmp 10-day cleanup before reaching
~1.8. Corroboration (5): Codex independent diagnosis; BPTT-from-2.09 control (res stayed 1.8e-4 + descended,
same fresh-opt restart where EP blew → the EP *update* is the destabilizer); FTLE (stable-BPTT 1.72 has WEAKER
contraction −0.0347 than diverged-EP −0.0377 → not forward-stability); eval_relax (limit cycle); gradient
decomposition (jacreg acquitted: 3% of grad, orthogonal, removing changes cos<0.001). Verbatim code + re-run
plan: **EP_BELOW210_DIAGNOSIS_FIX.md**.

---

## Background (literature)

- **EP** — Scellier & Bengio 2017. Energy-based; free phase relaxes to a fixed point, nudged phase
  with strength β; local weight update from the two equilibria. Centered/symmetric nudging
  (Laborieux 2021) reduces the gradient-estimator bias.
- **EP ≡ BPTT** — Ernoult et al. 2019: EP updates match BPTT gradients in an RNN with static input,
  in the limit β→0 with a converged free phase and enough nudged steps. (Relevant: as β→0 EP is
  unbiased — so our instability is NOT primarily gradient bias.)
- **Holomorphic EP** — Laborieux & Zenke 2022: exact gradients via finite oscillations, removing the
  β bias/noise trade-off.
- **CET** — Høier, Kerjan, Scellier (ICLR 2026 AM workshop), arXiv via OpenReview `Qrfml76eWJ`.
  Convergent Energy Transformer, EP-trained, CELEBA masked completion, EP ≈ TBPTE. The current SOTA
  for "EP + attention". Conservative (energy) attention with tied value → guaranteed fixed point.
- **AEP / AsymEP** — Scurria, Vanden Abeele, Mognetti, Massar, "EP for Non-Conservative Systems",
  arXiv:2602.03670. Nudged-phase correction `−2 A_J(x*)(x−x*)` (A_J = antisym part of the Jacobian
  at the free eq) turns the nudged Jacobian J → Jᵀ, giving the exact gradient even when no energy
  function exists. **The correction is theirs**; their scope is feedforward/Hopfield nets on static
  MNIST/CIFAR (no attention, no sequence, explicit Jacobian, no stability controller). Their
  force-form **VF** baseline collapses without the correction. See 2026-06-16 for the full ours/theirs
  boundary.
- **FRE-RNN** — "Toward Practical EP", arXiv:2508.11659 (Zhuo Liu et al.). Fixes EP instability via
  **feedback regulation (reduce spectral radius → fast convergence) + residual connections**
  (vanishing gradients). Directly relevant to our open problem (keep the free phase converged).
- Adjacent energy-transformers (BP-trained, not EP): Energy Transformer (2302.07253),
  EBT "Scalable Learners and Thinkers" (2507.02092), NrGPT (2512.16762).

---

## The arc (what we built & found)

| stage | finding | evidence |
|---|---|---|
| literature | EP+attention works only narrowly; CET is SOTA (CelebA, EP≈BP, single block) | CET full read |
| MVP (CET repro) | EP trains energy-attention+memory; **cosine 0.99; EP≈TBPTE** | CIFAR 0.539/0.546; FMNIST 0.272/0.278 |
| AEP (non-cons.) | corrects real-attention gradient: toy 0.30→**0.9975**; CET attn-params 0.75→**1.0** | `aep_*.py` |
| characterization | needs T2≥20–40, T1≥80; β-insensitive (static); **advantage grows with depth** (K=2: naive 0.05 vs AEP 0.99); only changes nudged phase (free eq identical); 1.44× cost | sweeps |
| fundamental limit | **strongly non-conservative attention has NO fixed point → whole EP family fails** (projection bounds magnitude but does not create a fixed point) | residual stuck 3e-2 |
| **F: damping** | `s·(attn−c·z)`, c≥1 → stable fixed point at any s → **AEP 0.99–1.0** even at s=8 | `aep_contractive2.py` |
| Option-2 LM attention | **AEP gives LM causal attention 0.993** (FA 0.25 — FA kills Q/K/V) | `lt_ep_attention.py` |
| Option-2 H1 (FFN) | **EP Hopfield-memory gives FFN 1.000** (FA −0.01 — the abandonment reason) | `lt_ep_ffn.py` |
| Option-2 H2 (train) | **BPTT trains (val CE 2.16↓, random 4.17); EP local training destabilizes early** | `lt_ep_train.py`, `H2_*.log` |

**Why FA failed (the project's pain, confirmed):** FA only delivers a usable gradient to the layer
right before the loss (output projection / FFN proj, cos≈1.0); all upstream layers (attention Q/K/V,
FFN fc) get random-routed error → cos≈0. EP/AEP fix all of them.

**Why EP training destabilizes (mechanism):** BPTT differentiates whatever the fixed T1-step
relaxation computes (a deep weight-tied net) — it does NOT need convergence. EP *assumes* the free
phase is at a fixed point; gradient descent on the loss pushes attention to be more expressive /
non-conservative → fixed point lost → relaxation diverges → EP estimate (and its jvp/vjp correction)
goes non-finite. Damping + clipping + weight-caps delay but don't prevent it.
*(Superseded: the residual-controlled Jacobian penalty DOES prevent it — see the 2026-06-09 section.)*

---

## Key numbers

- CET MVP (masked completion, test masked-MSE): CIFAR **EP 0.539 ≈ TBPTE 0.546**; FMNIST **EP 0.272 ≈ TBPTE 0.278**; BP-transformer 0.106/0.126; trivial(visible-mean) 0.583.
- EP-vs-BPTT gradient cosine (CET): global **0.99**; attention WQ/WK 0.98.
- AEP recovers non-conservative attention: toy naive 0.30 → AEP **0.9975**; CET attn-params 0.75 → **1.0**; damped high-s (s=8) AEP **0.99–1.0**.
- LM (Shakespeare char): AEP attention **0.993** (FA 0.25); EP Hopfield-FFN **1.000** (FA −0.01).
- H2 training: BPTT val CE **2.16 ↓** (random ln65=4.17); EP non-finite / no sustained training.

---

## Open problem & next directions

The frontier is **stable end-to-end EP training**. Hypotheses / plans:
1. **Keep the free phase converged during training** (FRE-RNN, 2508.11659): feedback regulation to
   keep the spectral radius < 1 + residual connections. Most directly targets the mechanism.
2. **β bias/noise trade-off** — Ernoult 2019 says EP≡BPTT as β→0 with converged phases; but small β
   amplifies the `(z₋−z₊)/2β` noise. Holomorphic EP (Laborieux–Zenke 2022) removes this trade-off.
   Diagnostic: log cosine(EP-grad, BPTT-grad) *during* training — does it start 0.99 and drop?
3. **Unified conservative energy** (no AEP): make the whole LM one energy (energy-attention, tied
   value) → guaranteed fixed point → plain EP is stable (we saw CET-EP is stable). Trades attention
   expressivity for stability.

## 2026-06-09 — H2 training era: the "wall" refuted, the real mechanism found

**Setting.** Shakespeare char-LM, single equilibrium block, C=128 H=4, same param budget everywhere.
`--attn_mode thin/real` = clamp + Hopfield-FFN + damped real attention; `thick` = DEQ-transformer
block (pre-LN + attention + untied 4× GELU FFN + residual + damping). Stabilizer = damping `c` +
soft Jacobian-norm penalty λ (Hutchinson on the non-conservative force) driven by a continuous
controller on the free-phase residual (residual, NOT spectral radius, is the right signal — the
attention Jacobian is non-normal, transient growth invisible to ρ).

**Scoreboard (best val CE; random ln65 = 4.17):**

| run | result |
|---|---|
| BP transformer, MLP=4 | 1.68 |
| BP transformer, MLP=1 (param-matched) | 1.79 |
| thick-BPTT (exact grad, same arch as thick-EP) | **1.95** (res ~1e-3→8e-3, learned contraction unaided) |
| mono-BPTT (monDEQ) | 2.11 |
| thin-BPTT | 2.21 (optimum at res ~1.5e-2) |
| **thick-EP champion (R5): R3 recipe + EMA-smoothed controller, 14k** | **2.0467** @ step 5k — within 0.10 of exact-gradient thick-BPTT |
| thick-EP (R3): λ-floor 0.1, res_target 5e-4, c=1, 8k | 2.0784 — beats thin-BPTT and mono-BPTT |
| thick-EP (R1): λ-floor 1, c=2, 14k | 2.1504 — 2.467@3k → 2.259@7.2k → 2.150@14k: monotone, no plateau, just slow |
| thick-EP (R4): λ-floor 0.1, res_target 2e-4, 14k | 2.1352 — tight target ⇒ controller thrash (λ swings 0.5↔13 late) |
| thick-EP (R2): λ→0, c=0.5 | 2.3572 @ step 800, then **collapse to a degenerate fixed point** (res→0 exactly, val ≫ random) |
| thick-EP, c=2 λ=1, 3k (the run the "wall" was called on) | 2.4665 (just undertrained) |
| thin-EP super-long 12k | 2.4847 plateau — **invalid-regime run**: res 2.4e-2–6e-2 the whole time, λ pinned at max 16 |
| thin-EP 3k, λ=16 | 2.5593 |

**Retraction.** Mid-project verdict — "EP is capped at ~2.5 by a convergence⟷richness wall
(rich blocks need damping that destroys their expressivity)" — is **wrong**, called prematurely at
3k steps when thick-EP ≈ thin-EP ≈ 2.47–2.50 by coincidence (two different slow/broken modes passing
the same value). Three independent refutations: (1) the 8k continuation sailed through the "wall"
(2.2592@7.2k, monotone); (2) the slack run hit 2.357 in 800 steps; (3) the gradient probe below shows
the EP estimator is healthy exactly where the stabilized runs operate. Also wrong in the original
story: damping c does NOT aid convergence for LN-inside blocks (LN Jacobian ∝ 1/σ(z): damping shrinks
z*, *inflating* the Jacobian — measured: thick plain-relax res 8.8e-3 at c=0 vs 3.4e-2 at c=2,
`solver_wall.py`), and the scary "init residual~12" was an unnormalized absolute-norm print.

**The real mechanism — a validity threshold, not a wall** (`grad_quality.py`: cosine of each
estimator vs long-horizon BPTT (T1=400) reference, at a 300-step-BPTT-pretrained thick block):

| estimator | free-phase res | cos vs BPTT-400 (3 batches) |
|---|---|---|
| EP T1=400, T2=20 | ~5e-5 | **0.85, 0.83, 0.88** — healthy |
| EP T1=150, T2=20 | 2e-4–1.6e-3 | 0.21 / 0.86 / 0.27 — marginal, batch-dependent |
| EP T1=50, T2=20 | 2.4e-3–3.6e-3 | −0.03 / 0.34 / 0.55 — broken |
| BPTT T1=150 (itself!) | — | 0.45 / 0.77 / 0.36 vs BPTT-400 |
| EP T1=150, **T2=60** | same as T1=150 | 0.01 / 0.07 / −0.19 — **nudged phase leaves the linear-response window** |

Readings: (a) at tight convergence the EP estimator agrees 0.85 with the exact gradient — the
EP-specific overhead (finite β, T2 truncation, AEP clipping) is only ~0.15 of misalignment;
(b) **the horizon/mixing cost is shared by BPTT itself** (BPTT-150 vs BPTT-400 cos 0.35–0.77):
slow-mixing equilibria are expensive for everyone, not an EP defect; (c) EP needs res ≲ few·1e-4 to
be valid — at res ~1e-2 (where the thin super-long run lived) EP updates are noise, which is what
that 2.48 "plateau" actually was: 12k steps of noise + λ=16 Jacobian-shrinking gradient; (d) longer
nudged phase is NOT better — the AEP correction is linearized at z*, so T2 must stay in the
linear-response window (T2≈20 at eps=0.1 good, T2=60 destroys the estimate).

**Decomposition of the EP→BP gap (final numbers):** architecture (BPTT−BP) = 1.949−1.79 =
**0.16** (small — the energy/fixed-point architecture is fine); training rule (EP−BPTT) =
2.047−1.949 = **0.10** — slower optimization (estimator cos ~0.85) + the λ tax of staying inside
the validity region. Total EP→param-matched-BP = **0.26** (the "wall" story claimed ~0.7 was
structural). EP-trained thick (2.047) beats exact-gradient thin (2.206) and monDEQ (2.111):
EP *does* cash in architecture richness, given validity + steps.

**λ-study (final, prediction confirmed: middle wins):** R5 2.047 < R3 2.078 < R4 2.135 < R1 2.150
< R2 2.357. The two failure modes flanking the optimum: λ→0 ⇒ fast descent then *collapse* (not
explosion — a dead res=0 equilibrium); λ heavy ⇒ no failure, just a constant tax (R1 monotone to
2.150@14k). R4's lesson: too-tight res_target puts the multiplicative controller on the residual
noise floor — `(res/target)^0.3` per-step updates compound the noise and λ thrashes 0.5↔13,
degrading late training. R5's fix: EMA-smoothed residual signal (`--res_ema 0.9`) kept λ quiet
(0.1–1.0) and set the champion 2.0467 by step 5k.

**Open item (the actual next frontier):** every stable recipe peaks mid-run (~5–6k) and then drifts
up 0.1–0.2 despite cosine lr decay — EP's late-training noise floor. Suspects: VF-estimator noise
dominating at small lr (cos 0.85 ⇒ persistent gradient-direction error), λ/lr balance shifting late,
weight-norm caps binding. Holomorphic EP (removes β bias/noise) is the principled candidate for the
remaining ~0.10; T2 must stay in the AEP linear-response window (T2≈20, NOT longer).

## 2026-06-10 — estimator/controller co-design: EP reaches 1.74, beats exact-gradient BPTT

The Holomorphic-EP line (see METHODS §4.3) + controller re-calibration, in causal order:

1. **R7 — holomorphic clamp-free nudge** (N=2, r=0.02 Cauchy readout, no g-clamp / no corr-clip,
   closed-form ∇zCE): best **2.029**, faster wall-clock than plain EP (3.3 vs 2.45 it/s). Probe
   discoveries: the legacy clamps were the dominant estimator error at marginal residuals
   (cos 0.27→0.89 at res 1.6e-3); N and r are flat ⇒ β-bias/noise was never the binding error;
   T2 truncation is the remainder (T2=120 → cos 0.985 when the nudged phase is stable; adaptive-T2
   early-stopping defeated by non-normal transient growth — open).
2. **R8 — param-EMA alone: tie (2.031).** Exposed the real late-drift mechanism: λ pinned at
   jr_max=16 in the tail — the controller enforcing the OLD validity margin (5e-4) against weights
   that want res~1e-3, which the new estimator tolerates fine (cos 0.89 @ 1.6e-3).
3. **R9 — controller re-calibrated to the new estimator** (target 1.5e-3, λmax 4) + param-EMA:
   **best 1.7399**, sustained EMA plateau 1.75–1.79, λ quiet at 0.1–0.5, tail ~1.92. Worth 0.29 —
   the single largest improvement of the project. Lesson: **estimator and controller must be
   co-designed**; upgrading one without re-tuning the other converts the controller into the main
   source of harm.
4. **14k controls** (same horizon as R9): BP MLP=4 **1.610**, BP MLP=1 1.689. Param-shape
   correction: the **thick** block's 131k FFN matches **MLP=4**, not MLP=1 (which matches the thin
   block's 33k Hopfield memory) — so thick-EP's like-for-like BP baseline is 1.610.
   Unregularized **BPTT-14k destabilizes** (res→4.7e-2, val→3.0, best 2.021 — worse than its own
   3k run, 1.949): the equilibrium architecture NEEDS the contraction control loop for long
   training; EP carries it out of estimator necessity, bare BPTT got nothing.
5. **The decisive control — BPTT + identical λ-controller + param-EMA: 1.6348**, tail stable.
   Verdict: most of EP-beats-bare-BPTT was the regularizer; with equipment matched, exact
   gradients lead EP (R9) by **0.105** — which matches the measured estimator misalignment
   (cos 0.85–0.90). Bonus: the controlled equilibrium block (1.635) beats the MLP=1 BP baseline
   (1.689) — the contraction controller is good for the architecture independent of training
   rule; EP merely forced its discovery.
6. **Adaptive T1/T2 (R10): EP 1.6755** — ate 0.064 of the 0.105 rule tax. Adaptive T2 solved by
   *hindsight snapshot selection* (judge by increments of the contrast estimate a_t, never by step
   sizes — non-normal transients can't fool it; dangerous batches self-limit to t≈20–30, stable
   ones run to ~110 and collect cos 0.987; probe mean 0.871→0.932). Adaptive T1 companion: the
   λ-controller signal stays sampled at T1=150 (R9 peace preserved), then relaxation refines to
   res≤1e-4 before nudging — long-T2 gains exist ONLY from tight equilibria (at res~1e-3 long T2
   hurts). **Final 14k decomposition: architecture 0.025 (1.635 vs 1.610), EP rule tax 0.041
   (1.676 vs 1.635), total 0.066.** EP now beats the MLP=1 BP baseline (1.689).
   *Multi-seed (3/arm):* EP **1.680 ± 0.005** vs BPTT-ctl **1.639 ± 0.006** ⇒ tax
   **0.041 ± 0.005 (~9σ)** — confirmed real.
5. Refuted along the way: λ-floor∝lr annealing (R6) reproduces the λ→0 death; post-mortem of that
   death (R2/R6): **explosion frozen by fp32 absorption masquerading as convergence** (val 60–77
   with res≡0: ε·F < ulp(z); uncapped tok/pos/fc/pj blow up) — the λ floor is the anti-collapse
   mechanism, not just a tax.

## Code map

- `~/ept/cet_mvp.py` — CET energy model + EP & TBPTE training (App. B faithful).
- `~/ept/bp_transformer.py` — vanilla BP transformer baseline.
- `~/ept/cet_aep.py` — CETReal (real non-conservative attention) + AEP, VF gradient, damping, clip.
- `~/ept/aep_characterize.py` / `aep_depth.py` / `aep_projected.py` / `aep_contractive2.py` — AEP sweeps.
- `~/ept/aep_option1.py` — projected (constrained) adjoint.
- `/tmp/lt_ep/lt_ep_attention.py` — AEP on the LM's causal attention (gradient quality).
- `/tmp/lt_ep/lt_ep_ffn.py` — EP Hopfield-memory vs FA-FFN (gradient quality).
- `/tmp/lt_ep/lt_ep_train.py` — H2 end-to-end EP vs BPTT training of the equilibrium LM block
  (`--attn_mode real/energy/mono/thick`, jacreg controller `--jacreg/--jr_floor/--res_target`).
- `/tmp/lt_ep/grad_quality.py` — EP-vs-exact gradient cosine vs residual level (the validity probe).
- `/tmp/lt_ep/solver_wall.py` — plain relax vs Anderson convergence per damping level (thin vs thick).

*(local-transformer working copy in `/tmp/lt_ep`; original `~/local-transformer` untouched.)*

## 2026-06-13 — late-drift 诊断

每个稳定配方中段触底后 val 上漂(S0 ~0.05 轻微,S1/L2 ~0.5 重度)。多假设诊断(drift_diag.py,13 列/200 步):
- **嵌入 runaway 假设被否**:|emb| 涨到 11.9 后被 weight-decay 压回 11.6,自限,非 culprit。
- **drift 对汇报数字无害**:存 best 即可(L2 漂到 1.70 但报 best 1.214;warm-track 报 1.14)。
- **drift 不藏更低解**:warm-track 4800 步触底 1.14 后,lr 3e-4 漂、lr 1e-4 冻(1.15),均找不到更低 ⇒ 1.14 是该配方真地板,drift = 在地板上被噪声梯度推晃。
- **机制 = 晚期沿坏方向迈步**:低 lr 摁住 drift(anti-drift 续训冻在 1.16);S0 `cos`(EP,BPTT400)晚期 0.85→0.3 印证方向变差。
- **严重度随尺度**:S0(C128/T64)≈0.05,S1(C256/T256)≈0.5。
- 待结(S1 warm-start 诊断 + bctl 对照):cos 晚期塌缩是 EP 估计器退化(A,→N=4/tracking 可救)还是慢混合平衡点的视野模糊(B,EP/BPTT 共病,→架构加速混合)。bctl=cos(BPTT150,BPTT400):若同塌则 B。

## 2026-06-13 (cont.) — "why EP far at S1": operating-point + bias/variance diagnosis

Reframe: at the SAME (tight, res 1.5e-3) operating point EP (1.14–1.39) ≥ BPTT (1.52). The apparent
"gap" is EP(1.14) vs LOOSE-BPTT(0.97) — an operating point (res 1e-2, non-contractive) EP cannot
use. Mechanism: T=256 long-range mixing wants non-contractive dynamics; EP needs contraction for the
fixed point to exist; contraction suppresses the mixing. Signature: gap grows with T (S0/T64: 0.04;
S1/T256: 0.17).

bias_var.py (v4b ckpt, 16 batches, EP vs BPTT-400, BPTT-150 control), per group [mean-cos | cos-means]:
- all  0.24|0.36 ;  BPTT-horizon ctrl 0.44|0.57
- attn 0.22|0.33 ; ffn 0.32|0.49 ; ln 0.19|0.30 ; emb 0.31|0.37
Readings: (1) two EXACT gradients (BPTT-150 vs 400) only reach cos-means 0.57 — slow mixing makes
"the gradient" horizon-ambiguous; a PERFECT EP estimator's ceiling here is ~0.57, NOT 1. (2) Both
EP and BPTT improve with averaging (variance present, partly recoverable via navg). (3) EP-specific
residual concentrates in **attention + LN** (EP cos-means 0.33/0.30 vs BPTT 0.57/0.65); **FFN nearly
fine (0.49)**, emb on par. ⇒ the AEP-corrected softmax/LN path is the weak link, pointing the fix at
qk-norm (bounds logits, conditions the attention Jacobian) + N=4 holo + navg. Chasing cos→1 is the
wrong target; matching BPTT's 0.57 is (~0.21 of recoverable EP-specific alignment).

## 2026-06-13 (cont.) — scaling to BPE/50M: stability via small-init residual branches

50M (C=2048) from scratch ABORTED (init res 26; qk-norm+c=2 insufficient — res stuck 0.38). Fix:
**ReZero/Fixup-style small-init of the residual-branch output projections** (WO, pj ×0.1) makes the
thick block ≈ -(z-xin)-c·z (near-identity, trivially contractive) at init; training grows them.
Result at 15M (C=1024): res **1.1e-8 at step 50** (vs abort). The big-width init-instability wall is
a random-init artifact, not fundamental; --resinit folds into the standard recipe. Separately,
torch.compile's 14.6× was a SMALL-model (launch-overhead) artifact; at C≥1024 the relaxation is
compute-bound (fp32 ~15 TFLOP/s) → ~0.06 it/s at 15M, ~0.03 at 50M. Demo feasibility: 15M solo
(~2-4 days); 50M needs multi-GPU data-parallel (per-sample contrast → gradient all-reduce) or the
external cards. The fp32 contrast (catastrophic-cancellation-bound, bf16 dead) keeps it compute-heavy
— again the "digital expensive / physical free" axis.

## 2026-06-14 — the big-model EP recipe (scaling past S1 needs a stability stack)

Scaling EP from S1 (C=256/T=256/char) to C=1024/T=512/BPE-4096 broke the S1 recipe: stable for ~50
steps then violent free-phase blowups (res → 0.5–1.4, jr pinned at max) — the controller's stability
margin shrinks with width/context. CAUTION: a step-50 res≈1e-8 snapshot is NOT proof of stability;
must watch through warmup-end peak-lr (~step 500–1000). The stack that fixes it (each necessary):
  1. **resinit** (WO,pj ×0.1) — near-identity contractive block at init.
  2. **qk-norm** (Qwen3 q/k RMSNorm) — bounds attention logits / Jacobian (also fixes the attn/LN
     estimator weak link from the 06-13 bias_var diagnosis).
  3. **lr warmup** (≥1000 steps linear) — let the λ-controller establish contraction before big steps.
  4. **µP-style lr scaling**: lr 1e-3 (C=256) → **4e-4 (C=1024)**; 1e-3 caused a 400-step instability
     episode, 4e-4 only brief recovered excursions.
  5. **higher jr_max** (16 → 32) — bigger non-conservative force needs more penalty headroom.
  6. validity gate + fuse retained (catch the residual transient excursions).
Locked 15M demo recipe (running, 16k steps, best-ckpt): C=1024 H=16 T=512, the 6 items above,
holo N=2 + adaptive T1/T2, pema 0.999. Smoke val descends 5.86→3.94 BPE-CE by step 1000.
Cost: ~0.05 it/s (fp32, compute-bound; compile's 14.6× was a small-model launch-overhead artifact)
⇒ 15M ~3 days solo. 50M (C=2048) still unstable even with resinit 0.03 + warmup — needs a dedicated
init/lr curriculum AND multi-GPU; deferred as a sub-project (not on the demo critical path).

## 2026-06-16 — C=512 BPE scale: baselines, AsymEP attribution, early-slowness dissection

**C=512 BPE-4096 scoreboard** (TinyStories, T=256, B=24; random ln4096 = 8.318; target band 1.0–1.5):

| run | result |
|---|---|
| BP standard transformer (7.48M, no warmup, lr 3e-3) | **best 1.7921** (20k, DONE) |
| EP thick "orphan" (lr 8e-4, warmup800, resinit0.1, holo2, t2sel40) | **best 2.4037** (20k, DONE; jr pinned at floor 0.1 the ENTIRE run — zero excursions) |
| EP thick "chin" (lr 9e-4, full-Chinchilla 24k target) | ABORT @5401, best 3.2408 |
| BPTT thick (exact grad, same arch) | ABORT @2113, best 3.7331 |

Data notes (NOT verdicts): (a) BP-C512 itself = 1.79 sits just *above* the 1–1.5 band (BP needs
C≥1024 to enter it); EP-orphan 2.40 is 0.61 above BP-C512. (b) **the exact-gradient BPTT twin
aborted EARLIEST** (2113, 3.73) — even worse than the EP runs — consistent with "training walks off
the contractive manifold, exact gradient included"; lr/seed not matched across these, treat as
directional. (c) abort = the `res>0.1 for 100 consecutive steps` fuse. **lr is the knife-edge**:
8e-4 (orphan) had NO excursions and finished; 9e-4 (chin) had periodic jr→32 / res→0.2–0.5
excursions, most recovered, one didn't make it back inside the 100-step fuse window → abort@5401.
Narrow lr-stability margin at C=512. (d) orphan late curve is a plateau: best moved only
2.4557→2.4037 over the last 3600 steps.

**AsymEP attribution (web-verified, the J→Jᵀ correction is NOT ours).** arXiv:2602.03670,
Scurria / Vanden Abeele / Mognetti / Massar, "EP for Non-Conservative Systems", names the method
**AsymEP**. Their correction `−2 A_J(z*)(z−z*)` is **mathematically identical** to ours
(`Jv−Jᵀv = (J−Jᵀ)v = 2 A_J v`). Their scope: Hopfield nets + feedforward MLPs (≤500 hidden) +
one CIFAR-10 feedforward CNN; **static inputs; explicit Jacobian construction + decomposition;
NO residual/validity controller; NO attention / transformer / sequence model; do NOT combine with
holomorphic**. Best result: CIFAR-10 FF-CNN AsymEP 89.74% vs BP 90.66% (their variational "Dyadic
EP" matches BP, p=0.75). Their **"VF" baseline** (the force-form ⟨a,∂F/∂θ⟩ readout — *prior art, not
ours*) **collapses without the correction** (CIFAR-10 → 10% = chance; MNIST 64% vs AsymEP 92.7%) —
matching our measured cos≈0.25 for uncorrected non-conservative attention. **Ours on this line**:
matrix-free jvp−vjp (their explicit Jacobian is infeasible at B·T·C dim); softmax-attention
application; the holomorphic combination (the real-linear correction preserves holomorphy);
common-mode-tracking + lock-in variants; the validity-threshold/controller/gate stability stack they
have none of; and the transformer-LM application + scale. (⇒ corrected EP_DERIVATION.md, which had
mislabeled force-form VF as our step.)

## 2026-06-17 (cont.) — full to-spec versions of exp1/4/5/6/7 (first pass was partial)

The 06-17 campaign ran lighter/subset versions; re-ran the FULL specs. New findings:

**exp1 (FULL) — multi-checkpoint cos-evolution (snap 0/200/800/2000 + plateau).** EP·BPTT150 cos:
step0(random,res2e-2) 0.26 → step200(res1e-9) **0.99** → step800 0.96 → step2000 0.89 →
plateau(2.40) **−0.05**; BPTT150·BPTT400 = 0.46→**1.00** throughout (true gradient always clean).
New **batch self-cos** at the plateau: EP **−0.27** (gradients on different batches anti-correlated)
vs BPTT **+0.96**. ⇒ decisive: the EP estimator is faithful while descending and undergoes an
**SNR/coherence collapse near the optimum** (true gradient shrinks below the EP bias floor, which is
batch-incoherent) — NOT horizon ambiguity, NOT LR. This is THE 2.40-plateau mechanism.

**exp5 (FULL) — jr_max sweep (the dropped dimension, and it mattered).** Warm-started from 2.40:
**jr_max 8 → 2.3648**, jr_max 4 → 2.376 — both BEAT jr_max 32 (2.463) and the orphan (2.404) by
~0.04. jr_max 16 → 2.416. floor/target fill still ≥2.47. ⇒ the jr_max=32 ceiling was a small tax;
~8 is better. (Earlier "not a controller tax" was incomplete — floor/target don't help, but the
ceiling does.) Still far from the 1–1.5 band; the estimator floor dominates.

**exp4 (FULL) — 3 arms.** armA (warmup): ~50-step gate dead-zone. armB (`cprewarm 200` = full-lr
contraction before task warmup): res 1.7e-2 after pre-phase, dead-zone only PARTIALLY mitigated
(val 5.97 vs 6.07 @100) — establishing contraction once isn't enough; warmup's tiny lr can't deepen
it. armC (no-warmup + resinit 0.03): NO dead-zone (cos 0.99 from step20, best 4.03@300) but a violent
excursion (res→0.14@400, recovered). ⇒ small-resinit+no-warmup learns fastest early but is
excursion-prone; the cprewarm decoupling is a weak fix.

**exp6 (FULL) — branch norms + attention entropy + qk RMS, incl. at plateau.** Branches grow fine
(|WO| 1.95→4.91 over training); at the 2.40 plateau they are LARGE (|pj|=44, |fc|=58, attn/xin≈1.0),
attn entropy 4.56→3.62 (sharpening), qk_rms pinned 1.0 (qknorm working). ⇒ no "stuck-small branch" —
the trainable-α-gate intervention is NOT indicated.

**exp7 (FULL, after a probe bug fix) — EP vs BP influence + per-position CE.** [Bug caught: the
EP probe embedded the input with random-init weights before loading the ckpt → garbage (per-pos CE
~7); fixed to embed post-load.] Clean influence ||Δz*_q||/δ vs distance d: **BP** 5.34(d1)/0.88(d16)/
0.15(d64)/0.003(d200); **EP-orphan** 0.24/0.071/0.019/0.006. Both decay with distance. EP ~20×
weaker in ABSOLUTE coupling (scale-caveated — z* vs hidden norm), but NORMALIZED to d=1 EP reaches
*farther* than BP (0.30 vs 0.16 at d16; 0.08 vs 0.03 at d64). Per-position CE: EP uniformly ~0.7–1.0
worse than BP at every position (no long-range-specific failure). ⇒ **does NOT support — arguably
refutes — "contraction cuts long-range mixing"**; EP's deficit is uniform, consistent with the
estimator floor (exp1) being the binding constraint. (BP-C512 trained to **1.6953**.)
Honest gap: a stable BPTT-ctl-tight ckpt does not exist (BPTT broke at C512), so the cross-method
influence comparison is EP-vs-BP only, not the full tight/loose/BP set.

**Campaign verdict (data; user concludes):** the C512 2.40 plateau is, decisively, an **EP estimator
SNR/coherence collapse near the optimum** (exp1: cos→0, self-cos −0.27, k→4000, true gradient clean).
It is not LR (exp2), not horizon (exp1), not λ-floor/target (exp5) — though jr_max≤8 buys ~0.04
(exp5). The architecture has a SEPARATE wall: exact-gradient BPTT can't stay stable at C512 (exp3).
The contractive-mixing hypothesis is NOT supported (exp7: EP's range is comparable to BP, deficit is
uniform). ⇒ the EP lever is reducing the estimator bias floor near the optimum (N=4 holo / tracking-
AEP / lock-in / navg); the architecture lever is the stability margin for exact gradients.

**Early-slowness dissection (C512 BPE) — cos/k probe CONFIRMED; ablations STILL RUNNING.**
From existing logs: **EP ≈ BPTT at every early step** (step 200: 5.84 = 5.84; step 800: 3.86 vs
4.05; step 2000: 3.58 vs 3.73), both **~1.2 behind BP** (2.76 @800). ⇒ early slowness is *not* the
EP estimator — the exact-gradient twin on the same architecture is equally slow; the gap is the
equilibrium architecture + recipe vs a standard transformer, present already at step 200 (pre
warmup-end). Probe (`--probe_bptt`: cos(g_EP,g_BPTT) + k=|gEP|/|gBPTT| along the REAL trajectory):
- **Estimator quality tracks res tightly**: res≈1e-9 → cos 0.93–0.99, k≈1.0 (all groups); a res
  spike to 1.4e-3 (mid-excursion) → cos all 0.86 / attn 0.70 / emb 0.13, k→0.4. ⇒ the earlier
  kinit *synthetic* probe (cos 0.33 / k 0.42 "at init") was a probe-operating-point artifact; on the
  real trajectory the EP gradient is faithful (cos≈1, k≈1) **whenever the free phase is converged**,
  and **k≈1 means no adaptive-k is needed in the converged regime**.
- **warmup × validity-gate early-stall mechanism**: with warmup, early lr is tiny → the contraction
  homeostat can't pull res under the gate (5e-3) → free-phase res stuck ~1.5e-2 for ~50 steps →
  the **validity gate skips the nudge** → the reported g_EP is pure jacreg/contraction (cos≈0,
  k≈11) = *the task is not being learned at all* for ~50–60 steps. res drops < gate by ~step 60–80
  → cos jumps to 0.99. No-warmup: res converged (1e-9) by step 20, cos 0.98 from step 20
  (val 5.72 vs warmup 6.07 @ step100).
- **CAVEAT — the tradeoff is UNRESOLVED**: the no-warmup arm hit an instability excursion ~step 180
  (res→1.4e-3, jr→32, recovering by 200) — i.e. warmup's stability role is real. Whether no-warmup
  survives to 1200, plus the **resinit (arm B)** and **lr 2e-3 (arm D)** attributions, are STILL
  RUNNING as of 16:00 — no verdict drawn. Probe instrumentation: `lt_ep_train.py --probe_bptt N`;
  launcher `/tmp/lt_ep/early_dissect.sh`.

## 2026-06-17 — C512 diagnostic campaign (7 experiments): the 2.40 plateau is an EP estimator bias-floor; exact-gradient BPTT is LESS stable than EP at C512

Controlled campaign at C512 BPE (orphan ckpt = EP best 2.4037). Raw results + the causal cut each
experiment was built for. Scripts: `lr_sweep.py`, `triangulation.py`, `mixing_probe.py`,
`lt_ep_train.py --probe_branch/--probe_bptt`; orchestration `master.sh`.

**exp3 — BPTT+controller, the decisive matched cut (GPU0).** Same arch+controller+lr 8e-4 as the EP
orphan; only the task gradient differs (exact vs EP). Descended cleanly to best **3.85 @step1400**,
then **DESTABILIZED at step 1600** (jr→32, res→0.07) and lodged in a broken basin (val 6.27,
res ~0.068 — just under the 0.1 fuse, so no abort) for 8400+ steps. **The exact-gradient twin did
WORSE than EP (3.85-stuck vs EP 2.40-completed).** ⇒ at C512 the contraction controller does NOT
keep the exact gradient on the manifold; EP's implicit contraction-bias is what kept the orphan
alive. Inverts the S0 picture (BPTT+ctl 1.635 < EP 1.676). Caveat: single lr/seed.

**exp1 — triangulation at the 2.40 plateau (estimator vs horizon).** BPTT150 vs BPTT400
cos = **1.000** (all/attn/ffn; emb 0.986) — true gradient perfectly well-defined, ZERO horizon
ambiguity — but EP vs BPTT cos = **−0.045** (orthogonal), k=|gEP|/|gBPTT| ≈ **4000**. ⇒ the plateau
is an **EP-specific estimator failure**, not slow-mixing horizon ambiguity.

**exp2 — one-step LR-sweep at the plateau.** BPTT best ΔCE −0.16 (lr 1e-2); EP best −0.04 (lr 1e-4),
diverges for lr≥3e-4. cos 0.11, k 486. ⇒ NOT LR-inequivalence; EP's direction caps one-step descent
4× below BPTT at any lr.

**exp1+exp2 synthesis:** the EP estimator has a ~fixed bias/noise floor. k runs **~1 during active
training → ~4000 at the plateau** (true gradient shrank below the EP bias). 2.40 is the SNR
crossover; below it the true gradient is smaller than the EP bias so EP can't see it. Not LR, not
horizon, not λ. ⇒ the EP lever is **reducing the estimator floor near the optimum** (N=4 holo /
tracking-AEP / lock-in / navg averaging), distinct from the architecture's stability lever.

**exp5 — λ floor/target grid (warm-start from orphan).** base 0.1/1.5e-3 → 2.463; 0.05/3e-3 → 2.491;
0.03/3e-3 → 2.485; 0.1/5e-3 → 2.488 — ALL worse than orphan 2.40 (slight late-drift up). ⇒ relaxing
the controller does NOT break the plateau — not a λ-floor tax. (floor 0.03 reached jr=0 without
collapse over 1200 warm-started steps.)

**exp6 — branch-growth (resinit vs no-resinit).** resinit 0.1: branches grow gradually & stably
(|WO| 2.35→4.91, |pj| 3.23→6.57 over 1000 steps; res ~1e-4; best 3.71). no-resinit: branches start
large (|WO|~20, |pj|~20), excursion at step 800 (res→0.23, jr→32; recovered, |WO|→14.7; best 4.04).
⇒ resinit's gradual growth is stabilizing; branches do grow (not stuck-small).

**exp7 — mixing/influence length.** ||Δz_q*||/δ FLAT across distance (~0.007 at d=1 and d=200) and
weak; loose-target (lam_t5) similar (~0.005). ⇒ no distance-decay signature; INCONCLUSIVE on the
contraction-cuts-mixing hypothesis (no non-contractive/BP reference; coupling uniformly weak).

**exp4 — warmup/gate (probes done).** no-warmup best 3.56@1200 (survived its step-180 excursion);
warmup 4.11@800. warmup×validity-gate early-stall confirmed (warmup keeps res>gate ~50 steps →
nudge skipped → no task learning early).

**Net (data; user to conclude):** the 1–1.5 band is blocked at C512 by TWO distinct walls —
(1) EP's estimator bias-floor caps EP at 2.40 (cos→0, k→4000 at the plateau; true gradient is
clean), and (2) the equilibrium architecture+controller can't keep the EXACT gradient stable at
C512 (BPTT broke at step 1600). EP's contraction-bias makes it the more robust of the two here.
The contractive-conflict hypothesis is supported on the BPTT side; the EP plateau is specifically
an estimator-SNR-floor effect.

## 2026-06-17 (round 2) — the plateau bias floor IS the frozen AEP linearization; tracking-AEP fixes it (gradient level)

Estimator ablation at the 2.40 plateau ckpt (`plateau_ablation.py`, vs BPTT400, n_grad=4):

| estimator | cos | k | batch self-cos |
|---|---|---|---|
| N2 frozen-AEP (current) | −0.045 | 4133 | −0.27 (incoherent) |
| N4 / N8 | −0.03 / −0.01 | 31 / 29 | ~0 |
| N2 r=.05/.10/.20 | −0.04 | ~4200 | −0.27 |
| N2 fixed-T2 20/80/120 | nan (diverges w/o snapshot selection) | — | — |
| **track_N2 (common-mode AEP)** | **0.997** | **0.9** | **+0.95 (coherent)** |

Diagnostic at z*: **|Jv−Jᵀv|/|Jv| = 1.37** (J highly non-normal). ⇒ the bias floor IS the
frozen-at-z* AEP linearization. N=4/8 fix only the magnitude (k 4000→30), not the direction; r and
T2 do nothing; **tracking-AEP (re-linearize at the moving common mode z̄=½(z₊+z₋)) restores cos
−0.045→0.997, coherence −0.27→+0.95, magnitude 4133×→0.9×.**

**exp D (per-group):** collapse is UNIFORM across WQ/WK/WV/WO/fc/pj/ln1/ln2/tok/pos (all cos
−0.03..−0.08, k 3k–18k) — the corrupted contrast `a` poisons every VF gradient equally (single
shared cause, not attention/LN-specific unlike the S1 bias_var diagnosis).

**exp B (navg):** cos(EP,BPTT) over navg = 0.37→0.40→0.55→−0.30→−0.21 (1/2/4/8/16); self-cos rises
(0.58→0.87) but toward a direction NEGATIVELY correlated with BPTT; dCE diverges at navg≥8. ⇒
**deterministic bias, not variance** — averaging converges to the wrong (biased) direction, so
navg/restart-averaging will NOT fix it.

### 2026-06-17 (round 2b) — free-phase ε is mistuned ~3× (monDEQ α-analysis analog); Anderson untapped

Free-phase convergence probe at the 2.40 ckpt (`solver_probe.py`, res_F=‖F(z)‖/‖z‖, eps-independent,
B=2, 4 batches). Steps to res_F≤1e-4 / ≤1e-6, plain Euler: ε0.05→470/844, **ε0.1 (ours)→236/548**,
ε0.2→121/274, **ε0.3→85/182** (~2.8× faster than 0.1, same fixed point, stable), ε0.5→stalls at
res~0.9 (non-convergent — the stable ceiling is ~0.3–0.4). Anderson(m=5): ε0.1→85/197, **ε0.2→58/118
(~4× vs our current 0.1)**, robust across ε. ⇒ **ε=0.1 is ~3× too small**; ε≈0.3 or Anderson@ε0.2 cut
free-phase force-evals ~3–4× (and let T1≈100 always clear res≤1e-4, dropping adaptive-T1) → directly
attacks the dominant EP compute cost AND tightens res (better estimator). This IS the "preconditioned/
Anderson relaxation" open item, now quantified. Caveats: free phase at trained point / B=2 (optimal ε
may shift early-training / slow-mixing); the **nudged phase shares ε but has AEP+holo dynamics — bigger
ε there needs a separate stability check** before changing. Confirms the monotone-operator α-sensitivity
(non-normal dynamics, |Jv−Jᵀv|/|Jv|=1.37) carries over to our hand-set Euler step.

**Decision:** the EP lever collapses to **common-mode tracking-AEP** (not N=4 / lock-in / navg).
Caveat: validated at the gradient level at one ckpt; the one-step ΔCE is magnitude-confounded, so the
NECESSARY next test is **training** — does EP with `--track` descend below 2.40 (`holo_a_track`
already wired)? exp C (BPTT+controller rescue, lr 2e-4/4e-4/6e-4 × jr_max 16/32) runs separately for
the architecture-ceiling question.

## 2026-06-18 — tracking-AEP VALIDATED IN TRAINING: it breaks the 2.40 plateau

Warm-track test (warm-start from the 2.40 orphan ckpt, jr_max 8, warmup 0, 2500 steps):
- **TRACK-warm (common-mode tracking-AEP): 2.40 → best 2.1628 @2500, still descending**, stable
  (res ~6e-5, jr at floor). ⇒ tracking-AEP descends PAST the 2.40 plateau — the round-2 estimator-
  floor fix works in training, exactly as the fixed-ckpt probe predicted (cos −0.05→0.997).
- **STD-warm control (standard estimator, same recipe): ABORTED @step 397** (blew up, res→0.21).
- **TRACK-fresh (tracking from scratch): ABORTED @968** (blew from random init).

⇒ under the identical aggressive recipe (jr_max 8), tracking-AEP is BOTH stable AND descends, while
the standard estimator blows up. But tracking-AEP is **not a from-scratch drop-in** (fresh blows up)
— the recipe is **two-phase warm-track**: standard estimator to ~2.40 (where it's stable), then
switch to common-mode tracking-AEP to descend below the floor (matches the S1 warm-track champion).
Caveat: std-control's abort is partly the aggressive jr_max 8 (orphan used jr_max 32, stable at
2.40); the clean claim is "at jr_max 8 tracking survives+descends where std blows". The 2.16 ate
~0.24 of the 0.61 EP-vs-BP gap and was still descending → continuing (`trkcont` from 2.16) to find
the ceiling.

**exp C — architecture ceiling (BPTT+controller rescue, 2500 steps):** stable exact-gradient exists
(lr≤6e-4 never breaks, unlike lr8e-4 which broke @1600). Best: lr6e-4 jr16 → 3.08, still descending.
At equal step (2500): **BP 2.34 ▸ BPTT 3.08 ▸ EP-orphan 3.55** ⇒ the ~10× proportional slowdown
(confirmed: median 9.4× in the descending regime) splits into an lr-stability-capped architecture
handicap (~0.74, BP→BPTT; the equilibrium block can't take BP's 3e-3 lr) AND an EP-estimator
handicap (~0.47, BPTT→EP). jr_max 16 > 32 confirmed again. Long ceiling run (`bptt_ceiling`,
lr6e-4 jr16, 12k) pending.

**Found + staged a real code inconsistency:** free phase / model / readout use erf-GELU (F.gelu)
while the holo nudge uses tanh-GELU (cgelu) → z* isn't the nudge-force's fixed point → spurious
common-mode drift in the contrast every step (which tracking-AEP partly absorbs — possibly part of
why it wins). Fix staged as `--gelu tanh` (additive, default erf). Separately: free-phase ε=0.1 is
~3× too small (ε0.3 or Anderson → ~3–4× fewer relaxation steps; `solver_probe`).

---

## 2026-06-24 — below-CE-2.1 divergence: DIAGNOSED (genuine Hopf) + FIXED (resreg/jacreg = finite-T1 LE control)

Full write-up: **`SESSION_2026-06-24_HOPF_DIAGNOSIS_RESREG_FIX.md`**. This pins down the forward-stability wall that sat under TL;DR #3 ("no loss ceiling, but a validity threshold that costs reg + steps") — the threshold is now mechanistically a **Hopf bifurcation**, and the "regularization tax" is specifically **LE control**.

**The mechanism (3 independent confirmations).** Below CE~2.1 the relaxation diverges because the learned **non-conservative attention** operator undergoes a **genuine continuous-time Hopf bifurcation** — a complex eigenvalue pair crosses **Re μ > 0** as EP makes attention expressive. Matrix-free FD-JVP Arnoldi on M=I+εJ (`ep_run/eig_probe.py`): leading continuous μ=(λ−1)/ε goes **−0.024 (s2000, CE 3.13, STABLE) → +0.44±1.37j (s3200, 2.74) → +1.35±2.08j (ep_eps05, 2.41)** — Re μ and |Im μ| both grow as CE drops. Confirmed by (a) eval limit-cycle + attention-knockout (α=0.2 converges), (b) Anderson can't reach a root at s3200 (near-root Re μ=+0.24 rotating). **Earlier "Euler artifact" read (and the cont.6 forward-mode/RTRL framing) are SUPERSEDED** — fugu caught that the ε-sweep "convergence" was the *step* residual ε·g, not g→0; an intermediate misread. (The Euler over-amplification is a real *second* layer — ε relocates the wall 2.74→2.41 — but the root cause is Re μ>0, which no ε closes.)

**Why equilibrium-EP hits this and BPTT doesn't (structural).** Equilibrium-EP optimizes L(z*); its gradient is blind to *finite-time contraction* (a param eroding ρ→1 with z* fixed has zero equilibrium gradient). BPTT differentiates the T1 unroll, so a non-converged z_T1 → bad output → penalized → it implicitly defends contraction. So EP needs that defense added back explicitly.

**The fix — control the finite-T1 Lyapunov exponent (LE).** Stability ⟺ ρ(M)<1 ⟺ finite-T1 LE<0. Two ~orthogonal handles, they stack:
- **resreg** (`lt_ep_train.py:220-231`) = penalize the T1-residual ‖εF(z_T1)‖ ~ ρ^T1 = exp(T1·LE) → a **direct LE penalty** (also catches the non-normal transients eigenvalues miss). **The PROVEN below-2.1 stabilizer.**
- **jacreg** (`:211-219`) = penalize ‖J_nc·v‖ → shrinks |Im μ|, pushes the pair to Re μ<0 (cause-side). `ep_run/eig_jacreg.py` confirmed at the mechanism level: at the same CE~2.74, frozen-jacreg = Re μ=+0.45 rotating g_floor 0.26, adaptive-jacreg = **Re μ=−0.23 real, g_floor 1e-4** (Hopf killed, true fixed point restored, AsymEP validity restored).

**★ The 2.09 recipe = FROM SCRATCH + resreg=0.2 + FROZEN jr** (NOT adaptive jacreg — that was a session-long detour; cmd at `EP_BELOW210_DIAGNOSIS_FIX.md:97-101`). Original ep_resreg2 reached **2.0573** (lowest EP ever, lost to /tmp wipe; rebuild 2.22). This session: ep_resreg_warm (resreg eager, warm from stable s2000) descending smoothly through 2.30 (peak res 1.6e-2, no spikes) — the clean 2.09 test; ep_rr_scratch (from-scratch, proven recipe) launched as the robust confirmation. **Warm-start ONLY from a STABLE operator (s2000, Re μ<0); resreg/jacreg PREVENT instability, they can't RESCUE an already-blown one** (warming from ep_eps05@2.41 blew).

**Infra (#14).** `--compile` EXONERATED + SAFE (compiles the free-phase `tforce`; numerically identical to eager — z150 rel-diff 1.6e-7, pure fp32; ~1.43x, ~3.3x with t2sel40). **`--tf32` UNSAFE — do NOT use** (10-bit mantissa ≈ 1e-3 perturbation; the relaxation is hyper-sensitive — ε 0.1→0.05 alone moved the wall 0.33 — so TF32 is too coarse). EP parallelism for the deep/scaling phase: no sequential backward, coupled equilibrium stack (#13) → depth-parallel, adaptive-T1 (residual-stopped) cleaner than adaptive-ε.

**Lit anchor.** Hopfield-ResNet (arXiv 2509.26003) EP-trained 12 layers — but it's **conservative** (energy Φ, symmetric weights, no oscillation by construction). Confirms non-conservativity is our culprit; **we are the first to EP-train non-conservative attention** (which has the Hopf), solved via resreg/jacreg.