summaryrefslogtreecommitdiff
path: root/docs/method
diff options
context:
space:
mode:
authorYuren Hao <yurenh2@illinois.edu>2026-07-03 05:56:50 -0500
committerYuren Hao <yurenh2@illinois.edu>2026-07-03 05:56:50 -0500
commitb83947778e2c776f757a07d4719b7ce961d7ed55 (patch)
treeb9cc01d7adda691d9156d9d04f4fb2f644674e96 /docs/method
Initial commit: ept — backprop-free equilibrium transformer (EP)
Code (ep_run/), organized docs (docs/{method,campaign,hardware,outreach,paper}), analysis scripts (scripts/), ONBOARDING.md entry point. Large data/checkpoints git-ignored (share separately). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014FAPDWQ49M5Ye3NpTndTpn
Diffstat (limited to 'docs/method')
-rw-r--r--docs/method/ARCHITECTURE.md117
-rw-r--r--docs/method/EP_DERIVATION.md215
-rw-r--r--docs/method/METHODS.md576
-rw-r--r--docs/method/READING.md58
-rw-r--r--docs/method/READING_EN.md54
5 files changed, 1020 insertions, 0 deletions
diff --git a/docs/method/ARCHITECTURE.md b/docs/method/ARCHITECTURE.md
new file mode 100644
index 0000000..213b192
--- /dev/null
+++ b/docs/method/ARCHITECTURE.md
@@ -0,0 +1,117 @@
+# Architecture: EP/AEP-trained Equilibrium Transformer — implementation details
+
+Based on the actual implementation in `/tmp/lt_ep/lt_ep_train.py` (`EQBlock` + `ep_step`).
+
+## 0. Overview — one unified force, one relaxation
+
+The whole block is **one dynamical system**: token state `z ∈ R^{B,T,C}` relaxes to a fixed
+point under a **single force** that bundles input-clamp + FFN + attention:
+
+```
+F(z) = − ∂/∂z [ ½‖z − x_in‖² + E_mem(z) ] + s·( Attn(z) − c·z )
+ └────── conservative part (symmetric Jacobian) ──┘ └── non-conservative (non-reciprocal) ──┘
+ input clamp Hopfield memory = FFN causal self-attention + damping
+```
+- `x_in = tok[idx] + pos` — input embedding, clamped as a boundary condition.
+- **Forward = relaxation**: `z ← z + ε·F(z)` for T1 steps → fixed point `z*`; read out `logits = z*·W_h`.
+- Conservative (FFN/clamp) and non-conservative (attention) live in **one force, one relaxation** — the basis of "unified" training.
+
+---
+
+## 1. FFN — standard EP (the clean part)
+
+The FFN is rewritten as a **modern Hopfield memory energy**:
+```
+E_mem(z) = − Σ_{token, m} relu( z · W_m )²_m # W_m ∈ R^{C×M}, M memories
+```
+Its force `−∂E_mem/∂z = 2·W_m·[relu(zW_m) ⊙ 1_{zW_m>0}]` is exactly a **tied-weight 2-layer MLP
+(W_in=W_out=W_m) with squared-ReLU** = the FFN.
+
+- **Conservative** (scalar energy, symmetric Jacobian) → **standard EP is exact, no correction**.
+- Verified: FFN-param gradient cosine vs backprop = **1.000** (`lt_ep_ffn.py`).
+- This is textbook EP / Hopfield — already demonstrated on memristor crossbars in the literature.
+
+---
+
+## 2. Attention — how it is "EP-ified" (the novel part)
+
+**Step A — rewrite attention as a FORCE** (not a feedforward layer): tokens relax under it.
+```
+Attn(z) = [ softmax( Q(z)K(z)ᵀ/√d , causal ) · V(z) ] · W_O
+ Q=zW_Q, K=zW_K, V=zW_V (independent projections ⇒ NON-reciprocal: i→j ≠ j→i)
+force term = s·( Attn(z) − c·z ) # −c·z damping ⇒ contraction ⇒ a stable fixed point exists
+```
+
+**Step B — fix the non-reciprocity bias (AEP correction).** Because Q≠K and V is independent,
+the attention Jacobian `J_attn` is **asymmetric** — it is not the gradient of any scalar. Plain EP's
+nudged phase relaxes under `J`, but the correct adjoint needs `Jᵀ`, so plain EP gives a **biased**
+gradient. AEP adds, in the nudged phase:
+```
+v = z − z* # deviation from the free equilibrium
+corr = s·( J_attn·v − J_attnᵀ·v ) # = 2·A_J·v , A_J = (J − Jᵀ)/2 (antisymmetric part)
+ J_attn·v = jvp(Attn, z*, v) # forward-mode (one forward probe)
+ J_attnᵀ·v = vjp(Attn, z*, v) # reverse-mode (one backward probe)
+f_nudged = F(z) − sign·β·∂C/∂z − clip(corr)
+```
+Effect: the attention part of the nudged linearization becomes `s·J·v − s·(J−Jᵀ)v = s·Jᵀ·v`
+— i.e. **J is turned into Jᵀ**, giving the correct adjoint.
+- The damping `−c·z` is **symmetric** (Jacobian −cI) ⇒ cancels in `J−Jᵀ` ⇒ the correction only
+ sees attention's **non-reciprocal** part.
+- Verified: attention-param gradient cosine vs backprop = **0.99–1.0** (plain EP: 0.25–0.78).
+- Hardware note: `jvp/vjp` = two probe directions; **non-reciprocal coupling is exactly what real
+ analog devices have** — AEP removes the usual "symmetric weights" requirement of EP hardware.
+
+---
+
+## 3. End-to-end unified training
+
+**One relaxation, one estimator, trains the whole block.** Key fact:
+
+> The antisymmetric Jacobian of the **full** force, `A_J`, equals the antisymmetric part of
+> (conservative + attention) = **attention's antisymmetric part alone** (the conservative FFN/clamp
+> have symmetric Jacobians → contribute 0 to `A_J`).
+
+So **the AEP correction needs to act on the attention term only**; the FFN/clamp ride along in the
+conservative part and are trained correctly by standard EP — **one relaxation, one correction, trains
+everything.**
+
+**Training step (`ep_step`) = 3 phases + 1 local update:**
+```
+① free phase: z* = relax(F, x_in, T1) # to the fixed point
+② nudged ±: z₊ = relax( F − β·∂C/∂z − corr , from z*, T2 ) # +β
+ z₋ = relax( F + β·∂C/∂z − corr , from z*, T2 ) # −β (centered EP)
+③ adjoint: a = (z₋ − z₊) / (2β) # read from the two nudged equilibria
+④ local update:
+ • equilibrium params (W_Q,W_K,W_V,W_O, W_m, embeddings tok/pos):
+ ∇_θ L = ⟨ a , ∂F/∂θ(z*) ⟩ # vector-field EP estimator — one formula for attn+FFN+embed
+ (code: autograd.grad( (a·F(z*,θ)).sum(), θ ) , θ live, z* fixed)
+ • readout W_h (outside the equilibrium): direct local gradient ∂C/∂W_h |_{z*}
+```
+- Why attention (non-conservative) and FFN (conservative) train under the *same* estimator:
+ `⟨a, ∂F/∂θ⟩` is uniform over all equilibrium params; the AEP correction only modifies `a` (making it
+ the correct adjoint for the non-reciprocal system); the FFN's `∂F/∂θ` is already correct.
+- Embeddings `tok/pos` enter the force through the clamp `½‖z−x_in‖²` (`∂F/∂x_in = +I`), so the same
+ `⟨a, ∂F/∂θ⟩` yields their gradient.
+- **Stability (feedback regulation, from FRE-RNN 2508.11659):** each step measure the free-phase
+ residual `res = ‖relax(z*,1)−z*‖/‖z*‖`; if `res` is too large, **increase damping `c`** (lower the
+ spectral radius → keep converging); if very small, relax `c`. This maintains the "free phase has
+ converged" condition (Ernoult 2019: EP ≡ BPTT in the β→0, converged limit) throughout training.
+
+**Measured (this implementation):** end-to-end EP trains a char-LM to **val CE 2.95** (random 4.17,
+backprop on the same architecture 2.10), with **zero non-finite steps** under feedback regulation.
+
+---
+
+## One-line summary
+> **One energy/force, one relaxation, one local estimator.** FFN = conservative Hopfield energy →
+> *standard EP* (exact). Attention = a *non-reciprocal force*; AEP turns the nudged-phase `J` into `Jᵀ`
+> via two probes (jvp/vjp) → exact gradient. Since the full force's antisymmetric part comes only from
+> attention, **one AEP correction + standard EP train the whole block end-to-end**; the readout trains
+> directly on the cost; damping + feedback-regulation keep the system convergent.
+
+## Hardware-relevant primitives
+- **local, no weight transport**: every weight updates from locally-available equilibrium states.
+- **compute = relaxation to a fixed point**: maps to oscillators / memristor crossbars / optics / Ising.
+- **two phases, same circuit**: free + nudged differ only by a small output nudge β.
+- **non-reciprocal coupling OK (a feature)**: AEP handles asymmetric `J`; `jvp`/`vjp` = two probe directions.
+- **dissipation `c` is a physical knob**: feedback-regulated to keep the system in the convergent regime.
diff --git a/docs/method/EP_DERIVATION.md b/docs/method/EP_DERIVATION.md
new file mode 100644
index 0000000..9adcc38
--- /dev/null
+++ b/docs/method/EP_DERIVATION.md
@@ -0,0 +1,215 @@
+# 从标准 EP 到当前版本 —— 逐层修改的因果路径
+
+伴随文档:`METHODS.md`(按主题组织的完整方法) / `FINDINGS.md`(项目时间线)。
+本文档是**差分视图**:从教科书 EP 出发,每一层修改对应标准 EP 的**一条隐含前提被
+transformer 破坏**,记录"标准怎么做 → 为什么失效 → 我们怎么改 → 代码在哪"。
+
+---
+
+## 0. 标准 EP(Scellier–Bengio 2017)与它的四条隐含前提
+
+标准 EP 在一个**能量函数** `E(z, θ)` 上运行:
+
+- **自由相**:动力学 `ż = −∇_z E`,relax 到能量极小 `z*`(自由不动点)。
+- **钳制相**:把损失以强度 β 加进能量,`E_β = E + β·ℓ(z)`,relax 到 `z_β`。
+- **梯度**:`∂L/∂θ ≈ (1/β)[ ∂E/∂θ(z_β) − ∂E/∂θ(z*) ]`(单边,bias O(β))。
+
+它能成立,**默认了四条前提**:
+
+| | 前提 | 含义 |
+|---|---|---|
+| **A** | **保守 / 对称** | 存在标量能量 E,所以 Jacobian `J = ∂F/∂z` 对称(`J = Jᵀ`)。 |
+| **B** | **自由相已收敛** | 读出在真不动点 `z*` 上;残差 ≈ 0。 |
+| **C** | **小 β 线性响应 + 干净 nudge** | β→0,钳制只是微扰;为防发散加的 clamp 不影响估计。 |
+| **D** | **训练中不动点始终稳定存在** | 每一步权重更新后,自由相仍收敛到一个稳定不动点。 |
+
+**Transformer 把这四条全破了。** 当前版本的每一处修改,都是在修复其中一条。下面分四块。
+
+---
+
+## 破缺 A:保守性 —— 力形式 EP + AEP 修正
+
+**标准怎么做**:从能量 E 出发,`F = −∇E`,梯度用 `∂E/∂θ`。
+
+**为什么失效**:transformer block 没有能量。注意力 `Q≠K`(非互易耦合)、untied FFN
+(`W1≠W2ᵀ`)使 `J ≠ Jᵀ`,**写不出 E**。强行用能量路线(`energy` 模式:tied-value
+LSE 能量)可以保守化,但代价是表达力(实测 thick 1.95 vs energy/mono 2.11,差 0.15–0.2 CE)。
+
+**改法(两步)**:
+
+1. **力形式 EP(VF,vector-field readout)**——丢掉能量,直接把动力学写成力 `F(z)`,relax
+ `ż=F(z)` 到不动点。梯度不再用 `∂E/∂θ`,改用**向量场读出**:
+ ```
+ ∂L/∂θ ≈ ∂/∂θ ⟨a, F(z*; θ)⟩ a = 对比态(见破缺 C)
+ ```
+ 这一步在不动点处只调一次 autograd(逐项局部记账,不是 BPTT)。注意力、FFN、LN、
+ embedding 全是同一个 F 的项,**联合训练,无分模块调度**。
+ ⚠️ **更正(非我们发明)**:力形式 VF 是**已有方法**,正是 AsymEP(arXiv:2602.03670)论文里的
+ baseline。而且 **VF 单独用在非保守系统上会崩**(他们 CIFAR-10 上 VF=10% 随机、MNIST 64% vs
+ AsymEP 92.7%)——这恰好对应我们实测的"无修正注意力 cos≈0.25"。所以这一步**不是我们的贡献**,
+ 它是"会崩的起点";真正修好它的是下面第 2 步的反对称修正(也是 AsymEP 的)。
+
+2. **AEP 修正(非保守修复)**——力形式下,naive nudge 围绕 z* 线性化用的是 J,但正确的
+ 伴随(adjoint)需要 **Jᵀ**。不修正 → 注意力梯度 cos ≈ 0.25(基本是错的)。
+ 修正:给 nudge 力加上 `−(Jv − Jᵀv)`,`v = z − z*`,无矩阵实现 = 一次 jvp + 一次 vjp。
+ 作用:把 nudge 相 Jacobian `J → Jᵀ` ⇒ a 逼近真正的伴随响应 ⇒ **`Q≠K` 注意力梯度
+ cos 恢复到 0.99–1.0**(实测:attn 0.99 / ffn 1.00 / 整块 0.99)。
+ - 来源:AEP "EP for non-conservative systems"(arXiv:2602.03670)。
+ - 关键性质:修正项在 (z−z*) 上**线性、实系数** → 不破坏后面全纯估计的解析性。
+ - 注意:修正在 **z* 处线性化**,所以 nudge 轨迹必须留在线性响应窗内(T2≈20 在窗内;
+ T2≳150–300 会逸出,见硬件孪生那段的 horizon 限制)。
+
+**代码**:`lt_ep_train.py::force`(thick 力)、`ep_step` 的 `nudge()` 内 AEP 块、
+`nc_force`(非保守部分,供 AEP/jacreg 用)。`--attn_mode thick`。
+
+---
+
+## 破缺 B:收敛性 —— 残差是健康信号 + 自适应 T1
+
+**标准怎么做**:固定 T1,假定已收敛,读出直接在"z*"上。
+
+**为什么失效**:EP 估计器有一个**有效性阈值**(实测,非假设):梯度 cos vs 精确参考随自由相
+相对残差 `res = ‖z⁺−z*‖/‖z*‖` 急剧退化:
+```
+res ≈ 5e-5 → cos 0.85–0.88
+res ≈ 1e-3 → cos 0.2–0.9(看 batch)
+res ≈ 3e-3 → cos ≈ 0–0.5
+res ≈ 1e-2 → 噪声
+```
+**BPTT 没有这个阈值**(它对实际有限计算求导,收不收敛都给一个自洽梯度)。这条不对称——
+而非任何更深的东西——就是 EP 专属的难点。
+
+**改法**:
+1. **把 res 立为头号健康信号**(不是 spectral radius——见破缺 D)。每步多走一步 relax 测 res。
+2. **自适应 T1**:固定 T1=150 测到 res 后,若仍 > `res_est`(默认 1e-4),按 50 步一段继续
+ relax 直到 res≤阈值或到 `t1max` 上限。**用算力买紧致**。
+ - 重要细节:λ-控制器的 res 信号**仍在固定 T1=150 处采样**(保持控制器语义不变,不引入新的
+ λ 战争);只有送进 nudge 的 z* 被 refine 到更紧。
+
+**代码**:`ep_step` 顶部 `res` 计算 + `t1max/res_est` 的 while-refine 块。
+`--t1max 500 --res_est 1e-4`。
+
+---
+
+## 破缺 C:小 β 线性响应 + 干净 nudge —— 对称 nudge + 全纯估计 + 自适应 T2
+
+**标准怎么做**:单边 +β(bias O(β)),且为防 nudge 把 relax 推爆,在 nudge 里加 **clamp**
+(对输出做硬投影 `g.clamp(±2)`、对 AEP 修正做 `‖corr‖≤‖F‖` 裁剪)。
+
+**为什么失效**:
+- 单边 β 的 O(β) bias 逼着 β 很小,而估计噪声 ∝ (收敛误差)/β,小 β 放大噪声。
+- **clamp 是非解析的硬投影**。实测:在边缘残差(res 1.6e-3)处,clamp 是**估计误差的主因**
+ ——plain EP cos 0.27,去掉 clamp 后 0.89。clamp 本是为早期训练护航,却在中期残差一漂高就
+ **悄悄毒化每一步更新**。
+
+**改法(三步)**:
+
+1. **对称(两边)nudge**——`a = (z₋ − z₊)/(2β)`,centered ⇒ bias O(β²)(Laborieux 2021)。
+
+2. **全纯 EP(clamp-free,复圆 Cauchy 读出)**(Laborieux–Zenke 2022)——把 ±β 换成复平面圆
+ `|β|=r` 上的 N 个点 `β_k = r·e^{2πik/N}`,relax **全纯延拓**的力,读
+ ```
+ a = −Re[ (1/Nr) Σ_k e^{−iφ_k} (z_k − z*) ] bias O(r^N)
+ ```
+ bias 从 O(r²) 降到 O(r^N) ⇒ r 可大 5–10×(等 bias),1/β 噪声放大同比例下降。
+ - 力的全纯延拓:手写复 LN(非共轭方差)、softmax 用 exp 比值、GELU 用 tanh 形(整函数)。
+ - **nudge 里完全无 clamp**(clamp 非解析,会毁掉 O(r^N) 阶);改为监控 `max|z−z*|`。
+ - AEP 修正实系数线性 ⇒ 保解析,对实/虚部分别施加即可。
+ - 实测扫描:N(2…8)与 r(0.02…0.2)基本持平 ⇒ **有限-β bias 和 1/β 噪声在此尺度都不是
+ 瓶颈**;剩余 ~0.12 错位是 T2 截断(下一步)。
+
+3. **自适应 T2(后见之明快照选择)**——T2 截断值 ~0.12 cos;但慢混合 batch 上长 T2 会发散
+ (非正规瞬态增长;**基于步长的早停会失败**,瞬态在 t≈6–39 误触发)。解法:lockstep 跑到
+ T2max,每 K 步快照对比态 `a_t`,**返回增量最小(最稳定)的那个快照**,只在明确爆炸
+ (增量 > 5× 运行最小值)时早停。**判据是"关心的量 a 的增量",不是步长** ⇒ 瞬态增长无害。
+ 实测:never worse than 固定 T2=20;mean cos 0.871 → 0.932。
+
+**代码**:`holo_ep.py`——`cln/csoftmax_masked/cgelu/cforce`(全纯力)、`holo_a`(Cauchy 读出)、
+`holo_a_select`(自适应 T2)、`holo_a_select2`(N=2 相位-batched 快路,与 select 数值等价)。
+旧 clamp 在 `ep_step::nudge` 内 `g.clamp(±2)`/`corr` 裁剪——**已被全纯路线取代**(那是 legacy 路径)。
+`--holo 2 --hr 0.2 --t2sel 120`。
+
+---
+
+## 破缺 D:训练中不动点始终稳定存在 —— λ控制器 + 验证门 + 熔断 + 架构稳定器
+
+**标准怎么做**:假定每步更新后自由相仍收敛到稳定不动点。
+
+**为什么失效**:**训练会把动力学推离收缩流形**。这不是 EP 特有的——实测连 bare BPTT(精确梯度)
+跑到 14k 也会走出收缩流形(res→4.7e-2,best 退化到 2.021,比它 3k 还差)。一旦离开,EP 估计器
+进入无效区(破缺 B 的阈值),更新方向变错,正反馈把权重推得更远。
+
+**改法(四件,从软到硬)**:
+
+1. **残差驱动的连续 λ-控制器(软 Jacobian 惩罚)**——核心稳定器。
+ - 惩罚项:`λ‖J_nc(z*)‖²_F`,Hutchinson 估计(一次 jvp on 随机探针,对 θ 求导;Bai 2021)。
+ 保持自由相收缩 ⇒ 把估计器留在有效区。
+ - 控制律(每步):`λ ← clip( λ · (res_EMA / target)^0.3 , floor , max )`。
+ - **为什么控 res 不控谱半径**:block Jacobian 高度**非正规**——瞬态增长对特征值不可见
+ (实测 ρ(J)=0.94"稳定"而 relax 发散到 res 0.21)。一步残差**就是**那个瞬态,控它。
+ - **信号上的 EMA(0.9)**:原始 res 噪声大,乘性控制器在噪声上会随机游走(实测 λ 在 0.5↔13
+ 抖),抖动的 λ 本身又扰动训练。EMA 去掉抖动。
+ - **target ≈ 5e-4**:刚进有效阈值内(few×1e-4)留余量;不更紧(更紧白费算力且伤表达力)。
+ 参考:BPTT 自己的最优在 res 1e-3–2e-2——好解只是**轻度收缩**;我们比 BPTT 多要一点,
+ 因为是**估计器**需要。
+ - **floor 是承重的(不可退火到 0)**:两次独立实验证明 λ≲0.02 在任何阶段都致命
+ (R2 从头 λ→0、R6 λ-floor 随 lr 退火,都死于同一种死法:val CE 60–77 且 res≡0)。
+ 事后剖析:这是**被浮点伪装成收敛的爆炸**——‖z*‖与无 cap 参数在临时无效梯度下涨到
+ `ε·F < ulp(z)`,relax 冻结(res=0 是吸收造成的),logits 巨大且自信地错。λ 惩罚的
+ θ-梯度触及 fc/pj/LN/attn,正是把这个盆地挡在门外的机制。
+2. **验证门(validity gate)**——当 `res_used > res_gate`,EP 更新在数学上无定义 ⇒ **只施加
+ homeostat(jacreg),完全跳过 nudge**(快速恢复步)。S1 尺度实测它是承重的:门之前死三次,
+ 门之后活。`--res_gate`。
+3. **熔断(abort fuse)**——`res > abort_res`(默认 0.1)**连续 100 步** ⇒ 停,保留 best ckpt。
+ 纯安全网,防止无效区里烧机时。`--abort_res 0.1`。
+4. **架构层稳定器(尺度变大时才需要)**:
+ - **resinit**(ReZero/Fixup:WO、pj 乘 0.1)——初始化即近恒等收缩块,大宽度起步稳。`--resinit 0.1`。
+ - **qknorm**(Qwen3 q/k RMSNorm)——bound 注意力 logits/Jacobian。`--qknorm`。
+ - **阻尼 −c·z**——给原始注意力造/强化不动点;对 thin/real 必需。但 thick 里 LN 在内部,阻尼
+ 反而抬高有效 Jacobian(`J∝1/σ(z)`),故 thick 把 c 设小(=1),稳定器交给 λ 惩罚。
+ - **权重范数 cap**(WQ/K/V/O/Wm/Wh/fc/pj renorm 到 3× init)——瞬态期的钝安全网,健康时少触发。
+
+**代码**:`ep_step` 的 jacreg 块 + 验证门分支;`main` 训练循环里的控制律(`jr = min(jr_max,
+max(flo, jr*exp(0.3*log(rs/rtgt))))`)、`badct` 熔断、cap renorm、`resinit/qknorm` 注入。
+
+---
+
+## 外壳:与 EP 正交但必需的工程层
+
+这些不是"EP 的修改",但当前版本依赖它们才能跑到当前数字:
+
+- **读出头 Wh**:用它自己的局部 CE 梯度 `∂CE/∂Wh`(在自由 z* 上),**不**穿过动力学。
+ 这是 EP 设定的标准做法,不是 BP。
+- **参数 EMA**(decay 0.999,与裸权重并行评估)——压late-phase 估计器噪声漂移,不碰稳定环。`--pema`。
+- **优化器 / 调度**:AdamW(lr 1e-3, wd 1e-4)、warmup→cosine、grad-norm clip 5.0、跳过 non-finite 步。
+ - **warmup 对大模型承重**:让控制器先建立收缩,再放大步长把权重踢出盆地。(BP baseline 不用
+ warmup 也稳——这是 EP↔BP 的一个不对称。)
+- **lr 标定(k 标定)**:`k = |g_EP|/|g_BPTT|` 每组,`lr_EP = lr_BPTT/k`。**但 AdamW 逐坐标归一化
+ ⇒ 吸收掉 k 的尺度 ⇒ 对 Adam k 失效**;k 只在 SGD/硬件(固定增益线)下才重要。
+ - 理论基础:Ernoult 2019——自由相收敛 + β→0 时 **EP ≡ BPTT**(逐步),所以 lr 的对应是
+ EP↔BPTT,**不是** EP↔BP(BP 与 BPTT 物理含义不同,lr 本就不该直接对应)。
+
+---
+
+## 一页速查表:修改 → 破坏的前提 → flag → 实测效果
+
+| 修改 | 破坏的标准前提 | flag | 实测效果 |
+|---|---|---|---|
+| 力形式 VF *(已有,非我们)* | A 保守 | `--attn_mode thick` | 写出无能量 EP,但**单独用会崩**(cos≈0.25) |
+| AEP 反对称修正 J→Jᵀ *(AsymEP 的,非我们)* | A 保守 | (thick 内置) | 注意力梯度 cos 0.25 → 0.99 |
+| ↳ 我们:无矩阵化 + 上注意力 + 全纯结合 + 共模追踪 | (scale/工程) | jvp−vjp / `--holo` / `holo_a_track` | 让上面两条能跑 transformer LM |
+| 对称 nudge | C 小β | (默认) | bias O(β)→O(β²) |
+| 全纯 + clamp-free | C 干净nudge | `--holo 2 --hr 0.2` | 边缘残差 cos 0.27 → 0.89;r 可放大 10× |
+| 自适应 T2 选择 | C 线性响应 | `--t2sel 120` | mean cos 0.871 → 0.932;训练 −0.064 |
+| 残差为健康信号 | B 收敛 | (内置) | 暴露有效性阈值 res≲few×1e-4 |
+| 自适应 T1 | B 收敛 | `--t1max 500 --res_est 1e-4` | 把 z* refine 进有效区,长 T2 才有收益 |
+| λ-控制器(软Jac惩罚) | D 稳定 | `--jacreg .. --res_target 5e-4 --res_ema 0.9` | 保持收缩;floor 不可退火(否则 fp-吸收爆炸) |
+| 验证门 | D 稳定 | `--res_gate` | S1:门前死3次,门后活 |
+| 熔断 | D 稳定 | `--abort_res 0.1` | 连续100步 res>0.1 即停,保 best |
+| resinit / qknorm | D 稳定(大宽度) | `--resinit 0.1 --qknorm` | 大 C 起步稳;bound 注意力 Jacobian |
+
+**一句话**:标准 EP 假定"保守、已收敛、小β干净、训练中恒稳";transformer 四条全破。
+A 用**力形式 + AEP** 修(把 J 变 Jᵀ);B 用**残差信号 + 自适应 T1** 修(进有效区);
+C 用**全纯 clamp-free + 自适应 T2** 修(干净估计 + 不截断不发散);D 用**残差驱动 λ-控制器
+(floor 承重)+ 门 + 熔断 + 架构稳定器** 修(训练中拽回收缩流形)。其中 A 的 AEP 与 D 的
+λ-控制器是两处最实质的偏离;其余多是"把估计器修干净 / 留在有效区"。
diff --git a/docs/method/METHODS.md b/docs/method/METHODS.md
new file mode 100644
index 0000000..2a7d255
--- /dev/null
+++ b/docs/method/METHODS.md
@@ -0,0 +1,576 @@
+# Methods — EP-Trained Equilibrium Transformer Language Model
+
+Complete technical notes for discussion: architecture, how attention/FFN are made EP-trainable,
+the training rule, every stabilizer/regularizer and its reason, the LM setting, validation
+methodology, results, and open problems. Code paths at the end. (Companion doc:
+`/home/yurenh2/ept/FINDINGS.md` — the project arc and findings log.)
+
+---
+
+## 1. Problem statement
+
+Train a transformer-class language model where **both attention and FFN learn without
+backpropagation through the computation** — using Equilibrium Propagation: two (or N) relaxations
+of the same dynamics plus a local contrast readout. The questions: (a) does it train at all,
+(b) what does it cost vs the exact gradient (BPTT on the same architecture) and vs a standard
+BP transformer at equal parameters, (c) what are the actual failure mechanisms.
+
+Headline result (all at 14k steps, fully-controlled comparison): best EP model reaches
+**val CE 1.676** (adaptive T1/T2, run R10). Like-for-like standard BP transformer (MLP=4 — the
+same parameter shape as the thick block, see §2) reaches 1.610 ⇒ **total gap 0.066**, decomposed:
+**architecture ≈ 0.025** (BPTT + the same stabilizer + same param-EMA on the identical block:
+1.635) and **training rule ≈ 0.041** (EP 1.676 vs that control) — and EP beats the thin-matched
+BP MLP=1 baseline (1.689). Unregularized BPTT *destabilizes* on long horizons (walks off the
+contractive manifold, res→5e-2, best 2.021 — worse than its 3k run, 1.949): the stabilization
+loop EP carries out of estimator necessity (residual-driven Jacobian-penalty controller) is what
+the equilibrium architecture itself needs for long training. Random = ln 65 = 4.17.
+
+## 2. LM setting (data, embeddings, readout, evaluation)
+
+- **Corpus**: Shakespeare character-level (nanoGPT preprocessing): `train.bin`/`val.bin` uint16
+ token streams + `meta.pkl`, vocab = 65 chars (~1.1 MB text). Local copy:
+ `/tmp/lt_ep/data/shakespeare_char`.
+- **Batching**: random crops, B=32 sequences × T=64 context; next-char targets (shift by 1).
+- **Embeddings**: learned token table `tok ∈ R^{65×128}` (init N(0, 0.02²)) + learned absolute
+ positional table `pos ∈ R^{64×128}`. Input injection `x_in = tok[idx] + pos`. **Embeddings are
+ trained by EP too** — they enter the force through the input-clamp term −(z − x_in), so the same
+ vector-field readout (Sec. 4) delivers their gradient. No pretrained embeddings, no BP path.
+- **Readout head**: logits = z* Wh, `Wh ∈ R^{128×65}`, trained with its **own local CE gradient**
+ ∂CE/∂Wh at the free equilibrium (loss-adjacent layer — local learning suffices; this is standard
+ in EP setups and is not backprop through the dynamics).
+- **Objective / eval**: mean next-token cross-entropy over all B·T positions; val CE = average over
+ 8 fresh validation batches, computed by running the same free-phase relaxation (T1 steps) used in
+ training — i.e. the eval graph equals the inference graph. Random baseline ln(65) = 4.174.
+- **Model size**: C=128, H=4 heads, single equilibrium block (weight-tied recurrence ⇒ effective
+ depth = T1). Parameter matching to the BP baseline depends on the variant: the **thin** block's
+ Hopfield memory (Wm: 128×256 ≈ 33k) matches BP **MLP=1** (2C² ≈ 33k); the **thick** block's
+ untied FFN (fc+pj = 2·4C² ≈ 131k) matches BP **MLP=4** (131k) — so thick-block results compare
+ against MLP=4 (1.610), thin-block results against MLP=1 (1.689).
+
+## 3. Architecture: the equilibrium block
+
+State `z ∈ R^{B×T×C}` (one vector per token position). Dynamics ż = F(z); inference = relax to
+fixed point z* (Euler: z ← z + ε F(z), ε=0.1, T1=150 steps), predict from z*.
+
+We built four force variants (all share the input clamp and the readout):
+
+### 3.1 `thick` — DEQ-transformer block (the winner)
+
+```
+F(z) = −(z − x_in) input clamp (leak toward embedding)
+ + Attn(LN1(z)) causal multi-head softmax attention, separate WQ WK WV WO
+ + W2 · GELU(W1 · LN2(z) + b1) + b2 untied 4× FFN (W1: C→4C, W2: 4C→C)
+ − c·z damping (c=1–2)
+```
+
+LN1/LN2 carry learned affine (g, b). This is exactly a pre-LN transformer block written as a
+*force* instead of a layer stack — same form DEQ uses as its fixed-point map. It is strongly
+**non-conservative**: no scalar energy has this gradient (Q≠K asymmetric coupling, untied FFN).
+EP is made exact for it via the AEP correction (Sec. 4.2).
+
+### 3.2 `real`/thin — Hopfield-FFN + damped real attention
+
+```
+F(z) = −∇_z [ ½‖z − x_in‖² + E_mem(z) ] + s·(Attn(z) − c·z)
+E_mem(z) = −Σ relu(z Wm)² modern-Hopfield / dense associative memory, Wm: C×256
+```
+
+The FFN here is **energy-based**: E_mem is the dense-associative-memory energy; its force
+2·relu(zWm)Wmᵀ is a one-hidden-layer FFN with *tied* weights (Wm in, Wmᵀ out). Attention remains a
+raw non-conservative force with damping. This was our first stable trainable variant.
+
+### 3.3 `energy` — fully conservative attention (CET-style)
+
+Attention folded into the energy: `E_att = −(1/γ) Σ_heads,i LSE_j(γ q_i·k_j)` (causal-masked,
+**tied value** — the force of this energy mixes values v=k), plus `½c‖z‖²` confinement because
+E_att+E_mem alone are unbounded below. F = −∇E exactly ⇒ classic EP applies with no correction.
+This is the CET (Høier/Kerjan/Scellier) route, which we reproduced separately on vision
+(masked CelebA/CIFAR completion, EP ≈ TBPTE). Trade-off: tied value + reciprocal coupling = the
+least expressive attention.
+
+### 3.4 `mono` — monDEQ-structured contraction
+
+`F(z) = −(m·z + z PᵀP) + z(Q−Qᵀ)ᵀ + x_in − ∇E_mem + s·Attn(z)` — the linear part is a monotone
+operator by construction (Winston–Kolter): symmetric part ⪯ −m·I guaranteed, antisymmetric part
+(Q−Qᵀ) free (non-reciprocal coupling at no stability cost). Guaranteed unique fixed point for the
+linear core; softmax attention sits on top with gain s. Ablation for "how much does guaranteed
+contraction cost": BPTT-mono = 2.11 vs BPTT-thick 1.95.
+
+### How attention is "EP-ified": the two routes, explicitly
+
+1. **Energy route** (3.3): make attention conservative (tied value, LSE energy) so F = −∇E and
+ vanilla EP is valid. Cost: expressivity (Q≠K asymmetry and free value are what make attention
+ attention).
+2. **Force route** (3.1, 3.2 — ours): keep real attention as a non-conservative force and repair
+ the *estimator* instead, with the AEP correction (Sec 4.2) which restores exact gradients for
+ non-reciprocal couplings. Validated gradient cosine vs autograd: attention 0.99, FFN 1.00,
+ full LM block 0.99 (and FA, for contrast, gives Q/K/V ≈ 0.25, upstream FFN ≈ −0.01).
+
+### How attention and FFN are trained *jointly*
+
+There is no per-module schedule or pipeline: attention, FFN, LN affines, and embeddings are all
+terms of the **same force** F. One free relaxation + one nudged ensemble produce the contrast state
+`a`; every parameter θ gets its gradient from the single vector-field formula ∇θ = ∂⟨a, F(z*,θ)⟩/∂θ
+(Sec 4.1), which decomposes into purely **local** per-term updates (each force term touches only its
+own parameters). The readout Wh is the only separately-trained parameter (local CE gradient).
+
+## 4. Training rule
+
+### 4.1 Vector-field (force-form) EP with symmetric nudging
+
+- **Free phase**: z⁰ = x_in; z^{t+1} = z^t + ε F(z^t), T1=150, ε=0.1 → z*. Monitor relative
+ residual `res = ‖z⁺−z*‖/‖z*‖` (one extra step) — the load-bearing health signal (Sec 5).
+- **Nudged phases** (±β, β=0.02, T2=20 steps from z*): relax the augmented force
+ `F(z) ∓ β ∇_z CE(z)` (the CE gradient w.r.t. the *state* — local at the readout).
+- **Contrast**: `a = (z₋ − z₊)/(2β)` ≈ −dz*/dβ (centered ⇒ O(β²) bias; Laborieux-style symmetric
+ nudging).
+- **Parameter update** (force form, valid for non-gradient dynamics): for all force params θ,
+ `∇θ L ≈ ∂/∂θ ⟨a, F(z*; θ)⟩` — one autograd call **at the fixed point only** (this is a local
+ Hebbian-style contrast in θ for each force term; autograd here is per-term bookkeeping, not
+ backprop through time/steps).
+- At a converged fixed point and β→0 this equals the implicit/equilibrium gradient
+ (Scellier–Bengio; Ernoult: EP ≡ BPTT stepwise under convergence).
+
+### 4.2 AEP correction (non-conservative repair)
+
+For non-conservative F the naive nudged phase linearizes around z* with Jacobian J, but the correct
+adjoint needs Jᵀ. Following **AsymEP** (Scurria, Vanden Abeele, Mognetti, Massar, "EP for
+Non-Conservative Systems", arXiv:2602.03670), we add to the nudged force the term `−(Jv − Jᵀv)`,
+v = z − z*, where J = ∂F_nc/∂z at z* and F_nc = the non-conservative part (attention, or
+attention+FFN for `thick`). This is **identical to their `−2 A_J(z*)(z−z*)`** (A_J = antisym part of
+J; `Jv−Jᵀv = (J−Jᵀ)v = 2 A_J v`) — **the correction is theirs, not ours**. What is ours on this
+line: (i) the **matrix-free jvp+vjp** form (one of each per nudged step) — AsymEP builds the full
+Jacobian explicitly and decomposes it, which is infeasible at transformer state dim B·T·C; (ii) the
+application to **softmax attention** (data-dependent Jacobian — they test only feedforward
+nets/Hopfield on static MNIST/CIFAR, no attention/sequence); (iii) the **holomorphic combination**
+(§4.3 — the correction is real-linear so it preserves holomorphy; they use plain ±β); (iv) the
+**common-mode-tracking** linearization variant (§4.3/below). Their force-form **VF** readout (= our
+`⟨a,∂F/∂θ⟩`) is *prior art* and is the baseline that collapses without the correction (their CIFAR
+VF = 10% chance), matching our cos≈0.25 for uncorrected attention. Effect: nudged-phase Jacobian J → Jᵀ ⇒ a approximates the *adjoint*
+response −(I−Jᵀ)⁻¹-type solve ⇒ exact gradients for Q≠K attention (measured 0.99–1.0).
+Caveats: the correction is **linearized at z***, so the nudged trajectory must stay in the
+linear-response window — T2≈20 at ε=0.1 is inside; T2=60+ can leave it (Sec 7).
+
+### 4.3 Holomorphic EP upgrade (current)
+
+Replace the 2-point real ±β difference by N points on a **complex circle** β_k = r·e^{2πik/N}
+(Laborieux–Zenke 2022): relax the *holomorphically extended* dynamics (manual complex LN with
+non-conjugate variance, softmax as exp-ratio, tanh-form GELU; the AEP correction is linear with
+real coefficients so it preserves holomorphy — apply to Re/Im parts separately), then read
+`a = −Re[(1/Nr) Σ_k e^{−iφ_k}(z_k − z*)]` (discrete Cauchy formula; bias O(r^N) instead of O(r²)).
+**No clamps inside the holomorphic nudge** (clamps are non-analytic and break the bias order).
+
+Probe findings (cos vs long-horizon-BPTT reference, 300-step-pretrained thick block):
+- The **clamps were the dominant estimator error at marginal residuals**: at res 1.6e-3, plain EP
+ cos 0.27 → clamp-free 0.89. (The clamps existed to protect early training; they were silently
+ poisoning mid-training updates whenever res drifted up.)
+- N and r are flat (N=2…8, r=0.02…0.2 all ≈ equal): finite-β bias and 1/β noise are *not* the
+ binding error at this scale.
+- The remaining ~0.12 misalignment is **T2 truncation**: with stable nudged dynamics, T2=120 →
+ cos 0.985. But on slow-mixing batches long T2 diverges (AEP linearization error compounds on
+ near-critical modes). Step-size-based early stopping FAILS (non-normal transient growth triggers
+ it at t≈6–39; same pathology that makes spectral radius the wrong free-phase signal).
+- **Adaptive T2, solved by hindsight selection** (`holo_a_select`): run the nudged phases to
+ T2max=120 in lockstep, snapshot the contrast a_t every 10 steps, return the snapshot with the
+ smallest increment (= most settled), early-exit only on clear blowup (inc > 5× the running min).
+ Judging stability by increments of the *quantity of interest* — not step sizes — makes transient
+ growth harmless. Probe: never worse than fixed T2=20; mean cos 0.871 → **0.932** at tight
+ equilibria (0.853 / 0.987 / 0.956 per batch; the dangerous batch self-limits to t≈20–30).
+- **Adaptive T1 companion**: long-T2 gains require a tight free phase (res ≲ 1e-4; at res ~1e-3
+ long T2 actively hurts). So the free phase is two-stage: the λ-controller's residual signal is
+ still sampled at T1=150 (R9 semantics preserved — no new λ war), then relaxation continues in
+ chunks of 50 until res ≤ 1e-4 (cap 500) before the nudged phases. Compute buys tightness;
+ λ pressure does not.
+
+Training outcomes: **R7** (N=2, r=0.02, fixed T2=20): best 2.0289, faster wall-clock than plain EP
+(3.3 vs 2.45 it/s — the holomorphic nudge's ∇_z CE is closed-form). **R10** (R9's controller +
+adaptive T1/T2): **best 1.6755** (sustained EMA plateau 1.68–1.70 around step 8–10k; ~0.7 it/s).
+
+## 5. Stabilization & regularization — what, and exactly why
+
+**The governing fact (measured, not assumed):** the EP estimator has a **validity threshold** in
+free-phase residual. Gradient cosine vs exact reference: res ≈ 5e-5 → 0.85–0.88; res ≈ 1e-3 →
+0.2–0.9 (batch-dependent); res ≈ 3e-3 → ≈ 0–0.5; res ≈ 1e-2 → noise. BPTT has no such threshold
+(it differentiates the actual finite computation, converged or not) — *this asymmetry, and nothing
+deeper, is the EP-specific difficulty*. There is **no structural ceiling**: an early "EP caps at
+~2.5" verdict was refuted (it conflated two undertrained/invalid-regime runs; see FINDINGS).
+
+Each stabilizer and its reason:
+
+1. **Damping −c·z** — creates/strengthens a fixed point for raw attention forces; *required* for
+ the thin/real variant (attention alone has no fixed point at high gain: residual floor ~3e-2,
+ no equilibrium to find). **Caveat for LN-inside blocks (`thick`)**: damping shrinks ‖z*‖ and the
+ LN Jacobian scales like 1/σ(z) ⇒ damping *inflates* the effective Jacobian — measured: thick
+ plain-relax residual 8.8e-3 at c=0 vs 3.4e-2 at c=2. So for `thick`, c is kept small (1) and is
+ NOT the stabilizer; the Jacobian penalty is.
+2. **Soft Jacobian penalty** λ‖J_nc(z*)‖²_F (Hutchinson estimator: one jvp on a random probe vector,
+ differentiated w.r.t. θ; Bai et al. 2021 "Stabilizing Equilibrium Models by Jacobian
+ Regularization") — the actual stabilizer: keeps the free phase contractive ⇒ keeps the estimator
+ inside its validity region. Soft penalty ≻ hard constraints (spectral-norm capping the attention
+ matrices to ρ=0.9 was tried: too restrictive, kills learning — consistent with FRE-RNN-style
+ regulation being preferable to hard projection).
+3. **Why the control signal is the residual, NOT the spectral radius**: the attention/block Jacobian
+ is highly **non-normal** — transient growth is invisible to eigenvalues (measured: ρ(J)=0.94
+ "stable" while the relaxation diverged at res 0.21). The one-step residual *is* the transient;
+ control on it.
+4. **Continuous λ controller**: λ ← clip( λ · (res_EMA/target)^0.3 , floor, 16 ), per step.
+ - **EMA on the signal (0.9)**: the raw residual is noisy; a multiplicative controller on a noisy
+ signal random-walks (measured thrash λ 0.5↔13 when the target sat at the noise floor) and the
+ thrashing λ itself perturbs training. EMA removed it and gave the current best run.
+ - **Target = 5e-4**: just inside the validity threshold (few·1e-4), with margin; NOT tighter —
+ demanding res ≪ threshold buys nothing and costs expressivity (a 2e-4-target run was worse).
+ For reference, BPTT's own optima sit at res ~1e-3–2e-2: good solutions are only mildly
+ contractive; we ask for slightly more than BPTT needs, because our *estimator* needs it.
+ - **Floor**: λ may shrink when res is healthy but **must not vanish — the floor is
+ load-bearing**. Floor=λ₀ (never off) is a permanent tax (2.150); floor 0.1 is the sweet spot
+ (2.078 → 2.047 with signal-EMA → 2.029 with the holomorphic estimator). Two independent runs
+ prove λ≲0.02 is fatal at *any* stage: λ→0 from the start (R2) and λ-floor annealed with lr
+ (R6) both ended in the same death — val CE 60–77 with res ≡ 0.0. Post-mortem: this is **an
+ explosion disguised as convergence by floating point**, not a dead state: ‖z*‖ and the
+ uncapped parameters (tok/pos/fc/pj) blow up under temporarily-invalid gradients until
+ ε·F < ulp(z) and the relaxation freezes (res = 0 by absorption), with huge confidently-wrong
+ logits. The λ penalty (whose θ-gradient touches fc/pj/LN/attention) is what keeps that basin
+ out of reach; it cannot be annealed away. The late-drift hypothesis "persistent penalty
+ gradient vs vanishing task signal" is hence only half-true — the persistent pressure is also
+ the anti-collapse mechanism. Current anti-drift attempt: **parameter EMA** (decay 0.999,
+ evaluated alongside raw weights), which targets late-phase estimator-noise wander without
+ touching the stability loop at all.
+5. **Weight-norm caps** (renorm to 3× init norm on WQ,WK,WV,WO,Wm,Wh after each step): blunt safety
+ net against runaway during transients when the estimator is temporarily invalid. Rarely binding
+ in healthy runs.
+6. **Nudge clamp g.clamp(±2) and AEP-correction clip (‖corr‖ ≤ ‖F‖)** — *legacy*: protected the
+ nudged relaxation early in training, but measured to be the main estimator error at marginal
+ residuals (cos 0.27 → 0.89 once removed). Replaced by the clamp-free holomorphic nudge; a
+ non-finite-gradient step-skip + the λ controller now carry the early-training safety.
+7. **Optimizer**: AdamW lr 1e-3, wd 1e-4, cosine to 5%, grad-norm clip 5.0, skip non-finite steps.
+ β(EP)=0.02, ε=0.1, T1=150, T2=20 everywhere unless stated.
+
+### 5.x T1-residual penalty (`--resreg`) — defend the evaluated state (2026-06-20)
+
+EP's gradient is the fixed-point/implicit gradient: it only cares WHERE the fixed point is, not how fast the
+relaxation reaches it, so it has no reward for keeping the block contractive. BPTT — differentiating the finite
+T1=150 unroll, which is what eval actually uses — gets that reward implicitly (a non-converging unroll → bad
+output → high CE). This asymmetry is why frozen-jr EP diverges past ~2.09 (res inflates → forward bifurcates to
+a limit cycle) while exact-BPTT with the identical recipe descends to 1.72 (see FINDINGS 2026-06-20; the EP run
+refines the free phase to t1max=300=z* and grades there, so it never feels the residual of the evaluated z150).
+
+The fix gives EP that missing term explicitly — penalize the T1 free-phase residual of the state actually
+evaluated, `z150 = relax(xin, T1)` taken BEFORE any t1max refinement:
+- `R_res = ‖ε·F(z150)‖² / (‖z150‖²+ε)`, gradient w.r.t. θ with z150 detached (`blk.tforce`);
+- scaled task-relative: `ratio = resreg·min(1, res@T1 / 2e-2)`, deadband `res@T1 > 7e-4`,
+ `λ = ratio·‖g_task‖/‖g_res‖`, added to the EP gradient.
+- **Run with `res_gate=0`** — the validity gate early-returns (jacreg-only) above the gate, which would bypass
+ the penalty exactly when res is high. Keep `t1max=300` (estimator accuracy) + the penalty (defends z150).
+
+Analog-compatible (one extra force measurement + the same local vector-field gradient, no digital root-finder)
+and more targeted than jacreg (which penalizes ‖J_nc‖_F, not the actual residual vector that explodes). Validated
+res-tight through step 1000 / best 2.0573 (past the 2.09 wall) before a /tmp wipe; full re-validation pending.
+
+## 6. Validation methodology (how we know the estimator/claims are right)
+
+- **Gradient-cosine probes**: at a fixed realistic operating point (300 BPTT steps from init —
+ no contraction penalty, the "natural" weight region), compare every estimator against a
+ long-horizon BPTT reference (T1=400), per parameter group (attn / ffn / LN / emb). This is what
+ exposed the validity threshold, the clamp damage, and the T2 truncation.
+- **Horizon control**: BPTT-150 vs BPTT-400 cosine is itself only 0.35–0.77 on slow-mixing batches
+ — the "finite horizon vs true equilibrium gradient" cost is shared by everyone, EP is not
+ special; at matched horizon EP is within ~0.15 of BPTT.
+- **BPTT-as-ablation**: BPTT on the identical architecture isolates *training-rule* cost (EP−BPTT)
+ from *architecture* cost (BPTT−BP). BPTT is an ablation, not the target; BP is the target.
+- **Same-graph eval**: val CE is computed through the same T1-step relaxation used in training, so
+ no train/eval mismatch flatters either method.
+- **Gradient-cosine has a lifecycle**: early/mid training it measures estimator quality (0.93 →
+ 0.79–0.85 across scale); late in training, at slow-mixing trained points, even two *exact*
+ gradients at different horizons decorrelate (cos(BPTT-150, BPTT-800) = 0.25 at the trained S1
+ point) — no single "true gradient" exists to cosine against, and the meaningful arbiter becomes
+ the training outcome on the horizon-matched eval objective. Validity-threshold claims here are
+ early/mid-phase statements. Late-phase corollary: EP's res target 1.5e-3 at S1 is already
+ optimal — cos rises monotonically with tightness and the loose-weights/refined-measurement
+ variant nulls at training level: the measurability-contraction tax is rigid across the interval
+ (the physical escape is oscillatory/lock-in measurement, not operating-point engineering).
+
+## 7. Results
+
+**14k-step matched comparison (the honest table; thick block ≈ BP MLP=4 in parameter shape):**
+
+| training rule | architecture / recipe | best val CE |
+|---|---|---|
+| BP | standard transformer, MLP=4 (**like-for-like for thick**) | **1.610** |
+| BPTT + R9's λ-controller + param-EMA | thick (exact grad, same stabilization as EP) | **1.635** — tail stable |
+| **EP (R10)** | **thick; R9 + adaptive T1 (refine to res≤1e-4) + adaptive T2 (selection, cap 120)** | **1.676** (EMA plateau 1.68–1.70) |
+| BP | standard transformer, MLP=1 (thin-matched) | 1.689 |
+| EP (R9) | thick; holo nudge + recalibrated controller (target 1.5e-3, λmax 4) + param-EMA | 1.740 |
+| BPTT (exact grad) | thick, unregularized | 2.021 — **destabilizes late** (res→4.7e-2, val→3.0) |
+
+3k-era and ablation numbers (shorter schedule):
+
+| run | best val CE |
+|---|---|
+| BPTT thick, 3k (its best showing) | 1.949 |
+| EP R7: holo estimator, old tight controller (target 5e-4) | 2.029 (late λ pinned 16 ⇒ drift) |
+| EP R8: R7 + param-EMA | 2.031 (EMA alone ≠ fix; the λ fight dominates) |
+| EP R5/R3: plain estimator generations | 2.047 / 2.078 |
+| BPTT monDEQ / thin, 3k | 2.111 / 2.206 |
+| EP R2 (λ→0) / R6 (λ-floor∝lr) | 2.357 / 2.501 — both die by fp-absorption explosion |
+| random | 4.174 |
+
+Reading: (a) final decomposition — **architecture tax ≈ 0.025** (1.635 vs 1.610), **EP rule tax
+≈ 0.041** (1.676 vs 1.635), total **0.066** to the like-for-like BP transformer; EP beats the
+thin-matched MLP=1 baseline. (b) EP beats *bare* BPTT at both horizons, but the controlled
+comparison shows most of that win was EP's mandatory stabilization loop doubling as regularization
+— bare exact-gradient training walks off the contractive manifold at 14k, and the same controller
+that EP cannot live without lifts BPTT to 1.635 (also beating MLP=1): **the contraction controller
+is good for the architecture regardless of training rule; EP simply forced its discovery.**
+(c) The estimator and the controller must be **co-designed**: upgrading the estimator
+(holomorphic, clamp-free) widened the validity region from res≲5e-4 to ~1.5e-3, and re-calibrating
+the controller to that wider region (R7→R9) was worth **0.29**; adaptive T1/T2 (R9→R10) was worth
+another **0.064**, matching the probe's cos 0.871→0.932. (d) **Multi-seed confirmation (3 seeds per arm)**: EP
+1.6755/1.6851/1.6786 → **1.680 ± 0.005** vs BPTT+controller 1.6348/1.6459/1.6365 →
+**1.639 ± 0.006**; the rule tax is **0.041 ± 0.005 (~9σ)** — real, tightly reproducible, and
+consistent with the measured estimator misalignment (cos 0.85–0.93).
+
+**Scale rung S1 (TinyStories char, C=256 H=8 T=256, 0.92M params; random ln127 = 4.84):**
+
+| run | best char-CE (BPC) |
+|---|---|
+| BP same-shape, 14k | 0.827 (1.19) |
+| **BPTT-ctl, loose target 1e-2, 14k** | **1.009 (1.46)** |
+| BPTT-ctl, tight target 1.5e-3, 14k | 1.521 — ⇒ **controller-mismatch tax 0.51** |
+| EP v4b (validity gate, lr 1e-3, 20k from scratch) | 1.393 |
+| EP L2 (v4b recipe, 40k from scratch) | 1.214 |
+| **EP warm-track (v4b → phase-2: common-mode tracking + loosened target)** | **1.141** — EP champion |
+
+S1 scale lessons: (a) containment must scale with model size (λ ceiling, cap list, fuse);
+(b) the **validity gate is load-bearing** — off-equilibrium EP updates poison weights (three
+deaths before the gate, alive after); (c) the estimator validity threshold tightens with scale
+(res 1e-4 → 1e-5 for full quality; rescue is compute-bounded, saturating at cos ≈ 0.85);
+(d) **the controller operating point is part of the training rule**: EP needs validity-tight
+targets, exact-gradient methods want loose ones — match controllers for rule-tax measurements,
+but report each method at its own best operating point for ceilings.
+
+**Optimizer pricing (S0-Shakespeare, R10 recipe, 8k steps; AdamW ≈ 1.70 at 8k):** EP-SaI
+(per-tensor lr from init g-SNR, frozen = one calibrated gain line per array) **2.048**;
+SGDM 2.166; Lion 2.175; Lion+LARS 2.244. Per-tensor calibration recovers ~0.12 of the 0.47
+uniform-scale gap; the remaining ~0.35 measures the value of per-coordinate adaptivity under EP's
+noisy heteroscedastic gradients. Pretraining therefore stays in the digital shell; fine-tuning is
+exempt (SGD suffices in the RL/fine-tune regime, with <0.02% sparse updates — an endurance gift;
+Mukherjee et al., arXiv:2602.07729).
+
+**Hardware twin v4 (S0, 8-bit program-verify + 30% static mismatch + σ=1e-4 white + 4× restart
+averaging): best 1.937 — 90% of the clean improvement** (clean 1.68). Noise laws measured:
+contrast pollution strictly linear in σ; √N restart averaging; snapshot SNR ≈ 1/53 at σ=0.3%,
+r=0.2 ⇒ hardware needs ~10³–10⁴ lock-in averages per update (ms at MHz loops — physically trivial,
+digitally prohibitive: the noise dimension repeats the compute story). Discovery en route: the
+frozen AEP correction has a clean-environment instability at nudge horizons ≳150–300 steps
+(spectrum of J(z) − J* + J*ᵀ uncontrolled) — windows ≤120 steps + restart averaging circumvent it;
+the single-trajectory oscillatory (true lock-in) estimator awaits a fix for this horizon limit.
+
+## 8. Open problems
+
+1. **Late drift — mostly solved, mechanism identified**: the drift was the controller *fighting*
+ the weights (λ pinned at max enforcing a target the upgraded estimator no longer needs; R7/R8
+ tails). Re-calibrating target/λmax to the holomorphic estimator's wider validity region removed
+ the fight (R9: λ stays 0.1–0.5, tail drift shrank from ~0.3 to ~0.15 above best). Refuted
+ routes: λ-floor annealing (R6 ⇒ fp-absorption explosion — the floor is load-bearing);
+ param-EMA alone (R8 — smooths wobble ~0.05, can't fix the fight). Residual ~0.15 tail drift
+ remains open (estimator direction bias near optimum is the suspect).
+2. **Adaptive T2 — SOLVED** by hindsight snapshot selection (§4.3): judge by increments of the
+ contrast estimate, not step sizes; select the most-settled snapshot. Probe mean cos 0.932;
+ training −0.064 val CE (R10). Possible refinements: larger N with selection, per-batch T2max.
+3. **Mixing time**: slow-mixing equilibria make *all* gradients horizon-expensive (BPTT included);
+ conditioning the dynamics for fast mixing (preconditioned/Anderson relaxation that preserves the
+ EP contrast structure) is unexplored here.
+4. **Scale**: depth-1 block, char-level, C=128. The mechanisms (validity threshold, non-normality,
+ controller design) are dynamics-level and should transfer; the constants will move.
+
+## 9. Hardware translation — can this now-complex algorithm still run on EP hardware?
+
+Audit of every component of the final recipe (R10) against an analog/in-memory substrate. The
+surprise: most of the added "complexity" is *control*, and control is cheap in analog; several of
+our fixes specifically REMOVED digital artifacts.
+
+| algorithm component | analog realization | difficulty |
+|---|---|---|
+| free phase (T1≈500 Euler steps — our digital bottleneck) | physical settling, ns–µs, "free" | trivial (hardware's whole pitch) |
+| adaptive T1 ("relax until res≤1e-4") | settling detector = comparator on dz/dt | trivial |
+| symmetric nudging ±β | output-node current injection | standard EP hardware |
+| **holomorphic N-point circle** | **AC-modulated nudge + lock-in (homodyne) detection** — the Cauchy sum over phases IS lock-in readout; this is Laborieux–Zenke's "finite-size oscillations" taken literally | standard measurement technique; *more* native than DC differencing, and the standard weapon against analog noise floors |
+| clamp removal (our biggest estimator fix) | hardware never had clamps; saturation is smooth | already done by physics |
+| VF update ⟨a, ∂F/∂θ⟩ | local Hebbian outer product (contrast × presynaptic activity) per crossbar; autograd was only digital bookkeeping | native |
+| λ-controller (residual → EMA → multiplicative λ, floor/cap) | a **neuromodulator/homeostatic loop**: 1 measurable scalar (settling ripple) → RC filter (the EMA) → log-domain integrator with rails (floor/cap) → global broadcast scaling a local anti-Hebbian (contraction) rule | a handful of analog components |
+| λ floor = anti-collapse (R2/R6 lesson) | minimum leak conductance — never let homeostasis switch off | natural |
+| adaptive T2 (snapshot selection) | sample-and-hold bank on the *contrast* signal + stability gate on the OUTPUT quantity (the transferable lesson: never gate on state velocity — non-normal transients fool it) | cheap |
+| weight caps (3× init) | device conductance range | physics gives it free |
+| param-EMA | slow/fast weight pairs (volatile + nonvolatile device) | known proposals |
+| AdamW | per-synapse capacitor for momentum; second moment is the real gap (shared by all analog-training schemes) | open engineering |
+| softmax attention + LN circuits | analog WTA / divisive-normalization primitives; data-dependent T×T attention in-memory | hard, but an *inference*-hardware problem shared by all analog-transformer efforts, not EP-specific |
+
+**The two genuine research obstacles:**
+
+1. **The AEP correction (J → Jᵀ) in physics.** Crossbars give Wᵀ for free (drive the other side),
+ but the correction needs the *circuit's* transposed Jacobian, including data-dependent softmax
+ parts. The classical answer exists: the **adjoint network** (Director–Rohrer 1969, circuit
+ sensitivity theory) — constructible by reversing non-reciprocal elements; nobody has built
+ AEP-learning with one yet. The alternative is a measured price list from our own ablations:
+ accept *reciprocal* (energy-based, tied-value) attention and skip the correction entirely,
+ costing ~0.15–0.2 CE (monDEQ 2.11 / energy-mode vs thick 1.95 under exact gradients).
+2. **Precision budget vs the validity threshold.** Analog noise floors (~1%, 8–10 effective bits)
+ sit exactly where we measured EP gradients dying (res ~1e-2). Mitigations, all quantified here:
+ the N-point estimator tolerates **r=0.2** (10× nudge signal at equal bias — probe: flat in r);
+ lock-in detection buys orders of magnitude of SNR below the noise floor; free and nudged phases
+ run on the *same* devices so static mismatch cancels in the contrast (EP's structural advantage
+ over on-chip backprop — note the adjoint path partially forfeits this and needs care); and
+ hardware's 10⁶× speed headroom converts to phase-averaging. Our cos-vs-residual and cos-vs-r
+ tables (§5, §4.3) are, read this way, the **spec sheet** for an analog EP-transformer design.
+
+**Memristor-crossbar specifics — the Jᵀ question.** Needing Jᵀ does NOT disqualify memristor EP
+platforms; crossbars are the most transpose-friendly analog substrate there is (drive rows→read
+columns = Wx; drive columns→read rows = Wᵀy — the property on-chip-BP designs rely on). The fork:
+(a) *passive-reciprocal* platforms (resistor-coupled, Kirchhoff/coupled-learning style) are J=Jᵀ by
+physics — no correction needed, but they cannot express non-reciprocal attention even at
+inference; on these, run the **reciprocal recipe**: LSE energy attention (tied value) + Hopfield
+FFN — whose force 2·relu(zWm)Wmᵀ uses ONE crossbar driven in both directions — at the measured
+~0.15–0.2 CE expressivity cost, zero hardware changes. (b) *active-periphery* platforms
+(DAC→crossbar→ADC loops; periphery already breaks reciprocity) get Jᵀ as **transposed reads of the
+same arrays + frozen local gains** (gelu′, softmax p, LN 1/σ from the free-phase operating point)
+— the Director–Rohrer adjoint network in crossbar form: a periphery/routing redesign, not a
+different device technology. Scope note: Jᵀ appears ONLY in the nudged phase applied to one vector;
+the free phase needs nothing, weight updates are local outer products, and even the λ penalty needs
+only ‖Jv‖² (forward perturbation + response energy, no transpose). Unlike on-chip BP there is no
+per-layer activation storage or strict reverse scheduling — one held operating point z* suffices.
+Numerics for the hardware team: large nudge amplitude r≈0.2 + multi-phase/AC (lock-in) readout is
+validated equivalent to r=0.02 (10× signal headroom); small-signal DC differencing dies at ~1e-3
+noise (our tf32 experiment: cos→−0.03). Suggested collaboration phasing: (1) reciprocal demo on
+the existing rig (zero redesign, pay 0.2), (2) transposed-periphery nudge → full AEP, buy it back.
+
+**On "training arbitrary analog circuits"** (the bigger question): classic EP requires
+energy-based (reciprocal) circuits. AEP lifts this to *any circuit with a stable fixed point* —
+IF the antisymmetric correction is realizable (adjoint network) or waived (reciprocal trade).
+What this project adds to that picture is the missing stability half: **training pushes arbitrary
+circuits off the contractive manifold** (bare-BPTT-14k showed even exact gradients walk off it),
+and a residual-driven homeostatic controller both prevents this AND improves learning — with the
+hard constraint that its floor never anneals to zero (fp-absorption collapse; in hardware:
+latch-up). Combined with agnostic/physical EP (Scellier et al. 2022 — no circuit model needed,
+contrast is measured) and small-scale physical demonstrations (Dillavou et al.'s self-learning
+resistor networks; Laydevant et al.'s Ising-machine EP, 2024; memristor activity-difference
+training), the pieces for "arbitrary stable analog circuit + adjoint or reciprocity + homeostatic
+contraction control = trainable" are all individually demonstrated; this work supplies the
+control law and the quantitative budgets.
+
+Inference note: causal attention's lower-triangular coupling means autoregressive generation
+settles *incrementally* — a new token's state relaxes with past states frozen, so EP inference is
+one physical settling per token, not a re-relaxation of the sequence.
+
+**Component BOM (assuming the current recipe survives the ladder unchanged):**
+(1) bidirectional-read analog weight arrays (ReRAM/PCM/analog-Flash/gain-cell/switched-cap) — all
+W and Wᵀ including the AEP adjoint reads; (2) state-integrator arrays (capacitor+OTA per state
+variable; K·T·C nodes — ~1M at the 33M demo, ~17M at 0.6B; the sequence dimension T dominates);
+(3) analog attention primitives — large-fan-in current/charge-domain softmax-WTA + T² score
+sample-and-hold for the frozen nudge gains — **the hardest, least-shelf-ready component**;
+(4) divisive-normalization circuits (LN/RMS/qk-norm); (5) mixed-signal periphery: DAC/ADC arrays,
+S&H banks, and **lock-in (synchronous-detection) channels** — large-r AC nudging is mandatory
+(small-signal DC differencing dies at analog noise floors; measured digitally via the tf32
+experiment); (6) control plane: settling comparator + RC filter (res-EMA) + log-domain integrator
+with rails (λ controller) + a global **learn-enable line (= the validity gate)** + fuse — a
+handful of components or one MCU; (7) weight-update machinery: coincidence pulse programming
+(local outer products), with device nonlinearity/endurance the classic pain point; (8) an FPGA
+phase sequencer (settle→hold→nudge±→snapshot→update).
+
+**Six-month prototype plan (borrow physics, don't fab — main track: optical).**
+*Primary — desktop optical EP machine (Goodman MVM + electronic loop):* one off-the-shelf LCoS SLM
+(~2M pixels, $15–25k) holds, with differential encoding, ~1M signed analog weights — the fully
+digitally-validated R10 thick block (C=128, 12C² ≈ 200k weights) occupies **one tenth of one SLM**
+(5× headroom). Weights (WQ/K/V/O, FFN) static on the SLM during settling; state z (C=128) cycles
+through a 128-channel DAC-driven source array → SLM → cylindrical-lens summation → photodiode
+array; nonlinearity, T² attention scores (negligible digitally at T=64), Euler integration, and
+the λ/gate control law in loop electronics; one loop pass = one Euler step. **Timing is set by
+B·T multiplexing, not settling**: each Euler step = B·T·(~6 matrices) MVM passes; at loop rates
+0.1–1 MHz and B=4–8, settle ≈ sub-second and a 14k-step training run ≈ hours; SLM refresh once
+per training step (60 Hz ample). **Wᵀ for AEP: program the transposed panels alongside W in the
+spare SLM area** (reverse-propagation reciprocity remains the Phase-2 elegance; don't gate the
+prototype on bidirectional alignment). Precision framing: master weights live fp32 in the digital
+shell, the SLM holds a fresh ~8-bit projection each step — standard QAT regime; the open question
+is per-pass multiplicative noise + slow drift (speckle/calibration, the optical ~1% floor), which
+is exactly what the spec-sheet arsenal (r=0.2 nudging, lock-in/homodyne — the field's native
+measurement, same-device contrast cancellation, λ-controller) was built for, and which is
+**pre-validated digitally by an optics-noise-model run** (8-bit weight quantization per step +
+1–2% multiplicative force-eval noise + drift) before any purchase. Novelty: photonics has in-situ
+BP and BP-free local learning (Science 2023 ×2); EP-on-optics exists only as oscillator theory —
+"EP-trained transformer on optics" is unclaimed. Budget $20–50k; optics ~2 months, loop
+electronics 1–2 months (rehearsed by the PCB track, same parts/skills), calibration+training 2.
+*Secondary (one day of email, no more):* Mythic M1076 as a borrowed settle engine — gated solely
+on SDK raw-MVM access + incremental writes; flash endurance marginal beyond few-k-step demos.
+(Laydevant et al.'s D-Wave EP precedent: reviewers accept rented physics.)
+*Fallback / rehearsal (zero-dependency):* board-level reciprocal block (C=8–16, Hopfield Wm
+driven bidirectionally + energy attention), digipot/MDAC weights, OTA+cap integrators, Red Pitaya
+AC nudge + lock-in, comparator+RC λ loop — small headline, but lands continuous settling, lock-in
+contrast readout, and the homeostatic control law in real electronics, and its loop electronics
+ARE the optical track's loop electronics. Six months buys no foundry CMOS — but **university cleanrooms (e.g., UIUC HMNTL) are a different
+category, and they fabricate exactly the one BOM item money can't buy: the weight arrays.**
+Passive BEOL memristor crossbars (bottom electrode / ALD HfOx-TaOx / top electrode, 3–4 masks,
+µm linewidth, contact or maskless litho) are the academic-cleanroom comfort zone; the practical
+per-array limit is sneak-path-set (~64×64–128×128 for 1R passive with V/2 biasing), and a handful
+of tiled arrays covers a C=32 reciprocal block (C=128 ≈ a dozen-array wiring project). FeFET
+(three-terminal, on the same line's ferroelectric pedigree) cures sneak paths for a few extra
+masks. The algorithm side has already bought insurance for first-batch device quality: Phase-1
+keeps fp32 masters with **program-verify writes (only ~6-bit iterative programmability needed —
+no pretty pulse physics)**, and 10–50% device mismatch is absorbed by same-device contrast
+cancellation + the λ-controller — validated in digital twin runs (per-step N-bit weight
+projection + per-pass multiplicative noise + static mismatch; see `--wq_bits/--fnoise/--wmis`).
+Discipline: fab ONLY the arrays; anything on Digi-Key stays COTS board-level (student-process
+CMOS periphery would be stone-age). This raises the board-track ceiling from digipot (~10³
+weights, C=8) to homemade crossbars (10⁴–10⁵ weights, C=32–64) without leaving campus. Execution
+for a no-fab-experience team: the standard academic rentals — (1) recipe-owner collaboration
+(their senior student runs their existing process; weeks of routine work; co-authorship; you
+never gown up), (2) apprenticeship via facility training + staff engineers (executed by a
+recruited student), (3) paid staff-run fabrication. Find the recipe owner before booking tool
+time; lead the pitch with the device-twin plot ("your first-batch devices suffice — proven").
+
+**Recommended Phase-1 architecture: analog equilibrium core + digital optimizer shell.** Physics
+performs only the expensive part (settling + contrast measurement = the ×300–1000 digital
+overhead); contrasts are ADC'd out, Adam/schedules/λ-logic stay digital, weights DAC back. This
+sidesteps component (7)'s update-nonlinearity pain and the missing analog Adam, at no loss of the
+compute advantage. Phase-2: full in-array updates.
+
+**Sizing correction — causal serialization (the assumption that kills the wafer-scale monster):**
+naive sizing assumes the whole sequence's state must be physically resident during settling
+(K·T·C integrator nodes — hundreds of millions at 8B scale, 1e4–5e4 mm²). Causality removes this:
+the free phase settles **token-by-token** (token t's equilibrium depends only on tokens ≤ t; past
+states live in an ordinary digital KV-cache), and the nudged phase is the adjoint of a
+time-lower-triangular system = an exact **reverse sweep** (upper-triangular back-substitution,
+exact to the same order as the AEP linearization). Physical state requirement drops from K·T·C to
+**K·C per token-slice** (÷T ≈ ÷2048): ~15 mm² of integrators; sequence caches are ~1 GB of
+commodity DRAM. Readout also streams per token-slice (~10⁵ values/token, batch-accumulated in
+charge domain) — ADC throughput lands at the standard IMC design point, wall-clock days for an
+8B-Chinchilla run at MWh-scale energy. Remaining big item: weight arrays only — 0.6B ≈ 2–4 chips
+@28nm (university-consortium scale, $5–20M staged program); 8B ≈ 5–8 reticle dies @7nm (gen-3).
+New throughput consideration: serialized operation makes τ_settle the rate limit, and the spring
+chain's ~K² mixing tax favors **shallow-wide stacks (K=4–8)** — which the ladder data already
+supports (thick single block ≈ same-shape BP). Program staging: MPW single-block demo (33M-class,
+$1–5M) → 0.6B 2–4-chip machine → 8B gen-3. Economics read: capex-dominated, opex→0 — the pitch is
+2–3 orders of magnitude energy and edge/continual learning, not cloud-GPU rent replacement.
+
+**Compute reality (digital simulation)**: EP's per-step force-eval budget E ≈ 700–3000 makes full
+EP training cost ≈ E/3 ≈ **230–1000× the BP cost** at equal tokens (Chinchilla 20×: 0.6B ⇒ 12B
+tokens ⇒ ~4.3e19 BP-FLOP ⇒ ~1e22-class EP-FLOP: H100-cluster scale for one run, years on 4×A6000).
+This multiplier is exactly what physical settling eliminates — the algorithm is expensive in
+digital silicon and native in physics. Note TinyStories (~0.6B tokens) Chinchilla-matches ~30M
+params — precisely the planned "readable stories" demo scale (S4).
+
+## 10. Code map (all on timan1)
+
+- `/tmp/lt_ep/lt_ep_train.py` — main trainer: EQBlock (all four force variants), `ep_step`
+ (VF-EP + AEP + optional holomorphic nudge), `bptt_step`, λ controller, caps.
+ Key flags: `--mode ep|bptt --attn_mode thick|real|energy|mono --jacreg --jr_floor --res_target
+ --res_ema --jr_lrcouple --holo N --hr r --c --T1 --T2 --eps --beta`.
+- `/tmp/lt_ep/holo_ep.py` — holomorphic force/softmax/LN/GELU, `holo_a` (Cauchy readout), probe.
+- `/tmp/lt_ep/grad_quality.py` — estimator-vs-exact cosine probe (validity threshold measurement).
+- `/tmp/lt_ep/solver_wall.py` — plain vs Anderson free-phase convergence per damping level.
+- `/tmp/lt_ep/bp_charlm.py` — param-matched standard BP transformer baseline.
+- `/home/yurenh2/ept/cet_mvp.py`, `cet_aep.py`, `aep_*.py` — CET reproduction + AEP validation
+ (vision side; gradient-fidelity numbers in Sec 3/FINDINGS).
+- Run logs: `/tmp/lt_ep/thickep_*.log`, `H2_*.json`.
+- Data: `/tmp/lt_ep/data/shakespeare_char/{train,val}.bin, meta.pkl`.
+
+Hardware: 1× RTX A6000 per run (shared node); plain-EP ~2.4 it/s, holo-EP(N=2) ~1.5–2 it/s at
+B=32, T=64, C=128, T1=150, T2=20. A 14k-step run ≈ 1.6–2.5 h.
diff --git a/docs/method/READING.md b/docs/method/READING.md
new file mode 100644
index 0000000..12561a3
--- /dev/null
+++ b/docs/method/READING.md
@@ -0,0 +1,58 @@
+# 项目阅读清单 — EP 训练平衡态 Transformer(含模拟硬件路线)
+
+按学习顺序排列。主线七篇是看懂本项目的最小集;支线按需。每篇标注"为什么读"。
+读完主线后直接读内部文档:`METHODS.md`(系统现状)→ `FINDINGS.md`(发现编年史)。
+
+## 主线(必读,按顺序)
+
+1. **Equilibrium Propagation** — Scellier & Bengio 2017, arXiv:1602.05179
+ 一切的起点:free phase / nudged phase / 局部对比更新。读到能背出两相结构为止。
+2. **EP ≡ BPTT** — Ernoult et al. 2019, arXiv:1905.13633
+ EP 为什么算的是真梯度、以及代价(自由相必须收敛 + β→0)。本项目的"有效域"概念源头。
+3. **Scaling EP(对称 nudging)** — Laborieux et al. 2021, arXiv:2006.03824
+ ±β 居中差分消一阶偏差;EP 第一次上 CIFAR。我们对比读出的基本形态。
+4. **Holomorphic EP** — Laborieux & Zenke 2022, arXiv:2209.00530
+ 复平面 N 点 / 振荡相位 → 有限 β 精确梯度。我们的估计器与硬件锁相故事的理论根。
+ (重要预期管理:本项目实测其"振荡"形态在白噪声下才是必需品,干净数字环境 N=2 即可。)
+5. **AEP:非保守系统的 EP** — arXiv:2602.03670
+ 反对称修正 −(J−Jᵀ)(z−z*):把 nudged 线性化从 J 翻成 Jᵀ,使 Q≠K 的真 attention 可 EP。
+ 本项目最重要的外部方法。我们的扩展:共模跟踪线性化(见 METHODS §4.3)。
+6. **CET:Convergent Energy Transformer** — Høier, Kerjan, Scellier, ICLR'26 AM workshop(OpenReview: Qrfml76eWJ)
+ 能量式(互易、tied-value)attention 的 EP 训练,我们入场前的 SOTA。我们复现过(cet_mvp.py),
+ 也是"互易让步"硬件路线(Phase-0)的配方。
+7. **DEQ:Deep Equilibrium Models** — Bai et al. 2019, arXiv:1909.01377
+ 平衡态架构家族总纲:权重共享不动点网络匹配显式 transformer。我们的 thick 块即 DEQ 式块。
+
+## 稳定性支线(理解我们的控制律)
+
+- **Jacobian 正则稳定平衡模型** — Bai et al. 2021, arXiv:2106.14342:λ 惩罚的出处。
+- **monDEQ** — Winston & Kolter 2020, arXiv:2006.08591:结构保证唯一不动点;我们的 mono 消融。
+- **FRE-RNN(Toward Practical EP)** — arXiv:2508.11659:反馈调节谱半径;我们 res 驱动 controller 的精神前身
+ (注意我们的发现:非正规雅可比下谱半径是错误信号,必须用残差——METHODS §5)。
+
+## 硬件支线(模拟实现路线)
+
+- **模拟电路 EP** — Kendall et al. 2020, arXiv:2006.01981:EP 上模拟硬件的开山提案。
+- **物理学习网络实物演示** — Dillavou et al. 2022, Phys. Rev. Applied:真电阻网络的对比局部学习。
+- **Ising 机 EP** — Laydevant et al. 2024, Nature Communications:租来的物理(D-Wave)也能发——先例。
+- **Agnostic physics-driven learning** — Scellier et al. 2022, arXiv:2205.15021:不需要电路模型的 EP。
+- 对照组:**Physics-aware training(PNN)** — Wright et al. 2022, Nature:物理前向 + 数字反传(我们不走的路)。
+- 电路理论经典:**伴随网络** — Director & Rohrer 1969(IEEE Trans. Circuit Theory):Jᵀ 的物理构造。
+
+## 优化器支线(硬件友好优化器之争)
+
+- **Why Transformers Need Adam(Hessian 异质性)** — NeurIPS 2024。
+- **SGD-SaI** — arXiv:2412.11768:初始化定每块 lr → SGDM 追平 AdamW(我们 EP-SaI 的原型,实测只赎回部分)。
+- **Do We Need Adam?(RL 阶段纯 SGD + 0.02% 稀疏更新)** — Mukherjee et al., arXiv:2602.07729(UIUC Hao Peng 组)。
+- Lion — arXiv:2302.06675:sign 更新 = 固定幅度脉冲编程(硬件视角)。
+
+## 语料与背景
+
+- **TinyStories** — arXiv:2305.07759:小模型可以写连贯故事;我们阶梯语料与"能看"demo 规模设定的依据。
+- Universal Transformer — arXiv:1807.03819:权重共享深度的先例。
+
+## 内部文档(读完主线后)
+
+1. `~/ept/METHODS.md` — 系统全貌:架构、估计器、控制律、规模法则、硬件翻译与 BOM。
+2. `~/ept/FINDINGS.md` — 编年史:每个失败、验尸与修复("墙"的证伪、闸门、噪声战役)。
+3. 代码:`~/ept/lt_ep_code/`(备份);活跃实验在 timan1:/tmp/lt_ep/。
diff --git a/docs/method/READING_EN.md b/docs/method/READING_EN.md
new file mode 100644
index 0000000..351c5ea
--- /dev/null
+++ b/docs/method/READING_EN.md
@@ -0,0 +1,54 @@
+# Project Reading List — Training Equilibrium Transformers with EP (incl. the analog-hardware track)
+
+Ordered for learning. The seven **core** papers are the minimal set to understand this project; the side tracks are on-demand. Each entry notes *why* to read it. After the core list, go straight to the internal docs: `METHODS.md` (current system) → `FINDINGS.md` (chronicle of findings).
+
+## Core (must-read, in order)
+
+1. **Equilibrium Propagation** — Scellier & Bengio 2017, arXiv:1602.05179
+ Where it all starts: free phase / nudged phase / local contrastive update. Read it until you can recite the two-phase structure from memory.
+2. **EP ≡ BPTT** — Ernoult et al. 2019, arXiv:1905.13633
+ *Why* EP computes the true gradient, and at what price (the free phase must converge + β→0). The origin of this project's "regime of validity" notion.
+3. **Scaling EP (symmetric nudging)** — Laborieux et al. 2021, arXiv:2006.03824
+ The centered ±β difference cancels the first-order bias; EP's first run on CIFAR. The baseline form we read everything against.
+4. **Holomorphic EP** — Laborieux & Zenke 2022, arXiv:2209.00530
+ N complex-plane points / oscillating phases → exact gradient at finite β. The theoretical root of our estimator and of the hardware lock-in story.
+ (Important expectation management: we find empirically that its "oscillatory" form is only *required* under white noise; in a clean digital setting N=2 suffices.)
+5. **AEP: EP for non-conservative systems** — arXiv:2602.03670
+ The antisymmetric correction −(J−Jᵀ)(z−z*): it flips the nudged-phase linearization from J to Jᵀ, making true attention with Q≠K EP-trainable. The single most important external method for this project. Our extension: common-mode-tracking linearization (see METHODS §4.3).
+6. **CET: Convergent Energy Transformer** — Høier, Kerjan, Scellier, ICLR'26 AM workshop (OpenReview: Qrfml76eWJ)
+ EP training of an energy-based (reciprocal, tied-value) attention — the SOTA before we entered. We reproduced it (cet_mvp.py); it's also the recipe for the "reciprocity-concession" hardware track (Phase-0).
+7. **DEQ: Deep Equilibrium Models** — Bai et al. 2019, arXiv:1909.01377
+ The master plan of the equilibrium-architecture family: a weight-tied fixed-point net matches an explicit transformer. Our "thick" block is a DEQ-style block.
+
+## Stability side-track (to understand our control laws)
+
+- **Stabilizing Equilibrium Models via Jacobian Regularization** — Bai et al. 2021, arXiv:2106.14342: the source of the λ-penalty.
+- **monDEQ (Monotone Operator Equilibrium Networks)** — Winston & Kolter 2020, arXiv:2006.08591: a structural guarantee of a unique fixed point; our `mono` ablation.
+- **FRE-RNN (Toward Practical EP)** — arXiv:2508.11659: feedback that regulates the spectral radius; the spiritual predecessor of our residual-driven controller. (Note our finding: under a non-normal Jacobian the spectral radius is the *wrong* signal — you must use the residual. See METHODS §5.)
+
+## Hardware side-track (the analog-implementation route)
+
+- **EP on analog circuits** — Kendall et al. 2020, arXiv:2006.01981: the founding proposal for EP on analog hardware.
+- **Physical-learning-network demonstration** — Dillavou et al. 2022, Phys. Rev. Applied: contrastive local learning on a real resistor network.
+- **EP on an Ising machine** — Laydevant et al. 2024, Nature Communications: even *rented* physics (D-Wave) can publish — the precedent.
+- **Agnostic physics-driven learning** — Scellier et al. 2022, arXiv:2205.15021: EP without needing a circuit model.
+- Contrast case: **Physics-aware training (PNN)** — Wright et al. 2022, Nature: physical forward + digital backprop (the road we *don't* take).
+- Circuit-theory classic: **the adjoint network** — Director & Rohrer 1969 (IEEE Trans. Circuit Theory): the physical construction of Jᵀ.
+
+## Optimizer side-track (the fight over a hardware-friendly optimizer)
+
+- **Why Transformers Need Adam (Hessian heterogeneity)** — NeurIPS 2024.
+- **SGD-SaI** — arXiv:2412.11768: initialization sets a per-block lr → SGDM matches AdamW (the prototype for our EP-SaI; empirically it only recovers part of the gap).
+- **Do We Need Adam? (pure SGD + 0.02% sparse updates in the RL stage)** — Mukherjee et al., arXiv:2602.07729 (Hao Peng's group, UIUC).
+- **Lion** — arXiv:2302.06675: the sign update = fixed-amplitude pulse programming (the hardware view).
+
+## Corpus & background
+
+- **TinyStories** — arXiv:2305.07759: small models can write coherent stories; the basis for our ladder corpus and the size target of the "legible" demo.
+- **Universal Transformer** — arXiv:1807.03819: the precedent for weight-tied depth.
+
+## Internal docs (after the core list)
+
+1. `~/ept/METHODS.md` — the full system: architecture, estimator, control laws, scaling laws, hardware translation + BOM.
+2. `~/ept/FINDINGS.md` — the chronicle: every failure, post-mortem, and fix (refuting "the wall", the gate, the noise campaign).
+3. Code: `~/ept/lt_ep_code/` (backup); active experiments under `~/ept/ep_run/`.