docs/method/READING_EN.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54

# Project Reading List — Training Equilibrium Transformers with EP (incl. the analog-hardware track)

Ordered for learning. The seven **core** papers are the minimal set to understand this project; the side tracks are on-demand. Each entry notes *why* to read it. After the core list, go straight to the internal docs: `METHODS.md` (current system) → `FINDINGS.md` (chronicle of findings).

## Core (must-read, in order)

1. **Equilibrium Propagation** — Scellier & Bengio 2017, arXiv:1602.05179
   Where it all starts: free phase / nudged phase / local contrastive update. Read it until you can recite the two-phase structure from memory.
2. **EP ≡ BPTT** — Ernoult et al. 2019, arXiv:1905.13633
   *Why* EP computes the true gradient, and at what price (the free phase must converge + β→0). The origin of this project's "regime of validity" notion.
3. **Scaling EP (symmetric nudging)** — Laborieux et al. 2021, arXiv:2006.03824
   The centered ±β difference cancels the first-order bias; EP's first run on CIFAR. The baseline form we read everything against.
4. **Holomorphic EP** — Laborieux & Zenke 2022, arXiv:2209.00530
   N complex-plane points / oscillating phases → exact gradient at finite β. The theoretical root of our estimator and of the hardware lock-in story.
   (Important expectation management: we find empirically that its "oscillatory" form is only *required* under white noise; in a clean digital setting N=2 suffices.)
5. **AEP: EP for non-conservative systems** — arXiv:2602.03670
   The antisymmetric correction −(J−Jᵀ)(z−z*): it flips the nudged-phase linearization from J to Jᵀ, making true attention with Q≠K EP-trainable. The single most important external method for this project. Our extension: common-mode-tracking linearization (see METHODS §4.3).
6. **CET: Convergent Energy Transformer** — Høier, Kerjan, Scellier, ICLR'26 AM workshop (OpenReview: Qrfml76eWJ)
   EP training of an energy-based (reciprocal, tied-value) attention — the SOTA before we entered. We reproduced it (cet_mvp.py); it's also the recipe for the "reciprocity-concession" hardware track (Phase-0).
7. **DEQ: Deep Equilibrium Models** — Bai et al. 2019, arXiv:1909.01377
   The master plan of the equilibrium-architecture family: a weight-tied fixed-point net matches an explicit transformer. Our "thick" block is a DEQ-style block.

## Stability side-track (to understand our control laws)

- **Stabilizing Equilibrium Models via Jacobian Regularization** — Bai et al. 2021, arXiv:2106.14342: the source of the λ-penalty.
- **monDEQ (Monotone Operator Equilibrium Networks)** — Winston & Kolter 2020, arXiv:2006.08591: a structural guarantee of a unique fixed point; our `mono` ablation.
- **FRE-RNN (Toward Practical EP)** — arXiv:2508.11659: feedback that regulates the spectral radius; the spiritual predecessor of our residual-driven controller. (Note our finding: under a non-normal Jacobian the spectral radius is the *wrong* signal — you must use the residual. See METHODS §5.)

## Hardware side-track (the analog-implementation route)

- **EP on analog circuits** — Kendall et al. 2020, arXiv:2006.01981: the founding proposal for EP on analog hardware.
- **Physical-learning-network demonstration** — Dillavou et al. 2022, Phys. Rev. Applied: contrastive local learning on a real resistor network.
- **EP on an Ising machine** — Laydevant et al. 2024, Nature Communications: even *rented* physics (D-Wave) can publish — the precedent.
- **Agnostic physics-driven learning** — Scellier et al. 2022, arXiv:2205.15021: EP without needing a circuit model.
- Contrast case: **Physics-aware training (PNN)** — Wright et al. 2022, Nature: physical forward + digital backprop (the road we *don't* take).
- Circuit-theory classic: **the adjoint network** — Director & Rohrer 1969 (IEEE Trans. Circuit Theory): the physical construction of Jᵀ.

## Optimizer side-track (the fight over a hardware-friendly optimizer)

- **Why Transformers Need Adam (Hessian heterogeneity)** — NeurIPS 2024.
- **SGD-SaI** — arXiv:2412.11768: initialization sets a per-block lr → SGDM matches AdamW (the prototype for our EP-SaI; empirically it only recovers part of the gap).
- **Do We Need Adam? (pure SGD + 0.02% sparse updates in the RL stage)** — Mukherjee et al., arXiv:2602.07729 (Hao Peng's group, UIUC).
- **Lion** — arXiv:2302.06675: the sign update = fixed-amplitude pulse programming (the hardware view).

## Corpus & background

- **TinyStories** — arXiv:2305.07759: small models can write coherent stories; the basis for our ladder corpus and the size target of the "legible" demo.
- **Universal Transformer** — arXiv:1807.03819: the precedent for weight-tied depth.

## Internal docs (after the core list)

1. `~/ept/METHODS.md` — the full system: architecture, estimator, control laws, scaling laws, hardware translation + BOM.
2. `~/ept/FINDINGS.md` — the chronicle: every failure, post-mortem, and fix (refuting "the wall", the gate, the noise campaign).
3. Code: `~/ept/lt_ep_code/` (backup); active experiments under `~/ept/ep_run/`.