summaryrefslogtreecommitdiff
path: root/docs/method/READING_EN.md
diff options
context:
space:
mode:
Diffstat (limited to 'docs/method/READING_EN.md')
-rw-r--r--docs/method/READING_EN.md54
1 files changed, 54 insertions, 0 deletions
diff --git a/docs/method/READING_EN.md b/docs/method/READING_EN.md
new file mode 100644
index 0000000..351c5ea
--- /dev/null
+++ b/docs/method/READING_EN.md
@@ -0,0 +1,54 @@
+# Project Reading List — Training Equilibrium Transformers with EP (incl. the analog-hardware track)
+
+Ordered for learning. The seven **core** papers are the minimal set to understand this project; the side tracks are on-demand. Each entry notes *why* to read it. After the core list, go straight to the internal docs: `METHODS.md` (current system) → `FINDINGS.md` (chronicle of findings).
+
+## Core (must-read, in order)
+
+1. **Equilibrium Propagation** — Scellier & Bengio 2017, arXiv:1602.05179
+ Where it all starts: free phase / nudged phase / local contrastive update. Read it until you can recite the two-phase structure from memory.
+2. **EP ≡ BPTT** — Ernoult et al. 2019, arXiv:1905.13633
+ *Why* EP computes the true gradient, and at what price (the free phase must converge + β→0). The origin of this project's "regime of validity" notion.
+3. **Scaling EP (symmetric nudging)** — Laborieux et al. 2021, arXiv:2006.03824
+ The centered ±β difference cancels the first-order bias; EP's first run on CIFAR. The baseline form we read everything against.
+4. **Holomorphic EP** — Laborieux & Zenke 2022, arXiv:2209.00530
+ N complex-plane points / oscillating phases → exact gradient at finite β. The theoretical root of our estimator and of the hardware lock-in story.
+ (Important expectation management: we find empirically that its "oscillatory" form is only *required* under white noise; in a clean digital setting N=2 suffices.)
+5. **AEP: EP for non-conservative systems** — arXiv:2602.03670
+ The antisymmetric correction −(J−Jᵀ)(z−z*): it flips the nudged-phase linearization from J to Jᵀ, making true attention with Q≠K EP-trainable. The single most important external method for this project. Our extension: common-mode-tracking linearization (see METHODS §4.3).
+6. **CET: Convergent Energy Transformer** — Høier, Kerjan, Scellier, ICLR'26 AM workshop (OpenReview: Qrfml76eWJ)
+ EP training of an energy-based (reciprocal, tied-value) attention — the SOTA before we entered. We reproduced it (cet_mvp.py); it's also the recipe for the "reciprocity-concession" hardware track (Phase-0).
+7. **DEQ: Deep Equilibrium Models** — Bai et al. 2019, arXiv:1909.01377
+ The master plan of the equilibrium-architecture family: a weight-tied fixed-point net matches an explicit transformer. Our "thick" block is a DEQ-style block.
+
+## Stability side-track (to understand our control laws)
+
+- **Stabilizing Equilibrium Models via Jacobian Regularization** — Bai et al. 2021, arXiv:2106.14342: the source of the λ-penalty.
+- **monDEQ (Monotone Operator Equilibrium Networks)** — Winston & Kolter 2020, arXiv:2006.08591: a structural guarantee of a unique fixed point; our `mono` ablation.
+- **FRE-RNN (Toward Practical EP)** — arXiv:2508.11659: feedback that regulates the spectral radius; the spiritual predecessor of our residual-driven controller. (Note our finding: under a non-normal Jacobian the spectral radius is the *wrong* signal — you must use the residual. See METHODS §5.)
+
+## Hardware side-track (the analog-implementation route)
+
+- **EP on analog circuits** — Kendall et al. 2020, arXiv:2006.01981: the founding proposal for EP on analog hardware.
+- **Physical-learning-network demonstration** — Dillavou et al. 2022, Phys. Rev. Applied: contrastive local learning on a real resistor network.
+- **EP on an Ising machine** — Laydevant et al. 2024, Nature Communications: even *rented* physics (D-Wave) can publish — the precedent.
+- **Agnostic physics-driven learning** — Scellier et al. 2022, arXiv:2205.15021: EP without needing a circuit model.
+- Contrast case: **Physics-aware training (PNN)** — Wright et al. 2022, Nature: physical forward + digital backprop (the road we *don't* take).
+- Circuit-theory classic: **the adjoint network** — Director & Rohrer 1969 (IEEE Trans. Circuit Theory): the physical construction of Jᵀ.
+
+## Optimizer side-track (the fight over a hardware-friendly optimizer)
+
+- **Why Transformers Need Adam (Hessian heterogeneity)** — NeurIPS 2024.
+- **SGD-SaI** — arXiv:2412.11768: initialization sets a per-block lr → SGDM matches AdamW (the prototype for our EP-SaI; empirically it only recovers part of the gap).
+- **Do We Need Adam? (pure SGD + 0.02% sparse updates in the RL stage)** — Mukherjee et al., arXiv:2602.07729 (Hao Peng's group, UIUC).
+- **Lion** — arXiv:2302.06675: the sign update = fixed-amplitude pulse programming (the hardware view).
+
+## Corpus & background
+
+- **TinyStories** — arXiv:2305.07759: small models can write coherent stories; the basis for our ladder corpus and the size target of the "legible" demo.
+- **Universal Transformer** — arXiv:1807.03819: the precedent for weight-tied depth.
+
+## Internal docs (after the core list)
+
+1. `~/ept/METHODS.md` — the full system: architecture, estimator, control laws, scaling laws, hardware translation + BOM.
+2. `~/ept/FINDINGS.md` — the chronicle: every failure, post-mortem, and fix (refuting "the wall", the gate, the noise campaign).
+3. Code: `~/ept/lt_ep_code/` (backup); active experiments under `~/ept/ep_run/`.