diff options
| author | Yuren Hao <yurenh2@illinois.edu> | 2026-07-03 05:56:50 -0500 |
|---|---|---|
| committer | Yuren Hao <yurenh2@illinois.edu> | 2026-07-03 05:56:50 -0500 |
| commit | b83947778e2c776f757a07d4719b7ce961d7ed55 (patch) | |
| tree | b9cc01d7adda691d9156d9d04f4fb2f644674e96 /docs/method/ARCHITECTURE.md | |
Initial commit: ept — backprop-free equilibrium transformer (EP)
Code (ep_run/), organized docs (docs/{method,campaign,hardware,outreach,paper}),
analysis scripts (scripts/), ONBOARDING.md entry point. Large data/checkpoints
git-ignored (share separately).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_014FAPDWQ49M5Ye3NpTndTpn
Diffstat (limited to 'docs/method/ARCHITECTURE.md')
| -rw-r--r-- | docs/method/ARCHITECTURE.md | 117 |
1 files changed, 117 insertions, 0 deletions
diff --git a/docs/method/ARCHITECTURE.md b/docs/method/ARCHITECTURE.md new file mode 100644 index 0000000..213b192 --- /dev/null +++ b/docs/method/ARCHITECTURE.md @@ -0,0 +1,117 @@ +# Architecture: EP/AEP-trained Equilibrium Transformer — implementation details + +Based on the actual implementation in `/tmp/lt_ep/lt_ep_train.py` (`EQBlock` + `ep_step`). + +## 0. Overview — one unified force, one relaxation + +The whole block is **one dynamical system**: token state `z ∈ R^{B,T,C}` relaxes to a fixed +point under a **single force** that bundles input-clamp + FFN + attention: + +``` +F(z) = − ∂/∂z [ ½‖z − x_in‖² + E_mem(z) ] + s·( Attn(z) − c·z ) + └────── conservative part (symmetric Jacobian) ──┘ └── non-conservative (non-reciprocal) ──┘ + input clamp Hopfield memory = FFN causal self-attention + damping +``` +- `x_in = tok[idx] + pos` — input embedding, clamped as a boundary condition. +- **Forward = relaxation**: `z ← z + ε·F(z)` for T1 steps → fixed point `z*`; read out `logits = z*·W_h`. +- Conservative (FFN/clamp) and non-conservative (attention) live in **one force, one relaxation** — the basis of "unified" training. + +--- + +## 1. FFN — standard EP (the clean part) + +The FFN is rewritten as a **modern Hopfield memory energy**: +``` +E_mem(z) = − Σ_{token, m} relu( z · W_m )²_m # W_m ∈ R^{C×M}, M memories +``` +Its force `−∂E_mem/∂z = 2·W_m·[relu(zW_m) ⊙ 1_{zW_m>0}]` is exactly a **tied-weight 2-layer MLP +(W_in=W_out=W_m) with squared-ReLU** = the FFN. + +- **Conservative** (scalar energy, symmetric Jacobian) → **standard EP is exact, no correction**. +- Verified: FFN-param gradient cosine vs backprop = **1.000** (`lt_ep_ffn.py`). +- This is textbook EP / Hopfield — already demonstrated on memristor crossbars in the literature. + +--- + +## 2. Attention — how it is "EP-ified" (the novel part) + +**Step A — rewrite attention as a FORCE** (not a feedforward layer): tokens relax under it. +``` +Attn(z) = [ softmax( Q(z)K(z)ᵀ/√d , causal ) · V(z) ] · W_O + Q=zW_Q, K=zW_K, V=zW_V (independent projections ⇒ NON-reciprocal: i→j ≠ j→i) +force term = s·( Attn(z) − c·z ) # −c·z damping ⇒ contraction ⇒ a stable fixed point exists +``` + +**Step B — fix the non-reciprocity bias (AEP correction).** Because Q≠K and V is independent, +the attention Jacobian `J_attn` is **asymmetric** — it is not the gradient of any scalar. Plain EP's +nudged phase relaxes under `J`, but the correct adjoint needs `Jᵀ`, so plain EP gives a **biased** +gradient. AEP adds, in the nudged phase: +``` +v = z − z* # deviation from the free equilibrium +corr = s·( J_attn·v − J_attnᵀ·v ) # = 2·A_J·v , A_J = (J − Jᵀ)/2 (antisymmetric part) + J_attn·v = jvp(Attn, z*, v) # forward-mode (one forward probe) + J_attnᵀ·v = vjp(Attn, z*, v) # reverse-mode (one backward probe) +f_nudged = F(z) − sign·β·∂C/∂z − clip(corr) +``` +Effect: the attention part of the nudged linearization becomes `s·J·v − s·(J−Jᵀ)v = s·Jᵀ·v` +— i.e. **J is turned into Jᵀ**, giving the correct adjoint. +- The damping `−c·z` is **symmetric** (Jacobian −cI) ⇒ cancels in `J−Jᵀ` ⇒ the correction only + sees attention's **non-reciprocal** part. +- Verified: attention-param gradient cosine vs backprop = **0.99–1.0** (plain EP: 0.25–0.78). +- Hardware note: `jvp/vjp` = two probe directions; **non-reciprocal coupling is exactly what real + analog devices have** — AEP removes the usual "symmetric weights" requirement of EP hardware. + +--- + +## 3. End-to-end unified training + +**One relaxation, one estimator, trains the whole block.** Key fact: + +> The antisymmetric Jacobian of the **full** force, `A_J`, equals the antisymmetric part of +> (conservative + attention) = **attention's antisymmetric part alone** (the conservative FFN/clamp +> have symmetric Jacobians → contribute 0 to `A_J`). + +So **the AEP correction needs to act on the attention term only**; the FFN/clamp ride along in the +conservative part and are trained correctly by standard EP — **one relaxation, one correction, trains +everything.** + +**Training step (`ep_step`) = 3 phases + 1 local update:** +``` +① free phase: z* = relax(F, x_in, T1) # to the fixed point +② nudged ±: z₊ = relax( F − β·∂C/∂z − corr , from z*, T2 ) # +β + z₋ = relax( F + β·∂C/∂z − corr , from z*, T2 ) # −β (centered EP) +③ adjoint: a = (z₋ − z₊) / (2β) # read from the two nudged equilibria +④ local update: + • equilibrium params (W_Q,W_K,W_V,W_O, W_m, embeddings tok/pos): + ∇_θ L = ⟨ a , ∂F/∂θ(z*) ⟩ # vector-field EP estimator — one formula for attn+FFN+embed + (code: autograd.grad( (a·F(z*,θ)).sum(), θ ) , θ live, z* fixed) + • readout W_h (outside the equilibrium): direct local gradient ∂C/∂W_h |_{z*} +``` +- Why attention (non-conservative) and FFN (conservative) train under the *same* estimator: + `⟨a, ∂F/∂θ⟩` is uniform over all equilibrium params; the AEP correction only modifies `a` (making it + the correct adjoint for the non-reciprocal system); the FFN's `∂F/∂θ` is already correct. +- Embeddings `tok/pos` enter the force through the clamp `½‖z−x_in‖²` (`∂F/∂x_in = +I`), so the same + `⟨a, ∂F/∂θ⟩` yields their gradient. +- **Stability (feedback regulation, from FRE-RNN 2508.11659):** each step measure the free-phase + residual `res = ‖relax(z*,1)−z*‖/‖z*‖`; if `res` is too large, **increase damping `c`** (lower the + spectral radius → keep converging); if very small, relax `c`. This maintains the "free phase has + converged" condition (Ernoult 2019: EP ≡ BPTT in the β→0, converged limit) throughout training. + +**Measured (this implementation):** end-to-end EP trains a char-LM to **val CE 2.95** (random 4.17, +backprop on the same architecture 2.10), with **zero non-finite steps** under feedback regulation. + +--- + +## One-line summary +> **One energy/force, one relaxation, one local estimator.** FFN = conservative Hopfield energy → +> *standard EP* (exact). Attention = a *non-reciprocal force*; AEP turns the nudged-phase `J` into `Jᵀ` +> via two probes (jvp/vjp) → exact gradient. Since the full force's antisymmetric part comes only from +> attention, **one AEP correction + standard EP train the whole block end-to-end**; the readout trains +> directly on the cost; damping + feedback-regulation keep the system convergent. + +## Hardware-relevant primitives +- **local, no weight transport**: every weight updates from locally-available equilibrium states. +- **compute = relaxation to a fixed point**: maps to oscillators / memristor crossbars / optics / Ising. +- **two phases, same circuit**: free + nudged differ only by a small output nudge β. +- **non-reciprocal coupling OK (a feature)**: AEP handles asymmetric `J`; `jvp`/`vjp` = two probe directions. +- **dissipation `c` is a physical knob**: feedback-regulated to keep the system in the convergent regime. |
