# Backprop-free analog training of a transformer — collaboration brief **One-page ask for hardware-side collaborators · 2026-06-21 · Yuren Hao (UIUC)** ## The idea in three sentences We train a **transformer block as a physical equilibrium (fixed-point) system** using **Equilibrium Propagation (EP)** — no backpropagation. The forward pass is a damped relaxation `z ← z + ε·F(z)` that **settles** to a fixed point (on analog hardware, the settling *is* the physics — nearly free); the weight update is **local**, computed from the contrast between a free settle and a slightly-nudged settle. This is exactly the computation an analog in-memory / memristive array is good at — and unlike every shipping analog-AI chip (all inference-only), it needs **in-situ weight update**, which is the open opportunity. ## Why now / why it's real (not speculative) - **Algorithm side (ours, in simulation):** EP's gradient matches true backprop (cosine ≈ 0.99–1.0 per component); the equilibrium transformer trains stably and **matches/beats a same-parameter BP transformer** on language modeling. Currently scaling the recipe; a fix for the one known instability (a residual-defense term) is under validation. - **Hardware precedent exists:** local contrastive/EP learning has been physically demonstrated (self-learning analog resistor networks, ~1 µs settling, on-chip weight update from a local free-vs-clamped difference; EP on a D-Wave Ising machine). **But nobody has built an EP-trained *transformer* in analog hardware — that is the first-mover demo.** - **Endurance clears the bar:** HfOx-class RRAM survives ~10^10 write cycles; a training run needs ≤10^8 device writes (fewer with digital-accumulate-then-threshold-program). Endurance is not the blocker — update linearity/symmetry is the real device challenge. ## What a hardware demo needs (three layers) — and the UIUC ECE fit | Layer | What it does | Closest collaborator | |---|---|---| | **Trainable device** | in-situ-updatable analog weights (RRAM/FeFET/ECRAM) — *the part you cannot buy* | **Wenjuan Zhu** (UIUC ECE, memristor/RRAM/FeFET/2D devices) | | **In-memory MVM circuit** | analog matrix-vector multiply + on-chip weight write-back | **Naresh Shanbhag** (UIUC ECE) — his JSSC-2018 DIMA chip *already* does analog MVM **+ on-chip SGD weight write-back** in 65nm; nearest existing substrate | | **Mixed-signal glue / control loop** | ADC/DAC to read settled states + apply the nudge; switched-cap integrators = relaxation primitives | **Pavan Hanumolu** (UIUC ECE, data converters / PLL / switched-cap) | | **EP control + sim** | the settle→nudge→settle→local-Δθ loop, noise/endurance de-risk in simulation | **us** (FPGA + the trained model + analog-noise sim already built) | **Escalation / device frontier:** **H.-S. Philip Wong (黄汉森, Stanford EE / TSMC Chief Scientist)** — NeuRRAM (Nature 2022) is the most EP-relevant analog-MVM substrate (inference-only today); the RRAM-device heavyweight + a TSMC-foundry path, reachable via a Stanford student contact. ## The concrete ask (staged, modular — stitch existing capabilities, no startup-scale custom fab) - **Phase 1:** put ONE equilibrium-transformer block on an existing in-situ-trainable substrate (Shanbhag's DIMA-class chip + Hanumolu converter/integrator glue; Zhu devices) + our FPGA EP-control loop → prove end-to-end analog EP training. - **Phase 2:** scale weights (foundry RRAM MPW — e.g. SkyWater S130 + Weebit ReRAM IP — or a fixed-weight inference array for the forward path with the trainable layer in-situ). - **What we bring:** the validated algorithm, the trained model + scaling data, the EP control logic, and a simulator that already models analog non-idealities (device noise / quantization / asymmetric update) to de-risk before tape-out. **Bottom line:** the science is done in sim and the hardware pieces all exist in-house at UIUC ECE — this is a stitching + first-demo opportunity, not a multi-year custom-silicon program. *(Backing detail + citations: HW_RESEARCH_FINDINGS.md; method: ept_method_intro.pdf)*