Initial commit: ept — backprop-free equilibrium transformer (EP)

Code (ep_run/), organized docs (docs/{method,campaign,hardware,outreach,paper}), analysis scripts (scripts/), ONBOARDING.md entry point. Large data/checkpoints git-ignored (share separately). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014FAPDWQ49M5Ye3NpTndTpn
author: Yuren Hao <yurenh2@illinois.edu> 2026-07-03 05:56:50 -0500
committer: Yuren Hao <yurenh2@illinois.edu> 2026-07-03 05:56:50 -0500
commit: b83947778e2c776f757a07d4719b7ce961d7ed55 (patch)
tree: b9cc01d7adda691d9156d9d04f4fb2f644674e96 /docs/hardware/HW_RESEARCH_FINDINGS.md
1 files changed, 98 insertions, 0 deletions
diff --git a/docs/hardware/HW_RESEARCH_FINDINGS.md b/docs/hardware/HW_RESEARCH_FINDINGS.md
new file mode 100644
index 0000000..ea5a6b8
--- /dev/null
+++ b/docs/hardware/HW_RESEARCH_FINDINGS.md
@@ -0,0 +1,98 @@
+# Analog-hardware substrate research — findings (2026-06-21)
+
+Deep-research run (108 agents, 25 sources, 118 claims → 22 adversarially-verified 3-0/2-1).
+Raw verified claims + source URLs + quotes: `hw_research_claims.json`. Synthesis below is mine.
+(The run's auto-synthesis step died on a mid-run /login 401; no DATA lost — all 22 verified claims recovered.)
+
+## THE decisive split confirmed: TRAINABLE-but-small vs LARGE-but-fixed
+The single most important filter — does the substrate support **in-situ weight update** (EP needs it) — cleanly partitions the market:
+
+### LARGE but FIXED-WEIGHT (inference-only — fail EP's in-situ filter as-is)
+- **Mythic M1076** (analog flash CIM): **80M weights/chip**, eval boards / M.2 / PCIe cards exist. BUT explicitly **inference-only** — train off-device, program once. [mythic.ai, 3-0]
+- **IBM HERMES** (PCM, 14nm, 64×256×256 = **4.2M weights**, mixed-signal): research chip, **inference-only**, weights programmed once via hardware-aware training. [Nature Electronics 2023, 3-0]
+- **MRAM / PCM crossbars** generally: program-once, fixed during inference; authors state in-situ training "increases energy + degrades device lifespan" → why the whole field avoids it. [Science, NCBI, 3-0]
+- → These give SCALE (the tens-of-M you want) but can't do EP's repeated local updates without re-flashing.
+
+### TRAINABLE in-situ (small, but the EP-correct regime)
+- **Bulk-switching memristor CIM module** (arXiv 2305.14547): experimentally implements **on-chip mixed-precision TRAINING** with in-situ VMM. KEY mechanism: **digital high-precision update accumulation, physically program the memristor only when accumulated Δw exceeds a threshold** — exactly the hybrid scheme that limits write/endurance stress in an EP loop. [3-0] ← **this is the template for our update path.**
+- **In-situ training demonstrated** on memristor crossbars for MLP/CNN/LSTM/RL — local in-array updates during a learning loop are physically real. [arXiv, 3-0]
+- Constraints to design around: limited NVM **endurance**, **asymmetric/nonlinear** weight update, variability, retention, stuck-at-faults. Compensation methods exist (stochastic rounding etc.). [escholarship, 2-1/3-0]
+
+## EP / equilibrium learning ALREADY physically realized (precedent exists!)
+- **PNAS — self-learning analog resistor network** (Coupled Learning, EP-cousin): XOR + nonlinear regression learned **fully in-situ, NO computer, NO backprop**. Weights = transistor gate-voltage on a local 22µF cap, updated by on-edge circuitry from the **local free-vs-clamped difference**. Forward = physical settling, **τ≈1µs**; learning on 18ms timescale. [PNAS, 3-0] ← **proof the whole concept works in COTS-buildable analog.**
+- **EP on D-Wave** (quantum annealer Ising machine): the physical machine does both free + nudge relaxation to steady state (settling is physical). Learning rule is **local** (updates from the two equilibrium states, no backprop). Caveat (1-1): weights live on the classical computer; only couplings loaded per phase → hybrid, not fully in-situ. [Nature, 3-0 on the local-rule claim]
+- → EP/local learning on physical equilibrium hardware is **demonstrated**, not speculative. Our contribution would be doing it for a TRANSFORMER block at scale.
+
+## Softmax/attention in analog (the hard part)
+- Confirmed open challenge: Transformers need frequent Q/K/V updates, which **conflicts with crossbars' weakness at reprogramming** — flagged as an open HW problem. [arXiv, 3-0]
+- (The energy/Hopfield-attention analog-native route verification was among the 3 claims killed by the 401 — needs a re-run. The pragmatic mixed-signal answer — softmax/LN/GELU in FPGA, linear+relaxation in analog — was the framing, not contradicted.)
+
+## BOTTOM LINE for our build (synthesis)
+The market splits exactly as feared: **you cannot buy one module that is both tens-of-M AND in-situ-trainable.** So:
+- **Phase 1 (trainable, small) — DO THIS FIRST.** Stitch a **bulk-switching/memristor CIM eval module** (in-situ, threshold-accumulated update) + an **FPGA** (softmax/LN/GELU + the EP control loop: settle→nudge→settle→local Δθ). Prove ONE equilibrium-transformer block trains end-to-end via EP in analog. The PNAS resistor-network + the memristor-training paper together show every piece is real.
+- **Phase 2 (scale) — LARGE-but-fixed used cleverly.** Use Mythic-80M / HERMES-class for the bulk fixed linear MVM (the relaxation forward), and keep ONLY the trainable/updated weights on the in-situ substrate, OR do mixed-signal "analog-forward, digital-accumulate, periodic-reflash" updates (the threshold-program trick) to tolerate their write limits.
+- **Update path = the crux.** Adopt the verified hybrid: **accumulate Δθ in digital high-precision, physically program the analog weight only when |Δθ|>threshold.** This is what makes EP survive endurance limits.
+- **De-risk in sim first (free):** the code's `--fnoise` already models multiplicative analog noise — sweep device noise / quantization / asymmetric-update in the 1B sim before buying anything.
+
+## Re-run #2 (2026-06-21, focused) — GAP 1 SOLVED, GAPs 2/3 still thin
+Raw: `hw_research_claims2.json`. 107 agents, clean run (no auth drop).
+
+### GAP 1 — analog attention: ANSWERED. It exists, across substrates, but all inference-only.
+- **Real fabricated silicon**: UCSD **65nm charge-based SRAM-CIM attention** chip (Moradifirouzabadi/Dodla/Kang, arXiv 2409.04940, ESSERC 2024) — first charge-based analog CIM in SRAM for transformers, **measured 14.8 TOPS/W**, 9-T bitcell does Q·Kᵀ via capacitor charge-sharing. [high]
+- **Jülich gain-cell in-memory attention** (Leroux et al., arXiv 2409.19315, Nature Comp Sci 2025): charge-on-capacitor, **~70,000× energy / ~100× speed vs GPU** (simulated). [high]
+- Memristor: Nature Sci Reports 2024 self-attention accel (128×128, 2-bit); **STAR RRAM softmax engine** (arXiv 2401.17582). Photonic: TFLN-microring softmax PROPOSAL (arXiv 2603.12934, not fabricated). [high/med]
+- **Softmax IS analog-realizable in principle**: a subthreshold source-coupled differential-pair / WTA network computes normalized-exp **"for free" via KCL at the shared tail node** (translinear). [high] — so an energy/LSE-attention analog route is physically grounded.
+- **BUT GAP 1(c) CONFIRMED**: real prototypes **overwhelmingly use the mixed-signal split we proposed** — softmax/LN/normalization in DIGITAL/LUT/FPGA, only the linear maps + dot-products in analog. So our architecture choice is the validated one. [high]
+- **EVERY analog-attention implementation found is INFERENCE-ONLY / fixed-weight** (Jülich uses offline HW-aware init + offline backprop fine-tune before deploy). Reinforces: nobody has done in-situ-trained analog attention → that IS our novel contribution. [high]
+- Noise budget datapoint: a variation-aware memristor-ViT sim tolerates **~35% compute error + ~10% conductance variation** while matching digital Top-1 (MDPI Electronics 2026) — encouraging for the `--fnoise` de-risk. [med]
+- Caveat (Sillman, arXiv 2305.13649): an analog softmax block only pays off INSIDE a fully-analog system; isolating it behind ADC/DAC dwarfs the saving → keep softmax digital UNLESS going fully analog. [med]
+
+### GAP 2 (buy-now SKUs) + GAP 3 (endurance/ECRAM) — STILL OPEN
+The re-run did NOT substantively verify these (its own summary says so). The one product claim (Knowm $800 kit) was REFUTED/split. So procurement (TetraMem/Mythic/Anadigm/Aspinity SKU+price+order-today) and the **make-or-break endurance budget (RRAM/PCM/FeFET/Flash vs ECRAM writes-to-failure)** remain genuinely unanswered. Indirect signal only: NVM rejected for KV-cache because of slow/high-energy/low-endurance writes; gain-cells chosen for endurance.
+
+### Still to pin (3rd focused pass — procurement + endurance ONLY)
+1. SKU-level buy-now: TetraMem MX100, Mythic dev kit, Anadigm AN231E04 board, Aspinity AML100, any RRAM eval kit — orderable today? price? (deep-research struggles here — may need vendor sites / direct contact, not web search.)
+2. Per-device **write endurance**: RRAM/PCM/FeFET/Flash/**ECRAM** cycles-to-failure; is ECRAM the symmetric-update + endurance fix, and is it available outside research labs? (Likely research-only — flag if so.)
+3. With digital-accumulate-then-threshold-program, how many physical writes does a ~30k-step EP run actually incur, vs device endurance?
+
+## UIUC ECE collaboration map (2026-06-21, user-directed — the hardware-side gap)
+User's reachable hardware groups (ALL ECE — this is the team's missing layer). The key insight: the
+three UIUC ECE groups span EXACTLY the three layers an in-situ-EP analog demo needs, and together they
+SOLVE the market's fatal gap (you can't BUY an in-situ-trainable analog array — but you can fab one in-house):
+- **Wenjuan Zhu (UIUC ECE) = DEVICE layer** [user-confirmed]: memristor/RRAM/FeFET / 2D-material devices.
+  This is the in-situ-trainable substrate that is research-only on the market — her group can FABRICATE it.
+- **Naresh Shanbhag (UIUC ECE) = CIRCUIT/ARCH layer**: SRAM in-memory compute (DIMA/C3SRAM line) — the analog MVM.
+- **Pavan Hanumolu (UIUC ECE) = MIXED-SIGNAL GLUE**: ADC/DAC, PLL, switched-cap — the converters + analog
+  integrator for the relaxation/control loop (settle→nudge→settle→local Δθ).
+- Tao Chen (USTC) = hardware but NOT EP; Stanford = student can broker intros (Wong RRAM / Murmann-legacy CIM / etc.).
+STRATEGY SHIFT: not "buy a board" — it's in-house fab of the trainable substrate (Zhu) + CIM circuit (Shanbhag)
++ converter glue (Hanumolu) + our FPGA/EP control loop. Sourcing deep-research w1kuw4zmz profiling all + industry;
+its "Zhu" angle mis-targets ML-accel (wrong layer) — corrected to Wenjuan-Zhu-device here; will run a focused
+pass on her device work + merge.
+
+### Sourcing run RESULTS (w1kuw4zmz, 11 high-conf named-paper findings; raw: hw_groups_claims.json)
+**HEADLINE: Shanbhag (UIUC) is the closest match of ANY named group — and it's the ONLY group silicon that already does analog MVM + genuine on-chip in-situ weight update.**
+- **Shanbhag DIMA chip** (Gonugondla/Kang/Shanbhag, **JSSC 2018**, "Variation-Tolerant In-Memory ML Classifier via On-Chip Training"): 65nm, 16kB 6T-SRAM, analog MVM via "functional read" + charge-sharing, AND a **dedicated on-chip digital trainer doing SGD + writing weights back to the array each batch** (random→within 1% of FP in ~400 batches). In-situ training gave **2.4× energy cut** at iso-accuracy. ⇒ the two EP-critical mechanisms (analog MVM + on-chip weight write-back) ALREADY in one fabricated UIUC chip. CAVEAT: it's a single-layer SVM, batch-SGD — **no relaxation/settling loop, no multilayer credit assignment**. We'd add the equilibrium dynamics + two-phase EP rule on top. [high]
+- Shanbhag also: **C3SRAM** (w/ ASU, JSSC 2020) — capacitive-coupling XNOR-MAC, but inference-only binary.
+- **Hanumolu** — (the run under-covered him; he's the ADC/DAC+integrator glue, still the right converter-layer partner; pin specific silicon separately.)
+- **"Zhu"/Lin correction CONFIRMED**: Yingyan **Celine Lin** is UIUC-PhD, now **Georgia Tech** (not current UIUC faculty, surname Lin not Zhu), digital-accel/co-design, **no analog/RRAM/in-situ silicon**. ⇒ the device-layer partner is **Wenjuan Zhu** (user-confirmed), NOT Lin.
+- **USTC Tao Chen (陈涛)**: device/materials "in-materio" reservoir computing (disordered dopant-atom networks in Si, Nature 2020). Real device work but NOT circuit-CIM, NOT EP — confirms user's "no EP hardware." Possible device-physics collaborator, not a demo host.
+- **Stanford NeuRRAM** (Wong + Raina + UCSD Cauwenberghs, **Nature 2022**): 48-core, ~3M-cell RRAM analog-CIM — the most EP-relevant *substrate* (analog MVM at scale), **BUT INFERENCE-ONLY** (weights programmed offline; only chip-in-the-loop forward fine-tune). Gives the MVM primitive, not native in-situ learning.
+  - **Wong = H.-S. Philip Wong (Hon-Sum Philip Wong, 黄汉森)**: Stanford EE, Willard R. & Inez Kerr Bell Professor; also **TSMC Chief Scientist**. RRAM/memristor, 3D monolithic integration, in-memory computing; co-author of NeuRRAM + the canonical "Memristive devices for computing" review (Nature Nanotech 2013). THE RRAM-device heavyweight for the trainable-substrate conversation (reach via the Stanford student contact). TSMC tie = a path to real foundry RRAM.
+- **Industry = all inference-only**: TetraMem **MX100** (Nature Electronics 2025; 10 cores, 248×256 1T1R RRAM+RISC-V) ships real silicon but **inference-only** (no in-situ update). Mythic/EnCharge class same.
+- **DIY in-situ test-chip path**: **SkyWater S130 + Weebit Nano 256Kb ReRAM IP** (JEDEC/AEC-Q100 qualified 2023, open SKY130 PDK) = foundry RRAM access for an MPW — a lab can fab its OWN trainable RRAM array. [med]
+- **EP-on-hardware = still only SPICE sim**: "Memristor Crossbar Circuits Implementing Equilibrium Propagation" (Oh et al., Kookmin U) is circuit simulation, NOT silicon. ⇒ **no fabricated EP-transformer hardware exists anywhere — the demo is genuinely novel.** [high]
+
+**BOTTOM LINE FOR THE PITCH:** lead with **Shanbhag** (his JSSC-2018 chip already proves analog-MVM + on-chip-training in one die — the nearest substrate; we add relaxation + EP) + **Wenjuan Zhu** (trainable device) + **Hanumolu** (converter glue) = a complete in-house UIUC-ECE stack. Stanford **Wong** as the RRAM-device escalation (via the student). Industry (TetraMem/Mythic) only useful for the fixed-weight Phase-2 forward path. Nobody has built EP-transformer hardware → first-mover.
+
+### Hanumolu profile (targeted, 2026-06-21)
+**Pavan Kumar Hanumolu** — Seendripu Family Professor, UIUC ECE (since 2013; prior Oregon State); member of CSL's Integrated Circuits & Systems Group. "Top-five mixed-signal IC researchers worldwide," NSF CAREER 2010, heavy JSSC/ISSCC record. Work: **energy-efficient analog/mixed-signal — time-based ADCs, continuous-time filters, ultralow-jitter clocking/PLLs, high-speed serial links, switched-cap, DC-DC power conversion.** ⇒ exactly the **converter + analog-integrator + feedback-loop** layer the EP control loop needs (ADC/DAC glue to read settled states + apply the nudge; switched-cap integrators ARE relaxation-dynamics primitives). Note: his published silicon is converters/links/clocking, NOT CIM — he's the glue/control-loop partner, not the MVM substrate. (Also co-founded Omni Design Technologies — converter IP.) [sources: ece.illinois.edu/.../hanumolu, icsg.csl.illinois.edu]
+
+### GAP 3 — endurance budget: the make-or-break number, and it CLEARS the bar
+The feasibility question: an EP run does ~tens-of-thousands of update STEPS; with digital-accumulate-then-threshold-program, physical device writes are FEWER than steps. How many cycles do devices survive?
+- **HfOx RRAM: up to ~10^10 cycles** endurance (best-in-class metal-oxide). [arxiv 1909.01771, IOP 10.1088/1361-6641/abf29d]
+- AlOx / weaker oxides: only ~10^4 — material choice matters a lot.
+- **Budget check**: MNIST-class training writes ~10^4 cycles; gradient training can scale to **~10^8** cycles. ⇒ **HfOx (10^10) has ~100× headroom even over a 10^8-write run** — endurance is NOT a blocker IF you use HfOx-class RRAM + the threshold-accumulate scheme (which cuts writes below step-count). [web-search snippets, med-high]
+- **Device-nudge insight**: an EP/Coupled-Learning *nudge* changes resistance far less than a full state write, so per-nudge endurance is plausibly >> rated full-write endurance (needs empirical confirmation, but favorable). 
+- **ECRAM (electrochemical RAM)** = the symmetric/linear-analog-update + high-endurance technology specifically aimed at in-situ training: "open-loop analog programmable electrochemical memory array" (Nature Comms 2023, s41467-023-41958-4) — but **research-only** (not commercially available; lab/foundry fab). It's the device-physics frontier Wenjuan Zhu / Wong-type collaborators work in.
+- **VERDICT: endurance is survivable** with HfOx-class RRAM (10^10) + threshold-program; ECRAM is the better-but-research-only upgrade. The make-or-break risk is NOT endurance — it's **update linearity/symmetry + device variation** (the asymmetric-nonlinear-update problem), which the digital-accumulate scheme + compensation (stochastic rounding) mitigates. [from earlier run + this]
author	Yuren Hao <yurenh2@illinois.edu>	2026-07-03 05:56:50 -0500
committer	Yuren Hao <yurenh2@illinois.edu>	2026-07-03 05:56:50 -0500
commit	b83947778e2c776f757a07d4719b7ce961d7ed55 (patch)
tree	b9cc01d7adda691d9156d9d04f4fb2f644674e96 /docs/hardware/HW_RESEARCH_FINDINGS.md