diff options
| author | Yuren Hao <yurenh2@illinois.edu> | 2026-07-03 05:56:50 -0500 |
|---|---|---|
| committer | Yuren Hao <yurenh2@illinois.edu> | 2026-07-03 05:56:50 -0500 |
| commit | b83947778e2c776f757a07d4719b7ce961d7ed55 (patch) | |
| tree | b9cc01d7adda691d9156d9d04f4fb2f644674e96 /docs/hardware/HW_RESEARCH_FINDINGS.md | |
Initial commit: ept — backprop-free equilibrium transformer (EP)
Code (ep_run/), organized docs (docs/{method,campaign,hardware,outreach,paper}),
analysis scripts (scripts/), ONBOARDING.md entry point. Large data/checkpoints
git-ignored (share separately).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_014FAPDWQ49M5Ye3NpTndTpn
Diffstat (limited to 'docs/hardware/HW_RESEARCH_FINDINGS.md')
| -rw-r--r-- | docs/hardware/HW_RESEARCH_FINDINGS.md | 98 |
1 files changed, 98 insertions, 0 deletions
diff --git a/docs/hardware/HW_RESEARCH_FINDINGS.md b/docs/hardware/HW_RESEARCH_FINDINGS.md new file mode 100644 index 0000000..ea5a6b8 --- /dev/null +++ b/docs/hardware/HW_RESEARCH_FINDINGS.md @@ -0,0 +1,98 @@ +# Analog-hardware substrate research — findings (2026-06-21) + +Deep-research run (108 agents, 25 sources, 118 claims → 22 adversarially-verified 3-0/2-1). +Raw verified claims + source URLs + quotes: `hw_research_claims.json`. Synthesis below is mine. +(The run's auto-synthesis step died on a mid-run /login 401; no DATA lost — all 22 verified claims recovered.) + +## THE decisive split confirmed: TRAINABLE-but-small vs LARGE-but-fixed +The single most important filter — does the substrate support **in-situ weight update** (EP needs it) — cleanly partitions the market: + +### LARGE but FIXED-WEIGHT (inference-only — fail EP's in-situ filter as-is) +- **Mythic M1076** (analog flash CIM): **80M weights/chip**, eval boards / M.2 / PCIe cards exist. BUT explicitly **inference-only** — train off-device, program once. [mythic.ai, 3-0] +- **IBM HERMES** (PCM, 14nm, 64×256×256 = **4.2M weights**, mixed-signal): research chip, **inference-only**, weights programmed once via hardware-aware training. [Nature Electronics 2023, 3-0] +- **MRAM / PCM crossbars** generally: program-once, fixed during inference; authors state in-situ training "increases energy + degrades device lifespan" → why the whole field avoids it. [Science, NCBI, 3-0] +- → These give SCALE (the tens-of-M you want) but can't do EP's repeated local updates without re-flashing. + +### TRAINABLE in-situ (small, but the EP-correct regime) +- **Bulk-switching memristor CIM module** (arXiv 2305.14547): experimentally implements **on-chip mixed-precision TRAINING** with in-situ VMM. KEY mechanism: **digital high-precision update accumulation, physically program the memristor only when accumulated Δw exceeds a threshold** — exactly the hybrid scheme that limits write/endurance stress in an EP loop. [3-0] ← **this is the template for our update path.** +- **In-situ training demonstrated** on memristor crossbars for MLP/CNN/LSTM/RL — local in-array updates during a learning loop are physically real. [arXiv, 3-0] +- Constraints to design around: limited NVM **endurance**, **asymmetric/nonlinear** weight update, variability, retention, stuck-at-faults. Compensation methods exist (stochastic rounding etc.). [escholarship, 2-1/3-0] + +## EP / equilibrium learning ALREADY physically realized (precedent exists!) +- **PNAS — self-learning analog resistor network** (Coupled Learning, EP-cousin): XOR + nonlinear regression learned **fully in-situ, NO computer, NO backprop**. Weights = transistor gate-voltage on a local 22µF cap, updated by on-edge circuitry from the **local free-vs-clamped difference**. Forward = physical settling, **τ≈1µs**; learning on 18ms timescale. [PNAS, 3-0] ← **proof the whole concept works in COTS-buildable analog.** +- **EP on D-Wave** (quantum annealer Ising machine): the physical machine does both free + nudge relaxation to steady state (settling is physical). Learning rule is **local** (updates from the two equilibrium states, no backprop). Caveat (1-1): weights live on the classical computer; only couplings loaded per phase → hybrid, not fully in-situ. [Nature, 3-0 on the local-rule claim] +- → EP/local learning on physical equilibrium hardware is **demonstrated**, not speculative. Our contribution would be doing it for a TRANSFORMER block at scale. + +## Softmax/attention in analog (the hard part) +- Confirmed open challenge: Transformers need frequent Q/K/V updates, which **conflicts with crossbars' weakness at reprogramming** — flagged as an open HW problem. [arXiv, 3-0] +- (The energy/Hopfield-attention analog-native route verification was among the 3 claims killed by the 401 — needs a re-run. The pragmatic mixed-signal answer — softmax/LN/GELU in FPGA, linear+relaxation in analog — was the framing, not contradicted.) + +## BOTTOM LINE for our build (synthesis) +The market splits exactly as feared: **you cannot buy one module that is both tens-of-M AND in-situ-trainable.** So: +- **Phase 1 (trainable, small) — DO THIS FIRST.** Stitch a **bulk-switching/memristor CIM eval module** (in-situ, threshold-accumulated update) + an **FPGA** (softmax/LN/GELU + the EP control loop: settle→nudge→settle→local Δθ). Prove ONE equilibrium-transformer block trains end-to-end via EP in analog. The PNAS resistor-network + the memristor-training paper together show every piece is real. +- **Phase 2 (scale) — LARGE-but-fixed used cleverly.** Use Mythic-80M / HERMES-class for the bulk fixed linear MVM (the relaxation forward), and keep ONLY the trainable/updated weights on the in-situ substrate, OR do mixed-signal "analog-forward, digital-accumulate, periodic-reflash" updates (the threshold-program trick) to tolerate their write limits. +- **Update path = the crux.** Adopt the verified hybrid: **accumulate Δθ in digital high-precision, physically program the analog weight only when |Δθ|>threshold.** This is what makes EP survive endurance limits. +- **De-risk in sim first (free):** the code's `--fnoise` already models multiplicative analog noise — sweep device noise / quantization / asymmetric-update in the 1B sim before buying anything. + +## Re-run #2 (2026-06-21, focused) — GAP 1 SOLVED, GAPs 2/3 still thin +Raw: `hw_research_claims2.json`. 107 agents, clean run (no auth drop). + +### GAP 1 — analog attention: ANSWERED. It exists, across substrates, but all inference-only. +- **Real fabricated silicon**: UCSD **65nm charge-based SRAM-CIM attention** chip (Moradifirouzabadi/Dodla/Kang, arXiv 2409.04940, ESSERC 2024) — first charge-based analog CIM in SRAM for transformers, **measured 14.8 TOPS/W**, 9-T bitcell does Q·Kᵀ via capacitor charge-sharing. [high] +- **Jülich gain-cell in-memory attention** (Leroux et al., arXiv 2409.19315, Nature Comp Sci 2025): charge-on-capacitor, **~70,000× energy / ~100× speed vs GPU** (simulated). [high] +- Memristor: Nature Sci Reports 2024 self-attention accel (128×128, 2-bit); **STAR RRAM softmax engine** (arXiv 2401.17582). Photonic: TFLN-microring softmax PROPOSAL (arXiv 2603.12934, not fabricated). [high/med] +- **Softmax IS analog-realizable in principle**: a subthreshold source-coupled differential-pair / WTA network computes normalized-exp **"for free" via KCL at the shared tail node** (translinear). [high] — so an energy/LSE-attention analog route is physically grounded. +- **BUT GAP 1(c) CONFIRMED**: real prototypes **overwhelmingly use the mixed-signal split we proposed** — softmax/LN/normalization in DIGITAL/LUT/FPGA, only the linear maps + dot-products in analog. So our architecture choice is the validated one. [high] +- **EVERY analog-attention implementation found is INFERENCE-ONLY / fixed-weight** (Jülich uses offline HW-aware init + offline backprop fine-tune before deploy). Reinforces: nobody has done in-situ-trained analog attention → that IS our novel contribution. [high] +- Noise budget datapoint: a variation-aware memristor-ViT sim tolerates **~35% compute error + ~10% conductance variation** while matching digital Top-1 (MDPI Electronics 2026) — encouraging for the `--fnoise` de-risk. [med] +- Caveat (Sillman, arXiv 2305.13649): an analog softmax block only pays off INSIDE a fully-analog system; isolating it behind ADC/DAC dwarfs the saving → keep softmax digital UNLESS going fully analog. [med] + +### GAP 2 (buy-now SKUs) + GAP 3 (endurance/ECRAM) — STILL OPEN +The re-run did NOT substantively verify these (its own summary says so). The one product claim (Knowm $800 kit) was REFUTED/split. So procurement (TetraMem/Mythic/Anadigm/Aspinity SKU+price+order-today) and the **make-or-break endurance budget (RRAM/PCM/FeFET/Flash vs ECRAM writes-to-failure)** remain genuinely unanswered. Indirect signal only: NVM rejected for KV-cache because of slow/high-energy/low-endurance writes; gain-cells chosen for endurance. + +### Still to pin (3rd focused pass — procurement + endurance ONLY) +1. SKU-level buy-now: TetraMem MX100, Mythic dev kit, Anadigm AN231E04 board, Aspinity AML100, any RRAM eval kit — orderable today? price? (deep-research struggles here — may need vendor sites / direct contact, not web search.) +2. Per-device **write endurance**: RRAM/PCM/FeFET/Flash/**ECRAM** cycles-to-failure; is ECRAM the symmetric-update + endurance fix, and is it available outside research labs? (Likely research-only — flag if so.) +3. With digital-accumulate-then-threshold-program, how many physical writes does a ~30k-step EP run actually incur, vs device endurance? + +## UIUC ECE collaboration map (2026-06-21, user-directed — the hardware-side gap) +User's reachable hardware groups (ALL ECE — this is the team's missing layer). The key insight: the +three UIUC ECE groups span EXACTLY the three layers an in-situ-EP analog demo needs, and together they +SOLVE the market's fatal gap (you can't BUY an in-situ-trainable analog array — but you can fab one in-house): +- **Wenjuan Zhu (UIUC ECE) = DEVICE layer** [user-confirmed]: memristor/RRAM/FeFET / 2D-material devices. + This is the in-situ-trainable substrate that is research-only on the market — her group can FABRICATE it. +- **Naresh Shanbhag (UIUC ECE) = CIRCUIT/ARCH layer**: SRAM in-memory compute (DIMA/C3SRAM line) — the analog MVM. +- **Pavan Hanumolu (UIUC ECE) = MIXED-SIGNAL GLUE**: ADC/DAC, PLL, switched-cap — the converters + analog + integrator for the relaxation/control loop (settle→nudge→settle→local Δθ). +- Tao Chen (USTC) = hardware but NOT EP; Stanford = student can broker intros (Wong RRAM / Murmann-legacy CIM / etc.). +STRATEGY SHIFT: not "buy a board" — it's in-house fab of the trainable substrate (Zhu) + CIM circuit (Shanbhag) ++ converter glue (Hanumolu) + our FPGA/EP control loop. Sourcing deep-research w1kuw4zmz profiling all + industry; +its "Zhu" angle mis-targets ML-accel (wrong layer) — corrected to Wenjuan-Zhu-device here; will run a focused +pass on her device work + merge. + +### Sourcing run RESULTS (w1kuw4zmz, 11 high-conf named-paper findings; raw: hw_groups_claims.json) +**HEADLINE: Shanbhag (UIUC) is the closest match of ANY named group — and it's the ONLY group silicon that already does analog MVM + genuine on-chip in-situ weight update.** +- **Shanbhag DIMA chip** (Gonugondla/Kang/Shanbhag, **JSSC 2018**, "Variation-Tolerant In-Memory ML Classifier via On-Chip Training"): 65nm, 16kB 6T-SRAM, analog MVM via "functional read" + charge-sharing, AND a **dedicated on-chip digital trainer doing SGD + writing weights back to the array each batch** (random→within 1% of FP in ~400 batches). In-situ training gave **2.4× energy cut** at iso-accuracy. ⇒ the two EP-critical mechanisms (analog MVM + on-chip weight write-back) ALREADY in one fabricated UIUC chip. CAVEAT: it's a single-layer SVM, batch-SGD — **no relaxation/settling loop, no multilayer credit assignment**. We'd add the equilibrium dynamics + two-phase EP rule on top. [high] +- Shanbhag also: **C3SRAM** (w/ ASU, JSSC 2020) — capacitive-coupling XNOR-MAC, but inference-only binary. +- **Hanumolu** — (the run under-covered him; he's the ADC/DAC+integrator glue, still the right converter-layer partner; pin specific silicon separately.) +- **"Zhu"/Lin correction CONFIRMED**: Yingyan **Celine Lin** is UIUC-PhD, now **Georgia Tech** (not current UIUC faculty, surname Lin not Zhu), digital-accel/co-design, **no analog/RRAM/in-situ silicon**. ⇒ the device-layer partner is **Wenjuan Zhu** (user-confirmed), NOT Lin. +- **USTC Tao Chen (陈涛)**: device/materials "in-materio" reservoir computing (disordered dopant-atom networks in Si, Nature 2020). Real device work but NOT circuit-CIM, NOT EP — confirms user's "no EP hardware." Possible device-physics collaborator, not a demo host. +- **Stanford NeuRRAM** (Wong + Raina + UCSD Cauwenberghs, **Nature 2022**): 48-core, ~3M-cell RRAM analog-CIM — the most EP-relevant *substrate* (analog MVM at scale), **BUT INFERENCE-ONLY** (weights programmed offline; only chip-in-the-loop forward fine-tune). Gives the MVM primitive, not native in-situ learning. + - **Wong = H.-S. Philip Wong (Hon-Sum Philip Wong, 黄汉森)**: Stanford EE, Willard R. & Inez Kerr Bell Professor; also **TSMC Chief Scientist**. RRAM/memristor, 3D monolithic integration, in-memory computing; co-author of NeuRRAM + the canonical "Memristive devices for computing" review (Nature Nanotech 2013). THE RRAM-device heavyweight for the trainable-substrate conversation (reach via the Stanford student contact). TSMC tie = a path to real foundry RRAM. +- **Industry = all inference-only**: TetraMem **MX100** (Nature Electronics 2025; 10 cores, 248×256 1T1R RRAM+RISC-V) ships real silicon but **inference-only** (no in-situ update). Mythic/EnCharge class same. +- **DIY in-situ test-chip path**: **SkyWater S130 + Weebit Nano 256Kb ReRAM IP** (JEDEC/AEC-Q100 qualified 2023, open SKY130 PDK) = foundry RRAM access for an MPW — a lab can fab its OWN trainable RRAM array. [med] +- **EP-on-hardware = still only SPICE sim**: "Memristor Crossbar Circuits Implementing Equilibrium Propagation" (Oh et al., Kookmin U) is circuit simulation, NOT silicon. ⇒ **no fabricated EP-transformer hardware exists anywhere — the demo is genuinely novel.** [high] + +**BOTTOM LINE FOR THE PITCH:** lead with **Shanbhag** (his JSSC-2018 chip already proves analog-MVM + on-chip-training in one die — the nearest substrate; we add relaxation + EP) + **Wenjuan Zhu** (trainable device) + **Hanumolu** (converter glue) = a complete in-house UIUC-ECE stack. Stanford **Wong** as the RRAM-device escalation (via the student). Industry (TetraMem/Mythic) only useful for the fixed-weight Phase-2 forward path. Nobody has built EP-transformer hardware → first-mover. + +### Hanumolu profile (targeted, 2026-06-21) +**Pavan Kumar Hanumolu** — Seendripu Family Professor, UIUC ECE (since 2013; prior Oregon State); member of CSL's Integrated Circuits & Systems Group. "Top-five mixed-signal IC researchers worldwide," NSF CAREER 2010, heavy JSSC/ISSCC record. Work: **energy-efficient analog/mixed-signal — time-based ADCs, continuous-time filters, ultralow-jitter clocking/PLLs, high-speed serial links, switched-cap, DC-DC power conversion.** ⇒ exactly the **converter + analog-integrator + feedback-loop** layer the EP control loop needs (ADC/DAC glue to read settled states + apply the nudge; switched-cap integrators ARE relaxation-dynamics primitives). Note: his published silicon is converters/links/clocking, NOT CIM — he's the glue/control-loop partner, not the MVM substrate. (Also co-founded Omni Design Technologies — converter IP.) [sources: ece.illinois.edu/.../hanumolu, icsg.csl.illinois.edu] + +### GAP 3 — endurance budget: the make-or-break number, and it CLEARS the bar +The feasibility question: an EP run does ~tens-of-thousands of update STEPS; with digital-accumulate-then-threshold-program, physical device writes are FEWER than steps. How many cycles do devices survive? +- **HfOx RRAM: up to ~10^10 cycles** endurance (best-in-class metal-oxide). [arxiv 1909.01771, IOP 10.1088/1361-6641/abf29d] +- AlOx / weaker oxides: only ~10^4 — material choice matters a lot. +- **Budget check**: MNIST-class training writes ~10^4 cycles; gradient training can scale to **~10^8** cycles. ⇒ **HfOx (10^10) has ~100× headroom even over a 10^8-write run** — endurance is NOT a blocker IF you use HfOx-class RRAM + the threshold-accumulate scheme (which cuts writes below step-count). [web-search snippets, med-high] +- **Device-nudge insight**: an EP/Coupled-Learning *nudge* changes resistance far less than a full state write, so per-nudge endurance is plausibly >> rated full-write endurance (needs empirical confirmation, but favorable). +- **ECRAM (electrochemical RAM)** = the symmetric/linear-analog-update + high-endurance technology specifically aimed at in-situ training: "open-loop analog programmable electrochemical memory array" (Nature Comms 2023, s41467-023-41958-4) — but **research-only** (not commercially available; lab/foundry fab). It's the device-physics frontier Wenjuan Zhu / Wong-type collaborators work in. +- **VERDICT: endurance is survivable** with HfOx-class RRAM (10^10) + threshold-program; ECRAM is the better-but-research-only upgrade. The make-or-break risk is NOT endurance — it's **update linearity/symmetry + device variation** (the asymmetric-nonlinear-update problem), which the digital-accumulate scheme + compensation (stochastic rounding) mitigates. [from earlier run + this] |
