diff options
| author | Yuren Hao <yurenh2@illinois.edu> | 2026-07-03 05:56:50 -0500 |
|---|---|---|
| committer | Yuren Hao <yurenh2@illinois.edu> | 2026-07-03 05:56:50 -0500 |
| commit | b83947778e2c776f757a07d4719b7ce961d7ed55 (patch) | |
| tree | b9cc01d7adda691d9156d9d04f4fb2f644674e96 /docs/hardware/SCALING_AND_HARDWARE_PLAN.md | |
Initial commit: ept — backprop-free equilibrium transformer (EP)
Code (ep_run/), organized docs (docs/{method,campaign,hardware,outreach,paper}),
analysis scripts (scripts/), ONBOARDING.md entry point. Large data/checkpoints
git-ignored (share separately).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_014FAPDWQ49M5Ye3NpTndTpn
Diffstat (limited to 'docs/hardware/SCALING_AND_HARDWARE_PLAN.md')
| -rw-r--r-- | docs/hardware/SCALING_AND_HARDWARE_PLAN.md | 58 |
1 files changed, 58 insertions, 0 deletions
diff --git a/docs/hardware/SCALING_AND_HARDWARE_PLAN.md b/docs/hardware/SCALING_AND_HARDWARE_PLAN.md new file mode 100644 index 0000000..8e93a40 --- /dev/null +++ b/docs/hardware/SCALING_AND_HARDWARE_PLAN.md @@ -0,0 +1,58 @@ +# EP scaling-to-1B sim cost + modular analog-hardware plan (2026-06-21) + +## PART 1 — 1B simulation cost (H200) + +Anchor: your 1B **BP** run ≈ 36h × 8 H200 = **288 H200-hours**. + +Cost driver = the equilibrium block relaxes ~150–300 steps (T1=150 + refine to t1max=300) — this is the +WHOLE cost, ~width-independent, so the per-step factor measured at C512 holds at 1B: +- EP/BP per-step FLOPs ≈ **~50–100×** (relaxation is forward-only, 2ND each; BP is 6ND). Measured: C512 EP 0.23 it/s + vs a depth-1 C512 BP ~tens it/s ⇒ ~65–130×. **EP/BPTT ≈ only 1.5×** — the cost is the equilibrium, not EP. +- Reducers: bf16 + torch.compile + the compiled free-phase fast-path (blk._cstep) ≈ 2–4×; refine exits early + while contractive (×~0.6); EP's low memory (no unrolled graph) packs bigger batches than BPTT. Net effective **~20–50×**. + +**Estimate: 1B EP ≈ 20–50 × 288 ≈ 6,000–15,000 H200-hours** ≈ $18k–60k @ ~$3–4/H200-hr (or burn AWS credits). +Wall-clock: ~30–75 days on 8×H200, or **~4–9 days on 64×H200** (same H200-hours — scale OUT for wall-clock, cost is fixed). + +**Recommendation (don't burn 10k H200-hr blind):** +1. Ship the **speed package (task #14)** FIRST — directly cuts the bill 2–4×. +2. Validate the recipe + MEASURE the real EP/BP factor on a **~100–300M rung** (scaling-law dossier) before committing 1B. +3. Then the 1B run. Use EP's memory advantage to pack batch; data-parallel across many H200s for wall-clock. + +## PART 2 — Modular analog-hardware plan (tens-of-M, COTS, NO custom fab) + +**Principle that makes this cheap:** the damped equilibrium block is a *physical settling system* (monDEQ ≈ passive +resistor/op-amp circuit, Chaffey 2025). The free-phase relaxation = the hardware SETTLING — so the ~100× sim cost becomes +**~free physical settle**, and EP's update is **local** (no backprop). The algorithm is DESIGNED for this substrate; the IP +is substrate-agnostic, so demo on whatever COTS analog you can buy — never fab. + +**Decompose the block → what each part needs:** +| part | hardware | note | +|---|---|---| +| linear maps WQ/K/V/O, fc, pj | analog MVM (crossbar) | bulk of params + compute | +| relaxation z←z+εF(z) | physical feedback (RC + the −c·z damping resistor) | THE win — free settle | +| nonlinearities: softmax, GELU, LN | the analog-HARD part | do mixed-signal in FPGA, OR use energy/Hopfield attention (analog-native) | +| EP update (local) | needs **updatable** analog weights | the key constraint | + +**Substrate options (COTS/existing; key tradeoff = weight updatability — VERIFY current availability/specs):** +- **(A) Memristor/ReRAM crossbar eval kits** — in-situ updatable ⇒ cleanest for EP TRAINING; small/research-grade (TetraMem, CrossBar, Knowm, academic arrays). +- **(B) Analog-inference compute modules** (~tens-of-M weights, e.g. Mythic-class) — large MVM but FIXED/re-flash weights ⇒ mixed-signal EP (analog forward, digital re-program for the update). Matches "tens of M" in MVM size. +- **(C) FPAA + discrete op-amps** (Anadigm / GT RASP) — fully-analog SMALL block, true physical settle, fully programmable. + +**Recommended architecture = mixed-signal, stitched (no fab):** +- analog MVM substrate (A for trainable, or B for scale) does the linear F(z) + the relaxation feedback (the energy win); +- a COTS **FPGA / RFSoC** (fast ADC/DAC) does the nonlinearities (softmax/GELU/LN) + the EP control loop + (drive settle → apply nudge βg → settle → measure contrast → compute + apply local Δθ); +- ADC/DAC glue between them. All COTS modules + a PCB. The "out-of-the-box" stitch. + +**Out-of-the-box levers:** +- **Softmax is the analog-hard piece** → for the HARDWARE demo use the energy/Hopfield (LSE, tied-value) attention + variant (analog-native, conservative) even if the sim keeps softmax; or keep softmax in the FPGA (small fraction of compute). +- The code's **`--fnoise` optics-noise model already exists** → simulate analog non-idealities (device noise, quantization, + variation) IN the 1B sim to de-risk the hardware before buying anything. +- **Stage it:** Phase-1 = ONE small block on FPAA/discrete + FPGA, prove the EP analog loop trains a toy task end-to-end; + Phase-2 = scale the crossbar to tens-of-M. Tens-of-M is the Phase-2 target, not the first build. + +**Next step:** a deep-research pass on "2026 COTS / eval-board analog-compute substrates with in-situ weight update for +equilibrium-network training" to pin the specific modules (memristor kits, Mythic-class availability, FPAA, RFSoC) — the +specific availability is the thing to verify, not guess. |
