Initial commit: ept — backprop-free equilibrium transformer (EP)

Code (ep_run/), organized docs (docs/{method,campaign,hardware,outreach,paper}), analysis scripts (scripts/), ONBOARDING.md entry point. Large data/checkpoints git-ignored (share separately). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014FAPDWQ49M5Ye3NpTndTpn
author: Yuren Hao <yurenh2@illinois.edu> 2026-07-03 05:56:50 -0500
committer: Yuren Hao <yurenh2@illinois.edu> 2026-07-03 05:56:50 -0500
commit: b83947778e2c776f757a07d4719b7ce961d7ed55 (patch)
tree: b9cc01d7adda691d9156d9d04f4fb2f644674e96 /docs/hardware/SCALING_AND_HARDWARE_PLAN.md
1 files changed, 58 insertions, 0 deletions
diff --git a/docs/hardware/SCALING_AND_HARDWARE_PLAN.md b/docs/hardware/SCALING_AND_HARDWARE_PLAN.md
new file mode 100644
index 0000000..8e93a40
--- /dev/null
+++ b/docs/hardware/SCALING_AND_HARDWARE_PLAN.md
@@ -0,0 +1,58 @@
+# EP scaling-to-1B sim cost + modular analog-hardware plan (2026-06-21)
+
+## PART 1 — 1B simulation cost (H200)
+
+Anchor: your 1B **BP** run ≈ 36h × 8 H200 = **288 H200-hours**.
+
+Cost driver = the equilibrium block relaxes ~150–300 steps (T1=150 + refine to t1max=300) — this is the
+WHOLE cost, ~width-independent, so the per-step factor measured at C512 holds at 1B:
+- EP/BP per-step FLOPs ≈ **~50–100×** (relaxation is forward-only, 2ND each; BP is 6ND). Measured: C512 EP 0.23 it/s
+  vs a depth-1 C512 BP ~tens it/s ⇒ ~65–130×. **EP/BPTT ≈ only 1.5×** — the cost is the equilibrium, not EP.
+- Reducers: bf16 + torch.compile + the compiled free-phase fast-path (blk._cstep) ≈ 2–4×; refine exits early
+  while contractive (×~0.6); EP's low memory (no unrolled graph) packs bigger batches than BPTT. Net effective **~20–50×**.
+
+**Estimate: 1B EP ≈ 20–50 × 288 ≈ 6,000–15,000 H200-hours** ≈ $18k–60k @ ~$3–4/H200-hr (or burn AWS credits).
+Wall-clock: ~30–75 days on 8×H200, or **~4–9 days on 64×H200** (same H200-hours — scale OUT for wall-clock, cost is fixed).
+
+**Recommendation (don't burn 10k H200-hr blind):**
+1. Ship the **speed package (task #14)** FIRST — directly cuts the bill 2–4×.
+2. Validate the recipe + MEASURE the real EP/BP factor on a **~100–300M rung** (scaling-law dossier) before committing 1B.
+3. Then the 1B run. Use EP's memory advantage to pack batch; data-parallel across many H200s for wall-clock.
+
+## PART 2 — Modular analog-hardware plan (tens-of-M, COTS, NO custom fab)
+
+**Principle that makes this cheap:** the damped equilibrium block is a *physical settling system* (monDEQ ≈ passive
+resistor/op-amp circuit, Chaffey 2025). The free-phase relaxation = the hardware SETTLING — so the ~100× sim cost becomes
+**~free physical settle**, and EP's update is **local** (no backprop). The algorithm is DESIGNED for this substrate; the IP
+is substrate-agnostic, so demo on whatever COTS analog you can buy — never fab.
+
+**Decompose the block → what each part needs:**
+| part | hardware | note |
+|---|---|---|
+| linear maps WQ/K/V/O, fc, pj | analog MVM (crossbar) | bulk of params + compute |
+| relaxation z←z+εF(z) | physical feedback (RC + the −c·z damping resistor) | THE win — free settle |
+| nonlinearities: softmax, GELU, LN | the analog-HARD part | do mixed-signal in FPGA, OR use energy/Hopfield attention (analog-native) |
+| EP update (local) | needs **updatable** analog weights | the key constraint |
+
+**Substrate options (COTS/existing; key tradeoff = weight updatability — VERIFY current availability/specs):**
+- **(A) Memristor/ReRAM crossbar eval kits** — in-situ updatable ⇒ cleanest for EP TRAINING; small/research-grade (TetraMem, CrossBar, Knowm, academic arrays).
+- **(B) Analog-inference compute modules** (~tens-of-M weights, e.g. Mythic-class) — large MVM but FIXED/re-flash weights ⇒ mixed-signal EP (analog forward, digital re-program for the update). Matches "tens of M" in MVM size.
+- **(C) FPAA + discrete op-amps** (Anadigm / GT RASP) — fully-analog SMALL block, true physical settle, fully programmable.
+
+**Recommended architecture = mixed-signal, stitched (no fab):**
+- analog MVM substrate (A for trainable, or B for scale) does the linear F(z) + the relaxation feedback (the energy win);
+- a COTS **FPGA / RFSoC** (fast ADC/DAC) does the nonlinearities (softmax/GELU/LN) + the EP control loop
+  (drive settle → apply nudge βg → settle → measure contrast → compute + apply local Δθ);
+- ADC/DAC glue between them. All COTS modules + a PCB. The "out-of-the-box" stitch.
+
+**Out-of-the-box levers:**
+- **Softmax is the analog-hard piece** → for the HARDWARE demo use the energy/Hopfield (LSE, tied-value) attention
+  variant (analog-native, conservative) even if the sim keeps softmax; or keep softmax in the FPGA (small fraction of compute).
+- The code's **`--fnoise` optics-noise model already exists** → simulate analog non-idealities (device noise, quantization,
+  variation) IN the 1B sim to de-risk the hardware before buying anything.
+- **Stage it:** Phase-1 = ONE small block on FPAA/discrete + FPGA, prove the EP analog loop trains a toy task end-to-end;
+  Phase-2 = scale the crossbar to tens-of-M. Tens-of-M is the Phase-2 target, not the first build.
+
+**Next step:** a deep-research pass on "2026 COTS / eval-board analog-compute substrates with in-situ weight update for
+equilibrium-network training" to pin the specific modules (memristor kits, Mythic-class availability, FPAA, RFSoC) — the
+specific availability is the thing to verify, not guess.
author	Yuren Hao <yurenh2@illinois.edu>	2026-07-03 05:56:50 -0500
committer	Yuren Hao <yurenh2@illinois.edu>	2026-07-03 05:56:50 -0500
commit	b83947778e2c776f757a07d4719b7ce961d7ed55 (patch)
tree	b9cc01d7adda691d9156d9d04f4fb2f644674e96 /docs/hardware/SCALING_AND_HARDWARE_PLAN.md