# EP scaling-to-1B sim cost + modular analog-hardware plan (2026-06-21) ## PART 1 — 1B simulation cost (H200) Anchor: your 1B **BP** run ≈ 36h × 8 H200 = **288 H200-hours**. Cost driver = the equilibrium block relaxes ~150–300 steps (T1=150 + refine to t1max=300) — this is the WHOLE cost, ~width-independent, so the per-step factor measured at C512 holds at 1B: - EP/BP per-step FLOPs ≈ **~50–100×** (relaxation is forward-only, 2ND each; BP is 6ND). Measured: C512 EP 0.23 it/s vs a depth-1 C512 BP ~tens it/s ⇒ ~65–130×. **EP/BPTT ≈ only 1.5×** — the cost is the equilibrium, not EP. - Reducers: bf16 + torch.compile + the compiled free-phase fast-path (blk._cstep) ≈ 2–4×; refine exits early while contractive (×~0.6); EP's low memory (no unrolled graph) packs bigger batches than BPTT. Net effective **~20–50×**. **Estimate: 1B EP ≈ 20–50 × 288 ≈ 6,000–15,000 H200-hours** ≈ $18k–60k @ ~$3–4/H200-hr (or burn AWS credits). Wall-clock: ~30–75 days on 8×H200, or **~4–9 days on 64×H200** (same H200-hours — scale OUT for wall-clock, cost is fixed). **Recommendation (don't burn 10k H200-hr blind):** 1. Ship the **speed package (task #14)** FIRST — directly cuts the bill 2–4×. 2. Validate the recipe + MEASURE the real EP/BP factor on a **~100–300M rung** (scaling-law dossier) before committing 1B. 3. Then the 1B run. Use EP's memory advantage to pack batch; data-parallel across many H200s for wall-clock. ## PART 2 — Modular analog-hardware plan (tens-of-M, COTS, NO custom fab) **Principle that makes this cheap:** the damped equilibrium block is a *physical settling system* (monDEQ ≈ passive resistor/op-amp circuit, Chaffey 2025). The free-phase relaxation = the hardware SETTLING — so the ~100× sim cost becomes **~free physical settle**, and EP's update is **local** (no backprop). The algorithm is DESIGNED for this substrate; the IP is substrate-agnostic, so demo on whatever COTS analog you can buy — never fab. **Decompose the block → what each part needs:** | part | hardware | note | |---|---|---| | linear maps WQ/K/V/O, fc, pj | analog MVM (crossbar) | bulk of params + compute | | relaxation z←z+εF(z) | physical feedback (RC + the −c·z damping resistor) | THE win — free settle | | nonlinearities: softmax, GELU, LN | the analog-HARD part | do mixed-signal in FPGA, OR use energy/Hopfield attention (analog-native) | | EP update (local) | needs **updatable** analog weights | the key constraint | **Substrate options (COTS/existing; key tradeoff = weight updatability — VERIFY current availability/specs):** - **(A) Memristor/ReRAM crossbar eval kits** — in-situ updatable ⇒ cleanest for EP TRAINING; small/research-grade (TetraMem, CrossBar, Knowm, academic arrays). - **(B) Analog-inference compute modules** (~tens-of-M weights, e.g. Mythic-class) — large MVM but FIXED/re-flash weights ⇒ mixed-signal EP (analog forward, digital re-program for the update). Matches "tens of M" in MVM size. - **(C) FPAA + discrete op-amps** (Anadigm / GT RASP) — fully-analog SMALL block, true physical settle, fully programmable. **Recommended architecture = mixed-signal, stitched (no fab):** - analog MVM substrate (A for trainable, or B for scale) does the linear F(z) + the relaxation feedback (the energy win); - a COTS **FPGA / RFSoC** (fast ADC/DAC) does the nonlinearities (softmax/GELU/LN) + the EP control loop (drive settle → apply nudge βg → settle → measure contrast → compute + apply local Δθ); - ADC/DAC glue between them. All COTS modules + a PCB. The "out-of-the-box" stitch. **Out-of-the-box levers:** - **Softmax is the analog-hard piece** → for the HARDWARE demo use the energy/Hopfield (LSE, tied-value) attention variant (analog-native, conservative) even if the sim keeps softmax; or keep softmax in the FPGA (small fraction of compute). - The code's **`--fnoise` optics-noise model already exists** → simulate analog non-idealities (device noise, quantization, variation) IN the 1B sim to de-risk the hardware before buying anything. - **Stage it:** Phase-1 = ONE small block on FPAA/discrete + FPGA, prove the EP analog loop trains a toy task end-to-end; Phase-2 = scale the crossbar to tens-of-M. Tens-of-M is the Phase-2 target, not the first build. **Next step:** a deep-research pass on "2026 COTS / eval-board analog-compute substrates with in-situ weight update for equilibrium-network training" to pin the specific modules (memristor kits, Mythic-class availability, FPAA, RFSoC) — the specific availability is the thing to verify, not guess.