docs/hardware/SCALING_AND_HARDWARE_PLAN.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58

# EP scaling-to-1B sim cost + modular analog-hardware plan (2026-06-21)

## PART 1 — 1B simulation cost (H200)

Anchor: your 1B **BP** run ≈ 36h × 8 H200 = **288 H200-hours**.

Cost driver = the equilibrium block relaxes ~150–300 steps (T1=150 + refine to t1max=300) — this is the
WHOLE cost, ~width-independent, so the per-step factor measured at C512 holds at 1B:
- EP/BP per-step FLOPs ≈ **~50–100×** (relaxation is forward-only, 2ND each; BP is 6ND). Measured: C512 EP 0.23 it/s
  vs a depth-1 C512 BP ~tens it/s ⇒ ~65–130×. **EP/BPTT ≈ only 1.5×** — the cost is the equilibrium, not EP.
- Reducers: bf16 + torch.compile + the compiled free-phase fast-path (blk._cstep) ≈ 2–4×; refine exits early
  while contractive (×~0.6); EP's low memory (no unrolled graph) packs bigger batches than BPTT. Net effective **~20–50×**.

**Estimate: 1B EP ≈ 20–50 × 288 ≈ 6,000–15,000 H200-hours** ≈ $18k–60k @ ~$3–4/H200-hr (or burn AWS credits).
Wall-clock: ~30–75 days on 8×H200, or **~4–9 days on 64×H200** (same H200-hours — scale OUT for wall-clock, cost is fixed).

**Recommendation (don't burn 10k H200-hr blind):**
1. Ship the **speed package (task #14)** FIRST — directly cuts the bill 2–4×.
2. Validate the recipe + MEASURE the real EP/BP factor on a **~100–300M rung** (scaling-law dossier) before committing 1B.
3. Then the 1B run. Use EP's memory advantage to pack batch; data-parallel across many H200s for wall-clock.

## PART 2 — Modular analog-hardware plan (tens-of-M, COTS, NO custom fab)

**Principle that makes this cheap:** the damped equilibrium block is a *physical settling system* (monDEQ ≈ passive
resistor/op-amp circuit, Chaffey 2025). The free-phase relaxation = the hardware SETTLING — so the ~100× sim cost becomes
**~free physical settle**, and EP's update is **local** (no backprop). The algorithm is DESIGNED for this substrate; the IP
is substrate-agnostic, so demo on whatever COTS analog you can buy — never fab.

**Decompose the block → what each part needs:**
| part | hardware | note |
|---|---|---|
| linear maps WQ/K/V/O, fc, pj | analog MVM (crossbar) | bulk of params + compute |
| relaxation z←z+εF(z) | physical feedback (RC + the −c·z damping resistor) | THE win — free settle |
| nonlinearities: softmax, GELU, LN | the analog-HARD part | do mixed-signal in FPGA, OR use energy/Hopfield attention (analog-native) |
| EP update (local) | needs **updatable** analog weights | the key constraint |

**Substrate options (COTS/existing; key tradeoff = weight updatability — VERIFY current availability/specs):**
- **(A) Memristor/ReRAM crossbar eval kits** — in-situ updatable ⇒ cleanest for EP TRAINING; small/research-grade (TetraMem, CrossBar, Knowm, academic arrays).
- **(B) Analog-inference compute modules** (~tens-of-M weights, e.g. Mythic-class) — large MVM but FIXED/re-flash weights ⇒ mixed-signal EP (analog forward, digital re-program for the update). Matches "tens of M" in MVM size.
- **(C) FPAA + discrete op-amps** (Anadigm / GT RASP) — fully-analog SMALL block, true physical settle, fully programmable.

**Recommended architecture = mixed-signal, stitched (no fab):**
- analog MVM substrate (A for trainable, or B for scale) does the linear F(z) + the relaxation feedback (the energy win);
- a COTS **FPGA / RFSoC** (fast ADC/DAC) does the nonlinearities (softmax/GELU/LN) + the EP control loop
  (drive settle → apply nudge βg → settle → measure contrast → compute + apply local Δθ);
- ADC/DAC glue between them. All COTS modules + a PCB. The "out-of-the-box" stitch.

**Out-of-the-box levers:**
- **Softmax is the analog-hard piece** → for the HARDWARE demo use the energy/Hopfield (LSE, tied-value) attention
  variant (analog-native, conservative) even if the sim keeps softmax; or keep softmax in the FPGA (small fraction of compute).
- The code's **`--fnoise` optics-noise model already exists** → simulate analog non-idealities (device noise, quantization,
  variation) IN the 1B sim to de-risk the hardware before buying anything.
- **Stage it:** Phase-1 = ONE small block on FPAA/discrete + FPGA, prove the EP analog loop trains a toy task end-to-end;
  Phase-2 = scale the crossbar to tens-of-M. Tens-of-M is the Phase-2 target, not the first build.

**Next step:** a deep-research pass on "2026 COTS / eval-board analog-compute substrates with in-situ weight update for
equilibrium-network training" to pin the specific modules (memristor kits, Mythic-class availability, FPAA, RFSoC) — the
specific availability is the thing to verify, not guess.