summaryrefslogtreecommitdiff
path: root/docs/outreach/SCELLIER_OUTREACH.md
diff options
context:
space:
mode:
Diffstat (limited to 'docs/outreach/SCELLIER_OUTREACH.md')
-rw-r--r--docs/outreach/SCELLIER_OUTREACH.md62
1 files changed, 62 insertions, 0 deletions
diff --git a/docs/outreach/SCELLIER_OUTREACH.md b/docs/outreach/SCELLIER_OUTREACH.md
new file mode 100644
index 0000000..35ad9e0
--- /dev/null
+++ b/docs/outreach/SCELLIER_OUTREACH.md
@@ -0,0 +1,62 @@
+# Scellier (Rain AI) outreach — 2-stage framing, compute estimate, invite plan
+
+**Goal:** present the work to Ben Scellier (benjamin@rain.ai, Rain AI; EP inventor; VF-EP author) as a **2-stage project**, attach a PDF report once the aep-dynamics figures are done, invite him for **any feedback** (experiments / framing / writing), and — when complete — invite him as **one of the senior authors**. No venue planned yet. Email thread already open (he offered AWS credits + asked about collaboration depth; he's likely waiting on our ept update / methods note).
+
+## The 2-stage framing (this is the pitch)
+
+- **Stage 1 — Dynamics & convergence of EP with asymmetric weights (aep-dynamics).** Current AsymEP literature ASSUMES convergence. We characterize **WHEN** the non-conservative VF actually converges, **WHY** it fails (a supercritical Hopf bifurcation of the free-phase operator), and the **steerings** that guarantee convergence (spectral / floss / adaptleak), with their trade-offs. **Almost done. MLP/CNN/RNN only — does NOT touch the transformer or LLM scaling** (keeps ept un-exposed). This is the self-contained piece that gets Ben as senior author — it's his wheelhouse and it directly answers the convergence pain point he already knows about (the "Jacobian control explodes, kills the ~1-cosine gradient" issue → now a characterized Hopf + a tunable steering trade-off, a much stronger story).
+- **Stage 2 — Scaling EP-trained transformers to LLMs (ept).** Standard non-conservative attention (Q≠K≠V≠O) + adjoint-consistent nudging. 15M debug model already within-noise of BP; 50M on TinyStories looks great. Next = **scale ladder + ablations**, which need compute → the AWS / collaborator-GPU conversation. This is the ongoing bigger collaboration.
+
+Clean separation: **Stage 1 = paper + senior-authorship; Stage 2 = compute collaboration.**
+
+## Compute estimate (rough — the EP overhead range dominates the uncertainty)
+
+**EP overhead over BP:** ~15–60× per token (relaxation steps × nudged-phase evaluations); the **nudging-phase design is the lever** — getting it to ~15× (vs 60×) is what makes 1B affordable. This constant is exactly what physical settling removes (the hardware argument).
+
+**Anchor:** 1B LLM, BP, ≈ 36 h on 8×H200 = **288 H200-GPU-hrs** (Chinchilla-ish ~20B tokens).
+**GPU throughput factors (bf16, rough):** H200 ≈ H100 ≈ **6.4× A6000** ≈ **~11× A5000**.
+
+**Scale ladder (compute-optimal, compute ∝ N²):**
+
+| rung | BP (H200-hr) | EP (H200-hr, 15–60×) | EP (A6000-hr) | run where |
+|---|---|---|---|---|
+| 50M | ~0.7 | 11–43 | 70–280 | local/Stanford A6000/A5000 |
+| 150M | ~6.5 | 100–390 | 640–2500 | local/Stanford A6000 |
+| 400M | ~46 | 700–2800 | 4500–18000 | borderline → AWS |
+| 1B (full Chinchilla) | 288 | **4300–17000** | 28000–110000 (≈ years on A6000×2) | **AWS H100** |
+| 1B (SHORT validation, ~1–2B tok) | ~15–30 | **230–1700** | — | AWS H100, affordable |
+
+**Ablations** (debug scale 15–50M on TinyStories, small fixed corpus): ~5–25 EP-H200-hr each (~30–160 A6000-hr); ~15–20 runs → **~500–3000 A6000-hr total** → fits the 6 local/Stanford cards (~5 A6000-equiv) over ~1–3 weeks.
+
+**Resource mapping:**
+- **Ablations + small rungs (50–150M):** local A6000×2 + Stanford A6000×2 + A5000×2 (on-demand; more queued). Weeks of wall-clock.
+- **400M–1B:** too big for A6000s (1B-full EP = 28k–110k A6000-hr). → **Ben's AWS H100.** Full-Chinchilla 1B = **4300–17000 H100-hr** (≈ 540–2100 hrs on one 8×H100 node ≈ 22–88 days; faster multi-node). **SHORT-run 1B validation (1–2B tokens) ≈ 230–1700 H100-hr** — the affordable path, likely what the paper needs (a "it scales" point, not a full SOTA 1B).
+
+**The ask to Ben (Stage 2):** AWS H100 for the 400M–1B rungs — order **~few thousand H100-hrs** for the short-run ladder (5k–17k if full Chinchilla 1B). Note the cost is **highly sensitive to the EP overhead** (15× vs 60×) → optimizing the nudging phase is the priority lever.
+
+## Invite / deliverable plan
+
+1. Finish the aep-dynamics figures (P1 cover [Yuren] · P2 phenomenon ✅ · P3 MLP ✅ · P4 dropped→text · P5 cross-arch [in progress, 200-seed bands running]).
+2. Write up Stage-1 dynamics → **attach a PDF report** to Ben.
+3. Email: report the 2-stage framing + ept progress (resreg_warm best **1.9313**, basically at the agreed loss<2.0 success bar) + the Stage-2 compute estimate above.
+4. Invite: "we'd value any feedback — experiments, framing, writing — and, when complete, would be honored to have you as one of the senior authors." No venue committed yet.
+
+## CET delta — what ept adds over Scellier's own CET (openreview Qrfml76eWJ)
+
+**CET (Høier, Kerjan, Scellier, ICLR'26 AM workshop):** an ENERGY (conservative) transformer = convergent Energy-Transformer. Attention is modern-Hopfield energy: `A=Q·K`, `E^att=-1/γ Σlog Σexp(γA)`, force = ∇E → **conservative**. Allows Q≠K (separate W^Q,W^K) **but NO free V/O projections** (values = tokens via the energy gradient; no W^V/W^O). Vanilla centered EP (Laborieux), T1=150 free / T2=5 nudge. Task = CELEBA masked-image completion (MSE); single energy block, temporally unrolled. EP MSE 0.01422 ≈ TBPTE 0.01376. Convergence is FREE (gradient flow) but it's an energy surrogate, NOT the real transformer; even its convergence is hand-waved (App. A: sometimes saddles, no proof PGD descends, relies on same T=150 at train/test).
+
+**ept adds:**
+1. **Real non-conservative attention (independent Q/K/V/O)** — the actual LLM architecture. CET drops V/O to keep attention a gradient field; we keep them → Jacobian non-normal (|Jv−Jᵀv|/|Jv|≈1.4), NOT energy descent → convergence NOT free.
+2. **Adjoint-consistent nudging (−2A_J)** — vanilla EP (CET's) gives the WRONG gradient on non-gradient dynamics; our correction recovers cosine≈1 with BPTT. Extends EP from energy-based → non-conservative dynamics.
+3. **Convergence characterization + steering (aep-dynamics)** — CET gets convergence by construction (+ hand-waves it); we characterize the real instability (supercritical Hopf as attention becomes expressive) and steer it (spectral/floss/adaptleak).
+4. **Language + scale + depth** — CET = single block, image completion, CELEBA. ept = real SLM (TinyStories→1B), multi-block, language; 50M within-noise of BP.
+
+**Honest trade:** ept GIVES UP the energy/Lyapunov interpretation (no global energy → the convergence problem is the price). Complementary: CET = EP for the energy-restricted transformer (clean); ept = EP for the REAL non-conservative transformer (full architecture, convergence earned).
+
+**Pitch to Scellier = build-on-CET:** "you showed EP trains an energy-restricted transformer; we extend it to the full non-conservative transformer (independent Q/K/V/O) via a corrected nudging + a convergence theory, at LM scale." He's a CET author → instantly legible + on his hardware mission.
+
+## Status anchors (so this survives compaction)
+- Ben thread: open since 2026-06; he offered AWS credits + asked compute estimate + collaboration depth; awaiting our update.
+- Original success bar agreed with Ben: **transformer val loss < 2.0**. ept resreg_warm now best 1.9313 (≈ at the bar).
+- aep-dynamics = Stage 1 (dynamics, MLP/CNN/RNN, no transformer). ept = Stage 2 (transformer scaling).
+- Related: [[hw-outreach-plan-gated]] (Scellier already on the theory-side outreach list).