summaryrefslogtreecommitdiff
path: root/docs/outreach/EMAIL_DRAFT_BEN.md
blob: 8e7532fc08ee3f870c12f2ef227eb74ac29c4d78 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
# Draft reply to Ben Scellier (benjamin@rain.ai) — subject: Re: Scaling transformer trained by EP

Hi Ben,

Thanks again — and apologies for the delay; I wanted to send something concrete, and to wait until the model's samples were actually legible before sharing. The work has since split into two stages that map cleanly onto your offer.

**Stage 1 — Dynamics & convergence of EP with asymmetric (non-conservative) weights.** This is now a self-contained piece; a short report is attached (5 pp). AsymEP / VF-EP assume the free phase converges — we characterize *when* it does. The non-conservative free-phase operator undergoes a **supercritical Hopf bifurcation** (we confirm ℓ₁ < 0 via the normal form) as the coupling crosses a boundary set by gain/leak × asymmetry; past it the fixed point gives way to a bounded limit cycle and EP breaks. We then characterize three steerings that hold the system below the boundary — a spectral-abscissa projection, a residual-driven adaptive leak, and a finite-time-Lyapunov (gradient-flossing) penalty — and, measured over 200 seeds, report an honest trade-off: the spectral projection is the only universally robust one, while the other two are each robust on a single architecture and parameter-sensitive on the other. It's all MLP/CNN/RNN, so it stands on its own — and it gives us enough of a handle on the scalability question (what makes the free phase converge or break) to scale the transformer with some confidence.

I think this sits naturally beside your CET: CET secures convergence *by construction* (the conservative attention energy, no free V/O), whereas this studies the complementary regime — the full non-conservative operator with independent Q/K/V/O, where convergence is precisely what is no longer free. I'd be very grateful for any feedback on the report — experiments, framing, writing, all welcome. And if the direction interests you, we would be honored to have you as one of the senior authors; no venue is committed yet, so there's room to shape it together.

**Stage 2 — Scaling the EP-trained transformer (the larger, ongoing effort).** This carries a good deal of additional method and engineering that Stage 1 doesn't cover — the adjoint-consistent nudge that makes the non-conservative attention trainable (it recovers the exact gradient, cosine ≈ 1 with BPTT), the block architecture, and the open analog-friendly-optimizer question. A brief progress note: we've been iterating recipes at the ~15M debug scale (faster turnaround), and the best of these on TinyStories just reached val CE ≈ 1.97 — for us that's roughly the threshold where generations become legible — and it produces coherent little stories from the prompt *"Once upon a time,"*, e.g.:

> *"Once upon a time, there was a little girl named Lily. She had a puppy named Max. Lily loved to play with her pet… Lily asked her mom, 'Can I play with Max?' Her mommy replied, 'Sure, but first, you have to play with me.'"*

(This is our current best debug recipe; we haven't migrated it to 50M yet — that's the immediate next step.)

On compute-only vs. methodology: I'd much prefer the methodology side — your guidance, especially on convergence and on the analog-friendly optimizer, would be invaluable.

A more detailed compute picture (rough — the EP overhead dominates the uncertainty):
- **EP overhead over BP:** ~15–60× per token (relaxation steps × nudged-phase evaluations; the nudging-phase design is the main lever, and this constant is exactly what physical settling would remove).
- **Anchor:** a 1B BP run ≈ 36 h on 8×H200 ≈ 290 H200-GPU-hours.
- **Ladder (EP, compute-optimal):**
  - 50M ≈ 10–45 H200-hr · 150M ≈ 100–400 H200-hr → both fit our **local + Stanford A6000/A5000s**
  - 400M ≈ 0.7k–2.8k H100-hr → **AWS**
  - **1B short-run validation (~1–2B tokens) ≈ 230–1,700 H100-hr → AWS (the target — a "it scales" point, not a headline 1B)**
  - 1B full Chinchilla ≈ 4.3k–17k H100-hr → only if we go for a flagship 1B
- **Ablations** (at the 15–50M debug scale) ≈ 0.5k–3k A6000-hours total → **local**.

So the concrete AWS ask is on the order of a **few thousand H100-hours** for the 400M–1B short-run rungs; the small rungs and ablations we cover ourselves. I can send a finer per-experiment breakdown and a Stage-2 methods note whenever useful.

Thank you again — really grateful for the interest and the offer.

Best,
Yuren

---
NOTES (not for sending):
- Attach: report.pdf (Stage-1 dynamics).
- Demo excerpt = seed=4 (Lily+Max, cleanest). Alt = seed=7 (Timmy + toy car, full broke→fixed→learned arc).
- 2.0 is framed as OUR internal legibility threshold (no explicit prior agreement with Ben).
- Result is the best ~15M debug recipe (val CE 1.97, pema), prompt "Once upon a time,", NOT yet migrated to 50M.
- Compute numbers from SCELLIER_OUTREACH.md (H200≈H100; ×15–60 EP overhead).