Initial commit: ept — backprop-free equilibrium transformer (EP)

Code (ep_run/), organized docs (docs/{method,campaign,hardware,outreach,paper}), analysis scripts (scripts/), ONBOARDING.md entry point. Large data/checkpoints git-ignored (share separately). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014FAPDWQ49M5Ye3NpTndTpn
author: Yuren Hao <yurenh2@illinois.edu> 2026-07-03 05:56:50 -0500
committer: Yuren Hao <yurenh2@illinois.edu> 2026-07-03 05:56:50 -0500
commit: b83947778e2c776f757a07d4719b7ce961d7ed55 (patch)
tree: b9cc01d7adda691d9156d9d04f4fb2f644674e96 /docs/outreach
3 files changed, 303 insertions, 0 deletions
diff --git a/docs/outreach/EMAIL_DRAFT_BEN.md b/docs/outreach/EMAIL_DRAFT_BEN.md
new file mode 100644
index 0000000..8e7532f
--- /dev/null
+++ b/docs/outreach/EMAIL_DRAFT_BEN.md
@@ -0,0 +1,42 @@
+# Draft reply to Ben Scellier (benjamin@rain.ai) — subject: Re: Scaling transformer trained by EP
+
+Hi Ben,
+
+Thanks again — and apologies for the delay; I wanted to send something concrete, and to wait until the model's samples were actually legible before sharing. The work has since split into two stages that map cleanly onto your offer.
+
+**Stage 1 — Dynamics & convergence of EP with asymmetric (non-conservative) weights.** This is now a self-contained piece; a short report is attached (5 pp). AsymEP / VF-EP assume the free phase converges — we characterize *when* it does. The non-conservative free-phase operator undergoes a **supercritical Hopf bifurcation** (we confirm ℓ₁ < 0 via the normal form) as the coupling crosses a boundary set by gain/leak × asymmetry; past it the fixed point gives way to a bounded limit cycle and EP breaks. We then characterize three steerings that hold the system below the boundary — a spectral-abscissa projection, a residual-driven adaptive leak, and a finite-time-Lyapunov (gradient-flossing) penalty — and, measured over 200 seeds, report an honest trade-off: the spectral projection is the only universally robust one, while the other two are each robust on a single architecture and parameter-sensitive on the other. It's all MLP/CNN/RNN, so it stands on its own — and it gives us enough of a handle on the scalability question (what makes the free phase converge or break) to scale the transformer with some confidence.
+
+I think this sits naturally beside your CET: CET secures convergence *by construction* (the conservative attention energy, no free V/O), whereas this studies the complementary regime — the full non-conservative operator with independent Q/K/V/O, where convergence is precisely what is no longer free. I'd be very grateful for any feedback on the report — experiments, framing, writing, all welcome. And if the direction interests you, we would be honored to have you as one of the senior authors; no venue is committed yet, so there's room to shape it together.
+
+**Stage 2 — Scaling the EP-trained transformer (the larger, ongoing effort).** This carries a good deal of additional method and engineering that Stage 1 doesn't cover — the adjoint-consistent nudge that makes the non-conservative attention trainable (it recovers the exact gradient, cosine ≈ 1 with BPTT), the block architecture, and the open analog-friendly-optimizer question. A brief progress note: we've been iterating recipes at the ~15M debug scale (faster turnaround), and the best of these on TinyStories just reached val CE ≈ 1.97 — for us that's roughly the threshold where generations become legible — and it produces coherent little stories from the prompt *"Once upon a time,"*, e.g.:
+
+> *"Once upon a time, there was a little girl named Lily. She had a puppy named Max. Lily loved to play with her pet… Lily asked her mom, 'Can I play with Max?' Her mommy replied, 'Sure, but first, you have to play with me.'"*
+
+(This is our current best debug recipe; we haven't migrated it to 50M yet — that's the immediate next step.)
+
+On compute-only vs. methodology: I'd much prefer the methodology side — your guidance, especially on convergence and on the analog-friendly optimizer, would be invaluable.
+
+A more detailed compute picture (rough — the EP overhead dominates the uncertainty):
+- **EP overhead over BP:** ~15–60× per token (relaxation steps × nudged-phase evaluations; the nudging-phase design is the main lever, and this constant is exactly what physical settling would remove).
+- **Anchor:** a 1B BP run ≈ 36 h on 8×H200 ≈ 290 H200-GPU-hours.
+- **Ladder (EP, compute-optimal):**
+  - 50M ≈ 10–45 H200-hr · 150M ≈ 100–400 H200-hr → both fit our **local + Stanford A6000/A5000s**
+  - 400M ≈ 0.7k–2.8k H100-hr → **AWS**
+  - **1B short-run validation (~1–2B tokens) ≈ 230–1,700 H100-hr → AWS (the target — a "it scales" point, not a headline 1B)**
+  - 1B full Chinchilla ≈ 4.3k–17k H100-hr → only if we go for a flagship 1B
+- **Ablations** (at the 15–50M debug scale) ≈ 0.5k–3k A6000-hours total → **local**.
+
+So the concrete AWS ask is on the order of a **few thousand H100-hours** for the 400M–1B short-run rungs; the small rungs and ablations we cover ourselves. I can send a finer per-experiment breakdown and a Stage-2 methods note whenever useful.
+
+Thank you again — really grateful for the interest and the offer.
+
+Best,
+Yuren
+
+---
+NOTES (not for sending):
+- Attach: report.pdf (Stage-1 dynamics).
+- Demo excerpt = seed=4 (Lily+Max, cleanest). Alt = seed=7 (Timmy + toy car, full broke→fixed→learned arc).
+- 2.0 is framed as OUR internal legibility threshold (no explicit prior agreement with Ben).
+- Result is the best ~15M debug recipe (val CE 1.97, pema), prompt "Once upon a time,", NOT yet migrated to 50M.
+- Compute numbers from SCELLIER_OUTREACH.md (H200≈H100; ×15–60 EP overhead).
diff --git a/docs/outreach/OUTREACH_TARGETS.md b/docs/outreach/OUTREACH_TARGETS.md
new file mode 100644
index 0000000..ffbd177
--- /dev/null
+++ b/docs/outreach/OUTREACH_TARGETS.md
@@ -0,0 +1,199 @@
+# EP analog-hardware collaboration — outreach targets (2026-06-21)
+Per-group PhD/PI profiles from 5 research agents. Accuracy discipline: emails only where published or netid on an
+official directory; "—" = not public, route via PI (no invented addresses). Verify "current" status before sending —
+students graduate. Companion: COLLABORATOR_BRIEF.md (the one-pager), HW_RESEARCH_FINDINGS.md (citations).
+
+## The cross-cutting framing (true at EVERY group — this is our wedge)
+Every group has EITHER the analog substrate OR an on-chip-training piece — **none has backprop-free LOCAL
+in-situ update on analog weights**. The EP local-update rule is the genuinely new thing WE bring; everyone else
+does inference-only analog MVM, or on-chip *gradient/backprop* training. First-mover gap. Say it plainly.
+
+## Strategy
+PhD-first / cc-PI where there's a clear hands-on student; PI-direct where the group is small or no student fits.
+UIUC first (home turf, all 3 layers local). Stanford = Phase-2 warm intro via your student. THU = strongest
+substrate, hardest access.
+
+---
+
+## 1. Shanbhag (UIUC ECE) — LEAD. Route: PhD-first, cc Shanbhag. The closest existing substrate.
+His JSSC-2018 DIMA chip already did analog MVM + on-chip SGD weight write-back. Current bench is rich but no
+single student spans all of {analog-MVM substrate + settling + on-chip learning} — pitch the trio:
+- **Soonha Hwang** `soonhah2@illinois.edu` — HIGH, **email first**. Building a *transformer* mixed-signal CIM chip
+  (28nm DiT accelerator, ESSERC 2025); grad 2028 (multi-year runway). Owns substrate (a)+(b). Caveat: his chip reads
+  inference-only — in-situ weight update would be new (= our piece).
+- **Mihir Kavishwar** `mihirvk2@illinois.edu` (publicly listed) — HIGH. Analog-MVM + compute-SNR-optimal ADCs
+  (CACTUS, arXiv 2507.09776). The readout/SNR layer the equilibrium readout lives or dies on.
+- **Vignesh Sundaresha** `vs49@illinois.edu` — HIGH on the *learning* axis. GEARnn in-situ on-edge training
+  (arXiv 2410.07691). Caveat: algorithm-level, standard training (not EP, not yet a mixed-signal learning circuit).
+- Others: Shuo Li (postdoc, analog-CIM characterization — verify still here), Saion Roy (MRAM/resistive CIM + error
+  compensation — but drifting to security, see Hanumolu note), Kaining Zhou (CIM simulation framework, kainingz@).
+
+## 2. Mingu Kang (UC SAN DIEGO faculty) — the in-situ-update know-how that LEFT UIUC. Route: PI-direct (peer faculty).
+**Built the original DIMA on-chip-SGD-write-back substrate** we cite (PhD 2017 w/ Shanbhag), now PI at UCSD. The
+person who most owns "analog MVM + on-chip weight update" anywhere. Strong faculty-level collaborator specifically for
+the update-loop piece. (Sujan Gonugondla, the other DIMA-trainer author, → Amazon, industry.)
+
+## 3. Wenjuan Zhu (UIUC ECE) — device layer. Route: PI-DIRECT (small group, device experts graduated).
+Email **wjzhu@illinois.edu**. HONEST FRAMING: their ferroelectric work is **memory/logic (FeFET memory, CAM,
+reconfigurable transistors), NOT synaptic weight-update training** — fit is device-platform overlap (vdW / CuInP₂S₆
+FeFETs = nonvolatile, electrically-set, multilevel conductance). Pitch = "extend your FeFETs to in-situ analog
+training," do NOT imply they already do it. Name-drop:
+- **Junzhe Kang** `junzhek2@illinois.edu` — the CIPS-FeFET / in-memory-computing lead (ACS Nano 2024/2025). On the
+  graduation boundary (2024 dissertation, still publishing 2026) — verify status.
+- **Ye Lin** — current student on the CIPS platform (—, via PI). Alumni (now industry): Zijing Zhao→Apple, Hojoon Ryu→Intel.
+
+## 4. Hanumolu (UIUC ECE) — converter/control-loop glue. Route: PI-DIRECT (no student is a pure data-converter).
+Email **hanumolu@illinois.edu**, ask him to route. Reality: his group is a **clocking / high-speed-link / frequency-
+reference** shop, not a data-converter shop — no current PhD has ADC/DAC/switched-cap as primary thesis. Best fits if
+he points to a student:
+- **Mahmoud Khalil** `mkhalil4@illinois.edu` — best converter+loop match: sampling-PLL (ISSCC 2024, 1st author) +
+  industry ADC + DC-DC converter experience. The settle→nudge feedback loop is his native language.
+- **Sujay Patel** `sujaysp2@illinois.edu` — mixed-signal links/equalizers + recovery loops (CICC 2026).
+- **Jason (Shuozhen) Liu** `sl111@illinois.edu` (netid single-sourced, verify) — only one stating ADC focus, but junior/no papers.
+- NOTE: the most topically-relevant IMC+ADC person, **Saion Roy**, is **Shanbhag's** grad (now security postdoc @ Northeastern), NOT Hanumolu's — don't mis-target.
+
+## 5. Wong + Raina (STANFORD) — Phase-2 escalation. Route: WARM INTRO via your Stanford student.
+Stanford leads **foundry-RRAM-as-weights + on-chip training** (CHIMERA/MINOTAUR) + **NeuRRAM analog-MVM inference** —
+but NOT analog in-situ gradient programming (the EP piece is still ours).
+- **Jeffrey Yu** `jeffreyy@stanford.edu` (Raina, current) — PRIMARY. On-chip transformer fine-tuning w/ RRAM
+  (MINOTAUR; 8-bit transformer fine-tuning ISCA 2024). The bridge to "train a transformer on RRAM," still in the building.
+- **Shuhan Liu** (Wong, final-year — move fast) — RRAM device/array + edge continual training (IEDM 2024). Loop in for device side.
+- Refs (not resident): Kartik Prabhu (CHIMERA/MINOTAUR; prob. → Meta, verify), Weier Wan (NeuRRAM lead → CTO Aizip, industry consult).
+
+## 6. Tsinghua (THU) — strongest in-situ SUBSTRATE in the world, but backprop-family + hardest access.
+Wu/Gao/Qian LEMON lab + Jianshi Tang, School of Integrated Circuits. THE clear #1 for fabricated, system-integrated
+silicon that closes the weight-write loop ON-CHIP during learning — the inference-only barrier everyone else hits, they've
+crossed. BUT (the "not EP-flavored" point, confirmed sharply): every on-chip rule they've shipped is **backprop-family /
+NON-local** (Sign-Backprop [Gao et al., Neural Networks 2018]; STELLAR's sign-SGD), **MLP/CNN-scale, never a transformer,
+never EP**. Their only energy/attractor touchpoint is a 2015 Hopfield associative-memory (Hebbian recall, not training).
+⇒ clean whitespace: we'd bring the first local/EP rule + first in-situ transformer to the one group with write-capable silicon.
+- **Jianshi Tang (唐建石)** — **jtang@tsinghua.edu.cn** (ONLY university-published email; Tenured Assoc. Prof + Vice Dean) →
+  BEST first contact, the device/integration translator. Senior enough to commit, reachable.
+- **Bin Gao (高滨)** — technical co-target; owns the on-chip update circuitry EP would repurpose (first-author the Sign-Backprop rule).
+- **Huaqiang Wu (吴华强)** — ultimate PI (LEMON lab, http://stor.ime.tsinghua.edu.cn/), hardest to reach cold.
+- Builders: Peng Yao (Nature-2020 CNN, postdoc-level), Wenbin Zhang (STELLAR/Science 2023). Chips: Nature 2020 CNN (HYBRID-
+  trained — only last FC layer in-situ, updates computed in software), STELLAR (Science 2023, full on-chip sign-SGD, 784×100×10
+  MLP), Attar (Sci China Inf Sci 2025 — RRAM transformer but INFERENCE-only). HfOx endurance 10^7 cycles (Nat Electron 2024).
+- ACCESS caveat: top-3-globally, Nature/Science yearly, many suitors + strategic/IP/scope sensitivities. Cold email MUST lead
+  with the specific complementary asset (working EP transformer needing exactly their write-capable substrate → offers them the
+  first local-learning + first in-situ transformer result on their hardware). Warm intro or concrete joint-demo proposal needed.
+- NB for the EP-native map: the agent flagged **Williams–Kumar–Kendall, "Activity-difference training of DNNs using memristor
+  crossbars," Nature Electronics 2023** — an EP-FAMILY (contrastive) rule ON REAL memristor crossbars = a candidate unicorn bridge;
+  and **Grollier (CNRS/Thales), "Training an Ising machine with EP," Nat Commun 2024**. (Confirm in the EP-native synthesis.)
+
+---
+
+## 7. EP-NATIVE complement groups — the "EP-flavored" people (fills the gap EVERY substrate group has)
+The substrate groups (1-6) are EP-poor by design (= our first-mover gap). The EP-native community is a SEPARATE
+world — EP-rich, mostly device-light. Pair one of each.
+- **UPenn physical-learning** (Durian + Andrea Liu faculty; **Dillavou** `dillavou@sas.upenn.edu` hands-on, now part-ARIA;
+  **Menny Stern → own group @ AMOLF Amsterdam**): **Coupled Learning** = EP's experimental sibling, built on real
+  self-adjusting analog circuits (PNAS 2024 "Machine learning without a processor"). The closest real-hardware analog to our method.
+- **Benjamin Scellier** (EP CO-INVENTOR; now **Rain / Rain AI UK**, ARIA-funded; `benjamin@rain.ai`, bscellier.github.io):
+  source authority on EP estimators + energy-based formulations = squarely our AEP/holomorphic domain. HIGH (industry posture).
+- **Axel Laborieux** (→ **Huawei Zurich**; laborieux-axel.github.io) + **Friedemann Zenke** (FMI Basel, senior gateway):
+  co-invented BOTH holomorphic EP (NeurIPS 2022) AND asymmetric EP / Jacobian homeostasis (ICLR 2024) — **literally the two
+  ingredients we build on.** THE algorithm-theory complement. + Maxence Ernoult (→ DeepMind), the estimator-bias-scaling track.
+- **Dmitry Krotov** (MIT-IBM): the **Energy Transformer** (NeurIPS 2023) IS our forward model (energy→fixed-point attention) —
+  trained by autodiff; "train it without backprop" is exactly our EP contribution. HIGH theory complement, no hardware.
+
+## 8. UNICORNS — EP-native AND real updatable device (the rare bridges)
+- **Julie Grollier** (CNRS, **Laboratoire Albert Fert**, Paris-Saclay; neurophysics.cnrs-thales.fr): **ran EP on PHYSICAL
+  hardware** — "Training an Ising machine with EP," Nat Commun 2024 (D-Wave); spintronic-native. The cleanest unicorn: EP-on-
+  hardware experience + device substrate. **Single best EP-native complement.**
+- **Yi / Kendall / Williams / Kumar** — "Activity-difference training of DNNs using memristor crossbars," **Nature Electronics
+  2023**: contrastive two-phase (EP-flavored) training executed on a **fabricated 64×64 RRAM chip** = the closest "EP-on-silicon"
+  that exists, and it overlaps our RRAM/CIM world. Suhas Kumar @ **Sandia**, R. Stanley Williams @ **Texas A&M**, Kendall @ Rain.
+- **Damien Querlioz** (CNRS, **C2N** Paris-Saclay; `damien.querlioz@c2n.upsaclay.fr`): EP-algorithm-native + a real RRAM fab
+  pipeline (CEA-Leti / Elisa Vianello) — near-unicorn (his fabricated learning demos are Bayesian, not yet EP). **Most credible
+  Western partner to actually FABRICATE EP on a crossbar.**
+- ⇒ **Université Paris-Saclay (Grollier + Querlioz, who co-author) = the global EP-on-hardware cluster.**
+- Also EP+device-intent (sim-now): Talatchian/Peters (SPINTEC Grenoble, EP-under-analog-noise), Alex Gower (Cambridge/Nokia,
+  EP on oscillator Ising machines), Kaushik Roy/Sumeet Gupta (Purdue — EP algo + CIM/spintronic devices, not yet fused).
+- NOT EP (don't chase): IBM (Ambrogio/Burr/Sebastian, PCM backprop), Ielmini/PoliMi, McMahon/Wright (physics-aware backprop,
+  not EP), Marquardt (Hamiltonian echo). Rain AI the COMPANY = distressed/acquihire-pending → engage Kendall as an individual.
+
+## REVISED pairing recommendation (the answer to "not EP-flavored")
+- **Lead EP-native = Grollier** (unicorn) × a crossbar substrate (**Tsinghua-Wu/Gao** most mature, **Wong/Raina** most reachable).
+- **Western fab route = Querlioz + Vianello (CEA-Leti)** × **Wenjuan Zhu** — both real updatable-device fabs; Querlioz brings EP fluency.
+- **Algorithm de-risk layer = Laborieux/Zenke** — own the holo + asymmetric-EP bias theory that decides if EP survives analog noise
+  on ANY substrate. **Shanbhag** pairs best here as the systems/CIM error-tolerance partner (his expertise), not the device fab.
+- **High-value individual outreach: Scellier** + the **Kumar/Kendall activity-difference team** (your proof contrastive-equilibrium
+  training already runs on a real memristor chip).
+
+---
+
+## Recommended sequencing
+1. **Shanbhag trio first** (Hwang+Kavishwar, cc Shanbhag; mention Sundaresha) — home dept, closest substrate, richest bench.
+   Consider a Zhai-brokered/in-person intro instead of cold email (same department = warmest path).
+2. **Parallel UIUC PI-direct**: Wenjuan Zhu (device, "extend your FeFETs") + Hanumolu (glue, "point me to a converter student").
+3. **Mingu Kang (UCSD)** — peer-faculty email for the in-situ-update expertise specifically.
+4. **Stanford warm intro** (Jeffrey Yu) via your student — Phase 2.
+5. **THU** — only if/when a connection exists; else cite as the substrate precedent, not a near-term collaborator.
+
+---
+
+## ⏸ STATUS (2026-06-21): HOLD — DO NOT SEND until the 33M demo + scaling dossier
+**User decision (CONFIRMED 2026-06-21): outreach is gated on the ~33M "能看" demo + scaling-law dossier (task #15) — NOT the
+C512/2.09 milestone.** Send nothing until there's a readable-generation ("能看") demo + a scaling-law dossier to lead with.
+(C512 EP descending past the 2.09 wall toward ~1.8 is a prerequisite step that validates the recipe, NOT the outreach gate —
+the gate is the bigger, showable 33M artifact.) Until then: no contact with anyone above.
+When the bar is met: set sender title, render COLLABORATOR_BRIEF.pdf, attach + ept_method_intro.pdf, optionally ask Prof. Zhai
+for a warm intro to Shanbhag/Hanumolu first. All profiles/contacts/pairing/drafts above are durable and ready.
+
+## Email drafts (READY, gated — copy-paste when the bar is met)
+
+### Draft 1 — Shanbhag group · To: Soonha Hwang (soonhah2@), Mihir Kavishwar (mihirvk2@) · cc: Shanbhag
+Subject: Backprop-free (Equilibrium-Propagation) transformer training — a fit for your in-memory CIM work?
+
+Hi Soonha and Mihir,
+
+I'm Yuren Hao, working on backprop-free training in ChengXiang Zhai's group at UIUC. We've gotten Equilibrium Propagation (EP)
+to train a transformer as a physical equilibrium system: the forward pass is a damped relaxation that settles to a fixed point,
+and the weight update is local — computed from a free vs. a nudged settle, no backpropagation. In simulation the EP gradient
+matches backprop (cosine ≈ 1) and comes within a small gap of a same-parameter backprop-trained transformer.
+
+Your DiT memory-in-compute accelerator — and Mihir's compute-SNR-optimal ADC work — is the closest existing substrate I've
+found to what this needs: analog MVM + a settling loop. The one new ingredient is EP's in-situ local weight update, which is
+actually a simpler thing to put on a crossbar than on-chip backprop.
+
+Could I grab 20 minutes to explore whether a small demo — one equilibrium-transformer block on a CIM substrate + our EP control
+loop — is feasible? A one-page overview and short method note are attached. (cc'ing Prof. Shanbhag.)
+
+Thanks, Yuren
+
+### Draft 2 — Wenjuan Zhu · To: wjzhu@illinois.edu (PI-direct)
+Subject: Extending your vdW / CuInP₂S₆ FeFETs to in-situ-trainable analog weights?
+
+Dear Prof. Zhu,
+
+I'm Yuren Hao, working on backprop-free training in ChengXiang Zhai's group at UIUC. We train a transformer as a physical
+equilibrium system using Equilibrium Propagation — no backprop — where learning is a local update from two settled states, and
+the key hardware need is an analog weight whose conductance can be updated in-situ during training.
+
+Your group's vdW / CuInP₂S₆ ferroelectric reconfigurable devices — nonvolatile, electrically programmable, multilevel
+conductance — look like a strong fit for exactly that role. I realize that work has centered on memory and logic rather than
+training, so I'd be keen to explore whether those devices could serve as in-situ-trainable analog synapses for an EP-trained network.
+
+Would you have 20 minutes for me to share what we have (a working EP-transformer in simulation + an analog-noise model) and
+discuss feasibility? One-page overview and a method note attached.
+
+Best, Yuren
+
+### Draft 3 — Hanumolu · To: hanumolu@illinois.edu (PI-direct, ask to route)
+Subject: Mixed-signal converter / control-loop partner for an analog EP-training demo?
+
+Dear Prof. Hanumolu,
+
+I'm Yuren Hao, working on backprop-free training in ChengXiang Zhai's group at UIUC. We're building toward an analog hardware
+demo of Equilibrium Propagation — training a transformer as a physical equilibrium system, where the forward pass is an analog
+settling loop and the weight update is local (no backprop).
+
+Beyond the in-memory compute array, this needs a mixed-signal layer your group is ideally suited for: fast ADC/DAC to read the
+settled state and apply a small "nudge," and switched-cap integrators for the relaxation/control loop. Since that's converter /
+feedback-loop expertise rather than the ML side, could you point me to a student who might be interested — or spare 15 minutes
+to discuss?
+
+A one-pager and method note are attached. Best, Yuren
+
+### Title TODO (all drafts): set Yuren's title ("PhD student" / "researcher") before sending.
diff --git a/docs/outreach/SCELLIER_OUTREACH.md b/docs/outreach/SCELLIER_OUTREACH.md
new file mode 100644
index 0000000..35ad9e0
--- /dev/null
+++ b/docs/outreach/SCELLIER_OUTREACH.md
@@ -0,0 +1,62 @@
+# Scellier (Rain AI) outreach — 2-stage framing, compute estimate, invite plan
+
+**Goal:** present the work to Ben Scellier (benjamin@rain.ai, Rain AI; EP inventor; VF-EP author) as a **2-stage project**, attach a PDF report once the aep-dynamics figures are done, invite him for **any feedback** (experiments / framing / writing), and — when complete — invite him as **one of the senior authors**. No venue planned yet. Email thread already open (he offered AWS credits + asked about collaboration depth; he's likely waiting on our ept update / methods note).
+
+## The 2-stage framing (this is the pitch)
+
+- **Stage 1 — Dynamics & convergence of EP with asymmetric weights (aep-dynamics).** Current AsymEP literature ASSUMES convergence. We characterize **WHEN** the non-conservative VF actually converges, **WHY** it fails (a supercritical Hopf bifurcation of the free-phase operator), and the **steerings** that guarantee convergence (spectral / floss / adaptleak), with their trade-offs. **Almost done. MLP/CNN/RNN only — does NOT touch the transformer or LLM scaling** (keeps ept un-exposed). This is the self-contained piece that gets Ben as senior author — it's his wheelhouse and it directly answers the convergence pain point he already knows about (the "Jacobian control explodes, kills the ~1-cosine gradient" issue → now a characterized Hopf + a tunable steering trade-off, a much stronger story).
+- **Stage 2 — Scaling EP-trained transformers to LLMs (ept).** Standard non-conservative attention (Q≠K≠V≠O) + adjoint-consistent nudging. 15M debug model already within-noise of BP; 50M on TinyStories looks great. Next = **scale ladder + ablations**, which need compute → the AWS / collaborator-GPU conversation. This is the ongoing bigger collaboration.
+
+Clean separation: **Stage 1 = paper + senior-authorship; Stage 2 = compute collaboration.**
+
+## Compute estimate (rough — the EP overhead range dominates the uncertainty)
+
+**EP overhead over BP:** ~15–60× per token (relaxation steps × nudged-phase evaluations); the **nudging-phase design is the lever** — getting it to ~15× (vs 60×) is what makes 1B affordable. This constant is exactly what physical settling removes (the hardware argument).
+
+**Anchor:** 1B LLM, BP, ≈ 36 h on 8×H200 = **288 H200-GPU-hrs** (Chinchilla-ish ~20B tokens).
+**GPU throughput factors (bf16, rough):** H200 ≈ H100 ≈ **6.4× A6000** ≈ **~11× A5000**.
+
+**Scale ladder (compute-optimal, compute ∝ N²):**
+
+| rung | BP (H200-hr) | EP (H200-hr, 15–60×) | EP (A6000-hr) | run where |
+|---|---|---|---|---|
+| 50M  | ~0.7 | 11–43      | 70–280        | local/Stanford A6000/A5000 |
+| 150M | ~6.5 | 100–390    | 640–2500      | local/Stanford A6000 |
+| 400M | ~46  | 700–2800   | 4500–18000    | borderline → AWS |
+| 1B (full Chinchilla) | 288 | **4300–17000** | 28000–110000 (≈ years on A6000×2) | **AWS H100** |
+| 1B (SHORT validation, ~1–2B tok) | ~15–30 | **230–1700** | — | AWS H100, affordable |
+
+**Ablations** (debug scale 15–50M on TinyStories, small fixed corpus): ~5–25 EP-H200-hr each (~30–160 A6000-hr); ~15–20 runs → **~500–3000 A6000-hr total** → fits the 6 local/Stanford cards (~5 A6000-equiv) over ~1–3 weeks.
+
+**Resource mapping:**
+- **Ablations + small rungs (50–150M):** local A6000×2 + Stanford A6000×2 + A5000×2 (on-demand; more queued). Weeks of wall-clock.
+- **400M–1B:** too big for A6000s (1B-full EP = 28k–110k A6000-hr). → **Ben's AWS H100.** Full-Chinchilla 1B = **4300–17000 H100-hr** (≈ 540–2100 hrs on one 8×H100 node ≈ 22–88 days; faster multi-node). **SHORT-run 1B validation (1–2B tokens) ≈ 230–1700 H100-hr** — the affordable path, likely what the paper needs (a "it scales" point, not a full SOTA 1B).
+
+**The ask to Ben (Stage 2):** AWS H100 for the 400M–1B rungs — order **~few thousand H100-hrs** for the short-run ladder (5k–17k if full Chinchilla 1B). Note the cost is **highly sensitive to the EP overhead** (15× vs 60×) → optimizing the nudging phase is the priority lever.
+
+## Invite / deliverable plan
+
+1. Finish the aep-dynamics figures (P1 cover [Yuren] · P2 phenomenon ✅ · P3 MLP ✅ · P4 dropped→text · P5 cross-arch [in progress, 200-seed bands running]).
+2. Write up Stage-1 dynamics → **attach a PDF report** to Ben.
+3. Email: report the 2-stage framing + ept progress (resreg_warm best **1.9313**, basically at the agreed loss<2.0 success bar) + the Stage-2 compute estimate above.
+4. Invite: "we'd value any feedback — experiments, framing, writing — and, when complete, would be honored to have you as one of the senior authors." No venue committed yet.
+
+## CET delta — what ept adds over Scellier's own CET (openreview Qrfml76eWJ)
+
+**CET (Høier, Kerjan, Scellier, ICLR'26 AM workshop):** an ENERGY (conservative) transformer = convergent Energy-Transformer. Attention is modern-Hopfield energy: `A=Q·K`, `E^att=-1/γ Σlog Σexp(γA)`, force = ∇E → **conservative**. Allows Q≠K (separate W^Q,W^K) **but NO free V/O projections** (values = tokens via the energy gradient; no W^V/W^O). Vanilla centered EP (Laborieux), T1=150 free / T2=5 nudge. Task = CELEBA masked-image completion (MSE); single energy block, temporally unrolled. EP MSE 0.01422 ≈ TBPTE 0.01376. Convergence is FREE (gradient flow) but it's an energy surrogate, NOT the real transformer; even its convergence is hand-waved (App. A: sometimes saddles, no proof PGD descends, relies on same T=150 at train/test).
+
+**ept adds:**
+1. **Real non-conservative attention (independent Q/K/V/O)** — the actual LLM architecture. CET drops V/O to keep attention a gradient field; we keep them → Jacobian non-normal (|Jv−Jᵀv|/|Jv|≈1.4), NOT energy descent → convergence NOT free.
+2. **Adjoint-consistent nudging (−2A_J)** — vanilla EP (CET's) gives the WRONG gradient on non-gradient dynamics; our correction recovers cosine≈1 with BPTT. Extends EP from energy-based → non-conservative dynamics.
+3. **Convergence characterization + steering (aep-dynamics)** — CET gets convergence by construction (+ hand-waves it); we characterize the real instability (supercritical Hopf as attention becomes expressive) and steer it (spectral/floss/adaptleak).
+4. **Language + scale + depth** — CET = single block, image completion, CELEBA. ept = real SLM (TinyStories→1B), multi-block, language; 50M within-noise of BP.
+
+**Honest trade:** ept GIVES UP the energy/Lyapunov interpretation (no global energy → the convergence problem is the price). Complementary: CET = EP for the energy-restricted transformer (clean); ept = EP for the REAL non-conservative transformer (full architecture, convergence earned).
+
+**Pitch to Scellier = build-on-CET:** "you showed EP trains an energy-restricted transformer; we extend it to the full non-conservative transformer (independent Q/K/V/O) via a corrected nudging + a convergence theory, at LM scale." He's a CET author → instantly legible + on his hardware mission.
+
+## Status anchors (so this survives compaction)
+- Ben thread: open since 2026-06; he offered AWS credits + asked compute estimate + collaboration depth; awaiting our update.
+- Original success bar agreed with Ben: **transformer val loss < 2.0**. ept resreg_warm now best 1.9313 (≈ at the bar).
+- aep-dynamics = Stage 1 (dynamics, MLP/CNN/RNN, no transformer). ept = Stage 2 (transformer scaling).
+- Related: [[hw-outreach-plan-gated]] (Scellier already on the theory-side outreach list).
author	Yuren Hao <yurenh2@illinois.edu>	2026-07-03 05:56:50 -0500
committer	Yuren Hao <yurenh2@illinois.edu>	2026-07-03 05:56:50 -0500
commit	b83947778e2c776f757a07d4719b7ce961d7ed55 (patch)
tree	b9cc01d7adda691d9156d9d04f4fb2f644674e96 /docs/outreach