diff options
Diffstat (limited to 'docs')
| -rw-r--r-- | docs/COLAB_50M.md | 117 | ||||
| -rw-r--r-- | docs/campaign/C512_PLATEAU_CAMPAIGN.md | 159 | ||||
| -rw-r--r-- | docs/campaign/C512_ROUND2_ABCD.md | 118 | ||||
| -rw-r--r-- | docs/campaign/EP_BELOW210_DIAGNOSIS_FIX.md | 260 | ||||
| -rw-r--r-- | docs/campaign/FINDINGS.md | 609 | ||||
| -rw-r--r-- | docs/campaign/SESSION_2026-06-24_HOPF_DIAGNOSIS_RESREG_FIX.md | 81 | ||||
| -rw-r--r-- | docs/hardware/COLLABORATOR_BRIEF.md | 46 | ||||
| -rw-r--r-- | docs/hardware/FUGU_CODEX_PHYSICS_ANSWER.md | 125 | ||||
| -rw-r--r-- | docs/hardware/GPT_PRO_PHYSICS_ANSWER.md | 46 | ||||
| -rw-r--r-- | docs/hardware/HW_RESEARCH_FINDINGS.md | 98 | ||||
| -rw-r--r-- | docs/hardware/PHYSICS_QUESTIONS_FOR_DEEP_REASONING.md | 65 | ||||
| -rw-r--r-- | docs/hardware/SCALING_AND_HARDWARE_PLAN.md | 58 | ||||
| -rw-r--r-- | docs/method/ARCHITECTURE.md | 117 | ||||
| -rw-r--r-- | docs/method/EP_DERIVATION.md | 215 | ||||
| -rw-r--r-- | docs/method/METHODS.md | 576 | ||||
| -rw-r--r-- | docs/method/READING.md | 58 | ||||
| -rw-r--r-- | docs/method/READING_EN.md | 54 | ||||
| -rw-r--r-- | docs/outreach/EMAIL_DRAFT_BEN.md | 42 | ||||
| -rw-r--r-- | docs/outreach/OUTREACH_TARGETS.md | 199 | ||||
| -rw-r--r-- | docs/outreach/SCELLIER_OUTREACH.md | 62 | ||||
| -rw-r--r-- | docs/paper/PAPER_A_OUTLINE.md | 57 |
21 files changed, 3162 insertions, 0 deletions
diff --git a/docs/COLAB_50M.md b/docs/COLAB_50M.md new file mode 100644 index 0000000..9236049 --- /dev/null +++ b/docs/COLAB_50M.md @@ -0,0 +1,117 @@ +# Colab 50M EP run — self-contained notebook cells + +Goal: train the 50M (C=2048) EP energy-transformer on TinyStories-BPE on a Colab A100/H100, +which fixes the two things timan1's A6000 can't: fp32 throughput and the big-width init instability +(more headroom to tune the curriculum + bigger batch). Checkpoints to Google Drive with full-state +resume, so the 12 h Colab timeout never loses progress — just re-run the training cell to continue. + +PREP (once, on your laptop): download from timan1 `~/ept/ept_colab.tar.gz` (16 KB, the code) and +upload it to your Google Drive root as `ept_colab.tar.gz`. Data is regenerated in-notebook (or +upload `~/ept/lt_ep_code/.../tinystories_bpe` bins to Drive to skip the ~40 min prep — optional). + +──────────────────────────────────────────────────────────────────────── +## Cell 1 — setup, Drive, GPU, deps +```python +import torch, subprocess, os +print(torch.__version__, torch.cuda.get_device_name(0)) +assert torch.__version__ >= "2.1", "need torch>=2.1 for torch.func/compile" +print(subprocess.run(["nvidia-smi","--query-gpu=name,memory.total","--format=csv,noheader"], + capture_output=True,text=True).stdout) +from google.colab import drive; drive.mount('/content/drive') +!pip -q install tokenizers +WORK="/content/work"; DRIVE="/content/drive/MyDrive"; os.makedirs(WORK, exist_ok=True) +!tar xzf {DRIVE}/ept_colab.tar.gz -C {WORK} +print("code:", os.listdir(WORK)) +``` + +## Cell 2 — data (regenerate, cached to Drive; skip if bins already uploaded) +```python +import os +DATA="/content/drive/MyDrive/ept_data/tinystories_bpe" +if os.path.exists(f"{DATA}/train.bin"): + print("BPE bins found on Drive — reusing.") +else: + os.makedirs("/content/drive/MyDrive/ept_data/tinystories", exist_ok=True) + %cd /content/drive/MyDrive/ept_data/tinystories + !test -f train.txt || wget -q -O train.txt https://huggingface.co/datasets/roneneldan/TinyStories/resolve/main/TinyStoriesV2-GPT4-train.txt + !test -f valid.txt || wget -q -O valid.txt https://huggingface.co/datasets/roneneldan/TinyStories/resolve/main/TinyStoriesV2-GPT4-valid.txt + # point the prep script at the Drive paths + import re; src=open(f"{WORK}/prepare_tinystories_bpe.py").read() + src=src.replace("/tmp/lt_ep/data/tinystories","/content/drive/MyDrive/ept_data/tinystories") + src=src.replace("/tmp/lt_ep/data/tinystories_bpe",DATA) + open(f"{WORK}/prep_bpe_colab.py","w").write(src) + %cd {WORK} + !python prep_bpe_colab.py +print("data:", os.listdir(DATA)) +``` + +## Cell 3 — STABILITY SMOKE (always run first; ~15 min). Must survive past warmup without abort. +```python +%cd {WORK} +# C=2048 starting curriculum (muP-scaled from C=1024's lr 4e-4 -> ~2e-4; longer warmup; gentler resinit) +!python lt_ep_train.py --mode ep --attn_mode thick --B 16 --C 2048 --H 16 --T 512 \ + --c 1.0 --jacreg 1.0 --jr_floor 0.1 --res_target 1.5e-3 --jr_max 64 --res_ema 0.9 \ + --holo 2 --hr 0.02 --pema 0.999 --t1max 300 --res_est 1e-4 --t2sel 60 --res_gate 5e-3 \ + --qknorm --resinit 0.05 --warmup 2500 --compile --T1 150 --T2 20 --lr 2e-4 \ + --steps 1200 --log 100 --data {DATA} +# READ THE OUTPUT: if it ABORTs or res spikes >0.1 repeatedly through steps 600-1200, the curriculum +# is still too hot -> lower lr to 1e-4 and/or resinit 0.03 and/or warmup 4000, re-run this cell. +# If res stays <1e-2 and val descends past step 1000, the curriculum is good -> go to Cell 4. +``` + +## Cell 3b — KEEP-ALIVE (run once, then it auto-clicks connect every 60s to beat the ~90min idle kill) +Open the browser JS console (F12 → Console) on the Colab tab and paste: +```javascript +function keepAlive(){ + document.querySelector("colab-connect-button")?.shadowRoot?.querySelector("#connect")?.click(); +} +setInterval(keepAlive, 60000); +``` +This beats ONLY the idle timeout. The HARD cap (free 12h / Pro 24h, and Pro+ background execution +is unreliable in 2026) is unbeatable — which is why Cell 4 is built to RESUME. When Colab drops you, +just reconnect and re-run Cell 4; it continues from the last `--save_every` checkpoint on Drive. + +## Cell 4 — FULL RUN with Drive full-state resume. Re-run this exact cell after EVERY disconnect. +```python +%cd {WORK} +ST="/content/drive/MyDrive/ept_ckpt/s4_50m.state"; CK="/content/drive/MyDrive/ept_ckpt/s4_50m.best.pt" +os.makedirs("/content/drive/MyDrive/ept_ckpt", exist_ok=True) +# --resume loads ST (weights+optimizer+sched+step+jr+best) if present -> idempotent across timeouts. +# --save_every 100 = atomic full-state save every 100 steps -> a kill loses at most ~100 steps. +!python lt_ep_train.py --mode ep --attn_mode thick --B 16 --C 2048 --H 16 --T 512 \ + --c 1.0 --jacreg 1.0 --jr_floor 0.1 --res_target 1.5e-3 --jr_max 64 --res_ema 0.9 \ + --holo 2 --hr 0.02 --pema 0.999 --t1max 300 --res_est 1e-4 --t2sel 60 --res_gate 5e-3 \ + --qknorm --resinit 0.05 --warmup 2500 --compile --T1 150 --T2 20 --lr 2e-4 \ + --steps 24000 --log 200 --save_every 100 --data {DATA} --ckpt {CK} --state {ST} --resume +# IMPORTANT: match every flag here to the curriculum that PASSED Cell 3 (esp. lr/warmup/resinit). +# On the FIRST run ST won't exist (fresh start, prints init residual); every re-run prints "[resume] from ...". +``` + +### Checkpointing guarantees (tested on timan1) +- `--state` writes the FULL state (weights + AdamW moments + LR-schedule position + step + λ + best) + to `ST.tmp` then `os.replace` → **atomic**: a kill mid-write leaves the previous good `ST` intact. +- `--resume` continues the LR schedule and optimizer momentum exactly (not a cold warm-start): + verified step 150 → resumed 151 with val still descending monotonically. +- State size at 50M ≈ ~1 GB (weights+pema+opt); `--save_every 100` ≈ a 1 GB Drive write every + ~20 min of A100 wall-clock (well under Drive's daily quota). Lower to 50 if you want ≤10-min loss. +- `--ckpt` (CK) separately keeps the best-val weights for sampling (Cell 5), updated only on improvement. + +## Cell 5 — sample stories from the best checkpoint (run anytime; reads CK from Drive) +```python +%cd {WORK} +CK="/content/drive/MyDrive/ept_ckpt/s4_50m.best.pt" +!python sample_eq.py --ckpt {CK} --data {DATA} --C 2048 --H 16 --T 512 --use_pema --n 4 \ + --prompt "Once upon a time" --temp 0.8 --topk 40 +``` +Note: sample_eq.py reads vocab from meta.pkl; for BPE it prints token ids unless decoded — if it +shows numbers not text, ping me and I'll add the BPE decode (tokenizer.json is in {DATA}). + +──────────────────────────────────────────────────────────────────────── +NOTES +- The curriculum in Cells 3/4 is a STARTING GUESS for C=2048 (we never got it stable on timan1). + Cell 3 is there precisely to dial it in fast on the better GPU before committing Cell 4's long run. +- Full-state resume tested on timan1 (step 150 → resumed 151, optimizer/schedule intact). +- Expected cost: A100 fp32 ~2-3x an A6000 → ~0.06-0.1 it/s → 24k steps ~3-4 days of wall-clock + ACROSS resumes (so leave it, re-run Cell 4 whenever Colab drops you). H100 faster. +- sample_eq.py BPE-decode gap is the one known rough edge; tell me if Cell 5 prints ids. +``` diff --git a/docs/campaign/C512_PLATEAU_CAMPAIGN.md b/docs/campaign/C512_PLATEAU_CAMPAIGN.md new file mode 100644 index 0000000..6066c21 --- /dev/null +++ b/docs/campaign/C512_PLATEAU_CAMPAIGN.md @@ -0,0 +1,159 @@ +# C512 "Why slow + plateau" diagnostic campaign (2026-06-17) + +Standalone record of the 7-experiment campaign the user proposed to diagnose why EP at C=512 BPE +plateaus at val CE 2.40 (orphan) and never reaches the BP capability band (1.0–1.5). Self-contained; +companion logs in `/tmp/lt_ep/`, general arc in `FINDINGS.md`. + +--- + +## 0. Setup and the user's hypothesis + +- **Models** (TinyStories BPE-4096, C=512 H=16 T=256 B=24; random ln4096 = 8.318): + - EP "orphan" (lr 8e-4, warmup 800, resinit 0.1, holo N=2, t2sel 40, jr_floor 0.1, res_target + 1.5e-3, jr_max 32, pema 0.999): **best val CE 2.4037**, ran full 20k, zero excursions. + - Standard BP transformer (same C/H/T, mlp=4): **best val CE 1.6953**. + - Target: BP capability band **1.0–1.5** (BP-C512 itself is 1.70; BP needs C≥1024 to enter the band). + +- **User's hypothesis (verbatim intent):** the "full-time slow + plateau" is **not** simple EP + LR-inequivalence; it is a conflict between the **equilibrium architecture's contractive operating + point** and **long-context mixing**, with LR merely expressing it as "8e-4 lives, 9e-4 explodes." + Supporting prior: at a matched tight residual operating point EP ≥ BPTT; the apparent gap is + EP-tight vs loose-BPTT (non-contractive), and the gap grows with T. + +- **Three plateau types the user distinguished:** (1) invalid plateau (res ~1e-2, gate skipping, + λ pinned); (2) valid-but-over-contracted plateau (res tiny, cos high, gate-skip low, still far + from BP); (3) controller-fight / estimator-controller mismatch. + +- **Priority order requested:** exp3 (BPTT+ctl) → exp2 (LR-sweep) → exp4 (warmup/gate decouple) + → exp5 (λ grid) → exp6/exp7 (branch + mixing). exp1 (triangulation) foundational. + +--- + +## 1. The seven experiments — spec, what ran, result, read + +### exp3 — C512 BPTT+controller (the decisive matched cut) +- **Spec:** replicate the orphan EP recipe (resinit, qk-norm, warmup, jr_floor, res_target, gate, + pema, lr 8e-4) but swap the EP task gradient for exact BPTT. If it plateaus ~2.2–2.4 → tax is + architecture/controller; if ~1.9 → EP estimator/noise. +- **Ran:** `--mode bptt`, all else matched, 20k steps, fuse on. +- **Result:** descended cleanly to **best 3.85 @step 1400**, then **destabilized at step 1600** + (jr→32, res→0.07) and lodged in a broken basin (val 6.27, res 0.068 — just under the 0.1 fuse, + no abort) for 8400+ steps. Killed at step ~12.8k. **Worse than the EP orphan (3.85 vs 2.40).** +- **Read:** the answer came back neither 2.2–2.4 nor 1.9 — BPTT **broke**. At C512 the contraction + controller does **not** keep the exact gradient on the manifold; EP's implicit contraction-bias is + what kept the orphan alive. Inverts the S0 result (BPTT+ctl 1.635 < EP 1.676). *Caveat: single + lr/seed; BPTT might be stable at another lr.* + +### exp2 — one-step loss-decrease LR-sweep at the plateau ckpt +- **Spec:** fixed ckpt + batch, gradients {BPTT, EP-task, EP-task+jac}, one-step update over an lr + grid, same-graph eval ΔCE. Answers "is it LR-inequivalence." +- **Result** (ΔCE, negative = better): BPTT best **−0.161** (lr 1e-2); EP best **−0.042** (lr 1e-4), + **diverges for lr ≥ 3e-4**. cos(g_EP,g_BPTT) 0.107, k=|g_EP|/|g_BPTT| ≈ 486. +- **Read:** **not LR.** EP's best one-step descent over *all* lr is 4× below BPTT's; a magnitude + rescale can't fix a 0.10-aligned direction. + +### exp1 — gradient triangulation, multi-checkpoint (foundational) +- **Spec:** at step 0/200/800/2000/plateau, per-group cos of {g_EP, g_BPTT-150, g_BPTT-400}, + norm-ratio k, batch-to-batch variance, + res. +- **Result:** + + | ckpt | val | res | EP·BPTT150 | BPTT150·400 | k | EP self-cos | BPTT self-cos | + |---|---|---|---|---|---|---|---| + | step0 (random) | 9.90 | 2.1e-2 | 0.261 | 0.459 | 0.319 | +0.654 | +0.617 | + | step200 | 6.07 | 4.2e-9 | **0.989** | 1.000 | 1.000 | +0.371 | +0.376 | + | step800 | 4.11 | 3.9e-4 | 0.964 | 0.935 | 0.844 | +0.755 | +0.721 | + | step2000 | 3.56 | 2.9e-4 | 0.885 | 0.845 | 0.675 | +0.445 | +0.358 | + | **plateau (2.40)** | 2.40 | 2.6e-5 | **−0.045** | **1.000** | **4223** | **−0.273** | **+0.957** | + +- **Read (decisive):** the EP estimator is **faithful while descending** (cos 0.99→0.89, steps + 200–2000) and undergoes an **SNR/coherence collapse at the optimum** — cos→0, k→4000, and crucially + the **batch self-cos goes −0.27** (EP gradients on different batches anti-correlate) while BPTT is + +0.96. BPTT-150≡BPTT-400 (cos 1.00) the entire time ⇒ the true gradient is always well-defined. + **The 2.40 plateau is an EP estimator bias-floor / batch-incoherence near the optimum** — not + horizon ambiguity (ruled out here), not LR (exp2). + +### exp4 — warmup / validity-gate decouple (3 arms) +- **Spec:** armA current warmup; armB full-strength contraction early + task LR still warmup; + armC no-warmup + smaller resinit / lower branch LR. Metric: gate-skip / res-under-gate / early val. +- **Result:** armA (warmup): res stuck ~1.5e-2 for ~50 steps → gate skips the nudge → **no task + learning for ~50–60 steps** (cos≈0, the reported grad is pure jacreg); cos jumps to 0.99 once + res<gate by step ~60; best 4.11@800. armB (`cprewarm 200`): res 1.7e-2 after the pre-phase, dead- + zone only **partially** mitigated (val 5.97 vs 6.07 @100, best 4.13@600). armC (no-warmup + + resinit 0.03): **no dead-zone** (cos 0.99 from step 20, best 4.03@300) but a violent excursion + (res→0.14 @400, recovered), ended 4.71. +- **Read:** the warmup×gate dead-zone is real; establishing contraction once (armB) isn't enough — + warmup's tiny lr can't deepen it. Smallest-resinit + no-warmup (armC) learns fastest early but is + excursion-prone. + +### exp5 — λ floor / target / jr_max grid (warm-start from 2.40) +- **Spec:** floor {0.1,0.05,0.03} × target {1.5e-3,3e-3,5e-3} × jr_max {4,8,16,32}, 1–2k steps, fuse on. +- **Result (best val CE, 1200 warm steps):** jr_max 4 → 2.376, **jr_max 8 → 2.3648**, jr_max 16 → + 2.416, jr_max 32 (base) → 2.463; floor 0.05/target1.5e-3 → 2.477; floor 0.03/target5e-3 → 2.482. + (orphan = 2.404.) +- **Read:** **jr_max ≤ 8 beats 32 by ~0.04** and dips just below the orphan — the jr_max ceiling was + a small tax; floor/target relaxation does not help. Still far from 1–1.5: the estimator floor + dominates. + +### exp6 — branch-growth + attention entropy + qk-RMS (incl. at plateau) +- **Spec:** log |attn|/|xin|, |ffn|/|xin|, |WO|, |pj|, |fc|, attention entropy, qk RMS, logit std; + if branch contribution is small at the plateau, add trainable residual gates α with larger LR. +- **Result:** training (resinit 0.1): |WO| 1.95→4.91, |pj|→6.57, attn/xin 56→3.5, attn entropy + 4.56→3.88, qk_rms pinned 1.0. **At the plateau:** |pj|=44, |fc|=58, attn/xin≈1.0, attn entropy + 3.62, qk_rms 1.0. +- **Read:** branches grow fine and are **large** at the plateau ⇒ **no "stuck-small branch"** — the + trainable-α-gate intervention is **not indicated**. qk-norm is bounding qk RMS (1.0) as designed. + +### exp7 — mixing / influence length + per-position CE (EP vs BP) +- **Spec:** finite-diff |∂z*_t/∂x_{t−d}| vs distance + per-position CE, comparing tight-EP, + BPTT-ctl-tight, BPTT-ctl-loose, standard BP. +- **Bug caught & fixed:** the EP probe first embedded the input with random-init weights (before + loading the ckpt) → garbage (per-pos CE ~7); fixed to embed post-load and re-ran. +- **Result — influence ||Δ·_q||/δ vs distance d:** + + | d | 1 | 8 | 16 | 64 | 200 | + |---|---|---|---|---|---| + | BP | 5.34 | 1.81 | 0.88 | 0.15 | 0.003 | + | EP-orphan | 0.238 | 0.102 | 0.071 | 0.019 | 0.006 | + | BP ÷d1 | 1.0 | 0.34 | 0.16 | 0.029 | 0.0006 | + | EP ÷d1 | 1.0 | 0.43 | 0.30 | 0.080 | 0.025 | + + Per-position CE: BP uniformly ~0.7–1.0 lower than EP at every position (no long-range-specific gap). +- **Read:** both decay with distance; EP ~20× weaker in **absolute** coupling (scale-caveated — z* + vs hidden norm), but **normalized, EP reaches farther than BP**. ⇒ **does NOT support — arguably + refutes — "contraction cuts long-range mixing."** EP's deficit is **uniform across position**, + consistent with the estimator floor (exp1), not a mixing-length deficit. +- **Honest gap:** a stable BPTT-ctl-tight ckpt does not exist (BPTT broke, exp3), so the cross-method + comparison is **EP-vs-BP only**, not the full tight/loose/BP set requested. + +--- + +## 2. Consolidated verdict (data-level) + +**The C512 2.40 plateau is an EP estimator SNR/coherence collapse near the optimum** (exp1: cos→0, +batch self-cos −0.27, k→4000, true gradient clean). Ruled out as causes: LR (exp2), horizon +ambiguity (exp1), λ floor/target (exp5), mixing-length / contraction-cut (exp7). The architecture +has a **separate** wall: exact-gradient BPTT can't stay stable at C512 (exp3 broke at 1600). EP's +contraction-bias makes it the more robust of the two here. + +Of the user's three plateau types, the orphan is type (2) **valid-but-over-contracted** — but the +sharper statement is that "over-contracted" is not the mechanism; the binding constraint is the +**EP estimator bias floor**, which becomes batch-incoherent once the true gradient shrinks below it. + +**Two levers the data points at (decision pending):** +- (a) **Reduce the EP estimator bias floor near the optimum** — N=4 holo / tracking-AEP / lock-in / + navg averaging; jr_max≤8 also buys ~0.04 (exp5). The lever for EP's own plateau. +- (b) **The architecture's stability margin for exact gradients** — the separate exp3 wall. + +--- + +## 3. Code / logs + +- Trainer flags added this round: `--snap` (exp1 ckpts), `--cprewarm` (exp4 armB), `--probe_branch` + (exp6, now with attn entropy + qk RMS), `--probe_bptt` (cos/k along trajectory). +- Scripts: `lr_sweep.py` (exp2), `triangulation.py` (exp1, +batch self-cos), `mixing_probe.py` + (exp7 EP, bug-fixed), `bp_mixing.py` (exp7 BP), `bp_charlm.py --ckpt` (BP ckpt save). +- Orchestration: `master.sh` (first pass), `master2.sh` (full to-spec versions). +- Logs: `triangulation_multi.log`, `exp4_armB.log`/`exp4_armC.log`, `lam_jr{4,8,16}.log`, + `exp6_train.log`/`exp6_plateau.log`, `bp_mixing.log`, `mixing2.log` (+ the corrected re-run). +- Checkpoints: `ep_C512_bpe.best.pt` (orphan 2.40), `bp_C512.pt` (BP 1.70), `snap_{0,200,800,2000}.pt`, + `lam_jr8.pt` (2.365), `lam_t5.pt`. diff --git a/docs/campaign/C512_ROUND2_ABCD.md b/docs/campaign/C512_ROUND2_ABCD.md new file mode 100644 index 0000000..65e5282 --- /dev/null +++ b/docs/campaign/C512_ROUND2_ABCD.md @@ -0,0 +1,118 @@ +# C512 Round-2 (A/B/C/D) — Spec Status, Results, and What Followed + +**Scope.** Status of the last long spec the user ordered (the A/B/C/D round, after the 7-experiment +campaign in `C512_PLATEAU_CAMPAIGN.md`), its results, the unfinished gaps, and the work that grew out +of it (this session). Setting: EP-trained equilibrium transformer block, C=512 BPE-4096, TinyStories, +T=256, B=24. Random ln(4096)=8.318; BP single-block reference (depth=1, 7.48M) best **1.7921** (20k). + +Convention: raw data; conclusions are the user's. + +--- + +## The spec (as ordered) + +> Next round (do A+B first, then decide N=4 / lock-in / tracking-AEP vs nudged-phase redesign): +> - **(A)** plateau estimator ablation at a fixed ckpt — holo N=2/4/8, r=.02–.2, T2=20–120 + adaptive, +> tracking-AEP, vs BPTT150/400/800. +> - **(B)** navg multi-batch one-step sweep (navg 1–16): variance-or-bias test. +> - **(C)** stable BPTT+controller rescue — lr 2e-4–8e-4 × jr_max 8–32, ≥2 seeds, 2–5k cutoff (the +> still-missing matched architecture ceiling). +> - **(D)** per-group late-collapse localization — attn QKVO / FFN fc-pj / LN / emb / head-clamp / +> jacreg + diagnostics |a|, |z^β−z*|, |Jv−Jᵀv|, selected-T2. + +--- + +## Per-item status & results + +| item | status | result (raw) | +|---|---|---| +| **A** estimator ablation | ✅ core done | At the 2.40 plateau ckpt vs BPTT400: **N2-frozen** cos **−0.045**, self-cos **−0.27**, k **4133**; **N4/N8** fix magnitude only (k→30) NOT direction; **r** .05/.10/.20 do nothing; **fixed-T2** 20/80/120 diverge (need snapshot selection); **tracking-AEP (common-mode)** cos **0.997**, self-cos **+0.95**, k **0.9**. Diagnostic at z*: **\|Jv−Jᵀv\|/\|Jv\| = 1.37** (highly non-normal → the frozen-at-z* AEP linearization IS the bias floor). Script: `plateau_ablation.py`. | +| **B** navg sweep | ✅ done | cos(EP,BPTT) over navg 1/2/4/8/16 = **0.37 / 0.40 / 0.55 / −0.30 / −0.21**; self-cos rises (0.58→0.87) but toward a direction NEGATIVELY correlated with BPTT; dCE diverges at navg≥8. ⇒ **deterministic BIAS, not variance** — averaging/restart won't fix it. Script: `navg_sweep.py`. | +| **C** stable BPTT rescue | 🟡 partial | Ran lr{2e-4, 4e-4, 6e-4} × jr_max{16, 32}, **single seed**, ~2.5k cutoff (`bpttr_lr*_jr*.pt`). Found: **stable exact-gradient BPTT exists** at lr ≤ 6e-4 (never broke); lr 8e-4 broke @ step 1600; **jr_max 16 > 32**. Winner (lr6e-4, jr16) extended to 12k = `bptt_ceiling` (best **2.1055** @ step 7400). Scripts: `bptt_rescue.sh`, `chain_next.sh`. | +| **D** per-group localization | ✅ core done | Late collapse is **UNIFORM** across WQ/WK/WV/WO, fc/pj, ln1/ln2, tok/pos (all cos −0.03..−0.08, k 3k–18k) — a single shared cause (the corrupted contrast `a` poisons every VF gradient equally), NOT attention/LN-specific. + the \|Jv−Jᵀv\|=1.37 diagnostic. Script: `plateau_ablation.py` (per-group mode). | + +### Decision gate (after A+B) +Per spec, A+B chose the direction: **switch the nudged-phase estimator to common-mode tracking-AEP** +(N=4/lock-in/navg rejected — A showed they fix magnitude not direction; B showed it's bias not +variance). The mechanism: the bias floor is the **frozen-at-z\*** AEP linearization; re-linearizing at +the moving common mode z̄=½(z₊+z₋) restores cos −0.045 → 0.997. + +### Follow-on (the spec's "next necessary test": does `--track` descend below 2.40 in TRAINING) +✅ Done. Warm-start from the 2.40 orphan ckpt, jr_max 8: +- **TRACK-warm (tracking-AEP): 2.40 → best 2.1628 @ 2500, still descending, stable** (res ~6e-5). +- **STD-warm control (standard estimator, same recipe): ABORTED @ step 397** (res→0.21). +- **TRACK-fresh (from scratch): ABORTED @ 968.** +⇒ tracking-AEP descends past the plateau but is **not a from-scratch drop-in**; recipe is **two-phase +warm-track** (std estimator → ~2.40, then switch to tracking-AEP). Continued (`trkcont`) → best **2.1116**. + +--- + +## Unfinished gaps (do not change conclusions, but were specified) +1. **C — ≥2 seeds:** only single-seed runs. The one item that affects the rigor of the "matched + architecture ceiling" claim. +2. **C — jr_max=8** not included in the BPTT rescue grid (only 16/32; exp5 covered jr_max 8 for EP). +3. **A — BPTT800** reference not run separately (BPTT150-vs-400 cos=1.0 from exp1 answered the horizon + question, but the spec listed 800). +4. **D — sub-diagnostics** |a|, |z^β−z*|, and a head-clamp arm not individually logged. + +--- + +## What followed (this session — beyond the A/B/C/D spec) + +These grew out of the round; they were **not** in the spec. + +1. **bptt_ceiling (architecture-ceiling run, lr6e-4 jr16, 12k).** Best **2.1055** @ step 7400, then + **controller-hijack blowup** @ ~8800 (jr→16, val CE→3.4), lodged at ~3.3 through 12000 (jr pinned at + 16 the entire tail, res bounded 2–5e-3 — never recovered). Logs: `bptt_ceiling.log`. + +2. **BPTT-vs-BP "slow vs misaligned" diagnosis** (`lr_overlay.py`). Plotting val-CE vs cumulative-lr + (∫lr, Adam path-length proxy): per unit path-length **BPTT meets/beats BP down to 2.11** (cumL ratio + 1.0→0.33) — the raw-step slowness is the 5× lr cap (6e-4 vs BP's 3e-3), not inefficiency. BPTT never + reached BP's 1.74 because it **blew up before converging** (stability-limited), not capacity. S0 + evidence that capacity is fine when it converges: **BPTT-ctl 1.635 ≈ BP 1.610** (gap 0.025). A + weight-import capacity probe was designed but **deemed confounded** (5 formula gaps: damping halves + z*, coupled-vs-sequential FFN, attn biases, qknorm, final-LN) and not run. + +3. **Controller-hijack diagnosis** (`blowup_diag.py`, at the 2.1055 ckpt). Decomposing the update into + CE and jacreg terms: **cos(g_CE, g_jac) = −0.008** (orthogonal); at jr=16, **\|jr·g_jac\|/\|g_CE\| = + 3.25** and **cos(g_CE, total) = 0.288** — i.e. ~71% of the update is the penalty, not loss descent. + Per-group, the penalty over-drives **LN gains 34×**, attn 3.2×. res stays bounded throughout ⇒ NOT a + relaxation blowup; a controller-penalty-driven weight-space detour. Trigger: near the optimum res + settled ~1.8e-3 (> target 1.5e-3) → controller ramped jr to max → hijack. + +4. **"Remove/replace the penalty" structural arms** (running now, all BPTT, base = bptt_ceiling recipe): + - `freeze_floor` (GPU1): `--jacreg 0.1 --jr_floor 0.1 --jr_max 0.1` → jr frozen at 0.1, no adaptive ramp. + - `specnorm` / Lipschitz (GPU0): `--jacreg 0 --specnorm --rho 0.9` → spectral-norm attention weights ≤0.9, off-gradient, no penalty. + - Queued: `no-penalty` (`--jacreg 0`), `energy` (`--attn_mode energy`), `mono` (`--attn_mode mono`). + - Watcher `watch_arms.sh` fires when both finish. Logs: `freeze_floor.log`, `specnorm.log`. + +5. **DEQ training-stability literature review** (deep-research, 24 sources, 25/25 claims verified). + Three families: (i) **soft Jacobian penalty (Bai 2021)** — author-admitted heuristic + DIGITAL-only + (relies on Broyden, deliberately doesn't enforce contraction) → wrong tool for analog; (ii) + **structural contraction** (monDEQ / NEMON-l∞ / pcDEQ / BiLipNet) — provable, and monDEQ physically = + a passive resistor-diode circuit → **analog-native**; (iii) **energy/Hopfield attention** (softmax = + one CCCP energy-descent step, provable monotone energy decrease) → the analog-compatible AND + attention-preserving route. Recommended construction (medium conf): energy/Hopfield attention + + monotone/NEMON-l∞ contractive FFN, plain Picard/averaged (= our εF Euler) iteration, no root-finder. + Full report relayed in chat; open questions = our research frontier. + +--- + +## Where it stands / open items +- **Round A/B/C/D:** substantially complete (4/4 + the decisive tracking-AEP training follow-on); + 4 specified gaps remain (above), the load-bearing one being **C multi-seed**. +- **Reframe (this session):** the explosion is a **DEQ forward-stability** problem (shared by EP and + BPTT, since EP≈BPTT gradient), not an EP problem — so the fix belongs to the DEQ/structural toolkit, + not EP-literature hacks. The soft penalty is both admitted-heuristic and analog-incompatible. +- **Running:** freeze_floor, specnorm (~10h) — testing whether removing/replacing the penalty keeps + the block stable past the ~9k blowup point and whether it descends below 2.1055. +- **Next-round formal plan (NOT yet executed):** WSD schedule redesign; fix-bundle re-run; converged + BP/BPTT/EP comparison; C512-vs-C≥1024 capability question. Plus (lit-informed) prioritize the + `energy` arm; consider NEMON-l∞ over Winston-Kolter for the structural `mono` arm. + +## Artifacts +- Scripts: `/tmp/lt_ep/{plateau_ablation,navg_sweep,bptt_rescue,blowup_diag,lr_overlay,cos_now}.py`, + `bptt_rescue.sh`, `chain_next.sh`, `watch_arms.sh`. +- Logs/ckpts: `/tmp/lt_ep/{bptt_ceiling,freeze_floor,specnorm,trkcont,track_warm,std_warm,bpttr_*}.{log,pt}`. +- Records: `C512_PLATEAU_CAMPAIGN.md` (round-1, exp1-7), `FINDINGS.md` (2026-06-17 round 2 + 2026-06-18), + memory `ep-c512-plateau-mechanism.md`. diff --git a/docs/campaign/EP_BELOW210_DIAGNOSIS_FIX.md b/docs/campaign/EP_BELOW210_DIAGNOSIS_FIX.md new file mode 100644 index 0000000..54897bc --- /dev/null +++ b/docs/campaign/EP_BELOW210_DIAGNOSIS_FIX.md @@ -0,0 +1,260 @@ +# C512 EP below-2.10 divergence — diagnosis + fix (2026-06-20) + +> Written to disk after `/tmp/lt_ep` was wiped (10-day /tmp cleanup) — all local code/ckpts/logs lost. +> This file preserves the diagnosis + the exact code change so it can be rebuilt. + +## Symptom +EP (frozen jr=0.1, tracking-AEP, qknorm, c=1) descends to best **~2.09**, then within ~200 steps: +res 5e-3→0.15, cos(g_EP,g_BPTT) 0.98→~0, val CE 2.09→4+, ABORT. Same ~2.09 floor every run +(erf & tanh both: 2.0919 / 2.0921). **Exact-BPTT, identical recipe, sails past 2.09 → 1.72 (freeze_wsd).** + +## Root cause (Codex-confirmed, 5-way corroborated) +**EP optimizes the fixed point; BPTT optimizes the finite unroll — only the finite unroll defends the residual.** +- `ep_step` relaxes T1=150 → `z150`, then REFINES to t1max=300 → `z*`, computes the EP gradient at `z*`. +- But **eval and `bptt_step` use `z150`**. EP's gradient never feels the T1 residual. +- As attention gets expressive → contraction weakens → `z150` drifts from `z*` → res@T1 inflates → EP + estimate (valid only at small res) dies → blowup. +- BPTT differentiates the actual 150-step unroll → a non-converging unroll → bad CE → gradient implicitly + rewards strong contraction. **That defend-the-residual term is what EP structurally lacks.** +- cos-0.977 holds only AT the fixed point; the missing perpendicular is ~21% (=√(1−0.977²)) = the + finite-horizon transient gradient. (T1=∞ would make even BPTT lose it — BPTT's stability IS the finite truncation.) + +Two layers: (1) point-mismatch (refinement) — `--t1max 150` alone pushed wall 2.09→2.05 but still blew@600; +(2) gradient-flavor — needs the explicit penalty. + +Diverged state = forward bifurcation to a **limit cycle** (eval_relax: res floors ~6e-2, oscillates, 150→4000 +steps don't help, CE ~3.7). FTLE stayed negative (−0.027..−0.050). ⇒ adaptive/more steps can't fix it (res as +stopping-criterion chases a vanishing fixed point); only res as a COST (penalty) prevents the drift. + +Ruled out: jacreg (3% of grad, orthogonal, removing changes cos <0.001); GELU erf/tanh (both diverge identically); +the validity gate (removing it diverges FASTER); forward-stability/FTLE (stable-BPTT −0.0347 weaker than diverged-EP −0.0377). + +## THE FIX: `--resreg` (T1-residual penalty) — exact edits to lt_ep_train.py / ep_step + +1. signature: `def ep_step(..., corr_every=1, res_gate=0.0, resreg=0.0):` + +2. right after `res_used = res` (BEFORE the `if t1max > T1:` refinement): +```python + zT, resT1 = zs, res # the T1 free-phase state (what eval/BPTT use), BEFORE refinement +``` + +3. just before the final `return grads, res` (after the jacreg block): +```python + if resreg > 0 and resT1 > 7e-4: # defend z_T1 (BPTT gets this implicitly; EP at z* doesn't) + with torch.enable_grad(): + Fz = blk.tforce(zT, xin0) # deterministic thick force at z_T1 (params live, zT/xin0 detached) + Rr = (eps * Fz).pow(2).sum() / (zT.pow(2).sum() + 1e-9) # ~ (T1 residual)^2 + grr = torch.autograd.grad(Rr, blk.block, allow_unused=True) + ratio = resreg * min(1.0, resT1 / 2e-2) # ramp 0->resreg as res 7e-4->2e-2, capped + gtask = math.sqrt(sum(float((grads[id(p)] ** 2).sum()) for p in blk.block if grads.get(id(p)) is not None) + 1e-20) + gres = math.sqrt(sum(float((g ** 2).sum()) for g in grr if g is not None) + 1e-20) + lam = ratio * gtask / gres # scale penalty to `ratio` of the task-grad norm + for p, g in zip(blk.block, grr): + if g is not None: + grads[id(p)] = g * lam if grads.get(id(p)) is None else grads[id(p)] + lam * g + return grads, res +``` + +4. argparse: `ap.add_argument('--resreg', type=float, default=0.0)` + +5. the ep_step CALL in the training loop: append `, cfg.resreg` as the last arg. + +**RUN IT WITH `--res_gate 0`** (else the gate bypasses the penalty). Keep `--t1max 300`. + +## Validation status (INTERRUPTED — must re-run) +`ep_resreg2` (res_gate=0, resreg=0.2, warm from erf-2.09 ckpt, t1max=300): held res **1–5e-4 through step 1000**, +best **2.0573** (lowest any EP run reached, past the wall), zero inflation. Deleted by /tmp wipe before reaching ~1.8. +No-penalty variants for contrast: ep_nogate blew@100, gated ep_resreg@200, ep_t1max150 reached 2.05 then blew@600. + +## What was lost in the /tmp wipe (must rebuild) +- Code: lt_ep_train.py (with all post-06-14 edits: --resreg, --gelu, --probe_bptt, --wsd, --lcap/--linf, empty_cache, + tracking-AEP), plus the probes (grad_decomp.py, ftle_probe.py +--gelu, eval_relax.py, cos_now.py, train_lyap.py). + Base (06-14) survives in `/home/yurenh2/ept/lt_ep_code/`; re-apply edits from this doc + the session transcript + `/home/yurenh2/.claude/projects/-home-yurenh2-ept/46bc4d43-*.jsonl`. +- ALL checkpoints: orphan (ep_C512_bpe.best.pt 2.40), freeze_wsd.pt (BPTT 1.72), ep_frzjr_wsd.pt (2.09), + ep_resreg2, ep_tanh_frz — gone (H200 backup source 157.66.255.42 is also down). Must retrain from scratch. +- Data tinystories_bpe: regenerate via `lt_ep_code/prepare_tinystories_bpe.py`. + +## Re-run plan +1. Reconstruct lt_ep_train.py = lt_ep_code/06-14 base + the edits above (+ the other recent flags). +2. Regenerate tinystories_bpe data. **Work entirely under /home/yurenh2/ept — NEVER /tmp.** +3. Retrain orphan (~20k) → freeze_wsd (BPTT WSD 16k, expect ~1.72) → EP+resreg (res_gate=0 resreg 0.2, WSD). +4. Confirm EP+resreg descends past 2.09 toward ~1.8 (BPTT ceiling). Sweep resreg 0.2–1.0 if it plateaus. +5. NEXT-ROUND code fixes (genuine bugs, kept this round only for comparability with the diverging baselines): + - GELU consistency: model force uses erf (F.gelu) but holo nudge (cgelu/rforce/cforce) uses tanh-form → nudge fixed point ≠ z* → ~4.7e-4 force error injects a spurious common-mode drift each step (estimator bias). FIX = `--gelu tanh` (make the model tanh too; the holo complex extension REQUIRES tanh = entire fn, so consistency means everything tanh). NOTE: fixing it does NOT fix the below-2.10 divergence (tested 6/18, that's the residual-defense issue) — but it's still a real bug worth fixing. + - cforce (holo_ep.py:41-45) omits the q/k RMSNorm that real attention applies → biases holo_a/holo_a_select with --qknorm (not the --track path, but fix anyway). + - Stale argparse defaults (footguns, harmless when flags passed explicitly): --attn_mode default 'real' (should be the intended mode), --data default '/tmp/lt_ep/data/shakespeare_char' (line ~320). Also tighten resreg scaling to pure task-grad norm (currently task+jacreg, ~3% off). + +--- +## 2026-06-21 session update — rebuild after /tmp wipe + validation IN PROGRESS + +REBUILD (all under /home/yurenh2/ept/ep_run, never /tmp): +- Code ep_run/lt_ep_train.py = lt_ep_code (a recent ~6/20 snapshot despite its misleading 6/14 mtime — it already has res_gate/gelu/t1max/track/resinit/full ep_step+holo) + my 3 edits (--resreg block, --wsd schedule, line-20 DD data-path fix). `diff` verified clean; holo_ep.py byte-identical to base. +- Bug-check forensics: holo_ep.py last edited 2026-06-13 -> the estimator (holo_a_track, AEP correction, rforce) is COMPLETE/not stale. Post-6/14 lt_ep_train edits NOT carried over are all diagnostic/other-run-type (--probe_bptt,--probe_branch,--snap,--cprewarm,--specnorm/--linf, bptt_step sep-mode, empty_cache) — none touched by the EP+resreg+track+frozen-jr run. +- Data: regenerated tinystories_bpe to ep_run/data/ (361M train tokens, vocab 4096; val=tail-15MB of train.txt, ~0.8% overlap — fine for tracking divergence/descent). + +VALIDATION RUN (from random init; pid 696706 GPU1): + python3 lt_ep_train.py --mode ep --attn_mode thick --B 24 --C 512 --H 16 --T 256 --c 1.0 \ + --jacreg 0.1 --jr_floor 0.1 --jr_max 0.1 --holo 2 --hr 0.02 --t2sel 40 --track --pema 0.999 --t1max 300 \ + --res_est 1e-4 --res_gate 0 --resreg 0.2 --qknorm --resinit 0.1 --warmup 800 --T1 150 --T2 20 --lr 6e-4 \ + --wsd 0.25 --steps 32000 --log 200 --save_every 500 --abort_res 0.3 --data data/tinystories_bpe \ + --ckpt runs/ep_resreg_scratch.pt --state runs/ep_resreg_scratch.state + Status @step 6400 (~20%, 8h, 0.23 it/s): val 2.5070 descending (FASTER than orphan: 2.51@6.4k vs orphan's 2.40@20k). res BOUNDED ~1-2e-2 by resreg (spikes to 2e-2 get pulled back to ~4e-3 — resreg WORKING, not running away like the no-resreg baselines that hit res 0.15->abort@2.09). Decisive break-2.09 test ~10h out; ~1.8 needs the WSD lr-decay (step 24k+). Watcher key: does res stay bounded through 2.09. + +resreg CONTAMINATION probe (ep_run/resreg_probe.py, at step-3600 ckpt, res 1e-2): + cos(VF,BPTT)=0.7166 vs cos(VF+resreg,BPTT)=0.7108 -> delta -0.006: resreg DOES directly contaminate (small). cos(resreg_dir,BPTT)=-0.047 -> resreg is ~ORTHOGONAL to the true gradient; it is NOT adding back BPTT's residual-defense term (my earlier hypothesis was WRONG) — it's a roughly-orthogonal constraint direction. NET still favorable: the tiny -0.006 direct cost (bounded by the 0.2 ratio cap) buys keeping res low so the EP estimate doesn't COLLAPSE (cos 0.72 held vs ->0 runaway+diverge). CAVEAT the probe also exposes: the EP estimate itself is only 0.72-aligned at res 1e-2 (vs 0.98 at res 1e-4) -> descent on a mediocre gradient -> MAY plateau above BPTT's 1.72. If it stalls >~1.85 or res creeps past 2e-2: retune tighter resreg (ramp-to-full by res~3-5e-3 instead of 2e-2) and/or stronger early contraction (frozen jr=0.1 from scratch is weaker than the orphan's adaptive jr which held res~1e-4). + +CODEX review of rebuilt code (verdict triaged): the 2 "High" bugs are STALE-DEFAULT footguns, HARMLESS here (run passes --attn_mode thick + --data explicitly, verified via /proc): line 304 --attn_mode default 'real'; line 320 --data default '/tmp/lt_ep/data/shakespeare_char'. GELU (erf model vs tanh nudge, ~4.7e-4 force err) + cforce-missing-qknorm = real bugs but status-quo (kept this round for comparability) -> next-round fixes (point 5 above). Codex CONFIRMED CORRECT: AEP contrast sign (not flipped/doubled), Wh readout grad, jacreg, the resreg sign/logic + res_gate=0 gate-bypass, frozen-jr clamp@0.1, WSD schedule. + +COST (EP vs same-param BP): per-step EP on the equilibrium block ~= ~100x a depth-1 BP transformer in DIGITAL SIM — because the block relaxes ~150-300 steps (effective depth); even exact-grad BPTT is ~100x depth-1 BP. EP/BPTT ~= 1.5x (EP 0.23 it/s vs freeze_wsd BPTT 0.34 it/s). The ~100x is the cost of SIMULATING the analog relaxation on a GPU; on analog relaxation hardware the free phase is physical settling (~free) + EP's local update needs no backward -> EP competitive/better than BP there. Per-quality the gap is smaller (equilibrium block beats same-param standard transformer, rule tax 0.041); EP memory << BPTT (no unrolled graph). + +DELIVERABLE: method-intro PDF /home/yurenh2/ept/ept_method_intro.{tex,pdf} (9pg, EP-literate audience; AEP/holomorphic/CET/tracking-AEP background; strict attribution; results honest+ongoing-marked; compiled with tectonic since conda pdflatex is broken). VERIFY before external use: the post-Jan-2026 citations (Scurria arXiv:2602.03670 AsymEP; CET=Hoier/Kerjan/Scellier ICLR2026 AM workshop OpenReview Qrfml76eWJ; Laborieux 2021) + Hoier-vs-Hoier spelling. + +--- +## 2026-06-21 (cont.) — divergence is CONTRACTION loss, not res; the missing BPTT term MEASURED +From-scratch EP (c=1, frozen jr=0.1, track, resreg=0.2) diverged at val 2.25 (ABOVE the 2.09 wall): +res 1.1e-2 -> 1.7e-1 -> 3.4e-1 then COLLAPSED to a trivial fixed point (res 1.4e-8, val 10). So it is a +bifurcation: good fixed point destroyed -> brief limit cycle -> degenerate attractor. NOT a numerical blowup. + +MEASURED the missing term M = g_BPTT - g_EP at the pre-blowup ckpt (step 9200, res~1e-2; resreg_probe.py): + cos(g_EP, g_BPTT) = 0.557 (decayed from 0.72 @ step3600 at the SAME res -> estimate degrades as operator hardens) + |M|/|BPTT| = 0.84, cos(M,BPTT) = +0.91, cos(M,EP) = +0.16, cos(M,resreg) = -0.10 + (|EP|=0.94, |BPTT|=2.2 -> EP captures only ~42% of the true gradient, 56 deg off, near the instability) +=> EP is missing 84% of the true gradient (M ~ the BPTT direction). resreg points AWAY from M (-0.10): + resreg is NOT the missing term and NOT a weak proxy -- it is the wrong direction. resreg retired as the fix. + +CODEX THEORY: correct state variable = contraction margin mu_P(J) (one-sided-Lipschitz/log-norm), NOT res +(lagging symptom; cos decays at fixed res) and NOT spectral radius (non-normal, |Jv-Jtv|/|Jv|~1.4 -> jacreg +targets the WRONG quantity). BPTT's stabilizer = the finite-horizon gradient term (the (I+eps*J)^k products that +ALSO govern contraction -> automatic contraction feedback); EP graded at refined z* structurally discards it. +The fix that (i) preserves the objective [no loss force] and (ii) runs on analog [no digital root-finder] = +CONTRACTION-BY-CONSTRUCTION (mu_P(J) < -alpha). KICKER: analog HW can never do BPTT -> can never access M -> +structural contraction is the FORCED hardware design constraint, not merely the cleanest fix. + +EXPERIMENT (test causality + the fix; both with resreg=0, no residual term): + (A) c-bump c=1->3 from scratch [runs/ep_c3_scratch] -- does stronger structural damping alone prevent divergence? + (B) HARD spectral-norm contraction constraint --specnorm (codex implementing) -- warm from preblow_9200 + from scratch. + If divergence vanishes with NO residual term -> contraction-loss confirmed as root; principled fix validated. + +--- +## 2026-06-21 (cont. 2) — the equivalence-condition diagnosis (deep paper read: Ernoult2019 + AsymEP + LZ-JacobianHomeostasis) +EP≡BPTT (Ernoult 2019, arXiv:1905.13633, Thm1/Eq16) needs: H1 symmetric Jacobian J=Jᵀ (Eq51 transpose-cancellation = load-bearing); +H2 free phase EXACTLY at fixed point s*=s_T=..=s_{T-K}; H3 β→0. Ernoult gives NO contraction-rate truncation bound (just assumes +convergence; my earlier "contraction-rate condition" guess was WRONG). BPTT needs NEITHER H1 nor H2 (differentiates the actual +finite-T graph) -> that's why BPTT sails to 1.7 while EP can't. + +OUR RUN VIOLATES BOTH, via the SAME quantity S^-1 A (S,A = sym/antisym parts of J; our |Jv-Jᵀv|/|Jv|≈1.4 IS ‖S^-1 A‖≈1.4): +- H2 (dominant, first-order): res~1e-2 -> not at fixed point -> EP & BPTT seeded by different states. "EP≈BPTT only at res≲1e-4, + noise at 1e-2" is textbook H2 failure. +- H1 (makes it sudden, sets M): both AsymEP (Scurria arXiv:2602.03670) and LZ (Laborieux-Zenke "Jacobian Homeostasis" + arXiv:2309.02214, ICLR2024) reduce the asymmetry bias to S^-1 A. LZ Eq11: EP error = δ - 2 S^-1 A δ, valid only ‖S^-1 A‖→0. + At 1.4 the map (S+A)^-1(S-A) → near-rotation (eigs→-1): right direction, lost magnitude = our cos(M,BPTT)=0.91, |M|=84%. + +KEY NEW REALIZATION: our AEP (J→Jᵀ) correction is OUT OF WARRANTY. It is FIRST-ORDER in S^-1 A and assumes ‖S^-1 A‖≪1 +(AsymEP error series Σ(S^-1 A)^{2n+1} converges only if ρ(S^-1 A)<1; they never state it). At 1.4 it may be O(1) wrong. AND it +is a FIXED linearization at x̄⁰, so res~1e-2 -> wrong anchor -> H2 POISONS the AEP correction too. The cos-0.99 we trust holds +ONLY at small-asymmetry + tight-convergence — the regime we leave. + +THE GAP, PRECISE: NOT "can't converge". Even at perfect convergence, at ‖S^-1 A‖=1.4 the H1 bias is O(1), and the BPTT-1.7 +solution NEEDS that non-normality. So EP's valid regime (‖S^-1 A‖≪1) and the expressive regime (≈1.4) DON'T OVERLAP. Structural. + +REFRAMES THE CONTRACTION EXPERIMENTS: specnorm/c3 cap ‖J‖/contraction, NOT the asymmetry ratio ‖A‖/‖S‖ that drives the bias. +The literature's correct lever = Jacobian Homeostasis (LZ Eq12: penalize ‖A‖² -> J functionally symmetric, works w/ untied weights) += the principled version of our jacreg — BUT it caps exactly the non-normality the 1.7 solution needs (expressivity tax; confirms +the "crutch" worry). Holomorphic nudge only removes finite-β bias, NOT the H1/H2 bias (so β is NOT our dominant error). + +THE ONE PATH WITHOUT THE EXPRESSIVITY TAX (novel): the AEP correction is first-order; a HIGHER-ORDER AEP (more Neumann terms in +S^-1 A) extends the warranty past ‖S^-1 A‖~1 WITHOUT capping non-normality — fix the gradient, not the model. Attacks H1 fundamentally. + +NEXT MEASUREMENTS (cheap, on preblow_9200 ckpt): (1) measure ‖S^-1 A‖ / ρ(S^-1 A) DIRECTLY (not the 1.4 proxy) and track over +training (warranty gauge; predict it grows toward the tip). (2) test M ≈ -2 S^-1 A · g_BPTT — if yes, the asymmetry bias IS the gap +(not convergence) and higher-order AEP is the lever. Sources: Ernoult 1905.13633; Scurria 2602.03670; LZ 2309.02214 + Holo 2209.00530. + +--- +## 2026-06-21 (cont. 3) — CORRECTION: the "AEP out of warranty / ρ(S⁻¹A)<1" framing (cont.2) was MIS-ATTRIBUTED +Deep read of Scurria 2026 (arXiv:2602.03670, verbatim Eq.24/47) overturns cont.2: +- Our AEP = the paper's AsymEP (single-state antisym correction −2A(z−z*); holo_a_track re-anchors A at the common mode z̄). + It is EXACT at β→0 for ARBITRARY asymmetry: Eq.24 `J_FA(x̄⁰)=J−2A=Jᵀ` is an algebraic IDENTITY (any A magnitude). +- The ρ(S⁻¹A)<1 / Neumann series Σ(S⁻¹A)^{2n+1} (Eq.47) is the **VF BASELINE's** error (uncorrected EP), which AsymEP REMOVES. + We + GPT-Pro (following Laborieux-Zenke, whose first-order −2S⁻¹Aδ analyzes UNCORRECTED EP) MIS-ATTRIBUTED the VF error + to AsymEP. It is NOT a restriction on our correction. +- "AsymEP = first-order projection of Dyadic EP" = DIMENSIONAL projection (2n→n), NOT first-order-in-β. Both exact at β→0. +- Dyadic EP (Scurria §4) = a SEPARATE 2n doubled-field method (aux field z′ + augmented energy H=−(z−z′)ᵀF((z+z′)/2)); + we DON'T implement it. It buys finite-displacement robustness + a real energy, NOT extra β→0 exactness. ⇒ optional upgrade, NOT the fix. +- ⇒ the asym_probe ρ(S⁻¹A)=3.3 result answers the WRONG question (VF's error, + measured at a no-fixed-point pathological ckpt). DISCARD as a validity verdict. + +REAL DIAGNOSIS (re-centered): the ONLY fundamental obstacle is **H2 = free-phase convergence/contraction**. AsymEP handles +asymmetry EXACTLY *given convergence*. The cos-0.557 / M-84% gap was measured at res 1e-2 (NOT converged) ⇒ it's H2, not H1. +Asymmetry is only "upstream" (expressive attention → weak contraction → H2 fails). FIX DIRECTION = contraction/convergence +defense (specnorm / damping / monotone / residual-defense) — VALIDATED. The cont.2 "asymmetry/ρ/AEP-warranty" line was a wrong turn. + +EMPIRICAL CHECK (running): oracle-adjoint probe — at a HEALTHY converged ckpt (specnorm, res~1e-4): is +g_EP(AsymEP) ≈ g_transpose(exact adjoint via GMRES Jᵀλ=−ℓ) ≈ g_BPTT? cos→1 confirms AsymEP exact + problem = convergence. +(The earlier 0.557 was the pathological non-converged point.) + +--- +## 2026-06-21 (cont. 5) — ROOT-CAUSE CANDIDATE: --hr 0.02 is 10x too small (should be 0.2) -> noisy EP gradient +oracle-adjoint probe (asym_probe --diag) on a PERFECTLY converged ckpt (res 1.5e-9): cos(g_EP, g_transpose[=exact +adjoint=g_BPTT, cross-checked cos 1.0]) = 0.63 mean over 6 batches (SYSTEMATIC, not a bad batch). BETA/r SWEEP: cos gets +WORSE as beta->0 (hr 0.04->0.86, 0.02->0.72, 0.005->0.38, 0.002->0.15) = 1/beta noise-amplification of the nudge-convergence +error (NOT finite-beta bias; NOT a holo/track/t2sel component -- plain EP ~= FULL ~0.72). Sweeping hr UP: 0.06->0.85, +0.1->0.95, 0.15->0.98, **0.2->0.986**, 0.3->0.991. => at the DOCUMENTED hr=0.2 (METHODS line 130) the estimator is FINE +(cos 0.986, in the historical 0.85-0.99 range). Our runs (this whole session + the EP_BELOW210 "diverging recipe") used +--hr 0.02 = 10x too small -> noise regime -> 0.72-cos gradient -> plausibly THE destabilizer of the below-2.10 divergence. +CODE IS CORRECT; mis-set parameter (hr regression from the validated 0.2). FIX = --hr 0.2. Decisive test: pure EP with +--hr 0.2 (runs/ep_hr02) -- does it stop diverging? The probe was correct; the convergence/asymmetry/Dyadic theory dive was +largely chasing a symptom of the mis-set hr. (NOTE: also revisit t2sel=40 vs documented 120, secondary.) + +## 2026-06-22 (cont. 6) — DECISIVE: divergence is STRUCTURAL (equilibrium gradient is blind to finite-time contraction). hr & t2sel are NOT the fix. +Deliverables this round: `ep_run/GPT55_BUG_HUNT.md`, `ep_run/CODEX_VERDICT.md`, `ep_run/FWD_MODE_DESIGN.md` (multi-agent: gpt-5.5 + fugu-ultra via codex). + +**cont.5's "hr=0.2 fixes it" was WRONG/incomplete.** ep_hr02 (--hr 0.2) STILL diverged (step 2800, val 2.87). hr is a real gradient-QUALITY knob but not the divergence fix. + +**Clean chain (all on a fixed val batch):** +1. **BPTT deconfounder:** BPTT (exact grad) → CE **1.83**, free-phase contraction ratio **ρ≈0.982**, res tight ~1e-3 throughout. Divergence is **EP-specific**; a contractive low-loss solution exists. +2. **The "0.94 estimator ceiling" = ADJOINT-PHASE TRUNCATION, not hr/β noise, not fundamental.** At a trained, tightly-converged operator (free res 2.2e-5): hr-ceiling sweep cos(g_EP,g_transpose) **flat ~0.94 across hr 0.04–0.8**; t2sel sweep (fixed hr) cos **0.74→0.998 as t2sel 10→160**. The code refines the FREE phase but never converges the NUDGED/adjoint phase (gpt-5.5 bug-hunt #1,#9). So EP CAN compute the near-exact equilibrium adjoint (0.998) at t2sel≥160. cos(g_transpose,g_BPTT)=0.975@res2e-5, =1.0@res1e-9 (finite-horizon gap, res-dependent). +3. **Bifurcation probe (mechanism):** as EP trains, free-phase **ρ drifts→1** (0.989@val4.96→0.998@val2.74); redx baseline blew step3300 (CE 2.74→41). **Damping does NOT rescue** — ρ-vs-c on the marginal op: more c makes ρ WORSE (0.998→0.9999); on BPTT's op more c helps (0.982→0.96). EP drives weights into a **damping-resistant marginal-contraction region** BPTT avoids. Sudden blowup because **res_T1≈ρ^150** is hyper-sensitive near ρ=1 (gradual ρ-drift + ρ^150 = sudden bifurcation). + +**VERDICT (codex/gpt-5.5, PROVEN — CODEX_VERDICT.md): (b) STRUCTURAL.** Equilibrium gradient differentiates L(z*); BPTT differentiates deployed L(z_T), T=150. The equilibrium gradient `grad=L_θ+λᵀF_θ` (F_zᵀλ=−L_z) **lacks the finite-horizon contraction-defense term**. Scalar proof: F=−k(z−z*) → ∀k, z* identical → **dL(z*)/dk=0** (blind to contraction rate), but z_T−z*=(1−εk)ᵀ(z₀−z*) → **dL(z_T)/dk ∝ T(1−εk)^{T−1}** (BPTT sees k; huge near ρ=1). A param that erodes contraction with z* fixed has ZERO equilibrium gradient but large BPTT gradient. **No t2sel/hr adds a term absent from the fixed-point gradient.** t2sel=160 only makes EP a better *equilibrium* gradient = still wrong objective → ep_t2fix predicted to still blow. + +**FIX (physically realizable; FWD_MODE_DESIGN.md): change the OBJECTIVE to finite-horizon — forward-mode/RTRL eligibility traces.** Forward-relax T=150, propagate `e_{t+1}=Φ_z(z_t)e_t+ε·∂F_θ/∂p` alongside the state, update `Δp ∝ −L_z(z_T)ᵀe_T`. Forward-only, local, NOT reverse BPTT, NOT a digital solver — keeps the contraction-defense terms. Exact RTRL O(P·dimz) → practical: UORO / random-direction directional-deriv / blockwise low-rank. Alt: explicit contraction homeostat (penalize log-norm μ(J)). + +**In-flight confirmations (2026-06-22):** ep_t2fix (t2sel=160, runs/ep_t2fix.log + t2fix_rho.log) — predicted ρ→1 + blow; oracle-adjoint training (GMRES exact adjoint each step, warm-start redx s2000, runs/oracle_adjoint.log) — DEFINITIVE (b) test (if even the exact equilibrium grad blows → (b)); forward-mode rule design = the fix. +**Tooling:** Sakana fugu-ultra wired into the codex plugin (broker spawn patched: `-c model=fugu-ultra -c model_provider=sakana`, key ~/.codex/sakana.key); gpt-5.5 via direct `codex exec`. + +## 2026-06-22 (cont. 7) — CORRECTION: the divergence is a forward LIMIT CYCLE = a DISCRETE-EULER ARTIFACT (continuous/analog system is stable). cont.6's (b)-structural/forward-mode verdict is SUPERSEDED. +Measured on genuine this-session ckpts (full dossier: `ep_run/EP_DIAGNOSIS_DOSSIER.md`; fugu Q1-Q4: `ep_run/FUGU_VERDICT_FULL.md`). +1. **It's a LIMIT CYCLE, no fixed point.** eval_relax on the marginal ckpt redx **s3200** (val 2.74): relax from embedding 6000 steps → res FLOORS ~2.3e-2 and OSCILLATES (tail 2.08-2.73e-2, non-monotone), no fixed point. Reproduces the old (lost-run) finding on a genuine ckpt. ⇒ there is **no z\*** at the diverging operator → cont.6's "equilibrium gradient L(z\*) blind to contraction → forward-mode RTRL" (assumed z\* exists) is **moot here**. The bifurcation-probe "ρ=0.998 / slow convergence" was a **transient artifact** (caught the initial decay, missed the floor+oscillation). +2. **Attention drives the cycle.** Knockout WO×α: α=1.0 cycle (res 2.5e-2), 0.7 smaller, 0.4 nearly gone, **0.2 CONVERGES (res 3e-4, true fixed point)**. Non-conservative attention is causally responsible. +3. **DECISIVE — it's a DISCRETE-EULER ARTIFACT.** ε-sweep on s3200 (εN fixed = same settling): res-floor + oscillation shrink **monotonically** as ε↓ — 0.1→2.3e-2(cycle), 0.05→7e-3, 0.03→3e-3, 0.02→2e-3, **0.01→8.9e-4 CONVERGED**. ⇒ the cycle is the **explicit-Euler instability of a stiff rotating attention mode** whose CONTINUOUS eigenvalue is **Re(μ)<0 (stable)**: `|1+εμ|>1` at ε=0.1 but `<1` at ε=0.01. **So the below-2.1 "divergence" is largely a DIGITAL-SIM INTEGRATION ARTIFACT (ε=0.1 explicit Euler too coarse), NOT a fundamental non-conservative instability — and ANALOG hardware (continuous relaxation, ε→0) would NOT have it.** Big positive for the thesis. +- **fugu-ultra (Q1-Q4)** concurred + was appropriately cautious: confirmed attention-driven oscillatory non-convergence, flagged the Euler-artifact alternative (now confirmed), said "Hopf" needs the eigenvalue-of-M crossing to be measured. Fix (Q2): finer integration (smaller ε / implicit/Anderson) + **adaptive jacreg homeostat** (the validated-2.40 mechanism, which the diverging runs *froze* weak — the regression) + **res_gate ON** + qknorm/WO-gain rails + monitor TAIL oscillation (a short residual probe is what fooled us). Thesis (Q3): sub-threshold attention IS expressive enough (BPTT 1.83 proves stable expressive solutions exist), threshold ≈ α_c 0.3; tradeoff real but not fundamental. Primitive (Q4): keep below the (discrete) instability — the code (`ep_step`/holo) REQUIRES a stationary z\*; training *through* the cycle is invalid → gate updates when res says equilibrium lost. +- **CAVEAT:** the ε-sweep proves the *discrete* artifact at s3200 (continuous-stable there); a *true* continuous Hopf could still emerge deeper in training. **Decisive confirmation (running): `ep_eps05`** = warm-start s2000 (pre-cycle, val 3.13) + **ε=0.05 (T1=300)** → does it clear redx's 2.74 blow point without cycling? Files: eval_relax_s3200.py / knockout_s3200.py / eps_sweep_s3200.py. + +## 2026-06-23 — RESULT: ε-TRAINING-MONOTONICITY CONFIRMS THE MECHANISM (wall moves with ε). +Three training runs, identical recipe except ε: +- ε=0.1, t2sel=40 (redx): blew @ best **2.74** +- ε=0.1, t2sel=160 (ep_t2fix): blew @ best **3.02** — the BETTER gradient (cos 0.998 vs 0.84) blew EARLIER ⇒ gradient quality is NOT the lever (a cleaner gradient if anything drives into the stiff regime faster). +- ε=0.05 (ep_eps05, warm-start s2000): blew @ best **2.41** — cleared *past* redx's 2.74 with res pinned ~5e-3 the whole way down (2.74→2.45), then res inflated 1.3e-2→2.0e-2→4.6e-2→val13 @step3100. + +**⇒ smaller ε → strictly lower wall (2.74 → 2.41); the blow point shifts MONOTONICALLY with the integration step.** This proves the divergence IS the explicit-Euler instability of the *stiffening* rotating attention mode (|1+εμ|>1). Fixed ε only RELOCATES the wall, doesn't eliminate it — as the attention keeps stiffening past ~2.4, even ε=0.05 eventually gets too coarse → cycle regrows → blow. The ep_t2fix contrast (better gradient still blows, earlier) seals **"integration, not gradient."** Analog HW (continuous relaxation, ε→0) is unaffected. Pre-blow ckpt preserved: `runs/ep_eps05.pt` @step2850 best2.4059. (2.41 lands near the old 2.40 estimator-plateau — likely coincidence; this is the res-inflation/stiffness blow, not the estimator floor.) +**NEXT (user's plan — try jacreg + physics-inferred methods):** (a) **adaptive ε** [#30, gate now passed] — shrink ε as res climbs, big ε when smooth → should ELIMINATE the wall, not just move it; (b) **jacreg** (adaptive jr — reduce rotating-mode stiffness → raise the Euler-stability margin → wall up / fixed-ε floor lower; the early-recipe defense); (c) ε=0.02/0.03 to confirm wall→~2.2. Two complementary axes: integration (a,c — SIM-only, no expressivity cost) vs stiffness (b — changes model, also speeds HW settling). + +## 2026-06-23 (cont. 8) — MAJOR CORRECTION: GENUINE Hopf instability, NOT an Euler artifact. cont.7's "Euler-artifact / analog-immune / integration-fix" framing is SUPERSEDED. +Direct leading-eigenvalue measurement (matrix-free FINITE-DIFFERENCE-JVP Arnoldi on M=I+εJ, `eig_probe.py`) + Anderson fixed-point-solver control (`anderson_control.py`). THREE independent confirmations: +1. **eval_relax**: s3200 floors/oscillates, never reaches a fixed point. +2. **Leading eigenvalues of M at the operating point**: s2000 (healthy, val 3.13) all |λ|<1, Re μ<0 (STABLE); s3200 (val 2.74) top |λ|=1.044, **Re μ=+0.44, COMPLEX** (μ=+0.26±1.37j); ep_eps05 (val 2.41) top |λ|=1.14, **Re μ=+1.35**, complex (μ=+0.65±2.08j). Re μ grows −0.02→+0.44→+1.35 as CE drops; rotating |Im μ| ~0→1.37→2.08. +3. **Anderson control**: s2000 converges (root, marginally stable μ≈0). s3200 — plain relax floors (res 2.7e-2), Anderson CANNOT reach a root (best res 1.4e-3), and the near-root has **UNSTABLE COMPLEX** eigenvalues (Re μ=+0.24, |Im μ|≈2.0). The solver itself cannot manufacture the absent stable equilibrium (fugu's exact caveat). + +⇒ **The below-2.1 divergence is a GENUINE continuous-time Hopf-type instability** of the learned non-conservative attention operator — a complex eigenvalue pair crosses Re μ=0 between val 3.13 and 2.74 (|Im μ|~2 rotation), EMERGING + GROWING as EP trains the attention expressive. **NOT a discrete-Euler artifact; the continuous ODE / analog HW is ALSO unstable → analog is NOT immune.** + +WHERE cont.7 WENT WRONG: the "Euler-artifact" inference rested on the ε-sweep "converging at ε=0.01" — but that was the **step** residual `ε·g`, not `g→0` (fugu caught this); the force-residual `g` floors high (no fixed point). The direct eigenvalue measurement refutes Re μ<0. RECONCILES with all prior data: ε-monotonicity (smaller ε → lower wall) = smaller ε makes `|1+εμ|>1` grow SLOWER → delays the blow → wall relocates (2.74→2.41), doesn't close (Re μ>0 is unstable for ANY ε); ep_t2fix (better gradient blew earlier) = cleaner gradient drives into the unstable regime faster; jacreg history (adaptive→2.40, frozen→blow) = jacreg cuts ‖J_nc‖ → pushes the complex pair back to Re μ<0. + +**FIX (FLIPPED): integration (adaptive ε / Anderson / implicit) only DELAYS** — can't converge to an unstable/absent fixed point (Anderson got 1.4e-3, not a root = direct proof). **The REAL fix is MODEL-SIDE: jacreg** (reduce ‖J_nc‖ → push the rotating complex pair back to Re μ<0), per fugu's Q2 (FUGU_OPTIONS_VERDICT.md): use an *adaptive bounded* jacreg homeostat (the validated 2.40 runs used adaptive jacreg; diverging runs froze it). **NEXT: jacreg run + verify with `eig_probe.py` that jacreg holds Re μ<0 / keeps the complex pair below the bifurcation.** Files: eig_probe.py, anderson_control.py, adaptive_eps_calib*.py, FUGU_OPTIONS_VERDICT.md. + +## 2026-06-23 (cont. 9) — FIX CONFIRMED AT THE MECHANISM LEVEL: adaptive jacreg pushes Re μ<0, kills the Hopf, restores the fixed point. +`ep_jacreg` = warm-start s2000 + ε=0.1 + **ADAPTIVE jacreg** (`--jacreg 0.1 --jr_floor 0.1 --jr_max 16 --res_target 1e-3 --res_gate 0`, controller :520-529). The ONLY change vs redx (which froze `jr_max=jr_floor=0.1`) is releasing the controller. It descends cleanly **past redx's 2.74 blow** (best 2.7495, res controlled ~1e-3, jr 0.1–0.3, no blow; ~4.5e-4 CE/step after a slow warm-start re-adjustment). +**eig_probe (`eig_jacreg.py`) at the SAME loss ~2.74 — the decisive A/B:** +- FROZEN jacreg (redx s3200, blew): leading μ=+0.45 (real) + +0.26±1.37j (**COMPLEX/rotating**), |λ|>1 UNSTABLE, g_floor=0.26 (cycle, no fixed point). +- ADAPTIVE jacreg (ep_jacreg @2.75): leading μ=−0.23,−0.25,−0.27 (**ALL real, NO rotating pair**), |λ|<1 STABLE, g_floor=**0.0001** (TRUE fixed point restored). +⇒ adaptive jacreg moved the leading eigenvalue from **Re μ=+0.45 (unstable rotating Hopf) → Re μ=−0.23 (stable real)**, killing the rotating instability and restoring a true fixed point (g 0.26→1e-4). This confirms BOTH the diagnosis (Hopf from the non-conservative ‖J_nc‖) AND the fix (jacreg reduces ‖J_nc‖ → Re μ<0), at the eigenvalue level, not just the loss. **Bonus:** the restored fixed point makes AsymEP valid again (it requires a stationary state — invalid on redx's cycle). NEXT: does it clear 2.41 (ep_eps05's blow) → 2.09 → below; re-run eig_probe at 2.41 to confirm jacreg holds Re μ<0 deeper. Production = compose with adaptive ε (fugu Q3) for sim fidelity/speed. Files: eig_jacreg.py, ep_jacreg.log. +**OBS (oscillation, in-progress):** windowed val_std SHRANK monotonically as it descended (0.159→0.031, steps 50→3000) = converging. At step 3400 a TRANSIENT val spike (2.65→3.67) self-recovered over ~150 steps (→2.79), best preserved (2.5624). KEY: throughout the spike **res stayed LOW (4e-4→8e-5) and jr LOW (0.1)** → it is NOT a Hopf/relaxation instability (res did not inflate) but a benign weight/CE training transient (a momentarily-off update; relaxation stayed healthy). So jacreg's Hopf-suppression is intact; spike is a separate phenomenon. Watch: if such spikes recur/grow, investigate — but "res-controlled-during-spike" already rules out Hopf re-emergence. diff --git a/docs/campaign/FINDINGS.md b/docs/campaign/FINDINGS.md new file mode 100644 index 0000000..0cb8b53 --- /dev/null +++ b/docs/campaign/FINDINGS.md @@ -0,0 +1,609 @@ +# EP / AEP for Transformers — Findings + +**Question.** Can transformers (attention + FFN) be trained with Equilibrium Propagation (EP), +i.e. without backprop / without Feedback Alignment? Started from literature, reproduced the SOTA, +characterized AEP (EP for non-conservative dynamics), and ported the recipe to a real char-LM. + +--- + +## TL;DR + +1. **EP/AEP give FAITHFUL LOCAL GRADIENTS for every transformer component** — attention + cosine **0.99**, FFN **1.0** vs the true BP gradient. This solves the credit-assignment that + **Feedback Alignment fails at**: FA only trains layers adjacent to the loss (output proj cos≈1.0) + and leaves upstream layers as noise (**attention Q/K/V cos≈0.25, FFN fc cos≈−0.01**). +2. **Recipe to make real (non-conservative) attention EP-able: DAMPING + AEP.** A damping term + `s·(attn(z) − c·z)` (c≥1) creates a stable fixed point at *any* attention strength, while keeping + the map non-conservative (independent Q/K/V); AEP then recovers the gradient (0.99–1.0). +3. **Stable end-to-end EP training: SOLVED** by a residual-driven continuous controller on a soft + Jacobian penalty (Bai-2021-style, Hutchinson) + damping. EP trains 12k+ steps without blowup. + **There is NO structural "EP ceiling"** (a ~2.5-CE wall was claimed mid-project and is + **retracted** — see *2026-06-09* section): EP has a quantitative **validity threshold** + (free-phase residual ≲ few·1e-4, nudge inside the linear-response window), and meeting it costs + regularization tax + steps — both reducible, neither a wall. + +--- + +## 2026-06-20 — below-2.10 EP divergence: ROOT CAUSE + the residual-defense fix + +The C512 EP wall (EP frozen-jr descends to best ~2.09, then SUDDENLY diverges within ~200 steps: +res 5e-3→0.15, cos(g_EP,g_BPTT) 0.98→~0, CE→4+, abort — while exact-BPTT with the IDENTICAL recipe +sails past to 1.72, freeze_wsd) is diagnosed. NOT the controller, NOT jacreg, NOT the erf/tanh GELU, +NOT a loss-landscape wall. + +**Root cause (Codex-confirmed, 5-way corroborated): EP optimizes the FIXED POINT; BPTT optimizes the +FINITE UNROLL — only the finite unroll defends the residual.** `ep_step` relaxes T1=150→z150, then REFINES +to t1max=300→z* and takes the gradient at z*; but eval & `bptt_step` use z150. EP never "feels" the T1 +residual → as attention gets expressive, contraction weakens, z150 drifts from z*, res@T1 inflates, the +EP estimate (valid only at small res) dies → blowup. BPTT differentiates the actual 150 steps, so a +non-converging unroll → bad CE → its gradient implicitly rewards strong contraction. That defend-the-residual +term is what EP structurally lacks: cos-0.977 holds only AT the fixed point; the missing ⊥ component is ~21% +(=√(1−0.977²)) = the finite-horizon transient gradient (T1=∞ would make even BPTT lose it — BPTT's stability +IS the finite truncation; the equilibrium/implicit gradient only cares WHERE the fixed point is, not how fast +you reach it → no contraction-reward). + +Two layers: (1) point-mismatch (refinement) — `--t1max 150` alone moved the wall 2.09→2.05 but still blew@600; +(2) gradient-flavor — needs the explicit penalty. The diverged state is a forward bifurcation to a LIMIT CYCLE, +not under-relaxation (eval_relax: res floors ~6e-2 and oscillates, 150→4000 relax steps don't help, CE ~3.7; +FTLE stays negative −0.027..−0.050 — single random-vector FTLE misses the cycle). ⇒ adaptive / more relaxation +steps CANNOT fix it (res as a STOPPING criterion just chases a vanishing fixed point); only res as a COST +(penalty) prevents the drift. + +**Fix — explicit T1-residual penalty `--resreg`**: defend z150 with R=‖εF(z150)‖²/(‖z150‖²+ε), grad w.r.t. θ +(z150 detached), task-relative scale (ratio = resreg·min(1, res@T1/2e-2), deadband res@T1>7e-4), added to the +EP gradient. **MUST run res_gate=0** (the validity gate early-returns jacreg-only above the gate, bypassing the +penalty exactly when res is high — first gated attempt blew@200). Keep t1max=300 (estimator accuracy) + penalty +(defends z150). Analog-compatible (one extra force measurement + local VF gradient, no root-finder); more +targeted than jacreg (which penalizes ‖J_nc‖, not the residual that explodes). + +**Validation (interrupted by /tmp wipe, needs re-run):** ep_resreg2 (res_gate=0, resreg=0.2, warm from erf-2.09) +held res pinned 1–5e-4 through step 1000, best **2.0573** (past the wall, lowest any EP run reached), zero +inflation — where every no-penalty variant blew (ep_nogate@100, gated ep_resreg@200, ep_t1max150 reached 2.05 +then blew@600). The run + ALL local ckpts were deleted 2026-06-20 by the /tmp 10-day cleanup before reaching +~1.8. Corroboration (5): Codex independent diagnosis; BPTT-from-2.09 control (res stayed 1.8e-4 + descended, +same fresh-opt restart where EP blew → the EP *update* is the destabilizer); FTLE (stable-BPTT 1.72 has WEAKER +contraction −0.0347 than diverged-EP −0.0377 → not forward-stability); eval_relax (limit cycle); gradient +decomposition (jacreg acquitted: 3% of grad, orthogonal, removing changes cos<0.001). Verbatim code + re-run +plan: **EP_BELOW210_DIAGNOSIS_FIX.md**. + +--- + +## Background (literature) + +- **EP** — Scellier & Bengio 2017. Energy-based; free phase relaxes to a fixed point, nudged phase + with strength β; local weight update from the two equilibria. Centered/symmetric nudging + (Laborieux 2021) reduces the gradient-estimator bias. +- **EP ≡ BPTT** — Ernoult et al. 2019: EP updates match BPTT gradients in an RNN with static input, + in the limit β→0 with a converged free phase and enough nudged steps. (Relevant: as β→0 EP is + unbiased — so our instability is NOT primarily gradient bias.) +- **Holomorphic EP** — Laborieux & Zenke 2022: exact gradients via finite oscillations, removing the + β bias/noise trade-off. +- **CET** — Høier, Kerjan, Scellier (ICLR 2026 AM workshop), arXiv via OpenReview `Qrfml76eWJ`. + Convergent Energy Transformer, EP-trained, CELEBA masked completion, EP ≈ TBPTE. The current SOTA + for "EP + attention". Conservative (energy) attention with tied value → guaranteed fixed point. +- **AEP / AsymEP** — Scurria, Vanden Abeele, Mognetti, Massar, "EP for Non-Conservative Systems", + arXiv:2602.03670. Nudged-phase correction `−2 A_J(x*)(x−x*)` (A_J = antisym part of the Jacobian + at the free eq) turns the nudged Jacobian J → Jᵀ, giving the exact gradient even when no energy + function exists. **The correction is theirs**; their scope is feedforward/Hopfield nets on static + MNIST/CIFAR (no attention, no sequence, explicit Jacobian, no stability controller). Their + force-form **VF** baseline collapses without the correction. See 2026-06-16 for the full ours/theirs + boundary. +- **FRE-RNN** — "Toward Practical EP", arXiv:2508.11659 (Zhuo Liu et al.). Fixes EP instability via + **feedback regulation (reduce spectral radius → fast convergence) + residual connections** + (vanishing gradients). Directly relevant to our open problem (keep the free phase converged). +- Adjacent energy-transformers (BP-trained, not EP): Energy Transformer (2302.07253), + EBT "Scalable Learners and Thinkers" (2507.02092), NrGPT (2512.16762). + +--- + +## The arc (what we built & found) + +| stage | finding | evidence | +|---|---|---| +| literature | EP+attention works only narrowly; CET is SOTA (CelebA, EP≈BP, single block) | CET full read | +| MVP (CET repro) | EP trains energy-attention+memory; **cosine 0.99; EP≈TBPTE** | CIFAR 0.539/0.546; FMNIST 0.272/0.278 | +| AEP (non-cons.) | corrects real-attention gradient: toy 0.30→**0.9975**; CET attn-params 0.75→**1.0** | `aep_*.py` | +| characterization | needs T2≥20–40, T1≥80; β-insensitive (static); **advantage grows with depth** (K=2: naive 0.05 vs AEP 0.99); only changes nudged phase (free eq identical); 1.44× cost | sweeps | +| fundamental limit | **strongly non-conservative attention has NO fixed point → whole EP family fails** (projection bounds magnitude but does not create a fixed point) | residual stuck 3e-2 | +| **F: damping** | `s·(attn−c·z)`, c≥1 → stable fixed point at any s → **AEP 0.99–1.0** even at s=8 | `aep_contractive2.py` | +| Option-2 LM attention | **AEP gives LM causal attention 0.993** (FA 0.25 — FA kills Q/K/V) | `lt_ep_attention.py` | +| Option-2 H1 (FFN) | **EP Hopfield-memory gives FFN 1.000** (FA −0.01 — the abandonment reason) | `lt_ep_ffn.py` | +| Option-2 H2 (train) | **BPTT trains (val CE 2.16↓, random 4.17); EP local training destabilizes early** | `lt_ep_train.py`, `H2_*.log` | + +**Why FA failed (the project's pain, confirmed):** FA only delivers a usable gradient to the layer +right before the loss (output projection / FFN proj, cos≈1.0); all upstream layers (attention Q/K/V, +FFN fc) get random-routed error → cos≈0. EP/AEP fix all of them. + +**Why EP training destabilizes (mechanism):** BPTT differentiates whatever the fixed T1-step +relaxation computes (a deep weight-tied net) — it does NOT need convergence. EP *assumes* the free +phase is at a fixed point; gradient descent on the loss pushes attention to be more expressive / +non-conservative → fixed point lost → relaxation diverges → EP estimate (and its jvp/vjp correction) +goes non-finite. Damping + clipping + weight-caps delay but don't prevent it. +*(Superseded: the residual-controlled Jacobian penalty DOES prevent it — see the 2026-06-09 section.)* + +--- + +## Key numbers + +- CET MVP (masked completion, test masked-MSE): CIFAR **EP 0.539 ≈ TBPTE 0.546**; FMNIST **EP 0.272 ≈ TBPTE 0.278**; BP-transformer 0.106/0.126; trivial(visible-mean) 0.583. +- EP-vs-BPTT gradient cosine (CET): global **0.99**; attention WQ/WK 0.98. +- AEP recovers non-conservative attention: toy naive 0.30 → AEP **0.9975**; CET attn-params 0.75 → **1.0**; damped high-s (s=8) AEP **0.99–1.0**. +- LM (Shakespeare char): AEP attention **0.993** (FA 0.25); EP Hopfield-FFN **1.000** (FA −0.01). +- H2 training: BPTT val CE **2.16 ↓** (random ln65=4.17); EP non-finite / no sustained training. + +--- + +## Open problem & next directions + +The frontier is **stable end-to-end EP training**. Hypotheses / plans: +1. **Keep the free phase converged during training** (FRE-RNN, 2508.11659): feedback regulation to + keep the spectral radius < 1 + residual connections. Most directly targets the mechanism. +2. **β bias/noise trade-off** — Ernoult 2019 says EP≡BPTT as β→0 with converged phases; but small β + amplifies the `(z₋−z₊)/2β` noise. Holomorphic EP (Laborieux–Zenke 2022) removes this trade-off. + Diagnostic: log cosine(EP-grad, BPTT-grad) *during* training — does it start 0.99 and drop? +3. **Unified conservative energy** (no AEP): make the whole LM one energy (energy-attention, tied + value) → guaranteed fixed point → plain EP is stable (we saw CET-EP is stable). Trades attention + expressivity for stability. + +## 2026-06-09 — H2 training era: the "wall" refuted, the real mechanism found + +**Setting.** Shakespeare char-LM, single equilibrium block, C=128 H=4, same param budget everywhere. +`--attn_mode thin/real` = clamp + Hopfield-FFN + damped real attention; `thick` = DEQ-transformer +block (pre-LN + attention + untied 4× GELU FFN + residual + damping). Stabilizer = damping `c` + +soft Jacobian-norm penalty λ (Hutchinson on the non-conservative force) driven by a continuous +controller on the free-phase residual (residual, NOT spectral radius, is the right signal — the +attention Jacobian is non-normal, transient growth invisible to ρ). + +**Scoreboard (best val CE; random ln65 = 4.17):** + +| run | result | +|---|---| +| BP transformer, MLP=4 | 1.68 | +| BP transformer, MLP=1 (param-matched) | 1.79 | +| thick-BPTT (exact grad, same arch as thick-EP) | **1.95** (res ~1e-3→8e-3, learned contraction unaided) | +| mono-BPTT (monDEQ) | 2.11 | +| thin-BPTT | 2.21 (optimum at res ~1.5e-2) | +| **thick-EP champion (R5): R3 recipe + EMA-smoothed controller, 14k** | **2.0467** @ step 5k — within 0.10 of exact-gradient thick-BPTT | +| thick-EP (R3): λ-floor 0.1, res_target 5e-4, c=1, 8k | 2.0784 — beats thin-BPTT and mono-BPTT | +| thick-EP (R1): λ-floor 1, c=2, 14k | 2.1504 — 2.467@3k → 2.259@7.2k → 2.150@14k: monotone, no plateau, just slow | +| thick-EP (R4): λ-floor 0.1, res_target 2e-4, 14k | 2.1352 — tight target ⇒ controller thrash (λ swings 0.5↔13 late) | +| thick-EP (R2): λ→0, c=0.5 | 2.3572 @ step 800, then **collapse to a degenerate fixed point** (res→0 exactly, val ≫ random) | +| thick-EP, c=2 λ=1, 3k (the run the "wall" was called on) | 2.4665 (just undertrained) | +| thin-EP super-long 12k | 2.4847 plateau — **invalid-regime run**: res 2.4e-2–6e-2 the whole time, λ pinned at max 16 | +| thin-EP 3k, λ=16 | 2.5593 | + +**Retraction.** Mid-project verdict — "EP is capped at ~2.5 by a convergence⟷richness wall +(rich blocks need damping that destroys their expressivity)" — is **wrong**, called prematurely at +3k steps when thick-EP ≈ thin-EP ≈ 2.47–2.50 by coincidence (two different slow/broken modes passing +the same value). Three independent refutations: (1) the 8k continuation sailed through the "wall" +(2.2592@7.2k, monotone); (2) the slack run hit 2.357 in 800 steps; (3) the gradient probe below shows +the EP estimator is healthy exactly where the stabilized runs operate. Also wrong in the original +story: damping c does NOT aid convergence for LN-inside blocks (LN Jacobian ∝ 1/σ(z): damping shrinks +z*, *inflating* the Jacobian — measured: thick plain-relax res 8.8e-3 at c=0 vs 3.4e-2 at c=2, +`solver_wall.py`), and the scary "init residual~12" was an unnormalized absolute-norm print. + +**The real mechanism — a validity threshold, not a wall** (`grad_quality.py`: cosine of each +estimator vs long-horizon BPTT (T1=400) reference, at a 300-step-BPTT-pretrained thick block): + +| estimator | free-phase res | cos vs BPTT-400 (3 batches) | +|---|---|---| +| EP T1=400, T2=20 | ~5e-5 | **0.85, 0.83, 0.88** — healthy | +| EP T1=150, T2=20 | 2e-4–1.6e-3 | 0.21 / 0.86 / 0.27 — marginal, batch-dependent | +| EP T1=50, T2=20 | 2.4e-3–3.6e-3 | −0.03 / 0.34 / 0.55 — broken | +| BPTT T1=150 (itself!) | — | 0.45 / 0.77 / 0.36 vs BPTT-400 | +| EP T1=150, **T2=60** | same as T1=150 | 0.01 / 0.07 / −0.19 — **nudged phase leaves the linear-response window** | + +Readings: (a) at tight convergence the EP estimator agrees 0.85 with the exact gradient — the +EP-specific overhead (finite β, T2 truncation, AEP clipping) is only ~0.15 of misalignment; +(b) **the horizon/mixing cost is shared by BPTT itself** (BPTT-150 vs BPTT-400 cos 0.35–0.77): +slow-mixing equilibria are expensive for everyone, not an EP defect; (c) EP needs res ≲ few·1e-4 to +be valid — at res ~1e-2 (where the thin super-long run lived) EP updates are noise, which is what +that 2.48 "plateau" actually was: 12k steps of noise + λ=16 Jacobian-shrinking gradient; (d) longer +nudged phase is NOT better — the AEP correction is linearized at z*, so T2 must stay in the +linear-response window (T2≈20 at eps=0.1 good, T2=60 destroys the estimate). + +**Decomposition of the EP→BP gap (final numbers):** architecture (BPTT−BP) = 1.949−1.79 = +**0.16** (small — the energy/fixed-point architecture is fine); training rule (EP−BPTT) = +2.047−1.949 = **0.10** — slower optimization (estimator cos ~0.85) + the λ tax of staying inside +the validity region. Total EP→param-matched-BP = **0.26** (the "wall" story claimed ~0.7 was +structural). EP-trained thick (2.047) beats exact-gradient thin (2.206) and monDEQ (2.111): +EP *does* cash in architecture richness, given validity + steps. + +**λ-study (final, prediction confirmed: middle wins):** R5 2.047 < R3 2.078 < R4 2.135 < R1 2.150 +< R2 2.357. The two failure modes flanking the optimum: λ→0 ⇒ fast descent then *collapse* (not +explosion — a dead res=0 equilibrium); λ heavy ⇒ no failure, just a constant tax (R1 monotone to +2.150@14k). R4's lesson: too-tight res_target puts the multiplicative controller on the residual +noise floor — `(res/target)^0.3` per-step updates compound the noise and λ thrashes 0.5↔13, +degrading late training. R5's fix: EMA-smoothed residual signal (`--res_ema 0.9`) kept λ quiet +(0.1–1.0) and set the champion 2.0467 by step 5k. + +**Open item (the actual next frontier):** every stable recipe peaks mid-run (~5–6k) and then drifts +up 0.1–0.2 despite cosine lr decay — EP's late-training noise floor. Suspects: VF-estimator noise +dominating at small lr (cos 0.85 ⇒ persistent gradient-direction error), λ/lr balance shifting late, +weight-norm caps binding. Holomorphic EP (removes β bias/noise) is the principled candidate for the +remaining ~0.10; T2 must stay in the AEP linear-response window (T2≈20, NOT longer). + +## 2026-06-10 — estimator/controller co-design: EP reaches 1.74, beats exact-gradient BPTT + +The Holomorphic-EP line (see METHODS §4.3) + controller re-calibration, in causal order: + +1. **R7 — holomorphic clamp-free nudge** (N=2, r=0.02 Cauchy readout, no g-clamp / no corr-clip, + closed-form ∇zCE): best **2.029**, faster wall-clock than plain EP (3.3 vs 2.45 it/s). Probe + discoveries: the legacy clamps were the dominant estimator error at marginal residuals + (cos 0.27→0.89 at res 1.6e-3); N and r are flat ⇒ β-bias/noise was never the binding error; + T2 truncation is the remainder (T2=120 → cos 0.985 when the nudged phase is stable; adaptive-T2 + early-stopping defeated by non-normal transient growth — open). +2. **R8 — param-EMA alone: tie (2.031).** Exposed the real late-drift mechanism: λ pinned at + jr_max=16 in the tail — the controller enforcing the OLD validity margin (5e-4) against weights + that want res~1e-3, which the new estimator tolerates fine (cos 0.89 @ 1.6e-3). +3. **R9 — controller re-calibrated to the new estimator** (target 1.5e-3, λmax 4) + param-EMA: + **best 1.7399**, sustained EMA plateau 1.75–1.79, λ quiet at 0.1–0.5, tail ~1.92. Worth 0.29 — + the single largest improvement of the project. Lesson: **estimator and controller must be + co-designed**; upgrading one without re-tuning the other converts the controller into the main + source of harm. +4. **14k controls** (same horizon as R9): BP MLP=4 **1.610**, BP MLP=1 1.689. Param-shape + correction: the **thick** block's 131k FFN matches **MLP=4**, not MLP=1 (which matches the thin + block's 33k Hopfield memory) — so thick-EP's like-for-like BP baseline is 1.610. + Unregularized **BPTT-14k destabilizes** (res→4.7e-2, val→3.0, best 2.021 — worse than its own + 3k run, 1.949): the equilibrium architecture NEEDS the contraction control loop for long + training; EP carries it out of estimator necessity, bare BPTT got nothing. +5. **The decisive control — BPTT + identical λ-controller + param-EMA: 1.6348**, tail stable. + Verdict: most of EP-beats-bare-BPTT was the regularizer; with equipment matched, exact + gradients lead EP (R9) by **0.105** — which matches the measured estimator misalignment + (cos 0.85–0.90). Bonus: the controlled equilibrium block (1.635) beats the MLP=1 BP baseline + (1.689) — the contraction controller is good for the architecture independent of training + rule; EP merely forced its discovery. +6. **Adaptive T1/T2 (R10): EP 1.6755** — ate 0.064 of the 0.105 rule tax. Adaptive T2 solved by + *hindsight snapshot selection* (judge by increments of the contrast estimate a_t, never by step + sizes — non-normal transients can't fool it; dangerous batches self-limit to t≈20–30, stable + ones run to ~110 and collect cos 0.987; probe mean 0.871→0.932). Adaptive T1 companion: the + λ-controller signal stays sampled at T1=150 (R9 peace preserved), then relaxation refines to + res≤1e-4 before nudging — long-T2 gains exist ONLY from tight equilibria (at res~1e-3 long T2 + hurts). **Final 14k decomposition: architecture 0.025 (1.635 vs 1.610), EP rule tax 0.041 + (1.676 vs 1.635), total 0.066.** EP now beats the MLP=1 BP baseline (1.689). + *Multi-seed (3/arm):* EP **1.680 ± 0.005** vs BPTT-ctl **1.639 ± 0.006** ⇒ tax + **0.041 ± 0.005 (~9σ)** — confirmed real. +5. Refuted along the way: λ-floor∝lr annealing (R6) reproduces the λ→0 death; post-mortem of that + death (R2/R6): **explosion frozen by fp32 absorption masquerading as convergence** (val 60–77 + with res≡0: ε·F < ulp(z); uncapped tok/pos/fc/pj blow up) — the λ floor is the anti-collapse + mechanism, not just a tax. + +## Code map + +- `~/ept/cet_mvp.py` — CET energy model + EP & TBPTE training (App. B faithful). +- `~/ept/bp_transformer.py` — vanilla BP transformer baseline. +- `~/ept/cet_aep.py` — CETReal (real non-conservative attention) + AEP, VF gradient, damping, clip. +- `~/ept/aep_characterize.py` / `aep_depth.py` / `aep_projected.py` / `aep_contractive2.py` — AEP sweeps. +- `~/ept/aep_option1.py` — projected (constrained) adjoint. +- `/tmp/lt_ep/lt_ep_attention.py` — AEP on the LM's causal attention (gradient quality). +- `/tmp/lt_ep/lt_ep_ffn.py` — EP Hopfield-memory vs FA-FFN (gradient quality). +- `/tmp/lt_ep/lt_ep_train.py` — H2 end-to-end EP vs BPTT training of the equilibrium LM block + (`--attn_mode real/energy/mono/thick`, jacreg controller `--jacreg/--jr_floor/--res_target`). +- `/tmp/lt_ep/grad_quality.py` — EP-vs-exact gradient cosine vs residual level (the validity probe). +- `/tmp/lt_ep/solver_wall.py` — plain relax vs Anderson convergence per damping level (thin vs thick). + +*(local-transformer working copy in `/tmp/lt_ep`; original `~/local-transformer` untouched.)* + +## 2026-06-13 — late-drift 诊断 + +每个稳定配方中段触底后 val 上漂(S0 ~0.05 轻微,S1/L2 ~0.5 重度)。多假设诊断(drift_diag.py,13 列/200 步): +- **嵌入 runaway 假设被否**:|emb| 涨到 11.9 后被 weight-decay 压回 11.6,自限,非 culprit。 +- **drift 对汇报数字无害**:存 best 即可(L2 漂到 1.70 但报 best 1.214;warm-track 报 1.14)。 +- **drift 不藏更低解**:warm-track 4800 步触底 1.14 后,lr 3e-4 漂、lr 1e-4 冻(1.15),均找不到更低 ⇒ 1.14 是该配方真地板,drift = 在地板上被噪声梯度推晃。 +- **机制 = 晚期沿坏方向迈步**:低 lr 摁住 drift(anti-drift 续训冻在 1.16);S0 `cos`(EP,BPTT400)晚期 0.85→0.3 印证方向变差。 +- **严重度随尺度**:S0(C128/T64)≈0.05,S1(C256/T256)≈0.5。 +- 待结(S1 warm-start 诊断 + bctl 对照):cos 晚期塌缩是 EP 估计器退化(A,→N=4/tracking 可救)还是慢混合平衡点的视野模糊(B,EP/BPTT 共病,→架构加速混合)。bctl=cos(BPTT150,BPTT400):若同塌则 B。 + +## 2026-06-13 (cont.) — "why EP far at S1": operating-point + bias/variance diagnosis + +Reframe: at the SAME (tight, res 1.5e-3) operating point EP (1.14–1.39) ≥ BPTT (1.52). The apparent +"gap" is EP(1.14) vs LOOSE-BPTT(0.97) — an operating point (res 1e-2, non-contractive) EP cannot +use. Mechanism: T=256 long-range mixing wants non-contractive dynamics; EP needs contraction for the +fixed point to exist; contraction suppresses the mixing. Signature: gap grows with T (S0/T64: 0.04; +S1/T256: 0.17). + +bias_var.py (v4b ckpt, 16 batches, EP vs BPTT-400, BPTT-150 control), per group [mean-cos | cos-means]: +- all 0.24|0.36 ; BPTT-horizon ctrl 0.44|0.57 +- attn 0.22|0.33 ; ffn 0.32|0.49 ; ln 0.19|0.30 ; emb 0.31|0.37 +Readings: (1) two EXACT gradients (BPTT-150 vs 400) only reach cos-means 0.57 — slow mixing makes +"the gradient" horizon-ambiguous; a PERFECT EP estimator's ceiling here is ~0.57, NOT 1. (2) Both +EP and BPTT improve with averaging (variance present, partly recoverable via navg). (3) EP-specific +residual concentrates in **attention + LN** (EP cos-means 0.33/0.30 vs BPTT 0.57/0.65); **FFN nearly +fine (0.49)**, emb on par. ⇒ the AEP-corrected softmax/LN path is the weak link, pointing the fix at +qk-norm (bounds logits, conditions the attention Jacobian) + N=4 holo + navg. Chasing cos→1 is the +wrong target; matching BPTT's 0.57 is (~0.21 of recoverable EP-specific alignment). + +## 2026-06-13 (cont.) — scaling to BPE/50M: stability via small-init residual branches + +50M (C=2048) from scratch ABORTED (init res 26; qk-norm+c=2 insufficient — res stuck 0.38). Fix: +**ReZero/Fixup-style small-init of the residual-branch output projections** (WO, pj ×0.1) makes the +thick block ≈ -(z-xin)-c·z (near-identity, trivially contractive) at init; training grows them. +Result at 15M (C=1024): res **1.1e-8 at step 50** (vs abort). The big-width init-instability wall is +a random-init artifact, not fundamental; --resinit folds into the standard recipe. Separately, +torch.compile's 14.6× was a SMALL-model (launch-overhead) artifact; at C≥1024 the relaxation is +compute-bound (fp32 ~15 TFLOP/s) → ~0.06 it/s at 15M, ~0.03 at 50M. Demo feasibility: 15M solo +(~2-4 days); 50M needs multi-GPU data-parallel (per-sample contrast → gradient all-reduce) or the +external cards. The fp32 contrast (catastrophic-cancellation-bound, bf16 dead) keeps it compute-heavy +— again the "digital expensive / physical free" axis. + +## 2026-06-14 — the big-model EP recipe (scaling past S1 needs a stability stack) + +Scaling EP from S1 (C=256/T=256/char) to C=1024/T=512/BPE-4096 broke the S1 recipe: stable for ~50 +steps then violent free-phase blowups (res → 0.5–1.4, jr pinned at max) — the controller's stability +margin shrinks with width/context. CAUTION: a step-50 res≈1e-8 snapshot is NOT proof of stability; +must watch through warmup-end peak-lr (~step 500–1000). The stack that fixes it (each necessary): + 1. **resinit** (WO,pj ×0.1) — near-identity contractive block at init. + 2. **qk-norm** (Qwen3 q/k RMSNorm) — bounds attention logits / Jacobian (also fixes the attn/LN + estimator weak link from the 06-13 bias_var diagnosis). + 3. **lr warmup** (≥1000 steps linear) — let the λ-controller establish contraction before big steps. + 4. **µP-style lr scaling**: lr 1e-3 (C=256) → **4e-4 (C=1024)**; 1e-3 caused a 400-step instability + episode, 4e-4 only brief recovered excursions. + 5. **higher jr_max** (16 → 32) — bigger non-conservative force needs more penalty headroom. + 6. validity gate + fuse retained (catch the residual transient excursions). +Locked 15M demo recipe (running, 16k steps, best-ckpt): C=1024 H=16 T=512, the 6 items above, +holo N=2 + adaptive T1/T2, pema 0.999. Smoke val descends 5.86→3.94 BPE-CE by step 1000. +Cost: ~0.05 it/s (fp32, compute-bound; compile's 14.6× was a small-model launch-overhead artifact) +⇒ 15M ~3 days solo. 50M (C=2048) still unstable even with resinit 0.03 + warmup — needs a dedicated +init/lr curriculum AND multi-GPU; deferred as a sub-project (not on the demo critical path). + +## 2026-06-16 — C=512 BPE scale: baselines, AsymEP attribution, early-slowness dissection + +**C=512 BPE-4096 scoreboard** (TinyStories, T=256, B=24; random ln4096 = 8.318; target band 1.0–1.5): + +| run | result | +|---|---| +| BP standard transformer (7.48M, no warmup, lr 3e-3) | **best 1.7921** (20k, DONE) | +| EP thick "orphan" (lr 8e-4, warmup800, resinit0.1, holo2, t2sel40) | **best 2.4037** (20k, DONE; jr pinned at floor 0.1 the ENTIRE run — zero excursions) | +| EP thick "chin" (lr 9e-4, full-Chinchilla 24k target) | ABORT @5401, best 3.2408 | +| BPTT thick (exact grad, same arch) | ABORT @2113, best 3.7331 | + +Data notes (NOT verdicts): (a) BP-C512 itself = 1.79 sits just *above* the 1–1.5 band (BP needs +C≥1024 to enter it); EP-orphan 2.40 is 0.61 above BP-C512. (b) **the exact-gradient BPTT twin +aborted EARLIEST** (2113, 3.73) — even worse than the EP runs — consistent with "training walks off +the contractive manifold, exact gradient included"; lr/seed not matched across these, treat as +directional. (c) abort = the `res>0.1 for 100 consecutive steps` fuse. **lr is the knife-edge**: +8e-4 (orphan) had NO excursions and finished; 9e-4 (chin) had periodic jr→32 / res→0.2–0.5 +excursions, most recovered, one didn't make it back inside the 100-step fuse window → abort@5401. +Narrow lr-stability margin at C=512. (d) orphan late curve is a plateau: best moved only +2.4557→2.4037 over the last 3600 steps. + +**AsymEP attribution (web-verified, the J→Jᵀ correction is NOT ours).** arXiv:2602.03670, +Scurria / Vanden Abeele / Mognetti / Massar, "EP for Non-Conservative Systems", names the method +**AsymEP**. Their correction `−2 A_J(z*)(z−z*)` is **mathematically identical** to ours +(`Jv−Jᵀv = (J−Jᵀ)v = 2 A_J v`). Their scope: Hopfield nets + feedforward MLPs (≤500 hidden) + +one CIFAR-10 feedforward CNN; **static inputs; explicit Jacobian construction + decomposition; +NO residual/validity controller; NO attention / transformer / sequence model; do NOT combine with +holomorphic**. Best result: CIFAR-10 FF-CNN AsymEP 89.74% vs BP 90.66% (their variational "Dyadic +EP" matches BP, p=0.75). Their **"VF" baseline** (the force-form ⟨a,∂F/∂θ⟩ readout — *prior art, not +ours*) **collapses without the correction** (CIFAR-10 → 10% = chance; MNIST 64% vs AsymEP 92.7%) — +matching our measured cos≈0.25 for uncorrected non-conservative attention. **Ours on this line**: +matrix-free jvp−vjp (their explicit Jacobian is infeasible at B·T·C dim); softmax-attention +application; the holomorphic combination (the real-linear correction preserves holomorphy); +common-mode-tracking + lock-in variants; the validity-threshold/controller/gate stability stack they +have none of; and the transformer-LM application + scale. (⇒ corrected EP_DERIVATION.md, which had +mislabeled force-form VF as our step.) + +## 2026-06-17 (cont.) — full to-spec versions of exp1/4/5/6/7 (first pass was partial) + +The 06-17 campaign ran lighter/subset versions; re-ran the FULL specs. New findings: + +**exp1 (FULL) — multi-checkpoint cos-evolution (snap 0/200/800/2000 + plateau).** EP·BPTT150 cos: +step0(random,res2e-2) 0.26 → step200(res1e-9) **0.99** → step800 0.96 → step2000 0.89 → +plateau(2.40) **−0.05**; BPTT150·BPTT400 = 0.46→**1.00** throughout (true gradient always clean). +New **batch self-cos** at the plateau: EP **−0.27** (gradients on different batches anti-correlated) +vs BPTT **+0.96**. ⇒ decisive: the EP estimator is faithful while descending and undergoes an +**SNR/coherence collapse near the optimum** (true gradient shrinks below the EP bias floor, which is +batch-incoherent) — NOT horizon ambiguity, NOT LR. This is THE 2.40-plateau mechanism. + +**exp5 (FULL) — jr_max sweep (the dropped dimension, and it mattered).** Warm-started from 2.40: +**jr_max 8 → 2.3648**, jr_max 4 → 2.376 — both BEAT jr_max 32 (2.463) and the orphan (2.404) by +~0.04. jr_max 16 → 2.416. floor/target fill still ≥2.47. ⇒ the jr_max=32 ceiling was a small tax; +~8 is better. (Earlier "not a controller tax" was incomplete — floor/target don't help, but the +ceiling does.) Still far from the 1–1.5 band; the estimator floor dominates. + +**exp4 (FULL) — 3 arms.** armA (warmup): ~50-step gate dead-zone. armB (`cprewarm 200` = full-lr +contraction before task warmup): res 1.7e-2 after pre-phase, dead-zone only PARTIALLY mitigated +(val 5.97 vs 6.07 @100) — establishing contraction once isn't enough; warmup's tiny lr can't deepen +it. armC (no-warmup + resinit 0.03): NO dead-zone (cos 0.99 from step20, best 4.03@300) but a violent +excursion (res→0.14@400, recovered). ⇒ small-resinit+no-warmup learns fastest early but is +excursion-prone; the cprewarm decoupling is a weak fix. + +**exp6 (FULL) — branch norms + attention entropy + qk RMS, incl. at plateau.** Branches grow fine +(|WO| 1.95→4.91 over training); at the 2.40 plateau they are LARGE (|pj|=44, |fc|=58, attn/xin≈1.0), +attn entropy 4.56→3.62 (sharpening), qk_rms pinned 1.0 (qknorm working). ⇒ no "stuck-small branch" — +the trainable-α-gate intervention is NOT indicated. + +**exp7 (FULL, after a probe bug fix) — EP vs BP influence + per-position CE.** [Bug caught: the +EP probe embedded the input with random-init weights before loading the ckpt → garbage (per-pos CE +~7); fixed to embed post-load.] Clean influence ||Δz*_q||/δ vs distance d: **BP** 5.34(d1)/0.88(d16)/ +0.15(d64)/0.003(d200); **EP-orphan** 0.24/0.071/0.019/0.006. Both decay with distance. EP ~20× +weaker in ABSOLUTE coupling (scale-caveated — z* vs hidden norm), but NORMALIZED to d=1 EP reaches +*farther* than BP (0.30 vs 0.16 at d16; 0.08 vs 0.03 at d64). Per-position CE: EP uniformly ~0.7–1.0 +worse than BP at every position (no long-range-specific failure). ⇒ **does NOT support — arguably +refutes — "contraction cuts long-range mixing"**; EP's deficit is uniform, consistent with the +estimator floor (exp1) being the binding constraint. (BP-C512 trained to **1.6953**.) +Honest gap: a stable BPTT-ctl-tight ckpt does not exist (BPTT broke at C512), so the cross-method +influence comparison is EP-vs-BP only, not the full tight/loose/BP set. + +**Campaign verdict (data; user concludes):** the C512 2.40 plateau is, decisively, an **EP estimator +SNR/coherence collapse near the optimum** (exp1: cos→0, self-cos −0.27, k→4000, true gradient clean). +It is not LR (exp2), not horizon (exp1), not λ-floor/target (exp5) — though jr_max≤8 buys ~0.04 +(exp5). The architecture has a SEPARATE wall: exact-gradient BPTT can't stay stable at C512 (exp3). +The contractive-mixing hypothesis is NOT supported (exp7: EP's range is comparable to BP, deficit is +uniform). ⇒ the EP lever is reducing the estimator bias floor near the optimum (N=4 holo / tracking- +AEP / lock-in / navg); the architecture lever is the stability margin for exact gradients. + +**Early-slowness dissection (C512 BPE) — cos/k probe CONFIRMED; ablations STILL RUNNING.** +From existing logs: **EP ≈ BPTT at every early step** (step 200: 5.84 = 5.84; step 800: 3.86 vs +4.05; step 2000: 3.58 vs 3.73), both **~1.2 behind BP** (2.76 @800). ⇒ early slowness is *not* the +EP estimator — the exact-gradient twin on the same architecture is equally slow; the gap is the +equilibrium architecture + recipe vs a standard transformer, present already at step 200 (pre +warmup-end). Probe (`--probe_bptt`: cos(g_EP,g_BPTT) + k=|gEP|/|gBPTT| along the REAL trajectory): +- **Estimator quality tracks res tightly**: res≈1e-9 → cos 0.93–0.99, k≈1.0 (all groups); a res + spike to 1.4e-3 (mid-excursion) → cos all 0.86 / attn 0.70 / emb 0.13, k→0.4. ⇒ the earlier + kinit *synthetic* probe (cos 0.33 / k 0.42 "at init") was a probe-operating-point artifact; on the + real trajectory the EP gradient is faithful (cos≈1, k≈1) **whenever the free phase is converged**, + and **k≈1 means no adaptive-k is needed in the converged regime**. +- **warmup × validity-gate early-stall mechanism**: with warmup, early lr is tiny → the contraction + homeostat can't pull res under the gate (5e-3) → free-phase res stuck ~1.5e-2 for ~50 steps → + the **validity gate skips the nudge** → the reported g_EP is pure jacreg/contraction (cos≈0, + k≈11) = *the task is not being learned at all* for ~50–60 steps. res drops < gate by ~step 60–80 + → cos jumps to 0.99. No-warmup: res converged (1e-9) by step 20, cos 0.98 from step 20 + (val 5.72 vs warmup 6.07 @ step100). +- **CAVEAT — the tradeoff is UNRESOLVED**: the no-warmup arm hit an instability excursion ~step 180 + (res→1.4e-3, jr→32, recovering by 200) — i.e. warmup's stability role is real. Whether no-warmup + survives to 1200, plus the **resinit (arm B)** and **lr 2e-3 (arm D)** attributions, are STILL + RUNNING as of 16:00 — no verdict drawn. Probe instrumentation: `lt_ep_train.py --probe_bptt N`; + launcher `/tmp/lt_ep/early_dissect.sh`. + +## 2026-06-17 — C512 diagnostic campaign (7 experiments): the 2.40 plateau is an EP estimator bias-floor; exact-gradient BPTT is LESS stable than EP at C512 + +Controlled campaign at C512 BPE (orphan ckpt = EP best 2.4037). Raw results + the causal cut each +experiment was built for. Scripts: `lr_sweep.py`, `triangulation.py`, `mixing_probe.py`, +`lt_ep_train.py --probe_branch/--probe_bptt`; orchestration `master.sh`. + +**exp3 — BPTT+controller, the decisive matched cut (GPU0).** Same arch+controller+lr 8e-4 as the EP +orphan; only the task gradient differs (exact vs EP). Descended cleanly to best **3.85 @step1400**, +then **DESTABILIZED at step 1600** (jr→32, res→0.07) and lodged in a broken basin (val 6.27, +res ~0.068 — just under the 0.1 fuse, so no abort) for 8400+ steps. **The exact-gradient twin did +WORSE than EP (3.85-stuck vs EP 2.40-completed).** ⇒ at C512 the contraction controller does NOT +keep the exact gradient on the manifold; EP's implicit contraction-bias is what kept the orphan +alive. Inverts the S0 picture (BPTT+ctl 1.635 < EP 1.676). Caveat: single lr/seed. + +**exp1 — triangulation at the 2.40 plateau (estimator vs horizon).** BPTT150 vs BPTT400 +cos = **1.000** (all/attn/ffn; emb 0.986) — true gradient perfectly well-defined, ZERO horizon +ambiguity — but EP vs BPTT cos = **−0.045** (orthogonal), k=|gEP|/|gBPTT| ≈ **4000**. ⇒ the plateau +is an **EP-specific estimator failure**, not slow-mixing horizon ambiguity. + +**exp2 — one-step LR-sweep at the plateau.** BPTT best ΔCE −0.16 (lr 1e-2); EP best −0.04 (lr 1e-4), +diverges for lr≥3e-4. cos 0.11, k 486. ⇒ NOT LR-inequivalence; EP's direction caps one-step descent +4× below BPTT at any lr. + +**exp1+exp2 synthesis:** the EP estimator has a ~fixed bias/noise floor. k runs **~1 during active +training → ~4000 at the plateau** (true gradient shrank below the EP bias). 2.40 is the SNR +crossover; below it the true gradient is smaller than the EP bias so EP can't see it. Not LR, not +horizon, not λ. ⇒ the EP lever is **reducing the estimator floor near the optimum** (N=4 holo / +tracking-AEP / lock-in / navg averaging), distinct from the architecture's stability lever. + +**exp5 — λ floor/target grid (warm-start from orphan).** base 0.1/1.5e-3 → 2.463; 0.05/3e-3 → 2.491; +0.03/3e-3 → 2.485; 0.1/5e-3 → 2.488 — ALL worse than orphan 2.40 (slight late-drift up). ⇒ relaxing +the controller does NOT break the plateau — not a λ-floor tax. (floor 0.03 reached jr=0 without +collapse over 1200 warm-started steps.) + +**exp6 — branch-growth (resinit vs no-resinit).** resinit 0.1: branches grow gradually & stably +(|WO| 2.35→4.91, |pj| 3.23→6.57 over 1000 steps; res ~1e-4; best 3.71). no-resinit: branches start +large (|WO|~20, |pj|~20), excursion at step 800 (res→0.23, jr→32; recovered, |WO|→14.7; best 4.04). +⇒ resinit's gradual growth is stabilizing; branches do grow (not stuck-small). + +**exp7 — mixing/influence length.** ||Δz_q*||/δ FLAT across distance (~0.007 at d=1 and d=200) and +weak; loose-target (lam_t5) similar (~0.005). ⇒ no distance-decay signature; INCONCLUSIVE on the +contraction-cuts-mixing hypothesis (no non-contractive/BP reference; coupling uniformly weak). + +**exp4 — warmup/gate (probes done).** no-warmup best 3.56@1200 (survived its step-180 excursion); +warmup 4.11@800. warmup×validity-gate early-stall confirmed (warmup keeps res>gate ~50 steps → +nudge skipped → no task learning early). + +**Net (data; user to conclude):** the 1–1.5 band is blocked at C512 by TWO distinct walls — +(1) EP's estimator bias-floor caps EP at 2.40 (cos→0, k→4000 at the plateau; true gradient is +clean), and (2) the equilibrium architecture+controller can't keep the EXACT gradient stable at +C512 (BPTT broke at step 1600). EP's contraction-bias makes it the more robust of the two here. +The contractive-conflict hypothesis is supported on the BPTT side; the EP plateau is specifically +an estimator-SNR-floor effect. + +## 2026-06-17 (round 2) — the plateau bias floor IS the frozen AEP linearization; tracking-AEP fixes it (gradient level) + +Estimator ablation at the 2.40 plateau ckpt (`plateau_ablation.py`, vs BPTT400, n_grad=4): + +| estimator | cos | k | batch self-cos | +|---|---|---|---| +| N2 frozen-AEP (current) | −0.045 | 4133 | −0.27 (incoherent) | +| N4 / N8 | −0.03 / −0.01 | 31 / 29 | ~0 | +| N2 r=.05/.10/.20 | −0.04 | ~4200 | −0.27 | +| N2 fixed-T2 20/80/120 | nan (diverges w/o snapshot selection) | — | — | +| **track_N2 (common-mode AEP)** | **0.997** | **0.9** | **+0.95 (coherent)** | + +Diagnostic at z*: **|Jv−Jᵀv|/|Jv| = 1.37** (J highly non-normal). ⇒ the bias floor IS the +frozen-at-z* AEP linearization. N=4/8 fix only the magnitude (k 4000→30), not the direction; r and +T2 do nothing; **tracking-AEP (re-linearize at the moving common mode z̄=½(z₊+z₋)) restores cos +−0.045→0.997, coherence −0.27→+0.95, magnitude 4133×→0.9×.** + +**exp D (per-group):** collapse is UNIFORM across WQ/WK/WV/WO/fc/pj/ln1/ln2/tok/pos (all cos +−0.03..−0.08, k 3k–18k) — the corrupted contrast `a` poisons every VF gradient equally (single +shared cause, not attention/LN-specific unlike the S1 bias_var diagnosis). + +**exp B (navg):** cos(EP,BPTT) over navg = 0.37→0.40→0.55→−0.30→−0.21 (1/2/4/8/16); self-cos rises +(0.58→0.87) but toward a direction NEGATIVELY correlated with BPTT; dCE diverges at navg≥8. ⇒ +**deterministic bias, not variance** — averaging converges to the wrong (biased) direction, so +navg/restart-averaging will NOT fix it. + +### 2026-06-17 (round 2b) — free-phase ε is mistuned ~3× (monDEQ α-analysis analog); Anderson untapped + +Free-phase convergence probe at the 2.40 ckpt (`solver_probe.py`, res_F=‖F(z)‖/‖z‖, eps-independent, +B=2, 4 batches). Steps to res_F≤1e-4 / ≤1e-6, plain Euler: ε0.05→470/844, **ε0.1 (ours)→236/548**, +ε0.2→121/274, **ε0.3→85/182** (~2.8× faster than 0.1, same fixed point, stable), ε0.5→stalls at +res~0.9 (non-convergent — the stable ceiling is ~0.3–0.4). Anderson(m=5): ε0.1→85/197, **ε0.2→58/118 +(~4× vs our current 0.1)**, robust across ε. ⇒ **ε=0.1 is ~3× too small**; ε≈0.3 or Anderson@ε0.2 cut +free-phase force-evals ~3–4× (and let T1≈100 always clear res≤1e-4, dropping adaptive-T1) → directly +attacks the dominant EP compute cost AND tightens res (better estimator). This IS the "preconditioned/ +Anderson relaxation" open item, now quantified. Caveats: free phase at trained point / B=2 (optimal ε +may shift early-training / slow-mixing); the **nudged phase shares ε but has AEP+holo dynamics — bigger +ε there needs a separate stability check** before changing. Confirms the monotone-operator α-sensitivity +(non-normal dynamics, |Jv−Jᵀv|/|Jv|=1.37) carries over to our hand-set Euler step. + +**Decision:** the EP lever collapses to **common-mode tracking-AEP** (not N=4 / lock-in / navg). +Caveat: validated at the gradient level at one ckpt; the one-step ΔCE is magnitude-confounded, so the +NECESSARY next test is **training** — does EP with `--track` descend below 2.40 (`holo_a_track` +already wired)? exp C (BPTT+controller rescue, lr 2e-4/4e-4/6e-4 × jr_max 16/32) runs separately for +the architecture-ceiling question. + +## 2026-06-18 — tracking-AEP VALIDATED IN TRAINING: it breaks the 2.40 plateau + +Warm-track test (warm-start from the 2.40 orphan ckpt, jr_max 8, warmup 0, 2500 steps): +- **TRACK-warm (common-mode tracking-AEP): 2.40 → best 2.1628 @2500, still descending**, stable + (res ~6e-5, jr at floor). ⇒ tracking-AEP descends PAST the 2.40 plateau — the round-2 estimator- + floor fix works in training, exactly as the fixed-ckpt probe predicted (cos −0.05→0.997). +- **STD-warm control (standard estimator, same recipe): ABORTED @step 397** (blew up, res→0.21). +- **TRACK-fresh (tracking from scratch): ABORTED @968** (blew from random init). + +⇒ under the identical aggressive recipe (jr_max 8), tracking-AEP is BOTH stable AND descends, while +the standard estimator blows up. But tracking-AEP is **not a from-scratch drop-in** (fresh blows up) +— the recipe is **two-phase warm-track**: standard estimator to ~2.40 (where it's stable), then +switch to common-mode tracking-AEP to descend below the floor (matches the S1 warm-track champion). +Caveat: std-control's abort is partly the aggressive jr_max 8 (orphan used jr_max 32, stable at +2.40); the clean claim is "at jr_max 8 tracking survives+descends where std blows". The 2.16 ate +~0.24 of the 0.61 EP-vs-BP gap and was still descending → continuing (`trkcont` from 2.16) to find +the ceiling. + +**exp C — architecture ceiling (BPTT+controller rescue, 2500 steps):** stable exact-gradient exists +(lr≤6e-4 never breaks, unlike lr8e-4 which broke @1600). Best: lr6e-4 jr16 → 3.08, still descending. +At equal step (2500): **BP 2.34 ▸ BPTT 3.08 ▸ EP-orphan 3.55** ⇒ the ~10× proportional slowdown +(confirmed: median 9.4× in the descending regime) splits into an lr-stability-capped architecture +handicap (~0.74, BP→BPTT; the equilibrium block can't take BP's 3e-3 lr) AND an EP-estimator +handicap (~0.47, BPTT→EP). jr_max 16 > 32 confirmed again. Long ceiling run (`bptt_ceiling`, +lr6e-4 jr16, 12k) pending. + +**Found + staged a real code inconsistency:** free phase / model / readout use erf-GELU (F.gelu) +while the holo nudge uses tanh-GELU (cgelu) → z* isn't the nudge-force's fixed point → spurious +common-mode drift in the contrast every step (which tracking-AEP partly absorbs — possibly part of +why it wins). Fix staged as `--gelu tanh` (additive, default erf). Separately: free-phase ε=0.1 is +~3× too small (ε0.3 or Anderson → ~3–4× fewer relaxation steps; `solver_probe`). + +--- + +## 2026-06-24 — below-CE-2.1 divergence: DIAGNOSED (genuine Hopf) + FIXED (resreg/jacreg = finite-T1 LE control) + +Full write-up: **`SESSION_2026-06-24_HOPF_DIAGNOSIS_RESREG_FIX.md`**. This pins down the forward-stability wall that sat under TL;DR #3 ("no loss ceiling, but a validity threshold that costs reg + steps") — the threshold is now mechanistically a **Hopf bifurcation**, and the "regularization tax" is specifically **LE control**. + +**The mechanism (3 independent confirmations).** Below CE~2.1 the relaxation diverges because the learned **non-conservative attention** operator undergoes a **genuine continuous-time Hopf bifurcation** — a complex eigenvalue pair crosses **Re μ > 0** as EP makes attention expressive. Matrix-free FD-JVP Arnoldi on M=I+εJ (`ep_run/eig_probe.py`): leading continuous μ=(λ−1)/ε goes **−0.024 (s2000, CE 3.13, STABLE) → +0.44±1.37j (s3200, 2.74) → +1.35±2.08j (ep_eps05, 2.41)** — Re μ and |Im μ| both grow as CE drops. Confirmed by (a) eval limit-cycle + attention-knockout (α=0.2 converges), (b) Anderson can't reach a root at s3200 (near-root Re μ=+0.24 rotating). **Earlier "Euler artifact" read (and the cont.6 forward-mode/RTRL framing) are SUPERSEDED** — fugu caught that the ε-sweep "convergence" was the *step* residual ε·g, not g→0; an intermediate misread. (The Euler over-amplification is a real *second* layer — ε relocates the wall 2.74→2.41 — but the root cause is Re μ>0, which no ε closes.) + +**Why equilibrium-EP hits this and BPTT doesn't (structural).** Equilibrium-EP optimizes L(z*); its gradient is blind to *finite-time contraction* (a param eroding ρ→1 with z* fixed has zero equilibrium gradient). BPTT differentiates the T1 unroll, so a non-converged z_T1 → bad output → penalized → it implicitly defends contraction. So EP needs that defense added back explicitly. + +**The fix — control the finite-T1 Lyapunov exponent (LE).** Stability ⟺ ρ(M)<1 ⟺ finite-T1 LE<0. Two ~orthogonal handles, they stack: +- **resreg** (`lt_ep_train.py:220-231`) = penalize the T1-residual ‖εF(z_T1)‖ ~ ρ^T1 = exp(T1·LE) → a **direct LE penalty** (also catches the non-normal transients eigenvalues miss). **The PROVEN below-2.1 stabilizer.** +- **jacreg** (`:211-219`) = penalize ‖J_nc·v‖ → shrinks |Im μ|, pushes the pair to Re μ<0 (cause-side). `ep_run/eig_jacreg.py` confirmed at the mechanism level: at the same CE~2.74, frozen-jacreg = Re μ=+0.45 rotating g_floor 0.26, adaptive-jacreg = **Re μ=−0.23 real, g_floor 1e-4** (Hopf killed, true fixed point restored, AsymEP validity restored). + +**★ The 2.09 recipe = FROM SCRATCH + resreg=0.2 + FROZEN jr** (NOT adaptive jacreg — that was a session-long detour; cmd at `EP_BELOW210_DIAGNOSIS_FIX.md:97-101`). Original ep_resreg2 reached **2.0573** (lowest EP ever, lost to /tmp wipe; rebuild 2.22). This session: ep_resreg_warm (resreg eager, warm from stable s2000) descending smoothly through 2.30 (peak res 1.6e-2, no spikes) — the clean 2.09 test; ep_rr_scratch (from-scratch, proven recipe) launched as the robust confirmation. **Warm-start ONLY from a STABLE operator (s2000, Re μ<0); resreg/jacreg PREVENT instability, they can't RESCUE an already-blown one** (warming from ep_eps05@2.41 blew). + +**Infra (#14).** `--compile` EXONERATED + SAFE (compiles the free-phase `tforce`; numerically identical to eager — z150 rel-diff 1.6e-7, pure fp32; ~1.43x, ~3.3x with t2sel40). **`--tf32` UNSAFE — do NOT use** (10-bit mantissa ≈ 1e-3 perturbation; the relaxation is hyper-sensitive — ε 0.1→0.05 alone moved the wall 0.33 — so TF32 is too coarse). EP parallelism for the deep/scaling phase: no sequential backward, coupled equilibrium stack (#13) → depth-parallel, adaptive-T1 (residual-stopped) cleaner than adaptive-ε. + +**Lit anchor.** Hopfield-ResNet (arXiv 2509.26003) EP-trained 12 layers — but it's **conservative** (energy Φ, symmetric weights, no oscillation by construction). Confirms non-conservativity is our culprit; **we are the first to EP-train non-conservative attention** (which has the Hopf), solved via resreg/jacreg. diff --git a/docs/campaign/SESSION_2026-06-24_HOPF_DIAGNOSIS_RESREG_FIX.md b/docs/campaign/SESSION_2026-06-24_HOPF_DIAGNOSIS_RESREG_FIX.md new file mode 100644 index 0000000..d605140 --- /dev/null +++ b/docs/campaign/SESSION_2026-06-24_HOPF_DIAGNOSIS_RESREG_FIX.md @@ -0,0 +1,81 @@ +# Session 2026-06-22→24: below-2.1 wall — DIAGNOSIS FLIPPED to genuine Hopf, FIX = resreg/jacreg (LE control), 2.09 recipe recovered + +THE definitive write-up of this multi-day session. Supersedes the "Euler-artifact" framing in EP_BELOW210 cont.7. + +## 1. HEADLINE — the diagnosis flipped twice, landed on GENUINE Hopf instability +The C512 EP below-CE-2.1 divergence is a **genuine continuous-time Hopf instability** of the learned **non-conservative attention** operator — a complex eigenvalue pair crossing **Re μ > 0** as EP trains the attention expressive. **NOT a discrete-Euler artifact** (that was an intermediate wrong turn). Analog/continuous hardware is NOT automatically immune. + +### Evidence chain (3 independent methods, all in ep_run/) +1. **eval_relax_s3200.py** — relax the marginal ckpt redx s3200 (val 2.74) 6000 steps → res floors ~2.3e-2, OSCILLATES, no fixed point = limit cycle. +2. **knockout_s3200.py** — scale attention output WO×α: cycle scales with α, CONVERGES at α=0.2 → the **non-conservative attention drives the cycle**. +3. **eps_sweep_s3200.py** — cycle amplitude shrinks monotonically as ε↓ (0.1→2.3e-2, 0.01→8.9e-4). *Intermediate misread*: thought "Euler artifact, analog immune" (cont.7). **fugu caught the bug**: that 8.9e-4 at ε=0.01 is the STEP residual `r=ε·g`, i.e. the SAME force-floor g≈0.09 — so the sweep proves the oscillation is discrete-amplified but does NOT prove a clean continuous fixed point exists. +4. **ε-TRAINING-monotonicity (3 runs)**: ε=0.1→blew@**2.74**; ε=0.1,t2sel=160 (BETTER gradient)→blew@**3.02** EARLIER (⇒ gradient quality is NOT the lever); ε=0.05→blew@**2.41**. Smaller ε → strictly lower wall. +5. **eig_probe.py** (matrix-free FINITE-DIFFERENCE-JVP Arnoldi on M=I+εJ — DECISIVE): leading continuous μ=(λ−1)/ε at the operating point: + - s2000 (3.13): all |λ|<1, **Re μ=−0.024 STABLE** + - s3200 (2.74): top |λ|=1.044, **Re μ=+0.44, COMPLEX** (μ=+0.26±1.37j rotating) + - ep_eps05 (2.41): top |λ|=1.14, **Re μ=+1.35**, complex (±2.08j) + - **Re μ grows −0.02→+0.44→+1.35 as CE drops, |Im μ| grows ~0→1.37→2.08.** GENUINE instability, growing. + - NOTE: autograd JVP gave 0 (blk.force detaches internally) → use **FD-JVP** `(F(z+h·u)−F(z))/h·‖v‖`, h=1e-3·‖z‖. +6. **anderson_control.py** — s3200: plain relax floors (res 2.7e-2), Anderson CANNOT reach a root (best 1.4e-3), near-root has **Re μ=+0.24, |Im μ|≈2.0 UNSTABLE rotating**. Solver can't manufacture an absent/unstable equilibrium. + +### Reconciles with everything +- ε-monotonicity: with Re μ>0, `|1+εμ|>1` for ANY ε; smaller ε just makes the blow-up SLOWER (less discrete over-amplification, esp. the rotating (εb)² term) → delays → wall RELOCATES (2.74→2.41), never closes. +- ep_t2fix earlier-blow: cleaner gradient drives into the stiff/unstable regime faster. +- It's a COMPOUND: continuous instability (root, Re μ>0) + explicit-Euler over-amplification (2nd layer). ε attacks layer 2 (delay); jacreg/resreg attack the root. + +## 2. THE FIX — resreg AND jacreg both work, by controlling the finite-T1 LE +Both add back the **finite-time contraction defense** that equilibrium-EP's L(z*) structurally lacks (and that BPTT has implicitly — BPTT differentiates the T1 unroll, so a non-converged z_T1 → bad output → penalized). + +- **jacreg** = penalize ‖J_nc·v‖ (the non-conservative/rotating Jacobian = attention+FFN; lt_ep_train.py:211-219). **Cause-side**: shrinks |Im μ| → pushes the complex pair back to Re μ<0. +- **resreg** = penalize the T1-residual ‖εF(z_T1)‖ (lt_ep_train.py:220-231). **Symptom-side**: residual ~ρ^T1=exp(T1·LE), so resreg ≈ a DIRECT finite-T1 Lyapunov-exponent penalty (catches non-normal transients the eigenvalues miss). ~orthogonal to BPTT-grad (cos −0.047) — a constraint that keeps res low so the EP estimate doesn't collapse. +- **Geometry**: relaxation map M=I+εJ; stable ⟺ ρ(M)<1 ⟺ finite-T1 LE<0. Hopf = non-conservative part pushes complex μ past Re μ=0. resreg controls LE directly (output), jacreg controls the structural source (‖J_nc‖). **They stack** (orthogonal handles; cause+symptom). +- **eig_jacreg.py CONFIRMED jacreg at the mechanism level** (cont.9): at the SAME loss ~2.74, FROZEN jacreg (redx) = Re μ=+0.45 rotating UNSTABLE g_floor 0.26; ADAPTIVE jacreg (ep_jacreg @2.75) = **Re μ=−0.23 STABLE real, g_floor 0.0001 (true fixed point restored)**. jacreg killed the Hopf + restored AsymEP validity. + +## 3. ★★ THE 2.09 CONFIG (recovered from EP_BELOW210:97-101) — the key stabilizer is RESREG, NOT jacreg ★★ +The session spent days on adaptive-jacreg; the USER pushed to find the actual 2.09 recipe. It is **FROM SCRATCH + resreg=0.2 + FROZEN jr=0.1** (the original ep_resreg2 reached **2.0573**, lowest EP ever; lost to /tmp wipe; rebuilt ep_resreg_scratch reached 2.22): +``` +python3 lt_ep_train.py --mode ep --attn_mode thick --B 24 --C 512 --H 16 --T 256 --c 1.0 \ + --jacreg 0.1 --jr_floor 0.1 --jr_max 0.1 --holo 2 --hr 0.02 --t2sel 40 --track --pema 0.999 --t1max 300 \ + --res_est 1e-4 --res_gate 0 --resreg 0.2 --qknorm --resinit 0.1 --warmup 800 --T1 150 --T2 20 --lr 6e-4 \ + --wsd 0.25 --steps 32000 ... # NO --init_ckpt = from scratch +``` +- jr_max=jr_floor=0.1 = jacreg FROZEN (controller off). Adaptive jacreg = release jr_max (e.g. 16); controller :520-529 ramps jr by `(res/res_target)^0.3`. +- res_gate MUST be 0 (the gate early-returns before the resreg penalty → bypasses it; res_gate≠0 blew @200 historically). + +## 4. RUNS this session (all C512, warm from s2000 unless noted) +- **s2000** = `runs/redx_traj/s2000.pt` (redx step 2000, val 3.13) = the STABLE pre-bifurcation start (Re μ=−0.024). Use THIS to warm-start; ep_eps05.pt (2.41, Re μ=+1.35) is UNSTABLE — resreg/jacreg can't RESCUE an already-unstable operator (they PREVENT, they don't reverse). +- ep_jacreg (adaptive jacreg, warm s2000): twitchy (jr-spikes to 15 → CE spikes), crawled to ~2.32 then ~stuck/slow. Broke past 2.74/2.41 cleanly though. +- **ep_resreg_warm** (resreg=0.2 eager, t2sel=160, warm s2000): SMOOTH (peak res 1.6e-2, no spikes), LEADING at **2.2985** — the clean 2.09 test, still alive. +- ep_resreg_fast (resreg t2sel40, warm from UNSTABLE ep_eps05@2.41): BLEW — start-point was unstable. (Confirmed by direct eval: fp32 & TF32 both 2.55 → load fine, TF32 didn't change relaxation; the operator just destabilizes under training.) +- ep_resreg_c (resreg t2sel40 +compile, warm s2000): BLEW @2.31. Cause = **t2sel40 (lean gradient)**, NOT compile, NOT warm-start (resreg_warm same warm-start is fine). +- ep_rr_scratch (FROM SCRATCH original recipe +compile): launched 2026-06-24 05:32 on GPU3 — tests the user's "from-scratch is robust" hypothesis + the proven 2.09 path. +- ep_eps05 (ε=0.05): blew @2.41 (the ε-monotonicity run). + +## 5. INFRA / #14 speedup findings +- **compile: EXONERATED + SAFE.** lt_ep_train has `--compile` (compiles the FREE-phase via `tforce`, the no_grad fast path; gradient stays eager). Verified numerically identical: tforce vs force rel-diff 9e-7; compile-z150 vs eager-z150 **1.6e-7** (just fp32 op-reorder rounding). Speedup ~1.43x (free phase) / ~3.3x with t2sel40. reduce-overhead/CUDA-graphs BROKEN (0.07x, graph breaks — needs fullgraph/static-shape fix). +- **TF32: DROPPED (user decision).** `--tf32` exists (lt_ep_train:368, sets allow_tf32). 10-bit mantissa ≈ 1e-3 precision loss. The relaxation is HYPER-precision-sensitive (ε 0.1→0.05 moved the wall 0.33!), so TF32's 1e-3 perturbation is too coarse → risky. compile(fp32, 1.6e-7) is ~10⁶× below the sensitivity scale → safe; TF32 is not. **DO NOT use --tf32.** +- **EP parallelism advantages for #14 (esp. for the scaling/deep phase):** + 1. NO sequential backward (vs BPTT's N reverse layers) + NO activation graph (memory-light). + 2. COUPLED equilibrium stack (#13, like the Hopfield-ResNet) → all layers relax CONCURRENTLY each step → depth parallelizes (vs BPTT's 2N sequential). (DEQ-style z*=f(z*) with deep f does NOT parallelize.) + 3. **adaptive-T1** (relax until residual<tol, the t1max machinery already does this for z*) — easier/cleaner than adaptive-ε (convergence signal is cleaner than overshoot). Speed + auto-converged readout + cap-hit=instability-flag. + 4. adaptive-ε (#30) as 2nd-order. +- GPU: 4× RTX A6000 (49GB). GPU0/1/3 = ours; **GPU2 = others' NV-Embed-v2 server (port 8555/8556) — DON'T TOUCH.** For RENTING (user found cheap provider, RTX models + H800, no A100/H100): **EP is memory-light + FLOP-bound → consumer 4090/5090 = best $/FLOP** for the bulk; 48-80GB (A6000/H800) only for BPTT-twins (memory-heavy) + 0.6B. EP per-step FLOPs ~1.5-3x BPTT (2 relaxations vs fwd+bwd) but that cost is SIM-only (analog relaxation is free physics). + +## 6. Hopfield-ResNet paper (arxiv 2509.26003) — confirms our diagnosis +"Scaling EP to Deeper Architectures" trained 12-conv Hopfield-ResNet with EP. It is **CONSERVATIVE** (energy function Φ, SYMMETRIC weights, monotone energy descent — "no oscillation or limit cycles"). No attention, no non-symmetric ops. **Confirms: non-conservativity is OUR culprit; conservative systems get depth free (no Hopf). We are the first to EP-train NON-conservative attention (which has the Hopf), solved via jacreg/resreg.** Good for the dossier: prior EP-deep = conservative/no-attention; ours = the harder non-conservative case. + +## 7. fugu-ultra consultations (all in ep_run/) +- FUGU_VERDICT_FULL.md (Q1-Q4): confirmed attention-driven oscillation, FLAGGED the Euler-artifact (we then measured it's a true Hopf), said the eigenpair is the decisive measurement. Fix=adaptive jacreg homeostat+res_gate; sub-threshold attention IS expressive (BPTT 1.83 proves it); keep below the instability. +- FUGU_OPTIONS_VERDICT.md (Q1-Q3): the step-vs-force-residual CORRECTION; adaptive ε eliminates only if ε_min<ε_crit; jacreg raises ε_crit (model-side, same wall) + a real analog settling benefit + true-Hopf insurance; Anderson/implicit are emulator-fidelity (analog-faithful), jacreg edits the model. Recommended: adaptive-ε+Anderson(both phases)+bounded jacreg, report FORCE-residual + Re μ. + +## 8. KEY FILES +- Probes: ep_run/{eig_probe.py, anderson_control.py, eig_jacreg.py, adaptive_eps_calib.py, adaptive_eps_calib2.py, eps_sweep_s3200.py, knockout_s3200.py, eval_relax_s3200.py, compile_bench.py} +- Dossiers: ep_run/{EP_DIAGNOSIS_DOSSIER.md, FUGU_VERDICT_FULL.md, FUGU_Q_OPTIONS.md, FUGU_OPTIONS_VERDICT.md, FUGU_Q1_VERDICT.md} +- EP_BELOW210_DIAGNOSIS_FIX.md: cont.6 (structural/forward-mode — SUPERSEDED), cont.7 (ε-artifact — SUPERSEDED), **cont.8 (Hopf correction — CURRENT)**, **cont.9 (jacreg confirmed at mechanism level)** + OBS (oscillation = benign weight transient, not Hopf), **2026-06-23 ε-monotonicity RESULT**, the 2.09 recipe at :97-101. +- Code: lt_ep_train.py — force/tforce:81-106, relax:123, ep_step:140, jacreg:211-219, resreg:220-231, jr controller:520-529, --compile (works), --tf32 (DON'T use). + +## 9. CURRENT STATE + NEXT +- Running: ep_jacreg (~2.32), ep_resreg_warm (LEADING 2.2985, clean 2.09 test), ep_rr_scratch (from-scratch, just launched). Watchers: ep_resreg_check.py→2.20, ep_jacreg_binary.py→2.30 (detached→/tmp), ep_rr_scratch needs one. +- **THE open question: does resreg break 2.09?** ep_resreg_warm (warm+t2sel160) at 2.2985 smooth → likely; ep_rr_scratch (from-scratch, proven recipe) = the robust confirmation. +- Lessons: (a) the 2.09 stabilizer is RESREG (from scratch), not adaptive jacreg; (b) warm-start ONLY from a STABLE operator (s2000), never an already-blown one (ep_eps05); (c) t2sel40 (lean grad) is fragile deep, t2sel160 safer; (d) compile safe (fp32), TF32 unsafe (precision); (e) the run is precision-hyper-sensitive. +- Recurring bug to avoid: `pkill -f "ckpt runs/X.pt"` SELF-MATCHES the bash → exit 144; kill by explicit PID instead. And `nohup python … &` inside a run_in_background bash DETACHES it (no notify) — run `python3 watcher.py` directly as the tracked task. diff --git a/docs/hardware/COLLABORATOR_BRIEF.md b/docs/hardware/COLLABORATOR_BRIEF.md new file mode 100644 index 0000000..3728657 --- /dev/null +++ b/docs/hardware/COLLABORATOR_BRIEF.md @@ -0,0 +1,46 @@ +# Backprop-free analog training of a transformer — collaboration brief +**One-page ask for hardware-side collaborators · 2026-06-21 · Yuren Hao (UIUC)** + +## The idea in three sentences +We train a **transformer block as a physical equilibrium (fixed-point) system** using **Equilibrium Propagation +(EP)** — no backpropagation. The forward pass is a damped relaxation `z ← z + ε·F(z)` that **settles** to a fixed +point (on analog hardware, the settling *is* the physics — nearly free); the weight update is **local**, computed +from the contrast between a free settle and a slightly-nudged settle. This is exactly the computation an analog +in-memory / memristive array is good at — and unlike every shipping analog-AI chip (all inference-only), it needs +**in-situ weight update**, which is the open opportunity. + +## Why now / why it's real (not speculative) +- **Algorithm side (ours, in simulation):** EP's gradient matches true backprop (cosine ≈ 0.99–1.0 per component); + the equilibrium transformer trains stably and **matches/beats a same-parameter BP transformer** on language modeling. + Currently scaling the recipe; a fix for the one known instability (a residual-defense term) is under validation. +- **Hardware precedent exists:** local contrastive/EP learning has been physically demonstrated (self-learning analog + resistor networks, ~1 µs settling, on-chip weight update from a local free-vs-clamped difference; EP on a D-Wave + Ising machine). **But nobody has built an EP-trained *transformer* in analog hardware — that is the first-mover demo.** +- **Endurance clears the bar:** HfOx-class RRAM survives ~10^10 write cycles; a training run needs ≤10^8 device writes + (fewer with digital-accumulate-then-threshold-program). Endurance is not the blocker — update linearity/symmetry is + the real device challenge. + +## What a hardware demo needs (three layers) — and the UIUC ECE fit +| Layer | What it does | Closest collaborator | +|---|---|---| +| **Trainable device** | in-situ-updatable analog weights (RRAM/FeFET/ECRAM) — *the part you cannot buy* | **Wenjuan Zhu** (UIUC ECE, memristor/RRAM/FeFET/2D devices) | +| **In-memory MVM circuit** | analog matrix-vector multiply + on-chip weight write-back | **Naresh Shanbhag** (UIUC ECE) — his JSSC-2018 DIMA chip *already* does analog MVM **+ on-chip SGD weight write-back** in 65nm; nearest existing substrate | +| **Mixed-signal glue / control loop** | ADC/DAC to read settled states + apply the nudge; switched-cap integrators = relaxation primitives | **Pavan Hanumolu** (UIUC ECE, data converters / PLL / switched-cap) | +| **EP control + sim** | the settle→nudge→settle→local-Δθ loop, noise/endurance de-risk in simulation | **us** (FPGA + the trained model + analog-noise sim already built) | + +**Escalation / device frontier:** **H.-S. Philip Wong (黄汉森, Stanford EE / TSMC Chief Scientist)** — NeuRRAM (Nature +2022) is the most EP-relevant analog-MVM substrate (inference-only today); the RRAM-device heavyweight + a TSMC-foundry +path, reachable via a Stanford student contact. + +## The concrete ask (staged, modular — stitch existing capabilities, no startup-scale custom fab) +- **Phase 1:** put ONE equilibrium-transformer block on an existing in-situ-trainable substrate (Shanbhag's DIMA-class + chip + Hanumolu converter/integrator glue; Zhu devices) + our FPGA EP-control loop → prove end-to-end analog EP training. +- **Phase 2:** scale weights (foundry RRAM MPW — e.g. SkyWater S130 + Weebit ReRAM IP — or a fixed-weight inference array + for the forward path with the trainable layer in-situ). +- **What we bring:** the validated algorithm, the trained model + scaling data, the EP control logic, and a simulator + that already models analog non-idealities (device noise / quantization / asymmetric update) to de-risk before tape-out. + +**Bottom line:** the science is done in sim and the hardware pieces all exist in-house at UIUC ECE — this is a +stitching + first-demo opportunity, not a multi-year custom-silicon program. + +*(Backing detail + citations: HW_RESEARCH_FINDINGS.md; method: ept_method_intro.pdf)* diff --git a/docs/hardware/FUGU_CODEX_PHYSICS_ANSWER.md b/docs/hardware/FUGU_CODEX_PHYSICS_ANSWER.md new file mode 100644 index 0000000..55b511c --- /dev/null +++ b/docs/hardware/FUGU_CODEX_PHYSICS_ANSWER.md @@ -0,0 +1,125 @@ +# Physics answer for Q1–Q7, grounded in the actual simulator code + +*Independent verification pass. Read-only: I did **not** run any training, probe, or GPU job. Every line citation below was re-checked against the files on disk; empirical numbers are quoted from probe **logs already on disk** (`ep_run/runs/*.log`), not from new runs.* + +--- + +## 0. The decisive facts the code actually establishes (read these first) + +These control all seven answers. Two of them overturn parts of the brief. + +**F1 — The active "thick" force is a damped residual DEQ-transformer block.** +`tforce(z,xin) = -(z-xin) + attn(LN z) + FFN(LN z) - c·z`, i.e. `x_in - (1+c)z + Attn + FFN`. FFN uses **tanh-approx GELU**. (`lt_ep_train.py:81-85`; same branch in `force(...,'thick')` at `lt_ep_train.py:102-106`.) Q,K,V,O are **independent** matrices (`lt_ep_train.py:39-40`), so attention is genuinely non-conservative; `--qknorm` optionally RMSNorms q/k (`lt_ep_train.py:63-65`). + +**F2 — This code's EP is NOT a forward-only, settled-state-only rule. It already uses the transpose.** +The AEP correction explicitly forms **both** `Jv` (jvp) and **`Jᵀv` (vjp)** of the non-conservative force and injects `-(Jv-Jᵀv)` into the nudged dynamics (`lt_ep_train.py:172-178`). The production holomorphic paths do the same with `torch.func.jvp`/`torch.func.vjp` (`holo_ep.py:181-185` plain N=2; `holo_ep.py:236-239` tracking). The block-param gradient is then `grad((a·f).sum(), block)` (`lt_ep_train.py:202-207`) and the readout head gets its own local grad `dCE/dWh` (`lt_ep_train.py:208-210`). **Implication:** on a GPU the transpose is a free autograd call, but on a one-way analog relaxation substrate `Jᵀv` is exactly the capability you do not have. So the simulator is *not itself* a demonstration of a physically forward-only rule — a critical caveat the brief glosses. + +**F3 — The implemented AEP is an algebraic `J→Jᵀ` identity, not a first-order `S⁻¹A` series, when the correction is not clipped.** +Subtracting `(J-Jᵀ)v` turns the nudged differential operator `J` into `J-(J-Jᵀ)=Jᵀ` exactly, for any asymmetry (`lt_ep_train.py:176`, `holo_ep.py:239`). Caveat: the **plain real** nudge path clips the correction if `‖corr‖>‖f‖` (`lt_ep_train.py:177-178`), so that path is not algebraically exact under clipping. The holomorphic/production paths shown in `holo_ep.py` do not apply that norm clip at the same point (`holo_ep.py:181-185,236-239`). The "ρ(S⁻¹A)<1 / Neumann series" warranty applies to the **uncorrected** vector-field EP baseline, **not** to this correction — and the diagnosis file itself retracts the earlier mis-attribution (`EP_BELOW210_DIAGNOSIS_FIX.md:174-184`). **The brief's framing in its §1/§2 (AEP "out of warranty" at ‖S⁻¹A‖≈1.4) does not match the code.** + +**F4 — The "‖S⁻¹A‖≈1.4" number in the brief is a different quantity than ρ(S⁻¹A).** +The brief line 11, and the older diagnosis at `EP_BELOW210_DIAGNOSIS_FIX.md:146`, equate `‖A‖/‖S‖`-scale ≈1.4 with `|Jv−Jᵀv|/|Jv|≈1.4`. That ratio is `|2Av|/|Jv|`, **not** `ρ(S⁻¹A)` or `‖S⁻¹A‖`. When the code actually measures `S⁻¹A` (`asym_probe.py:325-332,364-432`), it gets `‖S⁻¹A‖₂≈10–13` and `ρ` between 1.6 (Arnoldi) and 4.6 (power) — and even those are at **non-converged** points with MINRES solve residuals ~0.15 (`runs/oracle_probe.log:110-114`; `runs/asym_probe_full.log:145-148`). So the "1.4" is a loose proxy, the real `S⁻¹A` operator is large and **ill-measured**, and no single trustworthy ρ value exists in the logs. Treat all ρ-based claims as soft. + +**F5 — The cleanest converged measurement says the adjoint is EXACT and the gap is the ESTIMATOR.** +At a tightly converged checkpoint (`runs/diag_probe.log`, step-200 ckpt, free-phase residual **1.6e-9**): on the block-parameter scope used by the diagnostic (readout `Wh` excluded; `diag_probe.log:6`), `cos(g_transpose, g_BPTT) = 1.000000` on all 6 batches, while `cos(g_EP, g_transpose)` is only **0.51–0.72** (`diag_probe.log:13-44`). The β/hr sweep at that *same* converged state shows cos **rising with β**: 0.002→0.15, 0.005→0.38, 0.01→0.56, 0.02→0.72, 0.04→0.86 (`diag_probe.log:46-50`). **Reading:** (i) the exact transpose adjoint *is* BPTT (so it is not a finite-time-vs-fixed-point gap, at least at T1=150 with res→0); (ii) the EP shortfall is a 1/β noise-amplification / nudge-conditioning problem, consistent with the diagnosis' "hr=0.02 is 10× too small" note (`EP_BELOW210_DIAGNOSIS_FIX.md:196-205`). This is the single most important empirical anchor and the prior draft under-weighted it. + +**F6 — EP and BPTT can optimize different states (point mismatch).** +EP relaxes to T1, records the T1 residual, then *refines* up to `t1max` before computing the gradient (`lt_ep_train.py:143-152`); but `evaluate` (`:279-287`) and `bptt_step` (`:260-265`) both use the T1 state. With `t1max>T1` the EP grad is taken at a *different, tighter* point than the one being evaluated. This is real, but F5 shows it is **not** the dominant failure once you actually converge T1. + +**F7 — The "stabilizers" are proxies, not the physical quantities the brief asks about.** +`jacreg` penalizes `‖J_nc v‖²` for random `v` (Hutchinson `‖J_nc‖_F²`), **not** `‖A‖`, ρ(S⁻¹A), or a log-norm μ_P(J) (`lt_ep_train.py:211-216`). `specnorm` projects **weight** singular values post-step (`lt_ep_train.py:290-332`, applied at `:561-562`), not the state Jacobian. The no-jacreg path instead feedback-controls the damping `blk.c` on the residual (`lt_ep_train.py:530-534`). None is a contraction-by-construction guarantee. + +**F8 — Stale-assumption / dead-parameter caveats.** +The diagnosis lists old bugs (holo `cforce` missing q/k norm; erf-vs-tanh GELU mismatch: `EP_BELOW210_DIAGNOSIS_FIX.md:83-86`). In the current code thick `tforce` uses tanh GELU (`lt_ep_train.py:84`) and the nudge forces include q/k RMSNorm (`holo_ep.py:44-46,142-144`), so those are largely closed. `asym_probe.py` still exposes `--gelu` with default `erf` and stores it on `blk` (`asym_probe.py:35,119`), but `EQBlock.force/tforce` in this file never read `blk.gelu`; they hard-code tanh-approx GELU (`lt_ep_train.py:84,105`). So the current mismatch is a stale/no-op probe flag, **not** an erf-vs-tanh force mismatch in the active code. Also: in `thick` mode `Wm`, `P`, `Q` are allocated and listed in `blk.block` (`lt_ep_train.py:41-50`) but never used by the thick force (`lt_ep_train.py:81-85`) — so any "parameter count" / energy-vs-free comparison must exclude them. + +--- + +## Q1 — Is the obstruction FUNDAMENTAL or an estimator artifact? + +**(a) Verdict — split, and the code resolves it decisively for the converged regime.** +- *Plain forward-only settled-state EP* on a non-conservative equilibrium: **FUNDAMENTAL** bias. The implicit-function gradient needs `λ` solving `Jᵀλ = -∂L/∂z`; a one-way nudge relaxes under `J`, giving `J⁻¹`, and `J⁻¹≠J⁻ᵀ` when `J≠Jᵀ`. No reading of settled states of the *forward* dynamics supplies the missing `Jᵀ` action. +- *AEP as actually used in the unclipped holomorphic/tracking paths* (F2/F3): the `J→Jᵀ` swap is **exact at any asymmetry** in the β→0, converged limit; the plain real path can lose exactness when its correction clip triggers. So there is **no fundamental asymmetry no-go for this estimator** — the brief's central "ρ(S⁻¹A)<1 wall" is **not** what the code suffers from. +- *What actually remains* (F5): an **estimator artifact** — at hr/β=0.02 the EP contrast is only 0.5–0.7 aligned with the (exact) adjoint even at residual 1e-9, and it *improves monotonically with β*. The gap is 1/β noise amplification of the nudge-convergence error, not the asymmetry. + +The brief's own sub-question — "is `J⁻¹−J⁻ᵀ` fundamentally inaccessible to a forward relaxation?" — is answered: **yes for a strictly forward-only substrate; no for this simulator, because it pays for `Jᵀv` with a digital VJP.** + +**(b) Physical-realizability:** Forward-only, no transpose → **No** (can't get exact non-conservative gradients). Forward relaxation **+ a physical transpose/adjoint phase** → **Yes in principle**, but that is a *second* physical primitive (see Q3), not vanilla EP. The current code is local in the weight update but **not** transpose-free. + +**(c) Cheapest falsifier:** On one converged ckpt + one batch, run `asym_probe.py --diag` style three-way compare with the AEP correction **disabled** (set `corr` to zero in the nudge, `lt_ep_train.py:176`). Mechanism "the transpose is what AEP buys" is falsified if **AEP-off EP still matches `g_transpose`** at high asymmetry. (All machinery exists: `asym_probe.py:804-854`; no training.) + +--- + +## Q2 — Decouple non-normality from lost contraction. Is the GOOD solution intrinsically high ρ(S⁻¹A)? + +**(a) Verdict — ENGINEERING gap, and the brief's premise is unproven by the code.** +The damping enters as `-(1+c)z` (F1), so at a healthy fixed point `S≈-(1+c)I` and `S⁻¹A≈-A/(1+c)` — raising `c` shrinks `S⁻¹A` *mechanically*. The code lets you test this (`--c`, `lt_ep_train.py:348,397`). But "expressivity *requires* ‖S⁻¹A‖≈1.4" is **not** established: (i) the 1.4 is a proxy, not `S⁻¹A` (F4); (ii) the real `S⁻¹A` measurements (10–13, ρ 1.6–4.6) were taken at **non-converged** or **pre-blowup** ckpts with garbage solve residuals (F4), exactly where contraction is already lost — so they confound "intrinsic asymmetry" with "S near-singular," which is the brief's own worry (brief line 15). The diagnosis concurs the ρ verdict "answered the wrong question" (`EP_BELOW210_DIAGNOSIS_FIX.md:184`). + +Caveats from code: with `jacreg>0` (the documented recipe) the residual controller moves `jr`, **not** `c` (`:520-529`), so `c` is effectively fixed mid-run; only the no-jacreg branch adapts `c` (`:530-534`). `qknorm` bounds logits but does not guarantee contraction (`:63-68`). The `mono` branch's "sym(J)<0 guaranteed" comment (`:107-111`) is **false for the full operator** because it still adds raw `self.attn(zr)` (`:111`), whose antisymmetric part is unconstrained. + +**(b) Physical-realizability:** `c`, q/k-norm, weight scaling → forward-only, physical, **Yes**. `jacreg`/`specnorm` as written are autograd/SVD digital ops → **not** native local analog rules. A passive metric / per-channel `c` that holds ‖S⁻¹A‖<1 *uniformly without a penalty* is **not implemented**. + +**(c) Cheapest falsifier:** A `--c` sweep (e.g. 1,2,3,4) from identical init/recipe, logging val-CE, T1 residual, and one-batch `cos(g_EP,g_transpose)` + measured ρ(S⁻¹A) at matched CE. Mechanism "lost contraction, not intrinsic asymmetry" is falsified if large `c` keeps residual/ρ healthy but EP **still** collapses at the same CE wall. (`runs/ep_c3_scratch.*` already exists from such a run.) + +--- + +## Q3 — Can the physical hardware compute its own adjoint (transpose dynamics)? + +**(a) Verdict — ENGINEERING gap *if* a transpose twin is allowed; FUNDAMENTAL *if* the substrate is strictly one-way.** The simulator already *assumes* transpose access everywhere (F2; `asym_probe.py` even solves `Jᵀλ=-ℓ` by GMRES, `:290-323,443-454`). So the code proves the *conditional*: given `Jᵀ`, non-conservative equilibrium gradients are exact (F5: `cos(g_transpose,g_BPTT)=1.0`). It says **nothing** about whether a one-way damped analog attention circuit can produce `Jᵀ` for free — and a reciprocal `J=Jᵀ` circuit (the EBM case) is precisely what independent Q,K,V,O (`:39-40`) breaks. + +Hamiltonian-echo / time-reversal does **not** drop in here: the simulated dynamics are explicitly **dissipative** (`-(1+c)z`, `:85`). Reversing them needs active gain or a built transpose-twin network, not passive settling. + +**(b) Physical-realizability:** Same substrate, forward only → **No**. Substrate with a programmed transpose twin / reciprocal-by-construction coupling / reversible read path → **plausibly Yes**, at the cost of a second adjoint relaxation + calibrating the transposed operators. Current code = the digital stand-in for that. + +**(c) Cheapest falsifier:** `asym_probe.py` on a converged ckpt prints `cos(g_transpose,g_BPTT)`. "Transpose hardware is sufficient" is falsified if `g_transpose ≠ g_BPTT` even at tight residual — then the missing piece is finite-time, not the transpose. (Logs show it is **NOT** falsified: cos=1.0 at res 1e-9, `diag_probe.log:13`.) + +--- + +## Q4 — Is "equilibrium" the right primitive? Finite-time vs fixed-point. + +**(a) Verdict — ENGINEERING gap, and the code partly contradicts the brief's "finite-time is the stabilizer" thesis.** Point mismatch is real (F6). The brief/diagnosis claim BPTT's edge is the finite-horizon transient term that settled-state EP structurally lacks (`EP_BELOW210_DIAGNOSIS_FIX.md:12-20`). **But** F5 shows that at T1=150 with residual→0, `g_transpose` (a pure *fixed-point* adjoint, no unrolling) equals `g_BPTT` **exactly** (cos 1.0). If a finite-horizon term were the missing physics, the fixed-point adjoint could not match BPTT. So at convergence the two coincide and the residual gap is the estimator (F5), not a finite-time term. The diagnosis reaches the same retraction (`:186-189`). The resreg "finite-time defense" (`lt_ep_train.py:220-231`) was measured ~orthogonal to the true missing gradient (`EP_BELOW210_DIAGNOSIS_FIX.md:119-124`) — consistent: it was patching a non-problem. + +The brief is right in the *divergent* regime: once contraction is lost, z150 ≠ z* and BPTT-of-the-unroll implicitly rewards contraction while settled-state EP does not. So finite-time matters as a **stability crutch off-equilibrium**, not as a missing **gradient component at equilibrium**. + +**(b) Physical-realizability:** Pure fixed-point EP → physically clean but mis-targeted off-equilibrium. Exact finite-T BPTT → **not** realizable (no backward-through-time). Forward eligibility/RTRL physical accumulation → possible in principle, **not** in this code, and O(params²) state — poor scaling. + +**(c) Cheapest falsifier:** One ckpt, one batch: compare `g_EP(t1max=T1)`, `g_EP(t1max=2000)`, `g_BPTT(T1)`, `g_transpose`. "Finite-time term is the gap" is falsified if `g_transpose=g_BPTT` (it does, `diag_probe.log:13`) — already essentially falsified by existing logs at convergence. + +--- + +## Q5 — Non-normality as physics: must it hurt, and can it be bounded for free? + +**(a) Verdict — ENGINEERING gap; the code uses proxies, not the right invariant.** The physically natural regularizer is contraction-in-a-metric (log-norm `μ_P(J)<0`) / passivity, which a passive RC network enforces automatically. The code's controls are all proxies (F7): damping `c` (a genuine negative-diagonal contribution, `:85`), `qknorm` (logit bound), `jacreg` (‖J_nc‖_F², symmetric-blind), `specnorm` (weight σ, not state-Jacobian). None measures or bounds `μ_P(J)` or `‖A‖`. So bounded non-normality is **sufficient in principle** (and probably plenty for LM-scale computation), but this block is an *unconstrained active non-reciprocal* module + damping, not a passivity-by-construction element. + +**(b) Physical-realizability:** Passive/contractive design → **Yes**, but restricts the operator class. `jacreg`/`specnorm` → digital. Unconstrained active attention → realizable only with active elements that can ring/overshoot (the transient growth) and thus need explicit contraction safeguards. + +**(c) Cheapest falsifier:** Add a one-batch log-norm monitor `max_v ⟨v,Sv⟩/‖v‖²` reusing the existing `Operators.sv` (`asym_probe.py:247-250`); compare unconstrained vs `--specnorm` vs high-`c` at matched CE. "Passivity prevents the failure" is falsified if a run with consistently `μ_P(J)<0` and low residual **still** shows EP gradient collapse. + +--- + +## Q6 — The conservative-attention escape: fundamental or engineering cost? + +**(a) Verdict — ENGINEERING cost; no fundamental cost is demonstrated by the code.** An `energy` mode exists: a scalar `E` plus LSE attention-energy, differentiated to give a conservative force (`lt_ep_train.py:70-76,113-117`). But it is **not** an apples-to-apples symmetric version of the thick block: thick uses independent Q,K,V,O + LN + untied 4× FFN (`:39-46,81-85`), whereas `attn_energy` is a **tied** LSE over q·k with **no V/O path** (`:70-76`) — a strictly weaker, value-less form. So any measured 0.15–0.2 CE tax reflects *this particular* energy parameterization, not an information-theoretic bound on conservative attention. A partially-conservative sweet spot (small controlled `A` + exact AEP/transpose) is plausible and untested in code. + +**(b) Physical-realizability:** Fully conservative energy attention → **Yes**, cleanest analog-equilibrium route (EP exact, `J=Jᵀ`, no transpose phase). Partially conservative w/ bounded `A` → Yes if `A` small or transpose twin available. Free directed attention w/o safeguards → **No** for exact local EP. + +**(c) Cheapest falsifier:** Active-parameter-matched Pareto sweep — `energy` vs `thick` vs a tied/penalized-`A` interpolant — logging CE, residual, cos, asymmetry. "Conservative cost is only engineering" is falsified if **every** near-conservative variant at matched *active* params (excluding dead Wm/P/Q, F8) stays stuck at a large CE gap while free attention succeeds. + +--- + +## Q7 — The make-or-break: readable ~10–50M LM inside the analog-EP regime? + +**(a) Verdict — NOT proven by the code, NOT ruled out.** The hardware-realizable regime is conservative/contractive + tight convergence + local update + **no digital VJP/BPTT**. The current strongest path (thick AEP) **violates the no-transpose constraint** (F2), so it is not itself a proof of the thesis. The honest physical candidate is `energy`/`mono`/contractive attention — which in this code is the *weaker, value-less* form (Q6). And the current model is small and confounded: with C=512, T=256, vocab=4096, most params are embeddings/readout, and Wm/P/Q are dead in thick mode (F8). So no clean active-parameter-matched analog-regime LM result exists in these files. Empirically, BPTT reaches val-CE **1.83** (`runs/bptt_clean.log:162`) while the converged-from-scratch EP/AEP recipe stalls/diverges around **2.0–2.9** depending on hr (`runs/ep_hr02.log:16-22` aborts at 2.87 with hr=0.02; the diagnosis attributes this to mis-set hr, `EP_BELOW210_DIAGNOSIS_FIX.md:196-205`). That gap is currently an **estimator/hr** gap (F5), not a proven expressivity ceiling. + +**(b) Physical-realizability:** Strict conservative/contractive local EP → plausible, unproven for readable 10–50M. Current thick AEP → not a valid proof (digital transpose + dead params). Hybrid (conservative/contractive analog core + thin transpose-capable or digital non-conservative correction) → most realistic near-term, but breaks the strict "no backward at all" thesis. + +**(c) Cheapest falsifier:** Train the smallest *clean* hardware-faithful model — active-param-matched `energy`/`mono`, **AEP/VJP off**, **no BPTT**, same tokenizer/data/eval — and sample text at fixed CE milestones. Thesis falsified if such a model can't cross a pre-declared readability/CE bar that same-budget free directed attention clears. Must report **active** param count (exclude dead Wm/P/Q, F8). + +--- + +## Bottom line + +1. The brief's headline no-go — "AEP dies because ρ(S⁻¹A)>1" — **does not match the code**. The unclipped holomorphic/tracking AEP correction implements an exact `J→Jᵀ` identity (F3), and the diagnosis already retracts the warranty claim (`EP_BELOW210_DIAGNOSIS_FIX.md:174-184`). +2. The "1.4" asymmetry number is a proxy `|2Av|/|Jv|`, not `‖S⁻¹A‖`; the real measured `S⁻¹A` is ~10–13 with ρ unresolved (1.6–4.6) and only at non-converged points (F4). +3. The **true fundamental** obstruction is narrow and physical: a strictly forward-only, transpose-free substrate cannot realize `J⁻ᵀ` for a non-conservative equilibrium (Q1/Q3). This simulator sidesteps it with digital VJPs, so it is not a forward-only proof of concept. +4. The **dominant practical failure** in this code is an **estimator-conditioning** problem (β/hr too small ⇒ 1/β noise), demonstrated at residual 1e-9 where `g_transpose=g_BPTT` exactly but `cos(g_EP,g_transpose)`=0.5–0.7 and rises with β (F5). Finite-time and asymmetry are secondary/off-equilibrium effects. +5. The hardware-clean route is conservative/contractive attention or a real physical transpose twin — `jacreg`/`specnorm` are proxies (F7), the `energy` mode is value-less (Q6), and readable small LMs in the strict regime remain **plausible but unproven** in these files (Q7). diff --git a/docs/hardware/GPT_PRO_PHYSICS_ANSWER.md b/docs/hardware/GPT_PRO_PHYSICS_ANSWER.md new file mode 100644 index 0000000..6dfbde2 --- /dev/null +++ b/docs/hardware/GPT_PRO_PHYSICS_ANSWER.md @@ -0,0 +1,46 @@ +# GPT-Pro physics consult — actionable synthesis (2026-06-21) +Full answer pasted by user (in session). This file = the load-bearing extraction + the prescribed experiment. + +## The decisive reframe: the no-go is REAL but NARROW +- **Same-dynamics, settled-state-only EP computes `J⁻¹ℓ`; the true implicit gradient needs `J⁻ᵀℓ`.** + True adjoint `λ = −J⁻ᵀℓ` (ℓ=∂L/∂z*); EP's infinitesimal nudge contrast solves `J s = −ℓ` ⇒ `s = −J⁻¹ℓ`. + Exactness for all ℓ ⇒ requires `J = Jᵀ` (self-adjoint in the physical metric P: `PJ=JᵀP`). **That is the precise no-go.** +- **But the BROADER no-go is FALSE.** `ρ(S⁻¹A)<1` is ONLY the convergence condition of the *first-order* (Neumann) AEP + expansion — NOT a universal impossibility. Exact non-conservative learning EXISTS if you alter the learning dynamics: + - **AsymEP** (Scurria 2026): add a local antisymmetric correction in the nudged phase so the modified nudged Jacobian = `Jᵀ`. + - **Dyadic EP** (Scurria 2026): double the state space; the auxiliary variable becomes a physical error signal ∝ cost gradient. EXACT for arbitrary non-conservative stationary systems. +- ⇒ **The obstruction MOVES from "gradient estimator impossible" to "can the HARDWARE embody the adjoint (transpose couplings / doubled state) without becoming a hidden digital backprop engine?"** Math exists; physical realizability is the open problem. +- Exact error operator (better than the ρ proxy): `E = J⁻¹Jᵀ − I = −2 J⁻¹A`. `‖S⁻¹A‖` is a diagnostic only when S is well-conditioned. + +## THE PRESCRIBED PRIORITY EXPERIMENT (GPT-Pro's stated #1: "oracle adjoint physics ablation") +At the SAME checkpoints, implement THREE gradient estimators + sweep an asymmetry knob ε (scale of the antisymmetric attention component): +1. **vanilla/our EP** (`g_EP`) — gives `J⁻¹ℓ`-flavored. +2. **exact transpose relaxation** (`g_transpose`): solve `Jᵀ λ = −ℓ` (GMRES; Jᵀ via vjp), then `g = (∂F/∂θ)ᵀ λ`. = the EXACT implicit gradient. +3. **Dyadic/AsymEP** with exact local `A` from autograd as a simulator ORACLE. +Reference: `g_BPTT` (finite-T). +**DECISION:** if `g_transpose` (and/or dyadic) stays aligned + correctly scaled vs `g_BPTT` even at `ρ(S⁻¹A)>1` (closing our measured 84% gap) ⇒ **the collapse is NOT a mathematical no-go — it's a hardware-realization / convergence problem** ⇒ the research problem becomes "build a local transpose/dyadic substrate." If `g_transpose` ALSO fails ⇒ culprit is finite residual / nonlinearity / saddle instability / loss of contraction, NOT the conservativity assumption. +- MUST run at a HEALTHY tight fixed point (the BPTT good solution, or a converged specnorm ckpt) — at the divergence edge (no fixed point) the solve is ill-conditioned (we saw it: MINRES resid 0.2). + +## Q2 (damping) — our suspicion CONFIRMED as plausible +`‖S⁻¹A‖ ≤ ‖A_G‖ / ((1+c) − ‖S_G‖)`. Two routes to high ρ: numerator `‖A_G‖` grows, OR contraction margin `(1+c)−‖S_G‖` collapses (S→singular). Weak-contraction good solutions fit the SECOND. Raising `c` buys EP-validity ONLY for *gratuitous* asymmetry; if you rescale attn/FFN by `(1+c)` to preserve the map, A_G & S_G scale too and ρ may not improve. **Damping sweep is high-value.** Decompose ρ into `‖A‖`, log-norm margin `μ_P(J)`, transient gain `sup_t‖e^{Jt}‖`. + +## Q3 (physical adjoint) — buildable, not free +- Transpose relaxation `λ̇ = Jᵀλ + ℓ` = recurrent-backprop/adjoint relaxation (NOT BPTT-through-time), physically allowed IF substrate has transpose couplings. Reciprocal crossbar: `Wᵀ` = swap drive/sense ports. Nonreciprocal attention: transpose includes state-dependent transposes of LN/softmax/QK/value/output ⇒ a "transpose twin" = duplicate routing + calibration + derivative circuits + extra phase. +- **Hamiltonian Echo (HEB/HEL)**: physical adjoint via time-reversal — but does NOT directly extend to DAMPED dissipative dynamics (dissipation destroys time-reversal info). Need bath/ancilla embedding or dyadic doubling. (Pourcel 2026: Lagrangian EP ⊇ HEL, forward-only, passes-independent-of-size, local.) +- Passive RC/resistive networks are energy-based (Kirchhoff ⇒ energy) — that's WHY local EP works there; they will NOT natively give `Jᵀ` for directed attention. +- Falsify: simulate transpose mismatch `Jᵀ+Δ` (sparsity/calibration/quantization error) → measure cos/norm-error/stability vs `‖Δ‖/‖J‖` = the HARDWARE TOLERANCE SPEC. + +## Q5/Q6 — non-normality & conservative cost +- Non-normality is NOT purely bad (nnRNN uses it for expressivity w/ stable spectra). Right objective: bound transient gain + adjoint mismatch, not kill A. Physical regularizer = contraction in a metric (`μ_P(J)<−m`), NOT Euclidean spectral radius. +- Conservative-attention's 0.15–0.2 CE gap is likely **parameterization/efficiency cost, NOT an information-theoretic wall** (conservative systems can encode directed maps via latent/paired compartments). Practical path = partially-conservative `F = F_energy + ε F_skew`, controlled ε per head/layer, keep `ρ<1` if using vanilla EP. + +## Q7 — make-or-break (staged, not monolithic) +Train 4× (10–50M, same corpus): (a) free non-conservative attn + BPTT [oracle]; (b) conservative Energy/Hopfield/CET + exact digital grad [architecture upper bound]; (c) same conservative + EP at strict residual [EP-estimator test]; (d) partially-conservative ε-skew + exact transpose/Dyadic EP. The result names the WALL: architecture (a bad) / EP-estimator (b good, c bad) / physical-adjoint (transpose works, local impl fails) / non-equilibrium (only finite-T BPTT readable). + +## Refs surfaced (pursue) +Dyadic EP + AsymEP = Scurria 2026 (arXiv:2602.03670, the EXACT non-conservative method). Lagrangian EP / Hamiltonian Echo Learning = Pourcel 2026 (arXiv:2506.06248) + RHEL (arXiv:2506.05259, forward-only BPTT proxy). Energy Transformer = Hoover/Krotov (arXiv:2302.07253); Convergent Energy Transformer (EP-trainable, OpenReview Qrfml76eWJ — already in our HW collab docs). Monotone-operator equ. nets = Winston-Kolter (arXiv:2006.08591). Resistive-net universal approx = arXiv:2312.15063. nnRNN = arXiv:1905.12080. + +## NEXT (our plan) +1. BUILD the oracle-adjoint probe (extend asym_probe: add `g_transpose` via GMRES solve of `Jᵀλ=−ℓ` + `(∂F/∂θ)ᵀλ`; + synthetic ε-asymmetry sweep on the J operator). [codex] +2. RUN at the BPTT good solution (watcher pings) + the ε-sweep on a current converged ckpt. DECIDES fundamental-vs-hardware. +3. Secondary: ρ(S⁻¹A) decomposition (‖A‖ / μ_P margin / transient gain) at BPTT ckpts; damping-c continuation. diff --git a/docs/hardware/HW_RESEARCH_FINDINGS.md b/docs/hardware/HW_RESEARCH_FINDINGS.md new file mode 100644 index 0000000..ea5a6b8 --- /dev/null +++ b/docs/hardware/HW_RESEARCH_FINDINGS.md @@ -0,0 +1,98 @@ +# Analog-hardware substrate research — findings (2026-06-21) + +Deep-research run (108 agents, 25 sources, 118 claims → 22 adversarially-verified 3-0/2-1). +Raw verified claims + source URLs + quotes: `hw_research_claims.json`. Synthesis below is mine. +(The run's auto-synthesis step died on a mid-run /login 401; no DATA lost — all 22 verified claims recovered.) + +## THE decisive split confirmed: TRAINABLE-but-small vs LARGE-but-fixed +The single most important filter — does the substrate support **in-situ weight update** (EP needs it) — cleanly partitions the market: + +### LARGE but FIXED-WEIGHT (inference-only — fail EP's in-situ filter as-is) +- **Mythic M1076** (analog flash CIM): **80M weights/chip**, eval boards / M.2 / PCIe cards exist. BUT explicitly **inference-only** — train off-device, program once. [mythic.ai, 3-0] +- **IBM HERMES** (PCM, 14nm, 64×256×256 = **4.2M weights**, mixed-signal): research chip, **inference-only**, weights programmed once via hardware-aware training. [Nature Electronics 2023, 3-0] +- **MRAM / PCM crossbars** generally: program-once, fixed during inference; authors state in-situ training "increases energy + degrades device lifespan" → why the whole field avoids it. [Science, NCBI, 3-0] +- → These give SCALE (the tens-of-M you want) but can't do EP's repeated local updates without re-flashing. + +### TRAINABLE in-situ (small, but the EP-correct regime) +- **Bulk-switching memristor CIM module** (arXiv 2305.14547): experimentally implements **on-chip mixed-precision TRAINING** with in-situ VMM. KEY mechanism: **digital high-precision update accumulation, physically program the memristor only when accumulated Δw exceeds a threshold** — exactly the hybrid scheme that limits write/endurance stress in an EP loop. [3-0] ← **this is the template for our update path.** +- **In-situ training demonstrated** on memristor crossbars for MLP/CNN/LSTM/RL — local in-array updates during a learning loop are physically real. [arXiv, 3-0] +- Constraints to design around: limited NVM **endurance**, **asymmetric/nonlinear** weight update, variability, retention, stuck-at-faults. Compensation methods exist (stochastic rounding etc.). [escholarship, 2-1/3-0] + +## EP / equilibrium learning ALREADY physically realized (precedent exists!) +- **PNAS — self-learning analog resistor network** (Coupled Learning, EP-cousin): XOR + nonlinear regression learned **fully in-situ, NO computer, NO backprop**. Weights = transistor gate-voltage on a local 22µF cap, updated by on-edge circuitry from the **local free-vs-clamped difference**. Forward = physical settling, **τ≈1µs**; learning on 18ms timescale. [PNAS, 3-0] ← **proof the whole concept works in COTS-buildable analog.** +- **EP on D-Wave** (quantum annealer Ising machine): the physical machine does both free + nudge relaxation to steady state (settling is physical). Learning rule is **local** (updates from the two equilibrium states, no backprop). Caveat (1-1): weights live on the classical computer; only couplings loaded per phase → hybrid, not fully in-situ. [Nature, 3-0 on the local-rule claim] +- → EP/local learning on physical equilibrium hardware is **demonstrated**, not speculative. Our contribution would be doing it for a TRANSFORMER block at scale. + +## Softmax/attention in analog (the hard part) +- Confirmed open challenge: Transformers need frequent Q/K/V updates, which **conflicts with crossbars' weakness at reprogramming** — flagged as an open HW problem. [arXiv, 3-0] +- (The energy/Hopfield-attention analog-native route verification was among the 3 claims killed by the 401 — needs a re-run. The pragmatic mixed-signal answer — softmax/LN/GELU in FPGA, linear+relaxation in analog — was the framing, not contradicted.) + +## BOTTOM LINE for our build (synthesis) +The market splits exactly as feared: **you cannot buy one module that is both tens-of-M AND in-situ-trainable.** So: +- **Phase 1 (trainable, small) — DO THIS FIRST.** Stitch a **bulk-switching/memristor CIM eval module** (in-situ, threshold-accumulated update) + an **FPGA** (softmax/LN/GELU + the EP control loop: settle→nudge→settle→local Δθ). Prove ONE equilibrium-transformer block trains end-to-end via EP in analog. The PNAS resistor-network + the memristor-training paper together show every piece is real. +- **Phase 2 (scale) — LARGE-but-fixed used cleverly.** Use Mythic-80M / HERMES-class for the bulk fixed linear MVM (the relaxation forward), and keep ONLY the trainable/updated weights on the in-situ substrate, OR do mixed-signal "analog-forward, digital-accumulate, periodic-reflash" updates (the threshold-program trick) to tolerate their write limits. +- **Update path = the crux.** Adopt the verified hybrid: **accumulate Δθ in digital high-precision, physically program the analog weight only when |Δθ|>threshold.** This is what makes EP survive endurance limits. +- **De-risk in sim first (free):** the code's `--fnoise` already models multiplicative analog noise — sweep device noise / quantization / asymmetric-update in the 1B sim before buying anything. + +## Re-run #2 (2026-06-21, focused) — GAP 1 SOLVED, GAPs 2/3 still thin +Raw: `hw_research_claims2.json`. 107 agents, clean run (no auth drop). + +### GAP 1 — analog attention: ANSWERED. It exists, across substrates, but all inference-only. +- **Real fabricated silicon**: UCSD **65nm charge-based SRAM-CIM attention** chip (Moradifirouzabadi/Dodla/Kang, arXiv 2409.04940, ESSERC 2024) — first charge-based analog CIM in SRAM for transformers, **measured 14.8 TOPS/W**, 9-T bitcell does Q·Kᵀ via capacitor charge-sharing. [high] +- **Jülich gain-cell in-memory attention** (Leroux et al., arXiv 2409.19315, Nature Comp Sci 2025): charge-on-capacitor, **~70,000× energy / ~100× speed vs GPU** (simulated). [high] +- Memristor: Nature Sci Reports 2024 self-attention accel (128×128, 2-bit); **STAR RRAM softmax engine** (arXiv 2401.17582). Photonic: TFLN-microring softmax PROPOSAL (arXiv 2603.12934, not fabricated). [high/med] +- **Softmax IS analog-realizable in principle**: a subthreshold source-coupled differential-pair / WTA network computes normalized-exp **"for free" via KCL at the shared tail node** (translinear). [high] — so an energy/LSE-attention analog route is physically grounded. +- **BUT GAP 1(c) CONFIRMED**: real prototypes **overwhelmingly use the mixed-signal split we proposed** — softmax/LN/normalization in DIGITAL/LUT/FPGA, only the linear maps + dot-products in analog. So our architecture choice is the validated one. [high] +- **EVERY analog-attention implementation found is INFERENCE-ONLY / fixed-weight** (Jülich uses offline HW-aware init + offline backprop fine-tune before deploy). Reinforces: nobody has done in-situ-trained analog attention → that IS our novel contribution. [high] +- Noise budget datapoint: a variation-aware memristor-ViT sim tolerates **~35% compute error + ~10% conductance variation** while matching digital Top-1 (MDPI Electronics 2026) — encouraging for the `--fnoise` de-risk. [med] +- Caveat (Sillman, arXiv 2305.13649): an analog softmax block only pays off INSIDE a fully-analog system; isolating it behind ADC/DAC dwarfs the saving → keep softmax digital UNLESS going fully analog. [med] + +### GAP 2 (buy-now SKUs) + GAP 3 (endurance/ECRAM) — STILL OPEN +The re-run did NOT substantively verify these (its own summary says so). The one product claim (Knowm $800 kit) was REFUTED/split. So procurement (TetraMem/Mythic/Anadigm/Aspinity SKU+price+order-today) and the **make-or-break endurance budget (RRAM/PCM/FeFET/Flash vs ECRAM writes-to-failure)** remain genuinely unanswered. Indirect signal only: NVM rejected for KV-cache because of slow/high-energy/low-endurance writes; gain-cells chosen for endurance. + +### Still to pin (3rd focused pass — procurement + endurance ONLY) +1. SKU-level buy-now: TetraMem MX100, Mythic dev kit, Anadigm AN231E04 board, Aspinity AML100, any RRAM eval kit — orderable today? price? (deep-research struggles here — may need vendor sites / direct contact, not web search.) +2. Per-device **write endurance**: RRAM/PCM/FeFET/Flash/**ECRAM** cycles-to-failure; is ECRAM the symmetric-update + endurance fix, and is it available outside research labs? (Likely research-only — flag if so.) +3. With digital-accumulate-then-threshold-program, how many physical writes does a ~30k-step EP run actually incur, vs device endurance? + +## UIUC ECE collaboration map (2026-06-21, user-directed — the hardware-side gap) +User's reachable hardware groups (ALL ECE — this is the team's missing layer). The key insight: the +three UIUC ECE groups span EXACTLY the three layers an in-situ-EP analog demo needs, and together they +SOLVE the market's fatal gap (you can't BUY an in-situ-trainable analog array — but you can fab one in-house): +- **Wenjuan Zhu (UIUC ECE) = DEVICE layer** [user-confirmed]: memristor/RRAM/FeFET / 2D-material devices. + This is the in-situ-trainable substrate that is research-only on the market — her group can FABRICATE it. +- **Naresh Shanbhag (UIUC ECE) = CIRCUIT/ARCH layer**: SRAM in-memory compute (DIMA/C3SRAM line) — the analog MVM. +- **Pavan Hanumolu (UIUC ECE) = MIXED-SIGNAL GLUE**: ADC/DAC, PLL, switched-cap — the converters + analog + integrator for the relaxation/control loop (settle→nudge→settle→local Δθ). +- Tao Chen (USTC) = hardware but NOT EP; Stanford = student can broker intros (Wong RRAM / Murmann-legacy CIM / etc.). +STRATEGY SHIFT: not "buy a board" — it's in-house fab of the trainable substrate (Zhu) + CIM circuit (Shanbhag) ++ converter glue (Hanumolu) + our FPGA/EP control loop. Sourcing deep-research w1kuw4zmz profiling all + industry; +its "Zhu" angle mis-targets ML-accel (wrong layer) — corrected to Wenjuan-Zhu-device here; will run a focused +pass on her device work + merge. + +### Sourcing run RESULTS (w1kuw4zmz, 11 high-conf named-paper findings; raw: hw_groups_claims.json) +**HEADLINE: Shanbhag (UIUC) is the closest match of ANY named group — and it's the ONLY group silicon that already does analog MVM + genuine on-chip in-situ weight update.** +- **Shanbhag DIMA chip** (Gonugondla/Kang/Shanbhag, **JSSC 2018**, "Variation-Tolerant In-Memory ML Classifier via On-Chip Training"): 65nm, 16kB 6T-SRAM, analog MVM via "functional read" + charge-sharing, AND a **dedicated on-chip digital trainer doing SGD + writing weights back to the array each batch** (random→within 1% of FP in ~400 batches). In-situ training gave **2.4× energy cut** at iso-accuracy. ⇒ the two EP-critical mechanisms (analog MVM + on-chip weight write-back) ALREADY in one fabricated UIUC chip. CAVEAT: it's a single-layer SVM, batch-SGD — **no relaxation/settling loop, no multilayer credit assignment**. We'd add the equilibrium dynamics + two-phase EP rule on top. [high] +- Shanbhag also: **C3SRAM** (w/ ASU, JSSC 2020) — capacitive-coupling XNOR-MAC, but inference-only binary. +- **Hanumolu** — (the run under-covered him; he's the ADC/DAC+integrator glue, still the right converter-layer partner; pin specific silicon separately.) +- **"Zhu"/Lin correction CONFIRMED**: Yingyan **Celine Lin** is UIUC-PhD, now **Georgia Tech** (not current UIUC faculty, surname Lin not Zhu), digital-accel/co-design, **no analog/RRAM/in-situ silicon**. ⇒ the device-layer partner is **Wenjuan Zhu** (user-confirmed), NOT Lin. +- **USTC Tao Chen (陈涛)**: device/materials "in-materio" reservoir computing (disordered dopant-atom networks in Si, Nature 2020). Real device work but NOT circuit-CIM, NOT EP — confirms user's "no EP hardware." Possible device-physics collaborator, not a demo host. +- **Stanford NeuRRAM** (Wong + Raina + UCSD Cauwenberghs, **Nature 2022**): 48-core, ~3M-cell RRAM analog-CIM — the most EP-relevant *substrate* (analog MVM at scale), **BUT INFERENCE-ONLY** (weights programmed offline; only chip-in-the-loop forward fine-tune). Gives the MVM primitive, not native in-situ learning. + - **Wong = H.-S. Philip Wong (Hon-Sum Philip Wong, 黄汉森)**: Stanford EE, Willard R. & Inez Kerr Bell Professor; also **TSMC Chief Scientist**. RRAM/memristor, 3D monolithic integration, in-memory computing; co-author of NeuRRAM + the canonical "Memristive devices for computing" review (Nature Nanotech 2013). THE RRAM-device heavyweight for the trainable-substrate conversation (reach via the Stanford student contact). TSMC tie = a path to real foundry RRAM. +- **Industry = all inference-only**: TetraMem **MX100** (Nature Electronics 2025; 10 cores, 248×256 1T1R RRAM+RISC-V) ships real silicon but **inference-only** (no in-situ update). Mythic/EnCharge class same. +- **DIY in-situ test-chip path**: **SkyWater S130 + Weebit Nano 256Kb ReRAM IP** (JEDEC/AEC-Q100 qualified 2023, open SKY130 PDK) = foundry RRAM access for an MPW — a lab can fab its OWN trainable RRAM array. [med] +- **EP-on-hardware = still only SPICE sim**: "Memristor Crossbar Circuits Implementing Equilibrium Propagation" (Oh et al., Kookmin U) is circuit simulation, NOT silicon. ⇒ **no fabricated EP-transformer hardware exists anywhere — the demo is genuinely novel.** [high] + +**BOTTOM LINE FOR THE PITCH:** lead with **Shanbhag** (his JSSC-2018 chip already proves analog-MVM + on-chip-training in one die — the nearest substrate; we add relaxation + EP) + **Wenjuan Zhu** (trainable device) + **Hanumolu** (converter glue) = a complete in-house UIUC-ECE stack. Stanford **Wong** as the RRAM-device escalation (via the student). Industry (TetraMem/Mythic) only useful for the fixed-weight Phase-2 forward path. Nobody has built EP-transformer hardware → first-mover. + +### Hanumolu profile (targeted, 2026-06-21) +**Pavan Kumar Hanumolu** — Seendripu Family Professor, UIUC ECE (since 2013; prior Oregon State); member of CSL's Integrated Circuits & Systems Group. "Top-five mixed-signal IC researchers worldwide," NSF CAREER 2010, heavy JSSC/ISSCC record. Work: **energy-efficient analog/mixed-signal — time-based ADCs, continuous-time filters, ultralow-jitter clocking/PLLs, high-speed serial links, switched-cap, DC-DC power conversion.** ⇒ exactly the **converter + analog-integrator + feedback-loop** layer the EP control loop needs (ADC/DAC glue to read settled states + apply the nudge; switched-cap integrators ARE relaxation-dynamics primitives). Note: his published silicon is converters/links/clocking, NOT CIM — he's the glue/control-loop partner, not the MVM substrate. (Also co-founded Omni Design Technologies — converter IP.) [sources: ece.illinois.edu/.../hanumolu, icsg.csl.illinois.edu] + +### GAP 3 — endurance budget: the make-or-break number, and it CLEARS the bar +The feasibility question: an EP run does ~tens-of-thousands of update STEPS; with digital-accumulate-then-threshold-program, physical device writes are FEWER than steps. How many cycles do devices survive? +- **HfOx RRAM: up to ~10^10 cycles** endurance (best-in-class metal-oxide). [arxiv 1909.01771, IOP 10.1088/1361-6641/abf29d] +- AlOx / weaker oxides: only ~10^4 — material choice matters a lot. +- **Budget check**: MNIST-class training writes ~10^4 cycles; gradient training can scale to **~10^8** cycles. ⇒ **HfOx (10^10) has ~100× headroom even over a 10^8-write run** — endurance is NOT a blocker IF you use HfOx-class RRAM + the threshold-accumulate scheme (which cuts writes below step-count). [web-search snippets, med-high] +- **Device-nudge insight**: an EP/Coupled-Learning *nudge* changes resistance far less than a full state write, so per-nudge endurance is plausibly >> rated full-write endurance (needs empirical confirmation, but favorable). +- **ECRAM (electrochemical RAM)** = the symmetric/linear-analog-update + high-endurance technology specifically aimed at in-situ training: "open-loop analog programmable electrochemical memory array" (Nature Comms 2023, s41467-023-41958-4) — but **research-only** (not commercially available; lab/foundry fab). It's the device-physics frontier Wenjuan Zhu / Wong-type collaborators work in. +- **VERDICT: endurance is survivable** with HfOx-class RRAM (10^10) + threshold-program; ECRAM is the better-but-research-only upgrade. The make-or-break risk is NOT endurance — it's **update linearity/symmetry + device variation** (the asymmetric-nonlinear-update problem), which the digital-accumulate scheme + compensation (stochastic rounding) mitigates. [from earlier run + this] diff --git a/docs/hardware/PHYSICS_QUESTIONS_FOR_DEEP_REASONING.md b/docs/hardware/PHYSICS_QUESTIONS_FOR_DEEP_REASONING.md new file mode 100644 index 0000000..46f79e3 --- /dev/null +++ b/docs/hardware/PHYSICS_QUESTIONS_FOR_DEEP_REASONING.md @@ -0,0 +1,65 @@ +# Deep-reasoning brief: the physics of backprop-free equilibrium learning on non-conservative systems +*Self-contained briefing for a deep-reasoning consult. Goal: identify whether the central obstruction is fundamental, and what the physically-realizable escape routes are. 2026-06-21.* + +--- + +## 0. One-paragraph setup +We train a **transformer as a physical equilibrium system** — a fixed point `z*` of a damped relaxation `ż = F(z)`, `F(z) = x_in − (1+c)z + Attn(LN z) + FFN(LN z)` (`c>0` damping) — **without backpropagation**, using **Equilibrium Propagation (EP)**: a free relaxation to `z*`, a nudged relaxation with the output perturbed by `β ∂L/∂z`, and a **local** weight update from the contrast of the two settled states. The motivation is **analog hardware**: the relaxation is physical settling (≈ free), and the update is local (no backward pass). On such hardware **BPTT is physically impossible** (no backward-through-time). The problem: EP trains cleanly until attention becomes expressive, then the gradient estimate collapses and training diverges — while exact BPTT with the identical model sails to a good solution. + +## 1. What is rigorously established (the constraints any answer must respect) +- **EP ≡ BPTT only under** (Ernoult 2019): **H1** symmetric Jacobian `∂F/∂z = (∂F/∂z)ᵀ` (i.e. F is a gradient of a scalar — *conservative*); **H2** free phase *exactly* at the fixed point; **H3** `β→0`. BPTT needs **none** of these — it differentiates the actual finite computation graph. +- **Softmax self-attention is non-conservative**: independent Q/K (and untied FFN) ⇒ `J ≠ Jᵀ`. We measure the non-normality `‖A‖/‖S‖`-scale ≈ **1.4** (S, A = symmetric/antisymmetric parts of J). +- The standard fix, the **AEP correction** (inject `−2A(x−x*)` in the nudged phase to make the nudged Jacobian `J→Jᵀ`), is **first-order in `S⁻¹A`** — its error is a Neumann series `Σ (S⁻¹A)^{2n+1}` that **converges only if `ρ(S⁻¹A) < 1`**. At our `‖S⁻¹A‖ ≈ 1.4` it is *out of warranty*. +- The **good solutions are weakly contractive** — a BPTT model at good loss has finite-time Lyapunov ≈ **−0.0347** (barely inside the stable disk). The instability is **non-normal transient growth**, not an eigenvalue crossing (spectral radius is a misleading stability signal here). +- Empirically the EP↔BPTT gradient gap `M = g_BPTT − g_EP` reaches **~84 % of `‖g_BPTT‖`** just before divergence; EP retains the *direction* (cos 0.91) but loses the *magnitude/structure*. EP gradient ≈ BPTT only when the free-phase residual `res ≲ 1e-4`; at `res ≈ 1e-2` it is noise. +- The damping `−(1+c)z` is what makes `S` well-conditioned at a *healthy* fixed point (`S ≈ −(1+c)I` dominant ⇒ `S⁻¹A ≈ −A/(1+c)`, small); `ρ(S⁻¹A)` blows up only *as contraction is lost* (S → singular). So large `ρ(S⁻¹A)` is partly a *symptom* of lost contraction, not purely intrinsic asymmetry. **(This is unresolved — see Q2.)** + +## 2. The central tension (the thing to crack) +**EP's gradient fidelity requires near-conservativity (`‖S⁻¹A‖ ≪ 1`) and tight convergence; the expressive power of attention comes precisely from its non-conservativity (`‖S⁻¹A‖ ≈ 1.4`).** BPTT escapes because it needs neither H1 nor H2 — but BPTT is not physically realizable on analog relaxation hardware. So the question is whether a **local, physically-realizable, backprop-free** rule can get the correct gradient for a **non-conservative equilibrium**, or whether there is a fundamental obstruction. + +--- + +## QUESTIONS FOR DEEP REASONING + +### Q1 — Is the obstruction FUNDAMENTAL or an estimator artifact? (the no-go question) +Is there a **theorem** of the form *"any local learning rule that reads only the settled states of a physical relaxation can recover the true loss-gradient of a non-conservative equilibrium system only to fidelity bounded by a function of the operator's non-normality `‖S⁻¹A‖`"*? Or is the `ρ(S⁻¹A)<1` limit specific to the *first-order* AEP construction, with a different (possibly non-perturbative) local estimator that is exact at any asymmetry? +- Concretely: the true adjoint needs `J⁻ᵀ`; EP's nudged relaxation naturally produces `J⁻¹`. For **symmetric** J these coincide (why EP works on energy-based systems). Is the gap `J⁻¹ − J⁻ᵀ` *fundamentally* inaccessible to any forward-only physical relaxation, or can a cleverly-coupled relaxation produce `J⁻ᵀ`? +- Relevant: implicit-function-theorem / adjoint-method structure; whether the adjoint of a *dissipative non-reciprocal* system is itself the steady state of *some* physically-runnable dynamics. + +### Q2 — Decouple non-normality from lost-contraction. Is the GOOD solution actually high-`ρ(S⁻¹A)`? +We suspect `ρ(S⁻¹A)` is large at the *divergence edge* mostly because contraction is lost (`S` near-singular), not because the *good* solution is intrinsically that asymmetric. At a healthy fixed point the damping makes `S ≈ −(1+c)I`, so `ρ(S⁻¹A) ≈ ‖A‖/(1+c)`. +- Does the **expressivity of attention** require large **`‖A‖`** (antisymmetric coupling) per se, or only large **`‖J‖`** (which the damping + a larger `c` could offset while keeping `‖S⁻¹A‖<1`)? I.e. **can one have an expressive attention operator that is strongly damped (large `1+c`) so that `S⁻¹A` stays sub-1 — buying EP-validity by *scaling*, not by killing asymmetry?** What is the trade between damping `c`, expressivity, and `ρ(S⁻¹A)`? +- Is there a reparameterization (e.g. normalizing the value/output path, or `c` per-channel) that holds `ρ(S⁻¹A)<1` *uniformly over training* without an explicit penalty? + +### Q3 — Can the PHYSICAL hardware compute its own adjoint (transpose dynamics)? +EP works on conservative systems because a **reciprocal** (symmetric-coupling) physical network has `J=Jᵀ`, so the same physical relaxation computes both the forward state and the adjoint. Our attention is **non-reciprocal** (independent Q/K = directed coupling). +- Is there a **physical mechanism** to run the *transpose/adjoint* dynamics of a non-reciprocal dissipative network on the same substrate — e.g. **time-reversal / Hamiltonian-echo backpropagation** (López-Pastor & Marquardt 2023, for Hamiltonian systems), a physically-coupled "transpose twin" network, or an Onsager-reciprocity-breaking trick? Does HEB extend from *Hamiltonian* (lossless) to our *damped dissipative* setting, and at what cost? +- If a substrate can physically realize **both** `J` and `Jᵀ` relaxations, EP/AEP becomes *exact* for non-conservative systems with **no** perturbative `ρ(S⁻¹A)<1` restriction. Is that physically buildable (analog crossbar / photonic / spintronic), and what does the transpose operation cost? + +### Q4 — Is "equilibrium" even the right primitive? Finite-time vs fixed-point computation +BPTT's stabilizer is exactly the **finite-horizon truncation gradient** (it differentiates the *actual* 150-step settling, not the `T→∞` fixed point). The good BPTT solution rides a **weakly-contractive / not-fully-settled** state — arguably a *finite-time transient*, not a true equilibrium. +- Should the physical computation be defined at a **fixed finite settling time** `T` (a "finite-time fixed point") rather than `T→∞`? Is there a **local** learning rule whose target is the finite-time output — i.e. a physically-realizable analog of BPTT's finite-horizon term — without an actual backward pass? (e.g. forward-mode/RTRL-like physical accumulation, two-timescale dynamics, or an auxiliary physical variable that accumulates the transient sensitivity.) +- More radically: should we **embrace non-equilibrium computation** (transient / limit-cycle / wave dynamics) as the primitive and seek a local rule for *that* (reservoir-computing, oscillatory/Ising-machine, physical-learning lineages), rather than forcing a fixed-point interpretation onto a computation that is natively transient? + +### Q5 — Non-normality as physics: does it have to hurt, and can it be bounded for free? +The failure is **non-normal transient amplification** — a stable operator whose perturbations grow transiently before decaying. In a real relaxation circuit this should be a *physical* phenomenon (ringing / overshoot before settling). +- What is the **physically natural regularizer** of non-normality? Is there a passivity / dissipativity / **contraction-in-a-metric (log-norm `μ_P(J)<0`)** structure that a physical substrate enforces *automatically* (e.g. a passive RC network is contractive by construction), and does building attention as a *passive* element bound `‖S⁻¹A‖` without a learned penalty? +- Is there an expressivity argument that **bounded non-normality is sufficient** for language modeling — i.e. that the useful computations of attention live at modest `‖S⁻¹A‖` and the divergence-causing high non-normality is *gratuitous* (an artifact of unconstrained training), not *functional*? + +### Q6 — The conservative-attention escape: fundamental cost or engineering cost? +Energy-based attention exists — **modern Hopfield networks / Energy Transformer (Krotov), tied-value LSE-energy attention** — which are conservative by construction (`H1` holds, EP is *exact*, runs natively on analog equilibrium hardware). We measure it costs ~0.15–0.2 CE vs free (non-conservative) attention. +- Is the expressivity loss of conservative/energy attention **fundamental** (information-theoretic: a symmetric-coupling associative memory provably cannot represent what directed Q≠K attention can) or **engineering** (current energy-attention forms are just suboptimal, and a better conservative parameterization closes the gap)? +- Is there a **partially-conservative** sweet spot: an attention with a *controlled, small* antisymmetric part (`‖S⁻¹A‖<1`, inside the first-order AEP warranty) that captures most of the expressivity — and what is the **Pareto frontier of [gradient fidelity ↔ asymmetry ↔ task loss]**? + +### Q7 — THE make-or-break for the hardware thesis +Granting that analog relaxation hardware is **structurally confined to (near-)conservative, contractive, equilibrium-based local learning**: **is a "readable" small language model (≈10–50M params, coherent text) achievable inside that regime?** Or does coherent language *require* the high non-normality / finite-time-transient expressivity that provably breaks local equilibrium learning? +- If achievable: what is the *minimum* non-normality / contraction margin a coherent-language equilibrium model needs, and does it fit inside `‖S⁻¹A‖<1` (+ comfortable contraction)? +- If not: what is the realistic *ceiling* of analog-EP-trainable models, and is there a hybrid (mostly-conservative core + a thin digitally-assisted non-conservative correction) that lifts it? + +--- + +## What an actionable answer looks like +For each: (a) is it a **fundamental obstruction** (cite/sketch the no-go) or an **engineering gap** (sketch the construction)? (b) the **physical realizability** verdict (can it run on a relaxation substrate with local updates, no backward pass?) (c) the **cheapest experiment** that would falsify the proposed mechanism on our simulator (we can measure `ρ(S⁻¹A)`, contraction margin, gradient cos vs a non-conservative knob, energy-vs-free-attention loss, damping-`c` sweeps). We are not looking for "use BPTT" or "use a digital solver" — both are physically excluded. We want the **physics of a local, forward-only, non-conservative equilibrium learning rule**, or the proof that none can exist beyond bounded asymmetry. + +## Key references the reasoner should engage +Scellier–Bengio 2017 (EP / energy); Ernoult et al. 2019 (EP≡BPTT conditions); Scurria et al. 2026 (AsymEP, the `J→Jᵀ` first-order correction); Laborieux–Zenke 2023 (Jacobian homeostasis, the `S⁻¹A` bias + functional-symmetry fix); López-Pastor–Marquardt 2023 (Hamiltonian echo backprop = physical adjoint via time-reversal); Krotov / Hoover et al. 2023 (Energy Transformer); Winston–Kolter 2020 (monotone-operator equilibrium nets = contraction by construction); Stern–Murugan 2023 (physical learning review); Kendall et al. 2020 (EP on analog resistor networks). diff --git a/docs/hardware/SCALING_AND_HARDWARE_PLAN.md b/docs/hardware/SCALING_AND_HARDWARE_PLAN.md new file mode 100644 index 0000000..8e93a40 --- /dev/null +++ b/docs/hardware/SCALING_AND_HARDWARE_PLAN.md @@ -0,0 +1,58 @@ +# EP scaling-to-1B sim cost + modular analog-hardware plan (2026-06-21) + +## PART 1 — 1B simulation cost (H200) + +Anchor: your 1B **BP** run ≈ 36h × 8 H200 = **288 H200-hours**. + +Cost driver = the equilibrium block relaxes ~150–300 steps (T1=150 + refine to t1max=300) — this is the +WHOLE cost, ~width-independent, so the per-step factor measured at C512 holds at 1B: +- EP/BP per-step FLOPs ≈ **~50–100×** (relaxation is forward-only, 2ND each; BP is 6ND). Measured: C512 EP 0.23 it/s + vs a depth-1 C512 BP ~tens it/s ⇒ ~65–130×. **EP/BPTT ≈ only 1.5×** — the cost is the equilibrium, not EP. +- Reducers: bf16 + torch.compile + the compiled free-phase fast-path (blk._cstep) ≈ 2–4×; refine exits early + while contractive (×~0.6); EP's low memory (no unrolled graph) packs bigger batches than BPTT. Net effective **~20–50×**. + +**Estimate: 1B EP ≈ 20–50 × 288 ≈ 6,000–15,000 H200-hours** ≈ $18k–60k @ ~$3–4/H200-hr (or burn AWS credits). +Wall-clock: ~30–75 days on 8×H200, or **~4–9 days on 64×H200** (same H200-hours — scale OUT for wall-clock, cost is fixed). + +**Recommendation (don't burn 10k H200-hr blind):** +1. Ship the **speed package (task #14)** FIRST — directly cuts the bill 2–4×. +2. Validate the recipe + MEASURE the real EP/BP factor on a **~100–300M rung** (scaling-law dossier) before committing 1B. +3. Then the 1B run. Use EP's memory advantage to pack batch; data-parallel across many H200s for wall-clock. + +## PART 2 — Modular analog-hardware plan (tens-of-M, COTS, NO custom fab) + +**Principle that makes this cheap:** the damped equilibrium block is a *physical settling system* (monDEQ ≈ passive +resistor/op-amp circuit, Chaffey 2025). The free-phase relaxation = the hardware SETTLING — so the ~100× sim cost becomes +**~free physical settle**, and EP's update is **local** (no backprop). The algorithm is DESIGNED for this substrate; the IP +is substrate-agnostic, so demo on whatever COTS analog you can buy — never fab. + +**Decompose the block → what each part needs:** +| part | hardware | note | +|---|---|---| +| linear maps WQ/K/V/O, fc, pj | analog MVM (crossbar) | bulk of params + compute | +| relaxation z←z+εF(z) | physical feedback (RC + the −c·z damping resistor) | THE win — free settle | +| nonlinearities: softmax, GELU, LN | the analog-HARD part | do mixed-signal in FPGA, OR use energy/Hopfield attention (analog-native) | +| EP update (local) | needs **updatable** analog weights | the key constraint | + +**Substrate options (COTS/existing; key tradeoff = weight updatability — VERIFY current availability/specs):** +- **(A) Memristor/ReRAM crossbar eval kits** — in-situ updatable ⇒ cleanest for EP TRAINING; small/research-grade (TetraMem, CrossBar, Knowm, academic arrays). +- **(B) Analog-inference compute modules** (~tens-of-M weights, e.g. Mythic-class) — large MVM but FIXED/re-flash weights ⇒ mixed-signal EP (analog forward, digital re-program for the update). Matches "tens of M" in MVM size. +- **(C) FPAA + discrete op-amps** (Anadigm / GT RASP) — fully-analog SMALL block, true physical settle, fully programmable. + +**Recommended architecture = mixed-signal, stitched (no fab):** +- analog MVM substrate (A for trainable, or B for scale) does the linear F(z) + the relaxation feedback (the energy win); +- a COTS **FPGA / RFSoC** (fast ADC/DAC) does the nonlinearities (softmax/GELU/LN) + the EP control loop + (drive settle → apply nudge βg → settle → measure contrast → compute + apply local Δθ); +- ADC/DAC glue between them. All COTS modules + a PCB. The "out-of-the-box" stitch. + +**Out-of-the-box levers:** +- **Softmax is the analog-hard piece** → for the HARDWARE demo use the energy/Hopfield (LSE, tied-value) attention + variant (analog-native, conservative) even if the sim keeps softmax; or keep softmax in the FPGA (small fraction of compute). +- The code's **`--fnoise` optics-noise model already exists** → simulate analog non-idealities (device noise, quantization, + variation) IN the 1B sim to de-risk the hardware before buying anything. +- **Stage it:** Phase-1 = ONE small block on FPAA/discrete + FPGA, prove the EP analog loop trains a toy task end-to-end; + Phase-2 = scale the crossbar to tens-of-M. Tens-of-M is the Phase-2 target, not the first build. + +**Next step:** a deep-research pass on "2026 COTS / eval-board analog-compute substrates with in-situ weight update for +equilibrium-network training" to pin the specific modules (memristor kits, Mythic-class availability, FPAA, RFSoC) — the +specific availability is the thing to verify, not guess. diff --git a/docs/method/ARCHITECTURE.md b/docs/method/ARCHITECTURE.md new file mode 100644 index 0000000..213b192 --- /dev/null +++ b/docs/method/ARCHITECTURE.md @@ -0,0 +1,117 @@ +# Architecture: EP/AEP-trained Equilibrium Transformer — implementation details + +Based on the actual implementation in `/tmp/lt_ep/lt_ep_train.py` (`EQBlock` + `ep_step`). + +## 0. Overview — one unified force, one relaxation + +The whole block is **one dynamical system**: token state `z ∈ R^{B,T,C}` relaxes to a fixed +point under a **single force** that bundles input-clamp + FFN + attention: + +``` +F(z) = − ∂/∂z [ ½‖z − x_in‖² + E_mem(z) ] + s·( Attn(z) − c·z ) + └────── conservative part (symmetric Jacobian) ──┘ └── non-conservative (non-reciprocal) ──┘ + input clamp Hopfield memory = FFN causal self-attention + damping +``` +- `x_in = tok[idx] + pos` — input embedding, clamped as a boundary condition. +- **Forward = relaxation**: `z ← z + ε·F(z)` for T1 steps → fixed point `z*`; read out `logits = z*·W_h`. +- Conservative (FFN/clamp) and non-conservative (attention) live in **one force, one relaxation** — the basis of "unified" training. + +--- + +## 1. FFN — standard EP (the clean part) + +The FFN is rewritten as a **modern Hopfield memory energy**: +``` +E_mem(z) = − Σ_{token, m} relu( z · W_m )²_m # W_m ∈ R^{C×M}, M memories +``` +Its force `−∂E_mem/∂z = 2·W_m·[relu(zW_m) ⊙ 1_{zW_m>0}]` is exactly a **tied-weight 2-layer MLP +(W_in=W_out=W_m) with squared-ReLU** = the FFN. + +- **Conservative** (scalar energy, symmetric Jacobian) → **standard EP is exact, no correction**. +- Verified: FFN-param gradient cosine vs backprop = **1.000** (`lt_ep_ffn.py`). +- This is textbook EP / Hopfield — already demonstrated on memristor crossbars in the literature. + +--- + +## 2. Attention — how it is "EP-ified" (the novel part) + +**Step A — rewrite attention as a FORCE** (not a feedforward layer): tokens relax under it. +``` +Attn(z) = [ softmax( Q(z)K(z)ᵀ/√d , causal ) · V(z) ] · W_O + Q=zW_Q, K=zW_K, V=zW_V (independent projections ⇒ NON-reciprocal: i→j ≠ j→i) +force term = s·( Attn(z) − c·z ) # −c·z damping ⇒ contraction ⇒ a stable fixed point exists +``` + +**Step B — fix the non-reciprocity bias (AEP correction).** Because Q≠K and V is independent, +the attention Jacobian `J_attn` is **asymmetric** — it is not the gradient of any scalar. Plain EP's +nudged phase relaxes under `J`, but the correct adjoint needs `Jᵀ`, so plain EP gives a **biased** +gradient. AEP adds, in the nudged phase: +``` +v = z − z* # deviation from the free equilibrium +corr = s·( J_attn·v − J_attnᵀ·v ) # = 2·A_J·v , A_J = (J − Jᵀ)/2 (antisymmetric part) + J_attn·v = jvp(Attn, z*, v) # forward-mode (one forward probe) + J_attnᵀ·v = vjp(Attn, z*, v) # reverse-mode (one backward probe) +f_nudged = F(z) − sign·β·∂C/∂z − clip(corr) +``` +Effect: the attention part of the nudged linearization becomes `s·J·v − s·(J−Jᵀ)v = s·Jᵀ·v` +— i.e. **J is turned into Jᵀ**, giving the correct adjoint. +- The damping `−c·z` is **symmetric** (Jacobian −cI) ⇒ cancels in `J−Jᵀ` ⇒ the correction only + sees attention's **non-reciprocal** part. +- Verified: attention-param gradient cosine vs backprop = **0.99–1.0** (plain EP: 0.25–0.78). +- Hardware note: `jvp/vjp` = two probe directions; **non-reciprocal coupling is exactly what real + analog devices have** — AEP removes the usual "symmetric weights" requirement of EP hardware. + +--- + +## 3. End-to-end unified training + +**One relaxation, one estimator, trains the whole block.** Key fact: + +> The antisymmetric Jacobian of the **full** force, `A_J`, equals the antisymmetric part of +> (conservative + attention) = **attention's antisymmetric part alone** (the conservative FFN/clamp +> have symmetric Jacobians → contribute 0 to `A_J`). + +So **the AEP correction needs to act on the attention term only**; the FFN/clamp ride along in the +conservative part and are trained correctly by standard EP — **one relaxation, one correction, trains +everything.** + +**Training step (`ep_step`) = 3 phases + 1 local update:** +``` +① free phase: z* = relax(F, x_in, T1) # to the fixed point +② nudged ±: z₊ = relax( F − β·∂C/∂z − corr , from z*, T2 ) # +β + z₋ = relax( F + β·∂C/∂z − corr , from z*, T2 ) # −β (centered EP) +③ adjoint: a = (z₋ − z₊) / (2β) # read from the two nudged equilibria +④ local update: + • equilibrium params (W_Q,W_K,W_V,W_O, W_m, embeddings tok/pos): + ∇_θ L = ⟨ a , ∂F/∂θ(z*) ⟩ # vector-field EP estimator — one formula for attn+FFN+embed + (code: autograd.grad( (a·F(z*,θ)).sum(), θ ) , θ live, z* fixed) + • readout W_h (outside the equilibrium): direct local gradient ∂C/∂W_h |_{z*} +``` +- Why attention (non-conservative) and FFN (conservative) train under the *same* estimator: + `⟨a, ∂F/∂θ⟩` is uniform over all equilibrium params; the AEP correction only modifies `a` (making it + the correct adjoint for the non-reciprocal system); the FFN's `∂F/∂θ` is already correct. +- Embeddings `tok/pos` enter the force through the clamp `½‖z−x_in‖²` (`∂F/∂x_in = +I`), so the same + `⟨a, ∂F/∂θ⟩` yields their gradient. +- **Stability (feedback regulation, from FRE-RNN 2508.11659):** each step measure the free-phase + residual `res = ‖relax(z*,1)−z*‖/‖z*‖`; if `res` is too large, **increase damping `c`** (lower the + spectral radius → keep converging); if very small, relax `c`. This maintains the "free phase has + converged" condition (Ernoult 2019: EP ≡ BPTT in the β→0, converged limit) throughout training. + +**Measured (this implementation):** end-to-end EP trains a char-LM to **val CE 2.95** (random 4.17, +backprop on the same architecture 2.10), with **zero non-finite steps** under feedback regulation. + +--- + +## One-line summary +> **One energy/force, one relaxation, one local estimator.** FFN = conservative Hopfield energy → +> *standard EP* (exact). Attention = a *non-reciprocal force*; AEP turns the nudged-phase `J` into `Jᵀ` +> via two probes (jvp/vjp) → exact gradient. Since the full force's antisymmetric part comes only from +> attention, **one AEP correction + standard EP train the whole block end-to-end**; the readout trains +> directly on the cost; damping + feedback-regulation keep the system convergent. + +## Hardware-relevant primitives +- **local, no weight transport**: every weight updates from locally-available equilibrium states. +- **compute = relaxation to a fixed point**: maps to oscillators / memristor crossbars / optics / Ising. +- **two phases, same circuit**: free + nudged differ only by a small output nudge β. +- **non-reciprocal coupling OK (a feature)**: AEP handles asymmetric `J`; `jvp`/`vjp` = two probe directions. +- **dissipation `c` is a physical knob**: feedback-regulated to keep the system in the convergent regime. diff --git a/docs/method/EP_DERIVATION.md b/docs/method/EP_DERIVATION.md new file mode 100644 index 0000000..9adcc38 --- /dev/null +++ b/docs/method/EP_DERIVATION.md @@ -0,0 +1,215 @@ +# 从标准 EP 到当前版本 —— 逐层修改的因果路径 + +伴随文档:`METHODS.md`(按主题组织的完整方法) / `FINDINGS.md`(项目时间线)。 +本文档是**差分视图**:从教科书 EP 出发,每一层修改对应标准 EP 的**一条隐含前提被 +transformer 破坏**,记录"标准怎么做 → 为什么失效 → 我们怎么改 → 代码在哪"。 + +--- + +## 0. 标准 EP(Scellier–Bengio 2017)与它的四条隐含前提 + +标准 EP 在一个**能量函数** `E(z, θ)` 上运行: + +- **自由相**:动力学 `ż = −∇_z E`,relax 到能量极小 `z*`(自由不动点)。 +- **钳制相**:把损失以强度 β 加进能量,`E_β = E + β·ℓ(z)`,relax 到 `z_β`。 +- **梯度**:`∂L/∂θ ≈ (1/β)[ ∂E/∂θ(z_β) − ∂E/∂θ(z*) ]`(单边,bias O(β))。 + +它能成立,**默认了四条前提**: + +| | 前提 | 含义 | +|---|---|---| +| **A** | **保守 / 对称** | 存在标量能量 E,所以 Jacobian `J = ∂F/∂z` 对称(`J = Jᵀ`)。 | +| **B** | **自由相已收敛** | 读出在真不动点 `z*` 上;残差 ≈ 0。 | +| **C** | **小 β 线性响应 + 干净 nudge** | β→0,钳制只是微扰;为防发散加的 clamp 不影响估计。 | +| **D** | **训练中不动点始终稳定存在** | 每一步权重更新后,自由相仍收敛到一个稳定不动点。 | + +**Transformer 把这四条全破了。** 当前版本的每一处修改,都是在修复其中一条。下面分四块。 + +--- + +## 破缺 A:保守性 —— 力形式 EP + AEP 修正 + +**标准怎么做**:从能量 E 出发,`F = −∇E`,梯度用 `∂E/∂θ`。 + +**为什么失效**:transformer block 没有能量。注意力 `Q≠K`(非互易耦合)、untied FFN +(`W1≠W2ᵀ`)使 `J ≠ Jᵀ`,**写不出 E**。强行用能量路线(`energy` 模式:tied-value +LSE 能量)可以保守化,但代价是表达力(实测 thick 1.95 vs energy/mono 2.11,差 0.15–0.2 CE)。 + +**改法(两步)**: + +1. **力形式 EP(VF,vector-field readout)**——丢掉能量,直接把动力学写成力 `F(z)`,relax + `ż=F(z)` 到不动点。梯度不再用 `∂E/∂θ`,改用**向量场读出**: + ``` + ∂L/∂θ ≈ ∂/∂θ ⟨a, F(z*; θ)⟩ a = 对比态(见破缺 C) + ``` + 这一步在不动点处只调一次 autograd(逐项局部记账,不是 BPTT)。注意力、FFN、LN、 + embedding 全是同一个 F 的项,**联合训练,无分模块调度**。 + ⚠️ **更正(非我们发明)**:力形式 VF 是**已有方法**,正是 AsymEP(arXiv:2602.03670)论文里的 + baseline。而且 **VF 单独用在非保守系统上会崩**(他们 CIFAR-10 上 VF=10% 随机、MNIST 64% vs + AsymEP 92.7%)——这恰好对应我们实测的"无修正注意力 cos≈0.25"。所以这一步**不是我们的贡献**, + 它是"会崩的起点";真正修好它的是下面第 2 步的反对称修正(也是 AsymEP 的)。 + +2. **AEP 修正(非保守修复)**——力形式下,naive nudge 围绕 z* 线性化用的是 J,但正确的 + 伴随(adjoint)需要 **Jᵀ**。不修正 → 注意力梯度 cos ≈ 0.25(基本是错的)。 + 修正:给 nudge 力加上 `−(Jv − Jᵀv)`,`v = z − z*`,无矩阵实现 = 一次 jvp + 一次 vjp。 + 作用:把 nudge 相 Jacobian `J → Jᵀ` ⇒ a 逼近真正的伴随响应 ⇒ **`Q≠K` 注意力梯度 + cos 恢复到 0.99–1.0**(实测:attn 0.99 / ffn 1.00 / 整块 0.99)。 + - 来源:AEP "EP for non-conservative systems"(arXiv:2602.03670)。 + - 关键性质:修正项在 (z−z*) 上**线性、实系数** → 不破坏后面全纯估计的解析性。 + - 注意:修正在 **z* 处线性化**,所以 nudge 轨迹必须留在线性响应窗内(T2≈20 在窗内; + T2≳150–300 会逸出,见硬件孪生那段的 horizon 限制)。 + +**代码**:`lt_ep_train.py::force`(thick 力)、`ep_step` 的 `nudge()` 内 AEP 块、 +`nc_force`(非保守部分,供 AEP/jacreg 用)。`--attn_mode thick`。 + +--- + +## 破缺 B:收敛性 —— 残差是健康信号 + 自适应 T1 + +**标准怎么做**:固定 T1,假定已收敛,读出直接在"z*"上。 + +**为什么失效**:EP 估计器有一个**有效性阈值**(实测,非假设):梯度 cos vs 精确参考随自由相 +相对残差 `res = ‖z⁺−z*‖/‖z*‖` 急剧退化: +``` +res ≈ 5e-5 → cos 0.85–0.88 +res ≈ 1e-3 → cos 0.2–0.9(看 batch) +res ≈ 3e-3 → cos ≈ 0–0.5 +res ≈ 1e-2 → 噪声 +``` +**BPTT 没有这个阈值**(它对实际有限计算求导,收不收敛都给一个自洽梯度)。这条不对称—— +而非任何更深的东西——就是 EP 专属的难点。 + +**改法**: +1. **把 res 立为头号健康信号**(不是 spectral radius——见破缺 D)。每步多走一步 relax 测 res。 +2. **自适应 T1**:固定 T1=150 测到 res 后,若仍 > `res_est`(默认 1e-4),按 50 步一段继续 + relax 直到 res≤阈值或到 `t1max` 上限。**用算力买紧致**。 + - 重要细节:λ-控制器的 res 信号**仍在固定 T1=150 处采样**(保持控制器语义不变,不引入新的 + λ 战争);只有送进 nudge 的 z* 被 refine 到更紧。 + +**代码**:`ep_step` 顶部 `res` 计算 + `t1max/res_est` 的 while-refine 块。 +`--t1max 500 --res_est 1e-4`。 + +--- + +## 破缺 C:小 β 线性响应 + 干净 nudge —— 对称 nudge + 全纯估计 + 自适应 T2 + +**标准怎么做**:单边 +β(bias O(β)),且为防 nudge 把 relax 推爆,在 nudge 里加 **clamp** +(对输出做硬投影 `g.clamp(±2)`、对 AEP 修正做 `‖corr‖≤‖F‖` 裁剪)。 + +**为什么失效**: +- 单边 β 的 O(β) bias 逼着 β 很小,而估计噪声 ∝ (收敛误差)/β,小 β 放大噪声。 +- **clamp 是非解析的硬投影**。实测:在边缘残差(res 1.6e-3)处,clamp 是**估计误差的主因** + ——plain EP cos 0.27,去掉 clamp 后 0.89。clamp 本是为早期训练护航,却在中期残差一漂高就 + **悄悄毒化每一步更新**。 + +**改法(三步)**: + +1. **对称(两边)nudge**——`a = (z₋ − z₊)/(2β)`,centered ⇒ bias O(β²)(Laborieux 2021)。 + +2. **全纯 EP(clamp-free,复圆 Cauchy 读出)**(Laborieux–Zenke 2022)——把 ±β 换成复平面圆 + `|β|=r` 上的 N 个点 `β_k = r·e^{2πik/N}`,relax **全纯延拓**的力,读 + ``` + a = −Re[ (1/Nr) Σ_k e^{−iφ_k} (z_k − z*) ] bias O(r^N) + ``` + bias 从 O(r²) 降到 O(r^N) ⇒ r 可大 5–10×(等 bias),1/β 噪声放大同比例下降。 + - 力的全纯延拓:手写复 LN(非共轭方差)、softmax 用 exp 比值、GELU 用 tanh 形(整函数)。 + - **nudge 里完全无 clamp**(clamp 非解析,会毁掉 O(r^N) 阶);改为监控 `max|z−z*|`。 + - AEP 修正实系数线性 ⇒ 保解析,对实/虚部分别施加即可。 + - 实测扫描:N(2…8)与 r(0.02…0.2)基本持平 ⇒ **有限-β bias 和 1/β 噪声在此尺度都不是 + 瓶颈**;剩余 ~0.12 错位是 T2 截断(下一步)。 + +3. **自适应 T2(后见之明快照选择)**——T2 截断值 ~0.12 cos;但慢混合 batch 上长 T2 会发散 + (非正规瞬态增长;**基于步长的早停会失败**,瞬态在 t≈6–39 误触发)。解法:lockstep 跑到 + T2max,每 K 步快照对比态 `a_t`,**返回增量最小(最稳定)的那个快照**,只在明确爆炸 + (增量 > 5× 运行最小值)时早停。**判据是"关心的量 a 的增量",不是步长** ⇒ 瞬态增长无害。 + 实测:never worse than 固定 T2=20;mean cos 0.871 → 0.932。 + +**代码**:`holo_ep.py`——`cln/csoftmax_masked/cgelu/cforce`(全纯力)、`holo_a`(Cauchy 读出)、 +`holo_a_select`(自适应 T2)、`holo_a_select2`(N=2 相位-batched 快路,与 select 数值等价)。 +旧 clamp 在 `ep_step::nudge` 内 `g.clamp(±2)`/`corr` 裁剪——**已被全纯路线取代**(那是 legacy 路径)。 +`--holo 2 --hr 0.2 --t2sel 120`。 + +--- + +## 破缺 D:训练中不动点始终稳定存在 —— λ控制器 + 验证门 + 熔断 + 架构稳定器 + +**标准怎么做**:假定每步更新后自由相仍收敛到稳定不动点。 + +**为什么失效**:**训练会把动力学推离收缩流形**。这不是 EP 特有的——实测连 bare BPTT(精确梯度) +跑到 14k 也会走出收缩流形(res→4.7e-2,best 退化到 2.021,比它 3k 还差)。一旦离开,EP 估计器 +进入无效区(破缺 B 的阈值),更新方向变错,正反馈把权重推得更远。 + +**改法(四件,从软到硬)**: + +1. **残差驱动的连续 λ-控制器(软 Jacobian 惩罚)**——核心稳定器。 + - 惩罚项:`λ‖J_nc(z*)‖²_F`,Hutchinson 估计(一次 jvp on 随机探针,对 θ 求导;Bai 2021)。 + 保持自由相收缩 ⇒ 把估计器留在有效区。 + - 控制律(每步):`λ ← clip( λ · (res_EMA / target)^0.3 , floor , max )`。 + - **为什么控 res 不控谱半径**:block Jacobian 高度**非正规**——瞬态增长对特征值不可见 + (实测 ρ(J)=0.94"稳定"而 relax 发散到 res 0.21)。一步残差**就是**那个瞬态,控它。 + - **信号上的 EMA(0.9)**:原始 res 噪声大,乘性控制器在噪声上会随机游走(实测 λ 在 0.5↔13 + 抖),抖动的 λ 本身又扰动训练。EMA 去掉抖动。 + - **target ≈ 5e-4**:刚进有效阈值内(few×1e-4)留余量;不更紧(更紧白费算力且伤表达力)。 + 参考:BPTT 自己的最优在 res 1e-3–2e-2——好解只是**轻度收缩**;我们比 BPTT 多要一点, + 因为是**估计器**需要。 + - **floor 是承重的(不可退火到 0)**:两次独立实验证明 λ≲0.02 在任何阶段都致命 + (R2 从头 λ→0、R6 λ-floor 随 lr 退火,都死于同一种死法:val CE 60–77 且 res≡0)。 + 事后剖析:这是**被浮点伪装成收敛的爆炸**——‖z*‖与无 cap 参数在临时无效梯度下涨到 + `ε·F < ulp(z)`,relax 冻结(res=0 是吸收造成的),logits 巨大且自信地错。λ 惩罚的 + θ-梯度触及 fc/pj/LN/attn,正是把这个盆地挡在门外的机制。 +2. **验证门(validity gate)**——当 `res_used > res_gate`,EP 更新在数学上无定义 ⇒ **只施加 + homeostat(jacreg),完全跳过 nudge**(快速恢复步)。S1 尺度实测它是承重的:门之前死三次, + 门之后活。`--res_gate`。 +3. **熔断(abort fuse)**——`res > abort_res`(默认 0.1)**连续 100 步** ⇒ 停,保留 best ckpt。 + 纯安全网,防止无效区里烧机时。`--abort_res 0.1`。 +4. **架构层稳定器(尺度变大时才需要)**: + - **resinit**(ReZero/Fixup:WO、pj 乘 0.1)——初始化即近恒等收缩块,大宽度起步稳。`--resinit 0.1`。 + - **qknorm**(Qwen3 q/k RMSNorm)——bound 注意力 logits/Jacobian。`--qknorm`。 + - **阻尼 −c·z**——给原始注意力造/强化不动点;对 thin/real 必需。但 thick 里 LN 在内部,阻尼 + 反而抬高有效 Jacobian(`J∝1/σ(z)`),故 thick 把 c 设小(=1),稳定器交给 λ 惩罚。 + - **权重范数 cap**(WQ/K/V/O/Wm/Wh/fc/pj renorm 到 3× init)——瞬态期的钝安全网,健康时少触发。 + +**代码**:`ep_step` 的 jacreg 块 + 验证门分支;`main` 训练循环里的控制律(`jr = min(jr_max, +max(flo, jr*exp(0.3*log(rs/rtgt))))`)、`badct` 熔断、cap renorm、`resinit/qknorm` 注入。 + +--- + +## 外壳:与 EP 正交但必需的工程层 + +这些不是"EP 的修改",但当前版本依赖它们才能跑到当前数字: + +- **读出头 Wh**:用它自己的局部 CE 梯度 `∂CE/∂Wh`(在自由 z* 上),**不**穿过动力学。 + 这是 EP 设定的标准做法,不是 BP。 +- **参数 EMA**(decay 0.999,与裸权重并行评估)——压late-phase 估计器噪声漂移,不碰稳定环。`--pema`。 +- **优化器 / 调度**:AdamW(lr 1e-3, wd 1e-4)、warmup→cosine、grad-norm clip 5.0、跳过 non-finite 步。 + - **warmup 对大模型承重**:让控制器先建立收缩,再放大步长把权重踢出盆地。(BP baseline 不用 + warmup 也稳——这是 EP↔BP 的一个不对称。) +- **lr 标定(k 标定)**:`k = |g_EP|/|g_BPTT|` 每组,`lr_EP = lr_BPTT/k`。**但 AdamW 逐坐标归一化 + ⇒ 吸收掉 k 的尺度 ⇒ 对 Adam k 失效**;k 只在 SGD/硬件(固定增益线)下才重要。 + - 理论基础:Ernoult 2019——自由相收敛 + β→0 时 **EP ≡ BPTT**(逐步),所以 lr 的对应是 + EP↔BPTT,**不是** EP↔BP(BP 与 BPTT 物理含义不同,lr 本就不该直接对应)。 + +--- + +## 一页速查表:修改 → 破坏的前提 → flag → 实测效果 + +| 修改 | 破坏的标准前提 | flag | 实测效果 | +|---|---|---|---| +| 力形式 VF *(已有,非我们)* | A 保守 | `--attn_mode thick` | 写出无能量 EP,但**单独用会崩**(cos≈0.25) | +| AEP 反对称修正 J→Jᵀ *(AsymEP 的,非我们)* | A 保守 | (thick 内置) | 注意力梯度 cos 0.25 → 0.99 | +| ↳ 我们:无矩阵化 + 上注意力 + 全纯结合 + 共模追踪 | (scale/工程) | jvp−vjp / `--holo` / `holo_a_track` | 让上面两条能跑 transformer LM | +| 对称 nudge | C 小β | (默认) | bias O(β)→O(β²) | +| 全纯 + clamp-free | C 干净nudge | `--holo 2 --hr 0.2` | 边缘残差 cos 0.27 → 0.89;r 可放大 10× | +| 自适应 T2 选择 | C 线性响应 | `--t2sel 120` | mean cos 0.871 → 0.932;训练 −0.064 | +| 残差为健康信号 | B 收敛 | (内置) | 暴露有效性阈值 res≲few×1e-4 | +| 自适应 T1 | B 收敛 | `--t1max 500 --res_est 1e-4` | 把 z* refine 进有效区,长 T2 才有收益 | +| λ-控制器(软Jac惩罚) | D 稳定 | `--jacreg .. --res_target 5e-4 --res_ema 0.9` | 保持收缩;floor 不可退火(否则 fp-吸收爆炸) | +| 验证门 | D 稳定 | `--res_gate` | S1:门前死3次,门后活 | +| 熔断 | D 稳定 | `--abort_res 0.1` | 连续100步 res>0.1 即停,保 best | +| resinit / qknorm | D 稳定(大宽度) | `--resinit 0.1 --qknorm` | 大 C 起步稳;bound 注意力 Jacobian | + +**一句话**:标准 EP 假定"保守、已收敛、小β干净、训练中恒稳";transformer 四条全破。 +A 用**力形式 + AEP** 修(把 J 变 Jᵀ);B 用**残差信号 + 自适应 T1** 修(进有效区); +C 用**全纯 clamp-free + 自适应 T2** 修(干净估计 + 不截断不发散);D 用**残差驱动 λ-控制器 +(floor 承重)+ 门 + 熔断 + 架构稳定器** 修(训练中拽回收缩流形)。其中 A 的 AEP 与 D 的 +λ-控制器是两处最实质的偏离;其余多是"把估计器修干净 / 留在有效区"。 diff --git a/docs/method/METHODS.md b/docs/method/METHODS.md new file mode 100644 index 0000000..2a7d255 --- /dev/null +++ b/docs/method/METHODS.md @@ -0,0 +1,576 @@ +# Methods — EP-Trained Equilibrium Transformer Language Model + +Complete technical notes for discussion: architecture, how attention/FFN are made EP-trainable, +the training rule, every stabilizer/regularizer and its reason, the LM setting, validation +methodology, results, and open problems. Code paths at the end. (Companion doc: +`/home/yurenh2/ept/FINDINGS.md` — the project arc and findings log.) + +--- + +## 1. Problem statement + +Train a transformer-class language model where **both attention and FFN learn without +backpropagation through the computation** — using Equilibrium Propagation: two (or N) relaxations +of the same dynamics plus a local contrast readout. The questions: (a) does it train at all, +(b) what does it cost vs the exact gradient (BPTT on the same architecture) and vs a standard +BP transformer at equal parameters, (c) what are the actual failure mechanisms. + +Headline result (all at 14k steps, fully-controlled comparison): best EP model reaches +**val CE 1.676** (adaptive T1/T2, run R10). Like-for-like standard BP transformer (MLP=4 — the +same parameter shape as the thick block, see §2) reaches 1.610 ⇒ **total gap 0.066**, decomposed: +**architecture ≈ 0.025** (BPTT + the same stabilizer + same param-EMA on the identical block: +1.635) and **training rule ≈ 0.041** (EP 1.676 vs that control) — and EP beats the thin-matched +BP MLP=1 baseline (1.689). Unregularized BPTT *destabilizes* on long horizons (walks off the +contractive manifold, res→5e-2, best 2.021 — worse than its 3k run, 1.949): the stabilization +loop EP carries out of estimator necessity (residual-driven Jacobian-penalty controller) is what +the equilibrium architecture itself needs for long training. Random = ln 65 = 4.17. + +## 2. LM setting (data, embeddings, readout, evaluation) + +- **Corpus**: Shakespeare character-level (nanoGPT preprocessing): `train.bin`/`val.bin` uint16 + token streams + `meta.pkl`, vocab = 65 chars (~1.1 MB text). Local copy: + `/tmp/lt_ep/data/shakespeare_char`. +- **Batching**: random crops, B=32 sequences × T=64 context; next-char targets (shift by 1). +- **Embeddings**: learned token table `tok ∈ R^{65×128}` (init N(0, 0.02²)) + learned absolute + positional table `pos ∈ R^{64×128}`. Input injection `x_in = tok[idx] + pos`. **Embeddings are + trained by EP too** — they enter the force through the input-clamp term −(z − x_in), so the same + vector-field readout (Sec. 4) delivers their gradient. No pretrained embeddings, no BP path. +- **Readout head**: logits = z* Wh, `Wh ∈ R^{128×65}`, trained with its **own local CE gradient** + ∂CE/∂Wh at the free equilibrium (loss-adjacent layer — local learning suffices; this is standard + in EP setups and is not backprop through the dynamics). +- **Objective / eval**: mean next-token cross-entropy over all B·T positions; val CE = average over + 8 fresh validation batches, computed by running the same free-phase relaxation (T1 steps) used in + training — i.e. the eval graph equals the inference graph. Random baseline ln(65) = 4.174. +- **Model size**: C=128, H=4 heads, single equilibrium block (weight-tied recurrence ⇒ effective + depth = T1). Parameter matching to the BP baseline depends on the variant: the **thin** block's + Hopfield memory (Wm: 128×256 ≈ 33k) matches BP **MLP=1** (2C² ≈ 33k); the **thick** block's + untied FFN (fc+pj = 2·4C² ≈ 131k) matches BP **MLP=4** (131k) — so thick-block results compare + against MLP=4 (1.610), thin-block results against MLP=1 (1.689). + +## 3. Architecture: the equilibrium block + +State `z ∈ R^{B×T×C}` (one vector per token position). Dynamics ż = F(z); inference = relax to +fixed point z* (Euler: z ← z + ε F(z), ε=0.1, T1=150 steps), predict from z*. + +We built four force variants (all share the input clamp and the readout): + +### 3.1 `thick` — DEQ-transformer block (the winner) + +``` +F(z) = −(z − x_in) input clamp (leak toward embedding) + + Attn(LN1(z)) causal multi-head softmax attention, separate WQ WK WV WO + + W2 · GELU(W1 · LN2(z) + b1) + b2 untied 4× FFN (W1: C→4C, W2: 4C→C) + − c·z damping (c=1–2) +``` + +LN1/LN2 carry learned affine (g, b). This is exactly a pre-LN transformer block written as a +*force* instead of a layer stack — same form DEQ uses as its fixed-point map. It is strongly +**non-conservative**: no scalar energy has this gradient (Q≠K asymmetric coupling, untied FFN). +EP is made exact for it via the AEP correction (Sec. 4.2). + +### 3.2 `real`/thin — Hopfield-FFN + damped real attention + +``` +F(z) = −∇_z [ ½‖z − x_in‖² + E_mem(z) ] + s·(Attn(z) − c·z) +E_mem(z) = −Σ relu(z Wm)² modern-Hopfield / dense associative memory, Wm: C×256 +``` + +The FFN here is **energy-based**: E_mem is the dense-associative-memory energy; its force +2·relu(zWm)Wmᵀ is a one-hidden-layer FFN with *tied* weights (Wm in, Wmᵀ out). Attention remains a +raw non-conservative force with damping. This was our first stable trainable variant. + +### 3.3 `energy` — fully conservative attention (CET-style) + +Attention folded into the energy: `E_att = −(1/γ) Σ_heads,i LSE_j(γ q_i·k_j)` (causal-masked, +**tied value** — the force of this energy mixes values v=k), plus `½c‖z‖²` confinement because +E_att+E_mem alone are unbounded below. F = −∇E exactly ⇒ classic EP applies with no correction. +This is the CET (Høier/Kerjan/Scellier) route, which we reproduced separately on vision +(masked CelebA/CIFAR completion, EP ≈ TBPTE). Trade-off: tied value + reciprocal coupling = the +least expressive attention. + +### 3.4 `mono` — monDEQ-structured contraction + +`F(z) = −(m·z + z PᵀP) + z(Q−Qᵀ)ᵀ + x_in − ∇E_mem + s·Attn(z)` — the linear part is a monotone +operator by construction (Winston–Kolter): symmetric part ⪯ −m·I guaranteed, antisymmetric part +(Q−Qᵀ) free (non-reciprocal coupling at no stability cost). Guaranteed unique fixed point for the +linear core; softmax attention sits on top with gain s. Ablation for "how much does guaranteed +contraction cost": BPTT-mono = 2.11 vs BPTT-thick 1.95. + +### How attention is "EP-ified": the two routes, explicitly + +1. **Energy route** (3.3): make attention conservative (tied value, LSE energy) so F = −∇E and + vanilla EP is valid. Cost: expressivity (Q≠K asymmetry and free value are what make attention + attention). +2. **Force route** (3.1, 3.2 — ours): keep real attention as a non-conservative force and repair + the *estimator* instead, with the AEP correction (Sec 4.2) which restores exact gradients for + non-reciprocal couplings. Validated gradient cosine vs autograd: attention 0.99, FFN 1.00, + full LM block 0.99 (and FA, for contrast, gives Q/K/V ≈ 0.25, upstream FFN ≈ −0.01). + +### How attention and FFN are trained *jointly* + +There is no per-module schedule or pipeline: attention, FFN, LN affines, and embeddings are all +terms of the **same force** F. One free relaxation + one nudged ensemble produce the contrast state +`a`; every parameter θ gets its gradient from the single vector-field formula ∇θ = ∂⟨a, F(z*,θ)⟩/∂θ +(Sec 4.1), which decomposes into purely **local** per-term updates (each force term touches only its +own parameters). The readout Wh is the only separately-trained parameter (local CE gradient). + +## 4. Training rule + +### 4.1 Vector-field (force-form) EP with symmetric nudging + +- **Free phase**: z⁰ = x_in; z^{t+1} = z^t + ε F(z^t), T1=150, ε=0.1 → z*. Monitor relative + residual `res = ‖z⁺−z*‖/‖z*‖` (one extra step) — the load-bearing health signal (Sec 5). +- **Nudged phases** (±β, β=0.02, T2=20 steps from z*): relax the augmented force + `F(z) ∓ β ∇_z CE(z)` (the CE gradient w.r.t. the *state* — local at the readout). +- **Contrast**: `a = (z₋ − z₊)/(2β)` ≈ −dz*/dβ (centered ⇒ O(β²) bias; Laborieux-style symmetric + nudging). +- **Parameter update** (force form, valid for non-gradient dynamics): for all force params θ, + `∇θ L ≈ ∂/∂θ ⟨a, F(z*; θ)⟩` — one autograd call **at the fixed point only** (this is a local + Hebbian-style contrast in θ for each force term; autograd here is per-term bookkeeping, not + backprop through time/steps). +- At a converged fixed point and β→0 this equals the implicit/equilibrium gradient + (Scellier–Bengio; Ernoult: EP ≡ BPTT stepwise under convergence). + +### 4.2 AEP correction (non-conservative repair) + +For non-conservative F the naive nudged phase linearizes around z* with Jacobian J, but the correct +adjoint needs Jᵀ. Following **AsymEP** (Scurria, Vanden Abeele, Mognetti, Massar, "EP for +Non-Conservative Systems", arXiv:2602.03670), we add to the nudged force the term `−(Jv − Jᵀv)`, +v = z − z*, where J = ∂F_nc/∂z at z* and F_nc = the non-conservative part (attention, or +attention+FFN for `thick`). This is **identical to their `−2 A_J(z*)(z−z*)`** (A_J = antisym part of +J; `Jv−Jᵀv = (J−Jᵀ)v = 2 A_J v`) — **the correction is theirs, not ours**. What is ours on this +line: (i) the **matrix-free jvp+vjp** form (one of each per nudged step) — AsymEP builds the full +Jacobian explicitly and decomposes it, which is infeasible at transformer state dim B·T·C; (ii) the +application to **softmax attention** (data-dependent Jacobian — they test only feedforward +nets/Hopfield on static MNIST/CIFAR, no attention/sequence); (iii) the **holomorphic combination** +(§4.3 — the correction is real-linear so it preserves holomorphy; they use plain ±β); (iv) the +**common-mode-tracking** linearization variant (§4.3/below). Their force-form **VF** readout (= our +`⟨a,∂F/∂θ⟩`) is *prior art* and is the baseline that collapses without the correction (their CIFAR +VF = 10% chance), matching our cos≈0.25 for uncorrected attention. Effect: nudged-phase Jacobian J → Jᵀ ⇒ a approximates the *adjoint* +response −(I−Jᵀ)⁻¹-type solve ⇒ exact gradients for Q≠K attention (measured 0.99–1.0). +Caveats: the correction is **linearized at z***, so the nudged trajectory must stay in the +linear-response window — T2≈20 at ε=0.1 is inside; T2=60+ can leave it (Sec 7). + +### 4.3 Holomorphic EP upgrade (current) + +Replace the 2-point real ±β difference by N points on a **complex circle** β_k = r·e^{2πik/N} +(Laborieux–Zenke 2022): relax the *holomorphically extended* dynamics (manual complex LN with +non-conjugate variance, softmax as exp-ratio, tanh-form GELU; the AEP correction is linear with +real coefficients so it preserves holomorphy — apply to Re/Im parts separately), then read +`a = −Re[(1/Nr) Σ_k e^{−iφ_k}(z_k − z*)]` (discrete Cauchy formula; bias O(r^N) instead of O(r²)). +**No clamps inside the holomorphic nudge** (clamps are non-analytic and break the bias order). + +Probe findings (cos vs long-horizon-BPTT reference, 300-step-pretrained thick block): +- The **clamps were the dominant estimator error at marginal residuals**: at res 1.6e-3, plain EP + cos 0.27 → clamp-free 0.89. (The clamps existed to protect early training; they were silently + poisoning mid-training updates whenever res drifted up.) +- N and r are flat (N=2…8, r=0.02…0.2 all ≈ equal): finite-β bias and 1/β noise are *not* the + binding error at this scale. +- The remaining ~0.12 misalignment is **T2 truncation**: with stable nudged dynamics, T2=120 → + cos 0.985. But on slow-mixing batches long T2 diverges (AEP linearization error compounds on + near-critical modes). Step-size-based early stopping FAILS (non-normal transient growth triggers + it at t≈6–39; same pathology that makes spectral radius the wrong free-phase signal). +- **Adaptive T2, solved by hindsight selection** (`holo_a_select`): run the nudged phases to + T2max=120 in lockstep, snapshot the contrast a_t every 10 steps, return the snapshot with the + smallest increment (= most settled), early-exit only on clear blowup (inc > 5× the running min). + Judging stability by increments of the *quantity of interest* — not step sizes — makes transient + growth harmless. Probe: never worse than fixed T2=20; mean cos 0.871 → **0.932** at tight + equilibria (0.853 / 0.987 / 0.956 per batch; the dangerous batch self-limits to t≈20–30). +- **Adaptive T1 companion**: long-T2 gains require a tight free phase (res ≲ 1e-4; at res ~1e-3 + long T2 actively hurts). So the free phase is two-stage: the λ-controller's residual signal is + still sampled at T1=150 (R9 semantics preserved — no new λ war), then relaxation continues in + chunks of 50 until res ≤ 1e-4 (cap 500) before the nudged phases. Compute buys tightness; + λ pressure does not. + +Training outcomes: **R7** (N=2, r=0.02, fixed T2=20): best 2.0289, faster wall-clock than plain EP +(3.3 vs 2.45 it/s — the holomorphic nudge's ∇_z CE is closed-form). **R10** (R9's controller + +adaptive T1/T2): **best 1.6755** (sustained EMA plateau 1.68–1.70 around step 8–10k; ~0.7 it/s). + +## 5. Stabilization & regularization — what, and exactly why + +**The governing fact (measured, not assumed):** the EP estimator has a **validity threshold** in +free-phase residual. Gradient cosine vs exact reference: res ≈ 5e-5 → 0.85–0.88; res ≈ 1e-3 → +0.2–0.9 (batch-dependent); res ≈ 3e-3 → ≈ 0–0.5; res ≈ 1e-2 → noise. BPTT has no such threshold +(it differentiates the actual finite computation, converged or not) — *this asymmetry, and nothing +deeper, is the EP-specific difficulty*. There is **no structural ceiling**: an early "EP caps at +~2.5" verdict was refuted (it conflated two undertrained/invalid-regime runs; see FINDINGS). + +Each stabilizer and its reason: + +1. **Damping −c·z** — creates/strengthens a fixed point for raw attention forces; *required* for + the thin/real variant (attention alone has no fixed point at high gain: residual floor ~3e-2, + no equilibrium to find). **Caveat for LN-inside blocks (`thick`)**: damping shrinks ‖z*‖ and the + LN Jacobian scales like 1/σ(z) ⇒ damping *inflates* the effective Jacobian — measured: thick + plain-relax residual 8.8e-3 at c=0 vs 3.4e-2 at c=2. So for `thick`, c is kept small (1) and is + NOT the stabilizer; the Jacobian penalty is. +2. **Soft Jacobian penalty** λ‖J_nc(z*)‖²_F (Hutchinson estimator: one jvp on a random probe vector, + differentiated w.r.t. θ; Bai et al. 2021 "Stabilizing Equilibrium Models by Jacobian + Regularization") — the actual stabilizer: keeps the free phase contractive ⇒ keeps the estimator + inside its validity region. Soft penalty ≻ hard constraints (spectral-norm capping the attention + matrices to ρ=0.9 was tried: too restrictive, kills learning — consistent with FRE-RNN-style + regulation being preferable to hard projection). +3. **Why the control signal is the residual, NOT the spectral radius**: the attention/block Jacobian + is highly **non-normal** — transient growth is invisible to eigenvalues (measured: ρ(J)=0.94 + "stable" while the relaxation diverged at res 0.21). The one-step residual *is* the transient; + control on it. +4. **Continuous λ controller**: λ ← clip( λ · (res_EMA/target)^0.3 , floor, 16 ), per step. + - **EMA on the signal (0.9)**: the raw residual is noisy; a multiplicative controller on a noisy + signal random-walks (measured thrash λ 0.5↔13 when the target sat at the noise floor) and the + thrashing λ itself perturbs training. EMA removed it and gave the current best run. + - **Target = 5e-4**: just inside the validity threshold (few·1e-4), with margin; NOT tighter — + demanding res ≪ threshold buys nothing and costs expressivity (a 2e-4-target run was worse). + For reference, BPTT's own optima sit at res ~1e-3–2e-2: good solutions are only mildly + contractive; we ask for slightly more than BPTT needs, because our *estimator* needs it. + - **Floor**: λ may shrink when res is healthy but **must not vanish — the floor is + load-bearing**. Floor=λ₀ (never off) is a permanent tax (2.150); floor 0.1 is the sweet spot + (2.078 → 2.047 with signal-EMA → 2.029 with the holomorphic estimator). Two independent runs + prove λ≲0.02 is fatal at *any* stage: λ→0 from the start (R2) and λ-floor annealed with lr + (R6) both ended in the same death — val CE 60–77 with res ≡ 0.0. Post-mortem: this is **an + explosion disguised as convergence by floating point**, not a dead state: ‖z*‖ and the + uncapped parameters (tok/pos/fc/pj) blow up under temporarily-invalid gradients until + ε·F < ulp(z) and the relaxation freezes (res = 0 by absorption), with huge confidently-wrong + logits. The λ penalty (whose θ-gradient touches fc/pj/LN/attention) is what keeps that basin + out of reach; it cannot be annealed away. The late-drift hypothesis "persistent penalty + gradient vs vanishing task signal" is hence only half-true — the persistent pressure is also + the anti-collapse mechanism. Current anti-drift attempt: **parameter EMA** (decay 0.999, + evaluated alongside raw weights), which targets late-phase estimator-noise wander without + touching the stability loop at all. +5. **Weight-norm caps** (renorm to 3× init norm on WQ,WK,WV,WO,Wm,Wh after each step): blunt safety + net against runaway during transients when the estimator is temporarily invalid. Rarely binding + in healthy runs. +6. **Nudge clamp g.clamp(±2) and AEP-correction clip (‖corr‖ ≤ ‖F‖)** — *legacy*: protected the + nudged relaxation early in training, but measured to be the main estimator error at marginal + residuals (cos 0.27 → 0.89 once removed). Replaced by the clamp-free holomorphic nudge; a + non-finite-gradient step-skip + the λ controller now carry the early-training safety. +7. **Optimizer**: AdamW lr 1e-3, wd 1e-4, cosine to 5%, grad-norm clip 5.0, skip non-finite steps. + β(EP)=0.02, ε=0.1, T1=150, T2=20 everywhere unless stated. + +### 5.x T1-residual penalty (`--resreg`) — defend the evaluated state (2026-06-20) + +EP's gradient is the fixed-point/implicit gradient: it only cares WHERE the fixed point is, not how fast the +relaxation reaches it, so it has no reward for keeping the block contractive. BPTT — differentiating the finite +T1=150 unroll, which is what eval actually uses — gets that reward implicitly (a non-converging unroll → bad +output → high CE). This asymmetry is why frozen-jr EP diverges past ~2.09 (res inflates → forward bifurcates to +a limit cycle) while exact-BPTT with the identical recipe descends to 1.72 (see FINDINGS 2026-06-20; the EP run +refines the free phase to t1max=300=z* and grades there, so it never feels the residual of the evaluated z150). + +The fix gives EP that missing term explicitly — penalize the T1 free-phase residual of the state actually +evaluated, `z150 = relax(xin, T1)` taken BEFORE any t1max refinement: +- `R_res = ‖ε·F(z150)‖² / (‖z150‖²+ε)`, gradient w.r.t. θ with z150 detached (`blk.tforce`); +- scaled task-relative: `ratio = resreg·min(1, res@T1 / 2e-2)`, deadband `res@T1 > 7e-4`, + `λ = ratio·‖g_task‖/‖g_res‖`, added to the EP gradient. +- **Run with `res_gate=0`** — the validity gate early-returns (jacreg-only) above the gate, which would bypass + the penalty exactly when res is high. Keep `t1max=300` (estimator accuracy) + the penalty (defends z150). + +Analog-compatible (one extra force measurement + the same local vector-field gradient, no digital root-finder) +and more targeted than jacreg (which penalizes ‖J_nc‖_F, not the actual residual vector that explodes). Validated +res-tight through step 1000 / best 2.0573 (past the 2.09 wall) before a /tmp wipe; full re-validation pending. + +## 6. Validation methodology (how we know the estimator/claims are right) + +- **Gradient-cosine probes**: at a fixed realistic operating point (300 BPTT steps from init — + no contraction penalty, the "natural" weight region), compare every estimator against a + long-horizon BPTT reference (T1=400), per parameter group (attn / ffn / LN / emb). This is what + exposed the validity threshold, the clamp damage, and the T2 truncation. +- **Horizon control**: BPTT-150 vs BPTT-400 cosine is itself only 0.35–0.77 on slow-mixing batches + — the "finite horizon vs true equilibrium gradient" cost is shared by everyone, EP is not + special; at matched horizon EP is within ~0.15 of BPTT. +- **BPTT-as-ablation**: BPTT on the identical architecture isolates *training-rule* cost (EP−BPTT) + from *architecture* cost (BPTT−BP). BPTT is an ablation, not the target; BP is the target. +- **Same-graph eval**: val CE is computed through the same T1-step relaxation used in training, so + no train/eval mismatch flatters either method. +- **Gradient-cosine has a lifecycle**: early/mid training it measures estimator quality (0.93 → + 0.79–0.85 across scale); late in training, at slow-mixing trained points, even two *exact* + gradients at different horizons decorrelate (cos(BPTT-150, BPTT-800) = 0.25 at the trained S1 + point) — no single "true gradient" exists to cosine against, and the meaningful arbiter becomes + the training outcome on the horizon-matched eval objective. Validity-threshold claims here are + early/mid-phase statements. Late-phase corollary: EP's res target 1.5e-3 at S1 is already + optimal — cos rises monotonically with tightness and the loose-weights/refined-measurement + variant nulls at training level: the measurability-contraction tax is rigid across the interval + (the physical escape is oscillatory/lock-in measurement, not operating-point engineering). + +## 7. Results + +**14k-step matched comparison (the honest table; thick block ≈ BP MLP=4 in parameter shape):** + +| training rule | architecture / recipe | best val CE | +|---|---|---| +| BP | standard transformer, MLP=4 (**like-for-like for thick**) | **1.610** | +| BPTT + R9's λ-controller + param-EMA | thick (exact grad, same stabilization as EP) | **1.635** — tail stable | +| **EP (R10)** | **thick; R9 + adaptive T1 (refine to res≤1e-4) + adaptive T2 (selection, cap 120)** | **1.676** (EMA plateau 1.68–1.70) | +| BP | standard transformer, MLP=1 (thin-matched) | 1.689 | +| EP (R9) | thick; holo nudge + recalibrated controller (target 1.5e-3, λmax 4) + param-EMA | 1.740 | +| BPTT (exact grad) | thick, unregularized | 2.021 — **destabilizes late** (res→4.7e-2, val→3.0) | + +3k-era and ablation numbers (shorter schedule): + +| run | best val CE | +|---|---| +| BPTT thick, 3k (its best showing) | 1.949 | +| EP R7: holo estimator, old tight controller (target 5e-4) | 2.029 (late λ pinned 16 ⇒ drift) | +| EP R8: R7 + param-EMA | 2.031 (EMA alone ≠ fix; the λ fight dominates) | +| EP R5/R3: plain estimator generations | 2.047 / 2.078 | +| BPTT monDEQ / thin, 3k | 2.111 / 2.206 | +| EP R2 (λ→0) / R6 (λ-floor∝lr) | 2.357 / 2.501 — both die by fp-absorption explosion | +| random | 4.174 | + +Reading: (a) final decomposition — **architecture tax ≈ 0.025** (1.635 vs 1.610), **EP rule tax +≈ 0.041** (1.676 vs 1.635), total **0.066** to the like-for-like BP transformer; EP beats the +thin-matched MLP=1 baseline. (b) EP beats *bare* BPTT at both horizons, but the controlled +comparison shows most of that win was EP's mandatory stabilization loop doubling as regularization +— bare exact-gradient training walks off the contractive manifold at 14k, and the same controller +that EP cannot live without lifts BPTT to 1.635 (also beating MLP=1): **the contraction controller +is good for the architecture regardless of training rule; EP simply forced its discovery.** +(c) The estimator and the controller must be **co-designed**: upgrading the estimator +(holomorphic, clamp-free) widened the validity region from res≲5e-4 to ~1.5e-3, and re-calibrating +the controller to that wider region (R7→R9) was worth **0.29**; adaptive T1/T2 (R9→R10) was worth +another **0.064**, matching the probe's cos 0.871→0.932. (d) **Multi-seed confirmation (3 seeds per arm)**: EP +1.6755/1.6851/1.6786 → **1.680 ± 0.005** vs BPTT+controller 1.6348/1.6459/1.6365 → +**1.639 ± 0.006**; the rule tax is **0.041 ± 0.005 (~9σ)** — real, tightly reproducible, and +consistent with the measured estimator misalignment (cos 0.85–0.93). + +**Scale rung S1 (TinyStories char, C=256 H=8 T=256, 0.92M params; random ln127 = 4.84):** + +| run | best char-CE (BPC) | +|---|---| +| BP same-shape, 14k | 0.827 (1.19) | +| **BPTT-ctl, loose target 1e-2, 14k** | **1.009 (1.46)** | +| BPTT-ctl, tight target 1.5e-3, 14k | 1.521 — ⇒ **controller-mismatch tax 0.51** | +| EP v4b (validity gate, lr 1e-3, 20k from scratch) | 1.393 | +| EP L2 (v4b recipe, 40k from scratch) | 1.214 | +| **EP warm-track (v4b → phase-2: common-mode tracking + loosened target)** | **1.141** — EP champion | + +S1 scale lessons: (a) containment must scale with model size (λ ceiling, cap list, fuse); +(b) the **validity gate is load-bearing** — off-equilibrium EP updates poison weights (three +deaths before the gate, alive after); (c) the estimator validity threshold tightens with scale +(res 1e-4 → 1e-5 for full quality; rescue is compute-bounded, saturating at cos ≈ 0.85); +(d) **the controller operating point is part of the training rule**: EP needs validity-tight +targets, exact-gradient methods want loose ones — match controllers for rule-tax measurements, +but report each method at its own best operating point for ceilings. + +**Optimizer pricing (S0-Shakespeare, R10 recipe, 8k steps; AdamW ≈ 1.70 at 8k):** EP-SaI +(per-tensor lr from init g-SNR, frozen = one calibrated gain line per array) **2.048**; +SGDM 2.166; Lion 2.175; Lion+LARS 2.244. Per-tensor calibration recovers ~0.12 of the 0.47 +uniform-scale gap; the remaining ~0.35 measures the value of per-coordinate adaptivity under EP's +noisy heteroscedastic gradients. Pretraining therefore stays in the digital shell; fine-tuning is +exempt (SGD suffices in the RL/fine-tune regime, with <0.02% sparse updates — an endurance gift; +Mukherjee et al., arXiv:2602.07729). + +**Hardware twin v4 (S0, 8-bit program-verify + 30% static mismatch + σ=1e-4 white + 4× restart +averaging): best 1.937 — 90% of the clean improvement** (clean 1.68). Noise laws measured: +contrast pollution strictly linear in σ; √N restart averaging; snapshot SNR ≈ 1/53 at σ=0.3%, +r=0.2 ⇒ hardware needs ~10³–10⁴ lock-in averages per update (ms at MHz loops — physically trivial, +digitally prohibitive: the noise dimension repeats the compute story). Discovery en route: the +frozen AEP correction has a clean-environment instability at nudge horizons ≳150–300 steps +(spectrum of J(z) − J* + J*ᵀ uncontrolled) — windows ≤120 steps + restart averaging circumvent it; +the single-trajectory oscillatory (true lock-in) estimator awaits a fix for this horizon limit. + +## 8. Open problems + +1. **Late drift — mostly solved, mechanism identified**: the drift was the controller *fighting* + the weights (λ pinned at max enforcing a target the upgraded estimator no longer needs; R7/R8 + tails). Re-calibrating target/λmax to the holomorphic estimator's wider validity region removed + the fight (R9: λ stays 0.1–0.5, tail drift shrank from ~0.3 to ~0.15 above best). Refuted + routes: λ-floor annealing (R6 ⇒ fp-absorption explosion — the floor is load-bearing); + param-EMA alone (R8 — smooths wobble ~0.05, can't fix the fight). Residual ~0.15 tail drift + remains open (estimator direction bias near optimum is the suspect). +2. **Adaptive T2 — SOLVED** by hindsight snapshot selection (§4.3): judge by increments of the + contrast estimate, not step sizes; select the most-settled snapshot. Probe mean cos 0.932; + training −0.064 val CE (R10). Possible refinements: larger N with selection, per-batch T2max. +3. **Mixing time**: slow-mixing equilibria make *all* gradients horizon-expensive (BPTT included); + conditioning the dynamics for fast mixing (preconditioned/Anderson relaxation that preserves the + EP contrast structure) is unexplored here. +4. **Scale**: depth-1 block, char-level, C=128. The mechanisms (validity threshold, non-normality, + controller design) are dynamics-level and should transfer; the constants will move. + +## 9. Hardware translation — can this now-complex algorithm still run on EP hardware? + +Audit of every component of the final recipe (R10) against an analog/in-memory substrate. The +surprise: most of the added "complexity" is *control*, and control is cheap in analog; several of +our fixes specifically REMOVED digital artifacts. + +| algorithm component | analog realization | difficulty | +|---|---|---| +| free phase (T1≈500 Euler steps — our digital bottleneck) | physical settling, ns–µs, "free" | trivial (hardware's whole pitch) | +| adaptive T1 ("relax until res≤1e-4") | settling detector = comparator on dz/dt | trivial | +| symmetric nudging ±β | output-node current injection | standard EP hardware | +| **holomorphic N-point circle** | **AC-modulated nudge + lock-in (homodyne) detection** — the Cauchy sum over phases IS lock-in readout; this is Laborieux–Zenke's "finite-size oscillations" taken literally | standard measurement technique; *more* native than DC differencing, and the standard weapon against analog noise floors | +| clamp removal (our biggest estimator fix) | hardware never had clamps; saturation is smooth | already done by physics | +| VF update ⟨a, ∂F/∂θ⟩ | local Hebbian outer product (contrast × presynaptic activity) per crossbar; autograd was only digital bookkeeping | native | +| λ-controller (residual → EMA → multiplicative λ, floor/cap) | a **neuromodulator/homeostatic loop**: 1 measurable scalar (settling ripple) → RC filter (the EMA) → log-domain integrator with rails (floor/cap) → global broadcast scaling a local anti-Hebbian (contraction) rule | a handful of analog components | +| λ floor = anti-collapse (R2/R6 lesson) | minimum leak conductance — never let homeostasis switch off | natural | +| adaptive T2 (snapshot selection) | sample-and-hold bank on the *contrast* signal + stability gate on the OUTPUT quantity (the transferable lesson: never gate on state velocity — non-normal transients fool it) | cheap | +| weight caps (3× init) | device conductance range | physics gives it free | +| param-EMA | slow/fast weight pairs (volatile + nonvolatile device) | known proposals | +| AdamW | per-synapse capacitor for momentum; second moment is the real gap (shared by all analog-training schemes) | open engineering | +| softmax attention + LN circuits | analog WTA / divisive-normalization primitives; data-dependent T×T attention in-memory | hard, but an *inference*-hardware problem shared by all analog-transformer efforts, not EP-specific | + +**The two genuine research obstacles:** + +1. **The AEP correction (J → Jᵀ) in physics.** Crossbars give Wᵀ for free (drive the other side), + but the correction needs the *circuit's* transposed Jacobian, including data-dependent softmax + parts. The classical answer exists: the **adjoint network** (Director–Rohrer 1969, circuit + sensitivity theory) — constructible by reversing non-reciprocal elements; nobody has built + AEP-learning with one yet. The alternative is a measured price list from our own ablations: + accept *reciprocal* (energy-based, tied-value) attention and skip the correction entirely, + costing ~0.15–0.2 CE (monDEQ 2.11 / energy-mode vs thick 1.95 under exact gradients). +2. **Precision budget vs the validity threshold.** Analog noise floors (~1%, 8–10 effective bits) + sit exactly where we measured EP gradients dying (res ~1e-2). Mitigations, all quantified here: + the N-point estimator tolerates **r=0.2** (10× nudge signal at equal bias — probe: flat in r); + lock-in detection buys orders of magnitude of SNR below the noise floor; free and nudged phases + run on the *same* devices so static mismatch cancels in the contrast (EP's structural advantage + over on-chip backprop — note the adjoint path partially forfeits this and needs care); and + hardware's 10⁶× speed headroom converts to phase-averaging. Our cos-vs-residual and cos-vs-r + tables (§5, §4.3) are, read this way, the **spec sheet** for an analog EP-transformer design. + +**Memristor-crossbar specifics — the Jᵀ question.** Needing Jᵀ does NOT disqualify memristor EP +platforms; crossbars are the most transpose-friendly analog substrate there is (drive rows→read +columns = Wx; drive columns→read rows = Wᵀy — the property on-chip-BP designs rely on). The fork: +(a) *passive-reciprocal* platforms (resistor-coupled, Kirchhoff/coupled-learning style) are J=Jᵀ by +physics — no correction needed, but they cannot express non-reciprocal attention even at +inference; on these, run the **reciprocal recipe**: LSE energy attention (tied value) + Hopfield +FFN — whose force 2·relu(zWm)Wmᵀ uses ONE crossbar driven in both directions — at the measured +~0.15–0.2 CE expressivity cost, zero hardware changes. (b) *active-periphery* platforms +(DAC→crossbar→ADC loops; periphery already breaks reciprocity) get Jᵀ as **transposed reads of the +same arrays + frozen local gains** (gelu′, softmax p, LN 1/σ from the free-phase operating point) +— the Director–Rohrer adjoint network in crossbar form: a periphery/routing redesign, not a +different device technology. Scope note: Jᵀ appears ONLY in the nudged phase applied to one vector; +the free phase needs nothing, weight updates are local outer products, and even the λ penalty needs +only ‖Jv‖² (forward perturbation + response energy, no transpose). Unlike on-chip BP there is no +per-layer activation storage or strict reverse scheduling — one held operating point z* suffices. +Numerics for the hardware team: large nudge amplitude r≈0.2 + multi-phase/AC (lock-in) readout is +validated equivalent to r=0.02 (10× signal headroom); small-signal DC differencing dies at ~1e-3 +noise (our tf32 experiment: cos→−0.03). Suggested collaboration phasing: (1) reciprocal demo on +the existing rig (zero redesign, pay 0.2), (2) transposed-periphery nudge → full AEP, buy it back. + +**On "training arbitrary analog circuits"** (the bigger question): classic EP requires +energy-based (reciprocal) circuits. AEP lifts this to *any circuit with a stable fixed point* — +IF the antisymmetric correction is realizable (adjoint network) or waived (reciprocal trade). +What this project adds to that picture is the missing stability half: **training pushes arbitrary +circuits off the contractive manifold** (bare-BPTT-14k showed even exact gradients walk off it), +and a residual-driven homeostatic controller both prevents this AND improves learning — with the +hard constraint that its floor never anneals to zero (fp-absorption collapse; in hardware: +latch-up). Combined with agnostic/physical EP (Scellier et al. 2022 — no circuit model needed, +contrast is measured) and small-scale physical demonstrations (Dillavou et al.'s self-learning +resistor networks; Laydevant et al.'s Ising-machine EP, 2024; memristor activity-difference +training), the pieces for "arbitrary stable analog circuit + adjoint or reciprocity + homeostatic +contraction control = trainable" are all individually demonstrated; this work supplies the +control law and the quantitative budgets. + +Inference note: causal attention's lower-triangular coupling means autoregressive generation +settles *incrementally* — a new token's state relaxes with past states frozen, so EP inference is +one physical settling per token, not a re-relaxation of the sequence. + +**Component BOM (assuming the current recipe survives the ladder unchanged):** +(1) bidirectional-read analog weight arrays (ReRAM/PCM/analog-Flash/gain-cell/switched-cap) — all +W and Wᵀ including the AEP adjoint reads; (2) state-integrator arrays (capacitor+OTA per state +variable; K·T·C nodes — ~1M at the 33M demo, ~17M at 0.6B; the sequence dimension T dominates); +(3) analog attention primitives — large-fan-in current/charge-domain softmax-WTA + T² score +sample-and-hold for the frozen nudge gains — **the hardest, least-shelf-ready component**; +(4) divisive-normalization circuits (LN/RMS/qk-norm); (5) mixed-signal periphery: DAC/ADC arrays, +S&H banks, and **lock-in (synchronous-detection) channels** — large-r AC nudging is mandatory +(small-signal DC differencing dies at analog noise floors; measured digitally via the tf32 +experiment); (6) control plane: settling comparator + RC filter (res-EMA) + log-domain integrator +with rails (λ controller) + a global **learn-enable line (= the validity gate)** + fuse — a +handful of components or one MCU; (7) weight-update machinery: coincidence pulse programming +(local outer products), with device nonlinearity/endurance the classic pain point; (8) an FPGA +phase sequencer (settle→hold→nudge±→snapshot→update). + +**Six-month prototype plan (borrow physics, don't fab — main track: optical).** +*Primary — desktop optical EP machine (Goodman MVM + electronic loop):* one off-the-shelf LCoS SLM +(~2M pixels, $15–25k) holds, with differential encoding, ~1M signed analog weights — the fully +digitally-validated R10 thick block (C=128, 12C² ≈ 200k weights) occupies **one tenth of one SLM** +(5× headroom). Weights (WQ/K/V/O, FFN) static on the SLM during settling; state z (C=128) cycles +through a 128-channel DAC-driven source array → SLM → cylindrical-lens summation → photodiode +array; nonlinearity, T² attention scores (negligible digitally at T=64), Euler integration, and +the λ/gate control law in loop electronics; one loop pass = one Euler step. **Timing is set by +B·T multiplexing, not settling**: each Euler step = B·T·(~6 matrices) MVM passes; at loop rates +0.1–1 MHz and B=4–8, settle ≈ sub-second and a 14k-step training run ≈ hours; SLM refresh once +per training step (60 Hz ample). **Wᵀ for AEP: program the transposed panels alongside W in the +spare SLM area** (reverse-propagation reciprocity remains the Phase-2 elegance; don't gate the +prototype on bidirectional alignment). Precision framing: master weights live fp32 in the digital +shell, the SLM holds a fresh ~8-bit projection each step — standard QAT regime; the open question +is per-pass multiplicative noise + slow drift (speckle/calibration, the optical ~1% floor), which +is exactly what the spec-sheet arsenal (r=0.2 nudging, lock-in/homodyne — the field's native +measurement, same-device contrast cancellation, λ-controller) was built for, and which is +**pre-validated digitally by an optics-noise-model run** (8-bit weight quantization per step + +1–2% multiplicative force-eval noise + drift) before any purchase. Novelty: photonics has in-situ +BP and BP-free local learning (Science 2023 ×2); EP-on-optics exists only as oscillator theory — +"EP-trained transformer on optics" is unclaimed. Budget $20–50k; optics ~2 months, loop +electronics 1–2 months (rehearsed by the PCB track, same parts/skills), calibration+training 2. +*Secondary (one day of email, no more):* Mythic M1076 as a borrowed settle engine — gated solely +on SDK raw-MVM access + incremental writes; flash endurance marginal beyond few-k-step demos. +(Laydevant et al.'s D-Wave EP precedent: reviewers accept rented physics.) +*Fallback / rehearsal (zero-dependency):* board-level reciprocal block (C=8–16, Hopfield Wm +driven bidirectionally + energy attention), digipot/MDAC weights, OTA+cap integrators, Red Pitaya +AC nudge + lock-in, comparator+RC λ loop — small headline, but lands continuous settling, lock-in +contrast readout, and the homeostatic control law in real electronics, and its loop electronics +ARE the optical track's loop electronics. Six months buys no foundry CMOS — but **university cleanrooms (e.g., UIUC HMNTL) are a different +category, and they fabricate exactly the one BOM item money can't buy: the weight arrays.** +Passive BEOL memristor crossbars (bottom electrode / ALD HfOx-TaOx / top electrode, 3–4 masks, +µm linewidth, contact or maskless litho) are the academic-cleanroom comfort zone; the practical +per-array limit is sneak-path-set (~64×64–128×128 for 1R passive with V/2 biasing), and a handful +of tiled arrays covers a C=32 reciprocal block (C=128 ≈ a dozen-array wiring project). FeFET +(three-terminal, on the same line's ferroelectric pedigree) cures sneak paths for a few extra +masks. The algorithm side has already bought insurance for first-batch device quality: Phase-1 +keeps fp32 masters with **program-verify writes (only ~6-bit iterative programmability needed — +no pretty pulse physics)**, and 10–50% device mismatch is absorbed by same-device contrast +cancellation + the λ-controller — validated in digital twin runs (per-step N-bit weight +projection + per-pass multiplicative noise + static mismatch; see `--wq_bits/--fnoise/--wmis`). +Discipline: fab ONLY the arrays; anything on Digi-Key stays COTS board-level (student-process +CMOS periphery would be stone-age). This raises the board-track ceiling from digipot (~10³ +weights, C=8) to homemade crossbars (10⁴–10⁵ weights, C=32–64) without leaving campus. Execution +for a no-fab-experience team: the standard academic rentals — (1) recipe-owner collaboration +(their senior student runs their existing process; weeks of routine work; co-authorship; you +never gown up), (2) apprenticeship via facility training + staff engineers (executed by a +recruited student), (3) paid staff-run fabrication. Find the recipe owner before booking tool +time; lead the pitch with the device-twin plot ("your first-batch devices suffice — proven"). + +**Recommended Phase-1 architecture: analog equilibrium core + digital optimizer shell.** Physics +performs only the expensive part (settling + contrast measurement = the ×300–1000 digital +overhead); contrasts are ADC'd out, Adam/schedules/λ-logic stay digital, weights DAC back. This +sidesteps component (7)'s update-nonlinearity pain and the missing analog Adam, at no loss of the +compute advantage. Phase-2: full in-array updates. + +**Sizing correction — causal serialization (the assumption that kills the wafer-scale monster):** +naive sizing assumes the whole sequence's state must be physically resident during settling +(K·T·C integrator nodes — hundreds of millions at 8B scale, 1e4–5e4 mm²). Causality removes this: +the free phase settles **token-by-token** (token t's equilibrium depends only on tokens ≤ t; past +states live in an ordinary digital KV-cache), and the nudged phase is the adjoint of a +time-lower-triangular system = an exact **reverse sweep** (upper-triangular back-substitution, +exact to the same order as the AEP linearization). Physical state requirement drops from K·T·C to +**K·C per token-slice** (÷T ≈ ÷2048): ~15 mm² of integrators; sequence caches are ~1 GB of +commodity DRAM. Readout also streams per token-slice (~10⁵ values/token, batch-accumulated in +charge domain) — ADC throughput lands at the standard IMC design point, wall-clock days for an +8B-Chinchilla run at MWh-scale energy. Remaining big item: weight arrays only — 0.6B ≈ 2–4 chips +@28nm (university-consortium scale, $5–20M staged program); 8B ≈ 5–8 reticle dies @7nm (gen-3). +New throughput consideration: serialized operation makes τ_settle the rate limit, and the spring +chain's ~K² mixing tax favors **shallow-wide stacks (K=4–8)** — which the ladder data already +supports (thick single block ≈ same-shape BP). Program staging: MPW single-block demo (33M-class, +$1–5M) → 0.6B 2–4-chip machine → 8B gen-3. Economics read: capex-dominated, opex→0 — the pitch is +2–3 orders of magnitude energy and edge/continual learning, not cloud-GPU rent replacement. + +**Compute reality (digital simulation)**: EP's per-step force-eval budget E ≈ 700–3000 makes full +EP training cost ≈ E/3 ≈ **230–1000× the BP cost** at equal tokens (Chinchilla 20×: 0.6B ⇒ 12B +tokens ⇒ ~4.3e19 BP-FLOP ⇒ ~1e22-class EP-FLOP: H100-cluster scale for one run, years on 4×A6000). +This multiplier is exactly what physical settling eliminates — the algorithm is expensive in +digital silicon and native in physics. Note TinyStories (~0.6B tokens) Chinchilla-matches ~30M +params — precisely the planned "readable stories" demo scale (S4). + +## 10. Code map (all on timan1) + +- `/tmp/lt_ep/lt_ep_train.py` — main trainer: EQBlock (all four force variants), `ep_step` + (VF-EP + AEP + optional holomorphic nudge), `bptt_step`, λ controller, caps. + Key flags: `--mode ep|bptt --attn_mode thick|real|energy|mono --jacreg --jr_floor --res_target + --res_ema --jr_lrcouple --holo N --hr r --c --T1 --T2 --eps --beta`. +- `/tmp/lt_ep/holo_ep.py` — holomorphic force/softmax/LN/GELU, `holo_a` (Cauchy readout), probe. +- `/tmp/lt_ep/grad_quality.py` — estimator-vs-exact cosine probe (validity threshold measurement). +- `/tmp/lt_ep/solver_wall.py` — plain vs Anderson free-phase convergence per damping level. +- `/tmp/lt_ep/bp_charlm.py` — param-matched standard BP transformer baseline. +- `/home/yurenh2/ept/cet_mvp.py`, `cet_aep.py`, `aep_*.py` — CET reproduction + AEP validation + (vision side; gradient-fidelity numbers in Sec 3/FINDINGS). +- Run logs: `/tmp/lt_ep/thickep_*.log`, `H2_*.json`. +- Data: `/tmp/lt_ep/data/shakespeare_char/{train,val}.bin, meta.pkl`. + +Hardware: 1× RTX A6000 per run (shared node); plain-EP ~2.4 it/s, holo-EP(N=2) ~1.5–2 it/s at +B=32, T=64, C=128, T1=150, T2=20. A 14k-step run ≈ 1.6–2.5 h. diff --git a/docs/method/READING.md b/docs/method/READING.md new file mode 100644 index 0000000..12561a3 --- /dev/null +++ b/docs/method/READING.md @@ -0,0 +1,58 @@ +# 项目阅读清单 — EP 训练平衡态 Transformer(含模拟硬件路线) + +按学习顺序排列。主线七篇是看懂本项目的最小集;支线按需。每篇标注"为什么读"。 +读完主线后直接读内部文档:`METHODS.md`(系统现状)→ `FINDINGS.md`(发现编年史)。 + +## 主线(必读,按顺序) + +1. **Equilibrium Propagation** — Scellier & Bengio 2017, arXiv:1602.05179 + 一切的起点:free phase / nudged phase / 局部对比更新。读到能背出两相结构为止。 +2. **EP ≡ BPTT** — Ernoult et al. 2019, arXiv:1905.13633 + EP 为什么算的是真梯度、以及代价(自由相必须收敛 + β→0)。本项目的"有效域"概念源头。 +3. **Scaling EP(对称 nudging)** — Laborieux et al. 2021, arXiv:2006.03824 + ±β 居中差分消一阶偏差;EP 第一次上 CIFAR。我们对比读出的基本形态。 +4. **Holomorphic EP** — Laborieux & Zenke 2022, arXiv:2209.00530 + 复平面 N 点 / 振荡相位 → 有限 β 精确梯度。我们的估计器与硬件锁相故事的理论根。 + (重要预期管理:本项目实测其"振荡"形态在白噪声下才是必需品,干净数字环境 N=2 即可。) +5. **AEP:非保守系统的 EP** — arXiv:2602.03670 + 反对称修正 −(J−Jᵀ)(z−z*):把 nudged 线性化从 J 翻成 Jᵀ,使 Q≠K 的真 attention 可 EP。 + 本项目最重要的外部方法。我们的扩展:共模跟踪线性化(见 METHODS §4.3)。 +6. **CET:Convergent Energy Transformer** — Høier, Kerjan, Scellier, ICLR'26 AM workshop(OpenReview: Qrfml76eWJ) + 能量式(互易、tied-value)attention 的 EP 训练,我们入场前的 SOTA。我们复现过(cet_mvp.py), + 也是"互易让步"硬件路线(Phase-0)的配方。 +7. **DEQ:Deep Equilibrium Models** — Bai et al. 2019, arXiv:1909.01377 + 平衡态架构家族总纲:权重共享不动点网络匹配显式 transformer。我们的 thick 块即 DEQ 式块。 + +## 稳定性支线(理解我们的控制律) + +- **Jacobian 正则稳定平衡模型** — Bai et al. 2021, arXiv:2106.14342:λ 惩罚的出处。 +- **monDEQ** — Winston & Kolter 2020, arXiv:2006.08591:结构保证唯一不动点;我们的 mono 消融。 +- **FRE-RNN(Toward Practical EP)** — arXiv:2508.11659:反馈调节谱半径;我们 res 驱动 controller 的精神前身 + (注意我们的发现:非正规雅可比下谱半径是错误信号,必须用残差——METHODS §5)。 + +## 硬件支线(模拟实现路线) + +- **模拟电路 EP** — Kendall et al. 2020, arXiv:2006.01981:EP 上模拟硬件的开山提案。 +- **物理学习网络实物演示** — Dillavou et al. 2022, Phys. Rev. Applied:真电阻网络的对比局部学习。 +- **Ising 机 EP** — Laydevant et al. 2024, Nature Communications:租来的物理(D-Wave)也能发——先例。 +- **Agnostic physics-driven learning** — Scellier et al. 2022, arXiv:2205.15021:不需要电路模型的 EP。 +- 对照组:**Physics-aware training(PNN)** — Wright et al. 2022, Nature:物理前向 + 数字反传(我们不走的路)。 +- 电路理论经典:**伴随网络** — Director & Rohrer 1969(IEEE Trans. Circuit Theory):Jᵀ 的物理构造。 + +## 优化器支线(硬件友好优化器之争) + +- **Why Transformers Need Adam(Hessian 异质性)** — NeurIPS 2024。 +- **SGD-SaI** — arXiv:2412.11768:初始化定每块 lr → SGDM 追平 AdamW(我们 EP-SaI 的原型,实测只赎回部分)。 +- **Do We Need Adam?(RL 阶段纯 SGD + 0.02% 稀疏更新)** — Mukherjee et al., arXiv:2602.07729(UIUC Hao Peng 组)。 +- Lion — arXiv:2302.06675:sign 更新 = 固定幅度脉冲编程(硬件视角)。 + +## 语料与背景 + +- **TinyStories** — arXiv:2305.07759:小模型可以写连贯故事;我们阶梯语料与"能看"demo 规模设定的依据。 +- Universal Transformer — arXiv:1807.03819:权重共享深度的先例。 + +## 内部文档(读完主线后) + +1. `~/ept/METHODS.md` — 系统全貌:架构、估计器、控制律、规模法则、硬件翻译与 BOM。 +2. `~/ept/FINDINGS.md` — 编年史:每个失败、验尸与修复("墙"的证伪、闸门、噪声战役)。 +3. 代码:`~/ept/lt_ep_code/`(备份);活跃实验在 timan1:/tmp/lt_ep/。 diff --git a/docs/method/READING_EN.md b/docs/method/READING_EN.md new file mode 100644 index 0000000..351c5ea --- /dev/null +++ b/docs/method/READING_EN.md @@ -0,0 +1,54 @@ +# Project Reading List — Training Equilibrium Transformers with EP (incl. the analog-hardware track) + +Ordered for learning. The seven **core** papers are the minimal set to understand this project; the side tracks are on-demand. Each entry notes *why* to read it. After the core list, go straight to the internal docs: `METHODS.md` (current system) → `FINDINGS.md` (chronicle of findings). + +## Core (must-read, in order) + +1. **Equilibrium Propagation** — Scellier & Bengio 2017, arXiv:1602.05179 + Where it all starts: free phase / nudged phase / local contrastive update. Read it until you can recite the two-phase structure from memory. +2. **EP ≡ BPTT** — Ernoult et al. 2019, arXiv:1905.13633 + *Why* EP computes the true gradient, and at what price (the free phase must converge + β→0). The origin of this project's "regime of validity" notion. +3. **Scaling EP (symmetric nudging)** — Laborieux et al. 2021, arXiv:2006.03824 + The centered ±β difference cancels the first-order bias; EP's first run on CIFAR. The baseline form we read everything against. +4. **Holomorphic EP** — Laborieux & Zenke 2022, arXiv:2209.00530 + N complex-plane points / oscillating phases → exact gradient at finite β. The theoretical root of our estimator and of the hardware lock-in story. + (Important expectation management: we find empirically that its "oscillatory" form is only *required* under white noise; in a clean digital setting N=2 suffices.) +5. **AEP: EP for non-conservative systems** — arXiv:2602.03670 + The antisymmetric correction −(J−Jᵀ)(z−z*): it flips the nudged-phase linearization from J to Jᵀ, making true attention with Q≠K EP-trainable. The single most important external method for this project. Our extension: common-mode-tracking linearization (see METHODS §4.3). +6. **CET: Convergent Energy Transformer** — Høier, Kerjan, Scellier, ICLR'26 AM workshop (OpenReview: Qrfml76eWJ) + EP training of an energy-based (reciprocal, tied-value) attention — the SOTA before we entered. We reproduced it (cet_mvp.py); it's also the recipe for the "reciprocity-concession" hardware track (Phase-0). +7. **DEQ: Deep Equilibrium Models** — Bai et al. 2019, arXiv:1909.01377 + The master plan of the equilibrium-architecture family: a weight-tied fixed-point net matches an explicit transformer. Our "thick" block is a DEQ-style block. + +## Stability side-track (to understand our control laws) + +- **Stabilizing Equilibrium Models via Jacobian Regularization** — Bai et al. 2021, arXiv:2106.14342: the source of the λ-penalty. +- **monDEQ (Monotone Operator Equilibrium Networks)** — Winston & Kolter 2020, arXiv:2006.08591: a structural guarantee of a unique fixed point; our `mono` ablation. +- **FRE-RNN (Toward Practical EP)** — arXiv:2508.11659: feedback that regulates the spectral radius; the spiritual predecessor of our residual-driven controller. (Note our finding: under a non-normal Jacobian the spectral radius is the *wrong* signal — you must use the residual. See METHODS §5.) + +## Hardware side-track (the analog-implementation route) + +- **EP on analog circuits** — Kendall et al. 2020, arXiv:2006.01981: the founding proposal for EP on analog hardware. +- **Physical-learning-network demonstration** — Dillavou et al. 2022, Phys. Rev. Applied: contrastive local learning on a real resistor network. +- **EP on an Ising machine** — Laydevant et al. 2024, Nature Communications: even *rented* physics (D-Wave) can publish — the precedent. +- **Agnostic physics-driven learning** — Scellier et al. 2022, arXiv:2205.15021: EP without needing a circuit model. +- Contrast case: **Physics-aware training (PNN)** — Wright et al. 2022, Nature: physical forward + digital backprop (the road we *don't* take). +- Circuit-theory classic: **the adjoint network** — Director & Rohrer 1969 (IEEE Trans. Circuit Theory): the physical construction of Jᵀ. + +## Optimizer side-track (the fight over a hardware-friendly optimizer) + +- **Why Transformers Need Adam (Hessian heterogeneity)** — NeurIPS 2024. +- **SGD-SaI** — arXiv:2412.11768: initialization sets a per-block lr → SGDM matches AdamW (the prototype for our EP-SaI; empirically it only recovers part of the gap). +- **Do We Need Adam? (pure SGD + 0.02% sparse updates in the RL stage)** — Mukherjee et al., arXiv:2602.07729 (Hao Peng's group, UIUC). +- **Lion** — arXiv:2302.06675: the sign update = fixed-amplitude pulse programming (the hardware view). + +## Corpus & background + +- **TinyStories** — arXiv:2305.07759: small models can write coherent stories; the basis for our ladder corpus and the size target of the "legible" demo. +- **Universal Transformer** — arXiv:1807.03819: the precedent for weight-tied depth. + +## Internal docs (after the core list) + +1. `~/ept/METHODS.md` — the full system: architecture, estimator, control laws, scaling laws, hardware translation + BOM. +2. `~/ept/FINDINGS.md` — the chronicle: every failure, post-mortem, and fix (refuting "the wall", the gate, the noise campaign). +3. Code: `~/ept/lt_ep_code/` (backup); active experiments under `~/ept/ep_run/`. diff --git a/docs/outreach/EMAIL_DRAFT_BEN.md b/docs/outreach/EMAIL_DRAFT_BEN.md new file mode 100644 index 0000000..8e7532f --- /dev/null +++ b/docs/outreach/EMAIL_DRAFT_BEN.md @@ -0,0 +1,42 @@ +# Draft reply to Ben Scellier (benjamin@rain.ai) — subject: Re: Scaling transformer trained by EP + +Hi Ben, + +Thanks again — and apologies for the delay; I wanted to send something concrete, and to wait until the model's samples were actually legible before sharing. The work has since split into two stages that map cleanly onto your offer. + +**Stage 1 — Dynamics & convergence of EP with asymmetric (non-conservative) weights.** This is now a self-contained piece; a short report is attached (5 pp). AsymEP / VF-EP assume the free phase converges — we characterize *when* it does. The non-conservative free-phase operator undergoes a **supercritical Hopf bifurcation** (we confirm ℓ₁ < 0 via the normal form) as the coupling crosses a boundary set by gain/leak × asymmetry; past it the fixed point gives way to a bounded limit cycle and EP breaks. We then characterize three steerings that hold the system below the boundary — a spectral-abscissa projection, a residual-driven adaptive leak, and a finite-time-Lyapunov (gradient-flossing) penalty — and, measured over 200 seeds, report an honest trade-off: the spectral projection is the only universally robust one, while the other two are each robust on a single architecture and parameter-sensitive on the other. It's all MLP/CNN/RNN, so it stands on its own — and it gives us enough of a handle on the scalability question (what makes the free phase converge or break) to scale the transformer with some confidence. + +I think this sits naturally beside your CET: CET secures convergence *by construction* (the conservative attention energy, no free V/O), whereas this studies the complementary regime — the full non-conservative operator with independent Q/K/V/O, where convergence is precisely what is no longer free. I'd be very grateful for any feedback on the report — experiments, framing, writing, all welcome. And if the direction interests you, we would be honored to have you as one of the senior authors; no venue is committed yet, so there's room to shape it together. + +**Stage 2 — Scaling the EP-trained transformer (the larger, ongoing effort).** This carries a good deal of additional method and engineering that Stage 1 doesn't cover — the adjoint-consistent nudge that makes the non-conservative attention trainable (it recovers the exact gradient, cosine ≈ 1 with BPTT), the block architecture, and the open analog-friendly-optimizer question. A brief progress note: we've been iterating recipes at the ~15M debug scale (faster turnaround), and the best of these on TinyStories just reached val CE ≈ 1.97 — for us that's roughly the threshold where generations become legible — and it produces coherent little stories from the prompt *"Once upon a time,"*, e.g.: + +> *"Once upon a time, there was a little girl named Lily. She had a puppy named Max. Lily loved to play with her pet… Lily asked her mom, 'Can I play with Max?' Her mommy replied, 'Sure, but first, you have to play with me.'"* + +(This is our current best debug recipe; we haven't migrated it to 50M yet — that's the immediate next step.) + +On compute-only vs. methodology: I'd much prefer the methodology side — your guidance, especially on convergence and on the analog-friendly optimizer, would be invaluable. + +A more detailed compute picture (rough — the EP overhead dominates the uncertainty): +- **EP overhead over BP:** ~15–60× per token (relaxation steps × nudged-phase evaluations; the nudging-phase design is the main lever, and this constant is exactly what physical settling would remove). +- **Anchor:** a 1B BP run ≈ 36 h on 8×H200 ≈ 290 H200-GPU-hours. +- **Ladder (EP, compute-optimal):** + - 50M ≈ 10–45 H200-hr · 150M ≈ 100–400 H200-hr → both fit our **local + Stanford A6000/A5000s** + - 400M ≈ 0.7k–2.8k H100-hr → **AWS** + - **1B short-run validation (~1–2B tokens) ≈ 230–1,700 H100-hr → AWS (the target — a "it scales" point, not a headline 1B)** + - 1B full Chinchilla ≈ 4.3k–17k H100-hr → only if we go for a flagship 1B +- **Ablations** (at the 15–50M debug scale) ≈ 0.5k–3k A6000-hours total → **local**. + +So the concrete AWS ask is on the order of a **few thousand H100-hours** for the 400M–1B short-run rungs; the small rungs and ablations we cover ourselves. I can send a finer per-experiment breakdown and a Stage-2 methods note whenever useful. + +Thank you again — really grateful for the interest and the offer. + +Best, +Yuren + +--- +NOTES (not for sending): +- Attach: report.pdf (Stage-1 dynamics). +- Demo excerpt = seed=4 (Lily+Max, cleanest). Alt = seed=7 (Timmy + toy car, full broke→fixed→learned arc). +- 2.0 is framed as OUR internal legibility threshold (no explicit prior agreement with Ben). +- Result is the best ~15M debug recipe (val CE 1.97, pema), prompt "Once upon a time,", NOT yet migrated to 50M. +- Compute numbers from SCELLIER_OUTREACH.md (H200≈H100; ×15–60 EP overhead). diff --git a/docs/outreach/OUTREACH_TARGETS.md b/docs/outreach/OUTREACH_TARGETS.md new file mode 100644 index 0000000..ffbd177 --- /dev/null +++ b/docs/outreach/OUTREACH_TARGETS.md @@ -0,0 +1,199 @@ +# EP analog-hardware collaboration — outreach targets (2026-06-21) +Per-group PhD/PI profiles from 5 research agents. Accuracy discipline: emails only where published or netid on an +official directory; "—" = not public, route via PI (no invented addresses). Verify "current" status before sending — +students graduate. Companion: COLLABORATOR_BRIEF.md (the one-pager), HW_RESEARCH_FINDINGS.md (citations). + +## The cross-cutting framing (true at EVERY group — this is our wedge) +Every group has EITHER the analog substrate OR an on-chip-training piece — **none has backprop-free LOCAL +in-situ update on analog weights**. The EP local-update rule is the genuinely new thing WE bring; everyone else +does inference-only analog MVM, or on-chip *gradient/backprop* training. First-mover gap. Say it plainly. + +## Strategy +PhD-first / cc-PI where there's a clear hands-on student; PI-direct where the group is small or no student fits. +UIUC first (home turf, all 3 layers local). Stanford = Phase-2 warm intro via your student. THU = strongest +substrate, hardest access. + +--- + +## 1. Shanbhag (UIUC ECE) — LEAD. Route: PhD-first, cc Shanbhag. The closest existing substrate. +His JSSC-2018 DIMA chip already did analog MVM + on-chip SGD weight write-back. Current bench is rich but no +single student spans all of {analog-MVM substrate + settling + on-chip learning} — pitch the trio: +- **Soonha Hwang** `soonhah2@illinois.edu` — HIGH, **email first**. Building a *transformer* mixed-signal CIM chip + (28nm DiT accelerator, ESSERC 2025); grad 2028 (multi-year runway). Owns substrate (a)+(b). Caveat: his chip reads + inference-only — in-situ weight update would be new (= our piece). +- **Mihir Kavishwar** `mihirvk2@illinois.edu` (publicly listed) — HIGH. Analog-MVM + compute-SNR-optimal ADCs + (CACTUS, arXiv 2507.09776). The readout/SNR layer the equilibrium readout lives or dies on. +- **Vignesh Sundaresha** `vs49@illinois.edu` — HIGH on the *learning* axis. GEARnn in-situ on-edge training + (arXiv 2410.07691). Caveat: algorithm-level, standard training (not EP, not yet a mixed-signal learning circuit). +- Others: Shuo Li (postdoc, analog-CIM characterization — verify still here), Saion Roy (MRAM/resistive CIM + error + compensation — but drifting to security, see Hanumolu note), Kaining Zhou (CIM simulation framework, kainingz@). + +## 2. Mingu Kang (UC SAN DIEGO faculty) — the in-situ-update know-how that LEFT UIUC. Route: PI-direct (peer faculty). +**Built the original DIMA on-chip-SGD-write-back substrate** we cite (PhD 2017 w/ Shanbhag), now PI at UCSD. The +person who most owns "analog MVM + on-chip weight update" anywhere. Strong faculty-level collaborator specifically for +the update-loop piece. (Sujan Gonugondla, the other DIMA-trainer author, → Amazon, industry.) + +## 3. Wenjuan Zhu (UIUC ECE) — device layer. Route: PI-DIRECT (small group, device experts graduated). +Email **wjzhu@illinois.edu**. HONEST FRAMING: their ferroelectric work is **memory/logic (FeFET memory, CAM, +reconfigurable transistors), NOT synaptic weight-update training** — fit is device-platform overlap (vdW / CuInP₂S₆ +FeFETs = nonvolatile, electrically-set, multilevel conductance). Pitch = "extend your FeFETs to in-situ analog +training," do NOT imply they already do it. Name-drop: +- **Junzhe Kang** `junzhek2@illinois.edu` — the CIPS-FeFET / in-memory-computing lead (ACS Nano 2024/2025). On the + graduation boundary (2024 dissertation, still publishing 2026) — verify status. +- **Ye Lin** — current student on the CIPS platform (—, via PI). Alumni (now industry): Zijing Zhao→Apple, Hojoon Ryu→Intel. + +## 4. Hanumolu (UIUC ECE) — converter/control-loop glue. Route: PI-DIRECT (no student is a pure data-converter). +Email **hanumolu@illinois.edu**, ask him to route. Reality: his group is a **clocking / high-speed-link / frequency- +reference** shop, not a data-converter shop — no current PhD has ADC/DAC/switched-cap as primary thesis. Best fits if +he points to a student: +- **Mahmoud Khalil** `mkhalil4@illinois.edu` — best converter+loop match: sampling-PLL (ISSCC 2024, 1st author) + + industry ADC + DC-DC converter experience. The settle→nudge feedback loop is his native language. +- **Sujay Patel** `sujaysp2@illinois.edu` — mixed-signal links/equalizers + recovery loops (CICC 2026). +- **Jason (Shuozhen) Liu** `sl111@illinois.edu` (netid single-sourced, verify) — only one stating ADC focus, but junior/no papers. +- NOTE: the most topically-relevant IMC+ADC person, **Saion Roy**, is **Shanbhag's** grad (now security postdoc @ Northeastern), NOT Hanumolu's — don't mis-target. + +## 5. Wong + Raina (STANFORD) — Phase-2 escalation. Route: WARM INTRO via your Stanford student. +Stanford leads **foundry-RRAM-as-weights + on-chip training** (CHIMERA/MINOTAUR) + **NeuRRAM analog-MVM inference** — +but NOT analog in-situ gradient programming (the EP piece is still ours). +- **Jeffrey Yu** `jeffreyy@stanford.edu` (Raina, current) — PRIMARY. On-chip transformer fine-tuning w/ RRAM + (MINOTAUR; 8-bit transformer fine-tuning ISCA 2024). The bridge to "train a transformer on RRAM," still in the building. +- **Shuhan Liu** (Wong, final-year — move fast) — RRAM device/array + edge continual training (IEDM 2024). Loop in for device side. +- Refs (not resident): Kartik Prabhu (CHIMERA/MINOTAUR; prob. → Meta, verify), Weier Wan (NeuRRAM lead → CTO Aizip, industry consult). + +## 6. Tsinghua (THU) — strongest in-situ SUBSTRATE in the world, but backprop-family + hardest access. +Wu/Gao/Qian LEMON lab + Jianshi Tang, School of Integrated Circuits. THE clear #1 for fabricated, system-integrated +silicon that closes the weight-write loop ON-CHIP during learning — the inference-only barrier everyone else hits, they've +crossed. BUT (the "not EP-flavored" point, confirmed sharply): every on-chip rule they've shipped is **backprop-family / +NON-local** (Sign-Backprop [Gao et al., Neural Networks 2018]; STELLAR's sign-SGD), **MLP/CNN-scale, never a transformer, +never EP**. Their only energy/attractor touchpoint is a 2015 Hopfield associative-memory (Hebbian recall, not training). +⇒ clean whitespace: we'd bring the first local/EP rule + first in-situ transformer to the one group with write-capable silicon. +- **Jianshi Tang (唐建石)** — **jtang@tsinghua.edu.cn** (ONLY university-published email; Tenured Assoc. Prof + Vice Dean) → + BEST first contact, the device/integration translator. Senior enough to commit, reachable. +- **Bin Gao (高滨)** — technical co-target; owns the on-chip update circuitry EP would repurpose (first-author the Sign-Backprop rule). +- **Huaqiang Wu (吴华强)** — ultimate PI (LEMON lab, http://stor.ime.tsinghua.edu.cn/), hardest to reach cold. +- Builders: Peng Yao (Nature-2020 CNN, postdoc-level), Wenbin Zhang (STELLAR/Science 2023). Chips: Nature 2020 CNN (HYBRID- + trained — only last FC layer in-situ, updates computed in software), STELLAR (Science 2023, full on-chip sign-SGD, 784×100×10 + MLP), Attar (Sci China Inf Sci 2025 — RRAM transformer but INFERENCE-only). HfOx endurance 10^7 cycles (Nat Electron 2024). +- ACCESS caveat: top-3-globally, Nature/Science yearly, many suitors + strategic/IP/scope sensitivities. Cold email MUST lead + with the specific complementary asset (working EP transformer needing exactly their write-capable substrate → offers them the + first local-learning + first in-situ transformer result on their hardware). Warm intro or concrete joint-demo proposal needed. +- NB for the EP-native map: the agent flagged **Williams–Kumar–Kendall, "Activity-difference training of DNNs using memristor + crossbars," Nature Electronics 2023** — an EP-FAMILY (contrastive) rule ON REAL memristor crossbars = a candidate unicorn bridge; + and **Grollier (CNRS/Thales), "Training an Ising machine with EP," Nat Commun 2024**. (Confirm in the EP-native synthesis.) + +--- + +## 7. EP-NATIVE complement groups — the "EP-flavored" people (fills the gap EVERY substrate group has) +The substrate groups (1-6) are EP-poor by design (= our first-mover gap). The EP-native community is a SEPARATE +world — EP-rich, mostly device-light. Pair one of each. +- **UPenn physical-learning** (Durian + Andrea Liu faculty; **Dillavou** `dillavou@sas.upenn.edu` hands-on, now part-ARIA; + **Menny Stern → own group @ AMOLF Amsterdam**): **Coupled Learning** = EP's experimental sibling, built on real + self-adjusting analog circuits (PNAS 2024 "Machine learning without a processor"). The closest real-hardware analog to our method. +- **Benjamin Scellier** (EP CO-INVENTOR; now **Rain / Rain AI UK**, ARIA-funded; `benjamin@rain.ai`, bscellier.github.io): + source authority on EP estimators + energy-based formulations = squarely our AEP/holomorphic domain. HIGH (industry posture). +- **Axel Laborieux** (→ **Huawei Zurich**; laborieux-axel.github.io) + **Friedemann Zenke** (FMI Basel, senior gateway): + co-invented BOTH holomorphic EP (NeurIPS 2022) AND asymmetric EP / Jacobian homeostasis (ICLR 2024) — **literally the two + ingredients we build on.** THE algorithm-theory complement. + Maxence Ernoult (→ DeepMind), the estimator-bias-scaling track. +- **Dmitry Krotov** (MIT-IBM): the **Energy Transformer** (NeurIPS 2023) IS our forward model (energy→fixed-point attention) — + trained by autodiff; "train it without backprop" is exactly our EP contribution. HIGH theory complement, no hardware. + +## 8. UNICORNS — EP-native AND real updatable device (the rare bridges) +- **Julie Grollier** (CNRS, **Laboratoire Albert Fert**, Paris-Saclay; neurophysics.cnrs-thales.fr): **ran EP on PHYSICAL + hardware** — "Training an Ising machine with EP," Nat Commun 2024 (D-Wave); spintronic-native. The cleanest unicorn: EP-on- + hardware experience + device substrate. **Single best EP-native complement.** +- **Yi / Kendall / Williams / Kumar** — "Activity-difference training of DNNs using memristor crossbars," **Nature Electronics + 2023**: contrastive two-phase (EP-flavored) training executed on a **fabricated 64×64 RRAM chip** = the closest "EP-on-silicon" + that exists, and it overlaps our RRAM/CIM world. Suhas Kumar @ **Sandia**, R. Stanley Williams @ **Texas A&M**, Kendall @ Rain. +- **Damien Querlioz** (CNRS, **C2N** Paris-Saclay; `damien.querlioz@c2n.upsaclay.fr`): EP-algorithm-native + a real RRAM fab + pipeline (CEA-Leti / Elisa Vianello) — near-unicorn (his fabricated learning demos are Bayesian, not yet EP). **Most credible + Western partner to actually FABRICATE EP on a crossbar.** +- ⇒ **Université Paris-Saclay (Grollier + Querlioz, who co-author) = the global EP-on-hardware cluster.** +- Also EP+device-intent (sim-now): Talatchian/Peters (SPINTEC Grenoble, EP-under-analog-noise), Alex Gower (Cambridge/Nokia, + EP on oscillator Ising machines), Kaushik Roy/Sumeet Gupta (Purdue — EP algo + CIM/spintronic devices, not yet fused). +- NOT EP (don't chase): IBM (Ambrogio/Burr/Sebastian, PCM backprop), Ielmini/PoliMi, McMahon/Wright (physics-aware backprop, + not EP), Marquardt (Hamiltonian echo). Rain AI the COMPANY = distressed/acquihire-pending → engage Kendall as an individual. + +## REVISED pairing recommendation (the answer to "not EP-flavored") +- **Lead EP-native = Grollier** (unicorn) × a crossbar substrate (**Tsinghua-Wu/Gao** most mature, **Wong/Raina** most reachable). +- **Western fab route = Querlioz + Vianello (CEA-Leti)** × **Wenjuan Zhu** — both real updatable-device fabs; Querlioz brings EP fluency. +- **Algorithm de-risk layer = Laborieux/Zenke** — own the holo + asymmetric-EP bias theory that decides if EP survives analog noise + on ANY substrate. **Shanbhag** pairs best here as the systems/CIM error-tolerance partner (his expertise), not the device fab. +- **High-value individual outreach: Scellier** + the **Kumar/Kendall activity-difference team** (your proof contrastive-equilibrium + training already runs on a real memristor chip). + +--- + +## Recommended sequencing +1. **Shanbhag trio first** (Hwang+Kavishwar, cc Shanbhag; mention Sundaresha) — home dept, closest substrate, richest bench. + Consider a Zhai-brokered/in-person intro instead of cold email (same department = warmest path). +2. **Parallel UIUC PI-direct**: Wenjuan Zhu (device, "extend your FeFETs") + Hanumolu (glue, "point me to a converter student"). +3. **Mingu Kang (UCSD)** — peer-faculty email for the in-situ-update expertise specifically. +4. **Stanford warm intro** (Jeffrey Yu) via your student — Phase 2. +5. **THU** — only if/when a connection exists; else cite as the substrate precedent, not a near-term collaborator. + +--- + +## ⏸ STATUS (2026-06-21): HOLD — DO NOT SEND until the 33M demo + scaling dossier +**User decision (CONFIRMED 2026-06-21): outreach is gated on the ~33M "能看" demo + scaling-law dossier (task #15) — NOT the +C512/2.09 milestone.** Send nothing until there's a readable-generation ("能看") demo + a scaling-law dossier to lead with. +(C512 EP descending past the 2.09 wall toward ~1.8 is a prerequisite step that validates the recipe, NOT the outreach gate — +the gate is the bigger, showable 33M artifact.) Until then: no contact with anyone above. +When the bar is met: set sender title, render COLLABORATOR_BRIEF.pdf, attach + ept_method_intro.pdf, optionally ask Prof. Zhai +for a warm intro to Shanbhag/Hanumolu first. All profiles/contacts/pairing/drafts above are durable and ready. + +## Email drafts (READY, gated — copy-paste when the bar is met) + +### Draft 1 — Shanbhag group · To: Soonha Hwang (soonhah2@), Mihir Kavishwar (mihirvk2@) · cc: Shanbhag +Subject: Backprop-free (Equilibrium-Propagation) transformer training — a fit for your in-memory CIM work? + +Hi Soonha and Mihir, + +I'm Yuren Hao, working on backprop-free training in ChengXiang Zhai's group at UIUC. We've gotten Equilibrium Propagation (EP) +to train a transformer as a physical equilibrium system: the forward pass is a damped relaxation that settles to a fixed point, +and the weight update is local — computed from a free vs. a nudged settle, no backpropagation. In simulation the EP gradient +matches backprop (cosine ≈ 1) and comes within a small gap of a same-parameter backprop-trained transformer. + +Your DiT memory-in-compute accelerator — and Mihir's compute-SNR-optimal ADC work — is the closest existing substrate I've +found to what this needs: analog MVM + a settling loop. The one new ingredient is EP's in-situ local weight update, which is +actually a simpler thing to put on a crossbar than on-chip backprop. + +Could I grab 20 minutes to explore whether a small demo — one equilibrium-transformer block on a CIM substrate + our EP control +loop — is feasible? A one-page overview and short method note are attached. (cc'ing Prof. Shanbhag.) + +Thanks, Yuren + +### Draft 2 — Wenjuan Zhu · To: wjzhu@illinois.edu (PI-direct) +Subject: Extending your vdW / CuInP₂S₆ FeFETs to in-situ-trainable analog weights? + +Dear Prof. Zhu, + +I'm Yuren Hao, working on backprop-free training in ChengXiang Zhai's group at UIUC. We train a transformer as a physical +equilibrium system using Equilibrium Propagation — no backprop — where learning is a local update from two settled states, and +the key hardware need is an analog weight whose conductance can be updated in-situ during training. + +Your group's vdW / CuInP₂S₆ ferroelectric reconfigurable devices — nonvolatile, electrically programmable, multilevel +conductance — look like a strong fit for exactly that role. I realize that work has centered on memory and logic rather than +training, so I'd be keen to explore whether those devices could serve as in-situ-trainable analog synapses for an EP-trained network. + +Would you have 20 minutes for me to share what we have (a working EP-transformer in simulation + an analog-noise model) and +discuss feasibility? One-page overview and a method note attached. + +Best, Yuren + +### Draft 3 — Hanumolu · To: hanumolu@illinois.edu (PI-direct, ask to route) +Subject: Mixed-signal converter / control-loop partner for an analog EP-training demo? + +Dear Prof. Hanumolu, + +I'm Yuren Hao, working on backprop-free training in ChengXiang Zhai's group at UIUC. We're building toward an analog hardware +demo of Equilibrium Propagation — training a transformer as a physical equilibrium system, where the forward pass is an analog +settling loop and the weight update is local (no backprop). + +Beyond the in-memory compute array, this needs a mixed-signal layer your group is ideally suited for: fast ADC/DAC to read the +settled state and apply a small "nudge," and switched-cap integrators for the relaxation/control loop. Since that's converter / +feedback-loop expertise rather than the ML side, could you point me to a student who might be interested — or spare 15 minutes +to discuss? + +A one-pager and method note are attached. Best, Yuren + +### Title TODO (all drafts): set Yuren's title ("PhD student" / "researcher") before sending. diff --git a/docs/outreach/SCELLIER_OUTREACH.md b/docs/outreach/SCELLIER_OUTREACH.md new file mode 100644 index 0000000..35ad9e0 --- /dev/null +++ b/docs/outreach/SCELLIER_OUTREACH.md @@ -0,0 +1,62 @@ +# Scellier (Rain AI) outreach — 2-stage framing, compute estimate, invite plan + +**Goal:** present the work to Ben Scellier (benjamin@rain.ai, Rain AI; EP inventor; VF-EP author) as a **2-stage project**, attach a PDF report once the aep-dynamics figures are done, invite him for **any feedback** (experiments / framing / writing), and — when complete — invite him as **one of the senior authors**. No venue planned yet. Email thread already open (he offered AWS credits + asked about collaboration depth; he's likely waiting on our ept update / methods note). + +## The 2-stage framing (this is the pitch) + +- **Stage 1 — Dynamics & convergence of EP with asymmetric weights (aep-dynamics).** Current AsymEP literature ASSUMES convergence. We characterize **WHEN** the non-conservative VF actually converges, **WHY** it fails (a supercritical Hopf bifurcation of the free-phase operator), and the **steerings** that guarantee convergence (spectral / floss / adaptleak), with their trade-offs. **Almost done. MLP/CNN/RNN only — does NOT touch the transformer or LLM scaling** (keeps ept un-exposed). This is the self-contained piece that gets Ben as senior author — it's his wheelhouse and it directly answers the convergence pain point he already knows about (the "Jacobian control explodes, kills the ~1-cosine gradient" issue → now a characterized Hopf + a tunable steering trade-off, a much stronger story). +- **Stage 2 — Scaling EP-trained transformers to LLMs (ept).** Standard non-conservative attention (Q≠K≠V≠O) + adjoint-consistent nudging. 15M debug model already within-noise of BP; 50M on TinyStories looks great. Next = **scale ladder + ablations**, which need compute → the AWS / collaborator-GPU conversation. This is the ongoing bigger collaboration. + +Clean separation: **Stage 1 = paper + senior-authorship; Stage 2 = compute collaboration.** + +## Compute estimate (rough — the EP overhead range dominates the uncertainty) + +**EP overhead over BP:** ~15–60× per token (relaxation steps × nudged-phase evaluations); the **nudging-phase design is the lever** — getting it to ~15× (vs 60×) is what makes 1B affordable. This constant is exactly what physical settling removes (the hardware argument). + +**Anchor:** 1B LLM, BP, ≈ 36 h on 8×H200 = **288 H200-GPU-hrs** (Chinchilla-ish ~20B tokens). +**GPU throughput factors (bf16, rough):** H200 ≈ H100 ≈ **6.4× A6000** ≈ **~11× A5000**. + +**Scale ladder (compute-optimal, compute ∝ N²):** + +| rung | BP (H200-hr) | EP (H200-hr, 15–60×) | EP (A6000-hr) | run where | +|---|---|---|---|---| +| 50M | ~0.7 | 11–43 | 70–280 | local/Stanford A6000/A5000 | +| 150M | ~6.5 | 100–390 | 640–2500 | local/Stanford A6000 | +| 400M | ~46 | 700–2800 | 4500–18000 | borderline → AWS | +| 1B (full Chinchilla) | 288 | **4300–17000** | 28000–110000 (≈ years on A6000×2) | **AWS H100** | +| 1B (SHORT validation, ~1–2B tok) | ~15–30 | **230–1700** | — | AWS H100, affordable | + +**Ablations** (debug scale 15–50M on TinyStories, small fixed corpus): ~5–25 EP-H200-hr each (~30–160 A6000-hr); ~15–20 runs → **~500–3000 A6000-hr total** → fits the 6 local/Stanford cards (~5 A6000-equiv) over ~1–3 weeks. + +**Resource mapping:** +- **Ablations + small rungs (50–150M):** local A6000×2 + Stanford A6000×2 + A5000×2 (on-demand; more queued). Weeks of wall-clock. +- **400M–1B:** too big for A6000s (1B-full EP = 28k–110k A6000-hr). → **Ben's AWS H100.** Full-Chinchilla 1B = **4300–17000 H100-hr** (≈ 540–2100 hrs on one 8×H100 node ≈ 22–88 days; faster multi-node). **SHORT-run 1B validation (1–2B tokens) ≈ 230–1700 H100-hr** — the affordable path, likely what the paper needs (a "it scales" point, not a full SOTA 1B). + +**The ask to Ben (Stage 2):** AWS H100 for the 400M–1B rungs — order **~few thousand H100-hrs** for the short-run ladder (5k–17k if full Chinchilla 1B). Note the cost is **highly sensitive to the EP overhead** (15× vs 60×) → optimizing the nudging phase is the priority lever. + +## Invite / deliverable plan + +1. Finish the aep-dynamics figures (P1 cover [Yuren] · P2 phenomenon ✅ · P3 MLP ✅ · P4 dropped→text · P5 cross-arch [in progress, 200-seed bands running]). +2. Write up Stage-1 dynamics → **attach a PDF report** to Ben. +3. Email: report the 2-stage framing + ept progress (resreg_warm best **1.9313**, basically at the agreed loss<2.0 success bar) + the Stage-2 compute estimate above. +4. Invite: "we'd value any feedback — experiments, framing, writing — and, when complete, would be honored to have you as one of the senior authors." No venue committed yet. + +## CET delta — what ept adds over Scellier's own CET (openreview Qrfml76eWJ) + +**CET (Høier, Kerjan, Scellier, ICLR'26 AM workshop):** an ENERGY (conservative) transformer = convergent Energy-Transformer. Attention is modern-Hopfield energy: `A=Q·K`, `E^att=-1/γ Σlog Σexp(γA)`, force = ∇E → **conservative**. Allows Q≠K (separate W^Q,W^K) **but NO free V/O projections** (values = tokens via the energy gradient; no W^V/W^O). Vanilla centered EP (Laborieux), T1=150 free / T2=5 nudge. Task = CELEBA masked-image completion (MSE); single energy block, temporally unrolled. EP MSE 0.01422 ≈ TBPTE 0.01376. Convergence is FREE (gradient flow) but it's an energy surrogate, NOT the real transformer; even its convergence is hand-waved (App. A: sometimes saddles, no proof PGD descends, relies on same T=150 at train/test). + +**ept adds:** +1. **Real non-conservative attention (independent Q/K/V/O)** — the actual LLM architecture. CET drops V/O to keep attention a gradient field; we keep them → Jacobian non-normal (|Jv−Jᵀv|/|Jv|≈1.4), NOT energy descent → convergence NOT free. +2. **Adjoint-consistent nudging (−2A_J)** — vanilla EP (CET's) gives the WRONG gradient on non-gradient dynamics; our correction recovers cosine≈1 with BPTT. Extends EP from energy-based → non-conservative dynamics. +3. **Convergence characterization + steering (aep-dynamics)** — CET gets convergence by construction (+ hand-waves it); we characterize the real instability (supercritical Hopf as attention becomes expressive) and steer it (spectral/floss/adaptleak). +4. **Language + scale + depth** — CET = single block, image completion, CELEBA. ept = real SLM (TinyStories→1B), multi-block, language; 50M within-noise of BP. + +**Honest trade:** ept GIVES UP the energy/Lyapunov interpretation (no global energy → the convergence problem is the price). Complementary: CET = EP for the energy-restricted transformer (clean); ept = EP for the REAL non-conservative transformer (full architecture, convergence earned). + +**Pitch to Scellier = build-on-CET:** "you showed EP trains an energy-restricted transformer; we extend it to the full non-conservative transformer (independent Q/K/V/O) via a corrected nudging + a convergence theory, at LM scale." He's a CET author → instantly legible + on his hardware mission. + +## Status anchors (so this survives compaction) +- Ben thread: open since 2026-06; he offered AWS credits + asked compute estimate + collaboration depth; awaiting our update. +- Original success bar agreed with Ben: **transformer val loss < 2.0**. ept resreg_warm now best 1.9313 (≈ at the bar). +- aep-dynamics = Stage 1 (dynamics, MLP/CNN/RNN, no transformer). ept = Stage 2 (transformer scaling). +- Related: [[hw-outreach-plan-gated]] (Scellier already on the theory-side outreach list). diff --git a/docs/paper/PAPER_A_OUTLINE.md b/docs/paper/PAPER_A_OUTLINE.md new file mode 100644 index 0000000..9b0c0ef --- /dev/null +++ b/docs/paper/PAPER_A_OUTLINE.md @@ -0,0 +1,57 @@ +# Paper A — 骨架(claim 架构 + 图表清单) + +## 一句话 claim(脊柱) + +Transformer language models can be trained end-to-end without backpropagation, by equilibrium +propagation; we map every cost this incurs — and show each one is denominated in a currency that +digital simulation pays and physical relaxation does not. + +候选标题: +- "Training transformer language models by physical relaxation" +- "Equilibrium propagation trains language models: the complete cost map" +- "Backprop-free transformers: what it costs, and why physics doesn't pay it" + +## 单一主张与降级清单 + +CLAIM(按证据强度排序): +1. 存在性:EP 端到端训练 attention+FFN 的 LM(S0/S1 实跑;33M "能看" demo 为能力锚)。 +2. 机制图谱 + 配方:有效域阈值、闸门、控制器协同设计、holo/tracking 估计器、围栏/熔断 + ——每条带定量数与验尸记录;规则税 0.041±0.005(9σ,S0)→ S1 的税-尺度走势。 +3. 器件容忍包络(hardware-aware sim,Rasch/Nat Comm 方法学):噪声线性律、√N 平均律、 + program-verify 6-8bit + 30% 失配下 90% 改进量保留、锁相必要性(tf32 之死)。 +4. 成本倒置(Figure 1):E≈700–3000 力评估/步 ⇒ 数字 ×100–1000;物理 settle ⇒ ~0。 + +NOT-CLAIM(在 limitations 与正文显式声明): +- 不主张 SOTA / 大于 33M 的质量(1B 仅 2k 步稳定性证书 + 成本模型); +- 不主张物理 demo(paper B);不主张实测能耗。 + +## Figure 清单 + +- **Fig 1(统一图)**:EP 的需求(settle / nudge平均 / 噪声积分 / 收缩控制)× 两列成本 + (数字仿真实测 ×N vs 物理 ~free),每行放我们的实测数。全文的论点压缩。 +- Fig 2:架构与训练规则(thick 块力场、两相/全纯估计器、控制回路框图——含闸门=learn-enable)。 +- Fig 3:S0 终表+消融阶梯(墙的证伪轨迹 2.55→1.68 各级贡献瀑布图;多 seed 误差棒)。 +- Fig 4:规模阶梯(S0→S1→33M):税-尺度曲线 + 失效模式与对应围栏的时间线。 +- Fig 5:33M "能看" demo:故事样本(EP vs BP 并排)+ BPC。 +- Fig 6:硬件孪生:噪声线性律 + √N 律 + 器件孪生训练曲线;锁相必要性 inset(tf32)。 +- 附录图:优化器定价表、塌缩验尸(fp 吸收)、AEP 长视野失稳、追踪估计器探针等。 + +## 章节 + +1. Intro:BP-free 的动机(物理学习)+ 一句话 claim + Fig 1。 +2. Setup:平衡块、VF-EP、AEP;与 CET/DEQ 的关系(两条能量化路线)。 +3. The recipe:估计器(holo/clamp-free/selection/tracking)+ 控制律(res 信号、λ controller、 + 闸门、熔断)——以"协同设计"为主题串联。 +4. Results:S0 完整税单(多 seed)→ S1 阶梯与税-尺度 → 33M demo。 +5. Hardware-aware validation:孪生 + 缩放律 + 容忍包络(方法学引 Rasch/aihwkit)。 +6. The inversion:成本模型(E×、1B 账单)+ BOM 级含义一段(指向 paper B)。 +7. Limitations & open problems(诚实清单:S1 规则税 0.39、估计器生命周期、AEP 视野、纯模拟优化器税)。 + +## 投稿序列 + +33M demo 落地(~7 月) → arXiv 即挂(优先权) → Nature 主刊试投(desk 1–3 周,免费彩票, +形态按"能力+倒置"句式写) → 不中无缝转 ICLR(9 月底);Nature 系火力主存 paper B(物理 demo)。 + +## 数据来源索引 + +全部实验记录:METHODS.md(系统态)/ FINDINGS.md(编年史)/ lt_ep_code(代码备份)。 |
