Initial commit: ept — backprop-free equilibrium transformer (EP)

Code (ep_run/), organized docs (docs/{method,campaign,hardware,outreach,paper}), analysis scripts (scripts/), ONBOARDING.md entry point. Large data/checkpoints git-ignored (share separately). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014FAPDWQ49M5Ye3NpTndTpn
author: Yuren Hao <yurenh2@illinois.edu> 2026-07-03 05:56:50 -0500
committer: Yuren Hao <yurenh2@illinois.edu> 2026-07-03 05:56:50 -0500
commit: b83947778e2c776f757a07d4719b7ce961d7ed55 (patch)
tree: b9cc01d7adda691d9156d9d04f4fb2f644674e96 /docs/campaign/C512_ROUND2_ABCD.md
1 files changed, 118 insertions, 0 deletions
diff --git a/docs/campaign/C512_ROUND2_ABCD.md b/docs/campaign/C512_ROUND2_ABCD.md
new file mode 100644
index 0000000..65e5282
--- /dev/null
+++ b/docs/campaign/C512_ROUND2_ABCD.md
@@ -0,0 +1,118 @@
+# C512 Round-2 (A/B/C/D) — Spec Status, Results, and What Followed
+
+**Scope.** Status of the last long spec the user ordered (the A/B/C/D round, after the 7-experiment
+campaign in `C512_PLATEAU_CAMPAIGN.md`), its results, the unfinished gaps, and the work that grew out
+of it (this session). Setting: EP-trained equilibrium transformer block, C=512 BPE-4096, TinyStories,
+T=256, B=24. Random ln(4096)=8.318; BP single-block reference (depth=1, 7.48M) best **1.7921** (20k).
+
+Convention: raw data; conclusions are the user's.
+
+---
+
+## The spec (as ordered)
+
+> Next round (do A+B first, then decide N=4 / lock-in / tracking-AEP vs nudged-phase redesign):
+> - **(A)** plateau estimator ablation at a fixed ckpt — holo N=2/4/8, r=.02–.2, T2=20–120 + adaptive,
+>   tracking-AEP, vs BPTT150/400/800.
+> - **(B)** navg multi-batch one-step sweep (navg 1–16): variance-or-bias test.
+> - **(C)** stable BPTT+controller rescue — lr 2e-4–8e-4 × jr_max 8–32, ≥2 seeds, 2–5k cutoff (the
+>   still-missing matched architecture ceiling).
+> - **(D)** per-group late-collapse localization — attn QKVO / FFN fc-pj / LN / emb / head-clamp /
+>   jacreg + diagnostics |a|, |z^β−z*|, |Jv−Jᵀv|, selected-T2.
+
+---
+
+## Per-item status & results
+
+| item | status | result (raw) |
+|---|---|---|
+| **A** estimator ablation | ✅ core done | At the 2.40 plateau ckpt vs BPTT400: **N2-frozen** cos **−0.045**, self-cos **−0.27**, k **4133**; **N4/N8** fix magnitude only (k→30) NOT direction; **r** .05/.10/.20 do nothing; **fixed-T2** 20/80/120 diverge (need snapshot selection); **tracking-AEP (common-mode)** cos **0.997**, self-cos **+0.95**, k **0.9**. Diagnostic at z*: **\|Jv−Jᵀv\|/\|Jv\| = 1.37** (highly non-normal → the frozen-at-z* AEP linearization IS the bias floor). Script: `plateau_ablation.py`. |
+| **B** navg sweep | ✅ done | cos(EP,BPTT) over navg 1/2/4/8/16 = **0.37 / 0.40 / 0.55 / −0.30 / −0.21**; self-cos rises (0.58→0.87) but toward a direction NEGATIVELY correlated with BPTT; dCE diverges at navg≥8. ⇒ **deterministic BIAS, not variance** — averaging/restart won't fix it. Script: `navg_sweep.py`. |
+| **C** stable BPTT rescue | 🟡 partial | Ran lr{2e-4, 4e-4, 6e-4} × jr_max{16, 32}, **single seed**, ~2.5k cutoff (`bpttr_lr*_jr*.pt`). Found: **stable exact-gradient BPTT exists** at lr ≤ 6e-4 (never broke); lr 8e-4 broke @ step 1600; **jr_max 16 > 32**. Winner (lr6e-4, jr16) extended to 12k = `bptt_ceiling` (best **2.1055** @ step 7400). Scripts: `bptt_rescue.sh`, `chain_next.sh`. |
+| **D** per-group localization | ✅ core done | Late collapse is **UNIFORM** across WQ/WK/WV/WO, fc/pj, ln1/ln2, tok/pos (all cos −0.03..−0.08, k 3k–18k) — a single shared cause (the corrupted contrast `a` poisons every VF gradient equally), NOT attention/LN-specific. + the \|Jv−Jᵀv\|=1.37 diagnostic. Script: `plateau_ablation.py` (per-group mode). |
+
+### Decision gate (after A+B)
+Per spec, A+B chose the direction: **switch the nudged-phase estimator to common-mode tracking-AEP**
+(N=4/lock-in/navg rejected — A showed they fix magnitude not direction; B showed it's bias not
+variance). The mechanism: the bias floor is the **frozen-at-z\*** AEP linearization; re-linearizing at
+the moving common mode z̄=½(z₊+z₋) restores cos −0.045 → 0.997.
+
+### Follow-on (the spec's "next necessary test": does `--track` descend below 2.40 in TRAINING)
+✅ Done. Warm-start from the 2.40 orphan ckpt, jr_max 8:
+- **TRACK-warm (tracking-AEP): 2.40 → best 2.1628 @ 2500, still descending, stable** (res ~6e-5).
+- **STD-warm control (standard estimator, same recipe): ABORTED @ step 397** (res→0.21).
+- **TRACK-fresh (from scratch): ABORTED @ 968.**
+⇒ tracking-AEP descends past the plateau but is **not a from-scratch drop-in**; recipe is **two-phase
+warm-track** (std estimator → ~2.40, then switch to tracking-AEP). Continued (`trkcont`) → best **2.1116**.
+
+---
+
+## Unfinished gaps (do not change conclusions, but were specified)
+1. **C — ≥2 seeds:** only single-seed runs. The one item that affects the rigor of the "matched
+   architecture ceiling" claim.
+2. **C — jr_max=8** not included in the BPTT rescue grid (only 16/32; exp5 covered jr_max 8 for EP).
+3. **A — BPTT800** reference not run separately (BPTT150-vs-400 cos=1.0 from exp1 answered the horizon
+   question, but the spec listed 800).
+4. **D — sub-diagnostics** |a|, |z^β−z*|, and a head-clamp arm not individually logged.
+
+---
+
+## What followed (this session — beyond the A/B/C/D spec)
+
+These grew out of the round; they were **not** in the spec.
+
+1. **bptt_ceiling (architecture-ceiling run, lr6e-4 jr16, 12k).** Best **2.1055** @ step 7400, then
+   **controller-hijack blowup** @ ~8800 (jr→16, val CE→3.4), lodged at ~3.3 through 12000 (jr pinned at
+   16 the entire tail, res bounded 2–5e-3 — never recovered). Logs: `bptt_ceiling.log`.
+
+2. **BPTT-vs-BP "slow vs misaligned" diagnosis** (`lr_overlay.py`). Plotting val-CE vs cumulative-lr
+   (∫lr, Adam path-length proxy): per unit path-length **BPTT meets/beats BP down to 2.11** (cumL ratio
+   1.0→0.33) — the raw-step slowness is the 5× lr cap (6e-4 vs BP's 3e-3), not inefficiency. BPTT never
+   reached BP's 1.74 because it **blew up before converging** (stability-limited), not capacity. S0
+   evidence that capacity is fine when it converges: **BPTT-ctl 1.635 ≈ BP 1.610** (gap 0.025). A
+   weight-import capacity probe was designed but **deemed confounded** (5 formula gaps: damping halves
+   z*, coupled-vs-sequential FFN, attn biases, qknorm, final-LN) and not run.
+
+3. **Controller-hijack diagnosis** (`blowup_diag.py`, at the 2.1055 ckpt). Decomposing the update into
+   CE and jacreg terms: **cos(g_CE, g_jac) = −0.008** (orthogonal); at jr=16, **\|jr·g_jac\|/\|g_CE\| =
+   3.25** and **cos(g_CE, total) = 0.288** — i.e. ~71% of the update is the penalty, not loss descent.
+   Per-group, the penalty over-drives **LN gains 34×**, attn 3.2×. res stays bounded throughout ⇒ NOT a
+   relaxation blowup; a controller-penalty-driven weight-space detour. Trigger: near the optimum res
+   settled ~1.8e-3 (> target 1.5e-3) → controller ramped jr to max → hijack.
+
+4. **"Remove/replace the penalty" structural arms** (running now, all BPTT, base = bptt_ceiling recipe):
+   - `freeze_floor` (GPU1): `--jacreg 0.1 --jr_floor 0.1 --jr_max 0.1` → jr frozen at 0.1, no adaptive ramp.
+   - `specnorm` / Lipschitz (GPU0): `--jacreg 0 --specnorm --rho 0.9` → spectral-norm attention weights ≤0.9, off-gradient, no penalty.
+   - Queued: `no-penalty` (`--jacreg 0`), `energy` (`--attn_mode energy`), `mono` (`--attn_mode mono`).
+   - Watcher `watch_arms.sh` fires when both finish. Logs: `freeze_floor.log`, `specnorm.log`.
+
+5. **DEQ training-stability literature review** (deep-research, 24 sources, 25/25 claims verified).
+   Three families: (i) **soft Jacobian penalty (Bai 2021)** — author-admitted heuristic + DIGITAL-only
+   (relies on Broyden, deliberately doesn't enforce contraction) → wrong tool for analog; (ii)
+   **structural contraction** (monDEQ / NEMON-l∞ / pcDEQ / BiLipNet) — provable, and monDEQ physically =
+   a passive resistor-diode circuit → **analog-native**; (iii) **energy/Hopfield attention** (softmax =
+   one CCCP energy-descent step, provable monotone energy decrease) → the analog-compatible AND
+   attention-preserving route. Recommended construction (medium conf): energy/Hopfield attention +
+   monotone/NEMON-l∞ contractive FFN, plain Picard/averaged (= our εF Euler) iteration, no root-finder.
+   Full report relayed in chat; open questions = our research frontier.
+
+---
+
+## Where it stands / open items
+- **Round A/B/C/D:** substantially complete (4/4 + the decisive tracking-AEP training follow-on);
+  4 specified gaps remain (above), the load-bearing one being **C multi-seed**.
+- **Reframe (this session):** the explosion is a **DEQ forward-stability** problem (shared by EP and
+  BPTT, since EP≈BPTT gradient), not an EP problem — so the fix belongs to the DEQ/structural toolkit,
+  not EP-literature hacks. The soft penalty is both admitted-heuristic and analog-incompatible.
+- **Running:** freeze_floor, specnorm (~10h) — testing whether removing/replacing the penalty keeps
+  the block stable past the ~9k blowup point and whether it descends below 2.1055.
+- **Next-round formal plan (NOT yet executed):** WSD schedule redesign; fix-bundle re-run; converged
+  BP/BPTT/EP comparison; C512-vs-C≥1024 capability question. Plus (lit-informed) prioritize the
+  `energy` arm; consider NEMON-l∞ over Winston-Kolter for the structural `mono` arm.
+
+## Artifacts
+- Scripts: `/tmp/lt_ep/{plateau_ablation,navg_sweep,bptt_rescue,blowup_diag,lr_overlay,cos_now}.py`,
+  `bptt_rescue.sh`, `chain_next.sh`, `watch_arms.sh`.
+- Logs/ckpts: `/tmp/lt_ep/{bptt_ceiling,freeze_floor,specnorm,trkcont,track_warm,std_warm,bpttr_*}.{log,pt}`.
+- Records: `C512_PLATEAU_CAMPAIGN.md` (round-1, exp1-7), `FINDINGS.md` (2026-06-17 round 2 + 2026-06-18),
+  memory `ep-c512-plateau-mechanism.md`.
author	Yuren Hao <yurenh2@illinois.edu>	2026-07-03 05:56:50 -0500
committer	Yuren Hao <yurenh2@illinois.edu>	2026-07-03 05:56:50 -0500
commit	b83947778e2c776f757a07d4719b7ce961d7ed55 (patch)
tree	b9cc01d7adda691d9156d9d04f4fb2f644674e96 /docs/campaign/C512_ROUND2_ABCD.md