diff options
| author | Yuren Hao <yurenh2@illinois.edu> | 2026-07-03 05:56:50 -0500 |
|---|---|---|
| committer | Yuren Hao <yurenh2@illinois.edu> | 2026-07-03 05:56:50 -0500 |
| commit | b83947778e2c776f757a07d4719b7ce961d7ed55 (patch) | |
| tree | b9cc01d7adda691d9156d9d04f4fb2f644674e96 /docs/campaign/C512_ROUND2_ABCD.md | |
Initial commit: ept — backprop-free equilibrium transformer (EP)
Code (ep_run/), organized docs (docs/{method,campaign,hardware,outreach,paper}),
analysis scripts (scripts/), ONBOARDING.md entry point. Large data/checkpoints
git-ignored (share separately).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_014FAPDWQ49M5Ye3NpTndTpn
Diffstat (limited to 'docs/campaign/C512_ROUND2_ABCD.md')
| -rw-r--r-- | docs/campaign/C512_ROUND2_ABCD.md | 118 |
1 files changed, 118 insertions, 0 deletions
diff --git a/docs/campaign/C512_ROUND2_ABCD.md b/docs/campaign/C512_ROUND2_ABCD.md new file mode 100644 index 0000000..65e5282 --- /dev/null +++ b/docs/campaign/C512_ROUND2_ABCD.md @@ -0,0 +1,118 @@ +# C512 Round-2 (A/B/C/D) — Spec Status, Results, and What Followed + +**Scope.** Status of the last long spec the user ordered (the A/B/C/D round, after the 7-experiment +campaign in `C512_PLATEAU_CAMPAIGN.md`), its results, the unfinished gaps, and the work that grew out +of it (this session). Setting: EP-trained equilibrium transformer block, C=512 BPE-4096, TinyStories, +T=256, B=24. Random ln(4096)=8.318; BP single-block reference (depth=1, 7.48M) best **1.7921** (20k). + +Convention: raw data; conclusions are the user's. + +--- + +## The spec (as ordered) + +> Next round (do A+B first, then decide N=4 / lock-in / tracking-AEP vs nudged-phase redesign): +> - **(A)** plateau estimator ablation at a fixed ckpt — holo N=2/4/8, r=.02–.2, T2=20–120 + adaptive, +> tracking-AEP, vs BPTT150/400/800. +> - **(B)** navg multi-batch one-step sweep (navg 1–16): variance-or-bias test. +> - **(C)** stable BPTT+controller rescue — lr 2e-4–8e-4 × jr_max 8–32, ≥2 seeds, 2–5k cutoff (the +> still-missing matched architecture ceiling). +> - **(D)** per-group late-collapse localization — attn QKVO / FFN fc-pj / LN / emb / head-clamp / +> jacreg + diagnostics |a|, |z^β−z*|, |Jv−Jᵀv|, selected-T2. + +--- + +## Per-item status & results + +| item | status | result (raw) | +|---|---|---| +| **A** estimator ablation | ✅ core done | At the 2.40 plateau ckpt vs BPTT400: **N2-frozen** cos **−0.045**, self-cos **−0.27**, k **4133**; **N4/N8** fix magnitude only (k→30) NOT direction; **r** .05/.10/.20 do nothing; **fixed-T2** 20/80/120 diverge (need snapshot selection); **tracking-AEP (common-mode)** cos **0.997**, self-cos **+0.95**, k **0.9**. Diagnostic at z*: **\|Jv−Jᵀv\|/\|Jv\| = 1.37** (highly non-normal → the frozen-at-z* AEP linearization IS the bias floor). Script: `plateau_ablation.py`. | +| **B** navg sweep | ✅ done | cos(EP,BPTT) over navg 1/2/4/8/16 = **0.37 / 0.40 / 0.55 / −0.30 / −0.21**; self-cos rises (0.58→0.87) but toward a direction NEGATIVELY correlated with BPTT; dCE diverges at navg≥8. ⇒ **deterministic BIAS, not variance** — averaging/restart won't fix it. Script: `navg_sweep.py`. | +| **C** stable BPTT rescue | 🟡 partial | Ran lr{2e-4, 4e-4, 6e-4} × jr_max{16, 32}, **single seed**, ~2.5k cutoff (`bpttr_lr*_jr*.pt`). Found: **stable exact-gradient BPTT exists** at lr ≤ 6e-4 (never broke); lr 8e-4 broke @ step 1600; **jr_max 16 > 32**. Winner (lr6e-4, jr16) extended to 12k = `bptt_ceiling` (best **2.1055** @ step 7400). Scripts: `bptt_rescue.sh`, `chain_next.sh`. | +| **D** per-group localization | ✅ core done | Late collapse is **UNIFORM** across WQ/WK/WV/WO, fc/pj, ln1/ln2, tok/pos (all cos −0.03..−0.08, k 3k–18k) — a single shared cause (the corrupted contrast `a` poisons every VF gradient equally), NOT attention/LN-specific. + the \|Jv−Jᵀv\|=1.37 diagnostic. Script: `plateau_ablation.py` (per-group mode). | + +### Decision gate (after A+B) +Per spec, A+B chose the direction: **switch the nudged-phase estimator to common-mode tracking-AEP** +(N=4/lock-in/navg rejected — A showed they fix magnitude not direction; B showed it's bias not +variance). The mechanism: the bias floor is the **frozen-at-z\*** AEP linearization; re-linearizing at +the moving common mode z̄=½(z₊+z₋) restores cos −0.045 → 0.997. + +### Follow-on (the spec's "next necessary test": does `--track` descend below 2.40 in TRAINING) +✅ Done. Warm-start from the 2.40 orphan ckpt, jr_max 8: +- **TRACK-warm (tracking-AEP): 2.40 → best 2.1628 @ 2500, still descending, stable** (res ~6e-5). +- **STD-warm control (standard estimator, same recipe): ABORTED @ step 397** (res→0.21). +- **TRACK-fresh (from scratch): ABORTED @ 968.** +⇒ tracking-AEP descends past the plateau but is **not a from-scratch drop-in**; recipe is **two-phase +warm-track** (std estimator → ~2.40, then switch to tracking-AEP). Continued (`trkcont`) → best **2.1116**. + +--- + +## Unfinished gaps (do not change conclusions, but were specified) +1. **C — ≥2 seeds:** only single-seed runs. The one item that affects the rigor of the "matched + architecture ceiling" claim. +2. **C — jr_max=8** not included in the BPTT rescue grid (only 16/32; exp5 covered jr_max 8 for EP). +3. **A — BPTT800** reference not run separately (BPTT150-vs-400 cos=1.0 from exp1 answered the horizon + question, but the spec listed 800). +4. **D — sub-diagnostics** |a|, |z^β−z*|, and a head-clamp arm not individually logged. + +--- + +## What followed (this session — beyond the A/B/C/D spec) + +These grew out of the round; they were **not** in the spec. + +1. **bptt_ceiling (architecture-ceiling run, lr6e-4 jr16, 12k).** Best **2.1055** @ step 7400, then + **controller-hijack blowup** @ ~8800 (jr→16, val CE→3.4), lodged at ~3.3 through 12000 (jr pinned at + 16 the entire tail, res bounded 2–5e-3 — never recovered). Logs: `bptt_ceiling.log`. + +2. **BPTT-vs-BP "slow vs misaligned" diagnosis** (`lr_overlay.py`). Plotting val-CE vs cumulative-lr + (∫lr, Adam path-length proxy): per unit path-length **BPTT meets/beats BP down to 2.11** (cumL ratio + 1.0→0.33) — the raw-step slowness is the 5× lr cap (6e-4 vs BP's 3e-3), not inefficiency. BPTT never + reached BP's 1.74 because it **blew up before converging** (stability-limited), not capacity. S0 + evidence that capacity is fine when it converges: **BPTT-ctl 1.635 ≈ BP 1.610** (gap 0.025). A + weight-import capacity probe was designed but **deemed confounded** (5 formula gaps: damping halves + z*, coupled-vs-sequential FFN, attn biases, qknorm, final-LN) and not run. + +3. **Controller-hijack diagnosis** (`blowup_diag.py`, at the 2.1055 ckpt). Decomposing the update into + CE and jacreg terms: **cos(g_CE, g_jac) = −0.008** (orthogonal); at jr=16, **\|jr·g_jac\|/\|g_CE\| = + 3.25** and **cos(g_CE, total) = 0.288** — i.e. ~71% of the update is the penalty, not loss descent. + Per-group, the penalty over-drives **LN gains 34×**, attn 3.2×. res stays bounded throughout ⇒ NOT a + relaxation blowup; a controller-penalty-driven weight-space detour. Trigger: near the optimum res + settled ~1.8e-3 (> target 1.5e-3) → controller ramped jr to max → hijack. + +4. **"Remove/replace the penalty" structural arms** (running now, all BPTT, base = bptt_ceiling recipe): + - `freeze_floor` (GPU1): `--jacreg 0.1 --jr_floor 0.1 --jr_max 0.1` → jr frozen at 0.1, no adaptive ramp. + - `specnorm` / Lipschitz (GPU0): `--jacreg 0 --specnorm --rho 0.9` → spectral-norm attention weights ≤0.9, off-gradient, no penalty. + - Queued: `no-penalty` (`--jacreg 0`), `energy` (`--attn_mode energy`), `mono` (`--attn_mode mono`). + - Watcher `watch_arms.sh` fires when both finish. Logs: `freeze_floor.log`, `specnorm.log`. + +5. **DEQ training-stability literature review** (deep-research, 24 sources, 25/25 claims verified). + Three families: (i) **soft Jacobian penalty (Bai 2021)** — author-admitted heuristic + DIGITAL-only + (relies on Broyden, deliberately doesn't enforce contraction) → wrong tool for analog; (ii) + **structural contraction** (monDEQ / NEMON-l∞ / pcDEQ / BiLipNet) — provable, and monDEQ physically = + a passive resistor-diode circuit → **analog-native**; (iii) **energy/Hopfield attention** (softmax = + one CCCP energy-descent step, provable monotone energy decrease) → the analog-compatible AND + attention-preserving route. Recommended construction (medium conf): energy/Hopfield attention + + monotone/NEMON-l∞ contractive FFN, plain Picard/averaged (= our εF Euler) iteration, no root-finder. + Full report relayed in chat; open questions = our research frontier. + +--- + +## Where it stands / open items +- **Round A/B/C/D:** substantially complete (4/4 + the decisive tracking-AEP training follow-on); + 4 specified gaps remain (above), the load-bearing one being **C multi-seed**. +- **Reframe (this session):** the explosion is a **DEQ forward-stability** problem (shared by EP and + BPTT, since EP≈BPTT gradient), not an EP problem — so the fix belongs to the DEQ/structural toolkit, + not EP-literature hacks. The soft penalty is both admitted-heuristic and analog-incompatible. +- **Running:** freeze_floor, specnorm (~10h) — testing whether removing/replacing the penalty keeps + the block stable past the ~9k blowup point and whether it descends below 2.1055. +- **Next-round formal plan (NOT yet executed):** WSD schedule redesign; fix-bundle re-run; converged + BP/BPTT/EP comparison; C512-vs-C≥1024 capability question. Plus (lit-informed) prioritize the + `energy` arm; consider NEMON-l∞ over Winston-Kolter for the structural `mono` arm. + +## Artifacts +- Scripts: `/tmp/lt_ep/{plateau_ablation,navg_sweep,bptt_rescue,blowup_diag,lr_overlay,cos_now}.py`, + `bptt_rescue.sh`, `chain_next.sh`, `watch_arms.sh`. +- Logs/ckpts: `/tmp/lt_ep/{bptt_ceiling,freeze_floor,specnorm,trkcont,track_warm,std_warm,bpttr_*}.{log,pt}`. +- Records: `C512_PLATEAU_CAMPAIGN.md` (round-1, exp1-7), `FINDINGS.md` (2026-06-17 round 2 + 2026-06-18), + memory `ep-c512-plateau-mechanism.md`. |
