# C512 Round-2 (A/B/C/D) — Spec Status, Results, and What Followed **Scope.** Status of the last long spec the user ordered (the A/B/C/D round, after the 7-experiment campaign in `C512_PLATEAU_CAMPAIGN.md`), its results, the unfinished gaps, and the work that grew out of it (this session). Setting: EP-trained equilibrium transformer block, C=512 BPE-4096, TinyStories, T=256, B=24. Random ln(4096)=8.318; BP single-block reference (depth=1, 7.48M) best **1.7921** (20k). Convention: raw data; conclusions are the user's. --- ## The spec (as ordered) > Next round (do A+B first, then decide N=4 / lock-in / tracking-AEP vs nudged-phase redesign): > - **(A)** plateau estimator ablation at a fixed ckpt — holo N=2/4/8, r=.02–.2, T2=20–120 + adaptive, > tracking-AEP, vs BPTT150/400/800. > - **(B)** navg multi-batch one-step sweep (navg 1–16): variance-or-bias test. > - **(C)** stable BPTT+controller rescue — lr 2e-4–8e-4 × jr_max 8–32, ≥2 seeds, 2–5k cutoff (the > still-missing matched architecture ceiling). > - **(D)** per-group late-collapse localization — attn QKVO / FFN fc-pj / LN / emb / head-clamp / > jacreg + diagnostics |a|, |z^β−z*|, |Jv−Jᵀv|, selected-T2. --- ## Per-item status & results | item | status | result (raw) | |---|---|---| | **A** estimator ablation | ✅ core done | At the 2.40 plateau ckpt vs BPTT400: **N2-frozen** cos **−0.045**, self-cos **−0.27**, k **4133**; **N4/N8** fix magnitude only (k→30) NOT direction; **r** .05/.10/.20 do nothing; **fixed-T2** 20/80/120 diverge (need snapshot selection); **tracking-AEP (common-mode)** cos **0.997**, self-cos **+0.95**, k **0.9**. Diagnostic at z*: **\|Jv−Jᵀv\|/\|Jv\| = 1.37** (highly non-normal → the frozen-at-z* AEP linearization IS the bias floor). Script: `plateau_ablation.py`. | | **B** navg sweep | ✅ done | cos(EP,BPTT) over navg 1/2/4/8/16 = **0.37 / 0.40 / 0.55 / −0.30 / −0.21**; self-cos rises (0.58→0.87) but toward a direction NEGATIVELY correlated with BPTT; dCE diverges at navg≥8. ⇒ **deterministic BIAS, not variance** — averaging/restart won't fix it. Script: `navg_sweep.py`. | | **C** stable BPTT rescue | 🟡 partial | Ran lr{2e-4, 4e-4, 6e-4} × jr_max{16, 32}, **single seed**, ~2.5k cutoff (`bpttr_lr*_jr*.pt`). Found: **stable exact-gradient BPTT exists** at lr ≤ 6e-4 (never broke); lr 8e-4 broke @ step 1600; **jr_max 16 > 32**. Winner (lr6e-4, jr16) extended to 12k = `bptt_ceiling` (best **2.1055** @ step 7400). Scripts: `bptt_rescue.sh`, `chain_next.sh`. | | **D** per-group localization | ✅ core done | Late collapse is **UNIFORM** across WQ/WK/WV/WO, fc/pj, ln1/ln2, tok/pos (all cos −0.03..−0.08, k 3k–18k) — a single shared cause (the corrupted contrast `a` poisons every VF gradient equally), NOT attention/LN-specific. + the \|Jv−Jᵀv\|=1.37 diagnostic. Script: `plateau_ablation.py` (per-group mode). | ### Decision gate (after A+B) Per spec, A+B chose the direction: **switch the nudged-phase estimator to common-mode tracking-AEP** (N=4/lock-in/navg rejected — A showed they fix magnitude not direction; B showed it's bias not variance). The mechanism: the bias floor is the **frozen-at-z\*** AEP linearization; re-linearizing at the moving common mode z̄=½(z₊+z₋) restores cos −0.045 → 0.997. ### Follow-on (the spec's "next necessary test": does `--track` descend below 2.40 in TRAINING) ✅ Done. Warm-start from the 2.40 orphan ckpt, jr_max 8: - **TRACK-warm (tracking-AEP): 2.40 → best 2.1628 @ 2500, still descending, stable** (res ~6e-5). - **STD-warm control (standard estimator, same recipe): ABORTED @ step 397** (res→0.21). - **TRACK-fresh (from scratch): ABORTED @ 968.** ⇒ tracking-AEP descends past the plateau but is **not a from-scratch drop-in**; recipe is **two-phase warm-track** (std estimator → ~2.40, then switch to tracking-AEP). Continued (`trkcont`) → best **2.1116**. --- ## Unfinished gaps (do not change conclusions, but were specified) 1. **C — ≥2 seeds:** only single-seed runs. The one item that affects the rigor of the "matched architecture ceiling" claim. 2. **C — jr_max=8** not included in the BPTT rescue grid (only 16/32; exp5 covered jr_max 8 for EP). 3. **A — BPTT800** reference not run separately (BPTT150-vs-400 cos=1.0 from exp1 answered the horizon question, but the spec listed 800). 4. **D — sub-diagnostics** |a|, |z^β−z*|, and a head-clamp arm not individually logged. --- ## What followed (this session — beyond the A/B/C/D spec) These grew out of the round; they were **not** in the spec. 1. **bptt_ceiling (architecture-ceiling run, lr6e-4 jr16, 12k).** Best **2.1055** @ step 7400, then **controller-hijack blowup** @ ~8800 (jr→16, val CE→3.4), lodged at ~3.3 through 12000 (jr pinned at 16 the entire tail, res bounded 2–5e-3 — never recovered). Logs: `bptt_ceiling.log`. 2. **BPTT-vs-BP "slow vs misaligned" diagnosis** (`lr_overlay.py`). Plotting val-CE vs cumulative-lr (∫lr, Adam path-length proxy): per unit path-length **BPTT meets/beats BP down to 2.11** (cumL ratio 1.0→0.33) — the raw-step slowness is the 5× lr cap (6e-4 vs BP's 3e-3), not inefficiency. BPTT never reached BP's 1.74 because it **blew up before converging** (stability-limited), not capacity. S0 evidence that capacity is fine when it converges: **BPTT-ctl 1.635 ≈ BP 1.610** (gap 0.025). A weight-import capacity probe was designed but **deemed confounded** (5 formula gaps: damping halves z*, coupled-vs-sequential FFN, attn biases, qknorm, final-LN) and not run. 3. **Controller-hijack diagnosis** (`blowup_diag.py`, at the 2.1055 ckpt). Decomposing the update into CE and jacreg terms: **cos(g_CE, g_jac) = −0.008** (orthogonal); at jr=16, **\|jr·g_jac\|/\|g_CE\| = 3.25** and **cos(g_CE, total) = 0.288** — i.e. ~71% of the update is the penalty, not loss descent. Per-group, the penalty over-drives **LN gains 34×**, attn 3.2×. res stays bounded throughout ⇒ NOT a relaxation blowup; a controller-penalty-driven weight-space detour. Trigger: near the optimum res settled ~1.8e-3 (> target 1.5e-3) → controller ramped jr to max → hijack. 4. **"Remove/replace the penalty" structural arms** (running now, all BPTT, base = bptt_ceiling recipe): - `freeze_floor` (GPU1): `--jacreg 0.1 --jr_floor 0.1 --jr_max 0.1` → jr frozen at 0.1, no adaptive ramp. - `specnorm` / Lipschitz (GPU0): `--jacreg 0 --specnorm --rho 0.9` → spectral-norm attention weights ≤0.9, off-gradient, no penalty. - Queued: `no-penalty` (`--jacreg 0`), `energy` (`--attn_mode energy`), `mono` (`--attn_mode mono`). - Watcher `watch_arms.sh` fires when both finish. Logs: `freeze_floor.log`, `specnorm.log`. 5. **DEQ training-stability literature review** (deep-research, 24 sources, 25/25 claims verified). Three families: (i) **soft Jacobian penalty (Bai 2021)** — author-admitted heuristic + DIGITAL-only (relies on Broyden, deliberately doesn't enforce contraction) → wrong tool for analog; (ii) **structural contraction** (monDEQ / NEMON-l∞ / pcDEQ / BiLipNet) — provable, and monDEQ physically = a passive resistor-diode circuit → **analog-native**; (iii) **energy/Hopfield attention** (softmax = one CCCP energy-descent step, provable monotone energy decrease) → the analog-compatible AND attention-preserving route. Recommended construction (medium conf): energy/Hopfield attention + monotone/NEMON-l∞ contractive FFN, plain Picard/averaged (= our εF Euler) iteration, no root-finder. Full report relayed in chat; open questions = our research frontier. --- ## Where it stands / open items - **Round A/B/C/D:** substantially complete (4/4 + the decisive tracking-AEP training follow-on); 4 specified gaps remain (above), the load-bearing one being **C multi-seed**. - **Reframe (this session):** the explosion is a **DEQ forward-stability** problem (shared by EP and BPTT, since EP≈BPTT gradient), not an EP problem — so the fix belongs to the DEQ/structural toolkit, not EP-literature hacks. The soft penalty is both admitted-heuristic and analog-incompatible. - **Running:** freeze_floor, specnorm (~10h) — testing whether removing/replacing the penalty keeps the block stable past the ~9k blowup point and whether it descends below 2.1055. - **Next-round formal plan (NOT yet executed):** WSD schedule redesign; fix-bundle re-run; converged BP/BPTT/EP comparison; C512-vs-C≥1024 capability question. Plus (lit-informed) prioritize the `energy` arm; consider NEMON-l∞ over Winston-Kolter for the structural `mono` arm. ## Artifacts - Scripts: `/tmp/lt_ep/{plateau_ablation,navg_sweep,bptt_rescue,blowup_diag,lr_overlay,cos_now}.py`, `bptt_rescue.sh`, `chain_next.sh`, `watch_arms.sh`. - Logs/ckpts: `/tmp/lt_ep/{bptt_ceiling,freeze_floor,specnorm,trkcont,track_warm,std_warm,bpttr_*}.{log,pt}`. - Records: `C512_PLATEAU_CAMPAIGN.md` (round-1, exp1-7), `FINDINGS.md` (2026-06-17 round 2 + 2026-06-18), memory `ep-c512-plateau-mechanism.md`.