docs/campaign/C512_ROUND2_ABCD.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118

# C512 Round-2 (A/B/C/D) — Spec Status, Results, and What Followed

**Scope.** Status of the last long spec the user ordered (the A/B/C/D round, after the 7-experiment
campaign in `C512_PLATEAU_CAMPAIGN.md`), its results, the unfinished gaps, and the work that grew out
of it (this session). Setting: EP-trained equilibrium transformer block, C=512 BPE-4096, TinyStories,
T=256, B=24. Random ln(4096)=8.318; BP single-block reference (depth=1, 7.48M) best **1.7921** (20k).

Convention: raw data; conclusions are the user's.

---

## The spec (as ordered)

> Next round (do A+B first, then decide N=4 / lock-in / tracking-AEP vs nudged-phase redesign):
> - **(A)** plateau estimator ablation at a fixed ckpt — holo N=2/4/8, r=.02–.2, T2=20–120 + adaptive,
>   tracking-AEP, vs BPTT150/400/800.
> - **(B)** navg multi-batch one-step sweep (navg 1–16): variance-or-bias test.
> - **(C)** stable BPTT+controller rescue — lr 2e-4–8e-4 × jr_max 8–32, ≥2 seeds, 2–5k cutoff (the
>   still-missing matched architecture ceiling).
> - **(D)** per-group late-collapse localization — attn QKVO / FFN fc-pj / LN / emb / head-clamp /
>   jacreg + diagnostics |a|, |z^β−z*|, |Jv−Jᵀv|, selected-T2.

---

## Per-item status & results

| item | status | result (raw) |
|---|---|---|
| **A** estimator ablation | ✅ core done | At the 2.40 plateau ckpt vs BPTT400: **N2-frozen** cos **−0.045**, self-cos **−0.27**, k **4133**; **N4/N8** fix magnitude only (k→30) NOT direction; **r** .05/.10/.20 do nothing; **fixed-T2** 20/80/120 diverge (need snapshot selection); **tracking-AEP (common-mode)** cos **0.997**, self-cos **+0.95**, k **0.9**. Diagnostic at z*: **\|Jv−Jᵀv\|/\|Jv\| = 1.37** (highly non-normal → the frozen-at-z* AEP linearization IS the bias floor). Script: `plateau_ablation.py`. |
| **B** navg sweep | ✅ done | cos(EP,BPTT) over navg 1/2/4/8/16 = **0.37 / 0.40 / 0.55 / −0.30 / −0.21**; self-cos rises (0.58→0.87) but toward a direction NEGATIVELY correlated with BPTT; dCE diverges at navg≥8. ⇒ **deterministic BIAS, not variance** — averaging/restart won't fix it. Script: `navg_sweep.py`. |
| **C** stable BPTT rescue | 🟡 partial | Ran lr{2e-4, 4e-4, 6e-4} × jr_max{16, 32}, **single seed**, ~2.5k cutoff (`bpttr_lr*_jr*.pt`). Found: **stable exact-gradient BPTT exists** at lr ≤ 6e-4 (never broke); lr 8e-4 broke @ step 1600; **jr_max 16 > 32**. Winner (lr6e-4, jr16) extended to 12k = `bptt_ceiling` (best **2.1055** @ step 7400). Scripts: `bptt_rescue.sh`, `chain_next.sh`. |
| **D** per-group localization | ✅ core done | Late collapse is **UNIFORM** across WQ/WK/WV/WO, fc/pj, ln1/ln2, tok/pos (all cos −0.03..−0.08, k 3k–18k) — a single shared cause (the corrupted contrast `a` poisons every VF gradient equally), NOT attention/LN-specific. + the \|Jv−Jᵀv\|=1.37 diagnostic. Script: `plateau_ablation.py` (per-group mode). |

### Decision gate (after A+B)
Per spec, A+B chose the direction: **switch the nudged-phase estimator to common-mode tracking-AEP**
(N=4/lock-in/navg rejected — A showed they fix magnitude not direction; B showed it's bias not
variance). The mechanism: the bias floor is the **frozen-at-z\*** AEP linearization; re-linearizing at
the moving common mode z̄=½(z₊+z₋) restores cos −0.045 → 0.997.

### Follow-on (the spec's "next necessary test": does `--track` descend below 2.40 in TRAINING)
✅ Done. Warm-start from the 2.40 orphan ckpt, jr_max 8:
- **TRACK-warm (tracking-AEP): 2.40 → best 2.1628 @ 2500, still descending, stable** (res ~6e-5).
- **STD-warm control (standard estimator, same recipe): ABORTED @ step 397** (res→0.21).
- **TRACK-fresh (from scratch): ABORTED @ 968.**
⇒ tracking-AEP descends past the plateau but is **not a from-scratch drop-in**; recipe is **two-phase
warm-track** (std estimator → ~2.40, then switch to tracking-AEP). Continued (`trkcont`) → best **2.1116**.

---

## Unfinished gaps (do not change conclusions, but were specified)
1. **C — ≥2 seeds:** only single-seed runs. The one item that affects the rigor of the "matched
   architecture ceiling" claim.
2. **C — jr_max=8** not included in the BPTT rescue grid (only 16/32; exp5 covered jr_max 8 for EP).
3. **A — BPTT800** reference not run separately (BPTT150-vs-400 cos=1.0 from exp1 answered the horizon
   question, but the spec listed 800).
4. **D — sub-diagnostics** |a|, |z^β−z*|, and a head-clamp arm not individually logged.

---

## What followed (this session — beyond the A/B/C/D spec)

These grew out of the round; they were **not** in the spec.

1. **bptt_ceiling (architecture-ceiling run, lr6e-4 jr16, 12k).** Best **2.1055** @ step 7400, then
   **controller-hijack blowup** @ ~8800 (jr→16, val CE→3.4), lodged at ~3.3 through 12000 (jr pinned at
   16 the entire tail, res bounded 2–5e-3 — never recovered). Logs: `bptt_ceiling.log`.

2. **BPTT-vs-BP "slow vs misaligned" diagnosis** (`lr_overlay.py`). Plotting val-CE vs cumulative-lr
   (∫lr, Adam path-length proxy): per unit path-length **BPTT meets/beats BP down to 2.11** (cumL ratio
   1.0→0.33) — the raw-step slowness is the 5× lr cap (6e-4 vs BP's 3e-3), not inefficiency. BPTT never
   reached BP's 1.74 because it **blew up before converging** (stability-limited), not capacity. S0
   evidence that capacity is fine when it converges: **BPTT-ctl 1.635 ≈ BP 1.610** (gap 0.025). A
   weight-import capacity probe was designed but **deemed confounded** (5 formula gaps: damping halves
   z*, coupled-vs-sequential FFN, attn biases, qknorm, final-LN) and not run.

3. **Controller-hijack diagnosis** (`blowup_diag.py`, at the 2.1055 ckpt). Decomposing the update into
   CE and jacreg terms: **cos(g_CE, g_jac) = −0.008** (orthogonal); at jr=16, **\|jr·g_jac\|/\|g_CE\| =
   3.25** and **cos(g_CE, total) = 0.288** — i.e. ~71% of the update is the penalty, not loss descent.
   Per-group, the penalty over-drives **LN gains 34×**, attn 3.2×. res stays bounded throughout ⇒ NOT a
   relaxation blowup; a controller-penalty-driven weight-space detour. Trigger: near the optimum res
   settled ~1.8e-3 (> target 1.5e-3) → controller ramped jr to max → hijack.

4. **"Remove/replace the penalty" structural arms** (running now, all BPTT, base = bptt_ceiling recipe):
   - `freeze_floor` (GPU1): `--jacreg 0.1 --jr_floor 0.1 --jr_max 0.1` → jr frozen at 0.1, no adaptive ramp.
   - `specnorm` / Lipschitz (GPU0): `--jacreg 0 --specnorm --rho 0.9` → spectral-norm attention weights ≤0.9, off-gradient, no penalty.
   - Queued: `no-penalty` (`--jacreg 0`), `energy` (`--attn_mode energy`), `mono` (`--attn_mode mono`).
   - Watcher `watch_arms.sh` fires when both finish. Logs: `freeze_floor.log`, `specnorm.log`.

5. **DEQ training-stability literature review** (deep-research, 24 sources, 25/25 claims verified).
   Three families: (i) **soft Jacobian penalty (Bai 2021)** — author-admitted heuristic + DIGITAL-only
   (relies on Broyden, deliberately doesn't enforce contraction) → wrong tool for analog; (ii)
   **structural contraction** (monDEQ / NEMON-l∞ / pcDEQ / BiLipNet) — provable, and monDEQ physically =
   a passive resistor-diode circuit → **analog-native**; (iii) **energy/Hopfield attention** (softmax =
   one CCCP energy-descent step, provable monotone energy decrease) → the analog-compatible AND
   attention-preserving route. Recommended construction (medium conf): energy/Hopfield attention +
   monotone/NEMON-l∞ contractive FFN, plain Picard/averaged (= our εF Euler) iteration, no root-finder.
   Full report relayed in chat; open questions = our research frontier.

---

## Where it stands / open items
- **Round A/B/C/D:** substantially complete (4/4 + the decisive tracking-AEP training follow-on);
  4 specified gaps remain (above), the load-bearing one being **C multi-seed**.
- **Reframe (this session):** the explosion is a **DEQ forward-stability** problem (shared by EP and
  BPTT, since EP≈BPTT gradient), not an EP problem — so the fix belongs to the DEQ/structural toolkit,
  not EP-literature hacks. The soft penalty is both admitted-heuristic and analog-incompatible.
- **Running:** freeze_floor, specnorm (~10h) — testing whether removing/replacing the penalty keeps
  the block stable past the ~9k blowup point and whether it descends below 2.1055.
- **Next-round formal plan (NOT yet executed):** WSD schedule redesign; fix-bundle re-run; converged
  BP/BPTT/EP comparison; C512-vs-C≥1024 capability question. Plus (lit-informed) prioritize the
  `energy` arm; consider NEMON-l∞ over Winston-Kolter for the structural `mono` arm.

## Artifacts
- Scripts: `/tmp/lt_ep/{plateau_ablation,navg_sweep,bptt_rescue,blowup_diag,lr_overlay,cos_now}.py`,
  `bptt_rescue.sh`, `chain_next.sh`, `watch_arms.sh`.
- Logs/ckpts: `/tmp/lt_ep/{bptt_ceiling,freeze_floor,specnorm,trkcont,track_warm,std_warm,bpttr_*}.{log,pt}`.
- Records: `C512_PLATEAU_CAMPAIGN.md` (round-1, exp1-7), `FINDINGS.md` (2026-06-17 round 2 + 2026-06-18),
  memory `ep-c512-plateau-mechanism.md`.