diff options
Diffstat (limited to 'docs/campaign')
| -rw-r--r-- | docs/campaign/C512_PLATEAU_CAMPAIGN.md | 159 | ||||
| -rw-r--r-- | docs/campaign/C512_ROUND2_ABCD.md | 118 | ||||
| -rw-r--r-- | docs/campaign/EP_BELOW210_DIAGNOSIS_FIX.md | 260 | ||||
| -rw-r--r-- | docs/campaign/FINDINGS.md | 609 | ||||
| -rw-r--r-- | docs/campaign/SESSION_2026-06-24_HOPF_DIAGNOSIS_RESREG_FIX.md | 81 |
5 files changed, 1227 insertions, 0 deletions
diff --git a/docs/campaign/C512_PLATEAU_CAMPAIGN.md b/docs/campaign/C512_PLATEAU_CAMPAIGN.md new file mode 100644 index 0000000..6066c21 --- /dev/null +++ b/docs/campaign/C512_PLATEAU_CAMPAIGN.md @@ -0,0 +1,159 @@ +# C512 "Why slow + plateau" diagnostic campaign (2026-06-17) + +Standalone record of the 7-experiment campaign the user proposed to diagnose why EP at C=512 BPE +plateaus at val CE 2.40 (orphan) and never reaches the BP capability band (1.0–1.5). Self-contained; +companion logs in `/tmp/lt_ep/`, general arc in `FINDINGS.md`. + +--- + +## 0. Setup and the user's hypothesis + +- **Models** (TinyStories BPE-4096, C=512 H=16 T=256 B=24; random ln4096 = 8.318): + - EP "orphan" (lr 8e-4, warmup 800, resinit 0.1, holo N=2, t2sel 40, jr_floor 0.1, res_target + 1.5e-3, jr_max 32, pema 0.999): **best val CE 2.4037**, ran full 20k, zero excursions. + - Standard BP transformer (same C/H/T, mlp=4): **best val CE 1.6953**. + - Target: BP capability band **1.0–1.5** (BP-C512 itself is 1.70; BP needs C≥1024 to enter the band). + +- **User's hypothesis (verbatim intent):** the "full-time slow + plateau" is **not** simple EP + LR-inequivalence; it is a conflict between the **equilibrium architecture's contractive operating + point** and **long-context mixing**, with LR merely expressing it as "8e-4 lives, 9e-4 explodes." + Supporting prior: at a matched tight residual operating point EP ≥ BPTT; the apparent gap is + EP-tight vs loose-BPTT (non-contractive), and the gap grows with T. + +- **Three plateau types the user distinguished:** (1) invalid plateau (res ~1e-2, gate skipping, + λ pinned); (2) valid-but-over-contracted plateau (res tiny, cos high, gate-skip low, still far + from BP); (3) controller-fight / estimator-controller mismatch. + +- **Priority order requested:** exp3 (BPTT+ctl) → exp2 (LR-sweep) → exp4 (warmup/gate decouple) + → exp5 (λ grid) → exp6/exp7 (branch + mixing). exp1 (triangulation) foundational. + +--- + +## 1. The seven experiments — spec, what ran, result, read + +### exp3 — C512 BPTT+controller (the decisive matched cut) +- **Spec:** replicate the orphan EP recipe (resinit, qk-norm, warmup, jr_floor, res_target, gate, + pema, lr 8e-4) but swap the EP task gradient for exact BPTT. If it plateaus ~2.2–2.4 → tax is + architecture/controller; if ~1.9 → EP estimator/noise. +- **Ran:** `--mode bptt`, all else matched, 20k steps, fuse on. +- **Result:** descended cleanly to **best 3.85 @step 1400**, then **destabilized at step 1600** + (jr→32, res→0.07) and lodged in a broken basin (val 6.27, res 0.068 — just under the 0.1 fuse, + no abort) for 8400+ steps. Killed at step ~12.8k. **Worse than the EP orphan (3.85 vs 2.40).** +- **Read:** the answer came back neither 2.2–2.4 nor 1.9 — BPTT **broke**. At C512 the contraction + controller does **not** keep the exact gradient on the manifold; EP's implicit contraction-bias is + what kept the orphan alive. Inverts the S0 result (BPTT+ctl 1.635 < EP 1.676). *Caveat: single + lr/seed; BPTT might be stable at another lr.* + +### exp2 — one-step loss-decrease LR-sweep at the plateau ckpt +- **Spec:** fixed ckpt + batch, gradients {BPTT, EP-task, EP-task+jac}, one-step update over an lr + grid, same-graph eval ΔCE. Answers "is it LR-inequivalence." +- **Result** (ΔCE, negative = better): BPTT best **−0.161** (lr 1e-2); EP best **−0.042** (lr 1e-4), + **diverges for lr ≥ 3e-4**. cos(g_EP,g_BPTT) 0.107, k=|g_EP|/|g_BPTT| ≈ 486. +- **Read:** **not LR.** EP's best one-step descent over *all* lr is 4× below BPTT's; a magnitude + rescale can't fix a 0.10-aligned direction. + +### exp1 — gradient triangulation, multi-checkpoint (foundational) +- **Spec:** at step 0/200/800/2000/plateau, per-group cos of {g_EP, g_BPTT-150, g_BPTT-400}, + norm-ratio k, batch-to-batch variance, + res. +- **Result:** + + | ckpt | val | res | EP·BPTT150 | BPTT150·400 | k | EP self-cos | BPTT self-cos | + |---|---|---|---|---|---|---|---| + | step0 (random) | 9.90 | 2.1e-2 | 0.261 | 0.459 | 0.319 | +0.654 | +0.617 | + | step200 | 6.07 | 4.2e-9 | **0.989** | 1.000 | 1.000 | +0.371 | +0.376 | + | step800 | 4.11 | 3.9e-4 | 0.964 | 0.935 | 0.844 | +0.755 | +0.721 | + | step2000 | 3.56 | 2.9e-4 | 0.885 | 0.845 | 0.675 | +0.445 | +0.358 | + | **plateau (2.40)** | 2.40 | 2.6e-5 | **−0.045** | **1.000** | **4223** | **−0.273** | **+0.957** | + +- **Read (decisive):** the EP estimator is **faithful while descending** (cos 0.99→0.89, steps + 200–2000) and undergoes an **SNR/coherence collapse at the optimum** — cos→0, k→4000, and crucially + the **batch self-cos goes −0.27** (EP gradients on different batches anti-correlate) while BPTT is + +0.96. BPTT-150≡BPTT-400 (cos 1.00) the entire time ⇒ the true gradient is always well-defined. + **The 2.40 plateau is an EP estimator bias-floor / batch-incoherence near the optimum** — not + horizon ambiguity (ruled out here), not LR (exp2). + +### exp4 — warmup / validity-gate decouple (3 arms) +- **Spec:** armA current warmup; armB full-strength contraction early + task LR still warmup; + armC no-warmup + smaller resinit / lower branch LR. Metric: gate-skip / res-under-gate / early val. +- **Result:** armA (warmup): res stuck ~1.5e-2 for ~50 steps → gate skips the nudge → **no task + learning for ~50–60 steps** (cos≈0, the reported grad is pure jacreg); cos jumps to 0.99 once + res<gate by step ~60; best 4.11@800. armB (`cprewarm 200`): res 1.7e-2 after the pre-phase, dead- + zone only **partially** mitigated (val 5.97 vs 6.07 @100, best 4.13@600). armC (no-warmup + + resinit 0.03): **no dead-zone** (cos 0.99 from step 20, best 4.03@300) but a violent excursion + (res→0.14 @400, recovered), ended 4.71. +- **Read:** the warmup×gate dead-zone is real; establishing contraction once (armB) isn't enough — + warmup's tiny lr can't deepen it. Smallest-resinit + no-warmup (armC) learns fastest early but is + excursion-prone. + +### exp5 — λ floor / target / jr_max grid (warm-start from 2.40) +- **Spec:** floor {0.1,0.05,0.03} × target {1.5e-3,3e-3,5e-3} × jr_max {4,8,16,32}, 1–2k steps, fuse on. +- **Result (best val CE, 1200 warm steps):** jr_max 4 → 2.376, **jr_max 8 → 2.3648**, jr_max 16 → + 2.416, jr_max 32 (base) → 2.463; floor 0.05/target1.5e-3 → 2.477; floor 0.03/target5e-3 → 2.482. + (orphan = 2.404.) +- **Read:** **jr_max ≤ 8 beats 32 by ~0.04** and dips just below the orphan — the jr_max ceiling was + a small tax; floor/target relaxation does not help. Still far from 1–1.5: the estimator floor + dominates. + +### exp6 — branch-growth + attention entropy + qk-RMS (incl. at plateau) +- **Spec:** log |attn|/|xin|, |ffn|/|xin|, |WO|, |pj|, |fc|, attention entropy, qk RMS, logit std; + if branch contribution is small at the plateau, add trainable residual gates α with larger LR. +- **Result:** training (resinit 0.1): |WO| 1.95→4.91, |pj|→6.57, attn/xin 56→3.5, attn entropy + 4.56→3.88, qk_rms pinned 1.0. **At the plateau:** |pj|=44, |fc|=58, attn/xin≈1.0, attn entropy + 3.62, qk_rms 1.0. +- **Read:** branches grow fine and are **large** at the plateau ⇒ **no "stuck-small branch"** — the + trainable-α-gate intervention is **not indicated**. qk-norm is bounding qk RMS (1.0) as designed. + +### exp7 — mixing / influence length + per-position CE (EP vs BP) +- **Spec:** finite-diff |∂z*_t/∂x_{t−d}| vs distance + per-position CE, comparing tight-EP, + BPTT-ctl-tight, BPTT-ctl-loose, standard BP. +- **Bug caught & fixed:** the EP probe first embedded the input with random-init weights (before + loading the ckpt) → garbage (per-pos CE ~7); fixed to embed post-load and re-ran. +- **Result — influence ||Δ·_q||/δ vs distance d:** + + | d | 1 | 8 | 16 | 64 | 200 | + |---|---|---|---|---|---| + | BP | 5.34 | 1.81 | 0.88 | 0.15 | 0.003 | + | EP-orphan | 0.238 | 0.102 | 0.071 | 0.019 | 0.006 | + | BP ÷d1 | 1.0 | 0.34 | 0.16 | 0.029 | 0.0006 | + | EP ÷d1 | 1.0 | 0.43 | 0.30 | 0.080 | 0.025 | + + Per-position CE: BP uniformly ~0.7–1.0 lower than EP at every position (no long-range-specific gap). +- **Read:** both decay with distance; EP ~20× weaker in **absolute** coupling (scale-caveated — z* + vs hidden norm), but **normalized, EP reaches farther than BP**. ⇒ **does NOT support — arguably + refutes — "contraction cuts long-range mixing."** EP's deficit is **uniform across position**, + consistent with the estimator floor (exp1), not a mixing-length deficit. +- **Honest gap:** a stable BPTT-ctl-tight ckpt does not exist (BPTT broke, exp3), so the cross-method + comparison is **EP-vs-BP only**, not the full tight/loose/BP set requested. + +--- + +## 2. Consolidated verdict (data-level) + +**The C512 2.40 plateau is an EP estimator SNR/coherence collapse near the optimum** (exp1: cos→0, +batch self-cos −0.27, k→4000, true gradient clean). Ruled out as causes: LR (exp2), horizon +ambiguity (exp1), λ floor/target (exp5), mixing-length / contraction-cut (exp7). The architecture +has a **separate** wall: exact-gradient BPTT can't stay stable at C512 (exp3 broke at 1600). EP's +contraction-bias makes it the more robust of the two here. + +Of the user's three plateau types, the orphan is type (2) **valid-but-over-contracted** — but the +sharper statement is that "over-contracted" is not the mechanism; the binding constraint is the +**EP estimator bias floor**, which becomes batch-incoherent once the true gradient shrinks below it. + +**Two levers the data points at (decision pending):** +- (a) **Reduce the EP estimator bias floor near the optimum** — N=4 holo / tracking-AEP / lock-in / + navg averaging; jr_max≤8 also buys ~0.04 (exp5). The lever for EP's own plateau. +- (b) **The architecture's stability margin for exact gradients** — the separate exp3 wall. + +--- + +## 3. Code / logs + +- Trainer flags added this round: `--snap` (exp1 ckpts), `--cprewarm` (exp4 armB), `--probe_branch` + (exp6, now with attn entropy + qk RMS), `--probe_bptt` (cos/k along trajectory). +- Scripts: `lr_sweep.py` (exp2), `triangulation.py` (exp1, +batch self-cos), `mixing_probe.py` + (exp7 EP, bug-fixed), `bp_mixing.py` (exp7 BP), `bp_charlm.py --ckpt` (BP ckpt save). +- Orchestration: `master.sh` (first pass), `master2.sh` (full to-spec versions). +- Logs: `triangulation_multi.log`, `exp4_armB.log`/`exp4_armC.log`, `lam_jr{4,8,16}.log`, + `exp6_train.log`/`exp6_plateau.log`, `bp_mixing.log`, `mixing2.log` (+ the corrected re-run). +- Checkpoints: `ep_C512_bpe.best.pt` (orphan 2.40), `bp_C512.pt` (BP 1.70), `snap_{0,200,800,2000}.pt`, + `lam_jr8.pt` (2.365), `lam_t5.pt`. diff --git a/docs/campaign/C512_ROUND2_ABCD.md b/docs/campaign/C512_ROUND2_ABCD.md new file mode 100644 index 0000000..65e5282 --- /dev/null +++ b/docs/campaign/C512_ROUND2_ABCD.md @@ -0,0 +1,118 @@ +# C512 Round-2 (A/B/C/D) — Spec Status, Results, and What Followed + +**Scope.** Status of the last long spec the user ordered (the A/B/C/D round, after the 7-experiment +campaign in `C512_PLATEAU_CAMPAIGN.md`), its results, the unfinished gaps, and the work that grew out +of it (this session). Setting: EP-trained equilibrium transformer block, C=512 BPE-4096, TinyStories, +T=256, B=24. Random ln(4096)=8.318; BP single-block reference (depth=1, 7.48M) best **1.7921** (20k). + +Convention: raw data; conclusions are the user's. + +--- + +## The spec (as ordered) + +> Next round (do A+B first, then decide N=4 / lock-in / tracking-AEP vs nudged-phase redesign): +> - **(A)** plateau estimator ablation at a fixed ckpt — holo N=2/4/8, r=.02–.2, T2=20–120 + adaptive, +> tracking-AEP, vs BPTT150/400/800. +> - **(B)** navg multi-batch one-step sweep (navg 1–16): variance-or-bias test. +> - **(C)** stable BPTT+controller rescue — lr 2e-4–8e-4 × jr_max 8–32, ≥2 seeds, 2–5k cutoff (the +> still-missing matched architecture ceiling). +> - **(D)** per-group late-collapse localization — attn QKVO / FFN fc-pj / LN / emb / head-clamp / +> jacreg + diagnostics |a|, |z^β−z*|, |Jv−Jᵀv|, selected-T2. + +--- + +## Per-item status & results + +| item | status | result (raw) | +|---|---|---| +| **A** estimator ablation | ✅ core done | At the 2.40 plateau ckpt vs BPTT400: **N2-frozen** cos **−0.045**, self-cos **−0.27**, k **4133**; **N4/N8** fix magnitude only (k→30) NOT direction; **r** .05/.10/.20 do nothing; **fixed-T2** 20/80/120 diverge (need snapshot selection); **tracking-AEP (common-mode)** cos **0.997**, self-cos **+0.95**, k **0.9**. Diagnostic at z*: **\|Jv−Jᵀv\|/\|Jv\| = 1.37** (highly non-normal → the frozen-at-z* AEP linearization IS the bias floor). Script: `plateau_ablation.py`. | +| **B** navg sweep | ✅ done | cos(EP,BPTT) over navg 1/2/4/8/16 = **0.37 / 0.40 / 0.55 / −0.30 / −0.21**; self-cos rises (0.58→0.87) but toward a direction NEGATIVELY correlated with BPTT; dCE diverges at navg≥8. ⇒ **deterministic BIAS, not variance** — averaging/restart won't fix it. Script: `navg_sweep.py`. | +| **C** stable BPTT rescue | 🟡 partial | Ran lr{2e-4, 4e-4, 6e-4} × jr_max{16, 32}, **single seed**, ~2.5k cutoff (`bpttr_lr*_jr*.pt`). Found: **stable exact-gradient BPTT exists** at lr ≤ 6e-4 (never broke); lr 8e-4 broke @ step 1600; **jr_max 16 > 32**. Winner (lr6e-4, jr16) extended to 12k = `bptt_ceiling` (best **2.1055** @ step 7400). Scripts: `bptt_rescue.sh`, `chain_next.sh`. | +| **D** per-group localization | ✅ core done | Late collapse is **UNIFORM** across WQ/WK/WV/WO, fc/pj, ln1/ln2, tok/pos (all cos −0.03..−0.08, k 3k–18k) — a single shared cause (the corrupted contrast `a` poisons every VF gradient equally), NOT attention/LN-specific. + the \|Jv−Jᵀv\|=1.37 diagnostic. Script: `plateau_ablation.py` (per-group mode). | + +### Decision gate (after A+B) +Per spec, A+B chose the direction: **switch the nudged-phase estimator to common-mode tracking-AEP** +(N=4/lock-in/navg rejected — A showed they fix magnitude not direction; B showed it's bias not +variance). The mechanism: the bias floor is the **frozen-at-z\*** AEP linearization; re-linearizing at +the moving common mode z̄=½(z₊+z₋) restores cos −0.045 → 0.997. + +### Follow-on (the spec's "next necessary test": does `--track` descend below 2.40 in TRAINING) +✅ Done. Warm-start from the 2.40 orphan ckpt, jr_max 8: +- **TRACK-warm (tracking-AEP): 2.40 → best 2.1628 @ 2500, still descending, stable** (res ~6e-5). +- **STD-warm control (standard estimator, same recipe): ABORTED @ step 397** (res→0.21). +- **TRACK-fresh (from scratch): ABORTED @ 968.** +⇒ tracking-AEP descends past the plateau but is **not a from-scratch drop-in**; recipe is **two-phase +warm-track** (std estimator → ~2.40, then switch to tracking-AEP). Continued (`trkcont`) → best **2.1116**. + +--- + +## Unfinished gaps (do not change conclusions, but were specified) +1. **C — ≥2 seeds:** only single-seed runs. The one item that affects the rigor of the "matched + architecture ceiling" claim. +2. **C — jr_max=8** not included in the BPTT rescue grid (only 16/32; exp5 covered jr_max 8 for EP). +3. **A — BPTT800** reference not run separately (BPTT150-vs-400 cos=1.0 from exp1 answered the horizon + question, but the spec listed 800). +4. **D — sub-diagnostics** |a|, |z^β−z*|, and a head-clamp arm not individually logged. + +--- + +## What followed (this session — beyond the A/B/C/D spec) + +These grew out of the round; they were **not** in the spec. + +1. **bptt_ceiling (architecture-ceiling run, lr6e-4 jr16, 12k).** Best **2.1055** @ step 7400, then + **controller-hijack blowup** @ ~8800 (jr→16, val CE→3.4), lodged at ~3.3 through 12000 (jr pinned at + 16 the entire tail, res bounded 2–5e-3 — never recovered). Logs: `bptt_ceiling.log`. + +2. **BPTT-vs-BP "slow vs misaligned" diagnosis** (`lr_overlay.py`). Plotting val-CE vs cumulative-lr + (∫lr, Adam path-length proxy): per unit path-length **BPTT meets/beats BP down to 2.11** (cumL ratio + 1.0→0.33) — the raw-step slowness is the 5× lr cap (6e-4 vs BP's 3e-3), not inefficiency. BPTT never + reached BP's 1.74 because it **blew up before converging** (stability-limited), not capacity. S0 + evidence that capacity is fine when it converges: **BPTT-ctl 1.635 ≈ BP 1.610** (gap 0.025). A + weight-import capacity probe was designed but **deemed confounded** (5 formula gaps: damping halves + z*, coupled-vs-sequential FFN, attn biases, qknorm, final-LN) and not run. + +3. **Controller-hijack diagnosis** (`blowup_diag.py`, at the 2.1055 ckpt). Decomposing the update into + CE and jacreg terms: **cos(g_CE, g_jac) = −0.008** (orthogonal); at jr=16, **\|jr·g_jac\|/\|g_CE\| = + 3.25** and **cos(g_CE, total) = 0.288** — i.e. ~71% of the update is the penalty, not loss descent. + Per-group, the penalty over-drives **LN gains 34×**, attn 3.2×. res stays bounded throughout ⇒ NOT a + relaxation blowup; a controller-penalty-driven weight-space detour. Trigger: near the optimum res + settled ~1.8e-3 (> target 1.5e-3) → controller ramped jr to max → hijack. + +4. **"Remove/replace the penalty" structural arms** (running now, all BPTT, base = bptt_ceiling recipe): + - `freeze_floor` (GPU1): `--jacreg 0.1 --jr_floor 0.1 --jr_max 0.1` → jr frozen at 0.1, no adaptive ramp. + - `specnorm` / Lipschitz (GPU0): `--jacreg 0 --specnorm --rho 0.9` → spectral-norm attention weights ≤0.9, off-gradient, no penalty. + - Queued: `no-penalty` (`--jacreg 0`), `energy` (`--attn_mode energy`), `mono` (`--attn_mode mono`). + - Watcher `watch_arms.sh` fires when both finish. Logs: `freeze_floor.log`, `specnorm.log`. + +5. **DEQ training-stability literature review** (deep-research, 24 sources, 25/25 claims verified). + Three families: (i) **soft Jacobian penalty (Bai 2021)** — author-admitted heuristic + DIGITAL-only + (relies on Broyden, deliberately doesn't enforce contraction) → wrong tool for analog; (ii) + **structural contraction** (monDEQ / NEMON-l∞ / pcDEQ / BiLipNet) — provable, and monDEQ physically = + a passive resistor-diode circuit → **analog-native**; (iii) **energy/Hopfield attention** (softmax = + one CCCP energy-descent step, provable monotone energy decrease) → the analog-compatible AND + attention-preserving route. Recommended construction (medium conf): energy/Hopfield attention + + monotone/NEMON-l∞ contractive FFN, plain Picard/averaged (= our εF Euler) iteration, no root-finder. + Full report relayed in chat; open questions = our research frontier. + +--- + +## Where it stands / open items +- **Round A/B/C/D:** substantially complete (4/4 + the decisive tracking-AEP training follow-on); + 4 specified gaps remain (above), the load-bearing one being **C multi-seed**. +- **Reframe (this session):** the explosion is a **DEQ forward-stability** problem (shared by EP and + BPTT, since EP≈BPTT gradient), not an EP problem — so the fix belongs to the DEQ/structural toolkit, + not EP-literature hacks. The soft penalty is both admitted-heuristic and analog-incompatible. +- **Running:** freeze_floor, specnorm (~10h) — testing whether removing/replacing the penalty keeps + the block stable past the ~9k blowup point and whether it descends below 2.1055. +- **Next-round formal plan (NOT yet executed):** WSD schedule redesign; fix-bundle re-run; converged + BP/BPTT/EP comparison; C512-vs-C≥1024 capability question. Plus (lit-informed) prioritize the + `energy` arm; consider NEMON-l∞ over Winston-Kolter for the structural `mono` arm. + +## Artifacts +- Scripts: `/tmp/lt_ep/{plateau_ablation,navg_sweep,bptt_rescue,blowup_diag,lr_overlay,cos_now}.py`, + `bptt_rescue.sh`, `chain_next.sh`, `watch_arms.sh`. +- Logs/ckpts: `/tmp/lt_ep/{bptt_ceiling,freeze_floor,specnorm,trkcont,track_warm,std_warm,bpttr_*}.{log,pt}`. +- Records: `C512_PLATEAU_CAMPAIGN.md` (round-1, exp1-7), `FINDINGS.md` (2026-06-17 round 2 + 2026-06-18), + memory `ep-c512-plateau-mechanism.md`. diff --git a/docs/campaign/EP_BELOW210_DIAGNOSIS_FIX.md b/docs/campaign/EP_BELOW210_DIAGNOSIS_FIX.md new file mode 100644 index 0000000..54897bc --- /dev/null +++ b/docs/campaign/EP_BELOW210_DIAGNOSIS_FIX.md @@ -0,0 +1,260 @@ +# C512 EP below-2.10 divergence — diagnosis + fix (2026-06-20) + +> Written to disk after `/tmp/lt_ep` was wiped (10-day /tmp cleanup) — all local code/ckpts/logs lost. +> This file preserves the diagnosis + the exact code change so it can be rebuilt. + +## Symptom +EP (frozen jr=0.1, tracking-AEP, qknorm, c=1) descends to best **~2.09**, then within ~200 steps: +res 5e-3→0.15, cos(g_EP,g_BPTT) 0.98→~0, val CE 2.09→4+, ABORT. Same ~2.09 floor every run +(erf & tanh both: 2.0919 / 2.0921). **Exact-BPTT, identical recipe, sails past 2.09 → 1.72 (freeze_wsd).** + +## Root cause (Codex-confirmed, 5-way corroborated) +**EP optimizes the fixed point; BPTT optimizes the finite unroll — only the finite unroll defends the residual.** +- `ep_step` relaxes T1=150 → `z150`, then REFINES to t1max=300 → `z*`, computes the EP gradient at `z*`. +- But **eval and `bptt_step` use `z150`**. EP's gradient never feels the T1 residual. +- As attention gets expressive → contraction weakens → `z150` drifts from `z*` → res@T1 inflates → EP + estimate (valid only at small res) dies → blowup. +- BPTT differentiates the actual 150-step unroll → a non-converging unroll → bad CE → gradient implicitly + rewards strong contraction. **That defend-the-residual term is what EP structurally lacks.** +- cos-0.977 holds only AT the fixed point; the missing perpendicular is ~21% (=√(1−0.977²)) = the + finite-horizon transient gradient. (T1=∞ would make even BPTT lose it — BPTT's stability IS the finite truncation.) + +Two layers: (1) point-mismatch (refinement) — `--t1max 150` alone pushed wall 2.09→2.05 but still blew@600; +(2) gradient-flavor — needs the explicit penalty. + +Diverged state = forward bifurcation to a **limit cycle** (eval_relax: res floors ~6e-2, oscillates, 150→4000 +steps don't help, CE ~3.7). FTLE stayed negative (−0.027..−0.050). ⇒ adaptive/more steps can't fix it (res as +stopping-criterion chases a vanishing fixed point); only res as a COST (penalty) prevents the drift. + +Ruled out: jacreg (3% of grad, orthogonal, removing changes cos <0.001); GELU erf/tanh (both diverge identically); +the validity gate (removing it diverges FASTER); forward-stability/FTLE (stable-BPTT −0.0347 weaker than diverged-EP −0.0377). + +## THE FIX: `--resreg` (T1-residual penalty) — exact edits to lt_ep_train.py / ep_step + +1. signature: `def ep_step(..., corr_every=1, res_gate=0.0, resreg=0.0):` + +2. right after `res_used = res` (BEFORE the `if t1max > T1:` refinement): +```python + zT, resT1 = zs, res # the T1 free-phase state (what eval/BPTT use), BEFORE refinement +``` + +3. just before the final `return grads, res` (after the jacreg block): +```python + if resreg > 0 and resT1 > 7e-4: # defend z_T1 (BPTT gets this implicitly; EP at z* doesn't) + with torch.enable_grad(): + Fz = blk.tforce(zT, xin0) # deterministic thick force at z_T1 (params live, zT/xin0 detached) + Rr = (eps * Fz).pow(2).sum() / (zT.pow(2).sum() + 1e-9) # ~ (T1 residual)^2 + grr = torch.autograd.grad(Rr, blk.block, allow_unused=True) + ratio = resreg * min(1.0, resT1 / 2e-2) # ramp 0->resreg as res 7e-4->2e-2, capped + gtask = math.sqrt(sum(float((grads[id(p)] ** 2).sum()) for p in blk.block if grads.get(id(p)) is not None) + 1e-20) + gres = math.sqrt(sum(float((g ** 2).sum()) for g in grr if g is not None) + 1e-20) + lam = ratio * gtask / gres # scale penalty to `ratio` of the task-grad norm + for p, g in zip(blk.block, grr): + if g is not None: + grads[id(p)] = g * lam if grads.get(id(p)) is None else grads[id(p)] + lam * g + return grads, res +``` + +4. argparse: `ap.add_argument('--resreg', type=float, default=0.0)` + +5. the ep_step CALL in the training loop: append `, cfg.resreg` as the last arg. + +**RUN IT WITH `--res_gate 0`** (else the gate bypasses the penalty). Keep `--t1max 300`. + +## Validation status (INTERRUPTED — must re-run) +`ep_resreg2` (res_gate=0, resreg=0.2, warm from erf-2.09 ckpt, t1max=300): held res **1–5e-4 through step 1000**, +best **2.0573** (lowest any EP run reached, past the wall), zero inflation. Deleted by /tmp wipe before reaching ~1.8. +No-penalty variants for contrast: ep_nogate blew@100, gated ep_resreg@200, ep_t1max150 reached 2.05 then blew@600. + +## What was lost in the /tmp wipe (must rebuild) +- Code: lt_ep_train.py (with all post-06-14 edits: --resreg, --gelu, --probe_bptt, --wsd, --lcap/--linf, empty_cache, + tracking-AEP), plus the probes (grad_decomp.py, ftle_probe.py +--gelu, eval_relax.py, cos_now.py, train_lyap.py). + Base (06-14) survives in `/home/yurenh2/ept/lt_ep_code/`; re-apply edits from this doc + the session transcript + `/home/yurenh2/.claude/projects/-home-yurenh2-ept/46bc4d43-*.jsonl`. +- ALL checkpoints: orphan (ep_C512_bpe.best.pt 2.40), freeze_wsd.pt (BPTT 1.72), ep_frzjr_wsd.pt (2.09), + ep_resreg2, ep_tanh_frz — gone (H200 backup source 157.66.255.42 is also down). Must retrain from scratch. +- Data tinystories_bpe: regenerate via `lt_ep_code/prepare_tinystories_bpe.py`. + +## Re-run plan +1. Reconstruct lt_ep_train.py = lt_ep_code/06-14 base + the edits above (+ the other recent flags). +2. Regenerate tinystories_bpe data. **Work entirely under /home/yurenh2/ept — NEVER /tmp.** +3. Retrain orphan (~20k) → freeze_wsd (BPTT WSD 16k, expect ~1.72) → EP+resreg (res_gate=0 resreg 0.2, WSD). +4. Confirm EP+resreg descends past 2.09 toward ~1.8 (BPTT ceiling). Sweep resreg 0.2–1.0 if it plateaus. +5. NEXT-ROUND code fixes (genuine bugs, kept this round only for comparability with the diverging baselines): + - GELU consistency: model force uses erf (F.gelu) but holo nudge (cgelu/rforce/cforce) uses tanh-form → nudge fixed point ≠ z* → ~4.7e-4 force error injects a spurious common-mode drift each step (estimator bias). FIX = `--gelu tanh` (make the model tanh too; the holo complex extension REQUIRES tanh = entire fn, so consistency means everything tanh). NOTE: fixing it does NOT fix the below-2.10 divergence (tested 6/18, that's the residual-defense issue) — but it's still a real bug worth fixing. + - cforce (holo_ep.py:41-45) omits the q/k RMSNorm that real attention applies → biases holo_a/holo_a_select with --qknorm (not the --track path, but fix anyway). + - Stale argparse defaults (footguns, harmless when flags passed explicitly): --attn_mode default 'real' (should be the intended mode), --data default '/tmp/lt_ep/data/shakespeare_char' (line ~320). Also tighten resreg scaling to pure task-grad norm (currently task+jacreg, ~3% off). + +--- +## 2026-06-21 session update — rebuild after /tmp wipe + validation IN PROGRESS + +REBUILD (all under /home/yurenh2/ept/ep_run, never /tmp): +- Code ep_run/lt_ep_train.py = lt_ep_code (a recent ~6/20 snapshot despite its misleading 6/14 mtime — it already has res_gate/gelu/t1max/track/resinit/full ep_step+holo) + my 3 edits (--resreg block, --wsd schedule, line-20 DD data-path fix). `diff` verified clean; holo_ep.py byte-identical to base. +- Bug-check forensics: holo_ep.py last edited 2026-06-13 -> the estimator (holo_a_track, AEP correction, rforce) is COMPLETE/not stale. Post-6/14 lt_ep_train edits NOT carried over are all diagnostic/other-run-type (--probe_bptt,--probe_branch,--snap,--cprewarm,--specnorm/--linf, bptt_step sep-mode, empty_cache) — none touched by the EP+resreg+track+frozen-jr run. +- Data: regenerated tinystories_bpe to ep_run/data/ (361M train tokens, vocab 4096; val=tail-15MB of train.txt, ~0.8% overlap — fine for tracking divergence/descent). + +VALIDATION RUN (from random init; pid 696706 GPU1): + python3 lt_ep_train.py --mode ep --attn_mode thick --B 24 --C 512 --H 16 --T 256 --c 1.0 \ + --jacreg 0.1 --jr_floor 0.1 --jr_max 0.1 --holo 2 --hr 0.02 --t2sel 40 --track --pema 0.999 --t1max 300 \ + --res_est 1e-4 --res_gate 0 --resreg 0.2 --qknorm --resinit 0.1 --warmup 800 --T1 150 --T2 20 --lr 6e-4 \ + --wsd 0.25 --steps 32000 --log 200 --save_every 500 --abort_res 0.3 --data data/tinystories_bpe \ + --ckpt runs/ep_resreg_scratch.pt --state runs/ep_resreg_scratch.state + Status @step 6400 (~20%, 8h, 0.23 it/s): val 2.5070 descending (FASTER than orphan: 2.51@6.4k vs orphan's 2.40@20k). res BOUNDED ~1-2e-2 by resreg (spikes to 2e-2 get pulled back to ~4e-3 — resreg WORKING, not running away like the no-resreg baselines that hit res 0.15->abort@2.09). Decisive break-2.09 test ~10h out; ~1.8 needs the WSD lr-decay (step 24k+). Watcher key: does res stay bounded through 2.09. + +resreg CONTAMINATION probe (ep_run/resreg_probe.py, at step-3600 ckpt, res 1e-2): + cos(VF,BPTT)=0.7166 vs cos(VF+resreg,BPTT)=0.7108 -> delta -0.006: resreg DOES directly contaminate (small). cos(resreg_dir,BPTT)=-0.047 -> resreg is ~ORTHOGONAL to the true gradient; it is NOT adding back BPTT's residual-defense term (my earlier hypothesis was WRONG) — it's a roughly-orthogonal constraint direction. NET still favorable: the tiny -0.006 direct cost (bounded by the 0.2 ratio cap) buys keeping res low so the EP estimate doesn't COLLAPSE (cos 0.72 held vs ->0 runaway+diverge). CAVEAT the probe also exposes: the EP estimate itself is only 0.72-aligned at res 1e-2 (vs 0.98 at res 1e-4) -> descent on a mediocre gradient -> MAY plateau above BPTT's 1.72. If it stalls >~1.85 or res creeps past 2e-2: retune tighter resreg (ramp-to-full by res~3-5e-3 instead of 2e-2) and/or stronger early contraction (frozen jr=0.1 from scratch is weaker than the orphan's adaptive jr which held res~1e-4). + +CODEX review of rebuilt code (verdict triaged): the 2 "High" bugs are STALE-DEFAULT footguns, HARMLESS here (run passes --attn_mode thick + --data explicitly, verified via /proc): line 304 --attn_mode default 'real'; line 320 --data default '/tmp/lt_ep/data/shakespeare_char'. GELU (erf model vs tanh nudge, ~4.7e-4 force err) + cforce-missing-qknorm = real bugs but status-quo (kept this round for comparability) -> next-round fixes (point 5 above). Codex CONFIRMED CORRECT: AEP contrast sign (not flipped/doubled), Wh readout grad, jacreg, the resreg sign/logic + res_gate=0 gate-bypass, frozen-jr clamp@0.1, WSD schedule. + +COST (EP vs same-param BP): per-step EP on the equilibrium block ~= ~100x a depth-1 BP transformer in DIGITAL SIM — because the block relaxes ~150-300 steps (effective depth); even exact-grad BPTT is ~100x depth-1 BP. EP/BPTT ~= 1.5x (EP 0.23 it/s vs freeze_wsd BPTT 0.34 it/s). The ~100x is the cost of SIMULATING the analog relaxation on a GPU; on analog relaxation hardware the free phase is physical settling (~free) + EP's local update needs no backward -> EP competitive/better than BP there. Per-quality the gap is smaller (equilibrium block beats same-param standard transformer, rule tax 0.041); EP memory << BPTT (no unrolled graph). + +DELIVERABLE: method-intro PDF /home/yurenh2/ept/ept_method_intro.{tex,pdf} (9pg, EP-literate audience; AEP/holomorphic/CET/tracking-AEP background; strict attribution; results honest+ongoing-marked; compiled with tectonic since conda pdflatex is broken). VERIFY before external use: the post-Jan-2026 citations (Scurria arXiv:2602.03670 AsymEP; CET=Hoier/Kerjan/Scellier ICLR2026 AM workshop OpenReview Qrfml76eWJ; Laborieux 2021) + Hoier-vs-Hoier spelling. + +--- +## 2026-06-21 (cont.) — divergence is CONTRACTION loss, not res; the missing BPTT term MEASURED +From-scratch EP (c=1, frozen jr=0.1, track, resreg=0.2) diverged at val 2.25 (ABOVE the 2.09 wall): +res 1.1e-2 -> 1.7e-1 -> 3.4e-1 then COLLAPSED to a trivial fixed point (res 1.4e-8, val 10). So it is a +bifurcation: good fixed point destroyed -> brief limit cycle -> degenerate attractor. NOT a numerical blowup. + +MEASURED the missing term M = g_BPTT - g_EP at the pre-blowup ckpt (step 9200, res~1e-2; resreg_probe.py): + cos(g_EP, g_BPTT) = 0.557 (decayed from 0.72 @ step3600 at the SAME res -> estimate degrades as operator hardens) + |M|/|BPTT| = 0.84, cos(M,BPTT) = +0.91, cos(M,EP) = +0.16, cos(M,resreg) = -0.10 + (|EP|=0.94, |BPTT|=2.2 -> EP captures only ~42% of the true gradient, 56 deg off, near the instability) +=> EP is missing 84% of the true gradient (M ~ the BPTT direction). resreg points AWAY from M (-0.10): + resreg is NOT the missing term and NOT a weak proxy -- it is the wrong direction. resreg retired as the fix. + +CODEX THEORY: correct state variable = contraction margin mu_P(J) (one-sided-Lipschitz/log-norm), NOT res +(lagging symptom; cos decays at fixed res) and NOT spectral radius (non-normal, |Jv-Jtv|/|Jv|~1.4 -> jacreg +targets the WRONG quantity). BPTT's stabilizer = the finite-horizon gradient term (the (I+eps*J)^k products that +ALSO govern contraction -> automatic contraction feedback); EP graded at refined z* structurally discards it. +The fix that (i) preserves the objective [no loss force] and (ii) runs on analog [no digital root-finder] = +CONTRACTION-BY-CONSTRUCTION (mu_P(J) < -alpha). KICKER: analog HW can never do BPTT -> can never access M -> +structural contraction is the FORCED hardware design constraint, not merely the cleanest fix. + +EXPERIMENT (test causality + the fix; both with resreg=0, no residual term): + (A) c-bump c=1->3 from scratch [runs/ep_c3_scratch] -- does stronger structural damping alone prevent divergence? + (B) HARD spectral-norm contraction constraint --specnorm (codex implementing) -- warm from preblow_9200 + from scratch. + If divergence vanishes with NO residual term -> contraction-loss confirmed as root; principled fix validated. + +--- +## 2026-06-21 (cont. 2) — the equivalence-condition diagnosis (deep paper read: Ernoult2019 + AsymEP + LZ-JacobianHomeostasis) +EP≡BPTT (Ernoult 2019, arXiv:1905.13633, Thm1/Eq16) needs: H1 symmetric Jacobian J=Jᵀ (Eq51 transpose-cancellation = load-bearing); +H2 free phase EXACTLY at fixed point s*=s_T=..=s_{T-K}; H3 β→0. Ernoult gives NO contraction-rate truncation bound (just assumes +convergence; my earlier "contraction-rate condition" guess was WRONG). BPTT needs NEITHER H1 nor H2 (differentiates the actual +finite-T graph) -> that's why BPTT sails to 1.7 while EP can't. + +OUR RUN VIOLATES BOTH, via the SAME quantity S^-1 A (S,A = sym/antisym parts of J; our |Jv-Jᵀv|/|Jv|≈1.4 IS ‖S^-1 A‖≈1.4): +- H2 (dominant, first-order): res~1e-2 -> not at fixed point -> EP & BPTT seeded by different states. "EP≈BPTT only at res≲1e-4, + noise at 1e-2" is textbook H2 failure. +- H1 (makes it sudden, sets M): both AsymEP (Scurria arXiv:2602.03670) and LZ (Laborieux-Zenke "Jacobian Homeostasis" + arXiv:2309.02214, ICLR2024) reduce the asymmetry bias to S^-1 A. LZ Eq11: EP error = δ - 2 S^-1 A δ, valid only ‖S^-1 A‖→0. + At 1.4 the map (S+A)^-1(S-A) → near-rotation (eigs→-1): right direction, lost magnitude = our cos(M,BPTT)=0.91, |M|=84%. + +KEY NEW REALIZATION: our AEP (J→Jᵀ) correction is OUT OF WARRANTY. It is FIRST-ORDER in S^-1 A and assumes ‖S^-1 A‖≪1 +(AsymEP error series Σ(S^-1 A)^{2n+1} converges only if ρ(S^-1 A)<1; they never state it). At 1.4 it may be O(1) wrong. AND it +is a FIXED linearization at x̄⁰, so res~1e-2 -> wrong anchor -> H2 POISONS the AEP correction too. The cos-0.99 we trust holds +ONLY at small-asymmetry + tight-convergence — the regime we leave. + +THE GAP, PRECISE: NOT "can't converge". Even at perfect convergence, at ‖S^-1 A‖=1.4 the H1 bias is O(1), and the BPTT-1.7 +solution NEEDS that non-normality. So EP's valid regime (‖S^-1 A‖≪1) and the expressive regime (≈1.4) DON'T OVERLAP. Structural. + +REFRAMES THE CONTRACTION EXPERIMENTS: specnorm/c3 cap ‖J‖/contraction, NOT the asymmetry ratio ‖A‖/‖S‖ that drives the bias. +The literature's correct lever = Jacobian Homeostasis (LZ Eq12: penalize ‖A‖² -> J functionally symmetric, works w/ untied weights) += the principled version of our jacreg — BUT it caps exactly the non-normality the 1.7 solution needs (expressivity tax; confirms +the "crutch" worry). Holomorphic nudge only removes finite-β bias, NOT the H1/H2 bias (so β is NOT our dominant error). + +THE ONE PATH WITHOUT THE EXPRESSIVITY TAX (novel): the AEP correction is first-order; a HIGHER-ORDER AEP (more Neumann terms in +S^-1 A) extends the warranty past ‖S^-1 A‖~1 WITHOUT capping non-normality — fix the gradient, not the model. Attacks H1 fundamentally. + +NEXT MEASUREMENTS (cheap, on preblow_9200 ckpt): (1) measure ‖S^-1 A‖ / ρ(S^-1 A) DIRECTLY (not the 1.4 proxy) and track over +training (warranty gauge; predict it grows toward the tip). (2) test M ≈ -2 S^-1 A · g_BPTT — if yes, the asymmetry bias IS the gap +(not convergence) and higher-order AEP is the lever. Sources: Ernoult 1905.13633; Scurria 2602.03670; LZ 2309.02214 + Holo 2209.00530. + +--- +## 2026-06-21 (cont. 3) — CORRECTION: the "AEP out of warranty / ρ(S⁻¹A)<1" framing (cont.2) was MIS-ATTRIBUTED +Deep read of Scurria 2026 (arXiv:2602.03670, verbatim Eq.24/47) overturns cont.2: +- Our AEP = the paper's AsymEP (single-state antisym correction −2A(z−z*); holo_a_track re-anchors A at the common mode z̄). + It is EXACT at β→0 for ARBITRARY asymmetry: Eq.24 `J_FA(x̄⁰)=J−2A=Jᵀ` is an algebraic IDENTITY (any A magnitude). +- The ρ(S⁻¹A)<1 / Neumann series Σ(S⁻¹A)^{2n+1} (Eq.47) is the **VF BASELINE's** error (uncorrected EP), which AsymEP REMOVES. + We + GPT-Pro (following Laborieux-Zenke, whose first-order −2S⁻¹Aδ analyzes UNCORRECTED EP) MIS-ATTRIBUTED the VF error + to AsymEP. It is NOT a restriction on our correction. +- "AsymEP = first-order projection of Dyadic EP" = DIMENSIONAL projection (2n→n), NOT first-order-in-β. Both exact at β→0. +- Dyadic EP (Scurria §4) = a SEPARATE 2n doubled-field method (aux field z′ + augmented energy H=−(z−z′)ᵀF((z+z′)/2)); + we DON'T implement it. It buys finite-displacement robustness + a real energy, NOT extra β→0 exactness. ⇒ optional upgrade, NOT the fix. +- ⇒ the asym_probe ρ(S⁻¹A)=3.3 result answers the WRONG question (VF's error, + measured at a no-fixed-point pathological ckpt). DISCARD as a validity verdict. + +REAL DIAGNOSIS (re-centered): the ONLY fundamental obstacle is **H2 = free-phase convergence/contraction**. AsymEP handles +asymmetry EXACTLY *given convergence*. The cos-0.557 / M-84% gap was measured at res 1e-2 (NOT converged) ⇒ it's H2, not H1. +Asymmetry is only "upstream" (expressive attention → weak contraction → H2 fails). FIX DIRECTION = contraction/convergence +defense (specnorm / damping / monotone / residual-defense) — VALIDATED. The cont.2 "asymmetry/ρ/AEP-warranty" line was a wrong turn. + +EMPIRICAL CHECK (running): oracle-adjoint probe — at a HEALTHY converged ckpt (specnorm, res~1e-4): is +g_EP(AsymEP) ≈ g_transpose(exact adjoint via GMRES Jᵀλ=−ℓ) ≈ g_BPTT? cos→1 confirms AsymEP exact + problem = convergence. +(The earlier 0.557 was the pathological non-converged point.) + +--- +## 2026-06-21 (cont. 5) — ROOT-CAUSE CANDIDATE: --hr 0.02 is 10x too small (should be 0.2) -> noisy EP gradient +oracle-adjoint probe (asym_probe --diag) on a PERFECTLY converged ckpt (res 1.5e-9): cos(g_EP, g_transpose[=exact +adjoint=g_BPTT, cross-checked cos 1.0]) = 0.63 mean over 6 batches (SYSTEMATIC, not a bad batch). BETA/r SWEEP: cos gets +WORSE as beta->0 (hr 0.04->0.86, 0.02->0.72, 0.005->0.38, 0.002->0.15) = 1/beta noise-amplification of the nudge-convergence +error (NOT finite-beta bias; NOT a holo/track/t2sel component -- plain EP ~= FULL ~0.72). Sweeping hr UP: 0.06->0.85, +0.1->0.95, 0.15->0.98, **0.2->0.986**, 0.3->0.991. => at the DOCUMENTED hr=0.2 (METHODS line 130) the estimator is FINE +(cos 0.986, in the historical 0.85-0.99 range). Our runs (this whole session + the EP_BELOW210 "diverging recipe") used +--hr 0.02 = 10x too small -> noise regime -> 0.72-cos gradient -> plausibly THE destabilizer of the below-2.10 divergence. +CODE IS CORRECT; mis-set parameter (hr regression from the validated 0.2). FIX = --hr 0.2. Decisive test: pure EP with +--hr 0.2 (runs/ep_hr02) -- does it stop diverging? The probe was correct; the convergence/asymmetry/Dyadic theory dive was +largely chasing a symptom of the mis-set hr. (NOTE: also revisit t2sel=40 vs documented 120, secondary.) + +## 2026-06-22 (cont. 6) — DECISIVE: divergence is STRUCTURAL (equilibrium gradient is blind to finite-time contraction). hr & t2sel are NOT the fix. +Deliverables this round: `ep_run/GPT55_BUG_HUNT.md`, `ep_run/CODEX_VERDICT.md`, `ep_run/FWD_MODE_DESIGN.md` (multi-agent: gpt-5.5 + fugu-ultra via codex). + +**cont.5's "hr=0.2 fixes it" was WRONG/incomplete.** ep_hr02 (--hr 0.2) STILL diverged (step 2800, val 2.87). hr is a real gradient-QUALITY knob but not the divergence fix. + +**Clean chain (all on a fixed val batch):** +1. **BPTT deconfounder:** BPTT (exact grad) → CE **1.83**, free-phase contraction ratio **ρ≈0.982**, res tight ~1e-3 throughout. Divergence is **EP-specific**; a contractive low-loss solution exists. +2. **The "0.94 estimator ceiling" = ADJOINT-PHASE TRUNCATION, not hr/β noise, not fundamental.** At a trained, tightly-converged operator (free res 2.2e-5): hr-ceiling sweep cos(g_EP,g_transpose) **flat ~0.94 across hr 0.04–0.8**; t2sel sweep (fixed hr) cos **0.74→0.998 as t2sel 10→160**. The code refines the FREE phase but never converges the NUDGED/adjoint phase (gpt-5.5 bug-hunt #1,#9). So EP CAN compute the near-exact equilibrium adjoint (0.998) at t2sel≥160. cos(g_transpose,g_BPTT)=0.975@res2e-5, =1.0@res1e-9 (finite-horizon gap, res-dependent). +3. **Bifurcation probe (mechanism):** as EP trains, free-phase **ρ drifts→1** (0.989@val4.96→0.998@val2.74); redx baseline blew step3300 (CE 2.74→41). **Damping does NOT rescue** — ρ-vs-c on the marginal op: more c makes ρ WORSE (0.998→0.9999); on BPTT's op more c helps (0.982→0.96). EP drives weights into a **damping-resistant marginal-contraction region** BPTT avoids. Sudden blowup because **res_T1≈ρ^150** is hyper-sensitive near ρ=1 (gradual ρ-drift + ρ^150 = sudden bifurcation). + +**VERDICT (codex/gpt-5.5, PROVEN — CODEX_VERDICT.md): (b) STRUCTURAL.** Equilibrium gradient differentiates L(z*); BPTT differentiates deployed L(z_T), T=150. The equilibrium gradient `grad=L_θ+λᵀF_θ` (F_zᵀλ=−L_z) **lacks the finite-horizon contraction-defense term**. Scalar proof: F=−k(z−z*) → ∀k, z* identical → **dL(z*)/dk=0** (blind to contraction rate), but z_T−z*=(1−εk)ᵀ(z₀−z*) → **dL(z_T)/dk ∝ T(1−εk)^{T−1}** (BPTT sees k; huge near ρ=1). A param that erodes contraction with z* fixed has ZERO equilibrium gradient but large BPTT gradient. **No t2sel/hr adds a term absent from the fixed-point gradient.** t2sel=160 only makes EP a better *equilibrium* gradient = still wrong objective → ep_t2fix predicted to still blow. + +**FIX (physically realizable; FWD_MODE_DESIGN.md): change the OBJECTIVE to finite-horizon — forward-mode/RTRL eligibility traces.** Forward-relax T=150, propagate `e_{t+1}=Φ_z(z_t)e_t+ε·∂F_θ/∂p` alongside the state, update `Δp ∝ −L_z(z_T)ᵀe_T`. Forward-only, local, NOT reverse BPTT, NOT a digital solver — keeps the contraction-defense terms. Exact RTRL O(P·dimz) → practical: UORO / random-direction directional-deriv / blockwise low-rank. Alt: explicit contraction homeostat (penalize log-norm μ(J)). + +**In-flight confirmations (2026-06-22):** ep_t2fix (t2sel=160, runs/ep_t2fix.log + t2fix_rho.log) — predicted ρ→1 + blow; oracle-adjoint training (GMRES exact adjoint each step, warm-start redx s2000, runs/oracle_adjoint.log) — DEFINITIVE (b) test (if even the exact equilibrium grad blows → (b)); forward-mode rule design = the fix. +**Tooling:** Sakana fugu-ultra wired into the codex plugin (broker spawn patched: `-c model=fugu-ultra -c model_provider=sakana`, key ~/.codex/sakana.key); gpt-5.5 via direct `codex exec`. + +## 2026-06-22 (cont. 7) — CORRECTION: the divergence is a forward LIMIT CYCLE = a DISCRETE-EULER ARTIFACT (continuous/analog system is stable). cont.6's (b)-structural/forward-mode verdict is SUPERSEDED. +Measured on genuine this-session ckpts (full dossier: `ep_run/EP_DIAGNOSIS_DOSSIER.md`; fugu Q1-Q4: `ep_run/FUGU_VERDICT_FULL.md`). +1. **It's a LIMIT CYCLE, no fixed point.** eval_relax on the marginal ckpt redx **s3200** (val 2.74): relax from embedding 6000 steps → res FLOORS ~2.3e-2 and OSCILLATES (tail 2.08-2.73e-2, non-monotone), no fixed point. Reproduces the old (lost-run) finding on a genuine ckpt. ⇒ there is **no z\*** at the diverging operator → cont.6's "equilibrium gradient L(z\*) blind to contraction → forward-mode RTRL" (assumed z\* exists) is **moot here**. The bifurcation-probe "ρ=0.998 / slow convergence" was a **transient artifact** (caught the initial decay, missed the floor+oscillation). +2. **Attention drives the cycle.** Knockout WO×α: α=1.0 cycle (res 2.5e-2), 0.7 smaller, 0.4 nearly gone, **0.2 CONVERGES (res 3e-4, true fixed point)**. Non-conservative attention is causally responsible. +3. **DECISIVE — it's a DISCRETE-EULER ARTIFACT.** ε-sweep on s3200 (εN fixed = same settling): res-floor + oscillation shrink **monotonically** as ε↓ — 0.1→2.3e-2(cycle), 0.05→7e-3, 0.03→3e-3, 0.02→2e-3, **0.01→8.9e-4 CONVERGED**. ⇒ the cycle is the **explicit-Euler instability of a stiff rotating attention mode** whose CONTINUOUS eigenvalue is **Re(μ)<0 (stable)**: `|1+εμ|>1` at ε=0.1 but `<1` at ε=0.01. **So the below-2.1 "divergence" is largely a DIGITAL-SIM INTEGRATION ARTIFACT (ε=0.1 explicit Euler too coarse), NOT a fundamental non-conservative instability — and ANALOG hardware (continuous relaxation, ε→0) would NOT have it.** Big positive for the thesis. +- **fugu-ultra (Q1-Q4)** concurred + was appropriately cautious: confirmed attention-driven oscillatory non-convergence, flagged the Euler-artifact alternative (now confirmed), said "Hopf" needs the eigenvalue-of-M crossing to be measured. Fix (Q2): finer integration (smaller ε / implicit/Anderson) + **adaptive jacreg homeostat** (the validated-2.40 mechanism, which the diverging runs *froze* weak — the regression) + **res_gate ON** + qknorm/WO-gain rails + monitor TAIL oscillation (a short residual probe is what fooled us). Thesis (Q3): sub-threshold attention IS expressive enough (BPTT 1.83 proves stable expressive solutions exist), threshold ≈ α_c 0.3; tradeoff real but not fundamental. Primitive (Q4): keep below the (discrete) instability — the code (`ep_step`/holo) REQUIRES a stationary z\*; training *through* the cycle is invalid → gate updates when res says equilibrium lost. +- **CAVEAT:** the ε-sweep proves the *discrete* artifact at s3200 (continuous-stable there); a *true* continuous Hopf could still emerge deeper in training. **Decisive confirmation (running): `ep_eps05`** = warm-start s2000 (pre-cycle, val 3.13) + **ε=0.05 (T1=300)** → does it clear redx's 2.74 blow point without cycling? Files: eval_relax_s3200.py / knockout_s3200.py / eps_sweep_s3200.py. + +## 2026-06-23 — RESULT: ε-TRAINING-MONOTONICITY CONFIRMS THE MECHANISM (wall moves with ε). +Three training runs, identical recipe except ε: +- ε=0.1, t2sel=40 (redx): blew @ best **2.74** +- ε=0.1, t2sel=160 (ep_t2fix): blew @ best **3.02** — the BETTER gradient (cos 0.998 vs 0.84) blew EARLIER ⇒ gradient quality is NOT the lever (a cleaner gradient if anything drives into the stiff regime faster). +- ε=0.05 (ep_eps05, warm-start s2000): blew @ best **2.41** — cleared *past* redx's 2.74 with res pinned ~5e-3 the whole way down (2.74→2.45), then res inflated 1.3e-2→2.0e-2→4.6e-2→val13 @step3100. + +**⇒ smaller ε → strictly lower wall (2.74 → 2.41); the blow point shifts MONOTONICALLY with the integration step.** This proves the divergence IS the explicit-Euler instability of the *stiffening* rotating attention mode (|1+εμ|>1). Fixed ε only RELOCATES the wall, doesn't eliminate it — as the attention keeps stiffening past ~2.4, even ε=0.05 eventually gets too coarse → cycle regrows → blow. The ep_t2fix contrast (better gradient still blows, earlier) seals **"integration, not gradient."** Analog HW (continuous relaxation, ε→0) is unaffected. Pre-blow ckpt preserved: `runs/ep_eps05.pt` @step2850 best2.4059. (2.41 lands near the old 2.40 estimator-plateau — likely coincidence; this is the res-inflation/stiffness blow, not the estimator floor.) +**NEXT (user's plan — try jacreg + physics-inferred methods):** (a) **adaptive ε** [#30, gate now passed] — shrink ε as res climbs, big ε when smooth → should ELIMINATE the wall, not just move it; (b) **jacreg** (adaptive jr — reduce rotating-mode stiffness → raise the Euler-stability margin → wall up / fixed-ε floor lower; the early-recipe defense); (c) ε=0.02/0.03 to confirm wall→~2.2. Two complementary axes: integration (a,c — SIM-only, no expressivity cost) vs stiffness (b — changes model, also speeds HW settling). + +## 2026-06-23 (cont. 8) — MAJOR CORRECTION: GENUINE Hopf instability, NOT an Euler artifact. cont.7's "Euler-artifact / analog-immune / integration-fix" framing is SUPERSEDED. +Direct leading-eigenvalue measurement (matrix-free FINITE-DIFFERENCE-JVP Arnoldi on M=I+εJ, `eig_probe.py`) + Anderson fixed-point-solver control (`anderson_control.py`). THREE independent confirmations: +1. **eval_relax**: s3200 floors/oscillates, never reaches a fixed point. +2. **Leading eigenvalues of M at the operating point**: s2000 (healthy, val 3.13) all |λ|<1, Re μ<0 (STABLE); s3200 (val 2.74) top |λ|=1.044, **Re μ=+0.44, COMPLEX** (μ=+0.26±1.37j); ep_eps05 (val 2.41) top |λ|=1.14, **Re μ=+1.35**, complex (μ=+0.65±2.08j). Re μ grows −0.02→+0.44→+1.35 as CE drops; rotating |Im μ| ~0→1.37→2.08. +3. **Anderson control**: s2000 converges (root, marginally stable μ≈0). s3200 — plain relax floors (res 2.7e-2), Anderson CANNOT reach a root (best res 1.4e-3), and the near-root has **UNSTABLE COMPLEX** eigenvalues (Re μ=+0.24, |Im μ|≈2.0). The solver itself cannot manufacture the absent stable equilibrium (fugu's exact caveat). + +⇒ **The below-2.1 divergence is a GENUINE continuous-time Hopf-type instability** of the learned non-conservative attention operator — a complex eigenvalue pair crosses Re μ=0 between val 3.13 and 2.74 (|Im μ|~2 rotation), EMERGING + GROWING as EP trains the attention expressive. **NOT a discrete-Euler artifact; the continuous ODE / analog HW is ALSO unstable → analog is NOT immune.** + +WHERE cont.7 WENT WRONG: the "Euler-artifact" inference rested on the ε-sweep "converging at ε=0.01" — but that was the **step** residual `ε·g`, not `g→0` (fugu caught this); the force-residual `g` floors high (no fixed point). The direct eigenvalue measurement refutes Re μ<0. RECONCILES with all prior data: ε-monotonicity (smaller ε → lower wall) = smaller ε makes `|1+εμ|>1` grow SLOWER → delays the blow → wall relocates (2.74→2.41), doesn't close (Re μ>0 is unstable for ANY ε); ep_t2fix (better gradient blew earlier) = cleaner gradient drives into the unstable regime faster; jacreg history (adaptive→2.40, frozen→blow) = jacreg cuts ‖J_nc‖ → pushes the complex pair back to Re μ<0. + +**FIX (FLIPPED): integration (adaptive ε / Anderson / implicit) only DELAYS** — can't converge to an unstable/absent fixed point (Anderson got 1.4e-3, not a root = direct proof). **The REAL fix is MODEL-SIDE: jacreg** (reduce ‖J_nc‖ → push the rotating complex pair back to Re μ<0), per fugu's Q2 (FUGU_OPTIONS_VERDICT.md): use an *adaptive bounded* jacreg homeostat (the validated 2.40 runs used adaptive jacreg; diverging runs froze it). **NEXT: jacreg run + verify with `eig_probe.py` that jacreg holds Re μ<0 / keeps the complex pair below the bifurcation.** Files: eig_probe.py, anderson_control.py, adaptive_eps_calib*.py, FUGU_OPTIONS_VERDICT.md. + +## 2026-06-23 (cont. 9) — FIX CONFIRMED AT THE MECHANISM LEVEL: adaptive jacreg pushes Re μ<0, kills the Hopf, restores the fixed point. +`ep_jacreg` = warm-start s2000 + ε=0.1 + **ADAPTIVE jacreg** (`--jacreg 0.1 --jr_floor 0.1 --jr_max 16 --res_target 1e-3 --res_gate 0`, controller :520-529). The ONLY change vs redx (which froze `jr_max=jr_floor=0.1`) is releasing the controller. It descends cleanly **past redx's 2.74 blow** (best 2.7495, res controlled ~1e-3, jr 0.1–0.3, no blow; ~4.5e-4 CE/step after a slow warm-start re-adjustment). +**eig_probe (`eig_jacreg.py`) at the SAME loss ~2.74 — the decisive A/B:** +- FROZEN jacreg (redx s3200, blew): leading μ=+0.45 (real) + +0.26±1.37j (**COMPLEX/rotating**), |λ|>1 UNSTABLE, g_floor=0.26 (cycle, no fixed point). +- ADAPTIVE jacreg (ep_jacreg @2.75): leading μ=−0.23,−0.25,−0.27 (**ALL real, NO rotating pair**), |λ|<1 STABLE, g_floor=**0.0001** (TRUE fixed point restored). +⇒ adaptive jacreg moved the leading eigenvalue from **Re μ=+0.45 (unstable rotating Hopf) → Re μ=−0.23 (stable real)**, killing the rotating instability and restoring a true fixed point (g 0.26→1e-4). This confirms BOTH the diagnosis (Hopf from the non-conservative ‖J_nc‖) AND the fix (jacreg reduces ‖J_nc‖ → Re μ<0), at the eigenvalue level, not just the loss. **Bonus:** the restored fixed point makes AsymEP valid again (it requires a stationary state — invalid on redx's cycle). NEXT: does it clear 2.41 (ep_eps05's blow) → 2.09 → below; re-run eig_probe at 2.41 to confirm jacreg holds Re μ<0 deeper. Production = compose with adaptive ε (fugu Q3) for sim fidelity/speed. Files: eig_jacreg.py, ep_jacreg.log. +**OBS (oscillation, in-progress):** windowed val_std SHRANK monotonically as it descended (0.159→0.031, steps 50→3000) = converging. At step 3400 a TRANSIENT val spike (2.65→3.67) self-recovered over ~150 steps (→2.79), best preserved (2.5624). KEY: throughout the spike **res stayed LOW (4e-4→8e-5) and jr LOW (0.1)** → it is NOT a Hopf/relaxation instability (res did not inflate) but a benign weight/CE training transient (a momentarily-off update; relaxation stayed healthy). So jacreg's Hopf-suppression is intact; spike is a separate phenomenon. Watch: if such spikes recur/grow, investigate — but "res-controlled-during-spike" already rules out Hopf re-emergence. diff --git a/docs/campaign/FINDINGS.md b/docs/campaign/FINDINGS.md new file mode 100644 index 0000000..0cb8b53 --- /dev/null +++ b/docs/campaign/FINDINGS.md @@ -0,0 +1,609 @@ +# EP / AEP for Transformers — Findings + +**Question.** Can transformers (attention + FFN) be trained with Equilibrium Propagation (EP), +i.e. without backprop / without Feedback Alignment? Started from literature, reproduced the SOTA, +characterized AEP (EP for non-conservative dynamics), and ported the recipe to a real char-LM. + +--- + +## TL;DR + +1. **EP/AEP give FAITHFUL LOCAL GRADIENTS for every transformer component** — attention + cosine **0.99**, FFN **1.0** vs the true BP gradient. This solves the credit-assignment that + **Feedback Alignment fails at**: FA only trains layers adjacent to the loss (output proj cos≈1.0) + and leaves upstream layers as noise (**attention Q/K/V cos≈0.25, FFN fc cos≈−0.01**). +2. **Recipe to make real (non-conservative) attention EP-able: DAMPING + AEP.** A damping term + `s·(attn(z) − c·z)` (c≥1) creates a stable fixed point at *any* attention strength, while keeping + the map non-conservative (independent Q/K/V); AEP then recovers the gradient (0.99–1.0). +3. **Stable end-to-end EP training: SOLVED** by a residual-driven continuous controller on a soft + Jacobian penalty (Bai-2021-style, Hutchinson) + damping. EP trains 12k+ steps without blowup. + **There is NO structural "EP ceiling"** (a ~2.5-CE wall was claimed mid-project and is + **retracted** — see *2026-06-09* section): EP has a quantitative **validity threshold** + (free-phase residual ≲ few·1e-4, nudge inside the linear-response window), and meeting it costs + regularization tax + steps — both reducible, neither a wall. + +--- + +## 2026-06-20 — below-2.10 EP divergence: ROOT CAUSE + the residual-defense fix + +The C512 EP wall (EP frozen-jr descends to best ~2.09, then SUDDENLY diverges within ~200 steps: +res 5e-3→0.15, cos(g_EP,g_BPTT) 0.98→~0, CE→4+, abort — while exact-BPTT with the IDENTICAL recipe +sails past to 1.72, freeze_wsd) is diagnosed. NOT the controller, NOT jacreg, NOT the erf/tanh GELU, +NOT a loss-landscape wall. + +**Root cause (Codex-confirmed, 5-way corroborated): EP optimizes the FIXED POINT; BPTT optimizes the +FINITE UNROLL — only the finite unroll defends the residual.** `ep_step` relaxes T1=150→z150, then REFINES +to t1max=300→z* and takes the gradient at z*; but eval & `bptt_step` use z150. EP never "feels" the T1 +residual → as attention gets expressive, contraction weakens, z150 drifts from z*, res@T1 inflates, the +EP estimate (valid only at small res) dies → blowup. BPTT differentiates the actual 150 steps, so a +non-converging unroll → bad CE → its gradient implicitly rewards strong contraction. That defend-the-residual +term is what EP structurally lacks: cos-0.977 holds only AT the fixed point; the missing ⊥ component is ~21% +(=√(1−0.977²)) = the finite-horizon transient gradient (T1=∞ would make even BPTT lose it — BPTT's stability +IS the finite truncation; the equilibrium/implicit gradient only cares WHERE the fixed point is, not how fast +you reach it → no contraction-reward). + +Two layers: (1) point-mismatch (refinement) — `--t1max 150` alone moved the wall 2.09→2.05 but still blew@600; +(2) gradient-flavor — needs the explicit penalty. The diverged state is a forward bifurcation to a LIMIT CYCLE, +not under-relaxation (eval_relax: res floors ~6e-2 and oscillates, 150→4000 relax steps don't help, CE ~3.7; +FTLE stays negative −0.027..−0.050 — single random-vector FTLE misses the cycle). ⇒ adaptive / more relaxation +steps CANNOT fix it (res as a STOPPING criterion just chases a vanishing fixed point); only res as a COST +(penalty) prevents the drift. + +**Fix — explicit T1-residual penalty `--resreg`**: defend z150 with R=‖εF(z150)‖²/(‖z150‖²+ε), grad w.r.t. θ +(z150 detached), task-relative scale (ratio = resreg·min(1, res@T1/2e-2), deadband res@T1>7e-4), added to the +EP gradient. **MUST run res_gate=0** (the validity gate early-returns jacreg-only above the gate, bypassing the +penalty exactly when res is high — first gated attempt blew@200). Keep t1max=300 (estimator accuracy) + penalty +(defends z150). Analog-compatible (one extra force measurement + local VF gradient, no root-finder); more +targeted than jacreg (which penalizes ‖J_nc‖, not the residual that explodes). + +**Validation (interrupted by /tmp wipe, needs re-run):** ep_resreg2 (res_gate=0, resreg=0.2, warm from erf-2.09) +held res pinned 1–5e-4 through step 1000, best **2.0573** (past the wall, lowest any EP run reached), zero +inflation — where every no-penalty variant blew (ep_nogate@100, gated ep_resreg@200, ep_t1max150 reached 2.05 +then blew@600). The run + ALL local ckpts were deleted 2026-06-20 by the /tmp 10-day cleanup before reaching +~1.8. Corroboration (5): Codex independent diagnosis; BPTT-from-2.09 control (res stayed 1.8e-4 + descended, +same fresh-opt restart where EP blew → the EP *update* is the destabilizer); FTLE (stable-BPTT 1.72 has WEAKER +contraction −0.0347 than diverged-EP −0.0377 → not forward-stability); eval_relax (limit cycle); gradient +decomposition (jacreg acquitted: 3% of grad, orthogonal, removing changes cos<0.001). Verbatim code + re-run +plan: **EP_BELOW210_DIAGNOSIS_FIX.md**. + +--- + +## Background (literature) + +- **EP** — Scellier & Bengio 2017. Energy-based; free phase relaxes to a fixed point, nudged phase + with strength β; local weight update from the two equilibria. Centered/symmetric nudging + (Laborieux 2021) reduces the gradient-estimator bias. +- **EP ≡ BPTT** — Ernoult et al. 2019: EP updates match BPTT gradients in an RNN with static input, + in the limit β→0 with a converged free phase and enough nudged steps. (Relevant: as β→0 EP is + unbiased — so our instability is NOT primarily gradient bias.) +- **Holomorphic EP** — Laborieux & Zenke 2022: exact gradients via finite oscillations, removing the + β bias/noise trade-off. +- **CET** — Høier, Kerjan, Scellier (ICLR 2026 AM workshop), arXiv via OpenReview `Qrfml76eWJ`. + Convergent Energy Transformer, EP-trained, CELEBA masked completion, EP ≈ TBPTE. The current SOTA + for "EP + attention". Conservative (energy) attention with tied value → guaranteed fixed point. +- **AEP / AsymEP** — Scurria, Vanden Abeele, Mognetti, Massar, "EP for Non-Conservative Systems", + arXiv:2602.03670. Nudged-phase correction `−2 A_J(x*)(x−x*)` (A_J = antisym part of the Jacobian + at the free eq) turns the nudged Jacobian J → Jᵀ, giving the exact gradient even when no energy + function exists. **The correction is theirs**; their scope is feedforward/Hopfield nets on static + MNIST/CIFAR (no attention, no sequence, explicit Jacobian, no stability controller). Their + force-form **VF** baseline collapses without the correction. See 2026-06-16 for the full ours/theirs + boundary. +- **FRE-RNN** — "Toward Practical EP", arXiv:2508.11659 (Zhuo Liu et al.). Fixes EP instability via + **feedback regulation (reduce spectral radius → fast convergence) + residual connections** + (vanishing gradients). Directly relevant to our open problem (keep the free phase converged). +- Adjacent energy-transformers (BP-trained, not EP): Energy Transformer (2302.07253), + EBT "Scalable Learners and Thinkers" (2507.02092), NrGPT (2512.16762). + +--- + +## The arc (what we built & found) + +| stage | finding | evidence | +|---|---|---| +| literature | EP+attention works only narrowly; CET is SOTA (CelebA, EP≈BP, single block) | CET full read | +| MVP (CET repro) | EP trains energy-attention+memory; **cosine 0.99; EP≈TBPTE** | CIFAR 0.539/0.546; FMNIST 0.272/0.278 | +| AEP (non-cons.) | corrects real-attention gradient: toy 0.30→**0.9975**; CET attn-params 0.75→**1.0** | `aep_*.py` | +| characterization | needs T2≥20–40, T1≥80; β-insensitive (static); **advantage grows with depth** (K=2: naive 0.05 vs AEP 0.99); only changes nudged phase (free eq identical); 1.44× cost | sweeps | +| fundamental limit | **strongly non-conservative attention has NO fixed point → whole EP family fails** (projection bounds magnitude but does not create a fixed point) | residual stuck 3e-2 | +| **F: damping** | `s·(attn−c·z)`, c≥1 → stable fixed point at any s → **AEP 0.99–1.0** even at s=8 | `aep_contractive2.py` | +| Option-2 LM attention | **AEP gives LM causal attention 0.993** (FA 0.25 — FA kills Q/K/V) | `lt_ep_attention.py` | +| Option-2 H1 (FFN) | **EP Hopfield-memory gives FFN 1.000** (FA −0.01 — the abandonment reason) | `lt_ep_ffn.py` | +| Option-2 H2 (train) | **BPTT trains (val CE 2.16↓, random 4.17); EP local training destabilizes early** | `lt_ep_train.py`, `H2_*.log` | + +**Why FA failed (the project's pain, confirmed):** FA only delivers a usable gradient to the layer +right before the loss (output projection / FFN proj, cos≈1.0); all upstream layers (attention Q/K/V, +FFN fc) get random-routed error → cos≈0. EP/AEP fix all of them. + +**Why EP training destabilizes (mechanism):** BPTT differentiates whatever the fixed T1-step +relaxation computes (a deep weight-tied net) — it does NOT need convergence. EP *assumes* the free +phase is at a fixed point; gradient descent on the loss pushes attention to be more expressive / +non-conservative → fixed point lost → relaxation diverges → EP estimate (and its jvp/vjp correction) +goes non-finite. Damping + clipping + weight-caps delay but don't prevent it. +*(Superseded: the residual-controlled Jacobian penalty DOES prevent it — see the 2026-06-09 section.)* + +--- + +## Key numbers + +- CET MVP (masked completion, test masked-MSE): CIFAR **EP 0.539 ≈ TBPTE 0.546**; FMNIST **EP 0.272 ≈ TBPTE 0.278**; BP-transformer 0.106/0.126; trivial(visible-mean) 0.583. +- EP-vs-BPTT gradient cosine (CET): global **0.99**; attention WQ/WK 0.98. +- AEP recovers non-conservative attention: toy naive 0.30 → AEP **0.9975**; CET attn-params 0.75 → **1.0**; damped high-s (s=8) AEP **0.99–1.0**. +- LM (Shakespeare char): AEP attention **0.993** (FA 0.25); EP Hopfield-FFN **1.000** (FA −0.01). +- H2 training: BPTT val CE **2.16 ↓** (random ln65=4.17); EP non-finite / no sustained training. + +--- + +## Open problem & next directions + +The frontier is **stable end-to-end EP training**. Hypotheses / plans: +1. **Keep the free phase converged during training** (FRE-RNN, 2508.11659): feedback regulation to + keep the spectral radius < 1 + residual connections. Most directly targets the mechanism. +2. **β bias/noise trade-off** — Ernoult 2019 says EP≡BPTT as β→0 with converged phases; but small β + amplifies the `(z₋−z₊)/2β` noise. Holomorphic EP (Laborieux–Zenke 2022) removes this trade-off. + Diagnostic: log cosine(EP-grad, BPTT-grad) *during* training — does it start 0.99 and drop? +3. **Unified conservative energy** (no AEP): make the whole LM one energy (energy-attention, tied + value) → guaranteed fixed point → plain EP is stable (we saw CET-EP is stable). Trades attention + expressivity for stability. + +## 2026-06-09 — H2 training era: the "wall" refuted, the real mechanism found + +**Setting.** Shakespeare char-LM, single equilibrium block, C=128 H=4, same param budget everywhere. +`--attn_mode thin/real` = clamp + Hopfield-FFN + damped real attention; `thick` = DEQ-transformer +block (pre-LN + attention + untied 4× GELU FFN + residual + damping). Stabilizer = damping `c` + +soft Jacobian-norm penalty λ (Hutchinson on the non-conservative force) driven by a continuous +controller on the free-phase residual (residual, NOT spectral radius, is the right signal — the +attention Jacobian is non-normal, transient growth invisible to ρ). + +**Scoreboard (best val CE; random ln65 = 4.17):** + +| run | result | +|---|---| +| BP transformer, MLP=4 | 1.68 | +| BP transformer, MLP=1 (param-matched) | 1.79 | +| thick-BPTT (exact grad, same arch as thick-EP) | **1.95** (res ~1e-3→8e-3, learned contraction unaided) | +| mono-BPTT (monDEQ) | 2.11 | +| thin-BPTT | 2.21 (optimum at res ~1.5e-2) | +| **thick-EP champion (R5): R3 recipe + EMA-smoothed controller, 14k** | **2.0467** @ step 5k — within 0.10 of exact-gradient thick-BPTT | +| thick-EP (R3): λ-floor 0.1, res_target 5e-4, c=1, 8k | 2.0784 — beats thin-BPTT and mono-BPTT | +| thick-EP (R1): λ-floor 1, c=2, 14k | 2.1504 — 2.467@3k → 2.259@7.2k → 2.150@14k: monotone, no plateau, just slow | +| thick-EP (R4): λ-floor 0.1, res_target 2e-4, 14k | 2.1352 — tight target ⇒ controller thrash (λ swings 0.5↔13 late) | +| thick-EP (R2): λ→0, c=0.5 | 2.3572 @ step 800, then **collapse to a degenerate fixed point** (res→0 exactly, val ≫ random) | +| thick-EP, c=2 λ=1, 3k (the run the "wall" was called on) | 2.4665 (just undertrained) | +| thin-EP super-long 12k | 2.4847 plateau — **invalid-regime run**: res 2.4e-2–6e-2 the whole time, λ pinned at max 16 | +| thin-EP 3k, λ=16 | 2.5593 | + +**Retraction.** Mid-project verdict — "EP is capped at ~2.5 by a convergence⟷richness wall +(rich blocks need damping that destroys their expressivity)" — is **wrong**, called prematurely at +3k steps when thick-EP ≈ thin-EP ≈ 2.47–2.50 by coincidence (two different slow/broken modes passing +the same value). Three independent refutations: (1) the 8k continuation sailed through the "wall" +(2.2592@7.2k, monotone); (2) the slack run hit 2.357 in 800 steps; (3) the gradient probe below shows +the EP estimator is healthy exactly where the stabilized runs operate. Also wrong in the original +story: damping c does NOT aid convergence for LN-inside blocks (LN Jacobian ∝ 1/σ(z): damping shrinks +z*, *inflating* the Jacobian — measured: thick plain-relax res 8.8e-3 at c=0 vs 3.4e-2 at c=2, +`solver_wall.py`), and the scary "init residual~12" was an unnormalized absolute-norm print. + +**The real mechanism — a validity threshold, not a wall** (`grad_quality.py`: cosine of each +estimator vs long-horizon BPTT (T1=400) reference, at a 300-step-BPTT-pretrained thick block): + +| estimator | free-phase res | cos vs BPTT-400 (3 batches) | +|---|---|---| +| EP T1=400, T2=20 | ~5e-5 | **0.85, 0.83, 0.88** — healthy | +| EP T1=150, T2=20 | 2e-4–1.6e-3 | 0.21 / 0.86 / 0.27 — marginal, batch-dependent | +| EP T1=50, T2=20 | 2.4e-3–3.6e-3 | −0.03 / 0.34 / 0.55 — broken | +| BPTT T1=150 (itself!) | — | 0.45 / 0.77 / 0.36 vs BPTT-400 | +| EP T1=150, **T2=60** | same as T1=150 | 0.01 / 0.07 / −0.19 — **nudged phase leaves the linear-response window** | + +Readings: (a) at tight convergence the EP estimator agrees 0.85 with the exact gradient — the +EP-specific overhead (finite β, T2 truncation, AEP clipping) is only ~0.15 of misalignment; +(b) **the horizon/mixing cost is shared by BPTT itself** (BPTT-150 vs BPTT-400 cos 0.35–0.77): +slow-mixing equilibria are expensive for everyone, not an EP defect; (c) EP needs res ≲ few·1e-4 to +be valid — at res ~1e-2 (where the thin super-long run lived) EP updates are noise, which is what +that 2.48 "plateau" actually was: 12k steps of noise + λ=16 Jacobian-shrinking gradient; (d) longer +nudged phase is NOT better — the AEP correction is linearized at z*, so T2 must stay in the +linear-response window (T2≈20 at eps=0.1 good, T2=60 destroys the estimate). + +**Decomposition of the EP→BP gap (final numbers):** architecture (BPTT−BP) = 1.949−1.79 = +**0.16** (small — the energy/fixed-point architecture is fine); training rule (EP−BPTT) = +2.047−1.949 = **0.10** — slower optimization (estimator cos ~0.85) + the λ tax of staying inside +the validity region. Total EP→param-matched-BP = **0.26** (the "wall" story claimed ~0.7 was +structural). EP-trained thick (2.047) beats exact-gradient thin (2.206) and monDEQ (2.111): +EP *does* cash in architecture richness, given validity + steps. + +**λ-study (final, prediction confirmed: middle wins):** R5 2.047 < R3 2.078 < R4 2.135 < R1 2.150 +< R2 2.357. The two failure modes flanking the optimum: λ→0 ⇒ fast descent then *collapse* (not +explosion — a dead res=0 equilibrium); λ heavy ⇒ no failure, just a constant tax (R1 monotone to +2.150@14k). R4's lesson: too-tight res_target puts the multiplicative controller on the residual +noise floor — `(res/target)^0.3` per-step updates compound the noise and λ thrashes 0.5↔13, +degrading late training. R5's fix: EMA-smoothed residual signal (`--res_ema 0.9`) kept λ quiet +(0.1–1.0) and set the champion 2.0467 by step 5k. + +**Open item (the actual next frontier):** every stable recipe peaks mid-run (~5–6k) and then drifts +up 0.1–0.2 despite cosine lr decay — EP's late-training noise floor. Suspects: VF-estimator noise +dominating at small lr (cos 0.85 ⇒ persistent gradient-direction error), λ/lr balance shifting late, +weight-norm caps binding. Holomorphic EP (removes β bias/noise) is the principled candidate for the +remaining ~0.10; T2 must stay in the AEP linear-response window (T2≈20, NOT longer). + +## 2026-06-10 — estimator/controller co-design: EP reaches 1.74, beats exact-gradient BPTT + +The Holomorphic-EP line (see METHODS §4.3) + controller re-calibration, in causal order: + +1. **R7 — holomorphic clamp-free nudge** (N=2, r=0.02 Cauchy readout, no g-clamp / no corr-clip, + closed-form ∇zCE): best **2.029**, faster wall-clock than plain EP (3.3 vs 2.45 it/s). Probe + discoveries: the legacy clamps were the dominant estimator error at marginal residuals + (cos 0.27→0.89 at res 1.6e-3); N and r are flat ⇒ β-bias/noise was never the binding error; + T2 truncation is the remainder (T2=120 → cos 0.985 when the nudged phase is stable; adaptive-T2 + early-stopping defeated by non-normal transient growth — open). +2. **R8 — param-EMA alone: tie (2.031).** Exposed the real late-drift mechanism: λ pinned at + jr_max=16 in the tail — the controller enforcing the OLD validity margin (5e-4) against weights + that want res~1e-3, which the new estimator tolerates fine (cos 0.89 @ 1.6e-3). +3. **R9 — controller re-calibrated to the new estimator** (target 1.5e-3, λmax 4) + param-EMA: + **best 1.7399**, sustained EMA plateau 1.75–1.79, λ quiet at 0.1–0.5, tail ~1.92. Worth 0.29 — + the single largest improvement of the project. Lesson: **estimator and controller must be + co-designed**; upgrading one without re-tuning the other converts the controller into the main + source of harm. +4. **14k controls** (same horizon as R9): BP MLP=4 **1.610**, BP MLP=1 1.689. Param-shape + correction: the **thick** block's 131k FFN matches **MLP=4**, not MLP=1 (which matches the thin + block's 33k Hopfield memory) — so thick-EP's like-for-like BP baseline is 1.610. + Unregularized **BPTT-14k destabilizes** (res→4.7e-2, val→3.0, best 2.021 — worse than its own + 3k run, 1.949): the equilibrium architecture NEEDS the contraction control loop for long + training; EP carries it out of estimator necessity, bare BPTT got nothing. +5. **The decisive control — BPTT + identical λ-controller + param-EMA: 1.6348**, tail stable. + Verdict: most of EP-beats-bare-BPTT was the regularizer; with equipment matched, exact + gradients lead EP (R9) by **0.105** — which matches the measured estimator misalignment + (cos 0.85–0.90). Bonus: the controlled equilibrium block (1.635) beats the MLP=1 BP baseline + (1.689) — the contraction controller is good for the architecture independent of training + rule; EP merely forced its discovery. +6. **Adaptive T1/T2 (R10): EP 1.6755** — ate 0.064 of the 0.105 rule tax. Adaptive T2 solved by + *hindsight snapshot selection* (judge by increments of the contrast estimate a_t, never by step + sizes — non-normal transients can't fool it; dangerous batches self-limit to t≈20–30, stable + ones run to ~110 and collect cos 0.987; probe mean 0.871→0.932). Adaptive T1 companion: the + λ-controller signal stays sampled at T1=150 (R9 peace preserved), then relaxation refines to + res≤1e-4 before nudging — long-T2 gains exist ONLY from tight equilibria (at res~1e-3 long T2 + hurts). **Final 14k decomposition: architecture 0.025 (1.635 vs 1.610), EP rule tax 0.041 + (1.676 vs 1.635), total 0.066.** EP now beats the MLP=1 BP baseline (1.689). + *Multi-seed (3/arm):* EP **1.680 ± 0.005** vs BPTT-ctl **1.639 ± 0.006** ⇒ tax + **0.041 ± 0.005 (~9σ)** — confirmed real. +5. Refuted along the way: λ-floor∝lr annealing (R6) reproduces the λ→0 death; post-mortem of that + death (R2/R6): **explosion frozen by fp32 absorption masquerading as convergence** (val 60–77 + with res≡0: ε·F < ulp(z); uncapped tok/pos/fc/pj blow up) — the λ floor is the anti-collapse + mechanism, not just a tax. + +## Code map + +- `~/ept/cet_mvp.py` — CET energy model + EP & TBPTE training (App. B faithful). +- `~/ept/bp_transformer.py` — vanilla BP transformer baseline. +- `~/ept/cet_aep.py` — CETReal (real non-conservative attention) + AEP, VF gradient, damping, clip. +- `~/ept/aep_characterize.py` / `aep_depth.py` / `aep_projected.py` / `aep_contractive2.py` — AEP sweeps. +- `~/ept/aep_option1.py` — projected (constrained) adjoint. +- `/tmp/lt_ep/lt_ep_attention.py` — AEP on the LM's causal attention (gradient quality). +- `/tmp/lt_ep/lt_ep_ffn.py` — EP Hopfield-memory vs FA-FFN (gradient quality). +- `/tmp/lt_ep/lt_ep_train.py` — H2 end-to-end EP vs BPTT training of the equilibrium LM block + (`--attn_mode real/energy/mono/thick`, jacreg controller `--jacreg/--jr_floor/--res_target`). +- `/tmp/lt_ep/grad_quality.py` — EP-vs-exact gradient cosine vs residual level (the validity probe). +- `/tmp/lt_ep/solver_wall.py` — plain relax vs Anderson convergence per damping level (thin vs thick). + +*(local-transformer working copy in `/tmp/lt_ep`; original `~/local-transformer` untouched.)* + +## 2026-06-13 — late-drift 诊断 + +每个稳定配方中段触底后 val 上漂(S0 ~0.05 轻微,S1/L2 ~0.5 重度)。多假设诊断(drift_diag.py,13 列/200 步): +- **嵌入 runaway 假设被否**:|emb| 涨到 11.9 后被 weight-decay 压回 11.6,自限,非 culprit。 +- **drift 对汇报数字无害**:存 best 即可(L2 漂到 1.70 但报 best 1.214;warm-track 报 1.14)。 +- **drift 不藏更低解**:warm-track 4800 步触底 1.14 后,lr 3e-4 漂、lr 1e-4 冻(1.15),均找不到更低 ⇒ 1.14 是该配方真地板,drift = 在地板上被噪声梯度推晃。 +- **机制 = 晚期沿坏方向迈步**:低 lr 摁住 drift(anti-drift 续训冻在 1.16);S0 `cos`(EP,BPTT400)晚期 0.85→0.3 印证方向变差。 +- **严重度随尺度**:S0(C128/T64)≈0.05,S1(C256/T256)≈0.5。 +- 待结(S1 warm-start 诊断 + bctl 对照):cos 晚期塌缩是 EP 估计器退化(A,→N=4/tracking 可救)还是慢混合平衡点的视野模糊(B,EP/BPTT 共病,→架构加速混合)。bctl=cos(BPTT150,BPTT400):若同塌则 B。 + +## 2026-06-13 (cont.) — "why EP far at S1": operating-point + bias/variance diagnosis + +Reframe: at the SAME (tight, res 1.5e-3) operating point EP (1.14–1.39) ≥ BPTT (1.52). The apparent +"gap" is EP(1.14) vs LOOSE-BPTT(0.97) — an operating point (res 1e-2, non-contractive) EP cannot +use. Mechanism: T=256 long-range mixing wants non-contractive dynamics; EP needs contraction for the +fixed point to exist; contraction suppresses the mixing. Signature: gap grows with T (S0/T64: 0.04; +S1/T256: 0.17). + +bias_var.py (v4b ckpt, 16 batches, EP vs BPTT-400, BPTT-150 control), per group [mean-cos | cos-means]: +- all 0.24|0.36 ; BPTT-horizon ctrl 0.44|0.57 +- attn 0.22|0.33 ; ffn 0.32|0.49 ; ln 0.19|0.30 ; emb 0.31|0.37 +Readings: (1) two EXACT gradients (BPTT-150 vs 400) only reach cos-means 0.57 — slow mixing makes +"the gradient" horizon-ambiguous; a PERFECT EP estimator's ceiling here is ~0.57, NOT 1. (2) Both +EP and BPTT improve with averaging (variance present, partly recoverable via navg). (3) EP-specific +residual concentrates in **attention + LN** (EP cos-means 0.33/0.30 vs BPTT 0.57/0.65); **FFN nearly +fine (0.49)**, emb on par. ⇒ the AEP-corrected softmax/LN path is the weak link, pointing the fix at +qk-norm (bounds logits, conditions the attention Jacobian) + N=4 holo + navg. Chasing cos→1 is the +wrong target; matching BPTT's 0.57 is (~0.21 of recoverable EP-specific alignment). + +## 2026-06-13 (cont.) — scaling to BPE/50M: stability via small-init residual branches + +50M (C=2048) from scratch ABORTED (init res 26; qk-norm+c=2 insufficient — res stuck 0.38). Fix: +**ReZero/Fixup-style small-init of the residual-branch output projections** (WO, pj ×0.1) makes the +thick block ≈ -(z-xin)-c·z (near-identity, trivially contractive) at init; training grows them. +Result at 15M (C=1024): res **1.1e-8 at step 50** (vs abort). The big-width init-instability wall is +a random-init artifact, not fundamental; --resinit folds into the standard recipe. Separately, +torch.compile's 14.6× was a SMALL-model (launch-overhead) artifact; at C≥1024 the relaxation is +compute-bound (fp32 ~15 TFLOP/s) → ~0.06 it/s at 15M, ~0.03 at 50M. Demo feasibility: 15M solo +(~2-4 days); 50M needs multi-GPU data-parallel (per-sample contrast → gradient all-reduce) or the +external cards. The fp32 contrast (catastrophic-cancellation-bound, bf16 dead) keeps it compute-heavy +— again the "digital expensive / physical free" axis. + +## 2026-06-14 — the big-model EP recipe (scaling past S1 needs a stability stack) + +Scaling EP from S1 (C=256/T=256/char) to C=1024/T=512/BPE-4096 broke the S1 recipe: stable for ~50 +steps then violent free-phase blowups (res → 0.5–1.4, jr pinned at max) — the controller's stability +margin shrinks with width/context. CAUTION: a step-50 res≈1e-8 snapshot is NOT proof of stability; +must watch through warmup-end peak-lr (~step 500–1000). The stack that fixes it (each necessary): + 1. **resinit** (WO,pj ×0.1) — near-identity contractive block at init. + 2. **qk-norm** (Qwen3 q/k RMSNorm) — bounds attention logits / Jacobian (also fixes the attn/LN + estimator weak link from the 06-13 bias_var diagnosis). + 3. **lr warmup** (≥1000 steps linear) — let the λ-controller establish contraction before big steps. + 4. **µP-style lr scaling**: lr 1e-3 (C=256) → **4e-4 (C=1024)**; 1e-3 caused a 400-step instability + episode, 4e-4 only brief recovered excursions. + 5. **higher jr_max** (16 → 32) — bigger non-conservative force needs more penalty headroom. + 6. validity gate + fuse retained (catch the residual transient excursions). +Locked 15M demo recipe (running, 16k steps, best-ckpt): C=1024 H=16 T=512, the 6 items above, +holo N=2 + adaptive T1/T2, pema 0.999. Smoke val descends 5.86→3.94 BPE-CE by step 1000. +Cost: ~0.05 it/s (fp32, compute-bound; compile's 14.6× was a small-model launch-overhead artifact) +⇒ 15M ~3 days solo. 50M (C=2048) still unstable even with resinit 0.03 + warmup — needs a dedicated +init/lr curriculum AND multi-GPU; deferred as a sub-project (not on the demo critical path). + +## 2026-06-16 — C=512 BPE scale: baselines, AsymEP attribution, early-slowness dissection + +**C=512 BPE-4096 scoreboard** (TinyStories, T=256, B=24; random ln4096 = 8.318; target band 1.0–1.5): + +| run | result | +|---|---| +| BP standard transformer (7.48M, no warmup, lr 3e-3) | **best 1.7921** (20k, DONE) | +| EP thick "orphan" (lr 8e-4, warmup800, resinit0.1, holo2, t2sel40) | **best 2.4037** (20k, DONE; jr pinned at floor 0.1 the ENTIRE run — zero excursions) | +| EP thick "chin" (lr 9e-4, full-Chinchilla 24k target) | ABORT @5401, best 3.2408 | +| BPTT thick (exact grad, same arch) | ABORT @2113, best 3.7331 | + +Data notes (NOT verdicts): (a) BP-C512 itself = 1.79 sits just *above* the 1–1.5 band (BP needs +C≥1024 to enter it); EP-orphan 2.40 is 0.61 above BP-C512. (b) **the exact-gradient BPTT twin +aborted EARLIEST** (2113, 3.73) — even worse than the EP runs — consistent with "training walks off +the contractive manifold, exact gradient included"; lr/seed not matched across these, treat as +directional. (c) abort = the `res>0.1 for 100 consecutive steps` fuse. **lr is the knife-edge**: +8e-4 (orphan) had NO excursions and finished; 9e-4 (chin) had periodic jr→32 / res→0.2–0.5 +excursions, most recovered, one didn't make it back inside the 100-step fuse window → abort@5401. +Narrow lr-stability margin at C=512. (d) orphan late curve is a plateau: best moved only +2.4557→2.4037 over the last 3600 steps. + +**AsymEP attribution (web-verified, the J→Jᵀ correction is NOT ours).** arXiv:2602.03670, +Scurria / Vanden Abeele / Mognetti / Massar, "EP for Non-Conservative Systems", names the method +**AsymEP**. Their correction `−2 A_J(z*)(z−z*)` is **mathematically identical** to ours +(`Jv−Jᵀv = (J−Jᵀ)v = 2 A_J v`). Their scope: Hopfield nets + feedforward MLPs (≤500 hidden) + +one CIFAR-10 feedforward CNN; **static inputs; explicit Jacobian construction + decomposition; +NO residual/validity controller; NO attention / transformer / sequence model; do NOT combine with +holomorphic**. Best result: CIFAR-10 FF-CNN AsymEP 89.74% vs BP 90.66% (their variational "Dyadic +EP" matches BP, p=0.75). Their **"VF" baseline** (the force-form ⟨a,∂F/∂θ⟩ readout — *prior art, not +ours*) **collapses without the correction** (CIFAR-10 → 10% = chance; MNIST 64% vs AsymEP 92.7%) — +matching our measured cos≈0.25 for uncorrected non-conservative attention. **Ours on this line**: +matrix-free jvp−vjp (their explicit Jacobian is infeasible at B·T·C dim); softmax-attention +application; the holomorphic combination (the real-linear correction preserves holomorphy); +common-mode-tracking + lock-in variants; the validity-threshold/controller/gate stability stack they +have none of; and the transformer-LM application + scale. (⇒ corrected EP_DERIVATION.md, which had +mislabeled force-form VF as our step.) + +## 2026-06-17 (cont.) — full to-spec versions of exp1/4/5/6/7 (first pass was partial) + +The 06-17 campaign ran lighter/subset versions; re-ran the FULL specs. New findings: + +**exp1 (FULL) — multi-checkpoint cos-evolution (snap 0/200/800/2000 + plateau).** EP·BPTT150 cos: +step0(random,res2e-2) 0.26 → step200(res1e-9) **0.99** → step800 0.96 → step2000 0.89 → +plateau(2.40) **−0.05**; BPTT150·BPTT400 = 0.46→**1.00** throughout (true gradient always clean). +New **batch self-cos** at the plateau: EP **−0.27** (gradients on different batches anti-correlated) +vs BPTT **+0.96**. ⇒ decisive: the EP estimator is faithful while descending and undergoes an +**SNR/coherence collapse near the optimum** (true gradient shrinks below the EP bias floor, which is +batch-incoherent) — NOT horizon ambiguity, NOT LR. This is THE 2.40-plateau mechanism. + +**exp5 (FULL) — jr_max sweep (the dropped dimension, and it mattered).** Warm-started from 2.40: +**jr_max 8 → 2.3648**, jr_max 4 → 2.376 — both BEAT jr_max 32 (2.463) and the orphan (2.404) by +~0.04. jr_max 16 → 2.416. floor/target fill still ≥2.47. ⇒ the jr_max=32 ceiling was a small tax; +~8 is better. (Earlier "not a controller tax" was incomplete — floor/target don't help, but the +ceiling does.) Still far from the 1–1.5 band; the estimator floor dominates. + +**exp4 (FULL) — 3 arms.** armA (warmup): ~50-step gate dead-zone. armB (`cprewarm 200` = full-lr +contraction before task warmup): res 1.7e-2 after pre-phase, dead-zone only PARTIALLY mitigated +(val 5.97 vs 6.07 @100) — establishing contraction once isn't enough; warmup's tiny lr can't deepen +it. armC (no-warmup + resinit 0.03): NO dead-zone (cos 0.99 from step20, best 4.03@300) but a violent +excursion (res→0.14@400, recovered). ⇒ small-resinit+no-warmup learns fastest early but is +excursion-prone; the cprewarm decoupling is a weak fix. + +**exp6 (FULL) — branch norms + attention entropy + qk RMS, incl. at plateau.** Branches grow fine +(|WO| 1.95→4.91 over training); at the 2.40 plateau they are LARGE (|pj|=44, |fc|=58, attn/xin≈1.0), +attn entropy 4.56→3.62 (sharpening), qk_rms pinned 1.0 (qknorm working). ⇒ no "stuck-small branch" — +the trainable-α-gate intervention is NOT indicated. + +**exp7 (FULL, after a probe bug fix) — EP vs BP influence + per-position CE.** [Bug caught: the +EP probe embedded the input with random-init weights before loading the ckpt → garbage (per-pos CE +~7); fixed to embed post-load.] Clean influence ||Δz*_q||/δ vs distance d: **BP** 5.34(d1)/0.88(d16)/ +0.15(d64)/0.003(d200); **EP-orphan** 0.24/0.071/0.019/0.006. Both decay with distance. EP ~20× +weaker in ABSOLUTE coupling (scale-caveated — z* vs hidden norm), but NORMALIZED to d=1 EP reaches +*farther* than BP (0.30 vs 0.16 at d16; 0.08 vs 0.03 at d64). Per-position CE: EP uniformly ~0.7–1.0 +worse than BP at every position (no long-range-specific failure). ⇒ **does NOT support — arguably +refutes — "contraction cuts long-range mixing"**; EP's deficit is uniform, consistent with the +estimator floor (exp1) being the binding constraint. (BP-C512 trained to **1.6953**.) +Honest gap: a stable BPTT-ctl-tight ckpt does not exist (BPTT broke at C512), so the cross-method +influence comparison is EP-vs-BP only, not the full tight/loose/BP set. + +**Campaign verdict (data; user concludes):** the C512 2.40 plateau is, decisively, an **EP estimator +SNR/coherence collapse near the optimum** (exp1: cos→0, self-cos −0.27, k→4000, true gradient clean). +It is not LR (exp2), not horizon (exp1), not λ-floor/target (exp5) — though jr_max≤8 buys ~0.04 +(exp5). The architecture has a SEPARATE wall: exact-gradient BPTT can't stay stable at C512 (exp3). +The contractive-mixing hypothesis is NOT supported (exp7: EP's range is comparable to BP, deficit is +uniform). ⇒ the EP lever is reducing the estimator bias floor near the optimum (N=4 holo / tracking- +AEP / lock-in / navg); the architecture lever is the stability margin for exact gradients. + +**Early-slowness dissection (C512 BPE) — cos/k probe CONFIRMED; ablations STILL RUNNING.** +From existing logs: **EP ≈ BPTT at every early step** (step 200: 5.84 = 5.84; step 800: 3.86 vs +4.05; step 2000: 3.58 vs 3.73), both **~1.2 behind BP** (2.76 @800). ⇒ early slowness is *not* the +EP estimator — the exact-gradient twin on the same architecture is equally slow; the gap is the +equilibrium architecture + recipe vs a standard transformer, present already at step 200 (pre +warmup-end). Probe (`--probe_bptt`: cos(g_EP,g_BPTT) + k=|gEP|/|gBPTT| along the REAL trajectory): +- **Estimator quality tracks res tightly**: res≈1e-9 → cos 0.93–0.99, k≈1.0 (all groups); a res + spike to 1.4e-3 (mid-excursion) → cos all 0.86 / attn 0.70 / emb 0.13, k→0.4. ⇒ the earlier + kinit *synthetic* probe (cos 0.33 / k 0.42 "at init") was a probe-operating-point artifact; on the + real trajectory the EP gradient is faithful (cos≈1, k≈1) **whenever the free phase is converged**, + and **k≈1 means no adaptive-k is needed in the converged regime**. +- **warmup × validity-gate early-stall mechanism**: with warmup, early lr is tiny → the contraction + homeostat can't pull res under the gate (5e-3) → free-phase res stuck ~1.5e-2 for ~50 steps → + the **validity gate skips the nudge** → the reported g_EP is pure jacreg/contraction (cos≈0, + k≈11) = *the task is not being learned at all* for ~50–60 steps. res drops < gate by ~step 60–80 + → cos jumps to 0.99. No-warmup: res converged (1e-9) by step 20, cos 0.98 from step 20 + (val 5.72 vs warmup 6.07 @ step100). +- **CAVEAT — the tradeoff is UNRESOLVED**: the no-warmup arm hit an instability excursion ~step 180 + (res→1.4e-3, jr→32, recovering by 200) — i.e. warmup's stability role is real. Whether no-warmup + survives to 1200, plus the **resinit (arm B)** and **lr 2e-3 (arm D)** attributions, are STILL + RUNNING as of 16:00 — no verdict drawn. Probe instrumentation: `lt_ep_train.py --probe_bptt N`; + launcher `/tmp/lt_ep/early_dissect.sh`. + +## 2026-06-17 — C512 diagnostic campaign (7 experiments): the 2.40 plateau is an EP estimator bias-floor; exact-gradient BPTT is LESS stable than EP at C512 + +Controlled campaign at C512 BPE (orphan ckpt = EP best 2.4037). Raw results + the causal cut each +experiment was built for. Scripts: `lr_sweep.py`, `triangulation.py`, `mixing_probe.py`, +`lt_ep_train.py --probe_branch/--probe_bptt`; orchestration `master.sh`. + +**exp3 — BPTT+controller, the decisive matched cut (GPU0).** Same arch+controller+lr 8e-4 as the EP +orphan; only the task gradient differs (exact vs EP). Descended cleanly to best **3.85 @step1400**, +then **DESTABILIZED at step 1600** (jr→32, res→0.07) and lodged in a broken basin (val 6.27, +res ~0.068 — just under the 0.1 fuse, so no abort) for 8400+ steps. **The exact-gradient twin did +WORSE than EP (3.85-stuck vs EP 2.40-completed).** ⇒ at C512 the contraction controller does NOT +keep the exact gradient on the manifold; EP's implicit contraction-bias is what kept the orphan +alive. Inverts the S0 picture (BPTT+ctl 1.635 < EP 1.676). Caveat: single lr/seed. + +**exp1 — triangulation at the 2.40 plateau (estimator vs horizon).** BPTT150 vs BPTT400 +cos = **1.000** (all/attn/ffn; emb 0.986) — true gradient perfectly well-defined, ZERO horizon +ambiguity — but EP vs BPTT cos = **−0.045** (orthogonal), k=|gEP|/|gBPTT| ≈ **4000**. ⇒ the plateau +is an **EP-specific estimator failure**, not slow-mixing horizon ambiguity. + +**exp2 — one-step LR-sweep at the plateau.** BPTT best ΔCE −0.16 (lr 1e-2); EP best −0.04 (lr 1e-4), +diverges for lr≥3e-4. cos 0.11, k 486. ⇒ NOT LR-inequivalence; EP's direction caps one-step descent +4× below BPTT at any lr. + +**exp1+exp2 synthesis:** the EP estimator has a ~fixed bias/noise floor. k runs **~1 during active +training → ~4000 at the plateau** (true gradient shrank below the EP bias). 2.40 is the SNR +crossover; below it the true gradient is smaller than the EP bias so EP can't see it. Not LR, not +horizon, not λ. ⇒ the EP lever is **reducing the estimator floor near the optimum** (N=4 holo / +tracking-AEP / lock-in / navg averaging), distinct from the architecture's stability lever. + +**exp5 — λ floor/target grid (warm-start from orphan).** base 0.1/1.5e-3 → 2.463; 0.05/3e-3 → 2.491; +0.03/3e-3 → 2.485; 0.1/5e-3 → 2.488 — ALL worse than orphan 2.40 (slight late-drift up). ⇒ relaxing +the controller does NOT break the plateau — not a λ-floor tax. (floor 0.03 reached jr=0 without +collapse over 1200 warm-started steps.) + +**exp6 — branch-growth (resinit vs no-resinit).** resinit 0.1: branches grow gradually & stably +(|WO| 2.35→4.91, |pj| 3.23→6.57 over 1000 steps; res ~1e-4; best 3.71). no-resinit: branches start +large (|WO|~20, |pj|~20), excursion at step 800 (res→0.23, jr→32; recovered, |WO|→14.7; best 4.04). +⇒ resinit's gradual growth is stabilizing; branches do grow (not stuck-small). + +**exp7 — mixing/influence length.** ||Δz_q*||/δ FLAT across distance (~0.007 at d=1 and d=200) and +weak; loose-target (lam_t5) similar (~0.005). ⇒ no distance-decay signature; INCONCLUSIVE on the +contraction-cuts-mixing hypothesis (no non-contractive/BP reference; coupling uniformly weak). + +**exp4 — warmup/gate (probes done).** no-warmup best 3.56@1200 (survived its step-180 excursion); +warmup 4.11@800. warmup×validity-gate early-stall confirmed (warmup keeps res>gate ~50 steps → +nudge skipped → no task learning early). + +**Net (data; user to conclude):** the 1–1.5 band is blocked at C512 by TWO distinct walls — +(1) EP's estimator bias-floor caps EP at 2.40 (cos→0, k→4000 at the plateau; true gradient is +clean), and (2) the equilibrium architecture+controller can't keep the EXACT gradient stable at +C512 (BPTT broke at step 1600). EP's contraction-bias makes it the more robust of the two here. +The contractive-conflict hypothesis is supported on the BPTT side; the EP plateau is specifically +an estimator-SNR-floor effect. + +## 2026-06-17 (round 2) — the plateau bias floor IS the frozen AEP linearization; tracking-AEP fixes it (gradient level) + +Estimator ablation at the 2.40 plateau ckpt (`plateau_ablation.py`, vs BPTT400, n_grad=4): + +| estimator | cos | k | batch self-cos | +|---|---|---|---| +| N2 frozen-AEP (current) | −0.045 | 4133 | −0.27 (incoherent) | +| N4 / N8 | −0.03 / −0.01 | 31 / 29 | ~0 | +| N2 r=.05/.10/.20 | −0.04 | ~4200 | −0.27 | +| N2 fixed-T2 20/80/120 | nan (diverges w/o snapshot selection) | — | — | +| **track_N2 (common-mode AEP)** | **0.997** | **0.9** | **+0.95 (coherent)** | + +Diagnostic at z*: **|Jv−Jᵀv|/|Jv| = 1.37** (J highly non-normal). ⇒ the bias floor IS the +frozen-at-z* AEP linearization. N=4/8 fix only the magnitude (k 4000→30), not the direction; r and +T2 do nothing; **tracking-AEP (re-linearize at the moving common mode z̄=½(z₊+z₋)) restores cos +−0.045→0.997, coherence −0.27→+0.95, magnitude 4133×→0.9×.** + +**exp D (per-group):** collapse is UNIFORM across WQ/WK/WV/WO/fc/pj/ln1/ln2/tok/pos (all cos +−0.03..−0.08, k 3k–18k) — the corrupted contrast `a` poisons every VF gradient equally (single +shared cause, not attention/LN-specific unlike the S1 bias_var diagnosis). + +**exp B (navg):** cos(EP,BPTT) over navg = 0.37→0.40→0.55→−0.30→−0.21 (1/2/4/8/16); self-cos rises +(0.58→0.87) but toward a direction NEGATIVELY correlated with BPTT; dCE diverges at navg≥8. ⇒ +**deterministic bias, not variance** — averaging converges to the wrong (biased) direction, so +navg/restart-averaging will NOT fix it. + +### 2026-06-17 (round 2b) — free-phase ε is mistuned ~3× (monDEQ α-analysis analog); Anderson untapped + +Free-phase convergence probe at the 2.40 ckpt (`solver_probe.py`, res_F=‖F(z)‖/‖z‖, eps-independent, +B=2, 4 batches). Steps to res_F≤1e-4 / ≤1e-6, plain Euler: ε0.05→470/844, **ε0.1 (ours)→236/548**, +ε0.2→121/274, **ε0.3→85/182** (~2.8× faster than 0.1, same fixed point, stable), ε0.5→stalls at +res~0.9 (non-convergent — the stable ceiling is ~0.3–0.4). Anderson(m=5): ε0.1→85/197, **ε0.2→58/118 +(~4× vs our current 0.1)**, robust across ε. ⇒ **ε=0.1 is ~3× too small**; ε≈0.3 or Anderson@ε0.2 cut +free-phase force-evals ~3–4× (and let T1≈100 always clear res≤1e-4, dropping adaptive-T1) → directly +attacks the dominant EP compute cost AND tightens res (better estimator). This IS the "preconditioned/ +Anderson relaxation" open item, now quantified. Caveats: free phase at trained point / B=2 (optimal ε +may shift early-training / slow-mixing); the **nudged phase shares ε but has AEP+holo dynamics — bigger +ε there needs a separate stability check** before changing. Confirms the monotone-operator α-sensitivity +(non-normal dynamics, |Jv−Jᵀv|/|Jv|=1.37) carries over to our hand-set Euler step. + +**Decision:** the EP lever collapses to **common-mode tracking-AEP** (not N=4 / lock-in / navg). +Caveat: validated at the gradient level at one ckpt; the one-step ΔCE is magnitude-confounded, so the +NECESSARY next test is **training** — does EP with `--track` descend below 2.40 (`holo_a_track` +already wired)? exp C (BPTT+controller rescue, lr 2e-4/4e-4/6e-4 × jr_max 16/32) runs separately for +the architecture-ceiling question. + +## 2026-06-18 — tracking-AEP VALIDATED IN TRAINING: it breaks the 2.40 plateau + +Warm-track test (warm-start from the 2.40 orphan ckpt, jr_max 8, warmup 0, 2500 steps): +- **TRACK-warm (common-mode tracking-AEP): 2.40 → best 2.1628 @2500, still descending**, stable + (res ~6e-5, jr at floor). ⇒ tracking-AEP descends PAST the 2.40 plateau — the round-2 estimator- + floor fix works in training, exactly as the fixed-ckpt probe predicted (cos −0.05→0.997). +- **STD-warm control (standard estimator, same recipe): ABORTED @step 397** (blew up, res→0.21). +- **TRACK-fresh (tracking from scratch): ABORTED @968** (blew from random init). + +⇒ under the identical aggressive recipe (jr_max 8), tracking-AEP is BOTH stable AND descends, while +the standard estimator blows up. But tracking-AEP is **not a from-scratch drop-in** (fresh blows up) +— the recipe is **two-phase warm-track**: standard estimator to ~2.40 (where it's stable), then +switch to common-mode tracking-AEP to descend below the floor (matches the S1 warm-track champion). +Caveat: std-control's abort is partly the aggressive jr_max 8 (orphan used jr_max 32, stable at +2.40); the clean claim is "at jr_max 8 tracking survives+descends where std blows". The 2.16 ate +~0.24 of the 0.61 EP-vs-BP gap and was still descending → continuing (`trkcont` from 2.16) to find +the ceiling. + +**exp C — architecture ceiling (BPTT+controller rescue, 2500 steps):** stable exact-gradient exists +(lr≤6e-4 never breaks, unlike lr8e-4 which broke @1600). Best: lr6e-4 jr16 → 3.08, still descending. +At equal step (2500): **BP 2.34 ▸ BPTT 3.08 ▸ EP-orphan 3.55** ⇒ the ~10× proportional slowdown +(confirmed: median 9.4× in the descending regime) splits into an lr-stability-capped architecture +handicap (~0.74, BP→BPTT; the equilibrium block can't take BP's 3e-3 lr) AND an EP-estimator +handicap (~0.47, BPTT→EP). jr_max 16 > 32 confirmed again. Long ceiling run (`bptt_ceiling`, +lr6e-4 jr16, 12k) pending. + +**Found + staged a real code inconsistency:** free phase / model / readout use erf-GELU (F.gelu) +while the holo nudge uses tanh-GELU (cgelu) → z* isn't the nudge-force's fixed point → spurious +common-mode drift in the contrast every step (which tracking-AEP partly absorbs — possibly part of +why it wins). Fix staged as `--gelu tanh` (additive, default erf). Separately: free-phase ε=0.1 is +~3× too small (ε0.3 or Anderson → ~3–4× fewer relaxation steps; `solver_probe`). + +--- + +## 2026-06-24 — below-CE-2.1 divergence: DIAGNOSED (genuine Hopf) + FIXED (resreg/jacreg = finite-T1 LE control) + +Full write-up: **`SESSION_2026-06-24_HOPF_DIAGNOSIS_RESREG_FIX.md`**. This pins down the forward-stability wall that sat under TL;DR #3 ("no loss ceiling, but a validity threshold that costs reg + steps") — the threshold is now mechanistically a **Hopf bifurcation**, and the "regularization tax" is specifically **LE control**. + +**The mechanism (3 independent confirmations).** Below CE~2.1 the relaxation diverges because the learned **non-conservative attention** operator undergoes a **genuine continuous-time Hopf bifurcation** — a complex eigenvalue pair crosses **Re μ > 0** as EP makes attention expressive. Matrix-free FD-JVP Arnoldi on M=I+εJ (`ep_run/eig_probe.py`): leading continuous μ=(λ−1)/ε goes **−0.024 (s2000, CE 3.13, STABLE) → +0.44±1.37j (s3200, 2.74) → +1.35±2.08j (ep_eps05, 2.41)** — Re μ and |Im μ| both grow as CE drops. Confirmed by (a) eval limit-cycle + attention-knockout (α=0.2 converges), (b) Anderson can't reach a root at s3200 (near-root Re μ=+0.24 rotating). **Earlier "Euler artifact" read (and the cont.6 forward-mode/RTRL framing) are SUPERSEDED** — fugu caught that the ε-sweep "convergence" was the *step* residual ε·g, not g→0; an intermediate misread. (The Euler over-amplification is a real *second* layer — ε relocates the wall 2.74→2.41 — but the root cause is Re μ>0, which no ε closes.) + +**Why equilibrium-EP hits this and BPTT doesn't (structural).** Equilibrium-EP optimizes L(z*); its gradient is blind to *finite-time contraction* (a param eroding ρ→1 with z* fixed has zero equilibrium gradient). BPTT differentiates the T1 unroll, so a non-converged z_T1 → bad output → penalized → it implicitly defends contraction. So EP needs that defense added back explicitly. + +**The fix — control the finite-T1 Lyapunov exponent (LE).** Stability ⟺ ρ(M)<1 ⟺ finite-T1 LE<0. Two ~orthogonal handles, they stack: +- **resreg** (`lt_ep_train.py:220-231`) = penalize the T1-residual ‖εF(z_T1)‖ ~ ρ^T1 = exp(T1·LE) → a **direct LE penalty** (also catches the non-normal transients eigenvalues miss). **The PROVEN below-2.1 stabilizer.** +- **jacreg** (`:211-219`) = penalize ‖J_nc·v‖ → shrinks |Im μ|, pushes the pair to Re μ<0 (cause-side). `ep_run/eig_jacreg.py` confirmed at the mechanism level: at the same CE~2.74, frozen-jacreg = Re μ=+0.45 rotating g_floor 0.26, adaptive-jacreg = **Re μ=−0.23 real, g_floor 1e-4** (Hopf killed, true fixed point restored, AsymEP validity restored). + +**★ The 2.09 recipe = FROM SCRATCH + resreg=0.2 + FROZEN jr** (NOT adaptive jacreg — that was a session-long detour; cmd at `EP_BELOW210_DIAGNOSIS_FIX.md:97-101`). Original ep_resreg2 reached **2.0573** (lowest EP ever, lost to /tmp wipe; rebuild 2.22). This session: ep_resreg_warm (resreg eager, warm from stable s2000) descending smoothly through 2.30 (peak res 1.6e-2, no spikes) — the clean 2.09 test; ep_rr_scratch (from-scratch, proven recipe) launched as the robust confirmation. **Warm-start ONLY from a STABLE operator (s2000, Re μ<0); resreg/jacreg PREVENT instability, they can't RESCUE an already-blown one** (warming from ep_eps05@2.41 blew). + +**Infra (#14).** `--compile` EXONERATED + SAFE (compiles the free-phase `tforce`; numerically identical to eager — z150 rel-diff 1.6e-7, pure fp32; ~1.43x, ~3.3x with t2sel40). **`--tf32` UNSAFE — do NOT use** (10-bit mantissa ≈ 1e-3 perturbation; the relaxation is hyper-sensitive — ε 0.1→0.05 alone moved the wall 0.33 — so TF32 is too coarse). EP parallelism for the deep/scaling phase: no sequential backward, coupled equilibrium stack (#13) → depth-parallel, adaptive-T1 (residual-stopped) cleaner than adaptive-ε. + +**Lit anchor.** Hopfield-ResNet (arXiv 2509.26003) EP-trained 12 layers — but it's **conservative** (energy Φ, symmetric weights, no oscillation by construction). Confirms non-conservativity is our culprit; **we are the first to EP-train non-conservative attention** (which has the Hopf), solved via resreg/jacreg. diff --git a/docs/campaign/SESSION_2026-06-24_HOPF_DIAGNOSIS_RESREG_FIX.md b/docs/campaign/SESSION_2026-06-24_HOPF_DIAGNOSIS_RESREG_FIX.md new file mode 100644 index 0000000..d605140 --- /dev/null +++ b/docs/campaign/SESSION_2026-06-24_HOPF_DIAGNOSIS_RESREG_FIX.md @@ -0,0 +1,81 @@ +# Session 2026-06-22→24: below-2.1 wall — DIAGNOSIS FLIPPED to genuine Hopf, FIX = resreg/jacreg (LE control), 2.09 recipe recovered + +THE definitive write-up of this multi-day session. Supersedes the "Euler-artifact" framing in EP_BELOW210 cont.7. + +## 1. HEADLINE — the diagnosis flipped twice, landed on GENUINE Hopf instability +The C512 EP below-CE-2.1 divergence is a **genuine continuous-time Hopf instability** of the learned **non-conservative attention** operator — a complex eigenvalue pair crossing **Re μ > 0** as EP trains the attention expressive. **NOT a discrete-Euler artifact** (that was an intermediate wrong turn). Analog/continuous hardware is NOT automatically immune. + +### Evidence chain (3 independent methods, all in ep_run/) +1. **eval_relax_s3200.py** — relax the marginal ckpt redx s3200 (val 2.74) 6000 steps → res floors ~2.3e-2, OSCILLATES, no fixed point = limit cycle. +2. **knockout_s3200.py** — scale attention output WO×α: cycle scales with α, CONVERGES at α=0.2 → the **non-conservative attention drives the cycle**. +3. **eps_sweep_s3200.py** — cycle amplitude shrinks monotonically as ε↓ (0.1→2.3e-2, 0.01→8.9e-4). *Intermediate misread*: thought "Euler artifact, analog immune" (cont.7). **fugu caught the bug**: that 8.9e-4 at ε=0.01 is the STEP residual `r=ε·g`, i.e. the SAME force-floor g≈0.09 — so the sweep proves the oscillation is discrete-amplified but does NOT prove a clean continuous fixed point exists. +4. **ε-TRAINING-monotonicity (3 runs)**: ε=0.1→blew@**2.74**; ε=0.1,t2sel=160 (BETTER gradient)→blew@**3.02** EARLIER (⇒ gradient quality is NOT the lever); ε=0.05→blew@**2.41**. Smaller ε → strictly lower wall. +5. **eig_probe.py** (matrix-free FINITE-DIFFERENCE-JVP Arnoldi on M=I+εJ — DECISIVE): leading continuous μ=(λ−1)/ε at the operating point: + - s2000 (3.13): all |λ|<1, **Re μ=−0.024 STABLE** + - s3200 (2.74): top |λ|=1.044, **Re μ=+0.44, COMPLEX** (μ=+0.26±1.37j rotating) + - ep_eps05 (2.41): top |λ|=1.14, **Re μ=+1.35**, complex (±2.08j) + - **Re μ grows −0.02→+0.44→+1.35 as CE drops, |Im μ| grows ~0→1.37→2.08.** GENUINE instability, growing. + - NOTE: autograd JVP gave 0 (blk.force detaches internally) → use **FD-JVP** `(F(z+h·u)−F(z))/h·‖v‖`, h=1e-3·‖z‖. +6. **anderson_control.py** — s3200: plain relax floors (res 2.7e-2), Anderson CANNOT reach a root (best 1.4e-3), near-root has **Re μ=+0.24, |Im μ|≈2.0 UNSTABLE rotating**. Solver can't manufacture an absent/unstable equilibrium. + +### Reconciles with everything +- ε-monotonicity: with Re μ>0, `|1+εμ|>1` for ANY ε; smaller ε just makes the blow-up SLOWER (less discrete over-amplification, esp. the rotating (εb)² term) → delays → wall RELOCATES (2.74→2.41), never closes. +- ep_t2fix earlier-blow: cleaner gradient drives into the stiff/unstable regime faster. +- It's a COMPOUND: continuous instability (root, Re μ>0) + explicit-Euler over-amplification (2nd layer). ε attacks layer 2 (delay); jacreg/resreg attack the root. + +## 2. THE FIX — resreg AND jacreg both work, by controlling the finite-T1 LE +Both add back the **finite-time contraction defense** that equilibrium-EP's L(z*) structurally lacks (and that BPTT has implicitly — BPTT differentiates the T1 unroll, so a non-converged z_T1 → bad output → penalized). + +- **jacreg** = penalize ‖J_nc·v‖ (the non-conservative/rotating Jacobian = attention+FFN; lt_ep_train.py:211-219). **Cause-side**: shrinks |Im μ| → pushes the complex pair back to Re μ<0. +- **resreg** = penalize the T1-residual ‖εF(z_T1)‖ (lt_ep_train.py:220-231). **Symptom-side**: residual ~ρ^T1=exp(T1·LE), so resreg ≈ a DIRECT finite-T1 Lyapunov-exponent penalty (catches non-normal transients the eigenvalues miss). ~orthogonal to BPTT-grad (cos −0.047) — a constraint that keeps res low so the EP estimate doesn't collapse. +- **Geometry**: relaxation map M=I+εJ; stable ⟺ ρ(M)<1 ⟺ finite-T1 LE<0. Hopf = non-conservative part pushes complex μ past Re μ=0. resreg controls LE directly (output), jacreg controls the structural source (‖J_nc‖). **They stack** (orthogonal handles; cause+symptom). +- **eig_jacreg.py CONFIRMED jacreg at the mechanism level** (cont.9): at the SAME loss ~2.74, FROZEN jacreg (redx) = Re μ=+0.45 rotating UNSTABLE g_floor 0.26; ADAPTIVE jacreg (ep_jacreg @2.75) = **Re μ=−0.23 STABLE real, g_floor 0.0001 (true fixed point restored)**. jacreg killed the Hopf + restored AsymEP validity. + +## 3. ★★ THE 2.09 CONFIG (recovered from EP_BELOW210:97-101) — the key stabilizer is RESREG, NOT jacreg ★★ +The session spent days on adaptive-jacreg; the USER pushed to find the actual 2.09 recipe. It is **FROM SCRATCH + resreg=0.2 + FROZEN jr=0.1** (the original ep_resreg2 reached **2.0573**, lowest EP ever; lost to /tmp wipe; rebuilt ep_resreg_scratch reached 2.22): +``` +python3 lt_ep_train.py --mode ep --attn_mode thick --B 24 --C 512 --H 16 --T 256 --c 1.0 \ + --jacreg 0.1 --jr_floor 0.1 --jr_max 0.1 --holo 2 --hr 0.02 --t2sel 40 --track --pema 0.999 --t1max 300 \ + --res_est 1e-4 --res_gate 0 --resreg 0.2 --qknorm --resinit 0.1 --warmup 800 --T1 150 --T2 20 --lr 6e-4 \ + --wsd 0.25 --steps 32000 ... # NO --init_ckpt = from scratch +``` +- jr_max=jr_floor=0.1 = jacreg FROZEN (controller off). Adaptive jacreg = release jr_max (e.g. 16); controller :520-529 ramps jr by `(res/res_target)^0.3`. +- res_gate MUST be 0 (the gate early-returns before the resreg penalty → bypasses it; res_gate≠0 blew @200 historically). + +## 4. RUNS this session (all C512, warm from s2000 unless noted) +- **s2000** = `runs/redx_traj/s2000.pt` (redx step 2000, val 3.13) = the STABLE pre-bifurcation start (Re μ=−0.024). Use THIS to warm-start; ep_eps05.pt (2.41, Re μ=+1.35) is UNSTABLE — resreg/jacreg can't RESCUE an already-unstable operator (they PREVENT, they don't reverse). +- ep_jacreg (adaptive jacreg, warm s2000): twitchy (jr-spikes to 15 → CE spikes), crawled to ~2.32 then ~stuck/slow. Broke past 2.74/2.41 cleanly though. +- **ep_resreg_warm** (resreg=0.2 eager, t2sel=160, warm s2000): SMOOTH (peak res 1.6e-2, no spikes), LEADING at **2.2985** — the clean 2.09 test, still alive. +- ep_resreg_fast (resreg t2sel40, warm from UNSTABLE ep_eps05@2.41): BLEW — start-point was unstable. (Confirmed by direct eval: fp32 & TF32 both 2.55 → load fine, TF32 didn't change relaxation; the operator just destabilizes under training.) +- ep_resreg_c (resreg t2sel40 +compile, warm s2000): BLEW @2.31. Cause = **t2sel40 (lean gradient)**, NOT compile, NOT warm-start (resreg_warm same warm-start is fine). +- ep_rr_scratch (FROM SCRATCH original recipe +compile): launched 2026-06-24 05:32 on GPU3 — tests the user's "from-scratch is robust" hypothesis + the proven 2.09 path. +- ep_eps05 (ε=0.05): blew @2.41 (the ε-monotonicity run). + +## 5. INFRA / #14 speedup findings +- **compile: EXONERATED + SAFE.** lt_ep_train has `--compile` (compiles the FREE-phase via `tforce`, the no_grad fast path; gradient stays eager). Verified numerically identical: tforce vs force rel-diff 9e-7; compile-z150 vs eager-z150 **1.6e-7** (just fp32 op-reorder rounding). Speedup ~1.43x (free phase) / ~3.3x with t2sel40. reduce-overhead/CUDA-graphs BROKEN (0.07x, graph breaks — needs fullgraph/static-shape fix). +- **TF32: DROPPED (user decision).** `--tf32` exists (lt_ep_train:368, sets allow_tf32). 10-bit mantissa ≈ 1e-3 precision loss. The relaxation is HYPER-precision-sensitive (ε 0.1→0.05 moved the wall 0.33!), so TF32's 1e-3 perturbation is too coarse → risky. compile(fp32, 1.6e-7) is ~10⁶× below the sensitivity scale → safe; TF32 is not. **DO NOT use --tf32.** +- **EP parallelism advantages for #14 (esp. for the scaling/deep phase):** + 1. NO sequential backward (vs BPTT's N reverse layers) + NO activation graph (memory-light). + 2. COUPLED equilibrium stack (#13, like the Hopfield-ResNet) → all layers relax CONCURRENTLY each step → depth parallelizes (vs BPTT's 2N sequential). (DEQ-style z*=f(z*) with deep f does NOT parallelize.) + 3. **adaptive-T1** (relax until residual<tol, the t1max machinery already does this for z*) — easier/cleaner than adaptive-ε (convergence signal is cleaner than overshoot). Speed + auto-converged readout + cap-hit=instability-flag. + 4. adaptive-ε (#30) as 2nd-order. +- GPU: 4× RTX A6000 (49GB). GPU0/1/3 = ours; **GPU2 = others' NV-Embed-v2 server (port 8555/8556) — DON'T TOUCH.** For RENTING (user found cheap provider, RTX models + H800, no A100/H100): **EP is memory-light + FLOP-bound → consumer 4090/5090 = best $/FLOP** for the bulk; 48-80GB (A6000/H800) only for BPTT-twins (memory-heavy) + 0.6B. EP per-step FLOPs ~1.5-3x BPTT (2 relaxations vs fwd+bwd) but that cost is SIM-only (analog relaxation is free physics). + +## 6. Hopfield-ResNet paper (arxiv 2509.26003) — confirms our diagnosis +"Scaling EP to Deeper Architectures" trained 12-conv Hopfield-ResNet with EP. It is **CONSERVATIVE** (energy function Φ, SYMMETRIC weights, monotone energy descent — "no oscillation or limit cycles"). No attention, no non-symmetric ops. **Confirms: non-conservativity is OUR culprit; conservative systems get depth free (no Hopf). We are the first to EP-train NON-conservative attention (which has the Hopf), solved via jacreg/resreg.** Good for the dossier: prior EP-deep = conservative/no-attention; ours = the harder non-conservative case. + +## 7. fugu-ultra consultations (all in ep_run/) +- FUGU_VERDICT_FULL.md (Q1-Q4): confirmed attention-driven oscillation, FLAGGED the Euler-artifact (we then measured it's a true Hopf), said the eigenpair is the decisive measurement. Fix=adaptive jacreg homeostat+res_gate; sub-threshold attention IS expressive (BPTT 1.83 proves it); keep below the instability. +- FUGU_OPTIONS_VERDICT.md (Q1-Q3): the step-vs-force-residual CORRECTION; adaptive ε eliminates only if ε_min<ε_crit; jacreg raises ε_crit (model-side, same wall) + a real analog settling benefit + true-Hopf insurance; Anderson/implicit are emulator-fidelity (analog-faithful), jacreg edits the model. Recommended: adaptive-ε+Anderson(both phases)+bounded jacreg, report FORCE-residual + Re μ. + +## 8. KEY FILES +- Probes: ep_run/{eig_probe.py, anderson_control.py, eig_jacreg.py, adaptive_eps_calib.py, adaptive_eps_calib2.py, eps_sweep_s3200.py, knockout_s3200.py, eval_relax_s3200.py, compile_bench.py} +- Dossiers: ep_run/{EP_DIAGNOSIS_DOSSIER.md, FUGU_VERDICT_FULL.md, FUGU_Q_OPTIONS.md, FUGU_OPTIONS_VERDICT.md, FUGU_Q1_VERDICT.md} +- EP_BELOW210_DIAGNOSIS_FIX.md: cont.6 (structural/forward-mode — SUPERSEDED), cont.7 (ε-artifact — SUPERSEDED), **cont.8 (Hopf correction — CURRENT)**, **cont.9 (jacreg confirmed at mechanism level)** + OBS (oscillation = benign weight transient, not Hopf), **2026-06-23 ε-monotonicity RESULT**, the 2.09 recipe at :97-101. +- Code: lt_ep_train.py — force/tforce:81-106, relax:123, ep_step:140, jacreg:211-219, resreg:220-231, jr controller:520-529, --compile (works), --tf32 (DON'T use). + +## 9. CURRENT STATE + NEXT +- Running: ep_jacreg (~2.32), ep_resreg_warm (LEADING 2.2985, clean 2.09 test), ep_rr_scratch (from-scratch, just launched). Watchers: ep_resreg_check.py→2.20, ep_jacreg_binary.py→2.30 (detached→/tmp), ep_rr_scratch needs one. +- **THE open question: does resreg break 2.09?** ep_resreg_warm (warm+t2sel160) at 2.2985 smooth → likely; ep_rr_scratch (from-scratch, proven recipe) = the robust confirmation. +- Lessons: (a) the 2.09 stabilizer is RESREG (from scratch), not adaptive jacreg; (b) warm-start ONLY from a STABLE operator (s2000), never an already-blown one (ep_eps05); (c) t2sel40 (lean grad) is fragile deep, t2sel160 safer; (d) compile safe (fp32), TF32 unsafe (precision); (e) the run is precision-hyper-sensitive. +- Recurring bug to avoid: `pkill -f "ckpt runs/X.pt"` SELF-MATCHES the bash → exit 144; kill by explicit PID instead. And `nohup python … &` inside a run_in_background bash DETACHES it (no notify) — run `python3 watcher.py` directly as the tracked task. |
