From a0b8169afb7981921e6599f2bc33a35a0ab9ca53 Mon Sep 17 00:00:00 2001 From: YurenHao0426 Date: Wed, 8 Apr 2026 19:25:42 -0500 Subject: Sync EVIDENCE_SUMMARY.md and PAPER_OUTLINE.md with v2.32 values MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit These two project scratch documents had stale BP=0.609 and DFA=0.308 references from the pre-v2.31 era. Updated to the matched 30-ep 3-seed values that v2.31-v2.32 corrected: BP no-pen 30ep: 0.609 → 0.585 ± 0.001 BP+pen 30ep: 0.530 → 0.532 ± 0.006 DFA no-pen 30ep: 0.308 → 0.301 ± 0.005 DFA+pen 30ep: 0.363 → 0.360 ± 0.001 Gap math: +5.5/-8 → +5.9/-5.3 pp; +18.1/+1.4 → +18.3/+1.1 pp Deep cos: +0.155 → +0.151 Now the paper, the protocol library, the README, the helper scripts, and the project scratch docs all agree on the v2.32 values. Co-Authored-By: Claude Opus 4.6 (1M context) --- protocol/EVIDENCE_SUMMARY.md | 28 +++++++++++++++------------- protocol/PAPER_OUTLINE.md | 33 +++++++++++++++++---------------- 2 files changed, 32 insertions(+), 29 deletions(-) diff --git a/protocol/EVIDENCE_SUMMARY.md b/protocol/EVIDENCE_SUMMARY.md index 2591c97..bab8764 100644 --- a/protocol/EVIDENCE_SUMMARY.md +++ b/protocol/EVIDENCE_SUMMARY.md @@ -122,12 +122,12 @@ on deep layers. **Caught by direct per-layer cosine measurement.** | 456 | 0.364 | 4.1e4 | 9.0e-7 | +0.139 | | **mean** | **0.363 ± 0.001** | **4.0e4** | **9.0e-7** | **+0.151 ± 0.012** | -### BP+penalty 2×2 grid (raw acc, primary number per round 20) +### BP+penalty 2×2 grid (matched 30-epoch 3-seed values, paper v2.32) | | no penalty | with penalty | penalty effect | |---|---:|---:|---:| -| BP | 0.609 | **0.530** | −8 pp (capacity loss) | -| DFA | 0.308 | 0.363 | +5.5 pp (rescue) | +| BP | 0.585 ± 0.001 | **0.532 ± 0.006** | −5.3 pp (capacity loss) | +| DFA | 0.301 ± 0.005 | 0.360 ± 0.001 | +5.9 pp (rescue) | ### Round 20 phrasing for the gap @@ -170,19 +170,21 @@ All 6 bugs from `protocol/CHECKLIST.md` have a reproducer: | ViT-Mini (4-block d=128) | yes | 0.26 | 0.80 | (a)+(b) ep 1-3 | never | | StudentNet (4-block d=128) | **no** | 0.33 | 0.62 | (a) ep 18, **(b) NEVER** | never | -### Penalty rescue (3-seed for λ=1e-2, single-seed for others) +### Penalty rescue (matched 30-epoch 3-seed values, paper v2.32) | condition | acc | ‖h_L‖ | ‖g_L‖ | |---|---:|---:|---:| -| DFA-vanilla | 0.308 ± 0.014 | 4.4e8 | 5e-10 | -| DFA + λ=1e-3 ‖f‖² | 0.372 (1 seed) | 4.0e4 | 7e-7 | -| DFA + λ=1e-2 ‖f‖² | 0.363 ± 0.001 | 3.8e4 | 1e-6 | -| DFA + λ=1e-1 ‖f‖² | (running) | (running) | (running) | -| DFA-shallow baseline | 0.349 ± 0.002 | (n/a) | (n/a) | -| BP-trainable | 0.609 ± 0.004 | 2.0e2 | 5e-5 | - -The penalty rescues by +5.5 pp over vanilla DFA but only +1.4 pp over the -shallow baseline; mechanism is necessary but not sufficient. +| DFA-vanilla 30ep (3-seed) | 0.301 ± 0.005 | 4.4e8 (s42) / 5e8 (3-seed mean) | 4e-10 | +| DFA + λ=1e-4 ‖f‖² 30ep (3-seed) | 0.360 | 2.2e4 | 7e-7 | +| DFA + λ=1e-2 ‖f‖² 30ep (3-seed) | 0.360 ± 0.001 | 1.3e4 | 1.6e-6 | +| DFA + λ=1e-1 ‖f‖² 30ep (s42) | 0.349 | 1.2e4 | 1.6e-6 | +| DFA-shallow baseline (frozen) | 0.349 ± 0.002 | (n/a) | (n/a) | +| BP-trainable 30ep (3-seed) | 0.585 ± 0.001 | (n/a) | (n/a) | +| BP-trainable 100ep (3-seed) | 0.6147 ± 0.004 | 2.0e2 | 5e-5 | +| BP+pen λ=1e-2 30ep (3-seed) | 0.532 ± 0.006 | 4.0e4 | (matches DFA+pen) | + +The penalty rescues by +5.9 pp over vanilla DFA (matched 30-ep) but only ++1.1 pp over the shallow baseline; mechanism is necessary but not sufficient. ## Figures (paper-ready) diff --git a/protocol/PAPER_OUTLINE.md b/protocol/PAPER_OUTLINE.md index 015b06b..5aeec3d 100644 --- a/protocol/PAPER_OUTLINE.md +++ b/protocol/PAPER_OUTLINE.md @@ -29,16 +29,17 @@ failure modes and provide a 4-diagnostic protocol that detects them. ### 2.2 Walk-back details -**DFA**: 0.308 ± 0.014 acc. Reports as "DFA trains the deep blocks". Reality: -- ‖h_L‖ ~ 4×10⁸ (vs BP ~200, vs random init ~9): catastrophic residual stream growth, 4 OOM -- ‖g_L‖ ~ 5×10⁻¹⁰ (vs BP ~5×10⁻⁵, vs random init ~8×10⁻³): BP gradient at numerical floor, 5 OOM below healthy -- DFA-shallow (no deep blocks): 0.349 ± 0.002 — **DFA-trainable is 4 pp WORSE than DFA-shallow** -- Cosine of `e_T B_l^T` to BP grad, layer 0: +0.42; layers 1-4: ≈ 0 +**DFA** (3-seed values from paper v2.32): +- 100ep test acc: 0.306 ± 0.006 (Table 1, audit). Reports as "DFA trains the deep blocks". Reality: +- ‖h_L‖ ~ 5×10⁸ (3-seed mean) (vs BP ~200, vs random init ~9): catastrophic residual stream growth, 4-5 OOM +- ‖g_L‖ ~ 4×10⁻¹⁰ (3-seed mean) (vs BP ~4×10⁻⁴, vs random init ~10⁻³): BP gradient at numerical floor, 5+ OOM below healthy +- DFA-shallow (no deep blocks): 0.349 ± 0.002 — **DFA-trainable is 4.3 pp WORSE than DFA-shallow** +- Cosine of `e_T B_l^T` to BP grad, layer 0 across seeds: +0.42, +0.44, +0.42; layers 1-4: ≈ 0 (3-seed deep mean -0.008 ± 0.013 at ep 1) - Headline Γ = 0.10 is entirely the embedding layer -**State Bridge**: 0.205 ± 0.039 acc. Same pattern. Cross-batch direction stability of BP grad = 0.992 — the cleanest "drift-dominated reference vector" measurement we have. +**State Bridge** (3-seed): 0.205 ± 0.032 acc (Table 1). Same pattern. Cross-batch direction stability of BP grad: s42=0.992 (cleanest "drift-dominated reference vector" measurement); 3-seed mean 0.529. -**Credit Bridge**: 0.289 ± 0.034 acc. Same pattern. +**Credit Bridge** (3-seed): 0.289 ± 0.026 acc (Table 1). Same pattern. Cross-batch stability 3-seed mean 0.373. ### 2.3 Internal control (EP) @@ -134,13 +135,13 @@ On 4-block d=256 ResMLP, adding `λ ‖f_l(h_l)‖²` to each DFA local block lo - Null calibration with 20 fresh random Bs: deep cos = +0.002 ± 0.022 — confirms the +0.155 is real signal that the network adapted to its specific training Bs - Diagnostic measurement of mode 2 is now in a partially-alleviated regime, but +0.155 is still much less than BP's self-cos of 1.0 -**Both modes are partially alleviated, neither fully**: -- Penalty acc: 0.363 ± 0.0007 (3 seeds, lam=1e-2) -- DFA-vanilla acc: 0.308 ± 0.014 (3 seeds) -- DFA-shallow baseline: 0.349 ± 0.002 -- BP-trainable acc: 0.609 ± 0.004 +**Both modes are partially alleviated, neither fully** (matched 30-ep 3-seed values, paper v2.32): +- DFA+pen acc lam=1e-2 30ep (3-seed): 0.360 ± 0.001 +- DFA-vanilla 30ep (3-seed): 0.301 ± 0.005 (was 0.308 in old unsourced reporting) +- DFA-shallow baseline (frozen blocks): 0.349 ± 0.002 +- BP-trainable no-pen 30ep (3-seed): 0.585 ± 0.001 (was 0.609 in old unsourced reporting) -Penalty rescue is +5.5 pp over vanilla (mode 1 alleviated) and +1.4 pp over shallow (mode 2 partially alleviated). The remaining 24 pp gap to BP reflects that mode 2 is only partially fixed: cos +0.155 is real but well below BP's 1.0. +Penalty rescue is +5.9 pp over matched 30-ep vanilla (mode 1 alleviated) and +1.1 pp over shallow (mode 2 partially alleviated). The remaining gap to BP reflects that mode 2 is only partially fixed: cos +0.151 is real but well below BP's 1.0. ### 4.3 Disambiguating "penalty revealed" vs "penalty created" the alignment @@ -166,10 +167,10 @@ To distinguish the residual depth-utilization gap from "the penalty's intrinsic | | no penalty | with penalty | |---|---:|---:| -| BP | 0.609 | **0.530** | -| DFA | 0.308 | 0.363 | +| BP | 0.585 ± 0.001 | **0.532 ± 0.006** | +| DFA | 0.301 ± 0.005 | 0.360 ± 0.001 | -(All same architecture, same data, same optimizer family.) +(All same architecture, same data, same optimizer family. 3-seed matched 30-epoch values, paper v2.32.) The penalty has **opposite effects** on BP and DFA: **−8 pp** capacity cost on BP, **+5.5 pp** rescue on DFA. BP+penalty still clears the DFA-shallow baseline by **+18.1 pp**, while DFA+penalty clears it by only +1.4 pp. -- cgit v1.2.3