summaryrefslogtreecommitdiff
path: root/protocol/PAPER_OUTLINE.md
diff options
context:
space:
mode:
Diffstat (limited to 'protocol/PAPER_OUTLINE.md')
-rw-r--r--protocol/PAPER_OUTLINE.md33
1 files changed, 17 insertions, 16 deletions
diff --git a/protocol/PAPER_OUTLINE.md b/protocol/PAPER_OUTLINE.md
index 015b06b..5aeec3d 100644
--- a/protocol/PAPER_OUTLINE.md
+++ b/protocol/PAPER_OUTLINE.md
@@ -29,16 +29,17 @@ failure modes and provide a 4-diagnostic protocol that detects them.
### 2.2 Walk-back details
-**DFA**: 0.308 ± 0.014 acc. Reports as "DFA trains the deep blocks". Reality:
-- ‖h_L‖ ~ 4×10⁸ (vs BP ~200, vs random init ~9): catastrophic residual stream growth, 4 OOM
-- ‖g_L‖ ~ 5×10⁻¹⁰ (vs BP ~5×10⁻⁵, vs random init ~8×10⁻³): BP gradient at numerical floor, 5 OOM below healthy
-- DFA-shallow (no deep blocks): 0.349 ± 0.002 — **DFA-trainable is 4 pp WORSE than DFA-shallow**
-- Cosine of `e_T B_l^T` to BP grad, layer 0: +0.42; layers 1-4: ≈ 0
+**DFA** (3-seed values from paper v2.32):
+- 100ep test acc: 0.306 ± 0.006 (Table 1, audit). Reports as "DFA trains the deep blocks". Reality:
+- ‖h_L‖ ~ 5×10⁸ (3-seed mean) (vs BP ~200, vs random init ~9): catastrophic residual stream growth, 4-5 OOM
+- ‖g_L‖ ~ 4×10⁻¹⁰ (3-seed mean) (vs BP ~4×10⁻⁴, vs random init ~10⁻³): BP gradient at numerical floor, 5+ OOM below healthy
+- DFA-shallow (no deep blocks): 0.349 ± 0.002 — **DFA-trainable is 4.3 pp WORSE than DFA-shallow**
+- Cosine of `e_T B_l^T` to BP grad, layer 0 across seeds: +0.42, +0.44, +0.42; layers 1-4: ≈ 0 (3-seed deep mean -0.008 ± 0.013 at ep 1)
- Headline Γ = 0.10 is entirely the embedding layer
-**State Bridge**: 0.205 ± 0.039 acc. Same pattern. Cross-batch direction stability of BP grad = 0.992 — the cleanest "drift-dominated reference vector" measurement we have.
+**State Bridge** (3-seed): 0.205 ± 0.032 acc (Table 1). Same pattern. Cross-batch direction stability of BP grad: s42=0.992 (cleanest "drift-dominated reference vector" measurement); 3-seed mean 0.529.
-**Credit Bridge**: 0.289 ± 0.034 acc. Same pattern.
+**Credit Bridge** (3-seed): 0.289 ± 0.026 acc (Table 1). Same pattern. Cross-batch stability 3-seed mean 0.373.
### 2.3 Internal control (EP)
@@ -134,13 +135,13 @@ On 4-block d=256 ResMLP, adding `λ ‖f_l(h_l)‖²` to each DFA local block lo
- Null calibration with 20 fresh random Bs: deep cos = +0.002 ± 0.022 — confirms the +0.155 is real signal that the network adapted to its specific training Bs
- Diagnostic measurement of mode 2 is now in a partially-alleviated regime, but +0.155 is still much less than BP's self-cos of 1.0
-**Both modes are partially alleviated, neither fully**:
-- Penalty acc: 0.363 ± 0.0007 (3 seeds, lam=1e-2)
-- DFA-vanilla acc: 0.308 ± 0.014 (3 seeds)
-- DFA-shallow baseline: 0.349 ± 0.002
-- BP-trainable acc: 0.609 ± 0.004
+**Both modes are partially alleviated, neither fully** (matched 30-ep 3-seed values, paper v2.32):
+- DFA+pen acc lam=1e-2 30ep (3-seed): 0.360 ± 0.001
+- DFA-vanilla 30ep (3-seed): 0.301 ± 0.005 (was 0.308 in old unsourced reporting)
+- DFA-shallow baseline (frozen blocks): 0.349 ± 0.002
+- BP-trainable no-pen 30ep (3-seed): 0.585 ± 0.001 (was 0.609 in old unsourced reporting)
-Penalty rescue is +5.5 pp over vanilla (mode 1 alleviated) and +1.4 pp over shallow (mode 2 partially alleviated). The remaining 24 pp gap to BP reflects that mode 2 is only partially fixed: cos +0.155 is real but well below BP's 1.0.
+Penalty rescue is +5.9 pp over matched 30-ep vanilla (mode 1 alleviated) and +1.1 pp over shallow (mode 2 partially alleviated). The remaining gap to BP reflects that mode 2 is only partially fixed: cos +0.151 is real but well below BP's 1.0.
### 4.3 Disambiguating "penalty revealed" vs "penalty created" the alignment
@@ -166,10 +167,10 @@ To distinguish the residual depth-utilization gap from "the penalty's intrinsic
| | no penalty | with penalty |
|---|---:|---:|
-| BP | 0.609 | **0.530** |
-| DFA | 0.308 | 0.363 |
+| BP | 0.585 ± 0.001 | **0.532 ± 0.006** |
+| DFA | 0.301 ± 0.005 | 0.360 ± 0.001 |
-(All same architecture, same data, same optimizer family.)
+(All same architecture, same data, same optimizer family. 3-seed matched 30-epoch values, paper v2.32.)
The penalty has **opposite effects** on BP and DFA: **−8 pp** capacity cost on BP, **+5.5 pp** rescue on DFA. BP+penalty still clears the DFA-shallow baseline by **+18.1 pp**, while DFA+penalty clears it by only +1.4 pp.