summaryrefslogtreecommitdiff
path: root/protocol/EVIDENCE_SUMMARY.md
blob: 2591c97b2526d30498340b7d8ddb5d882d01540d (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
# FA Diagnostic Protocol — Evidence Summary

A consolidated snapshot of all the empirical evidence the protocol package
in this repo has produced. Each item includes the reproducibility command
and the file or memory entry where the result is recorded.

## §1 The protocol itself

| component | location | reproduce |
|---|---|---|
| Reference implementation | `protocol/protocol.py` | `import protocol; protocol.diagnose(...)` |
| Reporting class | `protocol/report.py` | — |
| Smoke test (BP/DFA/EP) | `protocol/smoke_test.py` | `python -m protocol.smoke_test` |
| Reporting template (paper appendix form) | `protocol/REPORTING_TEMPLATE.md` | — |
| Pipeline pitfalls checklist | `protocol/CHECKLIST.md` | — |
| README | `protocol/README.md` | — |

## §2 Audit findings

| evidence | result | reproduce |
|---|---|---|
| 5-method audit table, seed 42 | BP/EP pass; DFA fails (a)(b)(d); SB/CB fail all 4 | `python -m protocol.examples.audit_table --seeds 42` |
| 5-method audit table, 3 seeds | DFA/SB/CB walked back on 3/3 seeds via (a)+(b)+(d) | `python -m protocol.examples.audit_table --seeds 42 123 456` |
| Random init sanity (3 seeds) | All trustworthy at ~10% chance accuracy | `python -m protocol.examples.random_init_sanity` |
| Stability sub-mode bimodality | DFA/SB/CB show drift OR noise sub-mode depending on seed; both are degenerate, both are caught by (b) | (in audit table output) |

## §3 Decision utility

| evidence | result | reproduce |
|---|---|---|
| 7-strategy ablation (S0-S_full) | Field standard (S1) walks back **0/5** methods; full protocol walks back **3/5** | `python -m protocol.examples.ablation_decision_utility` |
| Per-diagnostic recall | (a), (b), (d) each catch all 3 failures independently; (c) catches 2/3 (drift sub-modes only) | (in ablation output) |
| Threshold sensitivity | Diagnostic (a) has a **63× separation gap** between healthy and degenerate; (b) has a **24,338× gap** | `python -m protocol.examples.threshold_sensitivity` |

## §4 Temporal validation

| evidence | result | reproduce |
|---|---|---|
| Single-seed temporal evolution (ResMLP) | Diagnostics fire by epoch 4-11 of 100 epochs | `python -m protocol.examples.temporal_diagnostic_evolution --seed 42 --arch resmlp` |
| 3-seed temporal evolution (ResMLP) | (b) fires ep 3-4, (a) fires ep 8-11 across all seeds | iterate over seeds 42/123/456 |
| ViT-Mini temporal evolution (3 seeds) | (a) fires ep 1, (b) fires ep 2-3 — even faster than ResMLP | iterate `--arch vit` over seeds |
| StudentNet (no out_ln) temporal evolution (3 seeds) | (a) fires ep 14-25, **(b) NEVER fires** — terminal LN is causally responsible for the gradient collapse | iterate `--arch no_outln` over seeds |
| Cross-architecture temporal figure | 3×3 grid: ResMLP/ViT/StudentNet × ‖h_L‖/‖g_L‖/acc | `python -m protocol.examples.plot_temporal_cross_arch --seed 42` |
| Training-monitor early-stop demo | DFA: protocol fires at ep 4, saves 96% compute, 0pp acc loss; BP: never fires, runs to completion | `python -m protocol.examples.training_monitor_demo` |

## §3.5 Threshold sensitivity (calibration evidence)

| evidence | result | reproduce |
|---|---|---|
| Threshold sensitivity sweep | (a) **63× separation gap**, (b) **24,338× separation gap** between healthy and degenerate | `python -m protocol.examples.threshold_sensitivity` |
| Default thresholds (a)+(b) | sit cleanly in the middle of substantial margins; verdicts robust to ±50% perturbation | (in sensitivity output) |
| Diagnostic (d) frozen-baseline threshold | **NOT robust** — penalized DFA at λ=1e-2 fires at 2 pp threshold, passes at 1 pp threshold; at λ=1e-3 (1 seed) margin is +2.3 pp which passes at 2 pp. The (d) verdict depends on both threshold choice and λ choice. | `python -m protocol.examples.threshold_d_sensitivity` |
| Round 18 lesson | Soften language: "after the penalty correction, the depth contribution is at most 1.4 pp above the random-blocks baseline at λ=1e-2 — much smaller than BP's +26 pp gap over shallow", NOT "the deep blocks are passive". | n/a |

## §3.6 Cross-width validation (d=512)

| evidence | result | reproduce |
|---|---|---|
| 4-method audit at d=512 | BP trustworthy on 3/3 seeds; DFA/SB/CB walked back on 3/3 (same as d=256) | `python -m protocol.examples.audit_d512` |
| Width effect | max-per-block growth is HIGHER at d=512 (6e3-7e4 vs ~1e3 at d=256) | (in d=512 output) |

## §3.7 Cross-architecture: CNN (no terminal LN, BatchNorm)

| method × 3-seed | acc | max/block growth | ‖g_3‖ | verdict |
|---|---:|---:|---:|---|
| BP CNN | 0.866 ± 0.003 | 1.31× | 4e-5 | trustworthy |
| State Bridge CNN | 0.633 ± 0.005 | 2.40× | 2e-3 | trustworthy |
| **DFA CNN** | **0.566 ± 0.021** | **237×** | 1e-3 | walk-back via (a) only |
| EP CNN | 0.512 ± 0.023 | 11.6× | ~6.6e-1 | trustworthy |
| Credit Bridge CNN | 0.325 ± 0.009 | 96× | 3e-3 | walk-back via (a) only |

**Key**: diagnostic (b) NEVER fires on CNN. Without terminal LN, BP grad does
not collapse below 1e-7. Combined with the StudentNet result, **(b) appears
restricted to the terminal-normalized architectures we audited** (round 18
softening: this is observational association across the architectures
tested, not causal identification of LayerNorm). DFA CNN reaches 0.566
(much higher than DFA ResMLP 0.31 / DFA ViT 0.24), consistent with the
literature: classical FA papers report DFA working on shallow CNNs but
failing on modern Transformers. On CNN the cosine remains in a measurable
regime (Γ=0.916 for DFA), but the training trajectory exhibits extreme
scale distortion (max-per-block growth 237×), so the headline Γ alone is
not a trustworthy summary of learning quality even though the cosine
itself is well-defined.

Reproduce: `python -m protocol.examples.audit_cnn`

## §4 Two distinct failure modes (LOCKED — round 20)

The §4 framing is locked after rounds 18-20. Two distinct failure modes,
five independent validations.

### Mode 1: measurement degeneracy via terminal-LayerNorm gradient cancellation

Residual stream growth → BP gradient at hidden layers collapses below the
1e-7 floor → cosine alignment metric measured against degenerate reference.
**Caught by diagnostic (b).** Direct empirical evidence is the 5-method
audit table where DFA/SB/CB all have ‖g_L‖ ~ 1e-9.

### Mode 2: low intrinsic credit-direction quality of random feedback

Even in the meaningful regime (vanilla DFA at ep 1, ‖g‖ ~ 10⁻⁶), DFA's
local credit signal `e_T B_l^T` is essentially uncorrelated with BP grad
on deep layers. **Caught by direct per-layer cosine measurement.**

### Six independent validations of the converged framing

| # | evidence | result | reproduce |
|---|---|---|---|
| 1 | Direct deep-layer cosine on penalized DFA, 3 seeds | layer-mean +0.186 ± 0.007; deep mean +0.155 ± 0.025 | `experiments/measure_direction_quality_existing_ckpt.py` on `results/dfa_pen_short/dfa_pen_lam0.01_s{42,123,456}.pt` |
| 2 | Null calibration with 20 fresh random Bs | training-Bs deep cos +0.16 vs fresh-Bs +0.002 ± 0.022 | `experiments/null_calibration_penalized_cos.py` |
| 3 | Hypothesis B disambiguation (vanilla early-epoch) | vanilla deep cos -0.008 ± 0.013 across 3 seeds × ep 1, even with ‖g‖ in meaningful regime | `experiments/vanilla_dfa_early_ckpt.py` + measure script |
| 4 | BP+penalty capacity-cost 2×2 control | BP+pen acc 0.530 (-8 pp); DFA+pen 0.363 (+5.5 pp); 17 pp residual gap consistent with credit quality | `experiments/bp_with_penalty_control.py` |
| 5 | Multi-seed lock-in (round 20) | 24 measurements (3 seeds × 2 epochs × 4 deep layers) all in [-0.04, +0.02] | iterate measure script over s42/s123/s456 × ep1/ep2 |
| 6 | **Perturbation correlation triangulation** (round 19 alt metric) | vanilla deep ρ +0.002, penalized deep ρ +0.080 ± 0.011 (3 seeds). Different metric (no eps clamp, sample-level loss change), same pattern as cos | `experiments/perturbation_correlation_audit.py` |

### Penalty rescue 3-seed table (lam=1e-2)

| seed | acc | ‖h_L‖ | ‖g_2‖ | deep cos l1-l4 mean |
|---:|---:|---:|---:|---:|
| 42 | 0.363 | 3.8e4 | 9.9e-7 | +0.163 |
| 123 | 0.362 | 4.1e4 | 8.1e-7 | +0.151 |
| 456 | 0.364 | 4.1e4 | 9.0e-7 | +0.139 |
| **mean** | **0.363 ± 0.001** | **4.0e4** | **9.0e-7** | **+0.151 ± 0.012** |

### BP+penalty 2×2 grid (raw acc, primary number per round 20)

| | no penalty | with penalty | penalty effect |
|---|---:|---:|---:|
| BP | 0.609 | **0.530** | −8 pp (capacity loss) |
| DFA | 0.308 | 0.363 | +5.5 pp (rescue) |

### Round 20 phrasing for the gap

**Lower bound on non-capacity gap**: matched penalty controls show that only part of DFA's deficit is attributable to the representational/optimization cost of the penalty itself; a substantial residual remains and is consistent with poorer credit assignment.

**Cannot rule out (caveats)**:
- BP uses end-to-end loss, DFA uses local block losses — the 2×2 isn't a perfectly clean isolation of "credit quality" in a vacuum
- The "different optimization regime" objection: penalty hurts BP (-8 pp) while helping DFA (+5.5 pp), opposite of what a generally-beneficial regime shift would do, so this is unlikely but not airtight

## §5 Pipeline pitfalls reproducers

All 6 bugs from `protocol/CHECKLIST.md` have a reproducer:

| bug | reproducer | finding |
|---|---|---|
| 1 — `tensor.norm(-1)` is L_{-1} | `verify_pitfalls.py` (synthetic) | `[3,4]/[6,8]` returns 1.143 instead of [5, 10] |
| 2 — `F.cosine_similarity` clamps eps=1e-8 | `verify_pitfalls.py` (synthetic) | true cos 0.018 reported as 0.0009 (off by 20×) |
| 3 — fp16 underflow | `verify_pitfalls.py` (synthetic) | 5e-10 → 0 in fp16; cosine NaN; bf16 works |
| 4 — Bs are training-specific | `verify_pitfalls_4_6.py` (DFA s42) | training Bs Γ=+0.068; 10 fresh Bs draws Γ=+0.0043±0.007 |
| 5 — aggregation strategy spread | `verify_pitfalls_4_6.py` (DFA s42) | 4 strategies span [-0.028, +0.074]; **sign flips** |
| 6 — layer-0 dominance | `verify_pitfalls_4_6.py` (DFA s42) | Γ_l0=+0.43; hidden-only mean -0.022 (negative) |

## Headline quantitative results (paper table)

### 5-method audit, ResMLP CIFAR-10 (3-seed mean ± std)

| method | acc | ‖h_L‖ | ‖g_L‖ | stability | verdict |
|---|---:|---:|---:|---:|---|
| BP | 0.615±0.004 | 2.0e2 | 3.6e-4 | 0.10 (range 0.08-0.11) | trustworthy |
| DFA | 0.306±0.012 | 4-7e8 | 2-4e-9 | bimodal | walked back: (a)+(b)+(d) |
| State Bridge | 0.205±0.039 | 7e7-2e8 | 2e-9 | mostly drift | walked back: all 4 |
| Credit Bridge | 0.289±0.034 | 1-6e8 | 9e-10 to 4e-9 | drift on 2/3 seeds | walked back: all 4 (mostly) |
| EP | 0.316±0.038 | 3-8e3 | 1-2e-4 | -0.04 to 0.12 | trustworthy |

### Cross-architecture (s42)

| arch | terminal LN | DFA acc | BP acc | DFA fires | BP fires |
|---|:---:|---:|---:|---:|:---:|
| ResMLP (4-block d=256) | yes | 0.31 | 0.61 | (a)+(b) ep 4-8 | never |
| ViT-Mini (4-block d=128) | yes | 0.26 | 0.80 | (a)+(b) ep 1-3 | never |
| StudentNet (4-block d=128) | **no** | 0.33 | 0.62 | (a) ep 18, **(b) NEVER** | never |

### Penalty rescue (3-seed for λ=1e-2, single-seed for others)

| condition | acc | ‖h_L‖ | ‖g_L‖ |
|---|---:|---:|---:|
| DFA-vanilla | 0.308 ± 0.014 | 4.4e8 | 5e-10 |
| DFA + λ=1e-3 ‖f‖² | 0.372 (1 seed) | 4.0e4 | 7e-7 |
| DFA + λ=1e-2 ‖f‖² | 0.363 ± 0.001 | 3.8e4 | 1e-6 |
| DFA + λ=1e-1 ‖f‖² | (running) | (running) | (running) |
| DFA-shallow baseline | 0.349 ± 0.002 | (n/a) | (n/a) |
| BP-trainable | 0.609 ± 0.004 | 2.0e2 | 5e-5 |

The penalty rescues by +5.5 pp over vanilla DFA but only +1.4 pp over the
shallow baseline; mechanism is necessary but not sufficient.

## Figures (paper-ready)

- `results/protocol_audit/figure_audit_5method.png` — §2/§3 hero figure: 4-panel horizontal bar chart of the 5-method audit, color-coded healthy/degenerate, threshold lines drawn
- `results/protocol_audit/figure_cross_arch_temporal_s42.png` — §4 cross-architecture temporal evolution: 3 architectures × 3 metric rows (‖h_L‖, ‖g_L‖, acc) with floor line

## Status of evidence

- §1 protocol package: **DONE**, committed
- §2 audit findings: **DONE** for ResMLP at d=256 (3 seeds), d=512 (3 seeds), and CNN (3 seeds, 5 methods). 11 method×architecture combinations × 3 seeds = 33 audited conditions. ViT audit waiting on checkpoint training
- §3 decision utility: **DONE**, ablation table + threshold sensitivity ready
- §3 hero figure: **DONE**
- §4 temporal validation: **DONE** for 3 architectures × 3 seeds (ResMLP, ViT, StudentNet)
- §4 cross-arch figure: **DONE**
- §4 two failure modes (separability): **DONE** via penalty partial audit + vanilla cosine measurement
- §5 pipeline pitfalls: **DONE**, all 6 reproducers
- Direction-quality direct test on penalized checkpoint: **RUNNING** (gives mechanistic explanation of (d) failure on penalized net)
- ResNet experiment: **RUNNING** (ep 1 of 60, slow due to GPU contention)
- ViT checkpoint training: **RUNNING** (ep 1 of 60)
- λ=1e-1 penalty data: **RUNNING** (ep 40 of 100)
- ViT 5-method audit: **WAITING** for ViT checkpoint

## Open questions for paper writeup

- Should the protocol report (a)/(b)/(d) as 3 redundant tests or as 1 composite indicator with 3 input signals? Current view: keep them separate so authors can see *why* the verdict was triggered.
- Should the paper claim the protocol generalizes to architectures not yet tested? Current view: scope claims tightly to "modern pre-LN residual architectures" and explicitly note ResNet/BatchNorm and StudentNet as the architectural extremes we've tested.
- Should the paper recommend specific threshold defaults, or leave them caller-configurable? Current view: ship defaults (50, 1e-7, 0.30, 2pp) with the calibration evidence and let users override.