protocol/EVIDENCE_SUMMARY.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111

# FA Diagnostic Protocol — Evidence Summary

A consolidated snapshot of all the empirical evidence the protocol package
in this repo has produced. Each item includes the reproducibility command
and the file or memory entry where the result is recorded.

## §1 The protocol itself

| component | location | reproduce |
|---|---|---|
| Reference implementation | `protocol/protocol.py` | `import protocol; protocol.diagnose(...)` |
| Reporting class | `protocol/report.py` | — |
| Smoke test (BP/DFA/EP) | `protocol/smoke_test.py` | `python -m protocol.smoke_test` |
| Reporting template (paper appendix form) | `protocol/REPORTING_TEMPLATE.md` | — |
| Pipeline pitfalls checklist | `protocol/CHECKLIST.md` | — |
| README | `protocol/README.md` | — |

## §2 Audit findings

| evidence | result | reproduce |
|---|---|---|
| 5-method audit table, seed 42 | BP/EP pass; DFA fails (a)(b)(d); SB/CB fail all 4 | `python -m protocol.examples.audit_table --seeds 42` |
| 5-method audit table, 3 seeds | DFA/SB/CB walked back on 3/3 seeds via (a)+(b)+(d) | `python -m protocol.examples.audit_table --seeds 42 123 456` |
| Random init sanity (3 seeds) | All trustworthy at ~10% chance accuracy | `python -m protocol.examples.random_init_sanity` |
| Stability sub-mode bimodality | DFA/SB/CB show drift OR noise sub-mode depending on seed; both are degenerate, both are caught by (b) | (in audit table output) |

## §3 Decision utility

| evidence | result | reproduce |
|---|---|---|
| 7-strategy ablation (S0-S_full) | Field standard (S1) walks back **0/5** methods; full protocol walks back **3/5** | `python -m protocol.examples.ablation_decision_utility` |
| Per-diagnostic recall | (a), (b), (d) each catch all 3 failures independently; (c) catches 2/3 (drift sub-modes only) | (in ablation output) |
| Threshold sensitivity | Diagnostic (a) has a **63× separation gap** between healthy and degenerate; (b) has a **24,338× gap** | `python -m protocol.examples.threshold_sensitivity` |

## §4 Temporal validation

| evidence | result | reproduce |
|---|---|---|
| Single-seed temporal evolution (ResMLP) | Diagnostics fire by epoch 4-11 of 100 epochs | `python -m protocol.examples.temporal_diagnostic_evolution --seed 42 --arch resmlp` |
| 3-seed temporal evolution (ResMLP) | (b) fires ep 3-4, (a) fires ep 8-11 across all seeds | iterate over seeds 42/123/456 |
| ViT-Mini temporal evolution (3 seeds) | (a) fires ep 1, (b) fires ep 2-3 — even faster than ResMLP | iterate `--arch vit` over seeds |
| StudentNet (no out_ln) temporal evolution (3 seeds) | (a) fires ep 14-25, **(b) NEVER fires** — terminal LN is causally responsible for the gradient collapse | iterate `--arch no_outln` over seeds |
| Cross-architecture temporal figure | 3×3 grid: ResMLP/ViT/StudentNet × ‖h_L‖/‖g_L‖/acc | `python -m protocol.examples.plot_temporal_cross_arch --seed 42` |
| Training-monitor early-stop demo | DFA: protocol fires at ep 4, saves 96% compute, 0pp acc loss; BP: never fires, runs to completion | `python -m protocol.examples.training_monitor_demo` |

## §5 Pipeline pitfalls reproducers

All 6 bugs from `protocol/CHECKLIST.md` have a reproducer:

| bug | reproducer | finding |
|---|---|---|
| 1 — `tensor.norm(-1)` is L_{-1} | `verify_pitfalls.py` (synthetic) | `[3,4]/[6,8]` returns 1.143 instead of [5, 10] |
| 2 — `F.cosine_similarity` clamps eps=1e-8 | `verify_pitfalls.py` (synthetic) | true cos 0.018 reported as 0.0009 (off by 20×) |
| 3 — fp16 underflow | `verify_pitfalls.py` (synthetic) | 5e-10 → 0 in fp16; cosine NaN; bf16 works |
| 4 — Bs are training-specific | `verify_pitfalls_4_6.py` (DFA s42) | training Bs Γ=+0.068; 10 fresh Bs draws Γ=+0.0043±0.007 |
| 5 — aggregation strategy spread | `verify_pitfalls_4_6.py` (DFA s42) | 4 strategies span [-0.028, +0.074]; **sign flips** |
| 6 — layer-0 dominance | `verify_pitfalls_4_6.py` (DFA s42) | Γ_l0=+0.43; hidden-only mean -0.022 (negative) |

## Headline quantitative results (paper table)

### 5-method audit, ResMLP CIFAR-10 (3-seed mean ± std)

| method | acc | ‖h_L‖ | ‖g_L‖ | stability | verdict |
|---|---:|---:|---:|---:|---|
| BP | 0.615±0.004 | 2.0e2 | 3.6e-4 | 0.10 (range 0.08-0.11) | trustworthy |
| DFA | 0.306±0.012 | 4-7e8 | 2-4e-9 | bimodal | walked back: (a)+(b)+(d) |
| State Bridge | 0.205±0.039 | 7e7-2e8 | 2e-9 | mostly drift | walked back: all 4 |
| Credit Bridge | 0.289±0.034 | 1-6e8 | 9e-10 to 4e-9 | drift on 2/3 seeds | walked back: all 4 (mostly) |
| EP | 0.316±0.038 | 3-8e3 | 1-2e-4 | -0.04 to 0.12 | trustworthy |

### Cross-architecture (s42)

| arch | terminal LN | DFA acc | BP acc | DFA fires | BP fires |
|---|:---:|---:|---:|---:|:---:|
| ResMLP (4-block d=256) | yes | 0.31 | 0.61 | (a)+(b) ep 4-8 | never |
| ViT-Mini (4-block d=128) | yes | 0.26 | 0.80 | (a)+(b) ep 1-3 | never |
| StudentNet (4-block d=128) | **no** | 0.33 | 0.62 | (a) ep 18, **(b) NEVER** | never |

### Penalty rescue (3-seed for λ=1e-2, single-seed for others)

| condition | acc | ‖h_L‖ | ‖g_L‖ |
|---|---:|---:|---:|
| DFA-vanilla | 0.308 ± 0.014 | 4.4e8 | 5e-10 |
| DFA + λ=1e-3 ‖f‖² | 0.372 (1 seed) | 4.0e4 | 7e-7 |
| DFA + λ=1e-2 ‖f‖² | 0.363 ± 0.001 | 3.8e4 | 1e-6 |
| DFA + λ=1e-1 ‖f‖² | (running) | (running) | (running) |
| DFA-shallow baseline | 0.349 ± 0.002 | (n/a) | (n/a) |
| BP-trainable | 0.609 ± 0.004 | 2.0e2 | 5e-5 |

The penalty rescues by +5.5 pp over vanilla DFA but only +1.4 pp over the
shallow baseline; mechanism is necessary but not sufficient.

## Status of evidence

- §1 protocol package: **DONE**, committed
- §2 audit findings: **DONE** for ResMLP single + 3-seed; ViT audit waiting on checkpoint training
- §3 decision utility: **DONE**, ablation table + sensitivity analysis ready
- §4 temporal validation: **DONE** for 3 architectures × 3 seeds (ResMLP, ViT, StudentNet)
- §5 pipeline pitfalls: **DONE**, all 6 reproducers
- §4 figure: **DONE** for cross-arch single-seed
- Direction-quality direct test: **RUNNING** (vanilla phase silent)
- ResNet experiment: **RUNNING** (ep 1 of 60)
- ViT checkpoint training: **RUNNING** (just started)
- λ=1e-1 penalty data: **RUNNING** (ep 30 of 100)
- Penalized DFA protocol application: **WAITING** for direction-quality checkpoint

## Open questions for paper writeup

- Should the protocol report (a)/(b)/(d) as 3 redundant tests or as 1 composite indicator with 3 input signals? Current view: keep them separate so authors can see *why* the verdict was triggered.
- Should the paper claim the protocol generalizes to architectures not yet tested? Current view: scope claims tightly to "modern pre-LN residual architectures" and explicitly note ResNet/BatchNorm and StudentNet as the architectural extremes we've tested.
- Should the paper recommend specific threshold defaults, or leave them caller-configurable? Current view: ship defaults (50, 1e-7, 0.30, 2pp) with the calibration evidence and let users override.