Add EVIDENCE_SUMMARY.md: consolidated snapshot of all protocol evidence

Single-document overview of every result the protocol package has produced so far, with reproducibility commands and the file/memory entry where each result is recorded. Organized by paper section (§1 protocol, §2 audit, §3 decision utility, §4 temporal validation, §5 pitfalls). Includes the headline tables (3-seed audit, cross-architecture, penalty sweep) ready for the paper, and an explicit status field for each ongoing experiment. This is a reading guide for anyone (codex, future-me, the user) who needs to know what evidence is ready and how to reproduce it.
author: YurenHao0426 <Blackhao0426@gmail.com> 2026-04-07 23:15:47 -0500
committer: YurenHao0426 <Blackhao0426@gmail.com> 2026-04-07 23:15:47 -0500
commit: 5dadf7b78cbd3332b48a3ec0c385e3aeaea253a6 (patch)
tree: a23e43b7b0786cdc0a9ca7d304b1aca8d9d5bd73 /protocol
parent: 33a84534418c0459dc3988bdd53df09dcd3ab676 (diff)
1 files changed, 111 insertions, 0 deletions
diff --git a/protocol/EVIDENCE_SUMMARY.md b/protocol/EVIDENCE_SUMMARY.md
new file mode 100644
index 0000000..517f7c7
--- /dev/null
+++ b/protocol/EVIDENCE_SUMMARY.md
@@ -0,0 +1,111 @@
+# FA Diagnostic Protocol — Evidence Summary
+
+A consolidated snapshot of all the empirical evidence the protocol package
+in this repo has produced. Each item includes the reproducibility command
+and the file or memory entry where the result is recorded.
+
+## §1 The protocol itself
+
+| component | location | reproduce |
+|---|---|---|
+| Reference implementation | `protocol/protocol.py` | `import protocol; protocol.diagnose(...)` |
+| Reporting class | `protocol/report.py` | — |
+| Smoke test (BP/DFA/EP) | `protocol/smoke_test.py` | `python -m protocol.smoke_test` |
+| Reporting template (paper appendix form) | `protocol/REPORTING_TEMPLATE.md` | — |
+| Pipeline pitfalls checklist | `protocol/CHECKLIST.md` | — |
+| README | `protocol/README.md` | — |
+
+## §2 Audit findings
+
+| evidence | result | reproduce |
+|---|---|---|
+| 5-method audit table, seed 42 | BP/EP pass; DFA fails (a)(b)(d); SB/CB fail all 4 | `python -m protocol.examples.audit_table --seeds 42` |
+| 5-method audit table, 3 seeds | DFA/SB/CB walked back on 3/3 seeds via (a)+(b)+(d) | `python -m protocol.examples.audit_table --seeds 42 123 456` |
+| Random init sanity (3 seeds) | All trustworthy at ~10% chance accuracy | `python -m protocol.examples.random_init_sanity` |
+| Stability sub-mode bimodality | DFA/SB/CB show drift OR noise sub-mode depending on seed; both are degenerate, both are caught by (b) | (in audit table output) |
+
+## §3 Decision utility
+
+| evidence | result | reproduce |
+|---|---|---|
+| 7-strategy ablation (S0-S_full) | Field standard (S1) walks back **0/5** methods; full protocol walks back **3/5** | `python -m protocol.examples.ablation_decision_utility` |
+| Per-diagnostic recall | (a), (b), (d) each catch all 3 failures independently; (c) catches 2/3 (drift sub-modes only) | (in ablation output) |
+| Threshold sensitivity | Diagnostic (a) has a **63× separation gap** between healthy and degenerate; (b) has a **24,338× gap** | `python -m protocol.examples.threshold_sensitivity` |
+
+## §4 Temporal validation
+
+| evidence | result | reproduce |
+|---|---|---|
+| Single-seed temporal evolution (ResMLP) | Diagnostics fire by epoch 4-11 of 100 epochs | `python -m protocol.examples.temporal_diagnostic_evolution --seed 42 --arch resmlp` |
+| 3-seed temporal evolution (ResMLP) | (b) fires ep 3-4, (a) fires ep 8-11 across all seeds | iterate over seeds 42/123/456 |
+| ViT-Mini temporal evolution (3 seeds) | (a) fires ep 1, (b) fires ep 2-3 — even faster than ResMLP | iterate `--arch vit` over seeds |
+| StudentNet (no out_ln) temporal evolution (3 seeds) | (a) fires ep 14-25, **(b) NEVER fires** — terminal LN is causally responsible for the gradient collapse | iterate `--arch no_outln` over seeds |
+| Cross-architecture temporal figure | 3×3 grid: ResMLP/ViT/StudentNet × ‖h_L‖/‖g_L‖/acc | `python -m protocol.examples.plot_temporal_cross_arch --seed 42` |
+| Training-monitor early-stop demo | DFA: protocol fires at ep 4, saves 96% compute, 0pp acc loss; BP: never fires, runs to completion | `python -m protocol.examples.training_monitor_demo` |
+
+## §5 Pipeline pitfalls reproducers
+
+All 6 bugs from `protocol/CHECKLIST.md` have a reproducer:
+
+| bug | reproducer | finding |
+|---|---|---|
+| 1 — `tensor.norm(-1)` is L_{-1} | `verify_pitfalls.py` (synthetic) | `[3,4]/[6,8]` returns 1.143 instead of [5, 10] |
+| 2 — `F.cosine_similarity` clamps eps=1e-8 | `verify_pitfalls.py` (synthetic) | true cos 0.018 reported as 0.0009 (off by 20×) |
+| 3 — fp16 underflow | `verify_pitfalls.py` (synthetic) | 5e-10 → 0 in fp16; cosine NaN; bf16 works |
+| 4 — Bs are training-specific | `verify_pitfalls_4_6.py` (DFA s42) | training Bs Γ=+0.068; 10 fresh Bs draws Γ=+0.0043±0.007 |
+| 5 — aggregation strategy spread | `verify_pitfalls_4_6.py` (DFA s42) | 4 strategies span [-0.028, +0.074]; **sign flips** |
+| 6 — layer-0 dominance | `verify_pitfalls_4_6.py` (DFA s42) | Γ_l0=+0.43; hidden-only mean -0.022 (negative) |
+
+## Headline quantitative results (paper table)
+
+### 5-method audit, ResMLP CIFAR-10 (3-seed mean ± std)
+
+| method | acc | ‖h_L‖ | ‖g_L‖ | stability | verdict |
+|---|---:|---:|---:|---:|---|
+| BP | 0.615±0.004 | 2.0e2 | 3.6e-4 | 0.10 (range 0.08-0.11) | trustworthy |
+| DFA | 0.306±0.012 | 4-7e8 | 2-4e-9 | bimodal | walked back: (a)+(b)+(d) |
+| State Bridge | 0.205±0.039 | 7e7-2e8 | 2e-9 | mostly drift | walked back: all 4 |
+| Credit Bridge | 0.289±0.034 | 1-6e8 | 9e-10 to 4e-9 | drift on 2/3 seeds | walked back: all 4 (mostly) |
+| EP | 0.316±0.038 | 3-8e3 | 1-2e-4 | -0.04 to 0.12 | trustworthy |
+
+### Cross-architecture (s42)
+
+| arch | terminal LN | DFA acc | BP acc | DFA fires | BP fires |
+|---|:---:|---:|---:|---:|:---:|
+| ResMLP (4-block d=256) | yes | 0.31 | 0.61 | (a)+(b) ep 4-8 | never |
+| ViT-Mini (4-block d=128) | yes | 0.26 | 0.80 | (a)+(b) ep 1-3 | never |
+| StudentNet (4-block d=128) | **no** | 0.33 | 0.62 | (a) ep 18, **(b) NEVER** | never |
+
+### Penalty rescue (3-seed for λ=1e-2, single-seed for others)
+
+| condition | acc | ‖h_L‖ | ‖g_L‖ |
+|---|---:|---:|---:|
+| DFA-vanilla | 0.308 ± 0.014 | 4.4e8 | 5e-10 |
+| DFA + λ=1e-3 ‖f‖² | 0.372 (1 seed) | 4.0e4 | 7e-7 |
+| DFA + λ=1e-2 ‖f‖² | 0.363 ± 0.001 | 3.8e4 | 1e-6 |
+| DFA + λ=1e-1 ‖f‖² | (running) | (running) | (running) |
+| DFA-shallow baseline | 0.349 ± 0.002 | (n/a) | (n/a) |
+| BP-trainable | 0.609 ± 0.004 | 2.0e2 | 5e-5 |
+
+The penalty rescues by +5.5 pp over vanilla DFA but only +1.4 pp over the
+shallow baseline; mechanism is necessary but not sufficient.
+
+## Status of evidence
+
+- §1 protocol package: **DONE**, committed
+- §2 audit findings: **DONE** for ResMLP single + 3-seed; ViT audit waiting on checkpoint training
+- §3 decision utility: **DONE**, ablation table + sensitivity analysis ready
+- §4 temporal validation: **DONE** for 3 architectures × 3 seeds (ResMLP, ViT, StudentNet)
+- §5 pipeline pitfalls: **DONE**, all 6 reproducers
+- §4 figure: **DONE** for cross-arch single-seed
+- Direction-quality direct test: **RUNNING** (vanilla phase silent)
+- ResNet experiment: **RUNNING** (ep 1 of 60)
+- ViT checkpoint training: **RUNNING** (just started)
+- λ=1e-1 penalty data: **RUNNING** (ep 30 of 100)
+- Penalized DFA protocol application: **WAITING** for direction-quality checkpoint
+
+## Open questions for paper writeup
+
+- Should the protocol report (a)/(b)/(d) as 3 redundant tests or as 1 composite indicator with 3 input signals? Current view: keep them separate so authors can see *why* the verdict was triggered.
+- Should the paper claim the protocol generalizes to architectures not yet tested? Current view: scope claims tightly to "modern pre-LN residual architectures" and explicitly note ResNet/BatchNorm and StudentNet as the architectural extremes we've tested.
+- Should the paper recommend specific threshold defaults, or leave them caller-configurable? Current view: ship defaults (50, 1e-7, 0.30, 2pp) with the calibration evidence and let users override.
author	YurenHao0426 <Blackhao0426@gmail.com>	2026-04-07 23:15:47 -0500
committer	YurenHao0426 <Blackhao0426@gmail.com>	2026-04-07 23:15:47 -0500
commit	5dadf7b78cbd3332b48a3ec0c385e3aeaea253a6 (patch)
tree	a23e43b7b0786cdc0a9ca7d304b1aca8d9d5bd73 /protocol
parent	33a84534418c0459dc3988bdd53df09dcd3ab676 (diff)