diff options
| author | YurenHao0426 <Blackhao0426@gmail.com> | 2026-04-07 23:15:47 -0500 |
|---|---|---|
| committer | YurenHao0426 <Blackhao0426@gmail.com> | 2026-04-07 23:15:47 -0500 |
| commit | 5dadf7b78cbd3332b48a3ec0c385e3aeaea253a6 (patch) | |
| tree | a23e43b7b0786cdc0a9ca7d304b1aca8d9d5bd73 /protocol | |
| parent | 33a84534418c0459dc3988bdd53df09dcd3ab676 (diff) | |
Add EVIDENCE_SUMMARY.md: consolidated snapshot of all protocol evidence
Single-document overview of every result the protocol package has
produced so far, with reproducibility commands and the file/memory entry
where each result is recorded. Organized by paper section (§1 protocol,
§2 audit, §3 decision utility, §4 temporal validation, §5 pitfalls).
Includes the headline tables (3-seed audit, cross-architecture, penalty
sweep) ready for the paper, and an explicit status field for each
ongoing experiment.
This is a reading guide for anyone (codex, future-me, the user) who
needs to know what evidence is ready and how to reproduce it.
Diffstat (limited to 'protocol')
| -rw-r--r-- | protocol/EVIDENCE_SUMMARY.md | 111 |
1 files changed, 111 insertions, 0 deletions
diff --git a/protocol/EVIDENCE_SUMMARY.md b/protocol/EVIDENCE_SUMMARY.md new file mode 100644 index 0000000..517f7c7 --- /dev/null +++ b/protocol/EVIDENCE_SUMMARY.md @@ -0,0 +1,111 @@ +# FA Diagnostic Protocol — Evidence Summary + +A consolidated snapshot of all the empirical evidence the protocol package +in this repo has produced. Each item includes the reproducibility command +and the file or memory entry where the result is recorded. + +## §1 The protocol itself + +| component | location | reproduce | +|---|---|---| +| Reference implementation | `protocol/protocol.py` | `import protocol; protocol.diagnose(...)` | +| Reporting class | `protocol/report.py` | — | +| Smoke test (BP/DFA/EP) | `protocol/smoke_test.py` | `python -m protocol.smoke_test` | +| Reporting template (paper appendix form) | `protocol/REPORTING_TEMPLATE.md` | — | +| Pipeline pitfalls checklist | `protocol/CHECKLIST.md` | — | +| README | `protocol/README.md` | — | + +## §2 Audit findings + +| evidence | result | reproduce | +|---|---|---| +| 5-method audit table, seed 42 | BP/EP pass; DFA fails (a)(b)(d); SB/CB fail all 4 | `python -m protocol.examples.audit_table --seeds 42` | +| 5-method audit table, 3 seeds | DFA/SB/CB walked back on 3/3 seeds via (a)+(b)+(d) | `python -m protocol.examples.audit_table --seeds 42 123 456` | +| Random init sanity (3 seeds) | All trustworthy at ~10% chance accuracy | `python -m protocol.examples.random_init_sanity` | +| Stability sub-mode bimodality | DFA/SB/CB show drift OR noise sub-mode depending on seed; both are degenerate, both are caught by (b) | (in audit table output) | + +## §3 Decision utility + +| evidence | result | reproduce | +|---|---|---| +| 7-strategy ablation (S0-S_full) | Field standard (S1) walks back **0/5** methods; full protocol walks back **3/5** | `python -m protocol.examples.ablation_decision_utility` | +| Per-diagnostic recall | (a), (b), (d) each catch all 3 failures independently; (c) catches 2/3 (drift sub-modes only) | (in ablation output) | +| Threshold sensitivity | Diagnostic (a) has a **63× separation gap** between healthy and degenerate; (b) has a **24,338× gap** | `python -m protocol.examples.threshold_sensitivity` | + +## §4 Temporal validation + +| evidence | result | reproduce | +|---|---|---| +| Single-seed temporal evolution (ResMLP) | Diagnostics fire by epoch 4-11 of 100 epochs | `python -m protocol.examples.temporal_diagnostic_evolution --seed 42 --arch resmlp` | +| 3-seed temporal evolution (ResMLP) | (b) fires ep 3-4, (a) fires ep 8-11 across all seeds | iterate over seeds 42/123/456 | +| ViT-Mini temporal evolution (3 seeds) | (a) fires ep 1, (b) fires ep 2-3 — even faster than ResMLP | iterate `--arch vit` over seeds | +| StudentNet (no out_ln) temporal evolution (3 seeds) | (a) fires ep 14-25, **(b) NEVER fires** — terminal LN is causally responsible for the gradient collapse | iterate `--arch no_outln` over seeds | +| Cross-architecture temporal figure | 3×3 grid: ResMLP/ViT/StudentNet × ‖h_L‖/‖g_L‖/acc | `python -m protocol.examples.plot_temporal_cross_arch --seed 42` | +| Training-monitor early-stop demo | DFA: protocol fires at ep 4, saves 96% compute, 0pp acc loss; BP: never fires, runs to completion | `python -m protocol.examples.training_monitor_demo` | + +## §5 Pipeline pitfalls reproducers + +All 6 bugs from `protocol/CHECKLIST.md` have a reproducer: + +| bug | reproducer | finding | +|---|---|---| +| 1 — `tensor.norm(-1)` is L_{-1} | `verify_pitfalls.py` (synthetic) | `[3,4]/[6,8]` returns 1.143 instead of [5, 10] | +| 2 — `F.cosine_similarity` clamps eps=1e-8 | `verify_pitfalls.py` (synthetic) | true cos 0.018 reported as 0.0009 (off by 20×) | +| 3 — fp16 underflow | `verify_pitfalls.py` (synthetic) | 5e-10 → 0 in fp16; cosine NaN; bf16 works | +| 4 — Bs are training-specific | `verify_pitfalls_4_6.py` (DFA s42) | training Bs Γ=+0.068; 10 fresh Bs draws Γ=+0.0043±0.007 | +| 5 — aggregation strategy spread | `verify_pitfalls_4_6.py` (DFA s42) | 4 strategies span [-0.028, +0.074]; **sign flips** | +| 6 — layer-0 dominance | `verify_pitfalls_4_6.py` (DFA s42) | Γ_l0=+0.43; hidden-only mean -0.022 (negative) | + +## Headline quantitative results (paper table) + +### 5-method audit, ResMLP CIFAR-10 (3-seed mean ± std) + +| method | acc | ‖h_L‖ | ‖g_L‖ | stability | verdict | +|---|---:|---:|---:|---:|---| +| BP | 0.615±0.004 | 2.0e2 | 3.6e-4 | 0.10 (range 0.08-0.11) | trustworthy | +| DFA | 0.306±0.012 | 4-7e8 | 2-4e-9 | bimodal | walked back: (a)+(b)+(d) | +| State Bridge | 0.205±0.039 | 7e7-2e8 | 2e-9 | mostly drift | walked back: all 4 | +| Credit Bridge | 0.289±0.034 | 1-6e8 | 9e-10 to 4e-9 | drift on 2/3 seeds | walked back: all 4 (mostly) | +| EP | 0.316±0.038 | 3-8e3 | 1-2e-4 | -0.04 to 0.12 | trustworthy | + +### Cross-architecture (s42) + +| arch | terminal LN | DFA acc | BP acc | DFA fires | BP fires | +|---|:---:|---:|---:|---:|:---:| +| ResMLP (4-block d=256) | yes | 0.31 | 0.61 | (a)+(b) ep 4-8 | never | +| ViT-Mini (4-block d=128) | yes | 0.26 | 0.80 | (a)+(b) ep 1-3 | never | +| StudentNet (4-block d=128) | **no** | 0.33 | 0.62 | (a) ep 18, **(b) NEVER** | never | + +### Penalty rescue (3-seed for λ=1e-2, single-seed for others) + +| condition | acc | ‖h_L‖ | ‖g_L‖ | +|---|---:|---:|---:| +| DFA-vanilla | 0.308 ± 0.014 | 4.4e8 | 5e-10 | +| DFA + λ=1e-3 ‖f‖² | 0.372 (1 seed) | 4.0e4 | 7e-7 | +| DFA + λ=1e-2 ‖f‖² | 0.363 ± 0.001 | 3.8e4 | 1e-6 | +| DFA + λ=1e-1 ‖f‖² | (running) | (running) | (running) | +| DFA-shallow baseline | 0.349 ± 0.002 | (n/a) | (n/a) | +| BP-trainable | 0.609 ± 0.004 | 2.0e2 | 5e-5 | + +The penalty rescues by +5.5 pp over vanilla DFA but only +1.4 pp over the +shallow baseline; mechanism is necessary but not sufficient. + +## Status of evidence + +- §1 protocol package: **DONE**, committed +- §2 audit findings: **DONE** for ResMLP single + 3-seed; ViT audit waiting on checkpoint training +- §3 decision utility: **DONE**, ablation table + sensitivity analysis ready +- §4 temporal validation: **DONE** for 3 architectures × 3 seeds (ResMLP, ViT, StudentNet) +- §5 pipeline pitfalls: **DONE**, all 6 reproducers +- §4 figure: **DONE** for cross-arch single-seed +- Direction-quality direct test: **RUNNING** (vanilla phase silent) +- ResNet experiment: **RUNNING** (ep 1 of 60) +- ViT checkpoint training: **RUNNING** (just started) +- λ=1e-1 penalty data: **RUNNING** (ep 30 of 100) +- Penalized DFA protocol application: **WAITING** for direction-quality checkpoint + +## Open questions for paper writeup + +- Should the protocol report (a)/(b)/(d) as 3 redundant tests or as 1 composite indicator with 3 input signals? Current view: keep them separate so authors can see *why* the verdict was triggered. +- Should the paper claim the protocol generalizes to architectures not yet tested? Current view: scope claims tightly to "modern pre-LN residual architectures" and explicitly note ResNet/BatchNorm and StudentNet as the architectural extremes we've tested. +- Should the paper recommend specific threshold defaults, or leave them caller-configurable? Current view: ship defaults (50, 1e-7, 0.30, 2pp) with the calibration evidence and let users override. |
