# FA Diagnostic Protocol — Evidence Summary

A consolidated snapshot of all the empirical evidence the protocol package
in this repo has produced. Each item includes the reproducibility command
and the file or memory entry where the result is recorded.

## §1 The protocol itself

| component | location | reproduce |
|---|---|---|
| Reference implementation | `protocol/protocol.py` | `import protocol; protocol.diagnose(...)` |
| Reporting class | `protocol/report.py` | — |
| Smoke test (BP/DFA/EP) | `protocol/smoke_test.py` | `python -m protocol.smoke_test` |
| Reporting template (paper appendix form) | `protocol/REPORTING_TEMPLATE.md` | — |
| Pipeline pitfalls checklist | `protocol/CHECKLIST.md` | — |
| README | `protocol/README.md` | — |

## §2 Audit findings

| evidence | result | reproduce |
|---|---|---|
| 5-method audit table, seed 42 | BP/EP pass; DFA fails (a)(b)(d); SB/CB fail all 4 | `python -m protocol.examples.audit_table --seeds 42` |
| 5-method audit table, 3 seeds | DFA/SB/CB walked back on 3/3 seeds via (a)+(b)+(d) | `python -m protocol.examples.audit_table --seeds 42 123 456` |
| Random init sanity (3 seeds) | All trustworthy at ~10% chance accuracy | `python -m protocol.examples.random_init_sanity` |
| Stability sub-mode bimodality | DFA/SB/CB show drift OR noise sub-mode depending on seed; both are degenerate, both are caught by (b) | (in audit table output) |

## §3 Decision utility

| evidence | result | reproduce |
|---|---|---|
| 7-strategy ablation (S0-S_full) | Field standard (S1) walks back **0/5** methods; full protocol walks back **3/5** | `python -m protocol.examples.ablation_decision_utility` |
| Per-diagnostic recall | (a), (b), (d) each catch all 3 failures independently; (c) catches 2/3 (drift sub-modes only) | (in ablation output) |
| Threshold sensitivity | Diagnostic (a) has a **63× separation gap** between healthy and degenerate; (b) has a **24,338× gap** | `python -m protocol.examples.threshold_sensitivity` |

## §4 Temporal validation

| evidence | result | reproduce |
|---|---|---|
| Single-seed temporal evolution (ResMLP) | Diagnostics fire by epoch 4-11 of 100 epochs | `python -m protocol.examples.temporal_diagnostic_evolution --seed 42 --arch resmlp` |
| 3-seed temporal evolution (ResMLP) | (b) fires ep 3-4, (a) fires ep 8-11 across all seeds | iterate over seeds 42/123/456 |
| ViT-Mini temporal evolution (3 seeds) | (a) fires ep 1, (b) fires ep 2-3 — even faster than ResMLP | iterate `--arch vit` over seeds |
| StudentNet (no out_ln) temporal evolution (3 seeds) | (a) fires ep 14-25, **(b) NEVER fires** — terminal LN is causally responsible for the gradient collapse | iterate `--arch no_outln` over seeds |
| Cross-architecture temporal figure | 3×3 grid: ResMLP/ViT/StudentNet × ‖h_L‖/‖g_L‖/acc | `python -m protocol.examples.plot_temporal_cross_arch --seed 42` |
| Training-monitor early-stop demo | DFA: protocol fires at ep 4, saves 96% compute, 0pp acc loss; BP: never fires, runs to completion | `python -m protocol.examples.training_monitor_demo` |

## §3.5 Threshold sensitivity (calibration evidence)

| evidence | result | reproduce |
|---|---|---|
| Threshold sensitivity sweep | (a) **63× separation gap**, (b) **24,338× separation gap** between healthy and degenerate | `python -m protocol.examples.threshold_sensitivity` |
| Default thresholds (a)+(b) | sit cleanly in the middle of substantial margins; verdicts robust to ±50% perturbation | (in sensitivity output) |
| Diagnostic (d) frozen-baseline threshold | **NOT robust** — penalized DFA at λ=1e-2 fires at 2 pp threshold, passes at 1 pp threshold; at λ=1e-3 (1 seed) margin is +2.3 pp which passes at 2 pp. The (d) verdict depends on both threshold choice and λ choice. | `python -m protocol.examples.threshold_d_sensitivity` |
| Round 18 lesson | Soften language: "after the penalty correction, the depth contribution is at most 1.4 pp above the random-blocks baseline at λ=1e-2 — much smaller than BP's +26 pp gap over shallow", NOT "the deep blocks are passive". | n/a |

## §3.6 Cross-width validation (d=512)

| evidence | result | reproduce |
|---|---|---|
| 4-method audit at d=512 | BP trustworthy on 3/3 seeds; DFA/SB/CB walked back on 3/3 (same as d=256) | `python -m protocol.examples.audit_d512` |
| Width effect | max-per-block growth is HIGHER at d=512 (6e3-7e4 vs ~1e3 at d=256) | (in d=512 output) |

## §3.7 Cross-architecture: CNN (no terminal LN, BatchNorm)

| method × 3-seed | acc | max/block growth | ‖g_3‖ | verdict |
|---|---:|---:|---:|---|
| BP CNN | 0.866 ± 0.003 | 1.31× | 4e-5 | trustworthy |
| State Bridge CNN | 0.633 ± 0.005 | 2.40× | 2e-3 | trustworthy |
| **DFA CNN** | **0.566 ± 0.021** | **237×** | 1e-3 | walk-back via (a) only |
| EP CNN | 0.512 ± 0.023 | 11.6× | ~6.6e-1 | trustworthy |
| Credit Bridge CNN | 0.325 ± 0.009 | 96× | 3e-3 | walk-back via (a) only |

**Key**: diagnostic (b) NEVER fires on CNN. Without terminal LN, BP grad does
not collapse below 1e-7. Combined with the StudentNet result, **(b) appears
restricted to the terminal-normalized architectures we audited** (round 18
softening: this is observational association across the architectures
tested, not causal identification of LayerNorm). DFA CNN reaches 0.566
(much higher than DFA ResMLP 0.31 / DFA ViT 0.24), consistent with the
literature: classical FA papers report DFA working on shallow CNNs but
failing on modern Transformers. On CNN the cosine remains in a measurable
regime (Γ=0.916 for DFA), but the training trajectory exhibits extreme
scale distortion (max-per-block growth 237×), so the headline Γ alone is
not a trustworthy summary of learning quality even though the cosine
itself is well-defined.

Reproduce: `python -m protocol.examples.audit_cnn`

## §4 Two distinct failure modes (LOCKED — round 20)

The §4 framing is locked after rounds 18-20. Two distinct failure modes,
five independent validations.

### Mode 1: measurement degeneracy via terminal-LayerNorm gradient cancellation

Residual stream growth → BP gradient at hidden layers collapses below the
1e-7 floor → cosine alignment metric measured against degenerate reference.
**Caught by diagnostic (b).** Direct empirical evidence is the 5-method
audit table where DFA/SB/CB all have ‖g_L‖ ~ 1e-9.

### Mode 2: low intrinsic credit-direction quality of random feedback

Even in the meaningful regime (vanilla DFA at ep 1, ‖g‖ ~ 10⁻⁶), DFA's
local credit signal `e_T B_l^T` is essentially uncorrelated with BP grad
on deep layers. **Caught by direct per-layer cosine measurement.**

### Six independent validations of the converged framing

| # | evidence | result | reproduce |
|---|---|---|---|
| 1 | Direct deep-layer cosine on penalized DFA, 3 seeds | layer-mean +0.186 ± 0.007; deep mean +0.155 ± 0.025 | `experiments/measure_direction_quality_existing_ckpt.py` on `results/dfa_pen_short/dfa_pen_lam0.01_s{42,123,456}.pt` |
| 2 | Null calibration with 20 fresh random Bs | training-Bs deep cos +0.16 vs fresh-Bs +0.002 ± 0.022 | `experiments/null_calibration_penalized_cos.py` |
| 3 | Hypothesis B disambiguation (vanilla early-epoch) | vanilla deep cos -0.008 ± 0.013 across 3 seeds × ep 1, even with ‖g‖ in meaningful regime | `experiments/vanilla_dfa_early_ckpt.py` + measure script |
| 4 | BP+penalty capacity-cost 2×2 control | BP+pen acc 0.530 (-8 pp); DFA+pen 0.363 (+5.5 pp); 17 pp residual gap consistent with credit quality | `experiments/bp_with_penalty_control.py` |
| 5 | Multi-seed lock-in (round 20) | 24 measurements (3 seeds × 2 epochs × 4 deep layers) all in [-0.04, +0.02] | iterate measure script over s42/s123/s456 × ep1/ep2 |
| 6 | **Perturbation correlation triangulation** (round 19 alt metric) | vanilla deep ρ +0.002, penalized deep ρ +0.080 ± 0.011 (3 seeds). Different metric (no eps clamp, sample-level loss change), same pattern as cos | `experiments/perturbation_correlation_audit.py` |

### Penalty rescue 3-seed table (lam=1e-2)

| seed | acc | ‖h_L‖ | ‖g_2‖ | deep cos l1-l4 mean |
|---:|---:|---:|---:|---:|
| 42 | 0.363 | 3.8e4 | 9.9e-7 | +0.163 |
| 123 | 0.362 | 4.1e4 | 8.1e-7 | +0.151 |
| 456 | 0.364 | 4.1e4 | 9.0e-7 | +0.139 |
| **mean** | **0.363 ± 0.001** | **4.0e4** | **9.0e-7** | **+0.151 ± 0.012** |

### BP+penalty 2×2 grid (matched 30-epoch 3-seed values, paper v2.32)

| | no penalty | with penalty | penalty effect |
|---|---:|---:|---:|
| BP | 0.585 ± 0.001 | **0.532 ± 0.006** | −5.3 pp (capacity loss) |
| DFA | 0.301 ± 0.005 | 0.360 ± 0.001 | +5.9 pp (rescue) |

### Vanilla FA vs DFA (2026-04-22, commit 88ff85c)

**PAPER-CHANGING FINDING.** FA (Lillicrap 2016 sequential backward with d×d random matrices) is dramatically different from DFA on the same architecture.

| | FA | DFA |
|---|---|---|
| Test acc (100ep, 3-seed, d=256) | **0.401 ± 0.009** | 0.306 ± 0.008 |
| vs frozen 0.349 | **+5.2 pp above** | -4.3 pp below |
| Deep cos | **+0.33** | ~0 (degenerate) |
| ‖h_L‖ | ~10⁵ | ~5×10⁸ |
| ‖g_L‖ | ~10⁻⁶ (meaningful) | ~10⁻¹⁰ (floor) |
| Mode 1(b) fires? | **NO** | YES |

Same local loss ⟨f_l, a_l⟩, same architecture, same optimizer. Only difference: how a_l is computed (sequential vs direct projection). FA's sequential backward preserves credit quality → prevents catastrophic Mode 1 growth. **Strongest empirical support for Mode 2 → Mode 1 causal hypothesis.**

Source: `results/fa_main_audit/results_cifar10.json`

FA depth sweep (d=512, 100ep, s42):
| L | FA acc | FA deep cos | DFA acc | DFA deep cos |
|---|---|---|---|---|
| 2 | 0.350 | +0.96 | — | — |
| 4 | 0.424 | +0.29 | — | — |
| 6 | 0.401 | +0.16 | — | — |
| 8 | 0.409 | +0.11 | 0.306 | ~0 |
| 12 | 0.404 | +0.09 | 0.309 | ~0 |

### Round 20 phrasing for the gap

**Lower bound on non-capacity gap**: matched penalty controls show that only part of DFA's deficit is attributable to the representational/optimization cost of the penalty itself; a substantial residual remains and is consistent with poorer credit assignment.

**Cannot rule out (caveats)**:
- BP uses end-to-end loss, DFA uses local block losses — the 2×2 isn't a perfectly clean isolation of "credit quality" in a vacuum
- The "different optimization regime" objection: penalty hurts BP (-8 pp) while helping DFA (+5.5 pp), opposite of what a generally-beneficial regime shift would do, so this is unlikely but not airtight

## §5 Pipeline pitfalls reproducers

All 6 bugs from `protocol/CHECKLIST.md` have a reproducer:

| bug | reproducer | finding |
|---|---|---|
| 1 — `tensor.norm(-1)` is L_{-1} | `verify_pitfalls.py` (synthetic) | `[3,4]/[6,8]` returns 1.143 instead of [5, 10] |
| 2 — `F.cosine_similarity` clamps eps=1e-8 | `verify_pitfalls.py` (synthetic) | true cos 0.018 reported as 0.0009 (off by 20×) |
| 3 — fp16 underflow | `verify_pitfalls.py` (synthetic) | 5e-10 → 0 in fp16; cosine NaN; bf16 works |
| 4 — Bs are training-specific | `verify_pitfalls_4_6.py` (DFA s42) | training Bs Γ=+0.068; 10 fresh Bs draws Γ=+0.0043±0.007 |
| 5 — aggregation strategy spread | `verify_pitfalls_4_6.py` (DFA s42) | 4 strategies span [-0.028, +0.074]; **sign flips** |
| 6 — layer-0 dominance | `verify_pitfalls_4_6.py` (DFA s42) | Γ_l0=+0.43; hidden-only mean -0.022 (negative) |

## Headline quantitative results (paper table)

### 5-method audit, ResMLP CIFAR-10 (3-seed mean ± std)

| method | acc | ‖h_L‖ | ‖g_L‖ | stability | verdict |
|---|---:|---:|---:|---:|---|
| BP | 0.615±0.004 | 2.0e2 | 3.6e-4 | 0.10 (range 0.08-0.11) | trustworthy |
| DFA | 0.306±0.012 | 4-7e8 | 2-4e-9 | bimodal | walked back: (a)+(b)+(d) |
| State Bridge | 0.205±0.039 | 7e7-2e8 | 2e-9 | mostly drift | walked back: all 4 |
| Credit Bridge | 0.289±0.034 | 1-6e8 | 9e-10 to 4e-9 | drift on 2/3 seeds | walked back: all 4 (mostly) |
| EP | 0.316±0.038 | 3-8e3 | 1-2e-4 | -0.04 to 0.12 | trustworthy |

### Cross-architecture (s42)

| arch | terminal LN | DFA acc | BP acc | DFA fires | BP fires |
|---|:---:|---:|---:|---:|:---:|
| ResMLP (4-block d=256) | yes | 0.31 | 0.61 | (a)+(b) ep 4-8 | never |
| ViT-Mini (4-block d=128) | yes | 0.26 | 0.80 | (a)+(b) ep 1-3 | never |
| StudentNet (4-block d=128) | **no** | 0.33 | 0.62 | (a) ep 18, **(b) NEVER** | never |

### Penalty rescue (matched 30-epoch 3-seed values, paper v2.32)

| condition | acc | ‖h_L‖ | ‖g_L‖ |
|---|---:|---:|---:|
| DFA-vanilla 30ep (3-seed) | 0.301 ± 0.005 | 4.4e8 (s42) / 5e8 (3-seed mean) | 4e-10 |
| DFA + λ=1e-4 ‖f‖² 30ep (3-seed) | 0.360 | 2.2e4 | 7e-7 |
| DFA + λ=1e-2 ‖f‖² 30ep (3-seed) | 0.360 ± 0.001 | 1.3e4 | 1.6e-6 |
| DFA + λ=1e-1 ‖f‖² 30ep (s42) | 0.349 | 1.2e4 | 1.6e-6 |
| DFA-shallow baseline (frozen) | 0.349 ± 0.002 | (n/a) | (n/a) |
| BP-trainable 30ep (3-seed) | 0.585 ± 0.001 | (n/a) | (n/a) |
| BP-trainable 100ep (3-seed) | 0.6147 ± 0.004 | 2.0e2 | 5e-5 |
| BP+pen λ=1e-2 30ep (3-seed) | 0.532 ± 0.006 | 4.0e4 | (matches DFA+pen) |

The penalty rescues by +5.9 pp over vanilla DFA (matched 30-ep) but only
+1.1 pp over the shallow baseline; mechanism is necessary but not sufficient.

## Figures (paper-ready)

- `results/protocol_audit/figure_audit_5method.png` — §2/§3 hero figure: 4-panel horizontal bar chart of the 5-method audit, color-coded healthy/degenerate, threshold lines drawn
- `results/protocol_audit/figure_cross_arch_temporal_s42.png` — §4 cross-architecture temporal evolution: 3 architectures × 3 metric rows (‖h_L‖, ‖g_L‖, acc) with floor line

## Status of evidence

- §1 protocol package: **DONE**, committed
- §2 audit findings: **DONE** for ResMLP at d=256 (3 seeds), d=512 (3 seeds), and CNN (3 seeds, 5 methods). 11 method×architecture combinations × 3 seeds = 33 audited conditions. ViT audit waiting on checkpoint training
- §3 decision utility: **DONE**, ablation table + threshold sensitivity ready
- §3 hero figure: **DONE**
- §4 temporal validation: **DONE** for 3 architectures × 3 seeds (ResMLP, ViT, StudentNet)
- §4 cross-arch figure: **DONE**
- §4 two failure modes (separability): **DONE** via penalty partial audit + vanilla cosine measurement
- §5 pipeline pitfalls: **DONE**, all 6 reproducers
- Direction-quality direct test on penalized checkpoint: **RUNNING** (gives mechanistic explanation of (d) failure on penalized net)
- ResNet experiment: **RUNNING** (ep 1 of 60, slow due to GPU contention)
- ViT checkpoint training: **RUNNING** (ep 1 of 60)
- λ=1e-1 penalty data: **RUNNING** (ep 40 of 100)
- ViT 5-method audit: **WAITING** for ViT checkpoint

## Open questions for paper writeup

- Should the protocol report (a)/(b)/(d) as 3 redundant tests or as 1 composite indicator with 3 input signals? Current view: keep them separate so authors can see *why* the verdict was triggered.
- Should the paper claim the protocol generalizes to architectures not yet tested? Current view: scope claims tightly to "modern pre-LN residual architectures" and explicitly note ResNet/BatchNorm and StudentNet as the architectural extremes we've tested.
- Should the paper recommend specific threshold defaults, or leave them caller-configurable? Current view: ship defaults (50, 1e-7, 0.30, 2pp) with the calibration evidence and let users override.