# Paper draft outline NeurIPS 2026 Evaluations & Datasets Track. Round 17 + 18 verdict locked-in. This outline reflects ALL the evidence collected as of 2026-04-08, with round 18 language softening applied throughout. **Working title** (round 17): *Beyond Accuracy and Alignment: A Diagnostic Evaluation Protocol for Feedback Alignment* **Top-line claim**: standard FA evaluation on modern residual networks (headline accuracy + Γ alignment) is unreliable; we identify two distinct failure modes and provide a 4-diagnostic protocol that detects them. ## §1 Introduction **Discovery-first hook** (round 16 narrative arc, kept under round 17 E&D framing): - Modern Feedback Alignment evaluation reports headline accuracy and Γ alignment as evidence that the method is training successfully on a given architecture. - We show, on standard 4-block pre-LayerNorm ResMLP and ViT-Mini at d=256/d=512 on CIFAR-10, that **both** signals silently fail. DFA "31% accuracy" walks back to "below the architecture-matched random-blocks baseline by 4 pp" (active harm). DFA Γ ≈ 0.10 walks back to "measured against a BP gradient at the numerical floor (~10⁻⁹), with the headline driven entirely by the embedding layer". - The evaluation failure is not unique to DFA: State Bridge and Credit Bridge show the same pattern. The protocol catches all three on every seed we tested. - We present the first systematic diagnostic protocol that catches these failures, with a reference implementation, validation across 3 architecture families (ResMLP, ViT-Mini, CNN), and an ablation showing 0/5 walked back by the field-standard reporting versus 3/5 by our protocol. ## §2 Audit findings ### 2.1 Audit on 4-block d=256 ResMLP CIFAR-10 (5 methods × 3 seeds) - BP, EP: trustworthy on all 9 method×seed combinations - DFA, State Bridge, Credit Bridge: walked back on all 9 combinations - Headline numbers (3-seed mean ± std): table from `audit_table_s42_s123_s456.json` ### 2.2 Walk-back details **DFA** (3-seed values from paper v2.32): - 100ep test acc: 0.306 ± 0.006 (Table 1, audit). Reports as "DFA trains the deep blocks". Reality: - ‖h_L‖ ~ 5×10⁸ (3-seed mean) (vs BP ~200, vs random init ~9): catastrophic residual stream growth, 4-5 OOM - ‖g_L‖ ~ 4×10⁻¹⁰ (3-seed mean) (vs BP ~4×10⁻⁴, vs random init ~10⁻³): BP gradient at numerical floor, 5+ OOM below healthy - DFA-shallow (no deep blocks): 0.349 ± 0.002 — **DFA-trainable is 4.3 pp WORSE than DFA-shallow** - Cosine of `e_T B_l^T` to BP grad, layer 0 across seeds: +0.42, +0.44, +0.42; layers 1-4: ≈ 0 (3-seed deep mean -0.008 ± 0.013 at ep 1) - Headline Γ = 0.10 is entirely the embedding layer **State Bridge** (3-seed): 0.205 ± 0.032 acc (Table 1). Same pattern. Cross-batch direction stability of BP grad: s42=0.992 (cleanest "drift-dominated reference vector" measurement); 3-seed mean 0.529. **Credit Bridge** (3-seed): 0.289 ± 0.026 acc (Table 1). Same pattern. Cross-batch stability 3-seed mean 0.373. ### 2.3 Internal control (EP) EP on the same architecture: 0.316 ± 0.038 acc. ‖h_L‖ ~ 5×10³ (bounded), ‖g_L‖ ~ 1×10⁻⁴ (healthy), per-layer cosines small-but-meaningful. **The protocol passes EP** even though EP's headline accuracy is also low. This isolates "the network is in a degenerate measurement regime" from "the method underperforms" — they are different failure modes with different evidence. EP is the §2 internal control that validates the protocol does not over-flag. ## §3 The diagnostic protocol ### 3.1 Four diagnostics (a) **per-layer ‖h_l‖**: max-per-block growth `max_l (‖h_{l+1}‖ / ‖h_l‖)`. Threshold: 50×. (b) **per-layer ‖g_l‖**: deepest-layer BP gradient norm. Floor: 10⁻⁷. (c) **cross-batch direction stability**: mean pairwise cosine of normalized BP-grad direction across disjoint minibatches. Drift ceiling: 0.30. (d) **frozen-blocks baseline**: trained-blocks acc must clear an architecture-matched random-untrained-blocks baseline by ≥ 2 pp. (a)/(b)/(d) are independently sufficient for binary detection on the cases we audited. (c) is for sub-mode discrimination (drift-dominated vs noise-dominated reference vector); it does not catch new positives. ### 3.2 Decision-utility ablation S0 (acc only) walks back 0/5 methods. S1 (acc + Γ field-standard) walks back 0/5 methods. S_full (full protocol) walks back 3/5 (DFA, SB, CB). The protocol's added value is 3 silent failures detected. ### 3.3 Threshold sensitivity - (a) has a **63× separation gap** between healthy max (EP s123 = 11×) and degenerate min (CB s123 = 694×). Threshold 50× sits cleanly in the gap; verdicts robust to ±50% perturbation. - (b) has a **24,338× separation gap**. Threshold 10⁻⁷ similarly robust. - (d) is **NOT robust**: penalized DFA at λ=1e-2 fires at 2 pp, passes at 1 pp; at λ=1e-3 the margin is +2.3 pp (single seed) which passes at 2 pp. The (d) verdict on penalized conditions depends on both threshold and intervention strength. ### 3.4 Temporal validation (when does it fire during training?) Replayed per-epoch DFA training data through the thresholds. On 4-block d=256 ResMLP across 3 seeds: - diagnostic (b) first fires at epoch 3-4 of 100 - diagnostic (a) first fires at epoch 8-11 - DFA accuracy at the moment of firing is essentially identical to its final value (the "training" is doing nothing useful for the headline number after epoch 1) - 96% of compute could be saved by using the protocol as an early-stop criterion On ViT-Mini (3 seeds): (a) fires at ep 1, (b) fires at ep 2-3. Even faster. BP never fires on any of 9 BP×architecture×seed conditions. ### 3.5 Cross-architecture validation 3 architecture families × 3 seeds × 2 methods (BP/DFA) = 18 conditions, plus 5-method × 3-seed audit on CNN. | arch | terminal LN | BP fires? | DFA (a) fires? | DFA (b) fires? | |---|:---:|:---:|:---:|:---:| | ResMLP (d=256) | yes | never | yes (ep 8-11) | yes (ep 3-4) | | ResMLP (d=512) | yes | never | yes (3-seed mean) | yes (3-seed mean) | | ViT-Mini | yes | never | yes (ep 1) | yes (ep 2-3) | | StudentNet (synthetic) | **no** | never | yes (ep 14-25) | **NEVER** | | CNN (BatchNorm) | **no** | never | yes (5-method audit) | **NEVER** | **[Round 18 softening]** Across the architectures we audited, **diagnostic (b) appears restricted to terminal-LayerNorm architectures**. In the absence of terminal LN, the BP gradient at hidden layers does not collapse to the numerical floor, even on DFA. This is observational association, not causal identification of LayerNorm — a future architecture without terminal LN where (b) fires would refine the claim. ### 3.6 Sub-mode discrimination via (c) (c) splits the (a)+(b) failures into two interpretable sub-modes: - **drift-dominated**: stability ≈ 0.5-1.0. Reference BP grad direction is sample-invariant. (Example: SB s42 stability = 0.992.) - **noise-dominated**: stability ≈ 0. Reference direction is random per batch. (Example: DFA s42 stability = 0.047.) The same method falls into different sub-modes on different seeds. (c) is for interpretation; (b) is the primary detector regardless of sub-mode. ## §4 Two distinct failure modes (round 19 final framing) ### 4.1 The two modes **Mode 1 — measurement degeneracy via terminal LayerNorm gradient cancellation** - DFA's local block losses have no global scale constraint - Block parameters grow ~95× rel-delta on ResMLP (vs BP ~2.7×) - Residual stream ‖h_L‖ ~ 10⁸ on ResMLP / ViT (4 OOM) - Terminal LayerNorm Jacobian rescaling drives ‖g_L‖ to ~10⁻¹⁰ (below the F.cosine_similarity eps clamp and well below the 10⁻⁷ floor) - The cosine alignment metric Γ is then computed against a numerical-floor reference vector — the value is mathematically defined but uninterpretable **This is caught by diagnostic (b)**: ‖g_L‖ floor check. **Mode 2 — low intrinsic credit-direction quality of random feedback** - Even when the BP gradient at hidden layers is in the meaningful regime (vanilla DFA at epoch 1, ‖g_L‖ ≈ 6×10⁻⁷), the deep-layer cosine of `e_T B_l^T` to BP grad is **essentially zero** (l1-l4 in [-0.05, +0.02] across vanilla ep 1-5) - This is not measurement noise: the same checkpoint shows l0 cos = +0.42, well above noise. The deep blocks specifically have zero alignment. - Random feedback projects the error signal in directions that are largely uncorrelated with the per-layer BP gradient, on the deep blocks of pre-LayerNorm residual networks **This is caught by direct per-layer cosine measurement** (in the meaningful regime). ### 4.2 Causal validation: penalty rescue partially alleviates BOTH modes On 4-block d=256 ResMLP, adding `λ ‖f_l(h_l)‖²` to each DFA local block loss (3 seeds at λ=1e-2, 30 epochs): **Mode 1 alleviation** (residual stream + BP grad): - ‖h_L‖: 4×10⁸ → ~4×10⁴ (4 OOM rescue, 3 seeds) - ‖g_L‖: 5×10⁻¹⁰ → ~10⁻⁶ (4 OOM rescue, 3 seeds) - Diagnostic (b) passes after the penalty **Mode 2 alleviation** (deep credit alignment): - Vanilla deep-layer cos (l1-l4): essentially zero - Penalized deep-layer cos (l1-l4) 3-seed mean: **+0.155 ± 0.025** - Null calibration with 20 fresh random Bs: deep cos = +0.002 ± 0.022 — confirms the +0.155 is real signal that the network adapted to its specific training Bs - Diagnostic measurement of mode 2 is now in a partially-alleviated regime, but +0.155 is still much less than BP's self-cos of 1.0 **Both modes are partially alleviated, neither fully** (matched 30-ep 3-seed values, paper v2.32): - DFA+pen acc lam=1e-2 30ep (3-seed): 0.360 ± 0.001 - DFA-vanilla 30ep (3-seed): 0.301 ± 0.005 (was 0.308 in old unsourced reporting) - DFA-shallow baseline (frozen blocks): 0.349 ± 0.002 - BP-trainable no-pen 30ep (3-seed): 0.585 ± 0.001 (was 0.609 in old unsourced reporting) Penalty rescue is +5.9 pp over matched 30-ep vanilla (mode 1 alleviated) and +1.1 pp over shallow (mode 2 partially alleviated). The remaining gap to BP reflects that mode 2 is only partially fixed: cos +0.151 is real but well below BP's 1.0. ### 4.3 Disambiguating "penalty revealed" vs "penalty created" the alignment Round 19 disambiguation experiment: trained vanilla DFA s42 for 5 epochs and saved checkpoints at each. Measured deep-layer cos at every checkpoint: | epoch | ‖g_L‖ | meaningful? | l1 cos | l2 cos | l3 cos | l4 cos | |---:|---:|:---:|---:|---:|---:|---:| | 1 | 1.4e-6 | yes | +0.005 | -0.028 | -0.039 | -0.038 | | 2 | 3.2e-7 | yes | -0.002 | -0.040 | -0.055 | -0.054 | | 3 | 1.3e-7 | borderline | +0.007 | -0.039 | -0.054 | -0.054 | | 4 | 6.8e-8 | no | +0.013 | -0.034 | -0.052 | -0.052 | | 5 | 4.3e-8 | no | +0.016 | -0.036 | -0.055 | -0.055 | **Even at epoch 1, where ‖g_L‖ is well above the floor (in the meaningful measurement regime), the deep cosines are essentially zero.** Compare to penalized DFA at ep 30: deep cos ~+0.17. This means the penalty intervention **created** the +0.17 alignment, it did not just make a previously-hidden alignment measurable. The mechanism is plausibly: with no penalty, the inner product `` can be increased indefinitely by inflating ‖f_l‖, so the optimizer pushes parameters in directions uncorrelated with BP grad. With the penalty, ‖f_l‖ is constrained, so the optimizer must instead orient the direction of `f_l` more carefully, which incidentally yields better (partial) alignment with BP grad. This is the strongest causal evidence we have: **the two modes are mechanistically distinct (round 20 wording: "strongly supports" rather than "confirmed"), and the penalty's role is not just numerical (preventing collapse) but training-trajectory-altering (creating partial alignment that wasn't there in vanilla)**. Multi-seed lock-in (3 seeds × {ep 1, ep 2}, 24 measurements total) gives deep-layer cosines all in [-0.04, +0.02], 3-seed mean -0.008 ± 0.013 at ep 1 — closing the single-seed-fluke objection. ### 4.4 Capacity-cost control: BP+penalty 2×2 To distinguish the residual depth-utilization gap from "the penalty's intrinsic capacity-regularization cost", we ran end-to-end BP with the same `λ ‖f_l(h_l)‖²` penalty for 30 epochs: | | no penalty | with penalty | |---|---:|---:| | BP | 0.585 ± 0.001 | **0.532 ± 0.006** | | DFA | 0.301 ± 0.005 | 0.360 ± 0.001 | (All same architecture, same data, same optimizer family. 3-seed matched 30-epoch values, paper v2.32.) The penalty has **opposite effects** on BP and DFA: **−8 pp** capacity cost on BP, **+5.5 pp** rescue on DFA. BP+penalty still clears the DFA-shallow baseline by **+18.1 pp**, while DFA+penalty clears it by only +1.4 pp. **[Round 20 phrasing]**: this is *not* a clean isolation of "credit quality" in a vacuum — it identifies a **lower bound on the residual performance gap under matched architecture, data, optimizer family, and matched penalty, after accounting for the penalty's direct capacity cost in BP**. Stated more cautiously: *"matched penalty controls show that only part of DFA's deficit is attributable to the representational/optimization cost of the penalty itself; a substantial residual remains and is consistent with poorer credit assignment"*. A counterargument would be that the penalty places BP into a fundamentally better optimization regime unrelated to capacity — but this is unlikely because the penalty *hurts* BP by 8 pp while *helping* DFA by 5.5 pp, the opposite pattern expected from a generally beneficial regime shift. ### 4.5 Partial alleviation explains the residual gap The remaining 24 pp gap from penalized DFA (0.36) to BP-trainable (0.61) is dominantly explained by the partial credit-quality cost identified above (~17 pp of the ~24 pp residual). The (d) diagnostic margin (penalty +1.4 pp over shallow) is consistent: the deep blocks contribute *some* useful signal (because cos > 0), but the magnitude is small. ### 4.6 Why this framing is paper-cleaner The new framing has several improvements over the original "scale + direction" claim: - **Empirically grounded**: both modes are directly measured (not inferred from observable proxies) - **Honest about measurement**: mode 2 is only measurable in the meaningful regime (i.e., after mode 1 is alleviated), and we say so explicitly - **Causal control**: the vanilla early-epoch checkpoint sweep disambiguates "penalty revealed" vs "penalty created" - **Null calibration**: fresh-Bs control rules out measurement artifacts - **Capacity-cost control**: BP+penalty 2×2 disambiguates capacity loss from credit quality - **Multi-seed lock-in**: 3 seeds × {ep 1, ep 2} vanilla cosine closes the single-seed-fluke objection - **Avoids the "two failure modes via (d)" claim**: (d) is now reframed as a depth-utilization measure, not a credit-quality test ### Six independent validations supporting the framing 1. Direct deep-layer cosine measurement on penalized DFA: 3-seed mean +0.155 ± 0.025 2. Null calibration with 20 fresh random Bs: deep cos +0.002 ± 0.022 — confirms training-Bs +0.16 is real 3. Hypothesis-B disambiguation (vanilla early-epoch ep 1, ‖g‖ in meaningful regime): deep cos -0.008 ± 0.013 across 3 seeds — confirms penalty creates (not just reveals) the alignment 4. BP+penalty capacity-cost control: penalty has only -8 pp BP cost; the 17 pp residual gap is consistent with credit-quality cost 5. Multi-seed lock-in: 24 measurements (3 seeds × 2 ep × 4 deep layers) all in [-0.04, +0.02] — single-seed-fluke objection closed 6. **Perturbation correlation cross-metric triangulation**: vanilla deep ρ -0.003 ± 0.005, penalized deep ρ +0.080 ± 0.011 (3 seeds). Different metric (loss-change-based, no eps clamp, sample-level) confirms the same vanilla→penalized transition. Rules out single-metric artifacts on either cos or ρ side. ## §5 Pipeline pitfalls catalog (appendix) 7 evaluation-pipeline bugs we found in our own dogfood codebase. Each has a reproducer in `protocol/examples/verify_pitfalls*.py`. 1. `tensor.norm(-1)` is L_{-1} of the whole tensor, not L_2 along dim=-1 2. `F.cosine_similarity` clamps the divisor by `eps=1e-8`, distorting cosines by 10-100× when ‖g‖ is at the floor 3. fp16 mixed precision underflows BP grads at hidden layers (~5×10⁻¹⁰ → 0) 4. Random feedback `Bs` are training-specific; Γ is not invariant to reseeding them 5. Aggregation strategy across layers/samples/batches gives 2.5-3.5× spread on the same data; sign can flip 6. **[Round 18 reframing]** Per-layer Γ structure is hidden by aggregation. Layer-0 dominance is one specific manifestation (ResMLP DFA: l0 +0.42, l1-4 ≈ 0). On ViT-Mini all layers are uniformly near zero. The general lesson is to always report per-layer. 6.5. **[Round 18 disclosure]** Auxiliary networks (`Bs`, bridge models) not persisted alongside checkpoints can cause post-hoc Γ scripts to silently fall back to `cos(BP, BP) = 1.0` and report perfect alignment. We discovered this in our own pipeline. ## §6 Reference implementation `protocol/` package: 4 diagnostics, DiagnosticReport class, smoke test, audit table, decision-utility ablation, temporal evolution, sensitivity analyses, reporting template, checklist. ~20 Python files, all syntax-clean. Smoke test passes BP/DFA/EP checkpoint verdicts. Reproducibility: every result in this paper reproduces from `protocol/examples/*.py` against checkpoints in `results/`. ## Limitations and walk-backs (appendix or §7) We walked back our own claims multiple times during this work. Reporting these explicitly as part of the paper: 1. "DFA trains ViT-Mini to 24% accuracy" → walked back to "DFA-frozen-random-blocks ViT also gets 25.7%; the deep blocks are passengers" (codex round 6) 2. "DFA trains ResMLP to 31%" → walked back to "DFA-trainable is 4 pp BELOW DFA-shallow on ResMLP; DFA training the deep blocks actively destroys value" (codex round 8) 3. "Penalty rescues to 36.3% above shallow baseline → second failure mode established by (d)" → softened to "the (d) verdict on penalized DFA depends on the intervention strength" (codex round 18) 4. "Layer 0 always dominates Γ" → softened to "aggregation hides per-layer structure that depends on the architecture; on ResMLP layer 0 dominates, on ViT-Mini all layers are uniformly near zero" (round 18 follow-up) 5. "(b) is causally specific to terminal LN" → softened to "(b) appears restricted to terminal-normalized architectures we audited" (round 18 follow-up) 6. "Two failure modes (scale + direction)" → "one unified failure mode (scale → LN → measurement degeneracy)" (cos +0.17 walk-back, 2026-04-08, before disambiguation experiment) 7. **"One unified failure mode" → "two distinct failure modes (measurement degeneracy + low intrinsic random-feedback alignment quality)"** (round 19 disambiguation, 2026-04-08, after vanilla early-epoch checkpoint sweep showed deep cos ~0 even in meaningful regime) The walk-back chain converges on a more honest framing each time. Walk-backs 6 and 7 happened in the same day; both are documented in memory and reflected here. The final two-distinct-modes framing is empirically grounded with direct measurement, null calibration, and a causal disambiguation control. ## Open experimental questions - ViT-Mini 5-method audit (waiting on checkpoint training) - Multi-seed lam=1e-3 (running) - Direct deep-block credit measurement on penalized DFA (rerunning at --epochs 30 with checkpoint save) - Threshold robustness sweep on additional architectures - BatchNorm ResNet as a 4th non-LN architecture (deferred — CNN already covers this case)