# Paper draft outline NeurIPS 2026 Evaluations & Datasets Track. Round 17 + 18 verdict locked-in. This outline reflects ALL the evidence collected as of 2026-04-08, with round 18 language softening applied throughout. **Working title** (round 17): *Beyond Accuracy and Alignment: A Diagnostic Evaluation Protocol for Feedback Alignment* **Top-line claim**: standard FA evaluation on modern residual networks (headline accuracy + Γ alignment) is unreliable; we identify two distinct failure modes and provide a 4-diagnostic protocol that detects them. ## §1 Introduction **Discovery-first hook** (round 16 narrative arc, kept under round 17 E&D framing): - Modern Feedback Alignment evaluation reports headline accuracy and Γ alignment as evidence that the method is training successfully on a given architecture. - We show, on standard 4-block pre-LayerNorm ResMLP and ViT-Mini at d=256/d=512 on CIFAR-10, that **both** signals silently fail. DFA "31% accuracy" walks back to "below the architecture-matched random-blocks baseline by 4 pp" (active harm). DFA Γ ≈ 0.10 walks back to "measured against a BP gradient at the numerical floor (~10⁻⁹), with the headline driven entirely by the embedding layer". - The evaluation failure is not unique to DFA: State Bridge and Credit Bridge show the same pattern. The protocol catches all three on every seed we tested. - We present the first systematic diagnostic protocol that catches these failures, with a reference implementation, validation across 3 architecture families (ResMLP, ViT-Mini, CNN), and an ablation showing 0/5 walked back by the field-standard reporting versus 3/5 by our protocol. ## §2 Audit findings ### 2.1 Audit on 4-block d=256 ResMLP CIFAR-10 (5 methods × 3 seeds) - BP, EP: trustworthy on all 9 method×seed combinations - DFA, State Bridge, Credit Bridge: walked back on all 9 combinations - Headline numbers (3-seed mean ± std): table from `audit_table_s42_s123_s456.json` ### 2.2 Walk-back details **DFA**: 0.308 ± 0.014 acc. Reports as "DFA trains the deep blocks". Reality: - ‖h_L‖ ~ 4×10⁸ (vs BP ~200, vs random init ~9): catastrophic residual stream growth, 4 OOM - ‖g_L‖ ~ 5×10⁻¹⁰ (vs BP ~5×10⁻⁵, vs random init ~8×10⁻³): BP gradient at numerical floor, 5 OOM below healthy - DFA-shallow (no deep blocks): 0.349 ± 0.002 — **DFA-trainable is 4 pp WORSE than DFA-shallow** - Cosine of `e_T B_l^T` to BP grad, layer 0: +0.42; layers 1-4: ≈ 0 - Headline Γ = 0.10 is entirely the embedding layer **State Bridge**: 0.205 ± 0.039 acc. Same pattern. Cross-batch direction stability of BP grad = 0.992 — the cleanest "drift-dominated reference vector" measurement we have. **Credit Bridge**: 0.289 ± 0.034 acc. Same pattern. ### 2.3 Internal control (EP) EP on the same architecture: 0.316 ± 0.038 acc. ‖h_L‖ ~ 5×10³ (bounded), ‖g_L‖ ~ 1×10⁻⁴ (healthy), per-layer cosines small-but-meaningful. **The protocol passes EP** even though EP's headline accuracy is also low. This isolates "the network is in a degenerate measurement regime" from "the method underperforms" — they are different failure modes with different evidence. EP is the §2 internal control that validates the protocol does not over-flag. ## §3 The diagnostic protocol ### 3.1 Four diagnostics (a) **per-layer ‖h_l‖**: max-per-block growth `max_l (‖h_{l+1}‖ / ‖h_l‖)`. Threshold: 50×. (b) **per-layer ‖g_l‖**: deepest-layer BP gradient norm. Floor: 10⁻⁷. (c) **cross-batch direction stability**: mean pairwise cosine of normalized BP-grad direction across disjoint minibatches. Drift ceiling: 0.30. (d) **frozen-blocks baseline**: trained-blocks acc must clear an architecture-matched random-untrained-blocks baseline by ≥ 2 pp. (a)/(b)/(d) are independently sufficient for binary detection on the cases we audited. (c) is for sub-mode discrimination (drift-dominated vs noise-dominated reference vector); it does not catch new positives. ### 3.2 Decision-utility ablation S0 (acc only) walks back 0/5 methods. S1 (acc + Γ field-standard) walks back 0/5 methods. S_full (full protocol) walks back 3/5 (DFA, SB, CB). The protocol's added value is 3 silent failures detected. ### 3.3 Threshold sensitivity - (a) has a **63× separation gap** between healthy max (EP s123 = 11×) and degenerate min (CB s123 = 694×). Threshold 50× sits cleanly in the gap; verdicts robust to ±50% perturbation. - (b) has a **24,338× separation gap**. Threshold 10⁻⁷ similarly robust. - (d) is **NOT robust**: penalized DFA at λ=1e-2 fires at 2 pp, passes at 1 pp; at λ=1e-3 the margin is +2.3 pp (single seed) which passes at 2 pp. The (d) verdict on penalized conditions depends on both threshold and intervention strength. ### 3.4 Temporal validation (when does it fire during training?) Replayed per-epoch DFA training data through the thresholds. On 4-block d=256 ResMLP across 3 seeds: - diagnostic (b) first fires at epoch 3-4 of 100 - diagnostic (a) first fires at epoch 8-11 - DFA accuracy at the moment of firing is essentially identical to its final value (the "training" is doing nothing useful for the headline number after epoch 1) - 96% of compute could be saved by using the protocol as an early-stop criterion On ViT-Mini (3 seeds): (a) fires at ep 1, (b) fires at ep 2-3. Even faster. BP never fires on any of 9 BP×architecture×seed conditions. ### 3.5 Cross-architecture validation 3 architecture families × 3 seeds × 2 methods (BP/DFA) = 18 conditions, plus 5-method × 3-seed audit on CNN. | arch | terminal LN | BP fires? | DFA (a) fires? | DFA (b) fires? | |---|:---:|:---:|:---:|:---:| | ResMLP (d=256) | yes | never | yes (ep 8-11) | yes (ep 3-4) | | ResMLP (d=512) | yes | never | yes (3-seed mean) | yes (3-seed mean) | | ViT-Mini | yes | never | yes (ep 1) | yes (ep 2-3) | | StudentNet (synthetic) | **no** | never | yes (ep 14-25) | **NEVER** | | CNN (BatchNorm) | **no** | never | yes (5-method audit) | **NEVER** | **[Round 18 softening]** Across the architectures we audited, **diagnostic (b) appears restricted to terminal-LayerNorm architectures**. In the absence of terminal LN, the BP gradient at hidden layers does not collapse to the numerical floor, even on DFA. This is observational association, not causal identification of LayerNorm — a future architecture without terminal LN where (b) fires would refine the claim. ### 3.6 Sub-mode discrimination via (c) (c) splits the (a)+(b) failures into two interpretable sub-modes: - **drift-dominated**: stability ≈ 0.5-1.0. Reference BP grad direction is sample-invariant. (Example: SB s42 stability = 0.992.) - **noise-dominated**: stability ≈ 0. Reference direction is random per batch. (Example: DFA s42 stability = 0.047.) The same method falls into different sub-modes on different seeds. (c) is for interpretation; (b) is the primary detector regardless of sub-mode. ## §4 Two failure modes ### 4.1 Mechanism story (the discovery layer) **Failure mode 1: residual-stream-amplified BP-grad collapse** - DFA's local block losses have no global scale constraint - Block parameters grow ~95× rel-delta on ResMLP (vs BP ~2.7×) - w1·w2 product per block ~5×10⁴; block outputs grow 10⁷-10⁸× - Residual stream ‖h_L‖ ~ 10⁸ on ResMLP / ViT (4 OOM) - Terminal LayerNorm Jacobian rescaling drives ‖g_L‖ to ~10⁻¹⁰ (5 OOM below BP) - Γ becomes a measurement of cosine to a numerical-floor reference vector This is what diagnostics (a) and (b) detect. **Failure mode 2: depth utilization** - Even after the scale pathology is corrected, the deep blocks may not contribute over a random-untrained-blocks baseline - Diagnostic (d) detects this via the frozen-blocks comparison - **[Round 18 caveat]** The (d) verdict on penalized DFA depends on the intervention strength (λ): at λ=1e-2 the margin is +1.4 ± 0.05 pp; at λ=1e-3 it is +2.3 pp. There is a **real tradeoff between penalty strength and depth utilization** — stronger penalty contains scale better but kills more depth contribution; weaker penalty preserves depth but keeps more scale pathology. ### 4.2 Causal validation: penalty rescue On 4-block d=256 ResMLP, adding `λ ‖f_l(h_l)‖²` to each DFA local block loss: - λ=1e-2 (3 seeds): ‖h_L‖ 4×10⁸ → 4×10⁴ (4 OOM rescue), ‖g_L‖ 5×10⁻¹⁰ → ~10⁻⁶ (4 OOM rescue), acc 0.308 → 0.363 (+5.5 pp over vanilla, +1.4 pp over shallow) - λ=1e-3: similar magnitude rescue, acc 0.372 (+2.3 pp over shallow, single seed; multi-seed verification in progress) **Round 18 framing**: this **partially dissociates the two putative failure modes by intervention**. The penalty alleviates the scale-related diagnostics (a) and (b) but does not bring depth contribution in line with BP. The two failure modes expose **distinct intervention surfaces**. Full mechanistic separability requires direct deep-block credit-quality measurement on the penalized checkpoint (in progress). ## §5 Pipeline pitfalls catalog (appendix) 7 evaluation-pipeline bugs we found in our own dogfood codebase. Each has a reproducer in `protocol/examples/verify_pitfalls*.py`. 1. `tensor.norm(-1)` is L_{-1} of the whole tensor, not L_2 along dim=-1 2. `F.cosine_similarity` clamps the divisor by `eps=1e-8`, distorting cosines by 10-100× when ‖g‖ is at the floor 3. fp16 mixed precision underflows BP grads at hidden layers (~5×10⁻¹⁰ → 0) 4. Random feedback `Bs` are training-specific; Γ is not invariant to reseeding them 5. Aggregation strategy across layers/samples/batches gives 2.5-3.5× spread on the same data; sign can flip 6. **[Round 18 reframing]** Per-layer Γ structure is hidden by aggregation. Layer-0 dominance is one specific manifestation (ResMLP DFA: l0 +0.42, l1-4 ≈ 0). On ViT-Mini all layers are uniformly near zero. The general lesson is to always report per-layer. 6.5. **[Round 18 disclosure]** Auxiliary networks (`Bs`, bridge models) not persisted alongside checkpoints can cause post-hoc Γ scripts to silently fall back to `cos(BP, BP) = 1.0` and report perfect alignment. We discovered this in our own pipeline. ## §6 Reference implementation `protocol/` package: 4 diagnostics, DiagnosticReport class, smoke test, audit table, decision-utility ablation, temporal evolution, sensitivity analyses, reporting template, checklist. ~20 Python files, all syntax-clean. Smoke test passes BP/DFA/EP checkpoint verdicts. Reproducibility: every result in this paper reproduces from `protocol/examples/*.py` against checkpoints in `results/`. ## Limitations and walk-backs (appendix or §7) We walked back our own claims multiple times during this work. Reporting these explicitly as part of the paper: 1. "DFA trains ViT-Mini to 24% accuracy" → walked back to "DFA-frozen-random-blocks ViT also gets 25.7%; the deep blocks are passengers" (codex round 6) 2. "DFA trains ResMLP to 31%" → walked back to "DFA-trainable is 4 pp BELOW DFA-shallow on ResMLP; DFA training the deep blocks actively destroys value" (codex round 8) 3. "Penalty rescues to 36.3% above shallow baseline → second failure mode established by (d)" → softened to "the (d) verdict on penalized DFA depends on the intervention strength; the two failure modes are partially dissociated by intervention" (codex round 18) 4. "Layer 0 always dominates Γ" → softened to "aggregation hides per-layer structure that depends on the architecture; on ResMLP layer 0 dominates, on ViT-Mini all layers are uniformly near zero" (round 18 follow-up) 5. "(b) is causally specific to terminal LN" → softened to "(b) appears restricted to terminal-normalized architectures we audited" (round 18 follow-up) ## Open experimental questions - ViT-Mini 5-method audit (waiting on checkpoint training) - Multi-seed lam=1e-3 (running) - Direct deep-block credit measurement on penalized DFA (rerunning at --epochs 30 with checkpoint save) - Threshold robustness sweep on additional architectures - BatchNorm ResNet as a 4th non-LN architecture (deferred — CNN already covers this case)