# Paper draft outline

NeurIPS 2026 Evaluations & Datasets Track. Round 17 + 18 verdict locked-in.
This outline reflects ALL the evidence collected as of 2026-04-08, with
round 18 language softening applied throughout.

**Working title** (round 17): *Beyond Accuracy and Alignment: A Diagnostic
Evaluation Protocol for Feedback Alignment*

**Top-line claim**: standard FA evaluation on modern residual networks
(headline accuracy + Γ alignment) is unreliable; we identify two distinct
failure modes and provide a 4-diagnostic protocol that detects them.

## §1 Introduction

**Discovery-first hook** (round 16 narrative arc, kept under round 17 E&D framing):
- Modern Feedback Alignment evaluation reports headline accuracy and Γ alignment as evidence that the method is training successfully on a given architecture.
- We show, on standard 4-block pre-LayerNorm ResMLP and ViT-Mini at d=256/d=512 on CIFAR-10, that **both** signals silently fail. DFA "31% accuracy" walks back to "below the architecture-matched random-blocks baseline by 4 pp" (active harm). DFA Γ ≈ 0.10 walks back to "measured against a BP gradient at the numerical floor (~10⁻⁹), with the headline driven entirely by the embedding layer".
- The evaluation failure is not unique to DFA: State Bridge and Credit Bridge show the same pattern. The protocol catches all three on every seed we tested.
- We present the first systematic diagnostic protocol that catches these failures, with a reference implementation, validation across 3 architecture families (ResMLP, ViT-Mini, CNN), and an ablation showing 0/5 walked back by the field-standard reporting versus 3/5 by our protocol.

## §2 Audit findings

### 2.1 Audit on 4-block d=256 ResMLP CIFAR-10 (5 methods × 3 seeds)

- BP, EP: trustworthy on all 9 method×seed combinations
- DFA, State Bridge, Credit Bridge: walked back on all 9 combinations
- Headline numbers (3-seed mean ± std): table from `audit_table_s42_s123_s456.json`

### 2.2 Walk-back details

**DFA**: 0.308 ± 0.014 acc. Reports as "DFA trains the deep blocks". Reality:
- ‖h_L‖ ~ 4×10⁸ (vs BP ~200, vs random init ~9): catastrophic residual stream growth, 4 OOM
- ‖g_L‖ ~ 5×10⁻¹⁰ (vs BP ~5×10⁻⁵, vs random init ~8×10⁻³): BP gradient at numerical floor, 5 OOM below healthy
- DFA-shallow (no deep blocks): 0.349 ± 0.002 — **DFA-trainable is 4 pp WORSE than DFA-shallow**
- Cosine of `e_T B_l^T` to BP grad, layer 0: +0.42; layers 1-4: ≈ 0
- Headline Γ = 0.10 is entirely the embedding layer

**State Bridge**: 0.205 ± 0.039 acc. Same pattern. Cross-batch direction stability of BP grad = 0.992 — the cleanest "drift-dominated reference vector" measurement we have.

**Credit Bridge**: 0.289 ± 0.034 acc. Same pattern.

### 2.3 Internal control (EP)

EP on the same architecture: 0.316 ± 0.038 acc. ‖h_L‖ ~ 5×10³ (bounded), ‖g_L‖ ~ 1×10⁻⁴ (healthy), per-layer cosines small-but-meaningful. **The protocol passes EP** even though EP's headline accuracy is also low. This isolates "the network is in a degenerate measurement regime" from "the method underperforms" — they are different failure modes with different evidence. EP is the §2 internal control that validates the protocol does not over-flag.

## §3 The diagnostic protocol

### 3.1 Four diagnostics

(a) **per-layer ‖h_l‖**: max-per-block growth `max_l (‖h_{l+1}‖ / ‖h_l‖)`. Threshold: 50×.
(b) **per-layer ‖g_l‖**: deepest-layer BP gradient norm. Floor: 10⁻⁷.
(c) **cross-batch direction stability**: mean pairwise cosine of normalized BP-grad direction across disjoint minibatches. Drift ceiling: 0.30.
(d) **frozen-blocks baseline**: trained-blocks acc must clear an architecture-matched random-untrained-blocks baseline by ≥ 2 pp.

(a)/(b)/(d) are independently sufficient for binary detection on the cases we audited. (c) is for sub-mode discrimination (drift-dominated vs noise-dominated reference vector); it does not catch new positives.

### 3.2 Decision-utility ablation

S0 (acc only) walks back 0/5 methods. S1 (acc + Γ field-standard) walks back 0/5 methods. S_full (full protocol) walks back 3/5 (DFA, SB, CB). The protocol's added value is 3 silent failures detected.

### 3.3 Threshold sensitivity

- (a) has a **63× separation gap** between healthy max (EP s123 = 11×) and degenerate min (CB s123 = 694×). Threshold 50× sits cleanly in the gap; verdicts robust to ±50% perturbation.
- (b) has a **24,338× separation gap**. Threshold 10⁻⁷ similarly robust.
- (d) is **NOT robust**: penalized DFA at λ=1e-2 fires at 2 pp, passes at 1 pp; at λ=1e-3 the margin is +2.3 pp (single seed) which passes at 2 pp. The (d) verdict on penalized conditions depends on both threshold and intervention strength.

### 3.4 Temporal validation (when does it fire during training?)

Replayed per-epoch DFA training data through the thresholds. On 4-block d=256 ResMLP across 3 seeds:
- diagnostic (b) first fires at epoch 3-4 of 100
- diagnostic (a) first fires at epoch 8-11
- DFA accuracy at the moment of firing is essentially identical to its final value (the "training" is doing nothing useful for the headline number after epoch 1)
- 96% of compute could be saved by using the protocol as an early-stop criterion

On ViT-Mini (3 seeds): (a) fires at ep 1, (b) fires at ep 2-3. Even faster.

BP never fires on any of 9 BP×architecture×seed conditions.

### 3.5 Cross-architecture validation

3 architecture families × 3 seeds × 2 methods (BP/DFA) = 18 conditions, plus 5-method × 3-seed audit on CNN.

| arch | terminal LN | BP fires? | DFA (a) fires? | DFA (b) fires? |
|---|:---:|:---:|:---:|:---:|
| ResMLP (d=256) | yes | never | yes (ep 8-11) | yes (ep 3-4) |
| ResMLP (d=512) | yes | never | yes (3-seed mean) | yes (3-seed mean) |
| ViT-Mini | yes | never | yes (ep 1) | yes (ep 2-3) |
| StudentNet (synthetic) | **no** | never | yes (ep 14-25) | **NEVER** |
| CNN (BatchNorm) | **no** | never | yes (5-method audit) | **NEVER** |

**[Round 18 softening]** Across the architectures we audited, **diagnostic (b) appears restricted to terminal-LayerNorm architectures**. In the absence of terminal LN, the BP gradient at hidden layers does not collapse to the numerical floor, even on DFA. This is observational association, not causal identification of LayerNorm — a future architecture without terminal LN where (b) fires would refine the claim.

### 3.6 Sub-mode discrimination via (c)

(c) splits the (a)+(b) failures into two interpretable sub-modes:
- **drift-dominated**: stability ≈ 0.5-1.0. Reference BP grad direction is sample-invariant. (Example: SB s42 stability = 0.992.)
- **noise-dominated**: stability ≈ 0. Reference direction is random per batch. (Example: DFA s42 stability = 0.047.)

The same method falls into different sub-modes on different seeds. (c) is for interpretation; (b) is the primary detector regardless of sub-mode.

## §4 Two failure modes

### 4.1 Mechanism story (the discovery layer)

**Failure mode 1: residual-stream-amplified BP-grad collapse**
- DFA's local block losses have no global scale constraint
- Block parameters grow ~95× rel-delta on ResMLP (vs BP ~2.7×)
- w1·w2 product per block ~5×10⁴; block outputs grow 10⁷-10⁸×
- Residual stream ‖h_L‖ ~ 10⁸ on ResMLP / ViT (4 OOM)
- Terminal LayerNorm Jacobian rescaling drives ‖g_L‖ to ~10⁻¹⁰ (5 OOM below BP)
- Γ becomes a measurement of cosine to a numerical-floor reference vector

This is what diagnostics (a) and (b) detect.

**Failure mode 2: depth utilization**
- Even after the scale pathology is corrected, the deep blocks may not contribute over a random-untrained-blocks baseline
- Diagnostic (d) detects this via the frozen-blocks comparison
- **[Round 18 caveat]** The (d) verdict on penalized DFA depends on the intervention strength (λ): at λ=1e-2 the margin is +1.4 ± 0.05 pp; at λ=1e-3 it is +2.3 pp. There is a **real tradeoff between penalty strength and depth utilization** — stronger penalty contains scale better but kills more depth contribution; weaker penalty preserves depth but keeps more scale pathology.

### 4.2 Causal validation: penalty rescue

On 4-block d=256 ResMLP, adding `λ ‖f_l(h_l)‖²` to each DFA local block loss:
- λ=1e-2 (3 seeds): ‖h_L‖ 4×10⁸ → 4×10⁴ (4 OOM rescue), ‖g_L‖ 5×10⁻¹⁰ → ~10⁻⁶ (4 OOM rescue), acc 0.308 → 0.363 (+5.5 pp over vanilla, +1.4 pp over shallow)
- λ=1e-3: similar magnitude rescue, acc 0.372 (+2.3 pp over shallow, single seed; multi-seed verification in progress)

**Round 18 framing**: this **partially dissociates the two putative failure modes by intervention**. The penalty alleviates the scale-related diagnostics (a) and (b) but does not bring depth contribution in line with BP. The two failure modes expose **distinct intervention surfaces**.

Full mechanistic separability requires direct deep-block credit-quality measurement on the penalized checkpoint (in progress).

## §5 Pipeline pitfalls catalog (appendix)

7 evaluation-pipeline bugs we found in our own dogfood codebase. Each has a reproducer in `protocol/examples/verify_pitfalls*.py`.

1. `tensor.norm(-1)` is L_{-1} of the whole tensor, not L_2 along dim=-1
2. `F.cosine_similarity` clamps the divisor by `eps=1e-8`, distorting cosines by 10-100× when ‖g‖ is at the floor
3. fp16 mixed precision underflows BP grads at hidden layers (~5×10⁻¹⁰ → 0)
4. Random feedback `Bs` are training-specific; Γ is not invariant to reseeding them
5. Aggregation strategy across layers/samples/batches gives 2.5-3.5× spread on the same data; sign can flip
6. **[Round 18 reframing]** Per-layer Γ structure is hidden by aggregation. Layer-0 dominance is one specific manifestation (ResMLP DFA: l0 +0.42, l1-4 ≈ 0). On ViT-Mini all layers are uniformly near zero. The general lesson is to always report per-layer.
6.5. **[Round 18 disclosure]** Auxiliary networks (`Bs`, bridge models) not persisted alongside checkpoints can cause post-hoc Γ scripts to silently fall back to `cos(BP, BP) = 1.0` and report perfect alignment. We discovered this in our own pipeline.

## §6 Reference implementation

`protocol/` package: 4 diagnostics, DiagnosticReport class, smoke test, audit table, decision-utility ablation, temporal evolution, sensitivity analyses, reporting template, checklist. ~20 Python files, all syntax-clean. Smoke test passes BP/DFA/EP checkpoint verdicts.

Reproducibility: every result in this paper reproduces from `protocol/examples/*.py` against checkpoints in `results/`.

## Limitations and walk-backs (appendix or §7)

We walked back our own claims multiple times during this work. Reporting these explicitly as part of the paper:

1. "DFA trains ViT-Mini to 24% accuracy" → walked back to "DFA-frozen-random-blocks ViT also gets 25.7%; the deep blocks are passengers" (codex round 6)
2. "DFA trains ResMLP to 31%" → walked back to "DFA-trainable is 4 pp BELOW DFA-shallow on ResMLP; DFA training the deep blocks actively destroys value" (codex round 8)
3. "Penalty rescues to 36.3% above shallow baseline → second failure mode established by (d)" → softened to "the (d) verdict on penalized DFA depends on the intervention strength; the two failure modes are partially dissociated by intervention" (codex round 18)
4. "Layer 0 always dominates Γ" → softened to "aggregation hides per-layer structure that depends on the architecture; on ResMLP layer 0 dominates, on ViT-Mini all layers are uniformly near zero" (round 18 follow-up)
5. "(b) is causally specific to terminal LN" → softened to "(b) appears restricted to terminal-normalized architectures we audited" (round 18 follow-up)

## Open experimental questions

- ViT-Mini 5-method audit (waiting on checkpoint training)
- Multi-seed lam=1e-3 (running)
- Direct deep-block credit measurement on penalized DFA (rerunning at --epochs 30 with checkpoint save)
- Threshold robustness sweep on additional architectures
- BatchNorm ResNet as a 4th non-LN architecture (deferred — CNN already covers this case)