# Paper draft outline

NeurIPS 2026 Evaluations & Datasets Track. Round 17 + 18 verdict locked-in.
This outline reflects ALL the evidence collected as of 2026-04-08, with
round 18 language softening applied throughout.

**Working title** (round 17): *Beyond Accuracy and Alignment: A Diagnostic
Evaluation Protocol for Feedback Alignment*

**Top-line claim**: standard FA evaluation on modern residual networks
(headline accuracy + Γ alignment) is unreliable; we identify two distinct
failure modes and provide a 4-diagnostic protocol that detects them.

## §1 Introduction

**Discovery-first hook** (round 16 narrative arc, kept under round 17 E&D framing):
- Modern Feedback Alignment evaluation reports headline accuracy and Γ alignment as evidence that the method is training successfully on a given architecture.
- We show, on standard 4-block pre-LayerNorm ResMLP and ViT-Mini at d=256/d=512 on CIFAR-10, that **both** signals silently fail. DFA "31% accuracy" walks back to "below the architecture-matched random-blocks baseline by 4 pp" (active harm). DFA Γ ≈ 0.10 walks back to "measured against a BP gradient at the numerical floor (~10⁻⁹), with the headline driven entirely by the embedding layer".
- The evaluation failure is not unique to DFA: State Bridge and Credit Bridge show the same pattern. The protocol catches all three on every seed we tested.
- We present the first systematic diagnostic protocol that catches these failures, with a reference implementation, validation across 3 architecture families (ResMLP, ViT-Mini, CNN), and an ablation showing 0/5 walked back by the field-standard reporting versus 3/5 by our protocol.

## §2 Audit findings

### 2.1 Audit on 4-block d=256 ResMLP CIFAR-10 (5 methods × 3 seeds)

- BP, EP: trustworthy on all 9 method×seed combinations
- DFA, State Bridge, Credit Bridge: walked back on all 9 combinations
- Headline numbers (3-seed mean ± std): table from `audit_table_s42_s123_s456.json`

### 2.2 Walk-back details

**DFA** (3-seed values from paper v2.32):
- 100ep test acc: 0.306 ± 0.006 (Table 1, audit). Reports as "DFA trains the deep blocks". Reality:
- ‖h_L‖ ~ 5×10⁸ (3-seed mean) (vs BP ~200, vs random init ~9): catastrophic residual stream growth, 4-5 OOM
- ‖g_L‖ ~ 4×10⁻¹⁰ (3-seed mean) (vs BP ~4×10⁻⁴, vs random init ~10⁻³): BP gradient at numerical floor, 5+ OOM below healthy
- DFA-shallow (no deep blocks): 0.349 ± 0.002 — **DFA-trainable is 4.3 pp WORSE than DFA-shallow**
- Cosine of `e_T B_l^T` to BP grad, layer 0 across seeds: +0.42, +0.44, +0.42; layers 1-4: ≈ 0 (3-seed deep mean -0.008 ± 0.013 at ep 1)
- Headline Γ = 0.10 is entirely the embedding layer

**State Bridge** (3-seed): 0.205 ± 0.032 acc (Table 1). Same pattern. Cross-batch direction stability of BP grad: s42=0.992 (cleanest "drift-dominated reference vector" measurement); 3-seed mean 0.529.

**Credit Bridge** (3-seed): 0.289 ± 0.026 acc (Table 1). Same pattern. Cross-batch stability 3-seed mean 0.373.

### 2.3 Internal control (EP)

EP on the same architecture: 0.316 ± 0.038 acc. ‖h_L‖ ~ 5×10³ (bounded), ‖g_L‖ ~ 1×10⁻⁴ (healthy), per-layer cosines small-but-meaningful. **The protocol passes EP** even though EP's headline accuracy is also low. This isolates "the network is in a degenerate measurement regime" from "the method underperforms" — they are different failure modes with different evidence. EP is the §2 internal control that validates the protocol does not over-flag.

## §3 The diagnostic protocol

### 3.1 Four diagnostics

(a) **per-layer ‖h_l‖**: max-per-block growth `max_l (‖h_{l+1}‖ / ‖h_l‖)`. Threshold: 50×.
(b) **per-layer ‖g_l‖**: deepest-layer BP gradient norm. Floor: 10⁻⁷.
(c) **cross-batch direction stability**: mean pairwise cosine of normalized BP-grad direction across disjoint minibatches. Drift ceiling: 0.30.
(d) **frozen-blocks baseline**: trained-blocks acc must clear an architecture-matched random-untrained-blocks baseline by ≥ 2 pp.

(a)/(b)/(d) are independently sufficient for binary detection on the cases we audited. (c) is for sub-mode discrimination (drift-dominated vs noise-dominated reference vector); it does not catch new positives.

### 3.2 Decision-utility ablation

S0 (acc only) walks back 0/5 methods. S1 (acc + Γ field-standard) walks back 0/5 methods. S_full (full protocol) walks back 3/5 (DFA, SB, CB). The protocol's added value is 3 silent failures detected.

### 3.3 Threshold sensitivity

- (a) has a **63× separation gap** between healthy max (EP s123 = 11×) and degenerate min (CB s123 = 694×). Threshold 50× sits cleanly in the gap; verdicts robust to ±50% perturbation.
- (b) has a **24,338× separation gap**. Threshold 10⁻⁷ similarly robust.
- (d) is **NOT robust**: penalized DFA at λ=1e-2 fires at 2 pp, passes at 1 pp; at λ=1e-3 the margin is +2.3 pp (single seed) which passes at 2 pp. The (d) verdict on penalized conditions depends on both threshold and intervention strength.

### 3.4 Temporal validation (when does it fire during training?)

Replayed per-epoch DFA training data through the thresholds. On 4-block d=256 ResMLP across 3 seeds:
- diagnostic (b) first fires at epoch 3-4 of 100
- diagnostic (a) first fires at epoch 8-11
- DFA accuracy at the moment of firing is essentially identical to its final value (the "training" is doing nothing useful for the headline number after epoch 1)
- 96% of compute could be saved by using the protocol as an early-stop criterion

On ViT-Mini (3 seeds): (a) fires at ep 1, (b) fires at ep 2-3. Even faster.

BP never fires on any of 9 BP×architecture×seed conditions.

### 3.5 Cross-architecture validation

3 architecture families × 3 seeds × 2 methods (BP/DFA) = 18 conditions, plus 5-method × 3-seed audit on CNN.

| arch | terminal LN | BP fires? | DFA (a) fires? | DFA (b) fires? |
|---|:---:|:---:|:---:|:---:|
| ResMLP (d=256) | yes | never | yes (ep 8-11) | yes (ep 3-4) |
| ResMLP (d=512) | yes | never | yes (3-seed mean) | yes (3-seed mean) |
| ViT-Mini | yes | never | yes (ep 1) | yes (ep 2-3) |
| StudentNet (synthetic) | **no** | never | yes (ep 14-25) | **NEVER** |
| CNN (BatchNorm) | **no** | never | yes (5-method audit) | **NEVER** |

**[Round 18 softening]** Across the architectures we audited, **diagnostic (b) appears restricted to terminal-LayerNorm architectures**. In the absence of terminal LN, the BP gradient at hidden layers does not collapse to the numerical floor, even on DFA. This is observational association, not causal identification of LayerNorm — a future architecture without terminal LN where (b) fires would refine the claim.

### 3.6 Sub-mode discrimination via (c)

(c) splits the (a)+(b) failures into two interpretable sub-modes:
- **drift-dominated**: stability ≈ 0.5-1.0. Reference BP grad direction is sample-invariant. (Example: SB s42 stability = 0.992.)
- **noise-dominated**: stability ≈ 0. Reference direction is random per batch. (Example: DFA s42 stability = 0.047.)

The same method falls into different sub-modes on different seeds. (c) is for interpretation; (b) is the primary detector regardless of sub-mode.

## §4 Two distinct failure modes (round 19 final framing)

### 4.1 The two modes

**Mode 1 — measurement degeneracy via terminal LayerNorm gradient cancellation**
- DFA's local block losses have no global scale constraint
- Block parameters grow ~95× rel-delta on ResMLP (vs BP ~2.7×)
- Residual stream ‖h_L‖ ~ 10⁸ on ResMLP / ViT (4 OOM)
- Terminal LayerNorm Jacobian rescaling drives ‖g_L‖ to ~10⁻¹⁰ (below the F.cosine_similarity eps clamp and well below the 10⁻⁷ floor)
- The cosine alignment metric Γ is then computed against a numerical-floor reference vector — the value is mathematically defined but uninterpretable

**This is caught by diagnostic (b)**: ‖g_L‖ floor check.

**Mode 2 — low intrinsic credit-direction quality of random feedback**
- Even when the BP gradient at hidden layers is in the meaningful regime (vanilla DFA at epoch 1, ‖g_L‖ ≈ 6×10⁻⁷), the deep-layer cosine of `e_T B_l^T` to BP grad is **essentially zero** (l1-l4 in [-0.05, +0.02] across vanilla ep 1-5)
- This is not measurement noise: the same checkpoint shows l0 cos = +0.42, well above noise. The deep blocks specifically have zero alignment.
- Random feedback projects the error signal in directions that are largely uncorrelated with the per-layer BP gradient, on the deep blocks of pre-LayerNorm residual networks

**This is caught by direct per-layer cosine measurement** (in the meaningful regime).

### 4.2 Causal validation: penalty rescue partially alleviates BOTH modes

On 4-block d=256 ResMLP, adding `λ ‖f_l(h_l)‖²` to each DFA local block loss (3 seeds at λ=1e-2, 30 epochs):

**Mode 1 alleviation** (residual stream + BP grad):
- ‖h_L‖: 4×10⁸ → ~4×10⁴ (4 OOM rescue, 3 seeds)
- ‖g_L‖: 5×10⁻¹⁰ → ~10⁻⁶ (4 OOM rescue, 3 seeds)
- Diagnostic (b) passes after the penalty

**Mode 2 alleviation** (deep credit alignment):
- Vanilla deep-layer cos (l1-l4): essentially zero
- Penalized deep-layer cos (l1-l4) 3-seed mean: **+0.155 ± 0.025**
- Null calibration with 20 fresh random Bs: deep cos = +0.002 ± 0.022 — confirms the +0.155 is real signal that the network adapted to its specific training Bs
- Diagnostic measurement of mode 2 is now in a partially-alleviated regime, but +0.155 is still much less than BP's self-cos of 1.0

**Both modes are partially alleviated, neither fully** (matched 30-ep 3-seed values, paper v2.32):
- DFA+pen acc lam=1e-2 30ep (3-seed): 0.360 ± 0.001
- DFA-vanilla 30ep (3-seed): 0.301 ± 0.005 (was 0.308 in old unsourced reporting)
- DFA-shallow baseline (frozen blocks): 0.349 ± 0.002
- BP-trainable no-pen 30ep (3-seed): 0.585 ± 0.001 (was 0.609 in old unsourced reporting)

Penalty rescue is +5.9 pp over matched 30-ep vanilla (mode 1 alleviated) and +1.1 pp over shallow (mode 2 partially alleviated). The remaining gap to BP reflects that mode 2 is only partially fixed: cos +0.151 is real but well below BP's 1.0.

### 4.3 Disambiguating "penalty revealed" vs "penalty created" the alignment

Round 19 disambiguation experiment: trained vanilla DFA s42 for 5 epochs and saved checkpoints at each. Measured deep-layer cos at every checkpoint:

| epoch | ‖g_L‖ | meaningful? | l1 cos | l2 cos | l3 cos | l4 cos |
|---:|---:|:---:|---:|---:|---:|---:|
| 1 | 1.4e-6 | yes | +0.005 | -0.028 | -0.039 | -0.038 |
| 2 | 3.2e-7 | yes | -0.002 | -0.040 | -0.055 | -0.054 |
| 3 | 1.3e-7 | borderline | +0.007 | -0.039 | -0.054 | -0.054 |
| 4 | 6.8e-8 | no | +0.013 | -0.034 | -0.052 | -0.052 |
| 5 | 4.3e-8 | no | +0.016 | -0.036 | -0.055 | -0.055 |

**Even at epoch 1, where ‖g_L‖ is well above the floor (in the meaningful measurement regime), the deep cosines are essentially zero.** Compare to penalized DFA at ep 30: deep cos ~+0.17.

This means the penalty intervention **created** the +0.17 alignment, it did not just make a previously-hidden alignment measurable. The mechanism is plausibly: with no penalty, the inner product `<f_l, e_T B_l^T>` can be increased indefinitely by inflating ‖f_l‖, so the optimizer pushes parameters in directions uncorrelated with BP grad. With the penalty, ‖f_l‖ is constrained, so the optimizer must instead orient the direction of `f_l` more carefully, which incidentally yields better (partial) alignment with BP grad.

This is the strongest causal evidence we have: **the two modes are mechanistically distinct (round 20 wording: "strongly supports" rather than "confirmed"), and the penalty's role is not just numerical (preventing collapse) but training-trajectory-altering (creating partial alignment that wasn't there in vanilla)**. Multi-seed lock-in (3 seeds × {ep 1, ep 2}, 24 measurements total) gives deep-layer cosines all in [-0.04, +0.02], 3-seed mean -0.008 ± 0.013 at ep 1 — closing the single-seed-fluke objection.

### 4.4 Capacity-cost control: BP+penalty 2×2

To distinguish the residual depth-utilization gap from "the penalty's intrinsic capacity-regularization cost", we ran end-to-end BP with the same `λ ‖f_l(h_l)‖²` penalty for 30 epochs:

| | no penalty | with penalty |
|---|---:|---:|
| BP | 0.585 ± 0.001 | **0.532 ± 0.006** |
| DFA | 0.301 ± 0.005 | 0.360 ± 0.001 |

(All same architecture, same data, same optimizer family. 3-seed matched 30-epoch values, paper v2.32.)

The penalty has **opposite effects** on BP and DFA: **−8 pp** capacity cost on BP, **+5.5 pp** rescue on DFA. BP+penalty still clears the DFA-shallow baseline by **+18.1 pp**, while DFA+penalty clears it by only +1.4 pp.

**[Round 20 phrasing]**: this is *not* a clean isolation of "credit quality" in a vacuum — it identifies a **lower bound on the residual performance gap under matched architecture, data, optimizer family, and matched penalty, after accounting for the penalty's direct capacity cost in BP**. Stated more cautiously: *"matched penalty controls show that only part of DFA's deficit is attributable to the representational/optimization cost of the penalty itself; a substantial residual remains and is consistent with poorer credit assignment"*.

A counterargument would be that the penalty places BP into a fundamentally better optimization regime unrelated to capacity — but this is unlikely because the penalty *hurts* BP by 8 pp while *helping* DFA by 5.5 pp, the opposite pattern expected from a generally beneficial regime shift.

### 4.5 Partial alleviation explains the residual gap

The remaining 24 pp gap from penalized DFA (0.36) to BP-trainable (0.61) is dominantly explained by the partial credit-quality cost identified above (~17 pp of the ~24 pp residual). The (d) diagnostic margin (penalty +1.4 pp over shallow) is consistent: the deep blocks contribute *some* useful signal (because cos > 0), but the magnitude is small.

### 4.6 Why this framing is paper-cleaner

The new framing has several improvements over the original "scale + direction" claim:
- **Empirically grounded**: both modes are directly measured (not inferred from observable proxies)
- **Honest about measurement**: mode 2 is only measurable in the meaningful regime (i.e., after mode 1 is alleviated), and we say so explicitly
- **Causal control**: the vanilla early-epoch checkpoint sweep disambiguates "penalty revealed" vs "penalty created"
- **Null calibration**: fresh-Bs control rules out measurement artifacts
- **Capacity-cost control**: BP+penalty 2×2 disambiguates capacity loss from credit quality
- **Multi-seed lock-in**: 3 seeds × {ep 1, ep 2} vanilla cosine closes the single-seed-fluke objection
- **Avoids the "two failure modes via (d)" claim**: (d) is now reframed as a depth-utilization measure, not a credit-quality test

### Six independent validations supporting the framing

1. Direct deep-layer cosine measurement on penalized DFA: 3-seed mean +0.155 ± 0.025
2. Null calibration with 20 fresh random Bs: deep cos +0.002 ± 0.022 — confirms training-Bs +0.16 is real
3. Hypothesis-B disambiguation (vanilla early-epoch ep 1, ‖g‖ in meaningful regime): deep cos -0.008 ± 0.013 across 3 seeds — confirms penalty creates (not just reveals) the alignment
4. BP+penalty capacity-cost control: penalty has only -8 pp BP cost; the 17 pp residual gap is consistent with credit-quality cost
5. Multi-seed lock-in: 24 measurements (3 seeds × 2 ep × 4 deep layers) all in [-0.04, +0.02] — single-seed-fluke objection closed
6. **Perturbation correlation cross-metric triangulation**: vanilla deep ρ -0.003 ± 0.005, penalized deep ρ +0.080 ± 0.011 (3 seeds). Different metric (loss-change-based, no eps clamp, sample-level) confirms the same vanilla→penalized transition. Rules out single-metric artifacts on either cos or ρ side.

## §5 Pipeline pitfalls catalog (appendix)

7 evaluation-pipeline bugs we found in our own dogfood codebase. Each has a reproducer in `protocol/examples/verify_pitfalls*.py`.

1. `tensor.norm(-1)` is L_{-1} of the whole tensor, not L_2 along dim=-1
2. `F.cosine_similarity` clamps the divisor by `eps=1e-8`, distorting cosines by 10-100× when ‖g‖ is at the floor
3. fp16 mixed precision underflows BP grads at hidden layers (~5×10⁻¹⁰ → 0)
4. Random feedback `Bs` are training-specific; Γ is not invariant to reseeding them
5. Aggregation strategy across layers/samples/batches gives 2.5-3.5× spread on the same data; sign can flip
6. **[Round 18 reframing]** Per-layer Γ structure is hidden by aggregation. Layer-0 dominance is one specific manifestation (ResMLP DFA: l0 +0.42, l1-4 ≈ 0). On ViT-Mini all layers are uniformly near zero. The general lesson is to always report per-layer.
6.5. **[Round 18 disclosure]** Auxiliary networks (`Bs`, bridge models) not persisted alongside checkpoints can cause post-hoc Γ scripts to silently fall back to `cos(BP, BP) = 1.0` and report perfect alignment. We discovered this in our own pipeline.

## §6 Reference implementation

`protocol/` package: 4 diagnostics, DiagnosticReport class, smoke test, audit table, decision-utility ablation, temporal evolution, sensitivity analyses, reporting template, checklist. ~20 Python files, all syntax-clean. Smoke test passes BP/DFA/EP checkpoint verdicts.

Reproducibility: every result in this paper reproduces from `protocol/examples/*.py` against checkpoints in `results/`.

## Limitations and walk-backs (appendix or §7)

We walked back our own claims multiple times during this work. Reporting these explicitly as part of the paper:

1. "DFA trains ViT-Mini to 24% accuracy" → walked back to "DFA-frozen-random-blocks ViT also gets 25.7%; the deep blocks are passengers" (codex round 6)
2. "DFA trains ResMLP to 31%" → walked back to "DFA-trainable is 4 pp BELOW DFA-shallow on ResMLP; DFA training the deep blocks actively destroys value" (codex round 8)
3. "Penalty rescues to 36.3% above shallow baseline → second failure mode established by (d)" → softened to "the (d) verdict on penalized DFA depends on the intervention strength" (codex round 18)
4. "Layer 0 always dominates Γ" → softened to "aggregation hides per-layer structure that depends on the architecture; on ResMLP layer 0 dominates, on ViT-Mini all layers are uniformly near zero" (round 18 follow-up)
5. "(b) is causally specific to terminal LN" → softened to "(b) appears restricted to terminal-normalized architectures we audited" (round 18 follow-up)
6. "Two failure modes (scale + direction)" → "one unified failure mode (scale → LN → measurement degeneracy)" (cos +0.17 walk-back, 2026-04-08, before disambiguation experiment)
7. **"One unified failure mode" → "two distinct failure modes (measurement degeneracy + low intrinsic random-feedback alignment quality)"** (round 19 disambiguation, 2026-04-08, after vanilla early-epoch checkpoint sweep showed deep cos ~0 even in meaningful regime)

The walk-back chain converges on a more honest framing each time. Walk-backs 6 and 7 happened in the same day; both are documented in memory and reflected here. The final two-distinct-modes framing is empirically grounded with direct measurement, null calibration, and a causal disambiguation control.

## Open experimental questions

- ViT-Mini 5-method audit (waiting on checkpoint training)
- Multi-seed lam=1e-3 (running)
- Direct deep-block credit measurement on penalized DFA (rerunning at --epochs 30 with checkpoint save)
- Threshold robustness sweep on additional architectures
- BatchNorm ResNet as a 4th non-LN architecture (deferred — CNN already covers this case)