protocol/PAPER_OUTLINE.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243

# Paper draft outline

NeurIPS 2026 Evaluations & Datasets Track. Round 17 + 18 verdict locked-in.
This outline reflects ALL the evidence collected as of 2026-04-08, with
round 18 language softening applied throughout.

**Working title** (round 17): *Beyond Accuracy and Alignment: A Diagnostic
Evaluation Protocol for Feedback Alignment*

**Top-line claim**: standard FA evaluation on modern residual networks
(headline accuracy + Γ alignment) is unreliable; we identify two distinct
failure modes and provide a 4-diagnostic protocol that detects them.

## §1 Introduction

**Discovery-first hook** (round 16 narrative arc, kept under round 17 E&D framing):
- Modern Feedback Alignment evaluation reports headline accuracy and Γ alignment as evidence that the method is training successfully on a given architecture.
- We show, on standard 4-block pre-LayerNorm ResMLP and ViT-Mini at d=256/d=512 on CIFAR-10, that **both** signals silently fail. DFA "31% accuracy" walks back to "below the architecture-matched random-blocks baseline by 4 pp" (active harm). DFA Γ ≈ 0.10 walks back to "measured against a BP gradient at the numerical floor (~10⁻⁹), with the headline driven entirely by the embedding layer".
- The evaluation failure is not unique to DFA: State Bridge and Credit Bridge show the same pattern. The protocol catches all three on every seed we tested.
- We present the first systematic diagnostic protocol that catches these failures, with a reference implementation, validation across 3 architecture families (ResMLP, ViT-Mini, CNN), and an ablation showing 0/5 walked back by the field-standard reporting versus 3/5 by our protocol.

## §2 Audit findings

### 2.1 Audit on 4-block d=256 ResMLP CIFAR-10 (5 methods × 3 seeds)

- BP, EP: trustworthy on all 9 method×seed combinations
- DFA, State Bridge, Credit Bridge: walked back on all 9 combinations
- Headline numbers (3-seed mean ± std): table from `audit_table_s42_s123_s456.json`

### 2.2 Walk-back details

**DFA** (3-seed values from paper v2.32):
- 100ep test acc: 0.306 ± 0.006 (Table 1, audit). Reports as "DFA trains the deep blocks". Reality:
- ‖h_L‖ ~ 5×10⁸ (3-seed mean) (vs BP ~200, vs random init ~9): catastrophic residual stream growth, 4-5 OOM
- ‖g_L‖ ~ 4×10⁻¹⁰ (3-seed mean) (vs BP ~4×10⁻⁴, vs random init ~10⁻³): BP gradient at numerical floor, 5+ OOM below healthy
- DFA-shallow (no deep blocks): 0.349 ± 0.002 — **DFA-trainable is 4.3 pp WORSE than DFA-shallow**
- Cosine of `e_T B_l^T` to BP grad, layer 0 across seeds: +0.42, +0.44, +0.42; layers 1-4: ≈ 0 (3-seed deep mean -0.008 ± 0.013 at ep 1)
- Headline Γ = 0.10 is entirely the embedding layer

**State Bridge** (3-seed): 0.205 ± 0.032 acc (Table 1). Same pattern. Cross-batch direction stability of BP grad: s42=0.992 (cleanest "drift-dominated reference vector" measurement); 3-seed mean 0.529.

**Credit Bridge** (3-seed): 0.289 ± 0.026 acc (Table 1). Same pattern. Cross-batch stability 3-seed mean 0.373.

### 2.3 Internal control (EP)

EP on the same architecture: 0.316 ± 0.038 acc. ‖h_L‖ ~ 5×10³ (bounded), ‖g_L‖ ~ 1×10⁻⁴ (healthy), per-layer cosines small-but-meaningful. **The protocol passes EP** even though EP's headline accuracy is also low. This isolates "the network is in a degenerate measurement regime" from "the method underperforms" — they are different failure modes with different evidence. EP is the §2 internal control that validates the protocol does not over-flag.

## §3 The diagnostic protocol

### 3.1 Four diagnostics

(a) **per-layer ‖h_l‖**: max-per-block growth `max_l (‖h_{l+1}‖ / ‖h_l‖)`. Threshold: 50×.
(b) **per-layer ‖g_l‖**: deepest-layer BP gradient norm. Floor: 10⁻⁷.
(c) **cross-batch direction stability**: mean pairwise cosine of normalized BP-grad direction across disjoint minibatches. Drift ceiling: 0.30.
(d) **frozen-blocks baseline**: trained-blocks acc must clear an architecture-matched random-untrained-blocks baseline by ≥ 2 pp.

(a)/(b)/(d) are independently sufficient for binary detection on the cases we audited. (c) is for sub-mode discrimination (drift-dominated vs noise-dominated reference vector); it does not catch new positives.

### 3.2 Decision-utility ablation

S0 (acc only) walks back 0/5 methods. S1 (acc + Γ field-standard) walks back 0/5 methods. S_full (full protocol) walks back 3/5 (DFA, SB, CB). The protocol's added value is 3 silent failures detected.

### 3.3 Threshold sensitivity

- (a) has a **63× separation gap** between healthy max (EP s123 = 11×) and degenerate min (CB s123 = 694×). Threshold 50× sits cleanly in the gap; verdicts robust to ±50% perturbation.
- (b) has a **24,338× separation gap**. Threshold 10⁻⁷ similarly robust.
- (d) is **NOT robust**: penalized DFA at λ=1e-2 fires at 2 pp, passes at 1 pp; at λ=1e-3 the margin is +2.3 pp (single seed) which passes at 2 pp. The (d) verdict on penalized conditions depends on both threshold and intervention strength.

### 3.4 Temporal validation (when does it fire during training?)

Replayed per-epoch DFA training data through the thresholds. On 4-block d=256 ResMLP across 3 seeds:
- diagnostic (b) first fires at epoch 3-4 of 100
- diagnostic (a) first fires at epoch 8-11
- DFA accuracy at the moment of firing is essentially identical to its final value (the "training" is doing nothing useful for the headline number after epoch 1)
- 96% of compute could be saved by using the protocol as an early-stop criterion

On ViT-Mini (3 seeds): (a) fires at ep 1, (b) fires at ep 2-3. Even faster.

BP never fires on any of 9 BP×architecture×seed conditions.

### 3.5 Cross-architecture validation

3 architecture families × 3 seeds × 2 methods (BP/DFA) = 18 conditions, plus 5-method × 3-seed audit on CNN.

| arch | terminal LN | BP fires? | DFA (a) fires? | DFA (b) fires? |
|---|:---:|:---:|:---:|:---:|
| ResMLP (d=256) | yes | never | yes (ep 8-11) | yes (ep 3-4) |
| ResMLP (d=512) | yes | never | yes (3-seed mean) | yes (3-seed mean) |
| ViT-Mini | yes | never | yes (ep 1) | yes (ep 2-3) |
| StudentNet (synthetic) | **no** | never | yes (ep 14-25) | **NEVER** |
| CNN (BatchNorm) | **no** | never | yes (5-method audit) | **NEVER** |

**[Round 18 softening]** Across the architectures we audited, **diagnostic (b) appears restricted to terminal-LayerNorm architectures**. In the absence of terminal LN, the BP gradient at hidden layers does not collapse to the numerical floor, even on DFA. This is observational association, not causal identification of LayerNorm — a future architecture without terminal LN where (b) fires would refine the claim.

### 3.6 Sub-mode discrimination via (c)

(c) splits the (a)+(b) failures into two interpretable sub-modes:
- **drift-dominated**: stability ≈ 0.5-1.0. Reference BP grad direction is sample-invariant. (Example: SB s42 stability = 0.992.)
- **noise-dominated**: stability ≈ 0. Reference direction is random per batch. (Example: DFA s42 stability = 0.047.)

The same method falls into different sub-modes on different seeds. (c) is for interpretation; (b) is the primary detector regardless of sub-mode.

## §4 Two distinct failure modes (round 19 final framing)

### 4.1 The two modes

**Mode 1 — measurement degeneracy via terminal LayerNorm gradient cancellation**
- DFA's local block losses have no global scale constraint
- Block parameters grow ~95× rel-delta on ResMLP (vs BP ~2.7×)
- Residual stream ‖h_L‖ ~ 10⁸ on ResMLP / ViT (4 OOM)
- Terminal LayerNorm Jacobian rescaling drives ‖g_L‖ to ~10⁻¹⁰ (below the F.cosine_similarity eps clamp and well below the 10⁻⁷ floor)
- The cosine alignment metric Γ is then computed against a numerical-floor reference vector — the value is mathematically defined but uninterpretable

**This is caught by diagnostic (b)**: ‖g_L‖ floor check.

**Mode 2 — low intrinsic credit-direction quality of random feedback**
- Even when the BP gradient at hidden layers is in the meaningful regime (vanilla DFA at epoch 1, ‖g_L‖ ≈ 6×10⁻⁷), the deep-layer cosine of `e_T B_l^T` to BP grad is **essentially zero** (l1-l4 in [-0.05, +0.02] across vanilla ep 1-5)
- This is not measurement noise: the same checkpoint shows l0 cos = +0.42, well above noise. The deep blocks specifically have zero alignment.
- Random feedback projects the error signal in directions that are largely uncorrelated with the per-layer BP gradient, on the deep blocks of pre-LayerNorm residual networks

**This is caught by direct per-layer cosine measurement** (in the meaningful regime).

### 4.2 Causal validation: penalty rescue partially alleviates BOTH modes

On 4-block d=256 ResMLP, adding `λ ‖f_l(h_l)‖²` to each DFA local block loss (3 seeds at λ=1e-2, 30 epochs):

**Mode 1 alleviation** (residual stream + BP grad):
- ‖h_L‖: 4×10⁸ → ~4×10⁴ (4 OOM rescue, 3 seeds)
- ‖g_L‖: 5×10⁻¹⁰ → ~10⁻⁶ (4 OOM rescue, 3 seeds)
- Diagnostic (b) passes after the penalty

**Mode 2 alleviation** (deep credit alignment):
- Vanilla deep-layer cos (l1-l4): essentially zero
- Penalized deep-layer cos (l1-l4) 3-seed mean: **+0.155 ± 0.025**
- Null calibration with 20 fresh random Bs: deep cos = +0.002 ± 0.022 — confirms the +0.155 is real signal that the network adapted to its specific training Bs
- Diagnostic measurement of mode 2 is now in a partially-alleviated regime, but +0.155 is still much less than BP's self-cos of 1.0

**Both modes are partially alleviated, neither fully** (matched 30-ep 3-seed values, paper v2.32):
- DFA+pen acc lam=1e-2 30ep (3-seed): 0.360 ± 0.001
- DFA-vanilla 30ep (3-seed): 0.301 ± 0.005 (was 0.308 in old unsourced reporting)
- DFA-shallow baseline (frozen blocks): 0.349 ± 0.002
- BP-trainable no-pen 30ep (3-seed): 0.585 ± 0.001 (was 0.609 in old unsourced reporting)

Penalty rescue is +5.9 pp over matched 30-ep vanilla (mode 1 alleviated) and +1.1 pp over shallow (mode 2 partially alleviated). The remaining gap to BP reflects that mode 2 is only partially fixed: cos +0.151 is real but well below BP's 1.0.

### 4.3 Disambiguating "penalty revealed" vs "penalty created" the alignment

Round 19 disambiguation experiment: trained vanilla DFA s42 for 5 epochs and saved checkpoints at each. Measured deep-layer cos at every checkpoint:

| epoch | ‖g_L‖ | meaningful? | l1 cos | l2 cos | l3 cos | l4 cos |
|---:|---:|:---:|---:|---:|---:|---:|
| 1 | 1.4e-6 | yes | +0.005 | -0.028 | -0.039 | -0.038 |
| 2 | 3.2e-7 | yes | -0.002 | -0.040 | -0.055 | -0.054 |
| 3 | 1.3e-7 | borderline | +0.007 | -0.039 | -0.054 | -0.054 |
| 4 | 6.8e-8 | no | +0.013 | -0.034 | -0.052 | -0.052 |
| 5 | 4.3e-8 | no | +0.016 | -0.036 | -0.055 | -0.055 |

**Even at epoch 1, where ‖g_L‖ is well above the floor (in the meaningful measurement regime), the deep cosines are essentially zero.** Compare to penalized DFA at ep 30: deep cos ~+0.17.

This means the penalty intervention **created** the +0.17 alignment, it did not just make a previously-hidden alignment measurable. The mechanism is plausibly: with no penalty, the inner product `<f_l, e_T B_l^T>` can be increased indefinitely by inflating ‖f_l‖, so the optimizer pushes parameters in directions uncorrelated with BP grad. With the penalty, ‖f_l‖ is constrained, so the optimizer must instead orient the direction of `f_l` more carefully, which incidentally yields better (partial) alignment with BP grad.

This is the strongest causal evidence we have: **the two modes are mechanistically distinct (round 20 wording: "strongly supports" rather than "confirmed"), and the penalty's role is not just numerical (preventing collapse) but training-trajectory-altering (creating partial alignment that wasn't there in vanilla)**. Multi-seed lock-in (3 seeds × {ep 1, ep 2}, 24 measurements total) gives deep-layer cosines all in [-0.04, +0.02], 3-seed mean -0.008 ± 0.013 at ep 1 — closing the single-seed-fluke objection.

### 4.4 Capacity-cost control: BP+penalty 2×2

To distinguish the residual depth-utilization gap from "the penalty's intrinsic capacity-regularization cost", we ran end-to-end BP with the same `λ ‖f_l(h_l)‖²` penalty for 30 epochs:

| | no penalty | with penalty |
|---|---:|---:|
| BP | 0.585 ± 0.001 | **0.532 ± 0.006** |
| DFA | 0.301 ± 0.005 | 0.360 ± 0.001 |

(All same architecture, same data, same optimizer family. 3-seed matched 30-epoch values, paper v2.32.)

The penalty has **opposite effects** on BP and DFA: **−8 pp** capacity cost on BP, **+5.5 pp** rescue on DFA. BP+penalty still clears the DFA-shallow baseline by **+18.1 pp**, while DFA+penalty clears it by only +1.4 pp.

**[Round 20 phrasing]**: this is *not* a clean isolation of "credit quality" in a vacuum — it identifies a **lower bound on the residual performance gap under matched architecture, data, optimizer family, and matched penalty, after accounting for the penalty's direct capacity cost in BP**. Stated more cautiously: *"matched penalty controls show that only part of DFA's deficit is attributable to the representational/optimization cost of the penalty itself; a substantial residual remains and is consistent with poorer credit assignment"*.

A counterargument would be that the penalty places BP into a fundamentally better optimization regime unrelated to capacity — but this is unlikely because the penalty *hurts* BP by 8 pp while *helping* DFA by 5.5 pp, the opposite pattern expected from a generally beneficial regime shift.

### 4.5 Partial alleviation explains the residual gap

The remaining 24 pp gap from penalized DFA (0.36) to BP-trainable (0.61) is dominantly explained by the partial credit-quality cost identified above (~17 pp of the ~24 pp residual). The (d) diagnostic margin (penalty +1.4 pp over shallow) is consistent: the deep blocks contribute *some* useful signal (because cos > 0), but the magnitude is small.

### 4.6 Why this framing is paper-cleaner

The new framing has several improvements over the original "scale + direction" claim:
- **Empirically grounded**: both modes are directly measured (not inferred from observable proxies)
- **Honest about measurement**: mode 2 is only measurable in the meaningful regime (i.e., after mode 1 is alleviated), and we say so explicitly
- **Causal control**: the vanilla early-epoch checkpoint sweep disambiguates "penalty revealed" vs "penalty created"
- **Null calibration**: fresh-Bs control rules out measurement artifacts
- **Capacity-cost control**: BP+penalty 2×2 disambiguates capacity loss from credit quality
- **Multi-seed lock-in**: 3 seeds × {ep 1, ep 2} vanilla cosine closes the single-seed-fluke objection
- **Avoids the "two failure modes via (d)" claim**: (d) is now reframed as a depth-utilization measure, not a credit-quality test

### Six independent validations supporting the framing

1. Direct deep-layer cosine measurement on penalized DFA: 3-seed mean +0.155 ± 0.025
2. Null calibration with 20 fresh random Bs: deep cos +0.002 ± 0.022 — confirms training-Bs +0.16 is real
3. Hypothesis-B disambiguation (vanilla early-epoch ep 1, ‖g‖ in meaningful regime): deep cos -0.008 ± 0.013 across 3 seeds — confirms penalty creates (not just reveals) the alignment
4. BP+penalty capacity-cost control: penalty has only -8 pp BP cost; the 17 pp residual gap is consistent with credit-quality cost
5. Multi-seed lock-in: 24 measurements (3 seeds × 2 ep × 4 deep layers) all in [-0.04, +0.02] — single-seed-fluke objection closed
6. **Perturbation correlation cross-metric triangulation**: vanilla deep ρ -0.003 ± 0.005, penalized deep ρ +0.080 ± 0.011 (3 seeds). Different metric (loss-change-based, no eps clamp, sample-level) confirms the same vanilla→penalized transition. Rules out single-metric artifacts on either cos or ρ side.

## §5 Pipeline pitfalls catalog (appendix)

7 evaluation-pipeline bugs we found in our own dogfood codebase. Each has a reproducer in `protocol/examples/verify_pitfalls*.py`.

1. `tensor.norm(-1)` is L_{-1} of the whole tensor, not L_2 along dim=-1
2. `F.cosine_similarity` clamps the divisor by `eps=1e-8`, distorting cosines by 10-100× when ‖g‖ is at the floor
3. fp16 mixed precision underflows BP grads at hidden layers (~5×10⁻¹⁰ → 0)
4. Random feedback `Bs` are training-specific; Γ is not invariant to reseeding them
5. Aggregation strategy across layers/samples/batches gives 2.5-3.5× spread on the same data; sign can flip
6. **[Round 18 reframing]** Per-layer Γ structure is hidden by aggregation. Layer-0 dominance is one specific manifestation (ResMLP DFA: l0 +0.42, l1-4 ≈ 0). On ViT-Mini all layers are uniformly near zero. The general lesson is to always report per-layer.
6.5. **[Round 18 disclosure]** Auxiliary networks (`Bs`, bridge models) not persisted alongside checkpoints can cause post-hoc Γ scripts to silently fall back to `cos(BP, BP) = 1.0` and report perfect alignment. We discovered this in our own pipeline.

## §6 Reference implementation

`protocol/` package: 4 diagnostics, DiagnosticReport class, smoke test, audit table, decision-utility ablation, temporal evolution, sensitivity analyses, reporting template, checklist. ~20 Python files, all syntax-clean. Smoke test passes BP/DFA/EP checkpoint verdicts.

Reproducibility: every result in this paper reproduces from `protocol/examples/*.py` against checkpoints in `results/`.

## Limitations and walk-backs (appendix or §7)

We walked back our own claims multiple times during this work. Reporting these explicitly as part of the paper:

1. "DFA trains ViT-Mini to 24% accuracy" → walked back to "DFA-frozen-random-blocks ViT also gets 25.7%; the deep blocks are passengers" (codex round 6)
2. "DFA trains ResMLP to 31%" → walked back to "DFA-trainable is 4 pp BELOW DFA-shallow on ResMLP; DFA training the deep blocks actively destroys value" (codex round 8)
3. "Penalty rescues to 36.3% above shallow baseline → second failure mode established by (d)" → softened to "the (d) verdict on penalized DFA depends on the intervention strength" (codex round 18)
4. "Layer 0 always dominates Γ" → softened to "aggregation hides per-layer structure that depends on the architecture; on ResMLP layer 0 dominates, on ViT-Mini all layers are uniformly near zero" (round 18 follow-up)
5. "(b) is causally specific to terminal LN" → softened to "(b) appears restricted to terminal-normalized architectures we audited" (round 18 follow-up)
6. "Two failure modes (scale + direction)" → "one unified failure mode (scale → LN → measurement degeneracy)" (cos +0.17 walk-back, 2026-04-08, before disambiguation experiment)
7. **"One unified failure mode" → "two distinct failure modes (measurement degeneracy + low intrinsic random-feedback alignment quality)"** (round 19 disambiguation, 2026-04-08, after vanilla early-epoch checkpoint sweep showed deep cos ~0 even in meaningful regime)

The walk-back chain converges on a more honest framing each time. Walk-backs 6 and 7 happened in the same day; both are documented in memory and reflected here. The final two-distinct-modes framing is empirically grounded with direct measurement, null calibration, and a causal disambiguation control.

## Open experimental questions

- ViT-Mini 5-method audit (waiting on checkpoint training)
- Multi-seed lam=1e-3 (running)
- Direct deep-block credit measurement on penalized DFA (rerunning at --epochs 30 with checkpoint save)
- Threshold robustness sweep on additional architectures
- BatchNorm ResNet as a 4th non-LN architecture (deferred — CNN already covers this case)