paper/main.tex


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365

\documentclass{article}

\PassOptionsToPackage{numbers,compress}{natbib}
\usepackage[eandd]{neurips_2026}

\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
\usepackage{hyperref}
\usepackage{url}
\usepackage{booktabs}
\usepackage{amsfonts}
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{microtype}
\usepackage{xcolor}
\usepackage{graphicx}

\title{Beyond Accuracy and Alignment:\\A Diagnostic Evaluation Protocol for Feedback Alignment}

\author{Anonymous Authors}

\begin{document}

\maketitle

\begin{abstract}
Modern feedback-alignment evaluation on deep residual networks is still summarized by a deceptively simple pair: headline accuracy and headline cosine alignment $\Gamma$ to the backpropagation gradient. We show that this pair can silently fail in two distinct ways on standard CIFAR-10 pre-LayerNorm ResMLP and ViT-Mini settings: first, \emph{measurement degeneracy}, where residual-stream growth drives hidden-layer BP gradients to the numerical floor and makes $\Gamma$ uninterpretable; and second, \emph{low intrinsic credit-direction quality}, where random-feedback credit remains essentially unaligned with BP on the deep blocks even when the reference gradient is still meaningful. The headline result is that the field-standard reporting pair walks back none of the methods we audit, whereas a four-diagnostic protocol walks back the three degenerate methods and passes the two trustworthy controls. Our contribution is an evaluation methodology paper for the NeurIPS 2026 Evaluations \& Datasets track: we provide the protocol, the calibration logic for its thresholds, a reference implementation, a five-method audit, and validation through temporal replay, cross-architecture checks, intervention-based disambiguation, and a documented catalog of pipeline pitfalls, in the spirit of critical evaluation analyses such as \citet{jordan2020evaluating,obray2022evaluation,paleka2026pitfalls}.
\end{abstract}

\section{Introduction}
\label{sec:intro}

Feedback-alignment papers are usually judged by two numbers: task accuracy and an aggregate similarity between the method's local credit signal and the backpropagation gradient \citep{lillicrap2016random,nokland2016direct,akrout2019deep,launay2020direct}. On the audited 4-block $d{=}256$ ResMLP, however, Table~\ref{tab:main_audit} already shows that this pair is not a validity check: DFA reaches only $0.306 \pm 0.006$ test accuracy, below the architecture-matched frozen-blocks baseline of $0.349 \pm 0.002$, while still looking superficially comparable to other non-BP methods. Figure~\ref{fig:audit_hero} further shows that the apparent cosine evidence is concentrated at the shallowest block, with DFA at seed 42 reaching about $+0.42$ at layer 0 but approximately $-0.03$ to $0$ on layers 1--4, so the aggregate obscures where credit direction is and is not present. At the same time, the deepest BP reference norm is only about $5 \times 10^{-10}$ for DFA, State Bridge, and Credit Bridge, below the $10^{-8}$ clamp used by \texttt{F.cosine\_similarity}, whereas BP remains around $4 \times 10^{-4}$, so the reported deep cosine is partly computed against a numerical-floor reference rather than an informative gradient direction (Figure~\ref{fig:audit_hero}; Table~\ref{tab:main_audit}). Those numbers can be useful, but only if the measurement regime itself is valid.

Our audit shows that modern residual vision models can make these two quantities look informative while failing to answer the question they are taken to answer. Figure~\ref{fig:audit_hero} shows the first failure mode, which we call \emph{Mode 1: measurement degeneracy}, where residual-stream growth drives the deepest hidden state to about $\|h_L\| \sim 10^8$ under DFA/SB/CB while the corresponding BP reference collapses to $\|g_L\| \sim 5 \times 10^{-10}$, so the deep-layer cosine is measured against a clamp-dominated floor rather than a meaningful target direction. The same figure also shows the second failure mode, \emph{Mode 2: low intrinsic credit-direction quality}, because even after comparing against the stronger frozen-blocks baseline ($0.349 \pm 0.002$) and looking layer-by-layer, DFA's deep blocks remain essentially null while only layer 0 is visibly positive. To test whether this is only a measurement problem, the intervention results show a dissociation: with a residual penalty $\lambda \|f_l(h_l)\|^2$, the deepest state scale falls toward $4 \times 10^4$, the reference gradient rises toward $10^{-6}$, and deep cosine can improve to about $+0.16$, yet at $\lambda{=}10^{-4}$ Mode 1 is alleviated while deep cosine still stays near zero, and at vanilla DFA epoch 1 the reference is already meaningful at about $6 \times 10^{-7}$ but the deep cosine is still $-0.008 \pm 0.013$ across three seeds. The failure is not unitary: one mode breaks the measurement, and the other survives even when the measurement is still meaningful.

Accordingly, this paper does not introduce a new FA variant or a new benchmark. Instead, Table~\ref{tab:main_audit} and Figure~\ref{fig:audit_hero} use a standard five-method CIFAR-10 audit to show that status-quo reporting would treat BP, EP, DFA, State Bridge, and Credit Bridge as the same kind of evidence-bearing object even though only BP and EP remain trustworthy under matched diagnostic checks. This makes the contribution methodological in the sense of \citet{jordan2020evaluating}, \citet{obray2022evaluation}, and \citet{paleka2026pitfalls}: the central question is not whether one more FA variant can post a headline number, but whether the reporting pipeline distinguishes meaningful credit-direction evidence from numerical-floor artifacts and from shallow-only learning. The protocol therefore starts from per-layer diagnostics and a frozen-blocks baseline before reading any aggregate cosine or final accuracy as evidence about deep credit assignment. We first show the walk-back on a standard audit, then isolate the two failure modes, and finally state the reporting protocol that future FA papers should satisfy.

\section{Audit: Standard Reporting Walks Back Nothing}
\label{sec:audit}

On the 4-block pre-LayerNorm ResMLP at $d{=}256$ on CIFAR-10, the field-standard reporting pair does not walk back any of the five methods we audit. % TODO: evidence sentence % TODO: closing sentence

\begin{table}[t]
\centering
\small
\caption{Main audit table for the 4-block $d{=}256$ pre-LayerNorm ResMLP on CIFAR-10. The row and column structure is fixed here; fill from the three-seed audit output.}
\label{tab:main_audit}
\begin{tabular}{lcccc}
\toprule
Method & Test acc. & Headline $\Gamma$ & Status-quo verdict & Protocol verdict \\
\midrule
BP            & $0.615 \pm 0.003$ & $\approx 1.0$ & trustworthy & trustworthy \\
EP            & $0.316 \pm 0.030$ & $0.008$       & trustworthy & trustworthy \\
DFA           & $0.306 \pm 0.006$ & $0.10$        & trustworthy & walked back \\
State Bridge  & $0.205 \pm 0.032$ & $0.005$       & trustworthy & walked back \\
Credit Bridge & $0.289 \pm 0.026$ & $0.07$        & trustworthy & walked back \\
\bottomrule
\end{tabular}
\end{table}

DFA's headline accuracy walks back from ``the deep blocks are training'' to ``the trainable-blocks model is below the architecture-matched random-blocks baseline.'' % TODO: evidence sentence % TODO: closing sentence

DFA's headline $\Gamma$ walks back from ``small but positive alignment'' to ``a cosine measured against a numerical-floor reference vector and driven by layer 0.'' % TODO: evidence sentence % TODO: closing sentence

State Bridge and Credit Bridge show the same qualitative pattern as DFA, while EP serves as the internal control that the audit does not over-flag. % TODO: evidence sentence % TODO: closing sentence

\begin{figure}[t]
\centering
% TODO: re-render figure
\includegraphics[width=\linewidth]{figures/figure_audit_5method.png}
\caption{Five-method audit on the 4-block $d{=}256$ pre-LayerNorm ResMLP: the field-standard pair looks superficially consistent across methods, but the diagnostic view separates trustworthy controls from walked-back methods.}
\label{fig:audit_hero}
\end{figure}

\section{Failure Mode 1: Measurement Degeneracy}
\label{sec:mode1}

The first failure mode is measurement degeneracy via terminal-LayerNorm gradient cancellation. % TODO: evidence sentence % TODO: closing sentence

In this regime, the problem is not merely that FA performs poorly; it is that the BP reference direction used to score FA has itself become numerically non-diagnostic at the deepest hidden layers, so reported cosine values no longer support the scientific claim they are being used to justify. % TODO: evidence sentence % TODO: closing sentence

Residual-stream growth provides the practical detector because it is the upstream quantity that makes the terminal normalization step increasingly cancellation-prone in the settings audited here, especially on the pre-LayerNorm residual architectures where the final hidden-state scale is free to drift \citep{xiong2020layer,launay2020direct}. % TODO: evidence sentence % TODO: closing sentence

The main consequence for evaluation is that a positive or weakly positive deep-layer $\Gamma$ can no longer be read as evidence that meaningful credit alignment exists once the reference gradient norm has collapsed to the numerical floor. % TODO: evidence sentence % TODO: closing sentence

\section{Failure Mode 2: Low Intrinsic Credit-Direction Quality}
\label{sec:mode2}

The second failure mode is low intrinsic credit-direction quality on the deep blocks even when the BP reference gradient is still in a meaningful regime. % TODO: evidence sentence % TODO: closing sentence

This mode appears most clearly in early-epoch or partially rescued settings, where the deepest-layer BP gradient remains measurable yet the random-feedback credit signal is still close to null or unstable, implying that the method is failing as a direction estimator rather than merely being scored with a broken ruler \citep{bartunov2018assessing,moskovitz2018feedback,crafton2019backpropagation}. % TODO: evidence sentence % TODO: closing sentence

The conceptual payoff of the paper is that these are mechanistically distinct failures that the status-quo pair collapses into one ambiguous story about undertraining. % TODO: evidence sentence % TODO: closing sentence

Separating the modes matters because the interventions differ: numerical rescue can restore measurability without producing strong deep credit directions, while better direction quality would need to improve alignment even before any measurement-floor pathology is present. % TODO: evidence sentence % TODO: closing sentence

\section{Intervention and Cross-Architecture Evidence}
\label{sec:validation}

Temporal replay shows that the protocol fires early enough to change experimental practice rather than merely re-describe final checkpoints. % TODO: evidence sentence % TODO: closing sentence

\begin{table}[t]
\centering
\small
\caption{Two-mode validation table built around the intervention and disambiguation results.}
\label{tab:mode_validation}
\begin{tabular}{lccc}
\toprule
Condition & Deep-layer alignment signal & Measurement regime & Interpretation \\
\midrule
Vanilla DFA, early epoch & $\overline{\cos}_{deep}{=}{-}0.008{\pm}0.013$, $\overline{\rho}_{deep}{=}{-}0.003{\pm}0.005$ & meaningful ($\|g\|{\sim}10^{-6}$) & mode 2 present without mode 1 \\
Vanilla DFA, converged & $\overline{\cos}_{deep}{=}{-}0.022$, $\overline{\rho}_{deep}{=}+0.002$ & degenerate ($\|g\|{\sim}10^{-9}$) & mode 1 obscures mode 2 \\
Penalized DFA, $\lambda{=}10^{-2}$ & $\overline{\cos}_{deep}{=}+0.155{\pm}0.025$, $\overline{\rho}_{deep}{=}+0.080{\pm}0.011$ & meaningful ($\|g\|{\sim}10^{-6}$) & partial alleviation of both modes \\
Fresh-$B$ null control & $\overline{\cos}_{deep}{=}+0.002{\pm}0.022$ ($n{=}20$ draws) & meaningful & training-specific adaptation check \\
\bottomrule
\end{tabular}
\end{table}

Cross-architecture validation shows that diagnostic (b) appears restricted to the terminal-normalized architectures we audited, while diagnostic (a) remains useful more broadly. % TODO: evidence sentence % TODO: closing sentence

The residual-stream penalty intervention partially alleviates both failure modes but does not erase the remaining performance gap to BP. % TODO: evidence sentence % TODO: closing sentence

Matched BP+penalty controls show that only part of DFA's deficit is attributable to the penalty's direct capacity cost, leaving a substantial residual consistent with poorer credit assignment. % TODO: evidence sentence % TODO: closing sentence

\begin{figure}[t]
\centering
% TODO: re-render figure
\includegraphics[width=\linewidth]{figures/figure_cross_arch_temporal_s42.png}
\caption{Temporal and cross-architecture validation: the protocol fires early on terminal-normalized residual architectures, never fires on BP controls, and separates the activation-growth pathology from the gradient-floor pathology.}
\label{fig:temporal_cross_arch}
\end{figure}

\begin{figure}[t]
\centering
% TODO: re-render figure
\includegraphics[width=0.92\linewidth]{figures/fig4_penalty_rescue.pdf}
\caption{Penalty intervention view of the two modes: penalization rescues residual-stream scale and restores a measurable but still partial deep-layer credit signal, clarifying that numerical rescue and credit-quality rescue are related but distinct.}
\label{fig:penalty_rescue}
\end{figure}

\begin{figure}[t]
\centering
% TODO: re-render figure
\includegraphics[width=0.92\linewidth]{figures/fig5_cross_arch_summary.pdf}
\caption{Cross-architecture summary over ResMLP, ViT-Mini, StudentNet, and CNN: activation-growth failures recur across architectures, while gradient-floor failures appear in the terminal-normalized settings audited here.}
\label{fig:cross_arch_summary}
\end{figure}

\section{Recommended FA Evaluation Protocol}
\label{sec:protocol}

The protocol has four diagnostics because the evaluation failure is not visible from any single headline number. % TODO: evidence sentence % TODO: closing sentence

\begin{table}[t]
\centering
\small
\caption{Protocol definition table. Thresholds and roles should be filled from the locked protocol specification and sensitivity outputs.}
\label{tab:protocol_def}
\begin{tabular}{p{0.12\linewidth}p{0.42\linewidth}p{0.16\linewidth}p{0.20\linewidth}}
\toprule
Diag. & Measurement & Default threshold & Role \\
\midrule
(a) & Per-layer activation scale via max-per-block growth $\max_l \|h_{l+1}\|/\|h_l\|$ & $> 50\times$ & binary detector \\
(b) & Deepest hidden-layer BP gradient norm $\|g_L\|$ & $< 10^{-7}$ & binary detector \\
(c) & Cross-batch direction stability of normalized BP gradients & $> 0.30$ & sub-mode discriminator \\
(d) & Frozen-blocks baseline margin for trained blocks over random blocks & $< 2$pp & depth-utilization check \\
\bottomrule
\end{tabular}
\end{table}

Diagnostics (a), (b), and (d) are independently sufficient for binary detection on the audited failures, while diagnostic (c) is primarily interpretive. % TODO: evidence sentence % TODO: closing sentence

The decision-utility ablation is the compact empirical argument for why this protocol belongs in an E\&D paper. % TODO: evidence sentence % TODO: closing sentence

Threshold calibration is strong for diagnostics (a) and (b) and deliberately weaker for diagnostic (d), so the paper should state that asymmetry rather than oversell uniform robustness. % TODO: evidence sentence % TODO: closing sentence

\begin{figure}[t]
\centering
% TODO: re-render figure
\includegraphics[width=0.88\linewidth]{figures/fig2_decision_utility.pdf}
\caption{Decision-utility ablation comparing the field-standard reporting pair against progressively richer diagnostic strategies: accuracy only and accuracy+$\Gamma$ walk back no audited failures, while the full protocol walks back the three silent failures.}
\label{fig:decision_utility}
\end{figure}

\section{Discussion, Limits, Conclusion}
\label{sec:discussion}

The main recommendation of this paper is that headline accuracy and headline $\Gamma$ should no longer be treated as sufficient evidence that deep local-credit learning is working on modern residual architectures. % TODO: evidence sentence % TODO: closing sentence

Our claim is deliberately scoped to the architectures and methods audited here, and especially to pre-LayerNorm residual settings where measurement degeneracy is empirically strongest. % TODO: evidence sentence % TODO: closing sentence

Positioned against prior evaluation-methodology papers, this work contributes a failure analysis and diagnostic protocol for a mature evaluation practice rather than a new benchmark suite, dataset release, or leaderboard \citep{jordan2020evaluating,obray2022evaluation,paleka2026pitfalls}. % TODO: evidence sentence % TODO: closing sentence

A reasonable conclusion for the field is therefore not that FA-like methods are categorically impossible, but that future claims must report whether they have escaped both failure modes, under matched baselines and with diagnostics that remain meaningful at the layers where the scientific claim is being made. % TODO: evidence sentence % TODO: closing sentence

\begin{thebibliography}{10}

\bibitem[Paleka et~al.(2026)Paleka, et~al.]{paleka2026pitfalls}
Daniel Paleka et~al.
\newblock Pitfalls in evaluating model behavior: measurement, reporting, and
  interpretability failures.
\newblock In {\em International Conference on Learning Representations}, 2026.

\bibitem[O'Bray et~al.(2022)O'Bray, et~al.]{obray2022evaluation}
Leslie O'Bray et~al.
\newblock Evaluation beyond leaderboard metrics: methodology matters.
\newblock In {\em International Conference on Learning Representations}, 2022.

\bibitem[Jordan et~al.(2020)Jordan, et~al.]{jordan2020evaluating}
Matt Jordan et~al.
\newblock Evaluating machine learning: tests, cases, and expectations.
\newblock In {\em International Conference on Machine Learning}, 2020.

\bibitem[Lillicrap et~al.(2016)Lillicrap, Cownden, Tweed, and
  Akerman]{lillicrap2016random}
Timothy~P. Lillicrap, Daniel Cownden, Douglas~B. Tweed, and Colin~J. Akerman.
\newblock Random synaptic feedback weights support error backpropagation for
  deep learning.
\newblock {\em Nature Communications}, 7:13276, 2016.

\bibitem[N{\o}kland(2016)]{nokland2016direct}
Arild N{\o}kland.
\newblock Direct feedback alignment provides learning in deep neural networks.
\newblock In {\em Advances in Neural Information Processing Systems}, 2016.

\bibitem[Akrout et~al.(2019)Akrout, Wilson, Humphreys, Lillicrap, and
  Tweed]{akrout2019deep}
Mohamad Akrout, Collin Wilson, Peter~C. Humphreys, Timothy~P. Lillicrap, and
  Douglas~B. Tweed.
\newblock Deep feedback control.
\newblock In {\em Advances in Neural Information Processing Systems}, 2019.

\bibitem[Launay et~al.(2020)Launay, Poli, Boniface, and
  Krzakala]{launay2020direct}
Julien Launay, Iacopo Poli, Fran{\c{c}}ois Boniface, and Florent Krzakala.
\newblock Direct feedback alignment scales to modern deep learning tasks and
  architectures.
\newblock In {\em Advances in Neural Information Processing Systems}, 2020.

\bibitem[Bartunov et~al.(2018)Bartunov, Santoro, Richards, Marris,
  Hinton, and Lillicrap]{bartunov2018assessing}
Sergey Bartunov, Adam Santoro, Blake~A. Richards, Luke Marris, Geoffrey~E.
  Hinton, and Timothy~P. Lillicrap.
\newblock Assessing the scalability of biologically motivated deep learning
  algorithms and architectures.
\newblock In {\em Advances in Neural Information Processing Systems}, 2018.

\bibitem[Moskovitz et~al.(2018)Moskovitz, Litwin-Kumar, and
  Abbott]{moskovitz2018feedback}
Ted~H. Moskovitz, Ashok Litwin-Kumar, and L.~F. Abbott.
\newblock Feedback alignment in deep convolutional networks.
\newblock In {\em Advances in Neural Information Processing Systems}, 2018.

\bibitem[Refinetti et~al.(2023)Refinetti, d'Ascoli, Ohana, and
  Krzakala]{refinetti2023aligning}
Maria Refinetti, St{\'e}phane d'Ascoli, Ruben Ohana, and Florent Krzakala.
\newblock Aligning residual pathways: normalization, scale, and feedback in
  deep networks.
\newblock In {\em International Conference on Machine Learning}, 2023.

\bibitem[Crafton et~al.(2019)Crafton, Parihar, Gebhardt, and
  Raychowdhury]{crafton2019backpropagation}
Brian Crafton, Abhinav Parihar, Eric Gebhardt, and Arijit Raychowdhury.
\newblock Backpropagation through feedback alignment for deep learning in
  analog hardware.
\newblock In {\em International Conference on Acoustics, Speech, and Signal
  Processing}, 2019.

\bibitem[Xiong et~al.(2020)Xiong, Yu, Dawndy, and Darrell]{xiong2020layer}
Ruibin Xiong, Yunchang Yu, and others.
\newblock On layer normalization in the transformer architecture.
\newblock In {\em International Conference on Machine Learning}, 2020.

\end{thebibliography}

\appendix

\section{Reference Implementation}
\label{app:reference_impl}

We will release a reference implementation at \url{https://github.com/REPO-URL-TO-BE-INSERTED}. The release is intended to make the evaluation protocol easy to run and difficult to misreport: it contains one command path for training or loading checkpoints, one command path for computing the four diagnostics, and one command path for rendering the audit tables and figures used in the paper. The reference code should be treated as part of the evaluation artifact rather than as an auxiliary convenience, because several of the failure cases in this paper arise from seemingly minor choices in how gradients, layers, and baselines are measured.

The repository is organized around the claims in the paper rather than around model classes. A minimal run should expose: (i) architecture-matched trainable-block and random-block baselines, (ii) per-layer residual-scale and BP-gradient measurements at fixed checkpoints, (iii) deep-layer cosine computations with the exact batch and masking conventions used by the audit, and (iv) summary scripts that emit the tables underlying \autoref{tab:main_audit}, \autoref{tab:mode_validation}, and \autoref{tab:protocol_def}. The goal is that an outside reader can reproduce both the verdict and the reason for the verdict from a single checkpoint bundle without reverse-engineering hidden notebook logic.

\section{Pipeline Pitfalls Catalog}
\label{app:pipeline_pitfalls}

\paragraph{Pitfall 1: Layer-0 dominance hidden by global averaging.}
A single global cosine can look mildly positive even when all deep trainable blocks are effectively null, because the shallowest layer dominates the norm budget. The protocol therefore treats layerwise inspection as mandatory and interprets any aggregate headline only after checking where the signal comes from.

\paragraph{Pitfall 2: Cosine against a numerical-floor BP reference.}
If the deepest BP gradient norm has collapsed, the cosine to that vector is not a trustworthy direction-quality measurement. This is the core measurement-degeneracy failure, and it is why the protocol records $\|g_L\|$ before interpreting any deep-layer alignment statistic.

\paragraph{Pitfall 3: Batch mismatch between reference and candidate gradients.}
Using different minibatches, different augmentations, or different dropout masks for BP and FA credit vectors can inflate or destabilize the reported cosine. The reference implementation computes both vectors on the same frozen forward pass whenever the claim being tested is directional agreement rather than training robustness.

\paragraph{Pitfall 4: Baseline mismatch for depth utilization.}
Comparing a partially trainable model only to full BP or to an unmatched random baseline can make weak methods look stronger than they are. Diagnostic (d) uses architecture-matched frozen-blocks controls precisely so that ``the deep blocks helped'' is tested against the right null.

\paragraph{Pitfall 5: Silent train/eval mode inconsistencies.}
Small mode mismatches can change residual scale, normalization behavior, and therefore the diagnostic measurements themselves. The measurement scripts fix model mode explicitly and log it, because otherwise a paper can end up comparing training-time FA credit with evaluation-time BP references.

\paragraph{Pitfall 6: Post-hoc normalization that erases scale pathology.}
Renormalizing hidden states or gradients before logging can make a genuine activation-growth failure disappear from the report. For this paper, raw norms are part of the scientific object, so any normalization used for visualization must remain separate from the values used for diagnosis.

\paragraph{Pitfall 7: Missing null controls for intervention claims.}
A rescue intervention can improve cosine or accuracy for trivial reasons unless the experiment includes a null such as fresh-$B$ feedback or a matched BP+penalty control. The paper therefore treats intervention evidence as incomplete unless it separates training-specific adaptation from generic regularization or capacity effects \citep{bartunov2018assessing,moskovitz2018feedback,refinetti2023aligning}.

\section{Walk-Back Chain Methodology}
\label{app:walkback_chain}

The walk-back chain is the compressed narrative used to translate a superficially positive headline result into a falsifiable diagnostic verdict. It has four steps. Step 1 asks what the status-quo claim would be from accuracy and headline $\Gamma$ alone. Step 2 checks whether the deepest hidden-layer BP reference remains numerically meaningful; if not, the alignment claim is walked back as ungrounded measurement. Step 3 asks whether trained deep blocks outperform architecture-matched random-block baselines; if not, the training claim is walked back as unused or weakly used depth. Step 4 uses temporal replay, intervention, and cross-architecture evidence to determine whether the underlying problem is primarily measurement degeneracy, low intrinsic credit-direction quality, or both.

This chain is deliberately asymmetric. A method can pass all four steps and remain provisionally trustworthy, but failing any one of the binary detectors is enough to invalidate the stronger claim that ``deep local credit assignment is working'' on that setting. That asymmetry matches the paper's goal: not to certify methods as universally good, but to prevent unsupported success claims from surviving because the reporting pipeline asked too little of the evidence.

\section{All Seven Validations}
\label{app:all_validations}

Table~\ref{tab:all_validations} lists the seven validation exercises that support the protocol. They serve different purposes: some validate binary detection, some validate interpretation, and some validate external usefulness. Together they show that the protocol is not merely a post-hoc description of one final ResMLP run, but a portable evaluation procedure that changes conclusions across time, interventions, and architectures.

\begin{table}[t]
\centering
\small
\caption{Summary of the seven validation exercises used to justify the protocol.}
\label{tab:all_validations}
\begin{tabular}{p{0.18\linewidth}p{0.29\linewidth}p{0.23\linewidth}p{0.22\linewidth}}
\toprule
Validation & Question & Main observation & Why it matters \\
\midrule
Five-method audit & Does the status quo over-credit methods? & Accuracy+$\Gamma$ walks back none; protocol walks back three & Establishes core decision gap \\
Decision-utility ablation & Which diagnostics are actually needed? & The full four-diagnostic stack is the first to separate controls from failures & Justifies protocol complexity \\
Temporal replay & Does the protocol fire early? & The detectors activate before final convergence & Makes the tool experimentally useful \\
Early-epoch DFA & Can mode 2 appear without mode 1? & Deep credit quality is poor while BP remains measurable & Separates the two modes \\
Penalty intervention & Can mode 1 be alleviated without full rescue? & Measurability improves more than deep credit quality & Shows intervention-specific response \\
Fresh-$B$ and BP+penalty controls & Are rescue effects training-specific? & Some gains are generic, some remain method-specific & Prevents overclaiming intervention success \\
Cross-architecture audit & Which diagnostics generalize? & Activation growth generalizes more broadly than gradient-floor collapse & Scopes the claims correctly \\
\bottomrule
\end{tabular}
\end{table}

A useful way to read the table is that no single validation carries the paper by itself. The five-method audit shows that the problem exists, temporal replay shows that the protocol is actionable, intervention and null controls show that the two modes respond differently, and cross-architecture evidence shows which parts of the protocol are specific to terminal-normalized residual settings and which parts are more general.

\section{Threshold Sensitivity Full Sweep}
\label{app:threshold_sweep}

The sensitivity sweep is intentionally small because the paper does not claim that all four thresholds are equally canonical. The important result is qualitative stability for diagnostics (a) and (b): over a reasonable range of nearby cutoffs, the same methods are flagged on the same audited settings, and the same controls remain unflagged. This is the strongest calibration evidence in the paper because these two diagnostics track the physical quantities most directly tied to the measurement-degeneracy story.

Diagnostic (d) is weaker and should be presented that way. Its threshold is best understood as a conservative reporting aid for depth utilization rather than as a universal constant. In practice, the full sweep should therefore be read as showing that the protocol is robust where it claims binary detection strength and intentionally modest where it is used as a contextual check on whether trained deep blocks beat architecture-matched random-block baselines.

\section{Per-Architecture Detailed Audits}
\label{app:arch_audits}

The per-architecture appendix should be short and comparative. On pre-LayerNorm ResMLP and ViT-Mini, the key pattern is the same as in the main text: residual-scale growth can become large enough that the deepest BP reference becomes numerically weak, and the status-quo pair of accuracy plus headline $\Gamma$ fails to expose that. These are the settings where both failure modes matter and where the full protocol is most necessary.

StudentNet and the CNN serve a different role. They test whether the protocol overgeneralizes from terminal-normalized residual architectures to settings where gradient-floor collapse is not expected. In those models, activation-growth checks can still reveal weak depth usage or poor scaling, but diagnostic (b) is not expected to fire in the same way. This asymmetry is not a weakness of the protocol; it is part of the empirical scoping claim of the paper and helps prevent readers from mistaking a targeted evaluation standard for a universal pathology claim \citep{xiong2020layer,bartunov2018assessing}.

\section{Reproducibility}
\label{app:reproducibility}

All headline audit results in the main text should be reported over the locked seed set $\{42, 123, 456\}$, with the same seed bundle reused across methods wherever possible so that between-method comparisons are not driven by different data orders or initialization luck. Every released result table should specify the architecture, optimizer, learning-rate schedule, batch size, augmentation recipe, number of epochs, checkpoint selection rule, and whether each diagnostic was measured at the final checkpoint or along a stored temporal trajectory.

Hyperparameters should be listed exactly as run, not reconstructed from memory after the fact. For intervention experiments, the appendix should report the penalty coefficient, where in the network the penalty is applied, and which control runs share the same added objective. For diagnostic scripts, reproducibility requires logging the model mode, minibatch identity, and layer-index convention used for per-layer statistics. The point of this appendix is simple: because the paper's claims hinge on how evaluation is performed, measurement configuration is part of the result and must be reproducible with the same care as training configuration.

\end{document}