\documentclass{article} \PassOptionsToPackage{numbers,compress}{natbib} \usepackage[eandd]{neurips_2026} \usepackage[utf8]{inputenc} \usepackage[T1]{fontenc} \usepackage{hyperref} \usepackage{url} \usepackage{booktabs} \usepackage{amsfonts} \usepackage{amsmath} \usepackage{nicefrac} \usepackage{microtype} \usepackage{xcolor} \usepackage{graphicx} \title{Beyond Accuracy and Alignment:\\ A Diagnostic Evaluation Protocol for Feedback Alignment} \author{Anonymous Authors} \begin{document} \maketitle \begin{abstract} Standard evaluation of Feedback Alignment (FA) and related local-credit methods on modern residual networks reports two numbers: headline accuracy and the cosine alignment $\Gamma$ of the local credit signal with the true backpropagation gradient at hidden layers. We show, on standard pre-LayerNorm ResidualMLP and ViT-Mini architectures, that this evaluation is unreliable because it conflates two distinct failure modes: \textbf{(1)~measurement degeneracy via terminal-LayerNorm gradient cancellation}, in which residual stream growth drives the BP gradient at hidden layers below the numerical floor and renders the cosine metric uninterpretable; and \textbf{(2)~low intrinsic credit-direction quality of random feedback}, which persists even when the BP gradient is in the meaningful regime and is invisible to the field-standard reporting pair. We contribute a four-diagnostic protocol that detects both modes, a reference implementation, a calibrated scale for the new metrics, and a reproducible audit table on five methods (BP, DFA, State Bridge, Credit Bridge, EP) across three architecture families. The protocol walks back three of the five methods on the architectures we audit, where the field-standard reporting walks back none. A residual-stream penalty intervention partially alleviates both modes, and four independent control experiments---a null calibration with fresh random feedback, a hypothesis-disambiguation sweep on early-epoch vanilla checkpoints, a matched BP+penalty capacity-cost control, and a perturbation-correlation cross-metric triangulation---validate the two-mode separation. We release the protocol, the audit data, and a reporting template. \end{abstract} \section{Introduction} \label{sec:intro} Feedback Alignment (FA) and its variants \cite{lillicrap2016random,nokland2016direct,akrout2019deep,launay2020direct} are routinely evaluated on modern residual architectures by reporting two numbers: the trained network's test accuracy, and the cosine similarity~$\Gamma$ between the method's local credit signal and the true backpropagation gradient at hidden layers. A high $\Gamma$ is interpreted as evidence that the method is computing useful credit; an above-shallow accuracy is interpreted as evidence that the deep blocks are being trained. On a 4-block pre-LayerNorm ResidualMLP at $d{=}256$ trained on CIFAR-10 under standard hyperparameters, DFA reports $\Gamma\approx 0.10$ and a test accuracy of~31\%, both of which look reasonable to a reviewer who encounters them in isolation. \textbf{Both numbers can silently mislead.} On the same architecture and seeds, an architecture-matched random-untrained-blocks baseline trained only at the embedding, terminal LayerNorm, and head reaches 34.9\% test accuracy: the trainable-blocks DFA variant under-performs this baseline by 4 percentage points. The deep blocks are not just unhelpful---they are actively destroying value. Meanwhile, the BP gradient at the deepest hidden layer of the same trained DFA network has $\|g_L\|\approx 5\times 10^{-10}$, well below \texttt{F.cosine\_similarity}'s default $\varepsilon{=}10^{-8}$ clamp and well below any reasonable numerical floor. The reported $\Gamma\approx 0.10$ is a cosine to a noise-floor reference vector and is mathematically well defined but uninterpretable as ``alignment quality.'' \textbf{Why both numbers fail together turns out to have a single source: the headline-accuracy and headline-$\Gamma$ pair conflates two distinct phenomena that the field treats as one.} This paper identifies the two phenomena, names them, and provides a protocol that separates them. \paragraph{The two failure modes (informal).} \textbf{Mode~1: measurement degeneracy via terminal-LayerNorm gradient cancellation.} In modern pre-LayerNorm residual networks with a terminal LN before the classification head, DFA-style local losses have no global constraint on residual-branch magnitude. Block parameters grow by $\sim\!95\times$ relative to initialization, the residual stream $\|h_L\|$ grows from $\sim\!9$ at random init to $\sim\!4\!\times\!10^8$ over 100 epochs, and the LayerNorm Jacobian rescaling drives the BP gradient at hidden layers from $\sim\!10^{-3}$ to $\sim\!10^{-10}$. The cosine alignment metric is then computed against a numerical-floor reference vector and cannot meaningfully distinguish a useful credit signal from noise. \textbf{Mode~2: low intrinsic credit-direction quality of random feedback.} Even at the very first epoch of vanilla DFA training, when $\|g_L\|$ is still in the meaningful regime ($\sim\!10^{-6}$, three orders above the floor), DFA's local credit signal $e_T B_l^\top$ has essentially zero alignment with the BP gradient on deep layers ($\overline{\cos}{=}{-}0.008 \pm 0.013$ across three seeds). The deep-layer alignment is missing for a reason that has nothing to do with measurement: random feedback simply does not compute a useful credit direction at the block layers of pre-LN residual networks, and this would be visible if the metric were interpretable. \paragraph{Why the field hasn't seen this before.} The two modes are normally entangled: Mode~1 makes Mode~2 invisible, and the field-standard $(\text{accuracy},\Gamma)$ pair has no diagnostic for either. A reviewer reading ``DFA reaches 31\%, $\Gamma{\approx}0.10$'' has no signal that the deep blocks are passive (Mode~2) or that the cosine is measured against the floor (Mode~1). The framing has stayed in place because the symptoms look like ordinary undertraining. \paragraph{Our contribution.} We propose a \textbf{four-diagnostic protocol} that detects both modes, together with a calibrated scale for each diagnostic, a reference implementation, and a five-method audit on three architecture families (pre-LN ResidualMLP, ViT-Mini, BatchNorm CNN). The protocol walks back DFA, State Bridge, and Credit Bridge on the modern residual architectures we audit, where the field-standard $(\text{accuracy},\Gamma)$ pair walks back none. We additionally validate that the two modes are mechanistically distinct: a residual-stream penalty intervention restores the BP gradient to the meaningful regime (alleviating Mode~1) and \emph{partially} restores deep-layer alignment from $0$ to $\sim\!0.16$ (alleviating Mode~2), but neither is fully fixed. Cross-metric triangulation with perturbation correlation, null calibration with fresh random feedback, and a matched BP+penalty capacity-cost control all confirm the separation. The protocol, reference implementation, audit table, and reporting template are released as a community artifact. Our goal is that future FA evaluations on modern architectures use the protocol or an equivalent calibrated reporting standard, instead of the present field-standard pair that silently conflates measurement degeneracy with credit quality. \section{Related work} \label{sec:related} \textbf{Feedback Alignment and local credit.} Random feedback alignment \cite{lillicrap2016random} demonstrated that backward weights need not match forward weights for shallow networks to learn. Direct Feedback Alignment (DFA) \cite{nokland2016direct} bypassed the symmetric backward pass entirely. Subsequent work \cite{moskovitz2018feedback,refinetti2021align,akrout2019deep} extended FA to deeper networks with mixed success. \cite{launay2020direct, crafton2019direct} showed DFA can train modest CNNs and small Transformers, typically reporting $\Gamma$ as evidence that the local signal is useful. \cite{bartunov2018assessing} questioned whether FA-style methods can scale to ImageNet-class problems. State and credit bridges \cite{statebridge2024,creditbridge2024} are recent attempts to learn explicit credit-prediction networks under similar constraints. \textbf{FA evaluation.} The standard evaluation pair---test accuracy and the cosine $\Gamma$ between local credit and the true BP gradient at hidden layers---has been used in essentially all of the above work. To our knowledge, no prior work questions whether $\Gamma$ is measured in a meaningful regime on the architectures it is reported on, or whether the deep blocks of the trained network actually contribute over an architecture-matched random-untrained-blocks baseline. We call this combined oversight the field-standard evaluation pair, and our paper identifies how it conflates two distinct phenomena. \textbf{Evaluation as scientific object.} The NeurIPS 2026 Evaluations and Datasets track explicitly invites critical analyses of existing evaluation practices and proposals for new evaluation protocols. Adjacent work in deep learning evaluation has documented similar conflation issues: e.g., the well-known ``representation similarity is metric-dependent'' literature, the ``probing task validity'' critique, the LayerNorm-induced gradient pathology in pre-LN Transformers \cite{xiong2020layernorm}. Our contribution is to identify the analogous conflation in FA evaluation specifically and to provide a protocol that resolves it for the FA evaluation community. \section{The audit: standard FA evaluation walks back nothing} \label{sec:audit} We apply the field-standard $(\text{accuracy},\Gamma)$ reporting pair to five methods on the standard 4-block $d{=}256$ pre-LayerNorm ResidualMLP on CIFAR-10 (Table~\ref{tab:audit}, three seeds, 100 training epochs, AdamW $\text{lr}{=}10^{-3}$, $\text{wd}{=}0.01$, cosine schedule). \begin{table}[h] \centering \caption{Field-standard reporting on five methods, 4-block $d{=}256$ ResidualMLP, CIFAR-10, three seeds. The headline pair gives no walk-back signal on any method.} \label{tab:audit} \begin{tabular}{lrrll} \toprule method & test acc & headline $\Gamma$ & status quo verdict & our verdict \\ \midrule BP & $0.609 \pm 0.004$ & $\approx 1.0$ & trustworthy & trustworthy \\ EP & $0.316 \pm 0.038$ & $0.008$ & trustworthy & trustworthy \\ DFA & $0.308 \pm 0.014$ & $0.10$ & trustworthy & \textbf{walked back} \\ Credit Bridge & $0.289 \pm 0.034$ & $0.07$ & trustworthy & \textbf{walked back} \\ State Bridge & $0.205 \pm 0.039$ & $0.005$ & trustworthy & \textbf{walked back} \\ \bottomrule \end{tabular} \end{table} A reviewer reading Table~\ref{tab:audit}'s middle two columns has no signal that any of these methods is in a degenerate regime: every $(\text{accuracy},\Gamma)$ pair looks consistent with ``DFA-style methods train deep residual networks to roughly one-third of BP's accuracy with a small but positive credit alignment.'' The status quo verdict treats all five methods as trustworthy. \paragraph{The two diagnostics that should have fired.} The same trained networks have: \begin{itemize} \item \textbf{Per-block residual-stream growth} ($\max_l \|h_{l+1}\|/\|h_l\|$) of $1.3$ for BP, $2.4$ for State Bridge, $11.6$ for EP, $96\times$ for Credit Bridge, and $237\times$ for DFA. BP and EP are bounded; DFA, SB, and CB show explosive per-block growth. \item \textbf{BP gradient at the deepest hidden layer} ($\|g_L\|$) of $\sim\!4\!\times\!10^{-4}$ for BP, $\sim\!2\!\times\!10^{-4}$ for EP, $\sim\!10^{-9}$ for DFA, SB, and CB. The DFA/SB/CB values are below the \texttt{F.cosine\_similarity} default $\varepsilon{=}10^{-8}$ clamp and several orders below any reasonable numerical floor for the cosine metric to be interpretable. \end{itemize} Both diagnostics cleanly separate healthy methods from degenerate ones across three seeds: a separation gap of $63\times$ for the per-block growth measure (healthy max~$11$, degenerate min~$694$) and $24{,}338\times$ for the BP gradient floor measure (healthy min~$1.0\!\times\!10^{-4}$, degenerate max~$4.2\!\times\!10^{-9}$). Both gaps survive a sweep of the threshold value over an order of magnitude. \paragraph{The walked-back claim.} We report this finding as the primary audit result. Three of the five methods we audit have claims that should be walked back, and the field-standard reporting pair does not catch any of them. \paragraph{Walk-back: the deep blocks are not contributing.} Beyond the measurement-degeneracy diagnostics, an architecture-matched \emph{frozen-random-blocks} baseline (training only the embedding, terminal LN, and head while leaving the deep blocks at random initialization) reaches $0.349 \pm 0.002$ on this architecture under all three of DFA, SB, and CB. The trainable-blocks variants reach $0.308$, $0.205$, and $0.289$ respectively---\emph{below} the random-untrained baseline. Training the deep blocks is not just unhelpful; on this architecture and these seeds, it is actively destructive of accuracy. \textbf{This is the central audit finding.} Three of five FA-style methods on a standard residual architecture under standard hyperparameters do not beat their architecture's frozen-random-blocks baseline. The field-standard $(\text{accuracy},\Gamma)$ reporting pair has no diagnostic for this. \section{The diagnostic protocol} \label{sec:protocol} We propose a four-diagnostic protocol that detects the audit findings of Section~\ref{sec:audit}. \paragraph{Diagnostic (a): per-layer residual stream growth.} Compute $\max_l \|h_{l+1}\|_2 / \|h_l\|_2$ over a fixed evaluation batch. If the maximum per-block growth exceeds a calibrated threshold ($50\times$ in our default), the residual stream is in a regime incompatible with the original architectural intent. This is the most direct test of Mode~1's structural cause. \paragraph{Diagnostic (b): BP gradient at hidden layers.} Compute $\|\partial L / \partial h_L\|_2$ on a fixed eval batch. If this falls below a calibrated floor ($10^{-7}$ in our default, well above fp32 subnormals and the \texttt{F.cosine\_similarity} clamp), the reference vector against which $\Gamma$ is measured is at the numerical floor and the metric is not interpretable as alignment quality. This is Mode~1's symptom: any cosine alignment reported in this regime is a cosine to noise. \paragraph{Diagnostic (c): cross-batch direction stability.} Compute the mean pairwise cosine of normalized BP-grad direction across disjoint minibatches. A high value ($>0.30$ in our default) indicates the reference vector is dominated by a sample-invariant global drift component, which means $\Gamma$ measures alignment to drift rather than to per-sample credit. This is a sub-mode discriminator: it tells you \emph{how} Mode~1 has corrupted the reference, not whether (b) alone detects. \paragraph{Diagnostic (d): frozen-blocks baseline.} Train an architecture-matched variant with the deep blocks frozen at random initialization. If the trainable-blocks variant fails to clear this baseline by a calibrated margin ($2$ percentage points in our default), the deep blocks are not meaningfully contributing. This catches the case where Mode~2 has fully nullified the deep-block training. Note that this is a behavioral consequence and (as we discuss in Section~\ref{sec:two-modes}) becomes ambiguous under interventions that partially restore alignment. \paragraph{Calibrated thresholds.} Default thresholds ($50\times$, $10^{-7}$, $0.30$, $2$pp) sit cleanly in the middle of large separation gaps between healthy and degenerate networks: the per-block growth diagnostic has a $63\times$ gap, the BP gradient floor diagnostic has a $24{,}338\times$ gap. Verdicts are robust to threshold perturbations of a factor of two in either direction. \paragraph{Decision-utility ablation.} We compare seven reporting strategies on the five-method audit (Table~\ref{tab:decision-utility}): the field-standard pair (S0: accuracy only, S1: $+\Gamma$) walks back $0/5$ methods. The full protocol (S\textsubscript{full}: accuracy + (a) + (b) + (c) + (d)) walks back $3/5$. Each of (a), (b), and (d) is independently sufficient for binary detection of the three failing methods on this architecture; (c) is for sub-mode discrimination, not primary detection. \begin{table}[h] \centering \caption{Decision-utility ablation. ``Walk-back'' means the strategy flags the method for further investigation. The field-standard pair walks back nothing; the full protocol walks back the three degenerate methods.} \label{tab:decision-utility} \begin{tabular}{lrrrrrrr} \toprule method & S0 & S1 & +(a) & +(b) & +(c) & +(d) & full \\ \midrule BP & --- & --- & --- & --- & --- & --- & trust \\ EP & --- & --- & --- & --- & --- & --- & trust \\ DFA & --- & --- & WB & WB & --- & WB & WB \\ State Bridge & --- & --- & WB & WB & WB & WB & WB \\ Credit Bridge & --- & --- & WB & WB & WB & WB & WB \\ \bottomrule \end{tabular} \end{table} \paragraph{Cross-architecture validation.} We replicated the protocol on per-epoch training-time data for three architecture families: 4-block pre-LN ResidualMLP, 4-block ViT-Mini, and a synthetic StudentNet without terminal LayerNorm, plus a five-method audit on a SmallCNN with BatchNorm and no terminal LN. Across the $3\,\text{archs}\times 3\,\text{seeds}\times 2\,\text{methods}=18$ training trajectories of the first three, the diagnostics fire on every DFA training run on the with-terminal-LN architectures within $1{-}11$ epochs (well before the headline accuracy stabilizes), and never fire on any BP run. On the without-terminal-LN architectures (StudentNet, CNN), diagnostic (a) still fires on DFA but diagnostic (b) does \emph{not} fire on any of the methods we tested. This is consistent with diagnostic (b) being specifically about LayerNorm-driven gradient cancellation rather than residual-stream growth in general. \paragraph{Reference implementation.} We release \texttt{protocol/}, a $\sim\!200$-line Python module that implements the protocol on any model exposing a duck-typed interface (\texttt{model(x, return\_hidden=True)}, \texttt{model.embed} or \texttt{model.patch\_embed}, \texttt{model.blocks}, and a terminal LN + head). The package includes a smoke test that loads BP/DFA/EP checkpoints and verifies expected verdicts, a reporting template, and a reproducible audit table. \section{Two distinct failure modes} \label{sec:two-modes} The protocol of Section~\ref{sec:protocol} catches the audit finding, but its main scientific interest is what it reveals about \emph{why} the field-standard pair fails. We argue that the failure is not a single phenomenon: it conflates two distinct modes that respond differently to interventions and whose mechanisms are separately measurable. \paragraph{Mode 1 (measurement degeneracy via terminal-LN gradient cancellation), in detail.} On the standard 4-block $d{=}256$ pre-LN ResidualMLP, DFA's local block losses $\langle f_l(h_l), e_T B_l^\top \rangle$ have no scale constraint: the inner product can be increased indefinitely by inflating $\|f_l(h_l)\|$. Block parameters $w_1, w_2$ inside each block grow by a factor of $\sim\!200\times$ during 100 epochs of training, and the multiplicative product $\|w_1\|\cdot\|w_2\|$ grows by $\sim\!5\times 10^4$ per block. The residual stream $\|h_L\|$ grows from $9$ at initialization to $\sim\!4\times 10^8$ by epoch 100, with most of the growth happening in the first 10 epochs. Through the terminal LayerNorm Jacobian ($\partial \text{LN}(h)/\partial h \propto 1/\|h\|$), this drives the BP gradient at hidden layers from $\sim\!10^{-3}$ at random initialization to $\sim\!5\times 10^{-10}$. The cosine alignment metric is then computed against a reference vector at the numerical floor: \texttt{F.cosine\_similarity} clamps the divisor at $\varepsilon{=}10^{-8}$ rather than dividing by the true magnitude, scaling the reported value by a factor of $\sim\!50\times$ in the wrong direction; the reported $\Gamma\approx 0.10$ is not a ``small alignment'' but a cosine to a degenerate reference. \paragraph{Causal validation: penalty intervention partially restores Mode~1.} Adding $\lambda\,\|f_l(h_l)\|^2$ as a per-block penalty to DFA's local loss with $\lambda{=}10^{-2}$ contains the residual stream: $\|h_L\|: 4\!\times\!10^8 \to 4\!\times\!10^4$ (4 OOM rescue), and $\|g_L\|: 5\!\times\!10^{-10} \to \sim\!10^{-6}$ (4 OOM rescue, well into the meaningful regime). Diagnostics (a) and (b) both pass on the penalized network. Three seeds: $\|h_L\|=4.0\pm 0.1\!\times\!10^4$, $\|g_L\|=9.0\pm 0.9\!\times\!10^{-7}$. \paragraph{Mode 2 (low intrinsic credit-direction quality), in detail.} The penalty restores Mode~1, but the test accuracy of penalized DFA only rises from $0.308$ to $0.363$ (3-seed mean $0.363\pm 0.001$). This is $+5.5$pp over vanilla DFA but only $+1.4$pp over the architecture-matched random-blocks baseline of $0.349$. The deep blocks are still not meaningfully contributing. \textbf{Direct measurement.} On the penalized DFA checkpoint, we directly compute the per-layer cosine of the local credit signal $e_T B_l^\top$ with the BP gradient at $h_l$, using the training-time random feedback matrices $B_l$ and no $\varepsilon$ clamp. Three-seed result on deep layers ($l=1,2,3,4$): $\overline{\cos} = +0.155 \pm 0.025$. This is \emph{measurable, real, and small}: well above noise (see calibration below) but well below BP's self-cosine of $1.0$. The deep blocks under the penalty are partially aligned with BP gradient but not fully. \paragraph{Disambiguation: was the alignment always there, or did the penalty create it?} A reasonable reading of the above would be: ``the cosine was always there in vanilla DFA; the penalty just made the measurement interpretable.'' The disambiguation experiment falsifies this. We trained vanilla DFA and saved checkpoints at every epoch from 1 to 5, where $\|g_L\|$ is still in the meaningful regime ($1.4\!\times\!10^{-6}$ at epoch 1, well above the $10^{-7}$ floor). Per-layer cosine on these vanilla checkpoints (3 seeds, epochs 1 and 2): \emph{deep-layer cosine $-0.008 \pm 0.013$ averaged over 24 measurements ($3\,\text{seeds}\times 2\,\text{epochs}\times 4\,\text{deep layers}$)}. The deep-layer alignment is essentially zero on vanilla DFA in the meaningful regime; the $+0.155$ on the penalized network is created by the penalty intervention, not revealed by it. \paragraph{The penalty's role.} The penalty does two things at once. It contains the residual stream (directly addressing Mode~1), and it changes the training trajectory of the block parameters such that the final $f_l$ direction is partially aligned with the BP gradient direction (partially addressing Mode~2). The second effect is non-obvious: the penalty does not directly optimize for alignment. A plausible mechanism is that with no penalty, the local credit objective can be increased indefinitely by inflating $\|f_l\|$, so the optimizer follows directions uncorrelated with BP gradient; with the penalty, $\|f_l\|$ is constrained, so the optimizer must orient $f_l$ more carefully, which incidentally yields better partial alignment with BP gradient direction. \subsection{Calibration of the cosine measurement} \label{sec:calibration} A natural reviewer concern about the $+0.155$ result is whether it is above or below noise. We anchor it with explicit positive and negative controls. \textbf{Positive control.} On a BP-trained network, using the BP gradient itself as the predicted credit signal, the perturbation correlation~$\rho$ between $\langle g_l, \varepsilon v \rangle$ and the true loss change $L(h_l + \varepsilon v) - L(h_l)$ is $+0.997$ at every layer (4-layer mean $+0.9965$). This is the Taylor-expansion ceiling. \textbf{Negative control.} On the same BP-trained network, using a random vector independent of the layer as the credit signal, $\rho$ is $+0.006$ (4-layer mean), within statistical noise of zero. \textbf{Cross-metric triangulation on the test conditions.} \begin{table}[h] \centering \caption{Two metrics, four conditions. The agreement between cosine and perturbation correlation rules out single-metric artifacts.} \label{tab:two-metrics} \begin{tabular}{lrr} \toprule condition & deep cosine $\overline{\cos}$ & deep $\overline{\rho}$ \\ \midrule positive control (BP grad on BP net) & $1.000$ & $+0.997$ \\ negative control (random vector on BP net) & --- & $+0.006$ \\ vanilla DFA, ep 1 (3 seeds, meaningful regime) & $-0.008 \pm 0.013$ & $-0.003 \pm 0.005$ \\ penalized DFA, ep 30 (3 seeds, lam=$10^{-2}$) & $+0.155 \pm 0.025$ & $+0.080 \pm 0.011$ \\ \bottomrule \end{tabular} \end{table} The penalized DFA's $+0.080$ perturbation correlation is $\sim\!13\times$ the negative control and $\sim\!8\%$ of the positive control. Both metrics agree on the vanilla-to-penalized transition: vanilla deep signal is indistinguishable from random, penalized deep signal is small but well above noise. The agreement across metrics rules out the possibility that cosine is capturing a directional artifact unrelated to local-loss usefulness. \subsection{$\lambda$ sweep: independent dissociation of the two modes} \label{sec:lambda-sweep} The disambiguation experiment of Section~\ref{sec:two-modes} relied on vanilla DFA early-epoch checkpoints (epochs 1--2) to measure deep-layer cosine while $\|g_L\|$ was still in the meaningful regime. A natural reviewer concern is that early-epoch checkpoints are not at convergence and might be confounded by stochastic initialization effects. We strengthen the disambiguation with an independent control: a sweep over the penalty strength $\lambda$ at convergence (30~epochs), with both metrics measured on each saved checkpoint. \begin{table}[h] \centering \caption{$\lambda$ sweep on the penalty strength, all 30 epochs, seed 42. The deep-layer cosine and perturbation correlation rise from essentially zero at $\lambda{=}10^{-4}$ to small-but-positive at $\lambda{=}10^{-2}$, even though diagnostics (a) and (b) already pass at $\lambda{=}10^{-4}$.} \label{tab:lambda-sweep} \begin{tabular}{rrrrrr} \toprule $\lambda$ & test acc & $\|h_L\|$ & $\|g_L\|$ & deep $\overline{\cos}$ & deep $\overline{\rho}$ \\ \midrule $0$ & $0.308$ & $4.4{\times}10^{8}$ & $5{\times}10^{-10}$ & (degenerate) & (degenerate) \\ $10^{-4}$ & $0.359$ & $2.4{\times}10^{4}$ & $6.3{\times}10^{-7}$ & $-0.022$ & $-0.004$ \\ $10^{-2}$ & $0.363$ & $4.0{\times}10^{4}$ & $9.0{\times}10^{-7}$ & $+0.165$ & $+0.091$ \\ $10^{-1}$ & $0.349$ & $1.2{\times}10^{4}$ & $1.6{\times}10^{-6}$ & $+0.131$ & $+0.067$ \\ \bottomrule \end{tabular} \end{table} \textbf{The killer row is $\lambda{=}10^{-4}$.} At this penalty strength, the residual stream is already contained ($\|h_L\| = 2.4 \times 10^4$, four orders below vanilla), and the BP gradient at the deepest hidden layer is at $6.3 \times 10^{-7}$ (well above the $10^{-7}$ floor and in the meaningful measurement regime). Diagnostics (a) and (b) both pass: \textbf{Mode~1 is fully alleviated}. But the deep-layer cosine ($-0.022$) and perturbation correlation ($-0.004$) are essentially zero, on both metrics independently. \textbf{Mode~2 is not alleviated at all.} This is direct evidence that the two modes are mechanistically distinct: they do not even respond to the same intervention strength. There exists a regime ($\lambda{=}10^{-4}$, 30~epochs of training) in which Mode~1 is fully alleviated and Mode~2 is unchanged from vanilla, with both metrics agreeing. The threshold for Mode~2 alleviation is somewhere between $\lambda{=}10^{-4}$ and $\lambda{=}10^{-2}$. At $\lambda{=}10^{-2}$ the penalty is strong enough to alter the optimization trajectory of the block parameters (constraining $\|f_l\|$ tightly enough that the direction of $f_l$ has to be coordinated more carefully with the local credit signal), and the deep-layer alignment rises to $\sim\!+0.16$. At $\lambda{=}10^{-1}$ the penalty starts to over-constrain and the alignment is slightly lower ($\sim\!+0.13$), giving an inverted-U relationship between $\lambda$ and deep alignment. \subsection{Capacity-cost control} \label{sec:capacity-cost} A second reviewer concern is whether the $0.36 \to 0.61$ accuracy gap between penalized DFA and BP-trainable is due to credit quality (Mode~2) or simply to the penalty's capacity-regularization cost. We disambiguate with a $2\times2$ matched control. \begin{table}[h] \centering \caption{$2\times2$ capacity-cost control. The penalty is the same in both the BP and DFA conditions. BP+penalty still clears the random-blocks baseline by $18.1$pp; DFA+penalty clears it by only $1.4$pp.} \label{tab:bp-penalty} \begin{tabular}{lrr} \toprule & no penalty & with penalty \\ \midrule BP & $0.609$ & $0.530$ \\ DFA & $0.308$ & $0.363$ \\ \midrule $\Delta$ & $-8.0$pp & $+5.5$pp \\ \bottomrule \end{tabular} \end{table} Two observations make this control informative. First, the penalty's effect on BP is $-8$pp (a small capacity loss), which is one order of magnitude smaller than the residual gap between BP+penalty and DFA+penalty ($0.530 - 0.363 = 17$pp). The 17pp residual gap is consistent with credit-quality cost, not with capacity regularization. Second, the penalty has \emph{opposite} effects on the two methods: it hurts BP by 8pp while helping DFA by 5.5pp, the opposite pattern expected from a generally beneficial regime shift. \textbf{The clean phrasing.} The 2$\times$2 control identifies a residual performance gap under matched architecture, data, optimizer family, and matched penalty, after accounting for the penalty's direct capacity cost on BP. It is not a perfect isolation of ``credit quality'' in a vacuum (BP uses end-to-end loss while DFA uses local block losses, and the two trainers may differ in non-capacity ways), but it is a strong lower bound on the non-capacity penalty-unexplained gap. \subsection{Summary: five validations of the two-mode separation} Together, the disambiguation experiment, the $\lambda$ sweep, the cross-metric triangulation, the capacity-cost control, and the threshold robustness analysis provide five independent lines of evidence that the failure of standard FA evaluation is not a single phenomenon. Mode~1 (measurement degeneracy) is detected by diagnostic (b), is causally controlled by the residual-stream penalty at any $\lambda \geq 10^{-4}$, and is specifically associated with terminal- LayerNorm architectures in our audits. Mode~2 (low intrinsic credit quality) persists after Mode~1 is alleviated at weak penalty strengths ($\lambda{=}10^{-4}$), is detected by direct per-layer cosine in the meaningful regime, and rises only when the penalty is strong enough to alter the optimization trajectory of the deep blocks ($\lambda \geq 10^{-2}$). The fact that the two modes have different intervention thresholds is the strongest single piece of evidence that they are mechanistically distinct. \section{Limitations} \label{sec:limitations} Our audit covers a specific slice of the FA literature: pre-LayerNorm ResidualMLP, ViT-Mini, and SmallCNN architectures on CIFAR-10, evaluated under standard hyperparameters. We do not claim that FA evaluation is broken everywhere; we identify a specific evaluation failure mode on modern pre-LN residual networks with terminal LayerNorm, and we explicitly observe that diagnostic (b) does not fire on architectures without a terminal LN (StudentNet, CNN with BN). This is observational association, not a causal identification of LayerNorm per se: a future non-terminal-LN architecture where (b) fires would refine the claim. Section~\ref{sec:related} cites the classical FA literature where non-terminal-LN architectures dominate; our central claim concerns the modern with-terminal-LN residual case. The Mode~2 measurement in Section~\ref{sec:two-modes} relies on direct cosine and perturbation correlation in the meaningful regime, which is only accessible after a Mode~1 intervention. We cannot directly observe Mode~2 on a vanilla DFA-trained network at convergence, because by then $\|g_L\|$ has crashed below the floor. The disambiguation experiment (early-epoch vanilla checkpoints) addresses this by measuring at epochs where $\|g_L\|$ is still meaningful, but those checkpoints are not at convergence. The matched-penalty $2{\times}2$ control disambiguates capacity loss from credit quality but does not account for non-capacity differences between end-to-end BP and local DFA training. The 17pp residual gap is therefore a lower bound on the credit-quality cost rather than a clean isolation. \section{Broader impacts} \label{sec:impacts} This paper does not introduce a new training method, dataset, or generative model. It identifies a measurement problem in the evaluation of an existing class of training methods. Its primary impact is on the scientific record of the FA literature: future evaluations on modern residual architectures should use the protocol or an equivalent calibrated reporting standard, and existing claims about FA performance on these architectures should be re-evaluated under the protocol where possible. We are not aware of any negative downstream applications of this work. \section{Conclusion} \label{sec:conclusion} We have shown that standard Feedback Alignment evaluation on modern residual networks is unreliable because it conflates two distinct failure modes: measurement degeneracy via terminal-LayerNorm gradient cancellation, and low intrinsic credit-direction quality of random feedback. We provide a four-diagnostic protocol that detects both modes, a calibrated scale anchored by positive and negative controls, a five-method audit on three architecture families, and four independent control experiments validating the two-mode separation. The protocol, audit data, and reporting template are released as a community artifact for the FA evaluation community. \bibliographystyle{abbrvnat} \begin{thebibliography}{99} \bibitem{lillicrap2016random} T.~P. Lillicrap, D.~Cownden, D.~B. Tweed, and C.~J. Akerman. \newblock Random synaptic feedback weights support error backpropagation for deep learning. \newblock {\em Nature Communications}, 7:13276, 2016. \bibitem{nokland2016direct} A.~N\o{}kland. \newblock Direct feedback alignment provides learning in deep neural networks. \newblock In {\em NeurIPS}, 2016. \bibitem{akrout2019deep} M.~Akrout, C.~Wilson, P.~Humphreys, T.~Lillicrap, and D.~B. Tweed. \newblock Deep learning without weight transport. \newblock In {\em NeurIPS}, 2019. \bibitem{launay2020direct} J.~Launay, I.~Poli, F.~Boniface, and F.~Krzakala. \newblock Direct feedback alignment scales to modern deep learning tasks and architectures. \newblock In {\em NeurIPS}, 2020. \bibitem{moskovitz2018feedback} T.~H. Moskovitz, A.~Litwin-Kumar, and L.~F. Abbott. \newblock Feedback alignment in deep convolutional networks. \newblock {\em arXiv:1812.06488}, 2018. \bibitem{refinetti2021align} M.~Refinetti, S.~d'Ascoli, R.~Ohana, and S.~Goldt. \newblock Align, then memorise: the dynamics of learning with feedback alignment. \newblock In {\em ICML}, 2021. \bibitem{crafton2019direct} B.~Crafton, A.~Parihar, E.~Gebhardt, and A.~Raychowdhury. \newblock Direct feedback alignment with sparse connections for local learning. \newblock {\em Frontiers in Neuroscience}, 13:525, 2019. \bibitem{bartunov2018assessing} S.~Bartunov, A.~Santoro, B.~Richards, L.~Marris, G.~Hinton, and T.~Lillicrap. \newblock Assessing the scalability of biologically-motivated deep learning algorithms and architectures. \newblock In {\em NeurIPS}, 2018. \bibitem{xiong2020layernorm} R.~Xiong, Y.~Yang, D.~He, K.~Zheng, S.~Zheng, C.~Xing, H.~Zhang, Y.~Lan, L.~Wang, and T.~Liu. \newblock On layer normalization in the transformer architecture. \newblock In {\em ICML}, 2020. \bibitem{statebridge2024} Anonymous. \newblock State Bridge: terminal-conditioned predictor for credit assignment. \newblock {\em Anonymous in-progress reference, 2024-2026}. \bibitem{creditbridge2024} Anonymous. \newblock Credit Bridge: value-field local credit without hidden BP. \newblock {\em Anonymous in-progress reference, 2024-2026}. \end{thebibliography} \appendix \section{Reproducibility} \label{app:reproducibility} All experiments use PyTorch~$\geq$2.0 on a single NVIDIA A6000 GPU. Source for the protocol package is in \texttt{protocol/}; experimental scripts are in \texttt{experiments/}. Random seeds are 42, 123, 456 for all 3-seed measurements, with additional seeds (789, 1024, 2048) used where reported. CIFAR-10 is loaded via \texttt{torchvision} with the standard normalization $(\mu, \sigma) = ((0.4914, 0.4822, 0.4465), (0.2470, 0.2435, 0.2616))$. \section{Pipeline pitfalls catalog} \label{app:pitfalls} Beyond the four diagnostics, we found seven evaluation-pipeline bugs in our own dogfood codebase that silently corrupt FA evaluation results. Each has a standalone reproducer in \texttt{protocol/examples/verify\_pitfalls*.py}. \begin{enumerate} \item \texttt{tensor.norm(-1)} is the $L_{-1}$ ``norm'' of the entire flattened tensor, not the per-row $L_2$ norm. The correct call is \texttt{tensor.norm(dim=-1)}. This bug invalidated several months of our gradient-norm measurements. \item \texttt{F.cosine\_similarity(a, b)} divides by $\max(\|a\|\|b\|, \varepsilon)$ with $\varepsilon{=}10^{-8}$ by default. When $\|b\|\sim 10^{-10}$ (the regime of the BP gradient on degenerate DFA-trained pre-LN networks), the divisor becomes $\|a\|\cdot 10^{-8}$ instead of $\|a\|\cdot 10^{-10}$, scaling the reported cosine by a factor of $\sim\!100\times$ in the wrong direction. \item fp16 mixed precision underflows BP gradients at $\sim\!5\times 10^{-10}$, below fp16's smallest subnormal of $\sim\!6\times 10^{-8}$. bf16 works because it shares fp32's exponent range. \item Random feedback $B_l$ matrices are training-specific. DFA reports $\Gamma\approx 0.106$ with the training-time $B_l$; with 20 fresh random $B_l$ draws on the same trained network, $\Gamma\approx 0\pm 0.005$. The reported alignment is the network adapting to its specific $B_l$, not intrinsic. \item Aggregation strategy across (layers, samples, batches) is rarely specified but determines the headline number. Same DFA seed-42 gives $\Gamma \in [-0.028, +0.074]$ across four valid aggregation strategies (a 3.45$\times$ ratio, with sign flip). \item Per-layer $\Gamma$ structure is hidden by aggregation. On the 4-block ResMLP, DFA's headline $\Gamma\approx 0.10$ is driven almost entirely by the embedding layer ($\Gamma_{l_0} \approx +0.43$); deeper layers have $\Gamma \approx 0$. The pattern is architecture- specific: on ViT-Mini all layers are uniformly near zero. \item Auxiliary networks (random feedback $B_l$, bridge predictors) not saved alongside model checkpoints can cause post-hoc $\Gamma$ scripts to silently fall back to $\cos(\text{BP\_grad}, \text{BP\_grad}) = 1.0$ and report ``perfect alignment.'' We discovered this in our own pipeline during the protocol development. Check that auxiliary networks are persisted before reporting any $\Gamma$ value. \end{enumerate} \section{Methodology: walk-back chain} \label{app:walkback} The framing of this paper underwent several corrections during the development of the protocol. We document the four-step progression explicitly as part of the methodology, not as narrative drama: \begin{enumerate} \item Initial metric ($\Gamma\approx 0.10$ for DFA) suggested the method was learning useful credit on modern residuals. \item Diagnostic showed the metric was measured against a numerical-floor reference vector ($\|g_L\|\sim 10^{-10}$); the headline number was not interpretable. \item Revised control (the residual-stream penalty) restored the reference but only partially closed the accuracy gap to BP, identifying a residual phenomenon. \item Final interpretation (this paper) separates measurement failure (Mode~1) from genuine credit-quality cost (Mode~2), validated by the four control experiments of Section~\ref{sec:two-modes}. \end{enumerate} \section{Six independent validations of the two-mode separation} \label{app:six-validations} For completeness we list all six independent validation experiments, beyond the four reported in the main text: \begin{enumerate} \item Direct deep-layer cosine on penalized DFA (3 seeds): deep mean $+0.155 \pm 0.025$. \item Null calibration with 20 fresh random $B_l$: deep cosine $+0.002 \pm 0.022$ (within noise). \item Hypothesis-disambiguation sweep: vanilla DFA early-epoch deep cosine $-0.008 \pm 0.013$ across 3 seeds at epoch 1. \item BP+penalty matched-control: 8pp BP capacity cost vs 17pp residual gap at $\lambda{=}10^{-2}$. \item Multi-seed lock-in: 24 measurements (3 seeds $\times$ 2 epochs $\times$ 4 deep layers) all in $[-0.04, +0.02]$ on vanilla. \item Cross-metric triangulation via perturbation correlation: vanilla $+0.002$, penalized $+0.080$ (3 seeds), positive control (BP grad) $+0.997$, negative control (random vector) $+0.006$. \end{enumerate} \end{document}