From 33bc08f55b45bea9063c16c902f0f29f13cab9aa Mon Sep 17 00:00:00 2001 From: YurenHao0426 Date: Wed, 8 Apr 2026 04:30:20 -0500 Subject: Archive failed v1 draft as v1_rejected.tex; remove main.tex/main.pdf MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit User rejected the v1 draft as '流水账实验报告' (sequential experiment report). Round 22 + 23 redid the outline with E&D-genre prescription. Saving v1 as v1_rejected.tex for reference. New main.tex will be written from round 24 LaTeX skeleton (codex offered to provide it), section by section, with codex check on each section's prose. --- paper/main.tex | 823 --------------------------------------------------------- 1 file changed, 823 deletions(-) delete mode 100644 paper/main.tex (limited to 'paper/main.tex') diff --git a/paper/main.tex b/paper/main.tex deleted file mode 100644 index f6295ff..0000000 --- a/paper/main.tex +++ /dev/null @@ -1,823 +0,0 @@ -\documentclass{article} - -\PassOptionsToPackage{numbers,compress}{natbib} -\usepackage[eandd]{neurips_2026} - -\usepackage[utf8]{inputenc} -\usepackage[T1]{fontenc} -\usepackage{hyperref} -\usepackage{url} -\usepackage{booktabs} -\usepackage{amsfonts} -\usepackage{amsmath} -\usepackage{nicefrac} -\usepackage{microtype} -\usepackage{xcolor} -\usepackage{graphicx} - -\title{Beyond Accuracy and Alignment:\\ A Diagnostic Evaluation Protocol for Feedback Alignment} - -\author{Anonymous Authors} - -\begin{document} - -\maketitle - -\begin{abstract} -Standard evaluation of Feedback Alignment (FA) and related local-credit -methods on modern residual networks reports two numbers: headline accuracy -and the cosine alignment $\Gamma$ of the local credit signal with the true -backpropagation gradient at hidden layers. We show, on standard pre-LayerNorm -ResidualMLP and ViT-Mini architectures, that this evaluation is unreliable -because it conflates two distinct failure modes: \textbf{(1)~measurement -degeneracy via terminal-LayerNorm gradient cancellation}, in which residual -stream growth drives the BP gradient at hidden layers below the numerical -floor and renders the cosine metric uninterpretable; and \textbf{(2)~low -intrinsic credit-direction quality of random feedback}, which persists even -when the BP gradient is in the meaningful regime and is invisible to the -field-standard reporting pair. - -We contribute a four-diagnostic protocol that detects both modes, a -reference implementation, a calibrated scale for the new metrics, and a -reproducible audit table on five methods (BP, DFA, State Bridge, Credit -Bridge, EP) across three architecture families. The protocol walks back -three of the five methods on the architectures we audit, where the -field-standard reporting walks back none. A residual-stream penalty -intervention partially alleviates both modes, and four independent control -experiments---a null calibration with fresh random feedback, a -hypothesis-disambiguation sweep on early-epoch vanilla checkpoints, a -matched BP+penalty capacity-cost control, and a perturbation-correlation -cross-metric triangulation---validate the two-mode separation. We release -the protocol, the audit data, and a reporting template. -\end{abstract} - -\section{Introduction} -\label{sec:intro} - -Feedback Alignment (FA) and its variants -\cite{lillicrap2016random,nokland2016direct,akrout2019deep,launay2020direct} -are routinely evaluated on modern residual architectures by reporting two -numbers: the trained network's test accuracy, and the cosine -similarity~$\Gamma$ between the method's local credit signal and the true -backpropagation gradient at hidden layers. A high $\Gamma$ is interpreted -as evidence that the method is computing useful credit; an above-shallow -accuracy is interpreted as evidence that the deep blocks are being trained. -On a 4-block pre-LayerNorm ResidualMLP at $d{=}256$ trained on CIFAR-10 -under standard hyperparameters, DFA reports $\Gamma\approx 0.10$ and a -test accuracy of~31\%, both of which look reasonable to a reviewer who -encounters them in isolation. - -\textbf{Both numbers can silently mislead.} On the same architecture and -seeds, an architecture-matched random-untrained-blocks baseline trained -only at the embedding, terminal LayerNorm, and head reaches 34.9\% test -accuracy: the trainable-blocks DFA variant under-performs this baseline by -4 percentage points. The deep blocks are not just unhelpful---they are -actively destroying value. Meanwhile, the BP gradient at the deepest hidden -layer of the same trained DFA network has $\|g_L\|\approx 5\times 10^{-10}$, -well below \texttt{F.cosine\_similarity}'s default $\varepsilon{=}10^{-8}$ -clamp and well below any reasonable numerical floor. The reported -$\Gamma\approx 0.10$ is a cosine to a noise-floor reference vector and is -mathematically well defined but uninterpretable as ``alignment quality.'' - -\textbf{Why both numbers fail together turns out to have a single source: -the headline-accuracy and headline-$\Gamma$ pair conflates two distinct -phenomena that the field treats as one.} This paper identifies the two -phenomena, names them, and provides a protocol that separates them. - -\paragraph{The two failure modes (informal).} -\textbf{Mode~1: measurement degeneracy via terminal-LayerNorm gradient -cancellation.} In modern pre-LayerNorm residual networks with a terminal -LN before the classification head, DFA-style local losses have no global -constraint on residual-branch magnitude. Block parameters grow by -$\sim\!95\times$ relative to initialization, the residual stream -$\|h_L\|$ grows from $\sim\!9$ at random init to $\sim\!4\!\times\!10^8$ -over 100 epochs, and the LayerNorm Jacobian rescaling drives the BP -gradient at hidden layers from $\sim\!10^{-3}$ to $\sim\!10^{-10}$. The -cosine alignment metric is then computed against a numerical-floor -reference vector and cannot meaningfully distinguish a useful credit -signal from noise. - -\textbf{Mode~2: low intrinsic credit-direction quality of random feedback.} -Even at the very first epoch of vanilla DFA training, when $\|g_L\|$ is -still in the meaningful regime ($\sim\!10^{-6}$, three orders above the -floor), DFA's local credit signal $e_T B_l^\top$ has essentially zero -alignment with the BP gradient on deep layers ($\overline{\cos}{=}{-}0.008 -\pm 0.013$ across three seeds). The deep-layer alignment is missing for a -reason that has nothing to do with measurement: random feedback simply does -not compute a useful credit direction at the block layers of pre-LN residual -networks, and this would be visible if the metric were interpretable. - -\paragraph{Why the field hasn't seen this before.} -The two modes are normally entangled: Mode~1 makes Mode~2 invisible, and -the field-standard $(\text{accuracy},\Gamma)$ pair has no diagnostic for -either. A reviewer reading ``DFA reaches 31\%, $\Gamma{\approx}0.10$'' has -no signal that the deep blocks are passive (Mode~2) or that the cosine is -measured against the floor (Mode~1). The framing has stayed in place -because the symptoms look like ordinary undertraining. - -\paragraph{Our contribution.} -We propose a \textbf{four-diagnostic protocol} that detects both modes, -together with a calibrated scale for each diagnostic, a reference -implementation, and a five-method audit on three architecture families -(pre-LN ResidualMLP, ViT-Mini, BatchNorm CNN). The protocol walks back -DFA, State Bridge, and Credit Bridge on the modern residual architectures -we audit, where the field-standard $(\text{accuracy},\Gamma)$ pair walks -back none. We additionally validate that the two modes are mechanistically -distinct: a residual-stream penalty intervention restores the BP gradient -to the meaningful regime (alleviating Mode~1) and \emph{partially} -restores deep-layer alignment from $0$ to $\sim\!0.16$ (alleviating -Mode~2), but neither is fully fixed. Cross-metric triangulation with -perturbation correlation, null calibration with fresh random feedback, -and a matched BP+penalty capacity-cost control all confirm the -separation. - -The protocol, reference implementation, audit table, and reporting -template are released as a community artifact. Our goal is that future -FA evaluations on modern architectures use the protocol or an equivalent -calibrated reporting standard, instead of the present field-standard pair -that silently conflates measurement degeneracy with credit quality. - -\section{Related work} -\label{sec:related} - -\textbf{Feedback Alignment and local credit.} Random feedback alignment -\cite{lillicrap2016random} demonstrated that backward weights need not -match forward weights for shallow networks to learn. Direct Feedback -Alignment (DFA) \cite{nokland2016direct} bypassed the symmetric backward -pass entirely. Subsequent work -\cite{moskovitz2018feedback,refinetti2021align,akrout2019deep} extended FA -to deeper networks with mixed success. \cite{launay2020direct, -crafton2019direct} showed DFA can train modest CNNs and small Transformers, -typically reporting $\Gamma$ as evidence that the local signal is useful. -\cite{bartunov2018assessing} questioned whether FA-style methods can scale -to ImageNet-class problems. State and credit bridges -\cite{statebridge2024,creditbridge2024} are recent attempts to learn -explicit credit-prediction networks under similar constraints. - -\textbf{FA evaluation.} The standard evaluation pair---test accuracy and -the cosine $\Gamma$ between local credit and the true BP gradient at hidden -layers---has been used in essentially all of the above work. To our -knowledge, no prior work questions whether $\Gamma$ is measured in a -meaningful regime on the architectures it is reported on, or whether the -deep blocks of the trained network actually contribute over an -architecture-matched random-untrained-blocks baseline. We call this -combined oversight the field-standard evaluation pair, and our paper -identifies how it conflates two distinct phenomena. - -\textbf{Evaluation as scientific object.} The NeurIPS 2026 Evaluations and -Datasets track explicitly invites critical analyses of existing evaluation -practices and proposals for new evaluation protocols. Adjacent work in -deep learning evaluation has documented similar conflation issues: e.g., -the well-known ``representation similarity is metric-dependent'' -literature, the ``probing task validity'' -critique, the LayerNorm-induced gradient pathology in pre-LN -Transformers \cite{xiong2020layernorm}. Our contribution is to identify -the analogous conflation in FA evaluation specifically and to provide a -protocol that resolves it for the FA evaluation community. - -\section{The audit: standard FA evaluation walks back nothing} -\label{sec:audit} - -We apply the field-standard $(\text{accuracy},\Gamma)$ reporting pair to -five methods on the standard 4-block $d{=}256$ pre-LayerNorm ResidualMLP -on CIFAR-10 (Table~\ref{tab:audit}, three seeds, 100 training epochs, -AdamW $\text{lr}{=}10^{-3}$, $\text{wd}{=}0.01$, cosine schedule). - -\begin{table}[h] -\centering -\caption{Field-standard reporting on five methods, 4-block $d{=}256$ -ResidualMLP, CIFAR-10, three seeds. The headline pair gives no walk-back -signal on any method.} -\label{tab:audit} -\begin{tabular}{lrrll} -\toprule -method & test acc & headline $\Gamma$ & status quo verdict & our verdict \\ -\midrule -BP & $0.609 \pm 0.004$ & $\approx 1.0$ & trustworthy & trustworthy \\ -EP & $0.316 \pm 0.038$ & $0.008$ & trustworthy & trustworthy \\ -DFA & $0.308 \pm 0.014$ & $0.10$ & trustworthy & \textbf{walked back} \\ -Credit Bridge & $0.289 \pm 0.034$ & $0.07$ & trustworthy & \textbf{walked back} \\ -State Bridge & $0.205 \pm 0.039$ & $0.005$ & trustworthy & \textbf{walked back} \\ -\bottomrule -\end{tabular} -\end{table} - -A reviewer reading Table~\ref{tab:audit}'s middle two columns has no -signal that any of these methods is in a degenerate regime: every -$(\text{accuracy},\Gamma)$ pair looks consistent with ``DFA-style methods -train deep residual networks to roughly one-third of BP's accuracy with a -small but positive credit alignment.'' The status quo verdict treats all -five methods as trustworthy. - -\paragraph{The two diagnostics that should have fired.} -The same trained networks have: -\begin{itemize} -\item \textbf{Per-block residual-stream growth} ($\max_l \|h_{l+1}\|/\|h_l\|$) -of $1.3$ for BP, $2.4$ for State Bridge, $11.6$ for EP, $96\times$ for -Credit Bridge, and $237\times$ for DFA. BP and EP are bounded; DFA, SB, and -CB show explosive per-block growth. -\item \textbf{BP gradient at the deepest hidden layer} ($\|g_L\|$) of -$\sim\!4\!\times\!10^{-4}$ for BP, $\sim\!2\!\times\!10^{-4}$ for EP, -$\sim\!10^{-9}$ for DFA, SB, and CB. The DFA/SB/CB values are below the -\texttt{F.cosine\_similarity} default $\varepsilon{=}10^{-8}$ clamp and -several orders below any reasonable numerical floor for the cosine metric -to be interpretable. -\end{itemize} -Both diagnostics cleanly separate healthy methods from degenerate ones -across three seeds: a separation gap of $63\times$ for the per-block -growth measure (healthy max~$11$, degenerate min~$694$) and $24{,}338\times$ -for the BP gradient floor measure (healthy min~$1.0\!\times\!10^{-4}$, -degenerate max~$4.2\!\times\!10^{-9}$). Both gaps survive a sweep of the -threshold value over an order of magnitude. - -\paragraph{The walked-back claim.} -We report this finding as the primary audit result. Three of the five -methods we audit have claims that should be walked back, and the -field-standard reporting pair does not catch any of them. - -\paragraph{Walk-back: the deep blocks are not contributing.} -Beyond the measurement-degeneracy diagnostics, an architecture-matched -\emph{frozen-random-blocks} baseline (training only the embedding, -terminal LN, and head while leaving the deep blocks at random -initialization) reaches $0.349 \pm 0.002$ on this architecture under all -three of DFA, SB, and CB. The trainable-blocks variants reach $0.308$, -$0.205$, and $0.289$ respectively---\emph{below} the random-untrained -baseline. Training the deep blocks is not just unhelpful; on this -architecture and these seeds, it is actively destructive of accuracy. - -\textbf{This is the central audit finding.} Three of five FA-style methods -on a standard residual architecture under standard hyperparameters do not -beat their architecture's frozen-random-blocks baseline. The field-standard -$(\text{accuracy},\Gamma)$ reporting pair has no diagnostic for this. - -\section{The diagnostic protocol} -\label{sec:protocol} - -We propose a four-diagnostic protocol that detects the audit findings of -Section~\ref{sec:audit}. - -\paragraph{Diagnostic (a): per-layer residual stream growth.} -Compute $\max_l \|h_{l+1}\|_2 / \|h_l\|_2$ over a fixed evaluation -batch. If the maximum per-block growth exceeds a calibrated threshold -($50\times$ in our default), the residual stream is in a regime -incompatible with the original architectural intent. This is the most -direct test of Mode~1's structural cause. - -\paragraph{Diagnostic (b): BP gradient at hidden layers.} -Compute $\|\partial L / \partial h_L\|_2$ on a fixed eval batch. If this -falls below a calibrated floor ($10^{-7}$ in our default, well above -fp32 subnormals and the \texttt{F.cosine\_similarity} clamp), the -reference vector against which $\Gamma$ is measured is at the numerical -floor and the metric is not interpretable as alignment quality. This is -Mode~1's symptom: any cosine alignment reported in this regime is a -cosine to noise. - -\paragraph{Diagnostic (c): cross-batch direction stability.} -Compute the mean pairwise cosine of normalized BP-grad direction across -disjoint minibatches. A high value ($>0.30$ in our default) indicates the -reference vector is dominated by a sample-invariant global drift -component, which means $\Gamma$ measures alignment to drift rather than -to per-sample credit. This is a sub-mode discriminator: it tells you -\emph{how} Mode~1 has corrupted the reference, not whether (b) alone -detects. - -\paragraph{Diagnostic (d): frozen-blocks baseline.} -Train an architecture-matched variant with the deep blocks frozen at -random initialization. If the trainable-blocks variant fails to clear -this baseline by a calibrated margin ($2$ percentage points in our -default), the deep blocks are not meaningfully contributing. This -catches the case where Mode~2 has fully nullified the deep-block -training. Note that this is a behavioral consequence and (as we discuss -in Section~\ref{sec:two-modes}) becomes ambiguous under interventions -that partially restore alignment. - -\paragraph{Calibrated thresholds.} Default thresholds ($50\times$, $10^{-7}$, -$0.30$, $2$pp) sit cleanly in the middle of large separation gaps -between healthy and degenerate networks: the per-block growth diagnostic -has a $63\times$ gap, the BP gradient floor diagnostic has a -$24{,}338\times$ gap. Verdicts are robust to threshold perturbations of a -factor of two in either direction. - -\paragraph{Decision-utility ablation.} -We compare seven reporting strategies on the five-method audit -(Table~\ref{tab:decision-utility}): the field-standard pair (S0: -accuracy only, S1: $+\Gamma$) walks back $0/5$ methods. The full -protocol (S\textsubscript{full}: accuracy + (a) + (b) + (c) + (d)) walks -back $3/5$. Each of (a), (b), and (d) is independently sufficient for -binary detection of the three failing methods on this architecture; (c) -is for sub-mode discrimination, not primary detection. - -\begin{table}[h] -\centering -\caption{Decision-utility ablation. ``Walk-back'' means the strategy -flags the method for further investigation. The field-standard pair -walks back nothing; the full protocol walks back the three degenerate -methods.} -\label{tab:decision-utility} -\begin{tabular}{lrrrrrrr} -\toprule -method & S0 & S1 & +(a) & +(b) & +(c) & +(d) & full \\ -\midrule -BP & --- & --- & --- & --- & --- & --- & trust \\ -EP & --- & --- & --- & --- & --- & --- & trust \\ -DFA & --- & --- & WB & WB & --- & WB & WB \\ -State Bridge & --- & --- & WB & WB & WB & WB & WB \\ -Credit Bridge & --- & --- & WB & WB & WB & WB & WB \\ -\bottomrule -\end{tabular} -\end{table} - -\paragraph{Cross-architecture validation.} -We replicated the protocol on per-epoch training-time data for three -architecture families: 4-block pre-LN ResidualMLP, 4-block ViT-Mini, and -a synthetic StudentNet without terminal LayerNorm, plus a five-method -audit on a SmallCNN with BatchNorm and no terminal LN. Across the -$3\,\text{archs}\times 3\,\text{seeds}\times 2\,\text{methods}=18$ -training trajectories of the first three, the diagnostics fire on every -DFA training run on the with-terminal-LN architectures within -$1{-}11$ epochs (well before the headline accuracy stabilizes), and never -fire on any BP run. On the without-terminal-LN architectures (StudentNet, -CNN), diagnostic (a) still fires on DFA but diagnostic (b) does -\emph{not} fire on any of the methods we tested. This is consistent with -diagnostic (b) being specifically about LayerNorm-driven gradient -cancellation rather than residual-stream growth in general. - -\paragraph{Reference implementation.} -We release \texttt{protocol/}, a $\sim\!200$-line Python module that -implements the protocol on any model exposing a duck-typed -interface (\texttt{model(x, return\_hidden=True)}, \texttt{model.embed} or -\texttt{model.patch\_embed}, \texttt{model.blocks}, and a terminal LN + -head). The package includes a smoke test that loads BP/DFA/EP checkpoints -and verifies expected verdicts, a reporting template, and a reproducible -audit table. - -\section{Two distinct failure modes} -\label{sec:two-modes} - -The protocol of Section~\ref{sec:protocol} catches the audit finding, -but its main scientific interest is what it reveals about \emph{why} the -field-standard pair fails. We argue that the failure is not a single -phenomenon: it conflates two distinct modes that respond differently to -interventions and whose mechanisms are separately measurable. - -\paragraph{Mode 1 (measurement degeneracy via terminal-LN gradient -cancellation), in detail.} -On the standard 4-block $d{=}256$ pre-LN ResidualMLP, DFA's local block -losses $\langle f_l(h_l), e_T B_l^\top \rangle$ have no scale constraint: -the inner product can be increased indefinitely by inflating -$\|f_l(h_l)\|$. Block parameters $w_1, w_2$ inside each block grow by a -factor of $\sim\!200\times$ during 100 epochs of training, and the -multiplicative product $\|w_1\|\cdot\|w_2\|$ grows by $\sim\!5\times 10^4$ -per block. The residual stream $\|h_L\|$ grows from $9$ at initialization -to $\sim\!4\times 10^8$ by epoch 100, with most of the growth happening -in the first 10 epochs. Through the terminal LayerNorm Jacobian -($\partial \text{LN}(h)/\partial h \propto 1/\|h\|$), this drives the BP -gradient at hidden layers from $\sim\!10^{-3}$ at random initialization -to $\sim\!5\times 10^{-10}$. The cosine alignment metric is then computed -against a reference vector at the numerical floor: \texttt{F.cosine\_similarity} -clamps the divisor at $\varepsilon{=}10^{-8}$ rather than dividing by the -true magnitude, scaling the reported value by a factor of $\sim\!50\times$ -in the wrong direction; the reported $\Gamma\approx 0.10$ is not a -``small alignment'' but a cosine to a degenerate reference. - -\paragraph{Causal validation: penalty intervention partially restores Mode~1.} -Adding $\lambda\,\|f_l(h_l)\|^2$ as a per-block penalty to DFA's local -loss with $\lambda{=}10^{-2}$ contains the residual stream: -$\|h_L\|: 4\!\times\!10^8 \to 4\!\times\!10^4$ (4 OOM rescue), and -$\|g_L\|: 5\!\times\!10^{-10} \to \sim\!10^{-6}$ (4 OOM rescue, well into -the meaningful regime). Diagnostics (a) and (b) both pass on the -penalized network. Three seeds: $\|h_L\|=4.0\pm 0.1\!\times\!10^4$, -$\|g_L\|=9.0\pm 0.9\!\times\!10^{-7}$. - -\paragraph{Mode 2 (low intrinsic credit-direction quality), in detail.} -The penalty restores Mode~1, but the test accuracy of penalized DFA only -rises from $0.308$ to $0.363$ (3-seed mean $0.363\pm 0.001$). This is -$+5.5$pp over vanilla DFA but only $+1.4$pp over the architecture-matched -random-blocks baseline of $0.349$. The deep blocks are still not -meaningfully contributing. - -\textbf{Direct measurement.} On the penalized DFA checkpoint, we directly -compute the per-layer cosine of the local credit signal $e_T B_l^\top$ -with the BP gradient at $h_l$, using the training-time random feedback -matrices $B_l$ and no $\varepsilon$ clamp. Three-seed result on deep -layers ($l=1,2,3,4$): $\overline{\cos} = +0.155 \pm 0.025$. This is -\emph{measurable, real, and small}: well above noise (see calibration -below) but well below BP's self-cosine of $1.0$. The deep blocks under -the penalty are partially aligned with BP gradient but not fully. - -\paragraph{Disambiguation: was the alignment always there, or did the -penalty create it?} -A reasonable reading of the above would be: ``the cosine was always -there in vanilla DFA; the penalty just made the measurement -interpretable.'' The disambiguation experiment falsifies this. We -trained vanilla DFA and saved checkpoints at every epoch from 1 to 5, -where $\|g_L\|$ is still in the meaningful regime -($1.4\!\times\!10^{-6}$ at epoch 1, well above the $10^{-7}$ floor). -Per-layer cosine on these vanilla checkpoints (3 seeds, epochs 1 and 2): -\emph{deep-layer cosine $-0.008 \pm 0.013$ averaged over 24 measurements -($3\,\text{seeds}\times 2\,\text{epochs}\times 4\,\text{deep layers}$)}. -The deep-layer alignment is essentially zero on vanilla DFA in the -meaningful regime; the $+0.155$ on the penalized network is created by -the penalty intervention, not revealed by it. - -\paragraph{The penalty's role.} -The penalty does two things at once. It contains the residual stream -(directly addressing Mode~1), and it changes the training trajectory -of the block parameters such that the final $f_l$ direction is partially -aligned with the BP gradient direction (partially addressing Mode~2). -The second effect is non-obvious: the penalty does not directly optimize -for alignment. A plausible mechanism is that with no penalty, the local -credit objective can be increased indefinitely by inflating $\|f_l\|$, so -the optimizer follows directions uncorrelated with BP gradient; with the -penalty, $\|f_l\|$ is constrained, so the optimizer must orient $f_l$ more -carefully, which incidentally yields better partial alignment with BP -gradient direction. - -\subsection{Calibration of the cosine measurement} -\label{sec:calibration} - -A natural reviewer concern about the $+0.155$ result is whether it is -above or below noise. We anchor it with explicit positive and negative -controls. - -\textbf{Positive control.} On a BP-trained network, using the BP -gradient itself as the predicted credit signal, the perturbation -correlation~$\rho$ between $\langle g_l, \varepsilon v \rangle$ and the -true loss change $L(h_l + \varepsilon v) - L(h_l)$ is -$+0.997$ at every layer (4-layer mean $+0.9965$). This is the -Taylor-expansion ceiling. - -\textbf{Negative control.} On the same BP-trained network, using a -random vector independent of the layer as the credit signal, $\rho$ is -$+0.006$ (4-layer mean), within statistical noise of zero. - -\textbf{Cross-metric triangulation on the test conditions.} - -\begin{table}[h] -\centering -\caption{Two metrics, four conditions. The agreement between cosine and -perturbation correlation rules out single-metric artifacts.} -\label{tab:two-metrics} -\begin{tabular}{lrr} -\toprule -condition & deep cosine $\overline{\cos}$ & deep $\overline{\rho}$ \\ -\midrule -positive control (BP grad on BP net) & $1.000$ & $+0.997$ \\ -negative control (random vector on BP net) & --- & $+0.006$ \\ -vanilla DFA, ep 1 (3 seeds, meaningful regime) & $-0.008 \pm 0.013$ & $-0.003 \pm 0.005$ \\ -penalized DFA, ep 30 (3 seeds, lam=$10^{-2}$) & $+0.155 \pm 0.025$ & $+0.080 \pm 0.011$ \\ -\bottomrule -\end{tabular} -\end{table} - -The penalized DFA's $+0.080$ perturbation correlation is $\sim\!13\times$ -the negative control and $\sim\!8\%$ of the positive control. Both -metrics agree on the vanilla-to-penalized transition: vanilla deep -signal is indistinguishable from random, penalized deep signal is small -but well above noise. The agreement across metrics rules out the -possibility that cosine is capturing a directional artifact unrelated to -local-loss usefulness. - -\subsection{$\lambda$ sweep: independent dissociation of the two modes} -\label{sec:lambda-sweep} - -The disambiguation experiment of Section~\ref{sec:two-modes} relied on -vanilla DFA early-epoch checkpoints (epochs 1--2) to measure deep-layer -cosine while $\|g_L\|$ was still in the meaningful regime. A natural -reviewer concern is that early-epoch checkpoints are not at convergence -and might be confounded by stochastic initialization effects. We -strengthen the disambiguation with an independent control: a sweep over -the penalty strength $\lambda$ at convergence (30~epochs), with both -metrics measured on each saved checkpoint. - -\begin{table}[h] -\centering -\caption{$\lambda$ sweep on the penalty strength, all 30 epochs, seed -42. The deep-layer cosine and perturbation correlation rise from -essentially zero at $\lambda{=}10^{-4}$ to small-but-positive at -$\lambda{=}10^{-2}$, even though diagnostics (a) and (b) already pass -at $\lambda{=}10^{-4}$.} -\label{tab:lambda-sweep} -\begin{tabular}{rrrrrr} -\toprule -$\lambda$ & test acc & $\|h_L\|$ & $\|g_L\|$ & deep $\overline{\cos}$ & deep $\overline{\rho}$ \\ -\midrule -$0$ & $0.308$ & $4.4{\times}10^{8}$ & $5{\times}10^{-10}$ & (degenerate) & (degenerate) \\ -$10^{-4}$ & $0.359$ & $2.4{\times}10^{4}$ & $6.3{\times}10^{-7}$ & $-0.022$ & $-0.004$ \\ -$10^{-2}$ & $0.363$ & $4.0{\times}10^{4}$ & $9.0{\times}10^{-7}$ & $+0.165$ & $+0.091$ \\ -$10^{-1}$ & $0.349$ & $1.2{\times}10^{4}$ & $1.6{\times}10^{-6}$ & $+0.131$ & $+0.067$ \\ -\bottomrule -\end{tabular} -\end{table} - -\textbf{The killer row is $\lambda{=}10^{-4}$.} At this penalty -strength, the residual stream is already contained ($\|h_L\| = 2.4 -\times 10^4$, four orders below vanilla), and the BP gradient at the -deepest hidden layer is at $6.3 \times 10^{-7}$ (well above the -$10^{-7}$ floor and in the meaningful measurement regime). Diagnostics -(a) and (b) both pass: \textbf{Mode~1 is fully alleviated}. But the -deep-layer cosine ($-0.022$) and perturbation correlation ($-0.004$) -are essentially zero, on both metrics independently. \textbf{Mode~2 is -not alleviated at all.} - -This is direct evidence that the two modes are mechanistically distinct: -they do not even respond to the same intervention strength. There exists -a regime ($\lambda{=}10^{-4}$, 30~epochs of training) in which -Mode~1 is fully alleviated and Mode~2 is unchanged from vanilla, with -both metrics agreeing. - -The threshold for Mode~2 alleviation is somewhere between -$\lambda{=}10^{-4}$ and $\lambda{=}10^{-2}$. At $\lambda{=}10^{-2}$ the -penalty is strong enough to alter the optimization trajectory of the -block parameters (constraining $\|f_l\|$ tightly enough that the -direction of $f_l$ has to be coordinated more carefully with the local -credit signal), and the deep-layer alignment rises to $\sim\!+0.16$. -At $\lambda{=}10^{-1}$ the penalty starts to over-constrain and the -alignment is slightly lower ($\sim\!+0.13$), giving an inverted-U -relationship between $\lambda$ and deep alignment. - -\subsection{Capacity-cost control} -\label{sec:capacity-cost} - -A second reviewer concern is whether the $0.36 \to 0.61$ accuracy gap -between penalized DFA and BP-trainable is due to credit quality (Mode~2) -or simply to the penalty's capacity-regularization cost. We disambiguate -with a $2\times2$ matched control. - -\begin{table}[h] -\centering -\caption{$2\times2$ capacity-cost control. The penalty is the same in both -the BP and DFA conditions. BP+penalty still clears the random-blocks -baseline by $18.1$pp; DFA+penalty clears it by only $1.4$pp.} -\label{tab:bp-penalty} -\begin{tabular}{lrr} -\toprule - & no penalty & with penalty \\ -\midrule -BP & $0.609$ & $0.530$ \\ -DFA & $0.308$ & $0.363$ \\ -\midrule -$\Delta$ & $-8.0$pp & $+5.5$pp \\ -\bottomrule -\end{tabular} -\end{table} - -Two observations make this control informative. First, the penalty's -effect on BP is $-8$pp (a small capacity loss), which is one order of -magnitude smaller than the residual gap between BP+penalty and -DFA+penalty ($0.530 - 0.363 = 17$pp). The 17pp residual gap is -consistent with credit-quality cost, not with capacity regularization. -Second, the penalty has \emph{opposite} effects on the two methods: it -hurts BP by 8pp while helping DFA by 5.5pp, the opposite pattern expected -from a generally beneficial regime shift. - -\textbf{The clean phrasing.} The 2$\times$2 control identifies a residual -performance gap under matched architecture, data, optimizer family, and -matched penalty, after accounting for the penalty's direct capacity cost -on BP. It is not a perfect isolation of ``credit quality'' in a vacuum -(BP uses end-to-end loss while DFA uses local block losses, and the two -trainers may differ in non-capacity ways), but it is a strong lower bound -on the non-capacity penalty-unexplained gap. - -\subsection{Summary: five validations of the two-mode separation} - -Together, the disambiguation experiment, the $\lambda$ sweep, the -cross-metric triangulation, the capacity-cost control, and the -threshold robustness analysis provide five independent lines of -evidence that the failure of standard FA evaluation is not a single -phenomenon. Mode~1 (measurement degeneracy) is detected by diagnostic -(b), is causally controlled by the residual-stream penalty at any -$\lambda \geq 10^{-4}$, and is specifically associated with terminal- -LayerNorm architectures in our audits. Mode~2 (low intrinsic credit -quality) persists after Mode~1 is alleviated at weak penalty -strengths ($\lambda{=}10^{-4}$), is detected by direct per-layer -cosine in the meaningful regime, and rises only when the penalty is -strong enough to alter the optimization trajectory of the deep -blocks ($\lambda \geq 10^{-2}$). The fact that the two modes have -different intervention thresholds is the strongest single piece of -evidence that they are mechanistically distinct. - -\section{Limitations} -\label{sec:limitations} - -Our audit covers a specific slice of the FA literature: pre-LayerNorm -ResidualMLP, ViT-Mini, and SmallCNN architectures on CIFAR-10, evaluated -under standard hyperparameters. We do not claim that FA evaluation is -broken everywhere; we identify a specific evaluation failure mode on -modern pre-LN residual networks with terminal LayerNorm, and we -explicitly observe that diagnostic (b) does not fire on architectures -without a terminal LN (StudentNet, CNN with BN). This is observational -association, not a causal identification of LayerNorm per se: a future -non-terminal-LN architecture where (b) fires would refine the claim. -Section~\ref{sec:related} cites the classical FA literature where -non-terminal-LN architectures dominate; our central claim concerns the -modern with-terminal-LN residual case. - -The Mode~2 measurement in Section~\ref{sec:two-modes} relies on direct -cosine and perturbation correlation in the meaningful regime, which is -only accessible after a Mode~1 intervention. We cannot directly observe -Mode~2 on a vanilla DFA-trained network at convergence, because by then -$\|g_L\|$ has crashed below the floor. The disambiguation experiment -(early-epoch vanilla checkpoints) addresses this by measuring at epochs -where $\|g_L\|$ is still meaningful, but those checkpoints are not at -convergence. - -The matched-penalty $2{\times}2$ control disambiguates capacity loss from -credit quality but does not account for non-capacity differences between -end-to-end BP and local DFA training. The 17pp residual gap is therefore -a lower bound on the credit-quality cost rather than a clean -isolation. - -\section{Broader impacts} -\label{sec:impacts} - -This paper does not introduce a new training method, dataset, or -generative model. It identifies a measurement problem in the evaluation -of an existing class of training methods. Its primary impact is on the -scientific record of the FA literature: future evaluations on modern -residual architectures should use the protocol or an equivalent -calibrated reporting standard, and existing claims about FA performance -on these architectures should be re-evaluated under the protocol where -possible. We are not aware of any negative downstream applications of -this work. - -\section{Conclusion} -\label{sec:conclusion} - -We have shown that standard Feedback Alignment evaluation on modern -residual networks is unreliable because it conflates two distinct -failure modes: measurement degeneracy via terminal-LayerNorm gradient -cancellation, and low intrinsic credit-direction quality of random -feedback. We provide a four-diagnostic protocol that detects both modes, -a calibrated scale anchored by positive and negative controls, a -five-method audit on three architecture families, and four independent -control experiments validating the two-mode separation. The protocol, -audit data, and reporting template are released as a community artifact -for the FA evaluation community. - -\bibliographystyle{abbrvnat} -\begin{thebibliography}{99} -\bibitem{lillicrap2016random} -T.~P. Lillicrap, D.~Cownden, D.~B. Tweed, and C.~J. Akerman. -\newblock Random synaptic feedback weights support error backpropagation for deep learning. -\newblock {\em Nature Communications}, 7:13276, 2016. - -\bibitem{nokland2016direct} -A.~N\o{}kland. -\newblock Direct feedback alignment provides learning in deep neural networks. -\newblock In {\em NeurIPS}, 2016. - -\bibitem{akrout2019deep} -M.~Akrout, C.~Wilson, P.~Humphreys, T.~Lillicrap, and D.~B. Tweed. -\newblock Deep learning without weight transport. -\newblock In {\em NeurIPS}, 2019. - -\bibitem{launay2020direct} -J.~Launay, I.~Poli, F.~Boniface, and F.~Krzakala. -\newblock Direct feedback alignment scales to modern deep learning tasks and architectures. -\newblock In {\em NeurIPS}, 2020. - -\bibitem{moskovitz2018feedback} -T.~H. Moskovitz, A.~Litwin-Kumar, and L.~F. Abbott. -\newblock Feedback alignment in deep convolutional networks. -\newblock {\em arXiv:1812.06488}, 2018. - -\bibitem{refinetti2021align} -M.~Refinetti, S.~d'Ascoli, R.~Ohana, and S.~Goldt. -\newblock Align, then memorise: the dynamics of learning with feedback alignment. -\newblock In {\em ICML}, 2021. - -\bibitem{crafton2019direct} -B.~Crafton, A.~Parihar, E.~Gebhardt, and A.~Raychowdhury. -\newblock Direct feedback alignment with sparse connections for local learning. -\newblock {\em Frontiers in Neuroscience}, 13:525, 2019. - -\bibitem{bartunov2018assessing} -S.~Bartunov, A.~Santoro, B.~Richards, L.~Marris, G.~Hinton, and T.~Lillicrap. -\newblock Assessing the scalability of biologically-motivated deep learning algorithms and architectures. -\newblock In {\em NeurIPS}, 2018. - -\bibitem{xiong2020layernorm} -R.~Xiong, Y.~Yang, D.~He, K.~Zheng, S.~Zheng, C.~Xing, H.~Zhang, Y.~Lan, L.~Wang, and T.~Liu. -\newblock On layer normalization in the transformer architecture. -\newblock In {\em ICML}, 2020. - -\bibitem{statebridge2024} -Anonymous. -\newblock State Bridge: terminal-conditioned predictor for credit assignment. -\newblock {\em Anonymous in-progress reference, 2024-2026}. - -\bibitem{creditbridge2024} -Anonymous. -\newblock Credit Bridge: value-field local credit without hidden BP. -\newblock {\em Anonymous in-progress reference, 2024-2026}. - -\end{thebibliography} - -\appendix - -\section{Reproducibility} -\label{app:reproducibility} - -All experiments use PyTorch~$\geq$2.0 on a single NVIDIA A6000 GPU. -Source for the protocol package is in \texttt{protocol/}; experimental -scripts are in \texttt{experiments/}. Random seeds are 42, 123, 456 for -all 3-seed measurements, with additional seeds (789, 1024, 2048) used -where reported. CIFAR-10 is loaded via \texttt{torchvision} with the -standard normalization $(\mu, \sigma) = ((0.4914, 0.4822, 0.4465), -(0.2470, 0.2435, 0.2616))$. - -\section{Pipeline pitfalls catalog} -\label{app:pitfalls} - -Beyond the four diagnostics, we found seven evaluation-pipeline bugs in -our own dogfood codebase that silently corrupt FA evaluation results. -Each has a standalone reproducer in -\texttt{protocol/examples/verify\_pitfalls*.py}. - -\begin{enumerate} -\item \texttt{tensor.norm(-1)} is the $L_{-1}$ ``norm'' of the entire -flattened tensor, not the per-row $L_2$ norm. The correct call is -\texttt{tensor.norm(dim=-1)}. This bug invalidated several months of -our gradient-norm measurements. - -\item \texttt{F.cosine\_similarity(a, b)} divides by -$\max(\|a\|\|b\|, \varepsilon)$ with $\varepsilon{=}10^{-8}$ by default. -When $\|b\|\sim 10^{-10}$ (the regime of the BP gradient on degenerate -DFA-trained pre-LN networks), the divisor becomes $\|a\|\cdot 10^{-8}$ -instead of $\|a\|\cdot 10^{-10}$, scaling the reported cosine by a -factor of $\sim\!100\times$ in the wrong direction. - -\item fp16 mixed precision underflows BP gradients at $\sim\!5\times -10^{-10}$, below fp16's smallest subnormal of $\sim\!6\times 10^{-8}$. -bf16 works because it shares fp32's exponent range. - -\item Random feedback $B_l$ matrices are training-specific. DFA reports -$\Gamma\approx 0.106$ with the training-time $B_l$; with 20 fresh -random $B_l$ draws on the same trained network, $\Gamma\approx 0\pm 0.005$. -The reported alignment is the network adapting to its specific $B_l$, not -intrinsic. - -\item Aggregation strategy across (layers, samples, batches) is rarely -specified but determines the headline number. Same DFA seed-42 gives -$\Gamma \in [-0.028, +0.074]$ across four valid aggregation strategies -(a 3.45$\times$ ratio, with sign flip). - -\item Per-layer $\Gamma$ structure is hidden by aggregation. On the -4-block ResMLP, DFA's headline $\Gamma\approx 0.10$ is driven almost -entirely by the embedding layer ($\Gamma_{l_0} \approx +0.43$); -deeper layers have $\Gamma \approx 0$. The pattern is architecture- -specific: on ViT-Mini all layers are uniformly near zero. - -\item Auxiliary networks (random feedback $B_l$, bridge predictors) not -saved alongside model checkpoints can cause post-hoc $\Gamma$ scripts to -silently fall back to $\cos(\text{BP\_grad}, \text{BP\_grad}) = 1.0$ and -report ``perfect alignment.'' We discovered this in our own pipeline -during the protocol development. Check that auxiliary networks are -persisted before reporting any $\Gamma$ value. -\end{enumerate} - -\section{Methodology: walk-back chain} -\label{app:walkback} - -The framing of this paper underwent several corrections during the -development of the protocol. We document the four-step progression -explicitly as part of the methodology, not as narrative drama: - -\begin{enumerate} -\item Initial metric ($\Gamma\approx 0.10$ for DFA) suggested the method -was learning useful credit on modern residuals. -\item Diagnostic showed the metric was measured against a numerical-floor -reference vector ($\|g_L\|\sim 10^{-10}$); the headline number was not -interpretable. -\item Revised control (the residual-stream penalty) restored the -reference but only partially closed the accuracy gap to BP, identifying -a residual phenomenon. -\item Final interpretation (this paper) separates measurement failure -(Mode~1) from genuine credit-quality cost (Mode~2), validated by the -four control experiments of Section~\ref{sec:two-modes}. -\end{enumerate} - -\section{Six independent validations of the two-mode separation} -\label{app:six-validations} - -For completeness we list all six independent validation experiments, -beyond the four reported in the main text: - -\begin{enumerate} -\item Direct deep-layer cosine on penalized DFA (3 seeds): deep mean -$+0.155 \pm 0.025$. -\item Null calibration with 20 fresh random $B_l$: deep cosine -$+0.002 \pm 0.022$ (within noise). -\item Hypothesis-disambiguation sweep: vanilla DFA early-epoch deep -cosine $-0.008 \pm 0.013$ across 3 seeds at epoch 1. -\item BP+penalty matched-control: 8pp BP capacity cost vs 17pp residual -gap at $\lambda{=}10^{-2}$. -\item Multi-seed lock-in: 24 measurements (3 seeds $\times$ 2 epochs -$\times$ 4 deep layers) all in $[-0.04, +0.02]$ on vanilla. -\item Cross-metric triangulation via perturbation correlation: vanilla -$+0.002$, penalized $+0.080$ (3 seeds), positive control (BP grad) -$+0.997$, negative control (random vector) $+0.006$. -\end{enumerate} - -\end{document} -- cgit v1.2.3