summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorYurenHao0426 <Blackhao0426@gmail.com>2026-04-08 20:23:54 -0500
committerYurenHao0426 <Blackhao0426@gmail.com>2026-04-08 20:23:54 -0500
commit29c2396ee6480e94d4543cb603587a4cc7b640cd (patch)
tree2454a3a5764a1c6fd3af4ea339e222e18871bc9e
parent72344e953b9e933a9961c3065a92c5518dd1cb93 (diff)
paper v2.36.1: extend \paragraph{} headers to §1, §2, §7
v2.36 added paragraph headers to §3-§6. Extending the same treatment to §1, §2, and §7 for consistent skim-readability across the whole main content. §1 (Introduction): - Feedback alignment and the standard reporting pair - The standard pair fails to validate - Two failure modes and their separability - Contribution: a methodology paper, not a new FA variant §2 (Audit): - Setup: 5-method audit on a 4-block pre-LayerNorm ResMLP - State Bridge and Credit Bridge: diagnostic probes constructed for this paper - Status-quo reading: every method looks acceptable - EP as the internal control: low accuracy without invalid measurement - Frozen-blocks baseline overturns the status-quo reading §7 (Discussion): - Scope, limits, and reporting recommendation All §1-§7 main content paragraphs now have inline bold headers naming what each paragraph argues for. Page count: 20 (unchanged from v2.36) — paragraph headers don't take much extra space, and the layout already had ~1 line of slack. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
-rw-r--r--paper/main.pdfbin534654 -> 535423 bytes
-rw-r--r--paper/main.tex20
2 files changed, 10 insertions, 10 deletions
diff --git a/paper/main.pdf b/paper/main.pdf
index 6da8a22..b65701f 100644
--- a/paper/main.pdf
+++ b/paper/main.pdf
Binary files differ
diff --git a/paper/main.tex b/paper/main.tex
index 3f50e82..4650898 100644
--- a/paper/main.tex
+++ b/paper/main.tex
@@ -33,18 +33,18 @@ Modern feedback-alignment evaluation on deep residual networks is still summariz
\section{Introduction}
\label{sec:intro}
-Backpropagation (BP) is the de facto training method for deep neural networks, but its requirement that each feedback connection carry a weight identical to the corresponding forward connection -- the weight-transport problem -- has long been considered biologically implausible \citep{lillicrap2016random,bartunov2018assessing}. \emph{Feedback alignment} (FA) \citep{lillicrap2016random} side-steps weight transport by delivering per-layer credit through fixed random feedback matrices, and its direct variant (DFA) \citep{nokland2016direct} projects the output error to every hidden layer through an independent random matrix; parallel lines include target propagation \citep{lee2015difference} and equilibrium propagation \citep{scellier2017equilibrium}. These rules are studied both as biologically-plausible alternatives to BP and as scalable, asynchronous training schemes, with recent work scaling DFA to transformer-scale architectures on language, recommendation, and view-synthesis tasks \citep{launay2020direct,akrout2019deep}. Evaluation in this line of work has converged on a two-number summary: final task accuracy, and an aggregate cosine alignment $\Gamma$ between the method's per-layer credit and the BP gradient on the trained network \citep{lillicrap2016random,nokland2016direct,akrout2019deep,launay2020direct,bartunov2018assessing}.
+\paragraph{Feedback alignment and the standard reporting pair.} Backpropagation (BP) is the de facto training method for deep neural networks, but its requirement that each feedback connection carry a weight identical to the corresponding forward connection -- the weight-transport problem -- has long been considered biologically implausible \citep{lillicrap2016random,bartunov2018assessing}. \emph{Feedback alignment} (FA) \citep{lillicrap2016random} side-steps weight transport by delivering per-layer credit through fixed random feedback matrices, and its direct variant (DFA) \citep{nokland2016direct} projects the output error to every hidden layer through an independent random matrix; parallel lines include target propagation \citep{lee2015difference} and equilibrium propagation \citep{scellier2017equilibrium}. These rules are studied both as biologically-plausible alternatives to BP and as scalable, asynchronous training schemes, with recent work scaling DFA to transformer-scale architectures on language, recommendation, and view-synthesis tasks \citep{launay2020direct,akrout2019deep}. Evaluation in this line of work has converged on a two-number summary: final task accuracy, and an aggregate cosine alignment $\Gamma$ between the method's per-layer credit and the BP gradient on the trained network \citep{lillicrap2016random,nokland2016direct,akrout2019deep,launay2020direct,bartunov2018assessing}.
-On the audited 4-block $d{=}256$ ResMLP, however, Table~\ref{tab:main_audit} already shows that this accuracy-plus-$\Gamma$ pair is not a validity check: DFA reaches only $0.306 \pm 0.006$ test accuracy, below the architecture-matched frozen-blocks baseline of $0.349 \pm 0.002$, while still looking superficially comparable to other non-BP methods. Figure~\ref{fig:audit_hero} further shows that the apparent cosine evidence is concentrated at the shallowest block, with DFA at seed 42 reaching about $+0.42$ at layer 0 but approximately $-0.03$ to $0$ on layers 1--4, so the aggregate obscures where credit direction is and is not present. At the same time, the deepest BP reference norm is only about $4 \times 10^{-10}$ for DFA (three-seed mean) and a few $\times 10^{-9}$ for State Bridge and Credit Bridge, all below the $10^{-8}$ clamp used by \texttt{F.cosine\_similarity}, whereas BP remains around $4 \times 10^{-4}$, so the reported deep cosine is partly computed against a numerical-floor reference rather than an informative gradient direction (Figure~\ref{fig:audit_hero}; Table~\ref{tab:main_audit}). Those numbers can be useful, but only if the measurement regime itself is valid.
+\paragraph{The standard pair fails to validate.} On the audited 4-block $d{=}256$ ResMLP, however, Table~\ref{tab:main_audit} already shows that this accuracy-plus-$\Gamma$ pair is not a validity check: DFA reaches only $0.306 \pm 0.006$ test accuracy, below the architecture-matched frozen-blocks baseline of $0.349 \pm 0.002$, while still looking superficially comparable to other non-BP methods. Figure~\ref{fig:audit_hero} further shows that the apparent cosine evidence is concentrated at the shallowest block, with DFA at seed 42 reaching about $+0.42$ at layer 0 but approximately $-0.03$ to $0$ on layers 1--4, so the aggregate obscures where credit direction is and is not present. At the same time, the deepest BP reference norm is only about $4 \times 10^{-10}$ for DFA (three-seed mean) and a few $\times 10^{-9}$ for State Bridge and Credit Bridge, all below the $10^{-8}$ clamp used by \texttt{F.cosine\_similarity}, whereas BP remains around $4 \times 10^{-4}$, so the reported deep cosine is partly computed against a numerical-floor reference rather than an informative gradient direction (Figure~\ref{fig:audit_hero}; Table~\ref{tab:main_audit}). Those numbers can be useful, but only if the measurement regime itself is valid.
-Our audit shows that modern residual vision models can make these two quantities look informative while failing to answer the question they are taken to answer. Figure~\ref{fig:audit_hero} shows the first failure mode, which we call \emph{Mode 1: measurement degeneracy}, where residual-stream growth drives the deepest hidden state to about $\|h_L\| \sim 10^8$ under DFA/SB/CB while the corresponding BP reference collapses to $\|g_L\| \sim 4 \times 10^{-10}$ for DFA (three-seed mean), so the deep-layer cosine is measured against a clamp-dominated floor rather than a meaningful target direction. The same figure also shows the second failure mode, \emph{Mode 2: low intrinsic credit-direction quality}, because even after comparing against the stronger frozen-blocks baseline ($0.349 \pm 0.002$) and looking layer-by-layer, DFA's deep blocks remain essentially null while only layer 0 is visibly positive. Intervention sharpens both modes. Adding a per-block residual penalty $\lambda \|f_l(h_l)\|^2$ to DFA at $\lambda{=}10^{-2}$ contains $\|h_L\|$ to about $4\times 10^4$ and lifts the deep BP reference to about $10^{-6}$, but DFA's rescued deep cosine is only about $+0.15$; State Bridge under the same intervention reaches a three-seed deep cosine of $+0.32$ and, unlike DFA, exceeds the frozen-blocks baseline by $+10$ points in final accuracy; Credit Bridge reaches a deep cosine near $+0.68$ yet matches only the DFA accuracy, so Mode~2 has method-dependent severity and deep cosine is not a sufficient predictor of final accuracy across methods. At the same time, at $\lambda{=}10^{-4}$ Mode~1 is alleviated while the DFA deep cosine still stays near zero, and at vanilla DFA epoch 1 the reference is already meaningful at about $6 \times 10^{-7}$ but the deep cosine is still $-0.008 \pm 0.013$ across three seeds. The failure is therefore neither unitary nor uniform: Mode~1 and Mode~2 are observationally separable, and within the audited fixed-feedback family, the severity of each mode varies by method.
+\paragraph{Two failure modes and their separability.} Our audit shows that modern residual vision models can make these two quantities look informative while failing to answer the question they are taken to answer. Figure~\ref{fig:audit_hero} shows the first failure mode, which we call \emph{Mode 1: measurement degeneracy}, where residual-stream growth drives the deepest hidden state to about $\|h_L\| \sim 10^8$ under DFA/SB/CB while the corresponding BP reference collapses to $\|g_L\| \sim 4 \times 10^{-10}$ for DFA (three-seed mean), so the deep-layer cosine is measured against a clamp-dominated floor rather than a meaningful target direction. The same figure also shows the second failure mode, \emph{Mode 2: low intrinsic credit-direction quality}, because even after comparing against the stronger frozen-blocks baseline ($0.349 \pm 0.002$) and looking layer-by-layer, DFA's deep blocks remain essentially null while only layer 0 is visibly positive. Intervention sharpens both modes. Adding a per-block residual penalty $\lambda \|f_l(h_l)\|^2$ to DFA at $\lambda{=}10^{-2}$ contains $\|h_L\|$ to about $4\times 10^4$ and lifts the deep BP reference to about $10^{-6}$, but DFA's rescued deep cosine is only about $+0.15$; State Bridge under the same intervention reaches a three-seed deep cosine of $+0.32$ and, unlike DFA, exceeds the frozen-blocks baseline by $+10$ points in final accuracy; Credit Bridge reaches a deep cosine near $+0.68$ yet matches only the DFA accuracy, so Mode~2 has method-dependent severity and deep cosine is not a sufficient predictor of final accuracy across methods. At the same time, at $\lambda{=}10^{-4}$ Mode~1 is alleviated while the DFA deep cosine still stays near zero, and at vanilla DFA epoch 1 the reference is already meaningful at about $6 \times 10^{-7}$ but the deep cosine is still $-0.008 \pm 0.013$ across three seeds. The failure is therefore neither unitary nor uniform: Mode~1 and Mode~2 are observationally separable, and within the audited fixed-feedback family, the severity of each mode varies by method.
-Accordingly, this paper does not introduce a new FA variant or a new benchmark. Of the five methods we audit, BP, EP, and DFA are established baselines from the published literature; the remaining two, which we call \emph{State Bridge} and \emph{Credit Bridge}, are diagnostic probes we construct in this paper to directly learn the two targets that different strands of the BP-free literature argue should produce good per-layer credit (formal definitions and citations in Section~\ref{sec:audit}). Instead, Table~\ref{tab:main_audit} and Figure~\ref{fig:audit_hero} use a standard five-method CIFAR-10 audit to show that status-quo reporting would treat BP, EP, DFA, State Bridge, and Credit Bridge as the same kind of evidence-bearing object even though only BP and EP remain trustworthy under matched diagnostic checks. This makes the contribution methodological in the sense of \citet{jordan2020evaluating}, \citet{obray2022evaluation}, and \citet{paleka2026pitfalls}: the central question is not whether one more FA variant can post a headline number, but whether the reporting pipeline distinguishes meaningful credit-direction evidence from numerical-floor artifacts and from shallow-only learning. The protocol therefore starts from per-layer diagnostics and a frozen-blocks baseline before reading any aggregate cosine or final accuracy as evidence about deep credit assignment. We first show the walk-back on a standard audit, then isolate the two failure modes, and finally state the reporting protocol that future FA papers should satisfy.
+\paragraph{Contribution: a methodology paper, not a new FA variant.} Accordingly, this paper does not introduce a new FA variant or a new benchmark. Of the five methods we audit, BP, EP, and DFA are established baselines from the published literature; the remaining two, which we call \emph{State Bridge} and \emph{Credit Bridge}, are diagnostic probes we construct in this paper to directly learn the two targets that different strands of the BP-free literature argue should produce good per-layer credit (formal definitions and citations in Section~\ref{sec:audit}). Instead, Table~\ref{tab:main_audit} and Figure~\ref{fig:audit_hero} use a standard five-method CIFAR-10 audit to show that status-quo reporting would treat BP, EP, DFA, State Bridge, and Credit Bridge as the same kind of evidence-bearing object even though only BP and EP remain trustworthy under matched diagnostic checks. This makes the contribution methodological in the sense of \citet{jordan2020evaluating}, \citet{obray2022evaluation}, and \citet{paleka2026pitfalls}: the central question is not whether one more FA variant can post a headline number, but whether the reporting pipeline distinguishes meaningful credit-direction evidence from numerical-floor artifacts and from shallow-only learning. The protocol therefore starts from per-layer diagnostics and a frozen-blocks baseline before reading any aggregate cosine or final accuracy as evidence about deep credit assignment. We first show the walk-back on a standard audit, then isolate the two failure modes, and finally state the reporting protocol that future FA papers should satisfy.
\section{Audit: Standard Reporting Walks Back Nothing}
\label{sec:audit}
-Table~\ref{tab:main_audit} fixes the canonical audit to a 4-block pre-LayerNorm ResMLP with width $d{=}256$ on CIFAR-10, trained for 100 epochs with AdamW (learning rate $10^{-3}$, weight decay $0.01$), a cosine schedule, batch size $128$, and three seeds (42, 123, 456); all five methods are read against the identical architecture, optimizer, schedule, and training budget without method-specific tuning, and Figure~\ref{fig:audit_hero} summarizes the corresponding per-block growth, deepest-layer BP reference norm, cross-batch stability, and frozen-baseline comparison.
+\paragraph{Setup: 5-method audit on a 4-block pre-LayerNorm ResMLP.} Table~\ref{tab:main_audit} fixes the canonical audit to a 4-block pre-LayerNorm ResMLP with width $d{=}256$ on CIFAR-10, trained for 100 epochs with AdamW (learning rate $10^{-3}$, weight decay $0.01$), a cosine schedule, batch size $128$, and three seeds (42, 123, 456); all five methods are read against the identical architecture, optimizer, schedule, and training budget without method-specific tuning, and Figure~\ref{fig:audit_hero} summarizes the corresponding per-block growth, deepest-layer BP reference norm, cross-batch stability, and frozen-baseline comparison.
\begin{table}[t]
\centering
@@ -65,13 +65,13 @@ Credit Bridge & $0.289 \pm 0.026$ & $0.07$ & trustworthy & walked back \\
\end{tabular}}
\end{table}
-Two rows in Table~\ref{tab:main_audit}, \emph{State Bridge} (SB) and \emph{Credit Bridge} (CB), are diagnostic probes we construct in this paper, not prior FA variants. Each directly learns a target that a different strand of the BP-free literature argues should produce good per-layer credit, and each uses the same block local loss $-\langle f_l(h_l),\, a_l\rangle$ as DFA but with a different $a_l$. SB instantiates the target-propagation view that accurate prediction of a downstream hidden state yields a usable credit signal \citep{bengio2014autoencoders,lee2015difference}: an auxiliary $G_\psi(h_l, t_l, s)$ is fit by MSE to predict $h_L$ from $(h_l, t_l{=}l/L, s{=}e_T)$, and $a_l^{\mathrm{SB}} = \nabla_{h_l}\,\mathrm{CE}(W_{\mathrm{out}}\,\mathrm{LN}(G_\psi(h_l, t_l, s)), y)$. CB instantiates the synthetic-gradient view that a learned value network, if its input-gradient approximates the BP gradient, can stand in for it \citep{jaderberg2017decoupled}: $V_\phi(h_l, t_l, s)$ is fit via a bridge residual against an EMA target, and $a_l^{\mathrm{CB}} = \nabla_{h_l} V_\phi(h_l, t_l, s)$. Both auxiliaries are trained on detached hidden states. We use SB and CB as controls that populate different points in the (angular agreement with BP, functional usefulness) plane; that is what makes the cross-method cosine-versus-accuracy dissociation in Section~\ref{sec:mode2} visible.
+\paragraph{State Bridge and Credit Bridge: diagnostic probes constructed for this paper.} Two rows in Table~\ref{tab:main_audit}, \emph{State Bridge} (SB) and \emph{Credit Bridge} (CB), are diagnostic probes we construct in this paper, not prior FA variants. Each directly learns a target that a different strand of the BP-free literature argues should produce good per-layer credit, and each uses the same block local loss $-\langle f_l(h_l),\, a_l\rangle$ as DFA but with a different $a_l$. SB instantiates the target-propagation view that accurate prediction of a downstream hidden state yields a usable credit signal \citep{bengio2014autoencoders,lee2015difference}: an auxiliary $G_\psi(h_l, t_l, s)$ is fit by MSE to predict $h_L$ from $(h_l, t_l{=}l/L, s{=}e_T)$, and $a_l^{\mathrm{SB}} = \nabla_{h_l}\,\mathrm{CE}(W_{\mathrm{out}}\,\mathrm{LN}(G_\psi(h_l, t_l, s)), y)$. CB instantiates the synthetic-gradient view that a learned value network, if its input-gradient approximates the BP gradient, can stand in for it \citep{jaderberg2017decoupled}: $V_\phi(h_l, t_l, s)$ is fit via a bridge residual against an EMA target, and $a_l^{\mathrm{CB}} = \nabla_{h_l} V_\phi(h_l, t_l, s)$. Both auxiliaries are trained on detached hidden states. We use SB and CB as controls that populate different points in the (angular agreement with BP, functional usefulness) plane; that is what makes the cross-method cosine-versus-accuracy dissociation in Section~\ref{sec:mode2} visible.
-By the field's usual criteria, the non-BP methods appear to train to nontrivial accuracy and report nonzero alignment. In Table~\ref{tab:main_audit}, DFA reaches $0.306 \pm 0.006$ test accuracy with headline $\Gamma{=}0.10$, State Bridge reaches $0.205 \pm 0.032$ with $\Gamma{=}0.005$, and Credit Bridge reaches $0.289 \pm 0.026$ with $\Gamma{=}0.07$; none of these rows looks like an obvious invalidation if one is reading the usual pair of final accuracy and aggregate alignment in the style of prior FA reporting \citep{lillicrap2016random,nokland2016direct,akrout2019deep,launay2020direct}. Even the absolute scale does not itself force a walk-back, because all three methods are plainly above chance and all three report positive headline alignment rather than a visibly broken or undefined quantity. That reading is exactly what the rest of the paper overturns.
+\paragraph{Status-quo reading: every method looks acceptable.} By the field's usual criteria, the non-BP methods appear to train to nontrivial accuracy and report nonzero alignment. In Table~\ref{tab:main_audit}, DFA reaches $0.306 \pm 0.006$ test accuracy with headline $\Gamma{=}0.10$, State Bridge reaches $0.205 \pm 0.032$ with $\Gamma{=}0.005$, and Credit Bridge reaches $0.289 \pm 0.026$ with $\Gamma{=}0.07$; none of these rows looks like an obvious invalidation if one is reading the usual pair of final accuracy and aggregate alignment in the style of prior FA reporting \citep{lillicrap2016random,nokland2016direct,akrout2019deep,launay2020direct}. Even the absolute scale does not itself force a walk-back, because all three methods are plainly above chance and all three report positive headline alignment rather than a visibly broken or undefined quantity. That reading is exactly what the rest of the paper overturns.
-Low accuracy by itself is not the pathology. Equilibrium Propagation (EP), a contrastive energy-based alternative to BP that updates weights from the difference between a free-phase and a nudged-phase hidden trajectory, is the key internal comparison in Table~\ref{tab:main_audit} and Figure~\ref{fig:audit_hero}: it achieves only $0.316 \pm 0.030$ accuracy and a very small headline $\Gamma{=}0.008$, yet its three-seed mean max-per-block growth is only $6.6\times$ (highest single-seed value $11.0\times$), its deepest BP reference norm remains around $1.3\times 10^{-4}$ rather than collapsing to the numerical floor, and its cross-batch direction-stability score is $0.02$ rather than the much higher drift-dominated values seen for DFA-family methods. At the same time, EP is not a positive result for depth usage in the stronger sense, because its trainable-model accuracy is still $3.3$ percentage points below the frozen-blocks baseline of $0.349 \pm 0.002$. The distinction matters because it separates underperformance from invalid evaluation.
+\paragraph{EP as the internal control: low accuracy without invalid measurement.} Low accuracy by itself is not the pathology. Equilibrium Propagation (EP), a contrastive energy-based alternative to BP that updates weights from the difference between a free-phase and a nudged-phase hidden trajectory, is the key internal comparison in Table~\ref{tab:main_audit} and Figure~\ref{fig:audit_hero}: it achieves only $0.316 \pm 0.030$ accuracy and a very small headline $\Gamma{=}0.008$, yet its three-seed mean max-per-block growth is only $6.6\times$ (highest single-seed value $11.0\times$), its deepest BP reference norm remains around $1.3\times 10^{-4}$ rather than collapsing to the numerical floor, and its cross-batch direction-stability score is $0.02$ rather than the much higher drift-dominated values seen for DFA-family methods. At the same time, EP is not a positive result for depth usage in the stronger sense, because its trainable-model accuracy is still $3.3$ percentage points below the frozen-blocks baseline of $0.349 \pm 0.002$. The distinction matters because it separates underperformance from invalid evaluation.
-When we compare each method to a frozen-blocks baseline matched to the same architecture, the headline interpretation changes immediately. The frozen-blocks model, which trains only the embedding, LayerNorm, and head while holding the residual blocks fixed, reaches $0.349 \pm 0.002$ across the same three seeds; against that baseline, BP is higher by $26.6$ points, but DFA is lower by $4.3$ points, State Bridge by $14.4$ points, Credit Bridge by $6.0$ points, and even EP by $3.3$ points. Figure~\ref{fig:audit_hero} shows that this accuracy comparison lines up with the diagnostic split: DFA, State Bridge, and Credit Bridge also combine extreme per-block growth (three-seed mean max ratios $\sim\!1.9\times 10^3$, $\sim\!1.6\times 10^4$, and $\sim\!1.2\times 10^3$ respectively), deepest-layer BP norms around $10^{-9}$, and high cross-batch instability ($0.16$, $0.53$, and $0.37$), so their deep blocks are at best passengers and in practice often harmful. This establishes the audit question the rest of the paper must answer: why do the standard signals fail so badly?
+\paragraph{Frozen-blocks baseline overturns the status-quo reading.} When we compare each method to a frozen-blocks baseline matched to the same architecture, the headline interpretation changes immediately. The frozen-blocks model, which trains only the embedding, LayerNorm, and head while holding the residual blocks fixed, reaches $0.349 \pm 0.002$ across the same three seeds; against that baseline, BP is higher by $26.6$ points, but DFA is lower by $4.3$ points, State Bridge by $14.4$ points, Credit Bridge by $6.0$ points, and even EP by $3.3$ points. Figure~\ref{fig:audit_hero} shows that this accuracy comparison lines up with the diagnostic split: DFA, State Bridge, and Credit Bridge also combine extreme per-block growth (three-seed mean max ratios $\sim\!1.9\times 10^3$, $\sim\!1.6\times 10^4$, and $\sim\!1.2\times 10^3$ respectively), deepest-layer BP norms around $10^{-9}$, and high cross-batch instability ($0.16$, $0.53$, and $0.37$), so their deep blocks are at best passengers and in practice often harmful. This establishes the audit question the rest of the paper must answer: why do the standard signals fail so badly?
\begin{figure}[t]
\centering
@@ -195,7 +195,7 @@ Diag. & Measurement & Default threshold & Role \\
\section{Discussion, Limits, Conclusion}
\label{sec:discussion}
-\looseness=-2 Our claim is about evidence, not impossibility: we show that current FA evaluation practice can misread what happened, not that FA cannot work in deep networks. DFA, SB, and CB all pass status-quo reporting (Table~\ref{tab:main_audit}) but fail the protocol's deep checks, and the Figure~\ref{fig:penalty_rescue} penalty partially rescues credit signal rather than validating headlines. Our strongest claim is scoped to $d{=}256/512$ pre-LayerNorm ResMLPs and ViT-Mini, where both Mode~1 diagnostics fire; the no-terminal-LN ResMLP ablation establishes terminal LayerNorm as causally necessary for diagnostic~(b) on residual ResMLP and (with the BatchNorm CNN) shows that activation growth can persist without gradient-floor collapse; the dataset is CIFAR-10; and the BP-plus-penalty comparison is a lower bound, not a full decomposition. In the evaluation-methodology line of \citet{jordan2020evaluating,obray2022evaluation,paleka2026pitfalls}, FA papers should report BP-reference validity, layerwise credit quality, and a frozen-blocks depth-utilization baseline as separate axes, not a single headline.
+\paragraph{Scope, limits, and reporting recommendation.} \looseness=-2 Our claim is about evidence, not impossibility: we show that current FA evaluation practice can misread what happened, not that FA cannot work in deep networks. DFA, SB, and CB all pass status-quo reporting (Table~\ref{tab:main_audit}) but fail the protocol's deep checks, and the Figure~\ref{fig:penalty_rescue} penalty partially rescues credit signal rather than validating headlines. Our strongest claim is scoped to $d{=}256/512$ pre-LayerNorm ResMLPs and ViT-Mini, where both Mode~1 diagnostics fire; the no-terminal-LN ResMLP ablation establishes terminal LayerNorm as causally necessary for diagnostic~(b) on residual ResMLP and (with the BatchNorm CNN) shows that activation growth can persist without gradient-floor collapse; the dataset is CIFAR-10; and the BP-plus-penalty comparison is a lower bound, not a full decomposition. In the evaluation-methodology line of \citet{jordan2020evaluating,obray2022evaluation,paleka2026pitfalls}, FA papers should report BP-reference validity, layerwise credit quality, and a frozen-blocks depth-utilization baseline as separate axes, not a single headline.
\begin{thebibliography}{10}