summaryrefslogtreecommitdiff
path: root/paper
diff options
context:
space:
mode:
authorYurenHao0426 <Blackhao0426@gmail.com>2026-04-08 15:13:22 -0500
committerYurenHao0426 <Blackhao0426@gmail.com>2026-04-08 15:13:22 -0500
commitda988679e29d99e4bc6b2788b2bd873d9cd5cde0 (patch)
tree5724b21efa111e6075247c2b154b67dde117a3ac /paper
parent0abfa8e96afffd75cdb6a985603dceb55a284427 (diff)
SB/CB probe reframe + compression + Figure 5 to appendix (user-approved)
User pushed back on SB/CB being treated as 'audited FA methods' because they're our own constructions. Reframe them as diagnostic probes built on two prior-literature assumptions (state=credit and credit=performance). §1 intro: add 1 sentence clarifying BP/EP/DFA are established baselines and SB/CB are probes constructed in this paper. §2 ¶2 new opening paragraph (before 'By the field's usual criteria'): - SB/CB are probes, not prior FA variants - Each directly learns a target from a prior-literature view - SB: target-propagation view (Bengio 2014, Lee 2015) — auxiliary G_ψ(h_l,t_l,s) predicts h_L via MSE; a_l^SB = ∇_{h_l} CE(W_out LN(G_ψ(h_l,t_l,s)), y) - CB: synthetic-gradient view (Jaderberg 2017) — auxiliary V_φ(h_l,t_l,s) trained via bridge residual; a_l^CB = ∇_{h_l} V_φ(h_l,t_l,s) - Both auxiliaries trained on detached hidden states - Role: populate different points in the (angular alignment, functional usefulness) plane, making the §4 cos-vs-acc dissociation visible Bibliography: added Bengio 2014 (arXiv 1407.7906), Lee et al. 2015 (ECML PKDD), Jaderberg et al. 2017 (ICML) — all verified via WebSearch. Page budget: the ~180-word §2 addition pushed §7 onto p10. Recovered space by: (a) compressing §2 ¶1 opening (b) compressing §3 ¶2 falsification chain (tighter number formatting) (c) compressing §6 ¶3 asymmetry paragraph (d) merging §7 into a single paragraph (was 3) (e) moving Figure 5 (decision_utility) from §6 main text to a floated appendix figure in Appendix D (the 'all seven validations' appendix, which is conceptually related). The decision-utility ablation's headline ('accuracy+Γ walks back 0/5, full protocol walks back 3/5') is already in §6 prose so the figure functions as supporting backup. Result: main content is strictly 9 pages (§1-§7 on p1-p9). References and appendices on p10+. Total 18 pages, 0 overfull hbox. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Diffstat (limited to 'paper')
-rw-r--r--paper/main.pdfbin486667 -> 491099 bytes
-rw-r--r--paper/main.tex50
2 files changed, 34 insertions, 16 deletions
diff --git a/paper/main.pdf b/paper/main.pdf
index da3099a..c336cb8 100644
--- a/paper/main.pdf
+++ b/paper/main.pdf
Binary files differ
diff --git a/paper/main.tex b/paper/main.tex
index 4dc3f1e..279ef82 100644
--- a/paper/main.tex
+++ b/paper/main.tex
@@ -8,6 +8,7 @@
\usepackage{hyperref}
\usepackage{url}
\usepackage{booktabs}
+\usepackage{float}
\usepackage{tabularx}
\newcolumntype{L}{>{\raggedright\arraybackslash}X}
\usepackage{amsfonts}
@@ -36,12 +37,12 @@ Feedback-alignment papers are usually judged by two numbers: task accuracy and a
Our audit shows that modern residual vision models can make these two quantities look informative while failing to answer the question they are taken to answer. Figure~\ref{fig:audit_hero} shows the first failure mode, which we call \emph{Mode 1: measurement degeneracy}, where residual-stream growth drives the deepest hidden state to about $\|h_L\| \sim 10^8$ under DFA/SB/CB while the corresponding BP reference collapses to $\|g_L\| \sim 5 \times 10^{-10}$, so the deep-layer cosine is measured against a clamp-dominated floor rather than a meaningful target direction. The same figure also shows the second failure mode, \emph{Mode 2: low intrinsic credit-direction quality}, because even after comparing against the stronger frozen-blocks baseline ($0.349 \pm 0.002$) and looking layer-by-layer, DFA's deep blocks remain essentially null while only layer 0 is visibly positive. Intervention sharpens both modes. Adding a per-block residual penalty $\lambda \|f_l(h_l)\|^2$ to DFA at $\lambda{=}10^{-2}$ contains $\|h_L\|$ to about $4\times 10^4$ and lifts the deep BP reference to about $10^{-6}$, but DFA's rescued deep cosine is only about $+0.16$; State Bridge under the same intervention reaches a three-seed deep cosine of $+0.32$ and, unlike DFA, exceeds the frozen-blocks baseline by $+10$ points in final accuracy; Credit Bridge reaches a deep cosine near $+0.68$ yet matches only the DFA accuracy, so Mode~2 has method-dependent severity and deep cosine is not a sufficient predictor of final accuracy across methods. At the same time, at $\lambda{=}10^{-4}$ Mode~1 is alleviated while the DFA deep cosine still stays near zero, and at vanilla DFA epoch 1 the reference is already meaningful at about $6 \times 10^{-7}$ but the deep cosine is still $-0.008 \pm 0.013$ across three seeds. The failure is therefore neither unitary nor uniform: Mode~1 and Mode~2 are observationally separable, and within the audited fixed-feedback family, the severity of each mode varies by method.
-Accordingly, this paper does not introduce a new FA variant or a new benchmark. Instead, Table~\ref{tab:main_audit} and Figure~\ref{fig:audit_hero} use a standard five-method CIFAR-10 audit to show that status-quo reporting would treat BP, EP, DFA, State Bridge, and Credit Bridge as the same kind of evidence-bearing object even though only BP and EP remain trustworthy under matched diagnostic checks. This makes the contribution methodological in the sense of \citet{jordan2020evaluating}, \citet{obray2022evaluation}, and \citet{paleka2026pitfalls}: the central question is not whether one more FA variant can post a headline number, but whether the reporting pipeline distinguishes meaningful credit-direction evidence from numerical-floor artifacts and from shallow-only learning. The protocol therefore starts from per-layer diagnostics and a frozen-blocks baseline before reading any aggregate cosine or final accuracy as evidence about deep credit assignment. We first show the walk-back on a standard audit, then isolate the two failure modes, and finally state the reporting protocol that future FA papers should satisfy.
+Accordingly, this paper does not introduce a new FA variant or a new benchmark. Of the five methods we audit, BP, EP, and DFA are established baselines from the published literature; the remaining two, which we call \emph{State Bridge} and \emph{Credit Bridge}, are diagnostic probes we construct in this paper to directly learn the two targets that different strands of the BP-free literature argue should produce good per-layer credit (formal definitions and citations in Section~\ref{sec:audit}). Instead, Table~\ref{tab:main_audit} and Figure~\ref{fig:audit_hero} use a standard five-method CIFAR-10 audit to show that status-quo reporting would treat BP, EP, DFA, State Bridge, and Credit Bridge as the same kind of evidence-bearing object even though only BP and EP remain trustworthy under matched diagnostic checks. This makes the contribution methodological in the sense of \citet{jordan2020evaluating}, \citet{obray2022evaluation}, and \citet{paleka2026pitfalls}: the central question is not whether one more FA variant can post a headline number, but whether the reporting pipeline distinguishes meaningful credit-direction evidence from numerical-floor artifacts and from shallow-only learning. The protocol therefore starts from per-layer diagnostics and a frozen-blocks baseline before reading any aggregate cosine or final accuracy as evidence about deep credit assignment. We first show the walk-back on a standard audit, then isolate the two failure modes, and finally state the reporting protocol that future FA papers should satisfy.
\section{Audit: Standard Reporting Walks Back Nothing}
\label{sec:audit}
-We begin with the smallest setting in which all methods can be compared head-to-head under identical architecture, optimizer family, and data. Table~\ref{tab:main_audit} fixes that canonical audit to a 4-block pre-LayerNorm ResMLP with width $d{=}256$ on CIFAR-10, trained for 100 epochs with AdamW (learning rate $10^{-3}$, weight decay $0.01$), a cosine schedule, and three seeds (42, 123, 456). Within that single setting, BP, EP, DFA, State Bridge, and Credit Bridge can be read against the same architecture and the same training budget, while Figure~\ref{fig:audit_hero} summarizes the corresponding per-block growth, deepest-layer BP reference norm, cross-batch stability, and frozen-baseline comparison. This is the table a reader would normally use to decide whether the methods trained the deep network.
+Table~\ref{tab:main_audit} fixes the canonical audit to a 4-block pre-LayerNorm ResMLP with width $d{=}256$ on CIFAR-10, trained for 100 epochs with AdamW (learning rate $10^{-3}$, weight decay $0.01$), a cosine schedule, and three seeds (42, 123, 456); all five methods are read against the same architecture, optimizer, and training budget, and Figure~\ref{fig:audit_hero} summarizes the corresponding per-block growth, deepest-layer BP reference norm, cross-batch stability, and frozen-baseline comparison.
\begin{table}[t]
\centering
@@ -62,6 +63,8 @@ Credit Bridge & $0.289 \pm 0.026$ & $0.07$ & trustworthy & walked back \\
\end{tabular}}
\end{table}
+Two rows in Table~\ref{tab:main_audit}, \emph{State Bridge} (SB) and \emph{Credit Bridge} (CB), are diagnostic probes we construct in this paper, not prior FA variants. Each directly learns a target that a different strand of the BP-free literature argues should produce good per-layer credit, and each uses the same block local loss $-\langle f_l(h_l),\, a_l\rangle$ as DFA but with a different $a_l$. SB instantiates the target-propagation view that accurate prediction of a downstream hidden state yields a usable credit signal \citep{bengio2014autoencoders,lee2015difference}: an auxiliary $G_\psi(h_l, t_l, s)$ is fit by MSE to predict $h_L$ from $(h_l, t_l{=}l/L, s{=}e_T)$, and $a_l^{\mathrm{SB}} = \nabla_{h_l}\,\mathrm{CE}(W_{\mathrm{out}}\,\mathrm{LN}(G_\psi(h_l, t_l, s)), y)$. CB instantiates the synthetic-gradient view that a learned value network, if its input-gradient approximates the BP gradient, can stand in for it \citep{jaderberg2017decoupled}: $V_\phi(h_l, t_l, s)$ is fit via a bridge residual against an EMA target, and $a_l^{\mathrm{CB}} = \nabla_{h_l} V_\phi(h_l, t_l, s)$. Both auxiliaries are trained on detached hidden states. We use SB and CB as controls that populate different points in the (angular agreement with BP, functional usefulness) plane; that is what makes the cross-method cosine-versus-accuracy dissociation in Section~\ref{sec:mode2} visible.
+
By the field's usual criteria, the non-BP methods appear to train to nontrivial accuracy and report nonzero alignment. In Table~\ref{tab:main_audit}, DFA reaches $0.306 \pm 0.006$ test accuracy with headline $\Gamma{=}0.10$, State Bridge reaches $0.205 \pm 0.032$ with $\Gamma{=}0.005$, and Credit Bridge reaches $0.289 \pm 0.026$ with $\Gamma{=}0.07$; none of these rows looks like an obvious invalidation if one is reading the usual pair of final accuracy and aggregate alignment in the style of prior FA reporting \citep{lillicrap2016random,nokland2016direct,akrout2019deep,launay2020direct}. Even the absolute scale does not itself force a walk-back, because all three methods are plainly above chance and all three report positive headline alignment rather than a visibly broken or undefined quantity. That reading is exactly what the rest of the paper overturns.
Low accuracy by itself is not the pathology. Equilibrium Propagation (EP), a contrastive energy-based alternative to BP that updates weights from the difference between a free-phase and a nudged-phase hidden trajectory, is the key internal comparison in Table~\ref{tab:main_audit} and Figure~\ref{fig:audit_hero}: it achieves only $0.316 \pm 0.030$ accuracy and a very small headline $\Gamma{=}0.008$, yet its per-block growth is only $11.6\times$, its deepest BP reference norm remains around $1.3\times 10^{-4}$ rather than collapsing to the numerical floor, and its cross-batch direction-stability score is $0.02$ rather than the much higher drift-dominated values seen for DFA-family methods. At the same time, EP is not a positive result for depth usage in the stronger sense, because its trainable-model accuracy is still $3.3$ percentage points below the frozen-blocks baseline of $0.349 \pm 0.002$. The distinction matters because it separates underperformance from invalid evaluation.
@@ -80,7 +83,7 @@ When we compare each method to a frozen-blocks baseline matched to the same arch
Mode~1 has two parts. The activation-growth part~(a) is a scale pathology of fixed-feedback local-credit objectives without an effective scale-control term: for block $l$, DFA, State Bridge, and Credit Bridge each update $f_l$ by reducing a local loss of the form $-\langle f_l(h_l),\, a_l\rangle$, where the per-layer credit vector $a_l$ is the method-specific projection of the output error (for DFA, $a_l = B_l^\top e_T$ with a fixed random $B_l$; for State Bridge, $a_l$ is the gradient of a cross-entropy loss measured through a learned state predictor $G_\psi(h_l,t_l,s)$ that estimates $h_L$; for Credit Bridge, $a_l$ is the gradient of a learned value network $V(h_l,t_l,s)$). None of these three local losses contains a penalty on $\|f_l(h_l)\|$, so any direction in which a larger block output improves inner-product alignment with the method's fixed or learned credit target is rewarded; in a pre-LN residual stack, larger block outputs directly increase residual-stream scale, and terminal LayerNorm at the output removes task-loss sensitivity to that scale, so the architecture supplies no global restraint on the local growth incentive. The gradient-floor part~(b) follows from the LayerNorm Jacobian: in terminal-LN architectures $\partial\mathrm{LN}(h)/\partial h \propto 1/\|h\|$ in expectation, so the same residual-stream inflation is accompanied by collapse of the hidden-layer BP reference norm. Empirically, on the audited 4-block pre-LayerNorm ResMLP ($d{=}256$, CIFAR-10, 100 epochs, 3 seeds), DFA training drives $\|h_L\|$ from about $9$ at initialization to about $4\times 10^8$ by epoch 100 and $\|g_L\|$ from about $9.8\times 10^{-4}$ to about $5\times 10^{-10}$, while the reported deep cosine remains defined only because \texttt{F.cosine\_similarity} clamps the denominator at $\varepsilon{=}10^{-8}$ (Table~\ref{tab:main_audit}; Figure~\ref{fig:audit_hero}). At that endpoint the reference norm is about $20\times$ below the clamp, so the quantity being reported is effectively $(a\cdot b)/(\|a\|\max(\|b\|,10^{-8}))$ rather than a comparison to a meaningful BP direction.
-We tested this mechanism story against four natural alternative attributions, all of which it survives. \emph{Not residual-skip-driven:} on the same ResMLP-d256 with terminal LN kept and the additive skip removed ($h_{l+1}{=}F_l(h_l)$), DFA still inflates $\|h_L\|$ from $\sim\!5$ to $\sim\!2.2{\times}10^{4}$ in three epochs and converges to $\|h_L\|{\approx}1.06{\times}10^{8}$ and $\|g_L\|{\approx}1.09{\times}10^{-10}$ at $100$ epochs, both already at the diagnostic floor (Appendix~\ref{app:no_residual}). \emph{Not task-signal-driven:} replacing labels by i.i.d.\ random class targets refreshed every minibatch on the same backbone, DFA still reaches $\|h_L\|{\approx}1.67{\times}10^{8}$ and $\|g_L\|{\approx}8{\times}10^{-12}$ at $100$ epochs while accuracy stays at chance (Appendix~\ref{app:random_targets}). \emph{Not DFA-specific:} the same random-target ablation also drives $\|h_L\|$ from $9$ to $6.2{\times}10^{3}$ for State Bridge and $2.0{\times}10^{4}$ for Credit Bridge in three epochs, again at chance accuracy, so all three audited fixed-feedback methods exhibit data-agnostic activation growth (Appendix~\ref{app:random_targets}). \emph{Not shared by EP:} under the same random-target protocol, EP keeps $\|h_L\|{\approx}586$ at five epochs of training, $25\times$ smaller than DFA's three-epoch value on the same architecture, consistent with EP's bounded behavior on real labels and confirming that the random-target assay separates the explosion-prone fixed-feedback class from EP's energy-based local objective.
+We tested this mechanism story against four natural alternative attributions, all of which it survives. \emph{Not residual-skip-driven:} with terminal LN kept and the additive skip removed ($h_{l+1}{=}F_l(h_l)$), DFA still converges to $\|h_L\|{\approx}1.06{\times}10^{8}$ and $\|g_L\|{\approx}1.09{\times}10^{-10}$ at $100$ epochs, both at the diagnostic floor (Appendix~\ref{app:no_residual}). \emph{Not task-signal-driven:} under i.i.d.\ random class targets per minibatch, DFA still reaches $\|h_L\|{\approx}1.67{\times}10^{8}$ and $\|g_L\|{\approx}8{\times}10^{-12}$ while accuracy stays at chance (Appendix~\ref{app:random_targets}). \emph{Not DFA-specific:} the same random-target ablation drives $\|h_L\|$ to $6.2{\times}10^{3}$ for SB and $2.0{\times}10^{4}$ for CB in three epochs, so all three audited fixed-feedback methods exhibit data-agnostic activation growth. \emph{Not shared by EP:} under the same protocol, EP keeps $\|h_L\|{\approx}586$ at five epochs, $25\times$ smaller than DFA's three-epoch value, confirming that the random-target assay separates the explosion-prone fixed-feedback class from EP's energy-based objective.
The matched same-backbone causal control for diagnostic~(b) is removing terminal LayerNorm. On the same ResMLP-d256 with the residual skip intact, $100$ epochs of DFA, three seeds, the residual stream still inflates to $\|h_L\|\!\approx\!1.21\times 10^7$, but the deepest hidden-layer BP gradient remains at $\|g_L\|\!\approx\!7.2\times 10^{-4}$ (four orders of magnitude above the diagnostic~(b) floor), and the final test accuracy is $0.327\pm 0.012$, statistically indistinguishable from vanilla DFA's $0.306\pm 0.006$ on the same backbone with terminal LayerNorm intact. Removing terminal LayerNorm therefore preserves Mode~1~(a) but cleanly eliminates Mode~1~(b) on the same architecture, while leaving final task accuracy essentially unchanged. Combined with the broader cross-architecture pattern (StudentNet and the BatchNorm CNN, which lack terminal LayerNorm, never trigger diagnostic~(b); ViT-Mini with a terminal LN does, by epochs 2--3 (Figure~\ref{fig:temporal_cross_arch})), terminal LayerNorm is necessary for Mode~1~(b) in the audited residual ResMLP and ViT-Mini setting. The collapse is also not a late-epoch curiosity: $\|g_L\|$ drops from $9.8\times 10^{-4}$ at epoch~0 to $6.7\times 10^{-8}$ by epoch~4 in the temporal replay across three seeds, so the protocol fires within the first $11$ epochs of a $100$-epoch run and is actionable as an early-stop criterion rather than a post hoc explanation. Once measurement degeneracy is identified, the next question is whether poor deep credit remains even before collapse.
@@ -170,23 +173,12 @@ Diag. & Measurement & Default threshold & Role \\
The point of the protocol is not to add plots; it is to prevent a specific class of false conclusions. For this paper, the minimal protocol is four checks: per-layer activation scale via max-per-block growth, deepest hidden BP gradient floor, meaningful-regime per-layer credit quality, and an architecture-matched frozen-blocks baseline (Table~\ref{tab:protocol_def}). The first two ask whether the reference quantity is still valid; the third asks whether, once validity is restored, the deep blocks receive useful directions; and the fourth asks whether the trained depth is doing better than a model whose residual blocks were never trained at all. Figure~\ref{fig:decision_utility} makes the decision value explicit: accuracy alone walks back $0/5$ audited methods, accuracy plus headline $\Gamma$ still walks back $0/5$, and the full protocol walks back $3/5$ by flagging DFA, State Bridge, and Credit Bridge, with diagnostics (a), (b), and (d) each independently sufficient for binary detection on those failures. On our audit, these checks catch failures that accuracy plus aggregate alignment miss completely.
-A useful evaluation rule should reject the bad cases without collapsing everything into a negative result. The protocol is conservative in exactly that sense: it preserves BP and EP as evidence-bearing controls, and it walks back only those claims that fail measurement-validity or depth-utilization checks in Table~\ref{tab:main_audit}. That asymmetry is important because the thresholds are not equally strong in the same way. Diagnostics (a) and (b) have sharp empirical calibration gaps in the audited regime, diagnostic (c) is explicitly a sub-mode discriminator rather than a primary detector, and diagnostic (d) uses a deliberately weak $2$pp margin as a context check rather than a theorem about useful depth. The rule therefore does not say that low accuracy, low aggregate alignment, or any non-BP method is automatically invalid; it says only that claims unsupported by measurement-valid evidence should be withdrawn, while trustworthy controls should remain standing. The Section~\ref{sec:mode2} cross-method cosine-versus-accuracy dissociation reinforces the necessity of keeping all four diagnostics separate: Credit Bridge, State Bridge, and DFA differ by more than a factor of four in deep-layer alignment under the same penalty rescue without tracking final accuracy in the same direction, so aligning an alternative credit rule with the BP gradient is not a substitute for checking depth utilization against a matched shallow baseline. That conservative asymmetry is why the protocol belongs in the main paper rather than the appendix.
-
-\begin{figure}[t]
-\centering
-\includegraphics[width=0.88\linewidth]{figures/fig2_decision_utility.pdf}
-\caption{Decision-utility ablation comparing the field-standard reporting pair against progressively richer diagnostic strategies: accuracy only and accuracy+$\Gamma$ walk back no audited failures, while the full protocol walks back the three silent failures.}
-\label{fig:decision_utility}
-\end{figure}
+The protocol is conservative in a specific sense: it preserves BP and EP as evidence-bearing controls and walks back only claims that fail measurement-validity or depth-utilization checks. Diagnostics (a) and (b) have sharp empirical calibration gaps in the audited regime, diagnostic (c) is a sub-mode discriminator rather than a primary detector, and diagnostic (d) uses a deliberately weak $2$pp margin as a context check rather than a theorem about useful depth. The Section~\ref{sec:mode2} cross-method cosine-versus-accuracy dissociation reinforces the necessity of keeping all four diagnostics separate: Credit Bridge, State Bridge, and DFA differ by more than $4\times$ in deep-layer alignment under the same penalty rescue without tracking final accuracy in the same direction, so aligning an alternative credit rule with the BP gradient is not a substitute for checking depth utilization against a matched shallow baseline.
\section{Discussion, Limits, Conclusion}
\label{sec:discussion}
-Our claim is about what existing evidence licenses, not about impossibility. This paper does not show that FA cannot work in deep networks; it shows that current evaluation practice can misread what happened by letting headline accuracy and aggregate alignment stand in for measurement validity and layerwise credit quality. The strongest examples are precisely the cases where the field-standard summary would sound mildly positive while the audited deep evidence has already collapsed or is already null: DFA, State Bridge, and Credit Bridge all survive status-quo reporting in Table~\ref{tab:main_audit}, yet the protocol shows that their deep claims are unsupported. The intervention results in Figure~\ref{fig:penalty_rescue} reinforce the same distinction, because restoring a measurable regime partially rescues deep credit signal rather than proving that the original headline had been trustworthy all along. That distinction is important because evaluation failure and algorithmic impossibility are different statements.
-
-The right level of generality is the audited regime. Our strongest claim is scoped to modern residual vision architectures, especially the pre-LayerNorm and terminal-LayerNorm settings where we directly observed Mode~1: the 4-block ResMLP at $d{=}256$, its $d{=}512$ extension, and ViT-Mini all show the same basic pattern, whereas StudentNet and the BatchNorm CNN refine the scope by showing that activation-growth failures can persist without the hidden-gradient-floor collapse (Figure~\ref{fig:cross_arch_summary}; Figure~\ref{fig:penalty_rescue}). That leaves clear limits. The dataset is only CIFAR-10, the models are small to medium rather than frontier-scale, the terminal-LayerNorm-necessity claim for diagnostic~(b) is established causally on the audited residual ResMLP via the matched same-backbone no-terminal-LN control but not proven to extend beyond that architecture family, and the BP-plus-penalty comparison is only a lower-bound control on penalty cost rather than a perfect decomposition. Those limitations narrow what is claimed, but they do not weaken the core methodological point that the audited measurement regime can fail silently in exactly the architectures that now dominate this genre of experiment. Future positive or negative examples outside this regime would refine the scope of the protocol, not invalidate the critique.
-
-The main lesson is to decompose the evaluation question before interpreting the answer. Future FA papers should report, separately, whether the BP reference is still meaningful, whether the deep layers receive useful credit in that meaningful regime, and whether trained depth beats an architecture-matched frozen-blocks baseline, instead of compressing those distinct questions into a single headline accuracy or headline $\Gamma$. That is the sense in which this paper fits the evaluation-methodology line of \citet{jordan2020evaluating}, \citet{obray2022evaluation}, and \citet{paleka2026pitfalls}: the contribution is a reporting rule for preventing a repeatable interpretive error, not a new benchmark artifact.
+Our claim is about what existing evidence licenses, not about impossibility: this paper does not show that FA cannot work in deep networks, only that current evaluation practice can misread what happened. DFA, State Bridge, and Credit Bridge all survive status-quo reporting in Table~\ref{tab:main_audit}, yet the protocol shows that their deep claims are unsupported, while the intervention in Figure~\ref{fig:penalty_rescue} partially rescues deep credit signal rather than validating the original headline. Our strongest claim is scoped to the 4-block $d{=}256$ and $d{=}512$ pre-LayerNorm ResMLPs and to ViT-Mini, where Mode~1 (a)+(b) both fire; StudentNet and the BatchNorm CNN refine the scope by showing that activation growth can persist without the gradient-floor collapse, the no-terminal-LN same-backbone control establishes terminal LayerNorm as causally necessary for diagnostic~(b) on residual ResMLP but not proven beyond that family, the dataset is only CIFAR-10, and the BP-plus-penalty comparison is a lower-bound control rather than a full decomposition. The main lesson is to decompose the evaluation question before interpreting the answer: FA papers should report the BP-reference validity, the layerwise credit quality in that meaningful regime, and the frozen-blocks depth-utilization baseline as three separate axes, rather than as a single headline accuracy or headline $\Gamma$. The contribution is a reporting rule in the evaluation-methodology line of \citet{jordan2020evaluating,obray2022evaluation,paleka2026pitfalls}, not a new benchmark artifact.
\begin{thebibliography}{10}
@@ -265,6 +257,25 @@ Ruibin Xiong, Yunchang Yu, et~al.
\newblock On layer normalization in the transformer architecture.
\newblock In {\em International Conference on Machine Learning}, 2020.
+\bibitem[Bengio(2014)]{bengio2014autoencoders}
+Yoshua Bengio.
+\newblock How auto-encoders could provide credit assignment in deep networks
+ via target propagation.
+\newblock {\em arXiv preprint arXiv:1407.7906}, 2014.
+
+\bibitem[Lee et~al.(2015)Lee, Zhang, Fischer, and Bengio]{lee2015difference}
+Dong-Hyun Lee, Saizheng Zhang, Asja Fischer, and Yoshua Bengio.
+\newblock Difference target propagation.
+\newblock In {\em European Conference on Machine Learning and Principles and
+ Practice of Knowledge Discovery in Databases (ECML PKDD)}, 2015.
+
+\bibitem[Jaderberg et~al.(2017)Jaderberg, Czarnecki, Osindero, Vinyals, Graves,
+ Silver, and Kavukcuoglu]{jaderberg2017decoupled}
+Max Jaderberg, Wojciech~M. Czarnecki, Simon Osindero, Oriol Vinyals, Alex
+ Graves, David Silver, and Koray Kavukcuoglu.
+\newblock Decoupled neural interfaces using synthetic gradients.
+\newblock In {\em International Conference on Machine Learning}, 2017.
+
\end{thebibliography}
\appendix
@@ -334,6 +345,13 @@ Cross-architecture audit & Which diagnostics generalize? & Activation growth gen
A useful way to read the table is that no single validation carries the paper by itself. The five-method audit shows that the problem exists, temporal replay shows that the protocol is actionable, intervention and null controls show that the two modes respond differently, and cross-architecture evidence shows which parts of the protocol are specific to terminal-normalized residual settings and which parts are more general.
+\begin{figure}[h]
+\centering
+\includegraphics[width=0.82\linewidth]{figures/fig2_decision_utility.pdf}
+\caption{Decision-utility ablation (seven reporting strategies $\times$ five methods) supporting Section~\ref{sec:protocol}: accuracy alone and accuracy$+\Gamma$ walk back $0/5$ audited methods, while any one of the diagnostics (a), (b), or (d) already walks back the three silent failures; the full four-diagnostic protocol also walks back $3/5$. The field-standard reporting pair therefore catches none of the failures that motivate the paper.}
+\label{fig:decision_utility}
+\end{figure}
+
\section{Threshold Sensitivity Full Sweep}
\label{app:threshold_sweep}