summaryrefslogtreecommitdiff
path: root/paper
diff options
context:
space:
mode:
authorYurenHao0426 <Blackhao0426@gmail.com>2026-04-08 22:43:52 -0500
committerYurenHao0426 <Blackhao0426@gmail.com>2026-04-08 22:43:52 -0500
commitac94c005cc7d5b45605bcefda448996bc01b7d0d (patch)
treef8a7c164bf479117d7918108e56e3817e23d87f4 /paper
parentf0941c65cbb815385bffdf1fe99405cf9382a420 (diff)
paper v2.38: ddof=1 statistical convention sweep (sample std with Bessel)
User picked option B: convert all 3-seed std values across the paper from ddof=0 (population) to ddof=1 (Bessel-corrected sample std). Bessel correction for n=3: ddof=1 std is √(3/2) ≈ 1.22× larger than ddof=0. About 22% inflation per value. 49 ± value replacements across §1, §2, §4, §5, Tables 1+2+9, Appendix H, Appendix L, and intermediate prose. Major updates: Table 1 (5-method audit accuracies): BP 0.615 ± 0.003 → 0.615 ± 0.004 EP 0.316 ± 0.030 → 0.316 ± 0.037 DFA 0.306 ± 0.006 → 0.306 ± 0.008 SB 0.205 ± 0.032 → 0.205 ± 0.039 CB 0.289 ± 0.026 → 0.289 ± 0.031 Frozen baseline: 0.349 ± 0.002 → 0.349 ± 0.003 (4 occurrences) §5 matched 30-ep controls: BP no-pen 0.585 ± 0.001 (already 0.001, ddof=1 = 0.0009 stays) BP+pen 0.532 ± 0.006 → 0.532 ± 0.007 DFA no-pen 0.301 ± 0.005 → 0.301 ± 0.006 DFA+pen 0.360 ± 0.001 → 0.360 ± 0.002 SB+pen 0.453 ± 0.003 stays (0.0030 → 0.003) CB+pen 0.360 ± 0.003 → 0.360 ± 0.004 §4 ¶4 cosines: SB+pen cos +0.322 ± 0.007 → +0.322 ± 0.008 CB+pen cos +0.679 ± 0.008 → +0.679 ± 0.010 DFA+pen cos +0.151 ± 0.025 stays (pooled n=12 ddof=1=0.0247) §4 ¶4 perturbation rho: SB+pen rho +0.402 ± 0.015 → +0.402 ± 0.019 CB+pen rho +0.464 ± 0.025 → +0.464 ± 0.030 DFA+pen rho +0.080 ± 0.011 → +0.080 ± 0.012 §4 ¶4 nudging test: SB -1.93 ± 0.11×10^-3 → -1.93 ± 0.14×10^-3 CB -4.26 ± 0.24×10^-4 → -4.26 ± 0.29×10^-4 DFA -4.98 ± 0.44×10^-5 → -4.98 ± 0.53×10^-5 §4 ¶4 training loss decrease: SB -0.447 ± 0.008 → -0.447 ± 0.010 CB -0.121 ± 0.003 stays DFA -0.095 ± 0.007 → -0.095 ± 0.008 §4 ¶1 vanilla DFA early-epoch deep cos / rho: -0.008 ± 0.013 → -0.008 ± 0.016 -0.018 ± 0.018 → -0.018 ± 0.017 -0.003 ± 0.005 → -0.003 ± 0.004 Appendix H L=4 d=512 3-seed: DFA layer-0 +0.412 ± 0.011 → +0.412 ± 0.013 DFA deep -0.0004 ± 0.0008 → -0.0004 ± 0.0009 CB deep +0.039 ± 0.010 → +0.039 ± 0.012 Appendix J Table 9 std rows updated to ddof=1 means. Appendix L drift values: DFA+pen w2 18.6 ± 0.5 → 18.6 ± 0.6 DFA+pen embed 94.6 ± 1.4 → 94.6 ± 1.8 (others unchanged within rounding) The §3 ¶3 no-outln value 0.327 ± 0.012 was already ddof=1 (was the single inconsistent place that motivated this sweep). Now the entire paper uses ddof=1 consistently. All means and per-seed values are unchanged (only the std reported changes). Headline science conclusions all unchanged. Page count 20 (unchanged). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Diffstat (limited to 'paper')
-rw-r--r--paper/main.pdfbin537402 -> 537407 bytes
-rw-r--r--paper/main.tex48
2 files changed, 24 insertions, 24 deletions
diff --git a/paper/main.pdf b/paper/main.pdf
index b149958..17c578c 100644
--- a/paper/main.pdf
+++ b/paper/main.pdf
Binary files differ
diff --git a/paper/main.tex b/paper/main.tex
index 184d054..4148b7e 100644
--- a/paper/main.tex
+++ b/paper/main.tex
@@ -35,9 +35,9 @@ Modern feedback-alignment evaluation on deep residual networks is still summariz
\paragraph{Feedback alignment and the standard reporting pair.} Backpropagation (BP) is the de facto training method for deep neural networks, but its requirement that each feedback connection carry a weight identical to the corresponding forward connection -- the weight-transport problem -- has long been considered biologically implausible \citep{lillicrap2016random,bartunov2018assessing}. \emph{Feedback alignment} (FA) \citep{lillicrap2016random} side-steps weight transport by delivering per-layer credit through fixed random feedback matrices, and its direct variant (DFA) \citep{nokland2016direct} projects the output error to every hidden layer through an independent random matrix; parallel lines include target propagation \citep{lee2015difference} and equilibrium propagation \citep{scellier2017equilibrium}. These rules are studied both as biologically-plausible alternatives to BP and as scalable, asynchronous training schemes, with recent work scaling DFA to transformer-scale architectures on language, recommendation, and view-synthesis tasks \citep{launay2020direct,akrout2019deep}. Evaluation in this line of work has converged on a two-number summary: final task accuracy, and an aggregate cosine alignment $\Gamma$ between the method's per-layer credit and the BP gradient on the trained network \citep{lillicrap2016random,nokland2016direct,akrout2019deep,launay2020direct,bartunov2018assessing}.
-\paragraph{The standard pair fails to validate.} On the audited 4-block $d{=}256$ ResMLP, however, Table~\ref{tab:main_audit} already shows that this accuracy-plus-$\Gamma$ pair is not a validity check: DFA reaches only $0.306 \pm 0.006$ test accuracy, below the architecture-matched frozen-blocks baseline of $0.349 \pm 0.002$, while still looking superficially comparable to other non-BP methods. Figure~\ref{fig:audit_hero} further shows that the apparent cosine evidence is concentrated at the shallowest block, with DFA at seed 42 reaching about $+0.42$ at layer 0 but approximately $-0.03$ to $0$ on layers 1--4, so the aggregate obscures where credit direction is and is not present. At the same time, the deepest BP reference norm is only about $4 \times 10^{-10}$ for DFA (three-seed mean) and a few $\times 10^{-9}$ for State Bridge and Credit Bridge, all below the $10^{-8}$ clamp used by \texttt{F.cosine\_similarity}, whereas BP remains around $4 \times 10^{-4}$, so the reported deep cosine is partly computed against a numerical-floor reference rather than an informative gradient direction (Figure~\ref{fig:audit_hero}; Table~\ref{tab:main_audit}). Those numbers can be useful, but only if the measurement regime itself is valid.
+\paragraph{The standard pair fails to validate.} On the audited 4-block $d{=}256$ ResMLP, however, Table~\ref{tab:main_audit} already shows that this accuracy-plus-$\Gamma$ pair is not a validity check: DFA reaches only $0.306 \pm 0.008$ test accuracy, below the architecture-matched frozen-blocks baseline of $0.349 \pm 0.003$, while still looking superficially comparable to other non-BP methods. Figure~\ref{fig:audit_hero} further shows that the apparent cosine evidence is concentrated at the shallowest block, with DFA at seed 42 reaching about $+0.42$ at layer 0 but approximately $-0.03$ to $0$ on layers 1--4, so the aggregate obscures where credit direction is and is not present. At the same time, the deepest BP reference norm is only about $4 \times 10^{-10}$ for DFA (three-seed mean) and a few $\times 10^{-9}$ for State Bridge and Credit Bridge, all below the $10^{-8}$ clamp used by \texttt{F.cosine\_similarity}, whereas BP remains around $4 \times 10^{-4}$, so the reported deep cosine is partly computed against a numerical-floor reference rather than an informative gradient direction (Figure~\ref{fig:audit_hero}; Table~\ref{tab:main_audit}). Those numbers can be useful, but only if the measurement regime itself is valid.
-\paragraph{Two failure modes and their separability.} Our audit shows that modern residual vision models can make these two quantities look informative while failing to answer the question they are taken to answer. Figure~\ref{fig:audit_hero} shows the first failure mode, which we call \emph{Mode 1: measurement degeneracy}, where residual-stream growth drives the deepest hidden state to about $\|h_L\| \sim 10^8$ under DFA/SB/CB while the corresponding BP reference collapses to $\|g_L\| \sim 4 \times 10^{-10}$ for DFA (three-seed mean), so the deep-layer cosine is measured against a clamp-dominated floor rather than a meaningful target direction. The same figure also shows the second failure mode, \emph{Mode 2: low intrinsic credit-direction quality}, because even after comparing against the stronger frozen-blocks baseline ($0.349 \pm 0.002$) and looking layer-by-layer, DFA's deep blocks remain essentially null while only layer 0 is visibly positive. Intervention sharpens both modes. Adding a per-block residual penalty $\lambda \|f_l(h_l)\|^2$ to DFA at $\lambda{=}10^{-2}$ contains $\|h_L\|$ to about $4\times 10^4$ and lifts the deep BP reference to about $10^{-6}$, but DFA's rescued deep cosine is only about $+0.15$; State Bridge under the same intervention reaches a three-seed deep cosine of $+0.32$ and, unlike DFA, exceeds the frozen-blocks baseline by $+10$ points in final accuracy; Credit Bridge reaches a deep cosine near $+0.68$ yet matches only the DFA accuracy, so Mode~2 has method-dependent severity and deep cosine is not a sufficient predictor of final accuracy across methods. At the same time, at $\lambda{=}10^{-4}$ Mode~1 is alleviated while the DFA deep cosine still stays near zero, and at vanilla DFA epoch 1 the reference is already meaningful at about $6 \times 10^{-7}$ but the deep cosine is still $-0.008 \pm 0.013$ across three seeds. The failure is therefore neither unitary nor uniform: Mode~1 and Mode~2 are observationally separable, and within the audited fixed-feedback family, the severity of each mode varies by method.
+\paragraph{Two failure modes and their separability.} Our audit shows that modern residual vision models can make these two quantities look informative while failing to answer the question they are taken to answer. Figure~\ref{fig:audit_hero} shows the first failure mode, which we call \emph{Mode 1: measurement degeneracy}, where residual-stream growth drives the deepest hidden state to about $\|h_L\| \sim 10^8$ under DFA/SB/CB while the corresponding BP reference collapses to $\|g_L\| \sim 4 \times 10^{-10}$ for DFA (three-seed mean), so the deep-layer cosine is measured against a clamp-dominated floor rather than a meaningful target direction. The same figure also shows the second failure mode, \emph{Mode 2: low intrinsic credit-direction quality}, because even after comparing against the stronger frozen-blocks baseline ($0.349 \pm 0.003$) and looking layer-by-layer, DFA's deep blocks remain essentially null while only layer 0 is visibly positive. Intervention sharpens both modes. Adding a per-block residual penalty $\lambda \|f_l(h_l)\|^2$ to DFA at $\lambda{=}10^{-2}$ contains $\|h_L\|$ to about $4\times 10^4$ and lifts the deep BP reference to about $10^{-6}$, but DFA's rescued deep cosine is only about $+0.15$; State Bridge under the same intervention reaches a three-seed deep cosine of $+0.32$ and, unlike DFA, exceeds the frozen-blocks baseline by $+10$ points in final accuracy; Credit Bridge reaches a deep cosine near $+0.68$ yet matches only the DFA accuracy, so Mode~2 has method-dependent severity and deep cosine is not a sufficient predictor of final accuracy across methods. At the same time, at $\lambda{=}10^{-4}$ Mode~1 is alleviated while the DFA deep cosine still stays near zero, and at vanilla DFA epoch 1 the reference is already meaningful at about $6 \times 10^{-7}$ but the deep cosine is still $-0.008 \pm 0.016$ across three seeds. The failure is therefore neither unitary nor uniform: Mode~1 and Mode~2 are observationally separable, and within the audited fixed-feedback family, the severity of each mode varies by method.
\paragraph{Contribution: a methodology paper, not a new FA variant.} Accordingly, this paper does not introduce a new FA variant or a new benchmark. Of the five methods we audit, BP, EP, and DFA are established baselines from the published literature; the remaining two, which we call \emph{State Bridge} and \emph{Credit Bridge}, are diagnostic probes we construct in this paper to directly learn the two targets that different strands of the BP-free literature argue should produce good per-layer credit (formal definitions and citations in Section~\ref{sec:audit}). Instead, Table~\ref{tab:main_audit} and Figure~\ref{fig:audit_hero} use a standard five-method CIFAR-10 audit to show that status-quo reporting would treat BP, EP, DFA, State Bridge, and Credit Bridge as the same kind of evidence-bearing object even though only BP and EP remain trustworthy under matched diagnostic checks. This makes the contribution methodological in the sense of \citet{jordan2020evaluating}, \citet{obray2022evaluation}, and \citet{paleka2026pitfalls}: the central question is not whether one more FA variant can post a headline number, but whether the reporting pipeline distinguishes meaningful credit-direction evidence from numerical-floor artifacts and from shallow-only learning. The protocol therefore starts from per-layer diagnostics and a frozen-blocks baseline before reading any aggregate cosine or final accuracy as evidence about deep credit assignment. We first show the walk-back on a standard audit, then isolate the two failure modes, and finally state the reporting protocol that future FA papers should satisfy.
@@ -56,22 +56,22 @@ Modern feedback-alignment evaluation on deep residual networks is still summariz
\toprule
Method & Test acc. & Headline $\Gamma$ & Status-quo verdict & Protocol verdict \\
\midrule
-BP & $0.615 \pm 0.003$ & $\approx 1.0$ & trustworthy & trustworthy \\
-EP & $0.316 \pm 0.030$ & $0.008$ & trustworthy & trustworthy \\
-DFA & $0.306 \pm 0.006$ & $0.10$ & trustworthy & walked back \\
-State Bridge & $0.205 \pm 0.032$ & $0.005$ & trustworthy & walked back \\
-Credit Bridge & $0.289 \pm 0.026$ & $0.07$ & trustworthy & walked back \\
+BP & $0.615 \pm 0.004$ & $\approx 1.0$ & trustworthy & trustworthy \\
+EP & $0.316 \pm 0.037$ & $0.008$ & trustworthy & trustworthy \\
+DFA & $0.306 \pm 0.008$ & $0.10$ & trustworthy & walked back \\
+State Bridge & $0.205 \pm 0.039$ & $0.005$ & trustworthy & walked back \\
+Credit Bridge & $0.289 \pm 0.031$ & $0.07$ & trustworthy & walked back \\
\bottomrule
\end{tabular}}
\end{table}
\paragraph{State Bridge and Credit Bridge: diagnostic probes constructed for this paper.} Two rows in Table~\ref{tab:main_audit}, \emph{State Bridge} (SB) and \emph{Credit Bridge} (CB), are diagnostic probes we construct in this paper, not prior FA variants. Each directly learns a target that a different strand of the BP-free literature argues should produce good per-layer credit, and each uses the same block local loss $-\langle f_l(h_l),\, a_l\rangle$ as DFA but with a different $a_l$. SB instantiates the target-propagation view that accurate prediction of a downstream hidden state yields a usable credit signal \citep{bengio2014autoencoders,lee2015difference}: an auxiliary $G_\psi(h_l, t_l, s)$ is fit by MSE to predict $h_L$ from $(h_l, t_l{=}l/L, s{=}e_T)$, and $a_l^{\mathrm{SB}} = \nabla_{h_l}\,\mathrm{CE}(W_{\mathrm{out}}\,\mathrm{LN}(G_\psi(h_l, t_l, s)), y)$. CB instantiates the synthetic-gradient view that a learned value network, if its input-gradient approximates the BP gradient, can stand in for it \citep{jaderberg2017decoupled}: $V_\phi(h_l, t_l, s)$ is fit via a bridge residual against an EMA target, and $a_l^{\mathrm{CB}} = \nabla_{h_l} V_\phi(h_l, t_l, s)$. Both auxiliaries are trained on detached hidden states. We use SB and CB as controls that populate different points in the (angular agreement with BP, functional usefulness) plane; that is what makes the cross-method cosine-versus-accuracy dissociation in Section~\ref{sec:mode2} visible.
-\paragraph{Status-quo reading: every method looks acceptable.} By the field's usual criteria, the non-BP methods appear to train to nontrivial accuracy and report nonzero alignment. In Table~\ref{tab:main_audit}, DFA reaches $0.306 \pm 0.006$ test accuracy with headline $\Gamma{=}0.10$, State Bridge reaches $0.205 \pm 0.032$ with $\Gamma{=}0.005$, and Credit Bridge reaches $0.289 \pm 0.026$ with $\Gamma{=}0.07$; none of these rows looks like an obvious invalidation if one is reading the usual pair of final accuracy and aggregate alignment in the style of prior FA reporting \citep{lillicrap2016random,nokland2016direct,akrout2019deep,launay2020direct}. Even the absolute scale does not itself force a walk-back, because all three methods are plainly above chance and all three report positive headline alignment rather than a visibly broken or undefined quantity. That reading is exactly what the rest of the paper overturns.
+\paragraph{Status-quo reading: every method looks acceptable.} By the field's usual criteria, the non-BP methods appear to train to nontrivial accuracy and report nonzero alignment. In Table~\ref{tab:main_audit}, DFA reaches $0.306 \pm 0.008$ test accuracy with headline $\Gamma{=}0.10$, State Bridge reaches $0.205 \pm 0.039$ with $\Gamma{=}0.005$, and Credit Bridge reaches $0.289 \pm 0.031$ with $\Gamma{=}0.07$; none of these rows looks like an obvious invalidation if one is reading the usual pair of final accuracy and aggregate alignment in the style of prior FA reporting \citep{lillicrap2016random,nokland2016direct,akrout2019deep,launay2020direct}. Even the absolute scale does not itself force a walk-back, because all three methods are plainly above chance and all three report positive headline alignment rather than a visibly broken or undefined quantity. That reading is exactly what the rest of the paper overturns.
-\paragraph{EP as the internal control: low accuracy without invalid measurement.} Low accuracy by itself is not the pathology. Equilibrium Propagation (EP), a contrastive energy-based alternative to BP that updates weights from the difference between a free-phase and a nudged-phase hidden trajectory, is the key internal comparison in Table~\ref{tab:main_audit} and Figure~\ref{fig:audit_hero}: it achieves only $0.316 \pm 0.030$ accuracy and a very small headline $\Gamma{=}0.008$, yet its three-seed mean max-per-block growth is only $6.6\times$ (highest single-seed value $11.0\times$), its deepest BP reference norm remains around $1.3\times 10^{-4}$ rather than collapsing to the numerical floor, and its cross-batch direction-stability score is $0.02$ rather than the much higher drift-dominated values seen for DFA-family methods. At the same time, EP is not a positive result for depth usage in the stronger sense, because its trainable-model accuracy is still $3.3$ percentage points below the frozen-blocks baseline of $0.349 \pm 0.002$. The distinction matters because it separates underperformance from invalid evaluation.
+\paragraph{EP as the internal control: low accuracy without invalid measurement.} Low accuracy by itself is not the pathology. Equilibrium Propagation (EP), a contrastive energy-based alternative to BP that updates weights from the difference between a free-phase and a nudged-phase hidden trajectory, is the key internal comparison in Table~\ref{tab:main_audit} and Figure~\ref{fig:audit_hero}: it achieves only $0.316 \pm 0.037$ accuracy and a very small headline $\Gamma{=}0.008$, yet its three-seed mean max-per-block growth is only $6.6\times$ (highest single-seed value $11.0\times$), its deepest BP reference norm remains around $1.3\times 10^{-4}$ rather than collapsing to the numerical floor, and its cross-batch direction-stability score is $0.02$ rather than the much higher drift-dominated values seen for DFA-family methods. At the same time, EP is not a positive result for depth usage in the stronger sense, because its trainable-model accuracy is still $3.3$ percentage points below the frozen-blocks baseline of $0.349 \pm 0.003$. The distinction matters because it separates underperformance from invalid evaluation.
-\paragraph{Frozen-blocks baseline overturns the status-quo reading.} When we compare each method to a frozen-blocks baseline matched to the same architecture, the headline interpretation changes immediately. The frozen-blocks model, which trains only the embedding, LayerNorm, and head while holding the residual blocks fixed, reaches $0.349 \pm 0.002$ across the same three seeds; against that baseline, BP is higher by $26.6$ points, but DFA is lower by $4.3$ points, State Bridge by $14.4$ points, Credit Bridge by $6.0$ points, and even EP by $3.3$ points. Figure~\ref{fig:audit_hero} shows that this accuracy comparison lines up with the diagnostic split: DFA, State Bridge, and Credit Bridge also combine extreme per-block growth (three-seed mean max ratios $\sim\!1.9\times 10^3$, $\sim\!1.6\times 10^4$, and $\sim\!1.2\times 10^3$ respectively), deepest-layer BP norms around $10^{-9}$, and high cross-batch instability ($0.16$, $0.53$, and $0.37$), so their deep blocks are at best passengers and in practice often harmful. This establishes the audit question the rest of the paper must answer: why do the standard signals fail so badly?
+\paragraph{Frozen-blocks baseline overturns the status-quo reading.} When we compare each method to a frozen-blocks baseline matched to the same architecture, the headline interpretation changes immediately. The frozen-blocks model, which trains only the embedding, LayerNorm, and head while holding the residual blocks fixed, reaches $0.349 \pm 0.003$ across the same three seeds; against that baseline, BP is higher by $26.6$ points, but DFA is lower by $4.3$ points, State Bridge by $14.4$ points, Credit Bridge by $6.0$ points, and even EP by $3.3$ points. Figure~\ref{fig:audit_hero} shows that this accuracy comparison lines up with the diagnostic split: DFA, State Bridge, and Credit Bridge also combine extreme per-block growth (three-seed mean max ratios $\sim\!1.9\times 10^3$, $\sim\!1.6\times 10^4$, and $\sim\!1.2\times 10^3$ respectively), deepest-layer BP norms around $10^{-9}$, and high cross-batch instability ($0.16$, $0.53$, and $0.37$), so their deep blocks are at best passengers and in practice often harmful. This establishes the audit question the rest of the paper must answer: why do the standard signals fail so badly?
\begin{figure}[t]
\centering
@@ -92,15 +92,15 @@ Credit Bridge & $0.289 \pm 0.026$ & $0.07$ & trustworthy & walked back \\
\section{Failure Mode 2: Low Intrinsic Credit-Direction Quality}
\label{sec:mode2}
-\paragraph{Mode~2 is present even when measurement is meaningful.} The second failure mode appears even in the meaningful-measurement regime. At the earliest vanilla DFA checkpoints on ResMLP, the hidden backpropagated gradient at the first deep block remains above the numerical floor: at epoch 1, $\|g_2\|$ is $6.8\times 10^{-7}$, $6.6\times 10^{-7}$, and $3.8\times 10^{-7}$ across the three seeds, all above the $10^{-7}$ threshold used to distinguish measurable from collapsed gradients. Yet the corresponding deep-layer cosine values are already essentially null: across layers $1$--$4$, all seed-level measurements at epoch 1 lie in $[-0.04,+0.02]$, with a three-seed mean of $-0.008 \pm 0.013$, and by epoch 2 the deep mean is still only $-0.018 \pm 0.018$ (Table~\ref{tab:mode_validation}). This is the observational pattern predicted by low credit-direction quality rather than mere disappearance of signal: the gradient is still present enough to measure, but the directions delivered to the deep network carry little agreement with backpropagation, consistent with prior concerns that alternative feedback rules can fail by supplying poor credit assignments even before full collapse \citep{bartunov2018assessing,moskovitz2018feedback,crafton2019backpropagation,refinetti2023aligning}. This rules out the simplest objection that the deep-layer null result is merely a byproduct of collapse.
+\paragraph{Mode~2 is present even when measurement is meaningful.} The second failure mode appears even in the meaningful-measurement regime. At the earliest vanilla DFA checkpoints on ResMLP, the hidden backpropagated gradient at the first deep block remains above the numerical floor: at epoch 1, $\|g_2\|$ is $6.8\times 10^{-7}$, $6.6\times 10^{-7}$, and $3.8\times 10^{-7}$ across the three seeds, all above the $10^{-7}$ threshold used to distinguish measurable from collapsed gradients. Yet the corresponding deep-layer cosine values are already essentially null: across layers $1$--$4$, all seed-level measurements at epoch 1 lie in $[-0.04,+0.02]$, with a three-seed mean of $-0.008 \pm 0.016$, and by epoch 2 the deep mean is still only $-0.018 \pm 0.017$ (Table~\ref{tab:mode_validation}). This is the observational pattern predicted by low credit-direction quality rather than mere disappearance of signal: the gradient is still present enough to measure, but the directions delivered to the deep network carry little agreement with backpropagation, consistent with prior concerns that alternative feedback rules can fail by supplying poor credit assignments even before full collapse \citep{bartunov2018assessing,moskovitz2018feedback,crafton2019backpropagation,refinetti2023aligning}. This rules out the simplest objection that the deep-layer null result is merely a byproduct of collapse.
-\paragraph{A second metric with different failure modes agrees.} A second metric with different numerical failure modes tells the same story. Cosine measures directional agreement with the BP gradient, whereas the per-layer perturbation correlation $\rho_l$ measures whether the proposed credit predicts the actual loss response: for $M{=}32$ unit-norm random directions $v_m$ and step $\varepsilon{=}10^{-3}$, $\rho_l \;{=}\; \mathrm{Pearson}_m\!\left(\langle a_l,\, \varepsilon v_m\rangle,\;\, \ell(h_l + \varepsilon v_m) - \ell(h_l)\right)$, evaluated per sample on a fixed eval batch and then averaged. Cosine and $\rho$ have different failure modes, especially with respect to normalization and small-denominator effects. In our controls, $\rho$ behaves as expected, with a Taylor-ceiling positive control near $+0.997$ and a random-vector negative control near $+0.006$ (Figure~\ref{fig:penalty_rescue}, Table~\ref{tab:mode_validation}). On vanilla DFA, deep $\rho$ is likewise null: for the early checkpoints where the gradients remain measurable, the deep average is $-0.003 \pm 0.005$ across seeds and epochs, and in a floor-level checkpoint it is $+0.002$, again indistinguishable from noise. The agreement between cosine and $\rho$ therefore rules out the interpretation that the null deep result is an artifact of cosine's $\varepsilon$-clamp or vector normalization. The deep blocks are not just hard to measure; they are receiving weakly useful directions.
+\paragraph{A second metric with different failure modes agrees.} A second metric with different numerical failure modes tells the same story. Cosine measures directional agreement with the BP gradient, whereas the per-layer perturbation correlation $\rho_l$ measures whether the proposed credit predicts the actual loss response: for $M{=}32$ unit-norm random directions $v_m$ and step $\varepsilon{=}10^{-3}$, $\rho_l \;{=}\; \mathrm{Pearson}_m\!\left(\langle a_l,\, \varepsilon v_m\rangle,\;\, \ell(h_l + \varepsilon v_m) - \ell(h_l)\right)$, evaluated per sample on a fixed eval batch and then averaged. Cosine and $\rho$ have different failure modes, especially with respect to normalization and small-denominator effects. In our controls, $\rho$ behaves as expected, with a Taylor-ceiling positive control near $+0.997$ and a random-vector negative control near $+0.006$ (Figure~\ref{fig:penalty_rescue}, Table~\ref{tab:mode_validation}). On vanilla DFA, deep $\rho$ is likewise null: for the early checkpoints where the gradients remain measurable, the deep average is $-0.003 \pm 0.004$ across seeds and epochs, and in a floor-level checkpoint it is $+0.002$, again indistinguishable from noise. The agreement between cosine and $\rho$ therefore rules out the interpretation that the null deep result is an artifact of cosine's $\varepsilon$-clamp or vector normalization. The deep blocks are not just hard to measure; they are receiving weakly useful directions.
\paragraph{Per-layer reporting is mandatory: layer-0 dominance.} Per-layer reporting is therefore not cosmetic. In ResMLP under vanilla DFA, the headline aggregate alignment $\Gamma \approx 0.07$--$0.10$ can look mildly positive only because layer $0$ remains strongly aligned while the deep network is not: at the same epoch-1 checkpoints where layers $1$--$4$ are essentially zero, layer $0$ has cosine $+0.42$, $+0.44$, and $+0.42$ across seeds (Table~\ref{tab:mode_validation}; per-seed values in Appendix~\ref{app:layer0_dominance}). The resulting average can therefore be driven by the embedding layer even when the interior blocks are effectively unaligned, so aggregate reporting obscures the very distinction needed to separate ``measurement collapse'' from ``poor credit direction.'' This layer-$0$ dominance is specific to the ResMLP DFA setting; on ViT-Mini DFA, all layers are near zero, which strengthens the broader methodological point that alignment should be reported per layer rather than only in aggregate. With the two modes separated observationally, the remaining question is whether intervention can move them independently.
-\paragraph{Method-dependent severity once Mode~1 is alleviated.} Mode~2 has method-dependent severity within the audited fixed-feedback family once Mode~1 is alleviated. Applying the same $\lambda{=}10^{-2}$ scale-control penalty to SB, CB, and DFA on the audited 4-block $d{=}256$ ResMLP for $30$ epochs (three seeds) gives, in order, test accuracies $0.453 \pm 0.003$, $0.360 \pm 0.003$, $0.360 \pm 0.001$ and deep mean cosines $+0.322 \pm 0.007$, $+0.679 \pm 0.008$, $+0.151 \pm 0.025$ (deep mean $\rho$ $+0.402$, $+0.464$, $+0.080$ and full $\|h_L\|/\|g_L\|$ in Appendix~\ref{app:sb_penalty}), all in the meaningful-measurement regime. SB+penalty is the first audited non-BP method whose trained deep blocks beat the frozen-blocks baseline ($0.349$), by $+10.4$ pp---comparable to BP+penalty's $+18.3$ pp.
+\paragraph{Method-dependent severity once Mode~1 is alleviated.} Mode~2 has method-dependent severity within the audited fixed-feedback family once Mode~1 is alleviated. Applying the same $\lambda{=}10^{-2}$ scale-control penalty to SB, CB, and DFA on the audited 4-block $d{=}256$ ResMLP for $30$ epochs (three seeds) gives, in order, test accuracies $0.453 \pm 0.003$, $0.360 \pm 0.004$, $0.360 \pm 0.002$ and deep mean cosines $+0.322 \pm 0.008$, $+0.679 \pm 0.010$, $+0.151 \pm 0.025$ (deep mean $\rho$ $+0.402$, $+0.464$, $+0.080$ and full $\|h_L\|/\|g_L\|$ in Appendix~\ref{app:sb_penalty}), all in the meaningful-measurement regime. SB+penalty is the first audited non-BP method whose trained deep blocks beat the frozen-blocks baseline ($0.349$), by $+10.4$ pp---comparable to BP+penalty's $+18.3$ pp.
-\paragraph{Three functional metrics rank the methods consistently; cosine disagrees.} Within this rescued regime the three methods reveal a clean cosine-versus-accuracy dissociation, and two independent functional measurements rule out the interpretation that cosine is just noisy. \emph{Nudging:} a single step $\eta{=}0.01$ along each method's per-layer credit $a_l$ at the converged checkpoint changes the deep-block test loss by $-1.93 \pm 0.11 \times 10^{-3}$ (SB+pen), $-4.26 \pm 0.24 \times 10^{-4}$ (CB+pen), and $-4.98 \pm 0.44 \times 10^{-5}$ (DFA+pen) across three seeds (per-seed values in Appendix~\ref{app:sb_penalty}): SB moves the loss $\approx\!4.5\times$ more than CB and $\approx\!39\times$ more than DFA, even though CB has the highest deep cosine with BP. \emph{Training-loss trajectory:} the integrated 30-epoch training loss decrease across three seeds ranks SB ($-0.447 \pm 0.008$) $\gg$ CB ($-0.121 \pm 0.003$) $\approx$ DFA ($-0.095 \pm 0.007$). All three functional metrics (accuracy, nudging, training-loss trajectory) agree on SB $\gg$ CB $\approx$ DFA; the deep-cosine ordering CB $>$ SB $>$ DFA is the only one that disagrees (Figure~\ref{fig:cos_acc_dissoc}).
+\paragraph{Three functional metrics rank the methods consistently; cosine disagrees.} Within this rescued regime the three methods reveal a clean cosine-versus-accuracy dissociation, and two independent functional measurements rule out the interpretation that cosine is just noisy. \emph{Nudging:} a single step $\eta{=}0.01$ along each method's per-layer credit $a_l$ at the converged checkpoint changes the deep-block test loss by $-1.93 \pm 0.14 \times 10^{-3}$ (SB+pen), $-4.26 \pm 0.29 \times 10^{-4}$ (CB+pen), and $-4.98 \pm 0.53 \times 10^{-5}$ (DFA+pen) across three seeds (per-seed values in Appendix~\ref{app:sb_penalty}): SB moves the loss $\approx\!4.5\times$ more than CB and $\approx\!39\times$ more than DFA, even though CB has the highest deep cosine with BP. \emph{Training-loss trajectory:} the integrated 30-epoch training loss decrease across three seeds ranks SB ($-0.447 \pm 0.010$) $\gg$ CB ($-0.121 \pm 0.003$) $\approx$ DFA ($-0.095 \pm 0.008$). All three functional metrics (accuracy, nudging, training-loss trajectory) agree on SB $\gg$ CB $\approx$ DFA; the deep-cosine ordering CB $>$ SB $>$ DFA is the only one that disagrees (Figure~\ref{fig:cos_acc_dissoc}).
\paragraph{A three-part proposition: observation, inference, mechanism hypothesis.} We frame the Mode~2 reading as a three-part proposition. \emph{Observation}: under the same intervention and budget, CB has $4\times$ DFA's deep cosine yet matches DFA's accuracy, while SB attains the best accuracy with intermediate cosine; the same SB $\gg$ CB $\approx$ DFA ranking is reproduced by single-step nudging and 30-epoch training-loss decrease. \emph{Inference}: layerwise cosine is necessary to rule out grossly wrong credit signals---it cleanly distinguishes the rescued regime from the clamp-dominated vanilla regime where deep cos is essentially zero---but it is not sufficient to certify that the supplied signal is useful credit for depth, because three independent functional metrics rank the same three methods in the opposite order from cosine. \emph{Mechanism hypothesis}: usefulness depends on whether the local update induces useful forward-state change across blocks, not merely on the angle between the local credit direction and the BP gradient. Under this reading, CB supplies a gradient-direction surrogate that aligns in angle without translating into coordinated forward-state improvement, while SB supplies a state-level teaching signal that preserves aspects of useful credit which layerwise cosine does not measure. The single-step nudging test and the integrated training-loss decrease are direct functional probes of exactly this distinction: they measure what an actual descent step in the proposed credit direction does to the loss, rather than how the direction angle compares to the BP gradient at one frozen point.
@@ -118,7 +118,7 @@ Credit Bridge & $0.289 \pm 0.026$ & $0.07$ & trustworthy & walked back \\
\section{Intervention and Cross-Architecture Evidence}
\label{sec:validation}
-\paragraph{The penalty rescues the measurement regime.} The penalty intervention first matters as a rescue of the measurement regime. When we add a per-block penalty $\lambda \,\mathrm{mean}(\|f_l(h_l)\|^2)$ to DFA's local loss and train the 4-block $d{=}256$ ResMLP for 30 epochs on CIFAR-10, the $\lambda{=}10^{-2}$ setting contains the terminal hidden-state scale from $\|h_L\| \sim 4.4\times 10^8$ under vanilla DFA to $\sim 4.0\times 10^4$, while lifting the deepest BP reference norm from $\|g_L\| \sim 5\times 10^{-10}$ to $\sim 9.0\times 10^{-7}$, a roughly four-order-of-magnitude rescue on both quantities (Figure~\ref{fig:penalty_rescue}; Table~\ref{tab:mode_validation}). At that setting, both diagnostic~(a) and diagnostic~(b) pass on penalized DFA, and test accuracy rises to $0.360 \pm 0.001$ from $0.301 \pm 0.005$ for matched 30-epoch vanilla DFA. The key point is not yet that the recovered network has good deep credit, but that the deep reference vector is again large enough to function as a meaningful target direction rather than a clamp-dominated artifact. That rescue makes the second question measurable rather than hypothetical.
+\paragraph{The penalty rescues the measurement regime.} The penalty intervention first matters as a rescue of the measurement regime. When we add a per-block penalty $\lambda \,\mathrm{mean}(\|f_l(h_l)\|^2)$ to DFA's local loss and train the 4-block $d{=}256$ ResMLP for 30 epochs on CIFAR-10, the $\lambda{=}10^{-2}$ setting contains the terminal hidden-state scale from $\|h_L\| \sim 4.4\times 10^8$ under vanilla DFA to $\sim 4.0\times 10^4$, while lifting the deepest BP reference norm from $\|g_L\| \sim 5\times 10^{-10}$ to $\sim 9.0\times 10^{-7}$, a roughly four-order-of-magnitude rescue on both quantities (Figure~\ref{fig:penalty_rescue}; Table~\ref{tab:mode_validation}). At that setting, both diagnostic~(a) and diagnostic~(b) pass on penalized DFA, and test accuracy rises to $0.360 \pm 0.002$ from $0.301 \pm 0.006$ for matched 30-epoch vanilla DFA. The key point is not yet that the recovered network has good deep credit, but that the deep reference vector is again large enough to function as a meaningful target direction rather than a clamp-dominated artifact. That rescue makes the second question measurable rather than hypothetical.
\begin{table}[t]
\centering
@@ -138,9 +138,9 @@ Fresh-$B$ null control & $\overline{\cos}_{deep}{=}+0.002{\pm}0.022$ ($n{=}20$ d
\end{tabular}}
\end{table}
-\paragraph{Penalty alleviates Mode~2 only partially; the $\lambda$ sweep separates the modes.} Once the reference vector is meaningful again, the deep layers no longer sit exactly at null. At $\lambda{=}10^{-2}$, penalized DFA reaches a three-seed deep-layer mean cosine of $+0.151 \pm 0.025$ and deep perturbation correlation of $+0.080 \pm 0.011$, whereas vanilla DFA is essentially zero on both metrics in the deep blocks, consistent with prior concerns that alternative feedback can fail by supplying poor credit directions even before full collapse \citep{bartunov2018assessing,moskovitz2018feedback,crafton2019backpropagation,refinetti2023aligning}. The null calibration rules out the interpretation that this recovered signal is merely measurement noise: on the same penalized checkpoint, replacing the training-time feedback matrices with 20 fresh random $B_l$ draws gives a deep cosine of only $+0.002 \pm 0.022$, with per-layer standard deviations of $0.013$--$0.023$, all within noise of zero (Table~\ref{tab:mode_validation}). The $\lambda$ sweep sharpens the dissociation further: at $\lambda{=}10^{-4}$, Mode~1 is already alleviated, with three-seed mean $\|h_L\|{\approx}2.2\times 10^4$ and $\|g_L\|{\approx}7.0\times 10^{-7}$, but the three-seed deep cosine remains $-0.020$, while $\lambda{=}10^{-2}$ delivers the $+0.151$ and $+0.080$ above (Figure~\ref{fig:penalty_rescue}). The improvement is real, but it is only partial.
+\paragraph{Penalty alleviates Mode~2 only partially; the $\lambda$ sweep separates the modes.} Once the reference vector is meaningful again, the deep layers no longer sit exactly at null. At $\lambda{=}10^{-2}$, penalized DFA reaches a three-seed deep-layer mean cosine of $+0.151 \pm 0.025$ and deep perturbation correlation of $+0.080 \pm 0.012$, whereas vanilla DFA is essentially zero on both metrics in the deep blocks, consistent with prior concerns that alternative feedback can fail by supplying poor credit directions even before full collapse \citep{bartunov2018assessing,moskovitz2018feedback,crafton2019backpropagation,refinetti2023aligning}. The null calibration rules out the interpretation that this recovered signal is merely measurement noise: on the same penalized checkpoint, replacing the training-time feedback matrices with 20 fresh random $B_l$ draws gives a deep cosine of only $+0.002 \pm 0.022$, with per-layer standard deviations of $0.013$--$0.023$, all within noise of zero (Table~\ref{tab:mode_validation}). The $\lambda$ sweep sharpens the dissociation further: at $\lambda{=}10^{-4}$, Mode~1 is already alleviated, with three-seed mean $\|h_L\|{\approx}2.2\times 10^4$ and $\|g_L\|{\approx}7.0\times 10^{-7}$, but the three-seed deep cosine remains $-0.020$, while $\lambda{=}10^{-2}$ delivers the $+0.151$ and $+0.080$ above (Figure~\ref{fig:penalty_rescue}). The improvement is real, but it is only partial.
-\paragraph{Capacity-cost control: BP under the same penalty.} A rescue intervention is only informative if its direct cost is controlled. The relevant control is BP trained under the same penalty for the same matched $30$-epoch budget: across three seeds, BP falls from $0.585 \pm 0.001$ without the penalty to $0.532 \pm 0.006$ with $\lambda{=}10^{-2}$, so the penalty has a direct cost of about $5.3$ percentage points even when credit assignment is correct, whereas DFA moves in the opposite direction, from $0.301 \pm 0.005$ to $0.360 \pm 0.001$, and State Bridge moves further still, from $0.213$ to $0.453 \pm 0.003$, all under the same $30$-epoch intervention (Figure~\ref{fig:penalty_rescue}; Appendix~\ref{app:sb_penalty}). Relative to the frozen-blocks baseline of $0.349$, BP+penalty retains a margin of $+18.3$ points, State Bridge+penalty retains $+10.4$ points, and DFA+penalty retains only $+1.1$ points. The remaining BP-to-DFA gap of $17.2$ points is therefore a lower bound on the part of DFA's deficit that is not explained by simple penalty-induced capacity loss alone, though not a clean isolation because BP uses an end-to-end loss whereas DFA uses block-local losses. The substantially smaller BP-to-State-Bridge gap of $0.532 - 0.453 = 7.9$ points shows that the cross-method differences in penalty-rescued accuracy are not all attributable to a uniform ``random-feedback ceiling'': the bridge construction in State Bridge can recover much more of the BP-with-penalty performance than DFA can, on the same architecture and the same intervention. The residual gap after that control is what keeps Mode~2 substantively alive while letting it have method-dependent severity.
+\paragraph{Capacity-cost control: BP under the same penalty.} A rescue intervention is only informative if its direct cost is controlled. The relevant control is BP trained under the same penalty for the same matched $30$-epoch budget: across three seeds, BP falls from $0.585 \pm 0.001$ without the penalty to $0.532 \pm 0.007$ with $\lambda{=}10^{-2}$, so the penalty has a direct cost of about $5.3$ percentage points even when credit assignment is correct, whereas DFA moves in the opposite direction, from $0.301 \pm 0.006$ to $0.360 \pm 0.002$, and State Bridge moves further still, from $0.213$ to $0.453 \pm 0.003$, all under the same $30$-epoch intervention (Figure~\ref{fig:penalty_rescue}; Appendix~\ref{app:sb_penalty}). Relative to the frozen-blocks baseline of $0.349$, BP+penalty retains a margin of $+18.3$ points, State Bridge+penalty retains $+10.4$ points, and DFA+penalty retains only $+1.1$ points. The remaining BP-to-DFA gap of $17.2$ points is therefore a lower bound on the part of DFA's deficit that is not explained by simple penalty-induced capacity loss alone, though not a clean isolation because BP uses an end-to-end loss whereas DFA uses block-local losses. The substantially smaller BP-to-State-Bridge gap of $0.532 - 0.453 = 7.9$ points shows that the cross-method differences in penalty-rescued accuracy are not all attributable to a uniform ``random-feedback ceiling'': the bridge construction in State Bridge can recover much more of the BP-with-penalty performance than DFA can, on the same architecture and the same intervention. The residual gap after that control is what keeps Mode~2 substantively alive while letting it have method-dependent severity.
\paragraph{Cross-architecture and depth-sweep evidence.} The architecture comparison sharpens the scope of the critique. In the terminal-LN architectures we audited, both diagnostics fire for DFA-trained ResMLP at $d{=}256$, the same pattern recurs at $d{=}512$ with even larger max-per-block growth (DFA three-seed mean about $7\times 10^3$ vs $\sim\!1.9\times 10^3$ at $d{=}256$), and ViT-Mini with a class token and terminal LN shows diagnostic~(a) by epoch~1 and diagnostic~(b) by epochs~2--3 (Figure~\ref{fig:temporal_cross_arch}). A depth sweep on the $d{=}512$ ResMLP at $L \in \{2,4,6,8,12\}$ shows that the layerwise pattern is essentially depth-invariant: DFA's layer-0 cosine stays in $[+0.38,+0.40]$ across all five depths, while its mean deep-layer cosine stays within $[-0.005,+0.000]$ and its deep perturbation correlation collapses to $0.000$ in every depth tested, even though BP retains a deep-layer cosine of $+0.94$ at $L{=}12$ (Appendix~\ref{app:depth_scan}). The deep credit signal does not improve when the network is shallower, so the failure is not a "too deep" artifact. In the non-terminal-LN controls, the pattern is different: the no-terminal-LN ResMLP-d256 ablation shows diagnostic~(a) firing across three seeds at epochs $\{18, 14, 25\}$ but diagnostic~(b) never fires across $100$ epochs and the same three seeds, and the BatchNorm CNN on CIFAR-10 likewise shows strong growth under DFA, with max-per-block growth up to $237\times$, but keeps deepest BP gradients around $\|g\| \sim 10^{-3}$ and never triggers diagnostic~(b) (Figure~\ref{fig:temporal_cross_arch}). BP never triggers either diagnostic in any audited architecture. The matched same-backbone ResMLP-d256 ablation in Section~\ref{sec:mode1} supplies the cleanest causal control: removing terminal LayerNorm from the same architecture preserves activation growth but eliminates the gradient floor, so diagnostic~(b) is necessary on terminal-LN ResMLP and is not just an architecture-class coincidence. The broader claim therefore holds at full strength inside the audited residual ResMLP and ViT-Mini regime, while diagnostic~(a) remains useful more broadly. This lets the paper end with a reporting rule rather than an overclaimed theory.
@@ -425,7 +425,7 @@ $12$ & Credit Bridge & $0.239$ & $+0.208$ & $+0.016$ & $+0.000$ \\
\end{tabularx}
\end{table}
-The layerwise pattern is essentially depth-invariant. DFA's layer-0 cosine stays in $[+0.38,+0.40]$ across all five depths, while its mean deep cosine sits within $[-0.005,+0.000]$ and its deep $\rho$ collapses to numerical zero in every condition. Credit Bridge shows a slightly milder version of the same shape, with a small positive deep cosine that does not improve as depth shrinks. BP, by contrast, maintains a deep cosine of $+0.94$ even at $L{=}12$, so the BP reference is still measurably non-degenerate where DFA and Credit Bridge are flat. The $L{=}4$ row, which matches the main audit's architecture, has also been replicated across three seeds (42, 123, 456): 3-seed DFA layer-0 cosine is $+0.412 \pm 0.011$, 3-seed DFA deep cosine is $-0.0004 \pm 0.0008$, and 3-seed CB deep cosine is $+0.039 \pm 0.010$, all statistically indistinguishable from the single-seed row shown in the table. This rules out the explanation that DFA's deep blocks are merely too far from the loss to receive useful credit: making the network shallower does not reach the deep blocks any better. The failure is structural to the credit signal rather than an artifact of depth.
+The layerwise pattern is essentially depth-invariant. DFA's layer-0 cosine stays in $[+0.38,+0.40]$ across all five depths, while its mean deep cosine sits within $[-0.005,+0.000]$ and its deep $\rho$ collapses to numerical zero in every condition. Credit Bridge shows a slightly milder version of the same shape, with a small positive deep cosine that does not improve as depth shrinks. BP, by contrast, maintains a deep cosine of $+0.94$ even at $L{=}12$, so the BP reference is still measurably non-degenerate where DFA and Credit Bridge are flat. The $L{=}4$ row, which matches the main audit's architecture, has also been replicated across three seeds (42, 123, 456): 3-seed DFA layer-0 cosine is $+0.412 \pm 0.013$, 3-seed DFA deep cosine is $-0.0004 \pm 0.0009$, and 3-seed CB deep cosine is $+0.039 \pm 0.012$, all statistically indistinguishable from the single-seed row shown in the table. This rules out the explanation that DFA's deep blocks are merely too far from the loss to receive useful credit: making the network shallower does not reach the deep blocks any better. The failure is structural to the credit signal rather than an artifact of depth.
\section{No-Residual Ablation: Skip Path Is Not the Proximate Trigger}
\label{app:no_residual}
@@ -511,7 +511,7 @@ To test whether the per-block scale-control penalty $\lambda \,\mathrm{mean}(\|f
\begin{table}[h]
\centering
\small
-\caption{State Bridge with the same per-block scale-control penalty $\lambda{=}10^{-2}$ that rescues DFA in Section~\ref{sec:validation}, on the 4-block $d{=}256$ pre-LayerNorm ResMLP, 30 epochs, three seeds. SB+penalty reaches a converged test accuracy of $0.453 \pm 0.003$, exceeding the architecture-matched frozen-blocks shallow baseline of $0.349$ by $+10.4$ percentage points and the matched 30-epoch DFA+penalty value of $0.360 \pm 0.001$ by $+9.3$ percentage points. The deep mean cosine and deep mean perturbation correlation are roughly $2\times$ and $5\times$ the corresponding DFA+penalty values respectively, while the residual stream is contained but not silenced ($\|h_L\|\!\approx\!302$, $\|g_L\|\!\approx\!1.8\times 10^{-4}$). Vanilla SB on the same architecture and seed reaches only $0.213$, with $\|h_L\|\!\approx\!9.85\times 10^6$ and $\|g_L\|$ at the diagnostic-(b) floor.}
+\caption{State Bridge with the same per-block scale-control penalty $\lambda{=}10^{-2}$ that rescues DFA in Section~\ref{sec:validation}, on the 4-block $d{=}256$ pre-LayerNorm ResMLP, 30 epochs, three seeds. SB+penalty reaches a converged test accuracy of $0.453 \pm 0.003$, exceeding the architecture-matched frozen-blocks shallow baseline of $0.349$ by $+10.4$ percentage points and the matched 30-epoch DFA+penalty value of $0.360 \pm 0.002$ by $+9.3$ percentage points. The deep mean cosine and deep mean perturbation correlation are roughly $2\times$ and $5\times$ the corresponding DFA+penalty values respectively, while the residual stream is contained but not silenced ($\|h_L\|\!\approx\!302$, $\|g_L\|\!\approx\!1.8\times 10^{-4}$). Vanilla SB on the same architecture and seed reaches only $0.213$, with $\|h_L\|\!\approx\!9.85\times 10^6$ and $\|g_L\|$ at the diagnostic-(b) floor.}
\label{tab:sb_penalty}
\begin{tabularx}{\linewidth}{@{\extracolsep{\fill}}lrrrrr@{}}
\toprule
@@ -520,21 +520,21 @@ seed & test acc & $\|h_L\|$ & $\|g_L\|$ & deep cos & deep $\rho$ \\
SB+pen $42$ & $0.4564$ & $302$ & $1.75\times 10^{-4}$ & $+0.312$ & $+0.392$ \\
SB+pen $123$ & $0.4514$ & $311$ & $1.74\times 10^{-4}$ & $+0.327$ & $+0.424$ \\
SB+pen $456$ & $0.4509$ & $292$ & $1.92\times 10^{-4}$ & $+0.326$ & $+0.391$ \\
-SB+pen mean & $0.453 \pm 0.003$ & $302 \pm 8$ & $1.80\times 10^{-4}$ & $+0.322 \pm 0.007$ & $+0.402 \pm 0.015$ \\
+SB+pen mean & $0.453 \pm 0.003$ & $302 \pm 10$ & $1.80\times 10^{-4}$ & $+0.322 \pm 0.008$ & $+0.402 \pm 0.019$ \\
\midrule
CB+pen $42$ & $0.3596$ & $5431$ & $1.88\times 10^{-5}$ & $+0.684$ & $+0.498$ \\
CB+pen $123$ & $0.3642$ & $5834$ & $1.81\times 10^{-5}$ & $+0.667$ & $+0.452$ \\
CB+pen $456$ & $0.3562$ & $5775$ & $2.01\times 10^{-5}$ & $+0.685$ & $+0.442$ \\
-CB+pen mean & $0.360 \pm 0.003$ & $5680 \pm 178$ & $1.90\times 10^{-5}$ & $+0.679 \pm 0.008$ & $+0.464 \pm 0.025$ \\
+CB+pen mean & $0.360 \pm 0.004$ & $5680 \pm 218$ & $1.90\times 10^{-5}$ & $+0.679 \pm 0.010$ & $+0.464 \pm 0.030$ \\
\midrule
vanilla SB $42$ & $0.213$ & $9.85\times 10^6$ & $1\times 10^{-8}$ & --- & --- \\
vanilla CB $42$ & $0.211$ & $6.7\times 10^7$ & $\sim 0$ & --- & --- \\
-DFA+pen mean & $0.360 \pm 0.001$ & $1.3\times 10^4$ & $1.6\times 10^{-6}$ & $+0.151 \pm 0.025$ & $+0.080 \pm 0.011$ \\
+DFA+pen mean & $0.360 \pm 0.002$ & $1.3\times 10^4$ & $1.6\times 10^{-6}$ & $+0.151 \pm 0.025$ & $+0.080 \pm 0.012$ \\
\bottomrule
\end{tabularx}
\end{table}
-The penalty rescue effect on State Bridge is much larger than on DFA: $+24$ percentage points for State Bridge versus $+5.9$ percentage points for DFA on the same architecture and intervention. SB+penalty is the first audited non-BP method whose trained deep blocks substantively beat the architecture-matched random-block baseline. We treat this as evidence that Mode~2 (low intrinsic credit-direction quality) has method-dependent severity within the audited fixed-feedback family once Mode~1 is alleviated, rather than being a uniform property of all fixed-feedback local-credit objectives. Importantly, State Bridge's deep cosine $+0.322$ is approximately twice DFA's $+0.151$ on the same intervention, but neither approaches the BP reference value of $\approx +1.0$, so this is a within-class gradation in credit-direction quality, not a claim that bridge constructions ``solve'' Mode~2. The drift diagnostic reinforces this reading rather than contradicting it: per-block $w_2$ relative displacement after $30$ epochs averages $14.8\times \pm 0.5$ for SB+penalty, $18.6\times \pm 0.5$ for DFA+penalty, and $19.1\times \pm 0.6$ for CB+penalty (three seeds each), and the embedding layer's relative drift is $7.0\times \pm 0.1$ for SB versus $46.3\times \pm 1.5$ for CB and $94.6\times \pm 1.4$ for DFA, so none of the three methods' per-block updates are silenced under penalty and CB's are in fact larger in magnitude than SB's while DFA's embedding updates are the largest of all, yet CB's and DFA's final accuracies are both $9.3$ percentage points below State Bridge's. The larger-but-less-useful parameter updates in CB are consistent with the mechanism hypothesis that angular agreement with the BP gradient does not by itself certify the functional forward-state content of the update. The nudging test at the same checkpoints provides the direct functional measurement: taking a single step of size $\eta{=}0.01$ in the direction of each method's per-layer credit $a_l$ at the converged checkpoint and measuring the resulting test-loss change averaged over the deep blocks (l1--l3 of the 4-block model) gives, across three seeds (42, 123, 456), $-1.93 \pm 0.11 \times 10^{-3}$ for SB+penalty (per-seed deep means $\{-1.78, -1.96, -2.05\}\times 10^{-3}$), $-4.26 \pm 0.24 \times 10^{-4}$ for CB+penalty (per-seed $\{-4.45, -3.93, -4.42\}\times 10^{-4}$), and $-4.98 \pm 0.44 \times 10^{-5}$ for DFA+penalty (per-seed $\{-5.53, -4.46, -4.95\}\times 10^{-5}$). At the same per-layer credit direction, a step in SB's direction moves the loss about $4.5\times$ more than a step in CB's direction and about $39\times$ more than a step in DFA's direction, even though CB's direction is more aligned with the BP gradient in angle than either. The full per-seed per-block nudging values are saved in \texttt{results/nudging\_test\_3seed\_summary.json}. The $30$-epoch training trajectories give a third independent confirmation: across three seeds, SB+penalty's training loss decreases by $0.447 \pm 0.008$ over the run (per seed $\{0.457, 0.444, 0.439\}$), whereas CB+penalty's decreases by only $0.121 \pm 0.003$ (per seed $\{0.123, 0.118, 0.124\}$) and DFA+penalty's by only $0.095 \pm 0.007$ (per seed $\{0.104, 0.088, 0.093\}$). Deep cosine ranks the three methods CB $>$ SB $>$ DFA, but every functional metric (nudging, integrated training-loss decrease, headline accuracy) ranks them SB $\gg$ CB $\approx$ DFA: the ordering produced by deep cosine is the only one that does not predict accuracy correctly. This is the strongest form of the cos-versus-accuracy dissociation: across three audited fixed-feedback methods under the same penalty intervention, the ranking implied by angular agreement with the BP gradient is contradicted by three independent functional measurements that do predict accuracy. Under the same intervention Credit Bridge reaches a three-seed test accuracy of $0.360 \pm 0.003$, a three-seed deep mean cosine of $+0.679 \pm 0.008$, and a three-seed deep mean $\rho$ of $+0.464 \pm 0.025$, with $\|h_L\|\approx 5680 \pm 178$ and $\|g_L\|\approx 1.9\times 10^{-5}$ well above the diagnostic floor. Credit Bridge therefore has an even higher deep cosine than State Bridge (about $4\times$ the DFA value and roughly $2\times$ the State Bridge value), but reaches the same final accuracy as DFA+penalty and $9.3$ percentage points below State Bridge+penalty. This is a clean dissociation: within the audited fixed-feedback family under the same rescue, deep cosine and deep $\rho$ differ by more than a factor of four across methods without tracking final accuracy in the same direction, so alignment to the BP gradient is a necessary but not sufficient diagnostic of usable credit for depth. That cross-method dissociation is a direct reason the protocol in Section~\ref{sec:protocol} keeps final accuracy, layerwise credit quality, and the depth-utilization baseline as three separate reporting axes rather than collapsing them into a single headline.
+The penalty rescue effect on State Bridge is much larger than on DFA: $+24$ percentage points for State Bridge versus $+5.9$ percentage points for DFA on the same architecture and intervention. SB+penalty is the first audited non-BP method whose trained deep blocks substantively beat the architecture-matched random-block baseline. We treat this as evidence that Mode~2 (low intrinsic credit-direction quality) has method-dependent severity within the audited fixed-feedback family once Mode~1 is alleviated, rather than being a uniform property of all fixed-feedback local-credit objectives. Importantly, State Bridge's deep cosine $+0.322$ is approximately twice DFA's $+0.151$ on the same intervention, but neither approaches the BP reference value of $\approx +1.0$, so this is a within-class gradation in credit-direction quality, not a claim that bridge constructions ``solve'' Mode~2. The drift diagnostic reinforces this reading rather than contradicting it: per-block $w_2$ relative displacement after $30$ epochs averages $14.8\times \pm 0.5$ for SB+penalty, $18.6\times \pm 0.6$ for DFA+penalty, and $19.1\times \pm 0.6$ for CB+penalty (three seeds each), and the embedding layer's relative drift is $7.0\times \pm 0.1$ for SB versus $46.3\times \pm 1.5$ for CB and $94.6\times \pm 1.8$ for DFA, so none of the three methods' per-block updates are silenced under penalty and CB's are in fact larger in magnitude than SB's while DFA's embedding updates are the largest of all, yet CB's and DFA's final accuracies are both $9.3$ percentage points below State Bridge's. The larger-but-less-useful parameter updates in CB are consistent with the mechanism hypothesis that angular agreement with the BP gradient does not by itself certify the functional forward-state content of the update. The nudging test at the same checkpoints provides the direct functional measurement: taking a single step of size $\eta{=}0.01$ in the direction of each method's per-layer credit $a_l$ at the converged checkpoint and measuring the resulting test-loss change averaged over the deep blocks (l1--l3 of the 4-block model) gives, across three seeds (42, 123, 456), $-1.93 \pm 0.14 \times 10^{-3}$ for SB+penalty (per-seed deep means $\{-1.78, -1.96, -2.05\}\times 10^{-3}$), $-4.26 \pm 0.29 \times 10^{-4}$ for CB+penalty (per-seed $\{-4.45, -3.93, -4.42\}\times 10^{-4}$), and $-4.98 \pm 0.53 \times 10^{-5}$ for DFA+penalty (per-seed $\{-5.53, -4.46, -4.95\}\times 10^{-5}$). At the same per-layer credit direction, a step in SB's direction moves the loss about $4.5\times$ more than a step in CB's direction and about $39\times$ more than a step in DFA's direction, even though CB's direction is more aligned with the BP gradient in angle than either. The full per-seed per-block nudging values are saved in \texttt{results/nudging\_test\_3seed\_summary.json}. The $30$-epoch training trajectories give a third independent confirmation: across three seeds, SB+penalty's training loss decreases by $0.447 \pm 0.008$ over the run (per seed $\{0.457, 0.444, 0.439\}$), whereas CB+penalty's decreases by only $0.121 \pm 0.003$ (per seed $\{0.123, 0.118, 0.124\}$) and DFA+penalty's by only $0.095 \pm 0.008$ (per seed $\{0.104, 0.088, 0.093\}$). Deep cosine ranks the three methods CB $>$ SB $>$ DFA, but every functional metric (nudging, integrated training-loss decrease, headline accuracy) ranks them SB $\gg$ CB $\approx$ DFA: the ordering produced by deep cosine is the only one that does not predict accuracy correctly. This is the strongest form of the cos-versus-accuracy dissociation: across three audited fixed-feedback methods under the same penalty intervention, the ranking implied by angular agreement with the BP gradient is contradicted by three independent functional measurements that do predict accuracy. Under the same intervention Credit Bridge reaches a three-seed test accuracy of $0.360 \pm 0.004$, a three-seed deep mean cosine of $+0.679 \pm 0.010$, and a three-seed deep mean $\rho$ of $+0.464 \pm 0.030$, with $\|h_L\|\approx 5680 \pm 218$ and $\|g_L\|\approx 1.9\times 10^{-5}$ well above the diagnostic floor. Credit Bridge therefore has an even higher deep cosine than State Bridge (about $4\times$ the DFA value and roughly $2\times$ the State Bridge value), but reaches the same final accuracy as DFA+penalty and $9.3$ percentage points below State Bridge+penalty. This is a clean dissociation: within the audited fixed-feedback family under the same rescue, deep cosine and deep $\rho$ differ by more than a factor of four across methods without tracking final accuracy in the same direction, so alignment to the BP gradient is a necessary but not sufficient diagnostic of usable credit for depth. That cross-method dissociation is a direct reason the protocol in Section~\ref{sec:protocol} keeps final accuracy, layerwise credit quality, and the depth-utilization baseline as three separate reporting axes rather than collapsing them into a single headline.
\section{Layer-0 Dominance: Per-Seed Vanilla DFA Early-Epoch Cosines}
\label{app:layer0_dominance}
@@ -560,7 +560,7 @@ $456$ & $2$ & $+0.409$ & $+0.003$ & $-0.039$ & $+0.001$ & $+0.000$ & $8.5\times
\end{tabular}
\end{table}
-The deep-layer mean across the three seeds at epoch 1 is $-0.008 \pm 0.013$ (matching Table~\ref{tab:mode_validation}), and at epoch 2 is $-0.018 \pm 0.018$. Layer 0 stays at $+0.42 \pm 0.02$ across all six measurements, so the layer-0-dominance pattern is not a single-seed coincidence: it is consistent across seeds and across the early epochs in which $\|g_2\|$ remains above the $10^{-7}$ diagnostic-(b) floor. This is the per-seed evidence behind the Section~\ref{sec:mode2} claim that aggregate cosine on vanilla DFA can look mildly positive only because layer 0 carries the entire alignment budget.
+The deep-layer mean across the three seeds at epoch 1 is $-0.008 \pm 0.016$ (matching Table~\ref{tab:mode_validation}), and at epoch 2 is $-0.018 \pm 0.017$. Layer 0 stays at $+0.42 \pm 0.02$ across all six measurements, so the layer-0-dominance pattern is not a single-seed coincidence: it is consistent across seeds and across the early epochs in which $\|g_2\|$ remains above the $10^{-7}$ diagnostic-(b) floor. This is the per-seed evidence behind the Section~\ref{sec:mode2} claim that aggregate cosine on vanilla DFA can look mildly positive only because layer 0 carries the entire alignment budget.
\section{Reproducibility}
\label{app:reproducibility}