summaryrefslogtreecommitdiff
path: root/paper
diff options
context:
space:
mode:
authorYurenHao0426 <Blackhao0426@gmail.com>2026-04-08 18:03:16 -0500
committerYurenHao0426 <Blackhao0426@gmail.com>2026-04-08 18:03:16 -0500
commit752dfb833b06a6fb974df892de560caf328ed1dd (patch)
tree4b29ea7737d7066c9ca33675f7061b89da2c55f2 /paper
parent4aa9d89a28ad9d5a9cfe1ef685a5e100a648b4ed (diff)
paper v2.31: matched 30-epoch BP/DFA controls (was unsourced 0.609/0.308)
The §5 ¶3 BP-no-penalty value of 0.609 ± 0.004 and DFA-no-penalty value of 0.308 ± 0.014 turned out to be unsourced — they were carried over from a hardcoded comment in experiments/bp_with_penalty_control.py ("BP-trainable (3-seed mean): 0.609") that nobody had actually measured with a matched 30-epoch run. Ran the missing matched controls under the same recipe as BP+pen (lam=0, 30 epochs, AdamW 1e-3, wd 0.01, cosine schedule, batch 128, 3 seeds 42/123/456): BP no-pen 30ep: per-seed 0.5851, 0.5845, 0.5863 → 0.585 ± 0.001 (paper said 0.609 ± 0.004, off by 0.024) DFA no-pen 30ep: per-seed 0.3070, 0.2985, 0.2966 → 0.301 ± 0.005 (paper said 0.308 ± 0.014) Also re-grounded DFA+penalty 30ep using the dfa_pen_short 3-seed run (0.3593, 0.3610, 0.3604 → 0.360 ± 0.001), which is what the deep-cosine +0.155 figure was computed on. The paper had 0.363 ± 0.001 — that came from the 100-epoch run, not the 30-epoch run, so it was an apples-to- oranges comparison with BP+pen 30-ep. Paper changes (§5 ¶3): BP penalty cost: -8 pp → -5.5 pp DFA pen rescue: +5.5 → +5.9 pp DFA+pen margin vs frozen: +1.4 → +1.1 pp BP-to-DFA gap: 17 → 17.0 pp (unchanged) BP-to-SB gap: 7.7 → 7.7 pp (unchanged) BP-to-DFA gap is still the lower-bound credit-quality cost claim; 17 pp gap is unchanged in magnitude. Also updated: - §5 ¶1 prose: 0.363 → 0.360, 0.308 → 0.301 - §4 ¶4 prose: DFA+pen 0.363 → 0.360 - Appendix J Table 9 caption: 0.363 → 0.360, +9.0 → +9.3 pp gap to SB - Appendix L paragraph: +5.5 → +5.9 pp DFA penalty rescue - Figure 3 panel C bar values + title pen-cost annotation - New results/matched_30ep_control_summary.json as auditable record Page layout preserved: 9 main pages + refs p10, 18 total, 0 overfull. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Diffstat (limited to 'paper')
-rw-r--r--paper/figures/fig4_penalty_rescue.pdfbin34190 -> 34199 bytes
-rw-r--r--paper/figures/render_fig4_penalty_rescue.py6
-rw-r--r--paper/main.pdfbin500508 -> 500649 bytes
-rw-r--r--paper/main.tex12
4 files changed, 9 insertions, 9 deletions
diff --git a/paper/figures/fig4_penalty_rescue.pdf b/paper/figures/fig4_penalty_rescue.pdf
index e703f08..deef685 100644
--- a/paper/figures/fig4_penalty_rescue.pdf
+++ b/paper/figures/fig4_penalty_rescue.pdf
Binary files differ
diff --git a/paper/figures/render_fig4_penalty_rescue.py b/paper/figures/render_fig4_penalty_rescue.py
index ad21a12..614698c 100644
--- a/paper/figures/render_fig4_penalty_rescue.py
+++ b/paper/figures/render_fig4_penalty_rescue.py
@@ -32,8 +32,8 @@ rho_err = [0.005, 0.0, 0.011, 0.0, 0.0]
# Panel C: 2x2 capacity-cost control
methods = ["BP", "DFA"]
-no_pen = [0.609, 0.308]
-with_pen = [0.530, 0.363]
+no_pen = [0.585, 0.301]
+with_pen = [0.530, 0.360]
shallow = 0.349
fig, axes = plt.subplots(1, 3, figsize=(13, 6.0))
@@ -81,7 +81,7 @@ ax.axhline(shallow, color="black", ls="--", lw=1, label=f"frozen baseline {shall
ax.set_xticks(xpos)
ax.set_xticklabels(methods, fontsize=10)
ax.set_ylabel("test accuracy", fontsize=10)
-ax.set_title("(c) BP+penalty 2$\\times$2 control\n(BP-pen-cost $-8$pp; gap $17$pp $=$ credit quality)", fontsize=10)
+ax.set_title("(c) BP+penalty 2$\\times$2 control\n(BP-pen-cost $-5.5$pp; gap $17$pp $=$ credit quality)", fontsize=10)
ax.legend(loc="upper right", fontsize=8)
ax.grid(True, axis="y", alpha=0.3)
ax.set_ylim(0, 0.7)
diff --git a/paper/main.pdf b/paper/main.pdf
index 3dd8aed..ae1d43a 100644
--- a/paper/main.pdf
+++ b/paper/main.pdf
Binary files differ
diff --git a/paper/main.tex b/paper/main.tex
index 254e1f6..5c4eb93 100644
--- a/paper/main.tex
+++ b/paper/main.tex
@@ -98,12 +98,12 @@ A second metric with different numerical failure modes tells the same story. Cos
Per-layer reporting is therefore not cosmetic. In ResMLP under vanilla DFA, the headline aggregate alignment $\Gamma \approx 0.07$--$0.10$ can look mildly positive only because layer $0$ remains strongly aligned while the deep network is not: at the same epoch-1 checkpoints where layers $1$--$4$ are essentially zero, layer $0$ has cosine $+0.42$, $+0.44$, and $+0.42$ across seeds (Table~\ref{tab:mode_validation}; per-seed values in Appendix~\ref{app:layer0_dominance}). The resulting average can therefore be driven by the embedding layer even when the interior blocks are effectively unaligned, so aggregate reporting obscures the very distinction needed to separate ``measurement collapse'' from ``poor credit direction.'' This layer-$0$ dominance is specific to the ResMLP DFA setting; on ViT-Mini DFA, all layers are near zero, which strengthens the broader methodological point that alignment should be reported per layer rather than only in aggregate. With the two modes separated observationally, the remaining question is whether intervention can move them independently.
-Mode~2 has method-dependent severity within the audited fixed-feedback family once Mode~1 is alleviated. Applying the same per-block scale-control penalty $\lambda{=}10^{-2}$ that rescued DFA to State Bridge and to Credit Bridge on the same 4-block $d{=}256$ ResMLP backbone over $30$ epochs and three seeds gives converged test accuracies of $0.453 \pm 0.003$ (SB) and $0.360 \pm 0.003$ (CB), with deep mean cosines of $+0.322 \pm 0.007$ (SB) and $+0.679 \pm 0.008$ (CB) and deep mean $\rho$ of $+0.402 \pm 0.015$ (SB) and $+0.464 \pm 0.025$ (CB), while DFA under the same intervention reaches $0.363 \pm 0.001$ with deep cosine $+0.155 \pm 0.025$ and deep $\rho$ $+0.080 \pm 0.011$ (Table~\ref{tab:mode_validation}; Appendix~\ref{app:sb_penalty}). The State Bridge penalty rescue is roughly $24$ percentage points above the vanilla State Bridge baseline of $0.213$ on the same architecture and, more importantly for the paper's central walk-back, exceeds the architecture-matched frozen-blocks shallow baseline of $0.349$ by $+10.4$ percentage points. State Bridge with the penalty intervention is therefore the first audited non-BP method whose trained deep blocks substantively improve over an architecture-matched random-block baseline; the headline accuracy gap is comparable to BP+penalty's $+18.1$ pp over the same shallow baseline. Neither the activation scale nor the deep BP gradient magnitude is silenced under the penalty: $\|h_L\|$ stays at $302 \pm 8$ for SB and $5680 \pm 178$ for CB, with $\|g_L\|$ at $\sim\!1.8\times 10^{-4}$ and $\sim\!1.9\times 10^{-5}$ respectively, both well within the meaningful-measurement regime, so the recovered deep cosines are computed against an informative reference and not against a numerical floor. Within this rescued regime, the three methods reveal a clean cosine-versus-accuracy dissociation. Credit Bridge achieves roughly $4\times$ the deep cosine of DFA and $2\times$ that of State Bridge, yet its final accuracy matches DFA's and is $9$ percentage points below State Bridge's. We therefore frame the Mode~2 reading as a three-part proposition. \emph{Observation}: under the same intervention and matched training budget, CB and DFA reach the same accuracy despite a $4\times$ deep-cosine gap, while SB is the best accuracy with intermediate cosine. \emph{Inference}: layerwise cosine to the BP gradient is necessary to rule out grossly wrong credit signals (it distinguishes the rescued regime from the clamp-dominated vanilla regime), but it is not sufficient to certify that the supplied signal is useful credit for depth. \emph{Mechanism hypothesis}: usefulness depends on whether the local update induces useful forward-state change across blocks, not merely whether its direction is close to the BP gradient in angle. Under this reading, CB supplies a gradient-direction surrogate that aligns with BP in angle but does not translate to a coordinated forward-state improvement, while State Bridge supplies a state-level downstream teaching signal that preserves aspects of useful credit which layerwise cosine does not measure. We state this as a mechanism hypothesis rather than a theorem because we have measured the angle-to-accuracy gap but not the full functional-credit content; the reporting rule that follows is robust to either interpretation. This cross-method dissociation strengthens the methodological point that alignment must be reported jointly with measurement validity and a depth-utilization baseline rather than as a single headline number.
+Mode~2 has method-dependent severity within the audited fixed-feedback family once Mode~1 is alleviated. Applying the same per-block scale-control penalty $\lambda{=}10^{-2}$ that rescued DFA to State Bridge and to Credit Bridge on the same 4-block $d{=}256$ ResMLP backbone over $30$ epochs and three seeds gives converged test accuracies of $0.453 \pm 0.003$ (SB) and $0.360 \pm 0.003$ (CB), with deep mean cosines of $+0.322 \pm 0.007$ (SB) and $+0.679 \pm 0.008$ (CB) and deep mean $\rho$ of $+0.402 \pm 0.015$ (SB) and $+0.464 \pm 0.025$ (CB), while DFA under the same intervention reaches $0.360 \pm 0.001$ with deep cosine $+0.155 \pm 0.025$ and deep $\rho$ $+0.080 \pm 0.011$ (Table~\ref{tab:mode_validation}; Appendix~\ref{app:sb_penalty}). The State Bridge penalty rescue is roughly $24$ percentage points above the vanilla State Bridge baseline of $0.213$ on the same architecture and, more importantly for the paper's central walk-back, exceeds the architecture-matched frozen-blocks shallow baseline of $0.349$ by $+10.4$ percentage points. State Bridge with the penalty intervention is therefore the first audited non-BP method whose trained deep blocks substantively improve over an architecture-matched random-block baseline; the headline accuracy gap is comparable to BP+penalty's $+18.1$ pp over the same shallow baseline. Neither the activation scale nor the deep BP gradient magnitude is silenced under the penalty: $\|h_L\|$ stays at $302 \pm 8$ for SB and $5680 \pm 178$ for CB, with $\|g_L\|$ at $\sim\!1.8\times 10^{-4}$ and $\sim\!1.9\times 10^{-5}$ respectively, both well within the meaningful-measurement regime, so the recovered deep cosines are computed against an informative reference and not against a numerical floor. Within this rescued regime, the three methods reveal a clean cosine-versus-accuracy dissociation. Credit Bridge achieves roughly $4\times$ the deep cosine of DFA and $2\times$ that of State Bridge, yet its final accuracy matches DFA's and is $9$ percentage points below State Bridge's. We therefore frame the Mode~2 reading as a three-part proposition. \emph{Observation}: under the same intervention and matched training budget, CB and DFA reach the same accuracy despite a $4\times$ deep-cosine gap, while SB is the best accuracy with intermediate cosine. \emph{Inference}: layerwise cosine to the BP gradient is necessary to rule out grossly wrong credit signals (it distinguishes the rescued regime from the clamp-dominated vanilla regime), but it is not sufficient to certify that the supplied signal is useful credit for depth. \emph{Mechanism hypothesis}: usefulness depends on whether the local update induces useful forward-state change across blocks, not merely whether its direction is close to the BP gradient in angle. Under this reading, CB supplies a gradient-direction surrogate that aligns with BP in angle but does not translate to a coordinated forward-state improvement, while State Bridge supplies a state-level downstream teaching signal that preserves aspects of useful credit which layerwise cosine does not measure. We state this as a mechanism hypothesis rather than a theorem because we have measured the angle-to-accuracy gap but not the full functional-credit content; the reporting rule that follows is robust to either interpretation. This cross-method dissociation strengthens the methodological point that alignment must be reported jointly with measurement validity and a depth-utilization baseline rather than as a single headline number.
\section{Intervention and Cross-Architecture Evidence}
\label{sec:validation}
-The penalty intervention first matters as a rescue of the measurement regime. When we add a per-block penalty $\lambda \,\mathrm{mean}(\|f_l(h_l)\|^2)$ to DFA's local loss and train the 4-block $d{=}256$ ResMLP for 30 epochs on CIFAR-10, the $\lambda{=}10^{-2}$ setting contains the terminal hidden-state scale from $\|h_L\| \sim 4.4\times 10^8$ under vanilla DFA to $\sim 4.0\times 10^4$, while lifting the deepest BP reference norm from $\|g_L\| \sim 5\times 10^{-10}$ to $\sim 9.0\times 10^{-7}$, a roughly four-order-of-magnitude rescue on both quantities (Figure~\ref{fig:penalty_rescue}; Table~\ref{tab:mode_validation}). At that setting, both diagnostic~(a) and diagnostic~(b) pass on penalized DFA, and test accuracy rises to $0.363 \pm 0.001$ from $0.308 \pm 0.014$ for vanilla DFA. The key point is not yet that the recovered network has good deep credit, but that the deep reference vector is again large enough to function as a meaningful target direction rather than a clamp-dominated artifact. That rescue makes the second question measurable rather than hypothetical.
+The penalty intervention first matters as a rescue of the measurement regime. When we add a per-block penalty $\lambda \,\mathrm{mean}(\|f_l(h_l)\|^2)$ to DFA's local loss and train the 4-block $d{=}256$ ResMLP for 30 epochs on CIFAR-10, the $\lambda{=}10^{-2}$ setting contains the terminal hidden-state scale from $\|h_L\| \sim 4.4\times 10^8$ under vanilla DFA to $\sim 4.0\times 10^4$, while lifting the deepest BP reference norm from $\|g_L\| \sim 5\times 10^{-10}$ to $\sim 9.0\times 10^{-7}$, a roughly four-order-of-magnitude rescue on both quantities (Figure~\ref{fig:penalty_rescue}; Table~\ref{tab:mode_validation}). At that setting, both diagnostic~(a) and diagnostic~(b) pass on penalized DFA, and test accuracy rises to $0.360 \pm 0.001$ from $0.301 \pm 0.005$ for matched 30-epoch vanilla DFA. The key point is not yet that the recovered network has good deep credit, but that the deep reference vector is again large enough to function as a meaningful target direction rather than a clamp-dominated artifact. That rescue makes the second question measurable rather than hypothetical.
\begin{table}[t]
\centering
@@ -125,7 +125,7 @@ Fresh-$B$ null control & $\overline{\cos}_{deep}{=}+0.002{\pm}0.022$ ($n{=}20$ d
Once the reference vector is meaningful again, the deep layers no longer sit exactly at null. At $\lambda{=}10^{-2}$, penalized DFA reaches a three-seed deep-layer mean cosine of $+0.155 \pm 0.025$ and deep perturbation correlation of $+0.080 \pm 0.011$, whereas vanilla DFA is essentially zero on both metrics in the deep blocks, consistent with prior concerns that alternative feedback can fail by supplying poor credit directions even before full collapse \citep{bartunov2018assessing,moskovitz2018feedback,crafton2019backpropagation,refinetti2023aligning}. The null calibration rules out the interpretation that this recovered signal is merely measurement noise: on the same penalized checkpoint, replacing the training-time feedback matrices with 20 fresh random $B_l$ draws gives a deep cosine of only $+0.002 \pm 0.022$, with per-layer standard deviations of $0.013$--$0.023$, all within noise of zero (Table~\ref{tab:mode_validation}). The $\lambda$ sweep sharpens the dissociation further: at $\lambda{=}10^{-4}$, Mode~1 is already alleviated, with $\|h_L\|{=}2.4\times 10^4$ and $\|g_L\|{=}6.3\times 10^{-7}$, but deep cosine remains $-0.022$, while at $\lambda{=}10^{-2}$ it rises to $+0.165$ and deep $\rho$ to $+0.091$ (Figure~\ref{fig:penalty_rescue}). The improvement is real, but it is only partial.
-A rescue intervention is only informative if its direct cost is controlled. The relevant control is BP trained under the same penalty: BP falls from $0.609 \pm 0.004$ without the penalty to $0.530$ with $\lambda{=}10^{-2}$, so the penalty has a direct cost of about $8$ percentage points even when credit assignment is correct, whereas DFA moves in the opposite direction, from $0.308 \pm 0.014$ to $0.363 \pm 0.001$, and State Bridge moves further still, from $0.213$ to $0.453 \pm 0.003$ (three seeds), under the same intervention (Figure~\ref{fig:penalty_rescue}; Appendix~\ref{app:sb_penalty}). Relative to the frozen-blocks baseline of $0.349$, BP+penalty retains a margin of $+18.1$ points, State Bridge+penalty retains $+10.4$ points, and DFA+penalty retains only $+1.4$ points. The remaining BP-to-DFA gap of $17$ points is therefore a lower bound on the part of DFA's deficit that is not explained by simple penalty-induced capacity loss alone, though not a clean isolation because BP uses an end-to-end loss whereas DFA uses block-local losses. The substantially smaller BP-to-State-Bridge gap of $0.530 - 0.453 = 7.7$ points shows that the cross-method differences in penalty-rescued accuracy are not all attributable to a uniform ``random-feedback ceiling'': the bridge construction in State Bridge can recover much more of the BP-with-penalty performance than DFA can, on the same architecture and the same intervention. The residual gap after that control is what keeps Mode~2 substantively alive while letting it have method-dependent severity.
+A rescue intervention is only informative if its direct cost is controlled. The relevant control is BP trained under the same penalty for the same matched $30$-epoch budget: across three seeds, BP falls from $0.585 \pm 0.001$ without the penalty to $0.530$ with $\lambda{=}10^{-2}$ (BP+penalty single seed), so the penalty has a direct cost of about $5.5$ percentage points even when credit assignment is correct, whereas DFA moves in the opposite direction, from $0.301 \pm 0.005$ to $0.360 \pm 0.001$, and State Bridge moves further still, from $0.213$ to $0.453 \pm 0.003$, all under the same $30$-epoch intervention (Figure~\ref{fig:penalty_rescue}; Appendix~\ref{app:sb_penalty}). Relative to the frozen-blocks baseline of $0.349$, BP+penalty retains a margin of $+18.1$ points, State Bridge+penalty retains $+10.4$ points, and DFA+penalty retains only $+1.1$ points. The remaining BP-to-DFA gap of $17.0$ points is therefore a lower bound on the part of DFA's deficit that is not explained by simple penalty-induced capacity loss alone, though not a clean isolation because BP uses an end-to-end loss whereas DFA uses block-local losses. The substantially smaller BP-to-State-Bridge gap of $0.530 - 0.453 = 7.7$ points shows that the cross-method differences in penalty-rescued accuracy are not all attributable to a uniform ``random-feedback ceiling'': the bridge construction in State Bridge can recover much more of the BP-with-penalty performance than DFA can, on the same architecture and the same intervention. The residual gap after that control is what keeps Mode~2 substantively alive while letting it have method-dependent severity.
The architecture comparison sharpens the scope of the critique. In the terminal-LN architectures we audited, both diagnostics fire for DFA-trained ResMLP at $d{=}256$, the same pattern recurs at $d{=}512$ with even larger max-per-block growth (about $1.5\times 10^4$), and ViT-Mini with a class token and terminal LN shows diagnostic~(a) by epoch~1 and diagnostic~(b) by epochs~2--3 (Figure~\ref{fig:temporal_cross_arch}). A depth sweep on the $d{=}512$ ResMLP at $L \in \{2,4,6,8,12\}$ shows that the layerwise pattern is essentially depth-invariant: DFA's layer-0 cosine stays in $[+0.39,+0.40]$ across all five depths, while its mean deep-layer cosine stays within $[-0.005,+0.000]$ and its deep perturbation correlation collapses to $0.000$ in every depth tested, even though BP retains a deep-layer cosine of $+0.94$ at $L{=}12$ (Appendix~\ref{app:depth_scan}). The deep credit signal does not improve when the network is shallower, so the failure is not a "too deep" artifact. In the non-terminal-LN controls, the pattern is different: StudentNet shows diagnostic~(a) only at epochs~14--25 while diagnostic~(b) never fires across $100$ epochs and three seeds, and the BatchNorm CNN on CIFAR-10 likewise shows strong growth under DFA, with max-per-block growth up to $237\times$, but keeps deepest BP gradients around $\|g\| \sim 10^{-3}$ and never triggers diagnostic~(b) (Figure~\ref{fig:temporal_cross_arch}). BP never triggers either diagnostic in any audited architecture. The matched same-backbone ResMLP-d256 ablation in Section~\ref{sec:mode1} supplies the cleanest causal control: removing terminal LayerNorm from the same architecture preserves activation growth but eliminates the gradient floor, so diagnostic~(b) is necessary on terminal-LN ResMLP and is not just an architecture-class coincidence. The broader claim therefore holds at full strength inside the audited residual ResMLP and ViT-Mini regime, while diagnostic~(a) remains useful more broadly. This lets the paper end with a reporting rule rather than an overclaimed theory.
@@ -494,7 +494,7 @@ To test whether the per-block scale-control penalty $\lambda \,\mathrm{mean}(\|f
\begin{table}[h]
\centering
\small
-\caption{State Bridge with the same per-block scale-control penalty $\lambda{=}10^{-2}$ that rescues DFA in Section~\ref{sec:validation}, on the 4-block $d{=}256$ pre-LayerNorm ResMLP, 30 epochs, three seeds. SB+penalty reaches a converged test accuracy of $0.453 \pm 0.003$, exceeding the architecture-matched frozen-blocks shallow baseline of $0.349$ by $+10.4$ percentage points and the DFA+penalty value of $0.363 \pm 0.001$ by $+9.0$ percentage points. The deep mean cosine and deep mean perturbation correlation are roughly $2\times$ and $5\times$ the corresponding DFA+penalty values respectively, while the residual stream is contained but not silenced ($\|h_L\|\!\approx\!302$, $\|g_L\|\!\approx\!1.8\times 10^{-4}$). Vanilla SB on the same architecture and seed reaches only $0.213$, with $\|h_L\|\!\approx\!9.85\times 10^6$ and $\|g_L\|$ at the diagnostic-(b) floor.}
+\caption{State Bridge with the same per-block scale-control penalty $\lambda{=}10^{-2}$ that rescues DFA in Section~\ref{sec:validation}, on the 4-block $d{=}256$ pre-LayerNorm ResMLP, 30 epochs, three seeds. SB+penalty reaches a converged test accuracy of $0.453 \pm 0.003$, exceeding the architecture-matched frozen-blocks shallow baseline of $0.349$ by $+10.4$ percentage points and the matched 30-epoch DFA+penalty value of $0.360 \pm 0.001$ by $+9.3$ percentage points. The deep mean cosine and deep mean perturbation correlation are roughly $2\times$ and $5\times$ the corresponding DFA+penalty values respectively, while the residual stream is contained but not silenced ($\|h_L\|\!\approx\!302$, $\|g_L\|\!\approx\!1.8\times 10^{-4}$). Vanilla SB on the same architecture and seed reaches only $0.213$, with $\|h_L\|\!\approx\!9.85\times 10^6$ and $\|g_L\|$ at the diagnostic-(b) floor.}
\label{tab:sb_penalty}
\begin{tabularx}{\linewidth}{@{\extracolsep{\fill}}lrrrrr@{}}
\toprule
@@ -512,12 +512,12 @@ CB+pen mean & $0.360 \pm 0.003$ & $5680 \pm 178$ & $1.90\times 10^{-5}$ & $+0.67
\midrule
vanilla SB $42$ & $0.213$ & $9.85\times 10^6$ & $1\times 10^{-8}$ & --- & --- \\
vanilla CB $42$ & $0.211$ & $6.7\times 10^7$ & $\sim 0$ & --- & --- \\
-DFA+pen mean & $0.363 \pm 0.001$ & $4.0\times 10^4$ & $9.0\times 10^{-7}$ & $+0.155 \pm 0.025$ & $+0.080 \pm 0.011$ \\
+DFA+pen mean & $0.360 \pm 0.001$ & $4.0\times 10^4$ & $9.0\times 10^{-7}$ & $+0.155 \pm 0.025$ & $+0.080 \pm 0.011$ \\
\bottomrule
\end{tabularx}
\end{table}
-The penalty rescue effect on State Bridge is much larger than on DFA: $+24$ percentage points for State Bridge versus $+5.5$ percentage points for DFA on the same architecture and intervention. SB+penalty is the first audited non-BP method whose trained deep blocks substantively beat the architecture-matched random-block baseline. We treat this as evidence that Mode~2 (low intrinsic credit-direction quality) has method-dependent severity within the audited fixed-feedback family once Mode~1 is alleviated, rather than being a uniform property of all fixed-feedback local-credit objectives. Importantly, State Bridge's deep cosine $+0.322$ is approximately twice DFA's $+0.155$ on the same intervention, but neither approaches the BP reference value of $\approx +1.0$, so this is a within-class gradation in credit-direction quality, not a claim that bridge constructions ``solve'' Mode~2. The drift diagnostic reinforces this reading rather than contradicting it: per-block $w_2$ relative displacement after $30$ epochs averages $14.3\times$ for SB+penalty, $18.6\times \pm 0.5$ for DFA+penalty, and $19.3\times$ for CB+penalty (three seeds each), and the embedding layer's relative drift is $7.1\times$ for SB versus $44.6\times$ for CB and $94.6\times \pm 1.4$ for DFA, so none of the three methods' per-block updates are silenced under penalty and CB's are in fact larger in magnitude than SB's while DFA's embedding updates are the largest of all, yet CB's and DFA's final accuracies are both $9.3$ percentage points below State Bridge's. The larger-but-less-useful parameter updates in CB are consistent with the mechanism hypothesis that angular agreement with the BP gradient does not by itself certify the functional forward-state content of the update. The nudging test at the same checkpoints provides the direct functional measurement: taking a small step of size $\eta{=}0.01$ in the direction of each method's per-layer credit $a_l$ decreases the test loss by $-1.78\times 10^{-3}$ on average over the deep blocks for SB+penalty, by $-0.45\times 10^{-3}$ for CB+penalty, and by only $-5\times 10^{-5}$ for DFA+penalty (three seeds each, $30$-epoch runs via the same training script). At the same per-layer credit direction, a step in SB's direction moves the loss about four times more than a step in CB's direction and about thirty-five times more than a step in DFA's direction, even though CB's direction is more aligned with the BP gradient in angle than either. The $30$-epoch training trajectories give a third independent confirmation: SB+penalty's training loss falls from $2.047$ at epoch $1$ to $1.589$ at epoch $30$, a decrease of $0.458$, whereas CB+penalty's training loss falls by only $0.122$ and DFA+penalty's by only $0.095 \pm 0.007$ over the same $30$ epochs (three seeds). Deep cosine ranks the three methods CB $>$ SB $>$ DFA, but every functional metric (nudging, integrated training-loss decrease, headline accuracy) ranks them SB $\gg$ CB $\approx$ DFA: the ordering produced by deep cosine is the only one that does not predict accuracy correctly. This is the strongest form of the cos-versus-accuracy dissociation: across three audited fixed-feedback methods under the same penalty intervention, the ranking implied by angular agreement with the BP gradient is contradicted by three independent functional measurements that do predict accuracy. Under the same intervention Credit Bridge reaches a three-seed test accuracy of $0.360 \pm 0.003$, a three-seed deep mean cosine of $+0.679 \pm 0.008$, and a three-seed deep mean $\rho$ of $+0.464 \pm 0.025$, with $\|h_L\|\approx 5680 \pm 178$ and $\|g_L\|\approx 1.9\times 10^{-5}$ well above the diagnostic floor. Credit Bridge therefore has an even higher deep cosine than State Bridge (about $4\times$ the DFA value and roughly $2\times$ the State Bridge value), but reaches the same final accuracy as DFA+penalty and $9.3$ percentage points below State Bridge+penalty. This is a clean dissociation: within the audited fixed-feedback family under the same rescue, deep cosine and deep $\rho$ differ by more than a factor of four across methods without tracking final accuracy in the same direction, so alignment to the BP gradient is a necessary but not sufficient diagnostic of usable credit for depth. That cross-method dissociation is a direct reason the protocol in Section~\ref{sec:protocol} keeps final accuracy, layerwise credit quality, and the depth-utilization baseline as three separate reporting axes rather than collapsing them into a single headline.
+The penalty rescue effect on State Bridge is much larger than on DFA: $+24$ percentage points for State Bridge versus $+5.9$ percentage points for DFA on the same architecture and intervention. SB+penalty is the first audited non-BP method whose trained deep blocks substantively beat the architecture-matched random-block baseline. We treat this as evidence that Mode~2 (low intrinsic credit-direction quality) has method-dependent severity within the audited fixed-feedback family once Mode~1 is alleviated, rather than being a uniform property of all fixed-feedback local-credit objectives. Importantly, State Bridge's deep cosine $+0.322$ is approximately twice DFA's $+0.155$ on the same intervention, but neither approaches the BP reference value of $\approx +1.0$, so this is a within-class gradation in credit-direction quality, not a claim that bridge constructions ``solve'' Mode~2. The drift diagnostic reinforces this reading rather than contradicting it: per-block $w_2$ relative displacement after $30$ epochs averages $14.3\times$ for SB+penalty, $18.6\times \pm 0.5$ for DFA+penalty, and $19.3\times$ for CB+penalty (three seeds each), and the embedding layer's relative drift is $7.1\times$ for SB versus $44.6\times$ for CB and $94.6\times \pm 1.4$ for DFA, so none of the three methods' per-block updates are silenced under penalty and CB's are in fact larger in magnitude than SB's while DFA's embedding updates are the largest of all, yet CB's and DFA's final accuracies are both $9.3$ percentage points below State Bridge's. The larger-but-less-useful parameter updates in CB are consistent with the mechanism hypothesis that angular agreement with the BP gradient does not by itself certify the functional forward-state content of the update. The nudging test at the same checkpoints provides the direct functional measurement: taking a small step of size $\eta{=}0.01$ in the direction of each method's per-layer credit $a_l$ decreases the test loss by $-1.78\times 10^{-3}$ on average over the deep blocks for SB+penalty, by $-0.45\times 10^{-3}$ for CB+penalty, and by only $-5\times 10^{-5}$ for DFA+penalty (three seeds each, $30$-epoch runs via the same training script). At the same per-layer credit direction, a step in SB's direction moves the loss about four times more than a step in CB's direction and about thirty-five times more than a step in DFA's direction, even though CB's direction is more aligned with the BP gradient in angle than either. The $30$-epoch training trajectories give a third independent confirmation: SB+penalty's training loss falls from $2.047$ at epoch $1$ to $1.589$ at epoch $30$, a decrease of $0.458$, whereas CB+penalty's training loss falls by only $0.122$ and DFA+penalty's by only $0.095 \pm 0.007$ over the same $30$ epochs (three seeds). Deep cosine ranks the three methods CB $>$ SB $>$ DFA, but every functional metric (nudging, integrated training-loss decrease, headline accuracy) ranks them SB $\gg$ CB $\approx$ DFA: the ordering produced by deep cosine is the only one that does not predict accuracy correctly. This is the strongest form of the cos-versus-accuracy dissociation: across three audited fixed-feedback methods under the same penalty intervention, the ranking implied by angular agreement with the BP gradient is contradicted by three independent functional measurements that do predict accuracy. Under the same intervention Credit Bridge reaches a three-seed test accuracy of $0.360 \pm 0.003$, a three-seed deep mean cosine of $+0.679 \pm 0.008$, and a three-seed deep mean $\rho$ of $+0.464 \pm 0.025$, with $\|h_L\|\approx 5680 \pm 178$ and $\|g_L\|\approx 1.9\times 10^{-5}$ well above the diagnostic floor. Credit Bridge therefore has an even higher deep cosine than State Bridge (about $4\times$ the DFA value and roughly $2\times$ the State Bridge value), but reaches the same final accuracy as DFA+penalty and $9.3$ percentage points below State Bridge+penalty. This is a clean dissociation: within the audited fixed-feedback family under the same rescue, deep cosine and deep $\rho$ differ by more than a factor of four across methods without tracking final accuracy in the same direction, so alignment to the BP gradient is a necessary but not sufficient diagnostic of usable credit for depth. That cross-method dissociation is a direct reason the protocol in Section~\ref{sec:protocol} keeps final accuracy, layerwise credit quality, and the depth-utilization baseline as three separate reporting axes rather than collapsing them into a single headline.
\section{Layer-0 Dominance: Per-Seed Vanilla DFA Early-Epoch Cosines}
\label{app:layer0_dominance}