summaryrefslogtreecommitdiff
path: root/paper
diff options
context:
space:
mode:
authorYurenHao0426 <Blackhao0426@gmail.com>2026-04-08 19:13:57 -0500
committerYurenHao0426 <Blackhao0426@gmail.com>2026-04-08 19:13:57 -0500
commit7191dba5505253f32a7bc576031768e72eef9441 (patch)
tree49d0185d17ea99af1a080206ba72c913db8b3657 /paper
parentd022688ea9fcfcb81f900751ee92e35597ef19b8 (diff)
paper v2.32.1: §5 ¶3 BP-to-SB gap 7.7 → 7.9 (cascade from BP+pen 0.530 → 0.532)
The §5 ¶3 explicit math "0.530 - 0.453 = 7.7 points" was a leftover from the BP+pen single-seed era. With v2.32's BP+pen 3-seed value 0.532, the math becomes 0.532 - 0.453 = 0.079 = 7.9 points. Updated to "0.532 - 0.453 = 7.9 points". This is the last cascade fix from the BP+pen multi-seed promotion. The qualitative point ("BP-to-SB gap is substantially smaller than the BP-to-DFA gap of 17.2 pp, so cross-method differences are not just a random-feedback ceiling") is unchanged: 7.9 pp << 17.2 pp. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Diffstat (limited to 'paper')
-rw-r--r--paper/main.pdfbin501197 -> 501187 bytes
-rw-r--r--paper/main.tex2
2 files changed, 1 insertions, 1 deletions
diff --git a/paper/main.pdf b/paper/main.pdf
index 48a39d9..642e971 100644
--- a/paper/main.pdf
+++ b/paper/main.pdf
Binary files differ
diff --git a/paper/main.tex b/paper/main.tex
index 7f3b79d..25d5853 100644
--- a/paper/main.tex
+++ b/paper/main.tex
@@ -125,7 +125,7 @@ Fresh-$B$ null control & $\overline{\cos}_{deep}{=}+0.002{\pm}0.022$ ($n{=}20$ d
Once the reference vector is meaningful again, the deep layers no longer sit exactly at null. At $\lambda{=}10^{-2}$, penalized DFA reaches a three-seed deep-layer mean cosine of $+0.151 \pm 0.025$ and deep perturbation correlation of $+0.080 \pm 0.011$, whereas vanilla DFA is essentially zero on both metrics in the deep blocks, consistent with prior concerns that alternative feedback can fail by supplying poor credit directions even before full collapse \citep{bartunov2018assessing,moskovitz2018feedback,crafton2019backpropagation,refinetti2023aligning}. The null calibration rules out the interpretation that this recovered signal is merely measurement noise: on the same penalized checkpoint, replacing the training-time feedback matrices with 20 fresh random $B_l$ draws gives a deep cosine of only $+0.002 \pm 0.022$, with per-layer standard deviations of $0.013$--$0.023$, all within noise of zero (Table~\ref{tab:mode_validation}). The $\lambda$ sweep sharpens the dissociation further: at $\lambda{=}10^{-4}$, Mode~1 is already alleviated, with three-seed mean $\|h_L\|{\approx}2.2\times 10^4$ and $\|g_L\|{\approx}7.0\times 10^{-7}$, but the three-seed deep cosine remains $-0.020$, while $\lambda{=}10^{-2}$ delivers the $+0.151$ and $+0.080$ above (Figure~\ref{fig:penalty_rescue}). The improvement is real, but it is only partial.
-A rescue intervention is only informative if its direct cost is controlled. The relevant control is BP trained under the same penalty for the same matched $30$-epoch budget: across three seeds, BP falls from $0.585 \pm 0.001$ without the penalty to $0.532 \pm 0.006$ with $\lambda{=}10^{-2}$, so the penalty has a direct cost of about $5.3$ percentage points even when credit assignment is correct, whereas DFA moves in the opposite direction, from $0.301 \pm 0.005$ to $0.360 \pm 0.001$, and State Bridge moves further still, from $0.213$ to $0.453 \pm 0.003$, all under the same $30$-epoch intervention (Figure~\ref{fig:penalty_rescue}; Appendix~\ref{app:sb_penalty}). Relative to the frozen-blocks baseline of $0.349$, BP+penalty retains a margin of $+18.3$ points, State Bridge+penalty retains $+10.4$ points, and DFA+penalty retains only $+1.1$ points. The remaining BP-to-DFA gap of $17.2$ points is therefore a lower bound on the part of DFA's deficit that is not explained by simple penalty-induced capacity loss alone, though not a clean isolation because BP uses an end-to-end loss whereas DFA uses block-local losses. The substantially smaller BP-to-State-Bridge gap of $0.530 - 0.453 = 7.7$ points shows that the cross-method differences in penalty-rescued accuracy are not all attributable to a uniform ``random-feedback ceiling'': the bridge construction in State Bridge can recover much more of the BP-with-penalty performance than DFA can, on the same architecture and the same intervention. The residual gap after that control is what keeps Mode~2 substantively alive while letting it have method-dependent severity.
+A rescue intervention is only informative if its direct cost is controlled. The relevant control is BP trained under the same penalty for the same matched $30$-epoch budget: across three seeds, BP falls from $0.585 \pm 0.001$ without the penalty to $0.532 \pm 0.006$ with $\lambda{=}10^{-2}$, so the penalty has a direct cost of about $5.3$ percentage points even when credit assignment is correct, whereas DFA moves in the opposite direction, from $0.301 \pm 0.005$ to $0.360 \pm 0.001$, and State Bridge moves further still, from $0.213$ to $0.453 \pm 0.003$, all under the same $30$-epoch intervention (Figure~\ref{fig:penalty_rescue}; Appendix~\ref{app:sb_penalty}). Relative to the frozen-blocks baseline of $0.349$, BP+penalty retains a margin of $+18.3$ points, State Bridge+penalty retains $+10.4$ points, and DFA+penalty retains only $+1.1$ points. The remaining BP-to-DFA gap of $17.2$ points is therefore a lower bound on the part of DFA's deficit that is not explained by simple penalty-induced capacity loss alone, though not a clean isolation because BP uses an end-to-end loss whereas DFA uses block-local losses. The substantially smaller BP-to-State-Bridge gap of $0.532 - 0.453 = 7.9$ points shows that the cross-method differences in penalty-rescued accuracy are not all attributable to a uniform ``random-feedback ceiling'': the bridge construction in State Bridge can recover much more of the BP-with-penalty performance than DFA can, on the same architecture and the same intervention. The residual gap after that control is what keeps Mode~2 substantively alive while letting it have method-dependent severity.
The architecture comparison sharpens the scope of the critique. In the terminal-LN architectures we audited, both diagnostics fire for DFA-trained ResMLP at $d{=}256$, the same pattern recurs at $d{=}512$ with even larger max-per-block growth (DFA three-seed mean about $7\times 10^3$ vs $\sim\!1.9\times 10^3$ at $d{=}256$), and ViT-Mini with a class token and terminal LN shows diagnostic~(a) by epoch~1 and diagnostic~(b) by epochs~2--3 (Figure~\ref{fig:temporal_cross_arch}). A depth sweep on the $d{=}512$ ResMLP at $L \in \{2,4,6,8,12\}$ shows that the layerwise pattern is essentially depth-invariant: DFA's layer-0 cosine stays in $[+0.38,+0.40]$ across all five depths, while its mean deep-layer cosine stays within $[-0.005,+0.000]$ and its deep perturbation correlation collapses to $0.000$ in every depth tested, even though BP retains a deep-layer cosine of $+0.94$ at $L{=}12$ (Appendix~\ref{app:depth_scan}). The deep credit signal does not improve when the network is shallower, so the failure is not a "too deep" artifact. In the non-terminal-LN controls, the pattern is different: the no-terminal-LN ResMLP-d256 ablation shows diagnostic~(a) firing across three seeds at epochs $\{18, 14, 25\}$ but diagnostic~(b) never fires across $100$ epochs and the same three seeds, and the BatchNorm CNN on CIFAR-10 likewise shows strong growth under DFA, with max-per-block growth up to $237\times$, but keeps deepest BP gradients around $\|g\| \sim 10^{-3}$ and never triggers diagnostic~(b) (Figure~\ref{fig:temporal_cross_arch}). BP never triggers either diagnostic in any audited architecture. The matched same-backbone ResMLP-d256 ablation in Section~\ref{sec:mode1} supplies the cleanest causal control: removing terminal LayerNorm from the same architecture preserves activation growth but eliminates the gradient floor, so diagnostic~(b) is necessary on terminal-LN ResMLP and is not just an architecture-class coincidence. The broader claim therefore holds at full strength inside the audited residual ResMLP and ViT-Mini regime, while diagnostic~(a) remains useful more broadly. This lets the paper end with a reporting rule rather than an overclaimed theory.