diff options
| author | YurenHao0426 <Blackhao0426@gmail.com> | 2026-04-08 05:18:38 -0500 |
|---|---|---|
| committer | YurenHao0426 <Blackhao0426@gmail.com> | 2026-04-08 05:18:38 -0500 |
| commit | 68cfa13af2f026b7ff388aae4420eba0f0db804a (patch) | |
| tree | 6f02841f579b83d2bdfd17badb3287b7db10603d | |
| parent | 3e49631d6dbf6b66683cb7dac5ecf71df2ad7de0 (diff) | |
Add depth-sweep evidence to §5 + Appendix H from existing d=512 L=2,4,6,8,12 data
The cifar_depth_scan_s42 results were already on disk but not surfaced in
the paper. Across L in {2,4,6,8,12} on the d=512 ResMLP, DFA's layer-0
cosine stays in [+0.39,+0.40] and its mean deep cosine stays within
[-0.005,+0.000], while BP retains a deep cosine of +0.94 even at L=12.
This rules out the 'too deep to receive useful credit' explanation:
making the network shallower does not reach the deep blocks any better.
- §5 paragraph 4: one-sentence depth-invariance summary citing the new appendix
- New Appendix H: Depth-Sweep Layerwise Profiles, with full table
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
| -rw-r--r-- | paper/main.pdf | bin | 448286 -> 452349 bytes | |||
| -rw-r--r-- | paper/main.tex | 37 |
2 files changed, 36 insertions, 1 deletions
diff --git a/paper/main.pdf b/paper/main.pdf Binary files differindex b13f9af..6cdc1d4 100644 --- a/paper/main.pdf +++ b/paper/main.pdf diff --git a/paper/main.tex b/paper/main.tex index 1a6f189..de6b546 100644 --- a/paper/main.tex +++ b/paper/main.tex @@ -119,7 +119,7 @@ Once the reference vector is meaningful again, the deep layers no longer sit exa A rescue intervention is only informative if its direct cost is controlled. The relevant control is BP trained under the same penalty: BP falls from $0.609 \pm 0.004$ without the penalty to $0.530$ with $\lambda{=}10^{-2}$, so the penalty has a direct cost of about $8$ percentage points even when credit assignment is correct, whereas DFA moves in the opposite direction, from $0.308 \pm 0.014$ to $0.363 \pm 0.001$ under the same intervention (Figure~\ref{fig:penalty_rescue}). Relative to the frozen-blocks baseline of $0.349$, BP+penalty still retains a margin of $+18.1$ points, while DFA+penalty retains only $+1.4$ points. The remaining gap, $0.530 - 0.363 = 17$ points, is therefore a lower bound on the part of DFA's deficit that is not explained by simple penalty-induced capacity loss alone, though not a clean isolation because BP uses an end-to-end loss whereas DFA uses block-local losses. The residual gap after that control is what keeps Mode~2 substantively alive. -The architecture comparison sharpens the scope of the critique. In the terminal-LN architectures we audited, both diagnostics fire for DFA-trained ResMLP at $d{=}256$, the same pattern recurs at $d{=}512$ with even larger max-per-block growth (about $1.5\times 10^4$), and ViT-Mini with a class token and terminal LN shows diagnostic~(a) by epoch~1 and diagnostic~(b) by epochs~2--3 (Figure~\ref{fig:temporal_cross_arch}). In the non-terminal-LN controls, the pattern is different: StudentNet shows diagnostic~(a) only at epochs~14--25 while diagnostic~(b) never fires across $100$ epochs and three seeds, and the BatchNorm CNN on CIFAR-10 likewise shows strong growth under DFA, with max-per-block growth up to $237\times$, but keeps deepest BP gradients around $\|g\| \sim 10^{-3}$ and never triggers diagnostic~(b) (Figure~\ref{fig:temporal_cross_arch}). BP never triggers either diagnostic in any audited architecture. This is an observational association rather than a causal identification of terminal LayerNorm as the unique mechanism, but it is enough to support a narrower claim: diagnostic~(b) appears tied to the terminal-LN architectures audited here, while diagnostic~(a) remains useful more broadly. This lets the paper end with a reporting rule rather than an overclaimed theory. +The architecture comparison sharpens the scope of the critique. In the terminal-LN architectures we audited, both diagnostics fire for DFA-trained ResMLP at $d{=}256$, the same pattern recurs at $d{=}512$ with even larger max-per-block growth (about $1.5\times 10^4$), and ViT-Mini with a class token and terminal LN shows diagnostic~(a) by epoch~1 and diagnostic~(b) by epochs~2--3 (Figure~\ref{fig:temporal_cross_arch}). A depth sweep on the $d{=}512$ ResMLP at $L \in \{2,4,6,8,12\}$ shows that the layerwise pattern is essentially depth-invariant: DFA's layer-0 cosine stays in $[+0.39,+0.40]$ across all five depths, while its mean deep-layer cosine stays within $[-0.005,+0.000]$ and its deep perturbation correlation collapses to $0.000$ in every depth tested, even though BP retains a deep-layer cosine of $+0.94$ at $L{=}12$ (Appendix~\ref{app:depth_scan}). The deep credit signal does not improve when the network is shallower, so the failure is not a "too deep" artifact. In the non-terminal-LN controls, the pattern is different: StudentNet shows diagnostic~(a) only at epochs~14--25 while diagnostic~(b) never fires across $100$ epochs and three seeds, and the BatchNorm CNN on CIFAR-10 likewise shows strong growth under DFA, with max-per-block growth up to $237\times$, but keeps deepest BP gradients around $\|g\| \sim 10^{-3}$ and never triggers diagnostic~(b) (Figure~\ref{fig:temporal_cross_arch}). BP never triggers either diagnostic in any audited architecture. This is an observational association rather than a causal identification of terminal LayerNorm as the unique mechanism, but it is enough to support a narrower claim: diagnostic~(b) appears tied to the terminal-LN architectures audited here, while diagnostic~(a) remains useful more broadly. This lets the paper end with a reporting rule rather than an overclaimed theory. \begin{figure}[t] \centering @@ -349,6 +349,41 @@ The per-architecture appendix should be short and comparative. On pre-LayerNorm StudentNet and the CNN serve a different role. They test whether the protocol overgeneralizes from terminal-normalized residual architectures to settings where gradient-floor collapse is not expected. In those models, activation-growth checks can still reveal weak depth usage or poor scaling, but diagnostic (b) is not expected to fire in the same way. This asymmetry is not a weakness of the protocol; it is part of the empirical scoping claim of the paper and helps prevent readers from mistaking a targeted evaluation standard for a universal pathology claim \citep{xiong2020layer,bartunov2018assessing}. +\section{Depth-Sweep Layerwise Profiles} +\label{app:depth_scan} + +To check whether the layerwise pattern in Figure~\ref{fig:audit_hero} is an artifact of the specific four-block depth used in the main audit, we ran the same architecture on $d{=}512$ pre-LayerNorm ResMLPs at five depths $L \in \{2,4,6,8,12\}$ on CIFAR-10 (single seed $42$, otherwise matched configuration). Table~\ref{tab:depth_scan} reports the layer-0 cosine, the mean cosine over all deeper layers, and the deep mean perturbation correlation $\rho$ for each depth. + +\begin{table}[h] +\centering +\small +\caption{Depth sweep on $d{=}512$ ResMLP, seed 42, 100 epochs CIFAR-10. \emph{layer-0 cos} is the embedding-block BP cosine, \emph{deep cos} is the mean BP cosine over the remaining $L{-}1$ blocks, and \emph{deep $\rho$} is the corresponding mean perturbation correlation. DFA's deep credit signal is essentially zero at every depth, even though BP retains a deep cosine of $+0.94$ at $L{=}12$.} +\label{tab:depth_scan} +\begin{tabular}{rlrrrr} +\toprule +$L$ & method & test acc & layer-0 cos & deep cos & deep $\rho$ \\ +\midrule +$2$ & BP & $0.599$ & $+1.000$ & $+1.000$ & $+0.983$ \\ +$2$ & DFA & $0.312$ & $+0.396$ & $-0.005$ & $+0.000$ \\ +$2$ & Credit Bridge & $0.310$ & $+0.330$ & $+0.020$ & $+0.000$ \\ +$4$ & BP & $0.603$ & $+1.000$ & $+1.000$ & $+0.988$ \\ +$4$ & DFA & $0.314$ & $+0.400$ & $-0.000$ & $+0.000$ \\ +$4$ & Credit Bridge & $0.298$ & $+0.402$ & $+0.030$ & $+0.000$ \\ +$6$ & BP & $0.602$ & $+0.993$ & $+0.993$ & $+0.991$ \\ +$6$ & DFA & $0.310$ & $+0.387$ & $-0.000$ & $+0.000$ \\ +$6$ & Credit Bridge & $0.299$ & $+0.304$ & $+0.054$ & $+0.000$ \\ +$8$ & BP & $0.589$ & $+0.965$ & $+0.965$ & $+0.992$ \\ +$8$ & DFA & $0.306$ & $+0.377$ & $-0.000$ & $+0.000$ \\ +$8$ & Credit Bridge & $0.288$ & $+0.205$ & $+0.022$ & $+0.000$ \\ +$12$ & BP & $0.594$ & $+0.942$ & $+0.940$ & $+0.990$ \\ +$12$ & DFA & $0.309$ & $+0.388$ & $-0.000$ & $+0.000$ \\ +$12$ & Credit Bridge & $0.239$ & $+0.208$ & $+0.016$ & $+0.000$ \\ +\bottomrule +\end{tabular} +\end{table} + +The layerwise pattern is essentially depth-invariant. DFA's layer-0 cosine stays in $[+0.39,+0.40]$ across all five depths, while its mean deep cosine sits within $[-0.005,+0.000]$ and its deep $\rho$ collapses to numerical zero in every condition. Credit Bridge shows a slightly milder version of the same shape, with a small positive deep cosine that does not improve as depth shrinks. BP, by contrast, maintains a deep cosine of $+0.94$ even at $L{=}12$, so the BP reference is still measurably non-degenerate where DFA and Credit Bridge are flat. This rules out the explanation that DFA's deep blocks are merely too far from the loss to receive useful credit: making the network shallower does not reach the deep blocks any better. The failure is structural to the credit signal rather than an artifact of depth. + \section{Reproducibility} \label{app:reproducibility} |
