Add depth-sweep evidence to §5 + Appendix H from existing d=512 L=2,4,6,8,12 data

The cifar_depth_scan_s42 results were already on disk but not surfaced in the paper. Across L in {2,4,6,8,12} on the d=512 ResMLP, DFA's layer-0 cosine stays in [+0.39,+0.40] and its mean deep cosine stays within [-0.005,+0.000], while BP retains a deep cosine of +0.94 even at L=12. This rules out the 'too deep to receive useful credit' explanation: making the network shallower does not reach the deep blocks any better. - §5 paragraph 4: one-sentence depth-invariance summary citing the new appendix - New Appendix H: Depth-Sweep Layerwise Profiles, with full table Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
author: YurenHao0426 <Blackhao0426@gmail.com> 2026-04-08 05:18:38 -0500
committer: YurenHao0426 <Blackhao0426@gmail.com> 2026-04-08 05:18:38 -0500
commit: 68cfa13af2f026b7ff388aae4420eba0f0db804a (patch)
tree: 6f02841f579b83d2bdfd17badb3287b7db10603d
parent: 3e49631d6dbf6b66683cb7dac5ecf71df2ad7de0 (diff)
2 files changed, 36 insertions, 1 deletions
diff --git a/paper/main.pdf b/paper/main.pdf
index b13f9af..6cdc1d4 100644
--- a/paper/main.pdf
+++ b/paper/main.pdf
diff --git a/paper/main.tex b/paper/main.tex
index 1a6f189..de6b546 100644
--- a/paper/main.tex
+++ b/paper/main.tex
@@ -119,7 +119,7 @@ Once the reference vector is meaningful again, the deep layers no longer sit exa
 
 A rescue intervention is only informative if its direct cost is controlled. The relevant control is BP trained under the same penalty: BP falls from $0.609 \pm 0.004$ without the penalty to $0.530$ with $\lambda{=}10^{-2}$, so the penalty has a direct cost of about $8$ percentage points even when credit assignment is correct, whereas DFA moves in the opposite direction, from $0.308 \pm 0.014$ to $0.363 \pm 0.001$ under the same intervention (Figure~\ref{fig:penalty_rescue}). Relative to the frozen-blocks baseline of $0.349$, BP+penalty still retains a margin of $+18.1$ points, while DFA+penalty retains only $+1.4$ points. The remaining gap, $0.530 - 0.363 = 17$ points, is therefore a lower bound on the part of DFA's deficit that is not explained by simple penalty-induced capacity loss alone, though not a clean isolation because BP uses an end-to-end loss whereas DFA uses block-local losses. The residual gap after that control is what keeps Mode~2 substantively alive.
 
-The architecture comparison sharpens the scope of the critique. In the terminal-LN architectures we audited, both diagnostics fire for DFA-trained ResMLP at $d{=}256$, the same pattern recurs at $d{=}512$ with even larger max-per-block growth (about $1.5\times 10^4$), and ViT-Mini with a class token and terminal LN shows diagnostic~(a) by epoch~1 and diagnostic~(b) by epochs~2--3 (Figure~\ref{fig:temporal_cross_arch}). In the non-terminal-LN controls, the pattern is different: StudentNet shows diagnostic~(a) only at epochs~14--25 while diagnostic~(b) never fires across $100$ epochs and three seeds, and the BatchNorm CNN on CIFAR-10 likewise shows strong growth under DFA, with max-per-block growth up to $237\times$, but keeps deepest BP gradients around $\|g\| \sim 10^{-3}$ and never triggers diagnostic~(b) (Figure~\ref{fig:temporal_cross_arch}). BP never triggers either diagnostic in any audited architecture. This is an observational association rather than a causal identification of terminal LayerNorm as the unique mechanism, but it is enough to support a narrower claim: diagnostic~(b) appears tied to the terminal-LN architectures audited here, while diagnostic~(a) remains useful more broadly. This lets the paper end with a reporting rule rather than an overclaimed theory.
+The architecture comparison sharpens the scope of the critique. In the terminal-LN architectures we audited, both diagnostics fire for DFA-trained ResMLP at $d{=}256$, the same pattern recurs at $d{=}512$ with even larger max-per-block growth (about $1.5\times 10^4$), and ViT-Mini with a class token and terminal LN shows diagnostic~(a) by epoch~1 and diagnostic~(b) by epochs~2--3 (Figure~\ref{fig:temporal_cross_arch}). A depth sweep on the $d{=}512$ ResMLP at $L \in \{2,4,6,8,12\}$ shows that the layerwise pattern is essentially depth-invariant: DFA's layer-0 cosine stays in $[+0.39,+0.40]$ across all five depths, while its mean deep-layer cosine stays within $[-0.005,+0.000]$ and its deep perturbation correlation collapses to $0.000$ in every depth tested, even though BP retains a deep-layer cosine of $+0.94$ at $L{=}12$ (Appendix~\ref{app:depth_scan}). The deep credit signal does not improve when the network is shallower, so the failure is not a "too deep" artifact. In the non-terminal-LN controls, the pattern is different: StudentNet shows diagnostic~(a) only at epochs~14--25 while diagnostic~(b) never fires across $100$ epochs and three seeds, and the BatchNorm CNN on CIFAR-10 likewise shows strong growth under DFA, with max-per-block growth up to $237\times$, but keeps deepest BP gradients around $\|g\| \sim 10^{-3}$ and never triggers diagnostic~(b) (Figure~\ref{fig:temporal_cross_arch}). BP never triggers either diagnostic in any audited architecture. This is an observational association rather than a causal identification of terminal LayerNorm as the unique mechanism, but it is enough to support a narrower claim: diagnostic~(b) appears tied to the terminal-LN architectures audited here, while diagnostic~(a) remains useful more broadly. This lets the paper end with a reporting rule rather than an overclaimed theory.
 
 \begin{figure}[t]
 \centering
@@ -349,6 +349,41 @@ The per-architecture appendix should be short and comparative. On pre-LayerNorm
 
 StudentNet and the CNN serve a different role. They test whether the protocol overgeneralizes from terminal-normalized residual architectures to settings where gradient-floor collapse is not expected. In those models, activation-growth checks can still reveal weak depth usage or poor scaling, but diagnostic (b) is not expected to fire in the same way. This asymmetry is not a weakness of the protocol; it is part of the empirical scoping claim of the paper and helps prevent readers from mistaking a targeted evaluation standard for a universal pathology claim \citep{xiong2020layer,bartunov2018assessing}.
 
+\section{Depth-Sweep Layerwise Profiles}
+\label{app:depth_scan}
+
+To check whether the layerwise pattern in Figure~\ref{fig:audit_hero} is an artifact of the specific four-block depth used in the main audit, we ran the same architecture on $d{=}512$ pre-LayerNorm ResMLPs at five depths $L \in \{2,4,6,8,12\}$ on CIFAR-10 (single seed $42$, otherwise matched configuration). Table~\ref{tab:depth_scan} reports the layer-0 cosine, the mean cosine over all deeper layers, and the deep mean perturbation correlation $\rho$ for each depth.
+
+\begin{table}[h]
+\centering
+\small
+\caption{Depth sweep on $d{=}512$ ResMLP, seed 42, 100 epochs CIFAR-10. \emph{layer-0 cos} is the embedding-block BP cosine, \emph{deep cos} is the mean BP cosine over the remaining $L{-}1$ blocks, and \emph{deep $\rho$} is the corresponding mean perturbation correlation. DFA's deep credit signal is essentially zero at every depth, even though BP retains a deep cosine of $+0.94$ at $L{=}12$.}
+\label{tab:depth_scan}
+\begin{tabular}{rlrrrr}
+\toprule
+$L$ & method & test acc & layer-0 cos & deep cos & deep $\rho$ \\
+\midrule
+$2$ & BP & $0.599$ & $+1.000$ & $+1.000$ & $+0.983$ \\
+$2$ & DFA & $0.312$ & $+0.396$ & $-0.005$ & $+0.000$ \\
+$2$ & Credit Bridge & $0.310$ & $+0.330$ & $+0.020$ & $+0.000$ \\
+$4$ & BP & $0.603$ & $+1.000$ & $+1.000$ & $+0.988$ \\
+$4$ & DFA & $0.314$ & $+0.400$ & $-0.000$ & $+0.000$ \\
+$4$ & Credit Bridge & $0.298$ & $+0.402$ & $+0.030$ & $+0.000$ \\
+$6$ & BP & $0.602$ & $+0.993$ & $+0.993$ & $+0.991$ \\
+$6$ & DFA & $0.310$ & $+0.387$ & $-0.000$ & $+0.000$ \\
+$6$ & Credit Bridge & $0.299$ & $+0.304$ & $+0.054$ & $+0.000$ \\
+$8$ & BP & $0.589$ & $+0.965$ & $+0.965$ & $+0.992$ \\
+$8$ & DFA & $0.306$ & $+0.377$ & $-0.000$ & $+0.000$ \\
+$8$ & Credit Bridge & $0.288$ & $+0.205$ & $+0.022$ & $+0.000$ \\
+$12$ & BP & $0.594$ & $+0.942$ & $+0.940$ & $+0.990$ \\
+$12$ & DFA & $0.309$ & $+0.388$ & $-0.000$ & $+0.000$ \\
+$12$ & Credit Bridge & $0.239$ & $+0.208$ & $+0.016$ & $+0.000$ \\
+\bottomrule
+\end{tabular}
+\end{table}
+
+The layerwise pattern is essentially depth-invariant. DFA's layer-0 cosine stays in $[+0.39,+0.40]$ across all five depths, while its mean deep cosine sits within $[-0.005,+0.000]$ and its deep $\rho$ collapses to numerical zero in every condition. Credit Bridge shows a slightly milder version of the same shape, with a small positive deep cosine that does not improve as depth shrinks. BP, by contrast, maintains a deep cosine of $+0.94$ even at $L{=}12$, so the BP reference is still measurably non-degenerate where DFA and Credit Bridge are flat. This rules out the explanation that DFA's deep blocks are merely too far from the loss to receive useful credit: making the network shallower does not reach the deep blocks any better. The failure is structural to the credit signal rather than an artifact of depth.
+
 \section{Reproducibility}
 \label{app:reproducibility}
author	YurenHao0426 <Blackhao0426@gmail.com>	2026-04-08 05:18:38 -0500
committer	YurenHao0426 <Blackhao0426@gmail.com>	2026-04-08 05:18:38 -0500
commit	68cfa13af2f026b7ff388aae4420eba0f0db804a (patch)
tree	6f02841f579b83d2bdfd17badb3287b7db10603d
parent	3e49631d6dbf6b66683cb7dac5ecf71df2ad7de0 (diff)