diff options
| author | YurenHao0426 <Blackhao0426@gmail.com> | 2026-04-08 18:32:23 -0500 |
|---|---|---|
| committer | YurenHao0426 <Blackhao0426@gmail.com> | 2026-04-08 18:32:23 -0500 |
| commit | 0c1d102c57d86d914eb1122dd59f329667db60d8 (patch) | |
| tree | 04d8e676817fa9f243d466686efdfcb2883bff99 /paper | |
| parent | 2b4581723d0c5ed562528fac6b0a789adf95e3c5 (diff) | |
paper v2.31.9: relabel "StudentNet" → "no-terminal-LN ResMLP"
The §3 ¶3 / §5 ¶3 / Figure 5 / §7 mentions of "StudentNet" as a
cross-architecture validation case were a misleading rebrand of the
no-terminal-LN ResMLP-d256 ablation. Verified by tracing the data:
results/protocol_audit/temporal_evolution_s{42,123,456}.json
final_acc 0.332/0.313/0.336 (matches no-outln 3-seed 0.327±0.012)
first_fire_a {18, 14, 25}
first_fire_b None / None / None
The actual synth StudentNet (results/snapshot_synth_v1, d=128 alpha=1.0)
has max-per-block growth ~6.88 over 80 epochs and never reaches the
50× threshold, so diagnostic (a) does NOT fire on the real synth
StudentNet at all. Calling the no-outln data "StudentNet" doubled-
counted the same architecture under two names (the same-backbone
causal control AND the cross-arch generalization test).
Relabeled to "no-terminal-LN ResMLP" everywhere it appeared:
- §3 ¶3 paragraph 1 cross-arch list
- §3 ¶3 paragraph 2 (now with explicit per-seed first-fire epochs {18,14,25})
- §5 paragraph (the conclusion)
- §7 conclusion (cross-arch list)
- Figure 5 caption
- Figure 5 row label (with re-rendered PDF)
The remaining cross-arch generalization claim is now: ViT-Mini fires
both diagnostics, ResMLP at d=256/d=512 fires both, no-terminal-LN
ResMLP and BatchNorm CNN fire only (a) — three real architecture
classes, with the no-LN ablation being the same-backbone control rather
than a separate architecture. The cross-arch story is slightly weaker
("3 architecture classes" not "4") but truthful and self-consistent.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Diffstat (limited to 'paper')
| -rw-r--r-- | paper/figures/fig5_cross_arch_summary.pdf | bin | 32029 -> 31577 bytes | |||
| -rw-r--r-- | paper/figures/render_fig5_cross_arch.py | 6 | ||||
| -rw-r--r-- | paper/main.pdf | bin | 501174 -> 500784 bytes | |||
| -rw-r--r-- | paper/main.tex | 10 |
4 files changed, 8 insertions, 8 deletions
diff --git a/paper/figures/fig5_cross_arch_summary.pdf b/paper/figures/fig5_cross_arch_summary.pdf Binary files differindex 00267a7..efab2b8 100644 --- a/paper/figures/fig5_cross_arch_summary.pdf +++ b/paper/figures/fig5_cross_arch_summary.pdf diff --git a/paper/figures/render_fig5_cross_arch.py b/paper/figures/render_fig5_cross_arch.py index 9ad9ce2..9d52e09 100644 --- a/paper/figures/render_fig5_cross_arch.py +++ b/paper/figures/render_fig5_cross_arch.py @@ -10,8 +10,8 @@ REPO_ROOT = "/home/yurenh2/fa" # Verdict matrix: arch x diagnostic # 0 = ok (BP), 1 = ok-non-LN-arch, 2 = walk-back # Columns: (a) per-block growth, (b) ||g_L|| floor, (c) drift stability, (d) frozen baseline -# Rows: ResMLP-d256, ResMLP-d512, ViT-Mini, StudentNet (no LN), CNN (BN, no LN) -arches = ["ResMLP $d{=}256$\n(terminal LN)", "ResMLP $d{=}512$\n(terminal LN)", "ViT-Mini\n(cls + LN)", "StudentNet\n(no terminal LN)", "CNN BatchNorm\n(no terminal LN)"] +# Rows: ResMLP-d256, ResMLP-d512, ViT-Mini, no-terminal-LN ResMLP-d256, CNN (BN, no LN) +arches = ["ResMLP $d{=}256$\n(terminal LN)", "ResMLP $d{=}512$\n(terminal LN)", "ViT-Mini\n(cls + LN)", "ResMLP $d{=}256$\n(no terminal LN)", "CNN BatchNorm\n(no terminal LN)"] diags = ["(a) scale", "(b) ${\\|g\\|}$ floor", "(c) drift", "(d) frozen"] # DFA verdicts on each @@ -20,7 +20,7 @@ dfa = np.array([ [1, 1, 0, 1], # ResMLP d256: (a) fires, (b) fires, (c) noise sub-mode, (d) fires [1, 1, 0, 1], # ResMLP d512: same pattern [1, 1, 0, 1], # ViT-Mini: same pattern - [1, 0, 0, 0], # StudentNet: only (a) fires; (b) NEVER + [1, 0, 0, 0], # ResMLP no-LN: only (a) fires; (b) NEVER [1, 0, 0, 0], # CNN BN: only (a) fires; (b) NEVER (the killer (b)-is-LN-specific finding) ]) diff --git a/paper/main.pdf b/paper/main.pdf Binary files differindex 422ff34..e82f7dd 100644 --- a/paper/main.pdf +++ b/paper/main.pdf diff --git a/paper/main.tex b/paper/main.tex index f40cab8..b7547a0 100644 --- a/paper/main.tex +++ b/paper/main.tex @@ -87,7 +87,7 @@ Mode~1 has two parts. The activation-growth part~(a) is a scale pathology of fix We tested this mechanism story against four natural alternative attributions, all of which it survives. \emph{Not residual-skip-driven:} with terminal LN kept and the additive skip removed ($h_{l+1}{=}F_l(h_l)$), DFA still converges across three seeds to mean $\|h_L\|{\approx}8.2{\times}10^{7}$ and mean $\|g_L\|{\approx}1.9{\times}10^{-10}$ at $100$ epochs, both at the diagnostic floor (Appendix~\ref{app:no_residual}). \emph{Not task-signal-driven:} under i.i.d.\ random class targets per minibatch, DFA still reaches $\|h_L\|{\approx}1.67{\times}10^{8}$ and $\|g_L\|{\approx}8{\times}10^{-12}$ while accuracy stays at chance (Appendix~\ref{app:random_targets}). \emph{Not DFA-specific:} the same random-target ablation drives $\|h_L\|$ to $6.2{\times}10^{3}$ for SB and $2.0{\times}10^{4}$ for CB in three epochs, so all three audited fixed-feedback methods exhibit data-agnostic activation growth. \emph{Not shared by EP:} under the same protocol, EP keeps $\|h_L\|{\approx}586$ at five epochs, $25\times$ smaller than DFA's three-epoch value, confirming that the random-target assay separates the explosion-prone fixed-feedback class from EP's energy-based objective. -The matched same-backbone causal control for diagnostic~(b) is removing terminal LayerNorm. On the same ResMLP-d256 with the residual skip intact, $100$ epochs of DFA, three seeds, the residual stream still inflates to $\|h_L\|\!\approx\!1.21\times 10^7$, but the deepest hidden-layer BP gradient remains at $\|g_L\|\!\approx\!7.2\times 10^{-4}$ (four orders of magnitude above the diagnostic~(b) floor), and the final test accuracy is $0.327\pm 0.012$, statistically indistinguishable from vanilla DFA's $0.306\pm 0.006$ on the same backbone with terminal LayerNorm intact. Removing terminal LayerNorm therefore preserves Mode~1~(a) but cleanly eliminates Mode~1~(b) on the same architecture, while leaving final task accuracy essentially unchanged. Combined with the broader cross-architecture pattern (StudentNet and the BatchNorm CNN, which lack terminal LayerNorm, never trigger diagnostic~(b); ViT-Mini with a terminal LN does, by epochs 2--3 (Figure~\ref{fig:temporal_cross_arch})), terminal LayerNorm is necessary for Mode~1~(b) in the audited residual ResMLP and ViT-Mini setting. The collapse is also not a late-epoch curiosity: $\|g_L\|$ drops from $9.8\times 10^{-4}$ at epoch~0 to $5.8\times 10^{-8}$ by epoch~4 in the three-seed temporal replay (per seed: $6.8$, $6.4$, $4.1\times 10^{-8}$), so the protocol fires within the first $11$ epochs of a $100$-epoch run and is actionable as an early-stop criterion rather than a post hoc explanation. Once measurement degeneracy is identified, the next question is whether poor deep credit remains even before collapse. +The matched same-backbone causal control for diagnostic~(b) is removing terminal LayerNorm. On the same ResMLP-d256 with the residual skip intact, $100$ epochs of DFA, three seeds, the residual stream still inflates to $\|h_L\|\!\approx\!1.21\times 10^7$, but the deepest hidden-layer BP gradient remains at $\|g_L\|\!\approx\!7.2\times 10^{-4}$ (four orders of magnitude above the diagnostic~(b) floor), and the final test accuracy is $0.327\pm 0.012$, statistically indistinguishable from vanilla DFA's $0.306\pm 0.006$ on the same backbone with terminal LayerNorm intact. Removing terminal LayerNorm therefore preserves Mode~1~(a) but cleanly eliminates Mode~1~(b) on the same architecture, while leaving final task accuracy essentially unchanged. Combined with the broader cross-architecture pattern (the no-terminal-LN ResMLP-d256 ablation and the BatchNorm CNN, which lack terminal LayerNorm, never trigger diagnostic~(b); ViT-Mini with a terminal LN does, by epochs 2--3 (Figure~\ref{fig:temporal_cross_arch})), terminal LayerNorm is necessary for Mode~1~(b) in the audited residual ResMLP and ViT-Mini setting. The collapse is also not a late-epoch curiosity: $\|g_L\|$ drops from $9.8\times 10^{-4}$ at epoch~0 to $5.8\times 10^{-8}$ by epoch~4 in the three-seed temporal replay (per seed: $6.8$, $6.4$, $4.1\times 10^{-8}$), so the protocol fires within the first $11$ epochs of a $100$-epoch run and is actionable as an early-stop criterion rather than a post hoc explanation. Once measurement degeneracy is identified, the next question is whether poor deep credit remains even before collapse. \section{Failure Mode 2: Low Intrinsic Credit-Direction Quality} \label{sec:mode2} @@ -127,7 +127,7 @@ Once the reference vector is meaningful again, the deep layers no longer sit exa A rescue intervention is only informative if its direct cost is controlled. The relevant control is BP trained under the same penalty for the same matched $30$-epoch budget: across three seeds, BP falls from $0.585 \pm 0.001$ without the penalty to $0.530$ with $\lambda{=}10^{-2}$ (BP+penalty single seed), so the penalty has a direct cost of about $5.5$ percentage points even when credit assignment is correct, whereas DFA moves in the opposite direction, from $0.301 \pm 0.005$ to $0.360 \pm 0.001$, and State Bridge moves further still, from $0.213$ to $0.453 \pm 0.003$, all under the same $30$-epoch intervention (Figure~\ref{fig:penalty_rescue}; Appendix~\ref{app:sb_penalty}). Relative to the frozen-blocks baseline of $0.349$, BP+penalty retains a margin of $+18.1$ points, State Bridge+penalty retains $+10.4$ points, and DFA+penalty retains only $+1.1$ points. The remaining BP-to-DFA gap of $17.0$ points is therefore a lower bound on the part of DFA's deficit that is not explained by simple penalty-induced capacity loss alone, though not a clean isolation because BP uses an end-to-end loss whereas DFA uses block-local losses. The substantially smaller BP-to-State-Bridge gap of $0.530 - 0.453 = 7.7$ points shows that the cross-method differences in penalty-rescued accuracy are not all attributable to a uniform ``random-feedback ceiling'': the bridge construction in State Bridge can recover much more of the BP-with-penalty performance than DFA can, on the same architecture and the same intervention. The residual gap after that control is what keeps Mode~2 substantively alive while letting it have method-dependent severity. -The architecture comparison sharpens the scope of the critique. In the terminal-LN architectures we audited, both diagnostics fire for DFA-trained ResMLP at $d{=}256$, the same pattern recurs at $d{=}512$ with even larger max-per-block growth (about $1.5\times 10^4$), and ViT-Mini with a class token and terminal LN shows diagnostic~(a) by epoch~1 and diagnostic~(b) by epochs~2--3 (Figure~\ref{fig:temporal_cross_arch}). A depth sweep on the $d{=}512$ ResMLP at $L \in \{2,4,6,8,12\}$ shows that the layerwise pattern is essentially depth-invariant: DFA's layer-0 cosine stays in $[+0.38,+0.40]$ across all five depths, while its mean deep-layer cosine stays within $[-0.005,+0.000]$ and its deep perturbation correlation collapses to $0.000$ in every depth tested, even though BP retains a deep-layer cosine of $+0.94$ at $L{=}12$ (Appendix~\ref{app:depth_scan}). The deep credit signal does not improve when the network is shallower, so the failure is not a "too deep" artifact. In the non-terminal-LN controls, the pattern is different: StudentNet shows diagnostic~(a) only at epochs~14--25 while diagnostic~(b) never fires across $100$ epochs and three seeds, and the BatchNorm CNN on CIFAR-10 likewise shows strong growth under DFA, with max-per-block growth up to $237\times$, but keeps deepest BP gradients around $\|g\| \sim 10^{-3}$ and never triggers diagnostic~(b) (Figure~\ref{fig:temporal_cross_arch}). BP never triggers either diagnostic in any audited architecture. The matched same-backbone ResMLP-d256 ablation in Section~\ref{sec:mode1} supplies the cleanest causal control: removing terminal LayerNorm from the same architecture preserves activation growth but eliminates the gradient floor, so diagnostic~(b) is necessary on terminal-LN ResMLP and is not just an architecture-class coincidence. The broader claim therefore holds at full strength inside the audited residual ResMLP and ViT-Mini regime, while diagnostic~(a) remains useful more broadly. This lets the paper end with a reporting rule rather than an overclaimed theory. +The architecture comparison sharpens the scope of the critique. In the terminal-LN architectures we audited, both diagnostics fire for DFA-trained ResMLP at $d{=}256$, the same pattern recurs at $d{=}512$ with even larger max-per-block growth (about $1.5\times 10^4$), and ViT-Mini with a class token and terminal LN shows diagnostic~(a) by epoch~1 and diagnostic~(b) by epochs~2--3 (Figure~\ref{fig:temporal_cross_arch}). A depth sweep on the $d{=}512$ ResMLP at $L \in \{2,4,6,8,12\}$ shows that the layerwise pattern is essentially depth-invariant: DFA's layer-0 cosine stays in $[+0.38,+0.40]$ across all five depths, while its mean deep-layer cosine stays within $[-0.005,+0.000]$ and its deep perturbation correlation collapses to $0.000$ in every depth tested, even though BP retains a deep-layer cosine of $+0.94$ at $L{=}12$ (Appendix~\ref{app:depth_scan}). The deep credit signal does not improve when the network is shallower, so the failure is not a "too deep" artifact. In the non-terminal-LN controls, the pattern is different: the no-terminal-LN ResMLP-d256 ablation shows diagnostic~(a) firing across three seeds at epochs $\{18, 14, 25\}$ but diagnostic~(b) never fires across $100$ epochs and the same three seeds, and the BatchNorm CNN on CIFAR-10 likewise shows strong growth under DFA, with max-per-block growth up to $237\times$, but keeps deepest BP gradients around $\|g\| \sim 10^{-3}$ and never triggers diagnostic~(b) (Figure~\ref{fig:temporal_cross_arch}). BP never triggers either diagnostic in any audited architecture. The matched same-backbone ResMLP-d256 ablation in Section~\ref{sec:mode1} supplies the cleanest causal control: removing terminal LayerNorm from the same architecture preserves activation growth but eliminates the gradient floor, so diagnostic~(b) is necessary on terminal-LN ResMLP and is not just an architecture-class coincidence. The broader claim therefore holds at full strength inside the audited residual ResMLP and ViT-Mini regime, while diagnostic~(a) remains useful more broadly. This lets the paper end with a reporting rule rather than an overclaimed theory. \begin{figure}[t] \centering @@ -146,7 +146,7 @@ The architecture comparison sharpens the scope of the critique. In the terminal- \begin{figure}[t] \centering \includegraphics[width=0.7\linewidth]{figures/fig5_cross_arch_summary.pdf} -\caption{Cross-architecture summary over ResMLP, ViT-Mini, StudentNet, and CNN: activation-growth failures recur across architectures, while gradient-floor failures appear in the terminal-normalized settings audited here.} +\caption{Cross-architecture summary over ResMLP, ViT-Mini, no-terminal-LN ResMLP, and CNN: activation-growth failures recur across architectures, while gradient-floor failures appear in the terminal-normalized settings audited here.} \label{fig:cross_arch_summary} \end{figure} @@ -180,7 +180,7 @@ The protocol is conservative in a specific sense: it preserves BP and EP as evid \section{Discussion, Limits, Conclusion} \label{sec:discussion} -\looseness=-2 Our claim is about evidence, not impossibility: we show that current FA evaluation practice can misread what happened, not that FA cannot work in deep networks. DFA, SB, and CB all pass status-quo reporting (Table~\ref{tab:main_audit}) but fail the protocol's deep checks, and the Figure~\ref{fig:penalty_rescue} penalty partially rescues credit signal rather than validating headlines. Our strongest claim is scoped to $d{=}256/512$ pre-LayerNorm ResMLPs and ViT-Mini, where both Mode~1 diagnostics fire; StudentNet and the BatchNorm CNN show that activation growth can persist without gradient-floor collapse; the no-terminal-LN control establishes terminal LayerNorm as causally necessary for diagnostic~(b) on residual ResMLP; the dataset is CIFAR-10; and the BP-plus-penalty comparison is a lower bound, not a full decomposition. In the evaluation-methodology line of \citet{jordan2020evaluating,obray2022evaluation,paleka2026pitfalls}, FA papers should report BP-reference validity, layerwise credit quality, and a frozen-blocks depth-utilization baseline as separate axes, not a single headline. +\looseness=-2 Our claim is about evidence, not impossibility: we show that current FA evaluation practice can misread what happened, not that FA cannot work in deep networks. DFA, SB, and CB all pass status-quo reporting (Table~\ref{tab:main_audit}) but fail the protocol's deep checks, and the Figure~\ref{fig:penalty_rescue} penalty partially rescues credit signal rather than validating headlines. Our strongest claim is scoped to $d{=}256/512$ pre-LayerNorm ResMLPs and ViT-Mini, where both Mode~1 diagnostics fire; the no-terminal-LN ResMLP ablation and the BatchNorm CNN show that activation growth can persist without gradient-floor collapse; the no-terminal-LN control establishes terminal LayerNorm as causally necessary for diagnostic~(b) on residual ResMLP; the dataset is CIFAR-10; and the BP-plus-penalty comparison is a lower bound, not a full decomposition. In the evaluation-methodology line of \citet{jordan2020evaluating,obray2022evaluation,paleka2026pitfalls}, FA papers should report BP-reference validity, layerwise credit quality, and a frozen-blocks depth-utilization baseline as separate axes, not a single headline. \begin{thebibliography}{10} @@ -373,7 +373,7 @@ Diagnostic (d) is weaker and should be presented that way. Its threshold is best The per-architecture appendix should be short and comparative. On pre-LayerNorm ResMLP and ViT-Mini, the key pattern is the same as in the main text: residual-scale growth can become large enough that the deepest BP reference becomes numerically weak, and the status-quo pair of accuracy plus headline $\Gamma$ fails to expose that. These are the settings where both failure modes matter and where the full protocol is most necessary. -StudentNet and the CNN serve a different role. They test whether the protocol overgeneralizes from terminal-normalized residual architectures to settings where gradient-floor collapse is not expected. In those models, activation-growth checks can still reveal weak depth usage or poor scaling, but diagnostic (b) is not expected to fire in the same way. This asymmetry is not a weakness of the protocol; it is part of the empirical scoping claim of the paper and helps prevent readers from mistaking a targeted evaluation standard for a universal pathology claim \citep{xiong2020layer,bartunov2018assessing}. +The no-terminal-LN ResMLP ablation and the CNN serve a different role. They test whether the protocol overgeneralizes from terminal-normalized residual architectures to settings where gradient-floor collapse is not expected. In those models, activation-growth checks can still reveal weak depth usage or poor scaling, but diagnostic (b) is not expected to fire in the same way. This asymmetry is not a weakness of the protocol; it is part of the empirical scoping claim of the paper and helps prevent readers from mistaking a targeted evaluation standard for a universal pathology claim \citep{xiong2020layer,bartunov2018assessing}. \section{Depth-Sweep Layerwise Profiles} \label{app:depth_scan} |
