summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
-rw-r--r--paper/main.pdfbin502005 -> 502309 bytes
-rw-r--r--paper/main.tex4
2 files changed, 2 insertions, 2 deletions
diff --git a/paper/main.pdf b/paper/main.pdf
index 033851c..ea62a49 100644
--- a/paper/main.pdf
+++ b/paper/main.pdf
Binary files differ
diff --git a/paper/main.tex b/paper/main.tex
index 7dc3abe..013c7fa 100644
--- a/paper/main.tex
+++ b/paper/main.tex
@@ -98,7 +98,7 @@ A second metric with different numerical failure modes tells the same story. Cos
Per-layer reporting is therefore not cosmetic. In ResMLP under vanilla DFA, the headline aggregate alignment $\Gamma \approx 0.07$--$0.10$ can look mildly positive only because layer $0$ remains strongly aligned while the deep network is not: at the same epoch-1 checkpoints where layers $1$--$4$ are essentially zero, layer $0$ has cosine $+0.42$, $+0.44$, and $+0.42$ across seeds (Table~\ref{tab:mode_validation}; per-seed values in Appendix~\ref{app:layer0_dominance}). The resulting average can therefore be driven by the embedding layer even when the interior blocks are effectively unaligned, so aggregate reporting obscures the very distinction needed to separate ``measurement collapse'' from ``poor credit direction.'' This layer-$0$ dominance is specific to the ResMLP DFA setting; on ViT-Mini DFA, all layers are near zero, which strengthens the broader methodological point that alignment should be reported per layer rather than only in aggregate. With the two modes separated observationally, the remaining question is whether intervention can move them independently.
-Mode~2 has method-dependent severity within the audited fixed-feedback family once Mode~1 is alleviated. Applying the same $\lambda{=}10^{-2}$ scale-control penalty to State Bridge and to Credit Bridge on the audited 4-block $d{=}256$ ResMLP backbone over $30$ epochs and three seeds gives converged test accuracies of $0.453 \pm 0.003$ (SB), $0.360 \pm 0.003$ (CB), and $0.360 \pm 0.001$ (DFA), with deep mean cosines of $+0.322 \pm 0.007$ (SB), $+0.679 \pm 0.008$ (CB), and $+0.151 \pm 0.025$ (DFA), and deep mean $\rho$ of $+0.402$, $+0.464$, and $+0.080$ respectively (Appendix~\ref{app:sb_penalty}). State Bridge+penalty is the first audited non-BP method whose trained deep blocks substantively beat the architecture-matched frozen-blocks baseline ($0.349$), exceeding it by $+10.4$ pp---comparable to BP+penalty's $+18.3$ pp---while $\|h_L\|$ and $\|g_L\|$ stay in the meaningful-measurement regime ($\|h_L\|{\approx}302$ for SB, $5680$ for CB; $\|g_L\|{\approx}1.8{\times}10^{-4}$ and $1.9{\times}10^{-5}$). Within this rescued regime, the three methods reveal a clean cosine-versus-accuracy dissociation, and two independent functional measurements rule out the interpretation that this is just cosine being a noisy metric. First, taking a single step of size $\eta{=}0.01$ along each method's per-layer credit $a_l$ at the converged checkpoint and measuring the resulting test-loss change averaged over the deep blocks gives, across three seeds, $-1.93 \pm 0.11 \times 10^{-3}$ for SB+penalty, $-4.26 \pm 0.24 \times 10^{-4}$ for CB+penalty, and $-4.98 \pm 0.44 \times 10^{-5}$ for DFA+penalty (per-seed and per-block values in Appendix~\ref{app:sb_penalty}): SB's credit direction moves the loss about $4.5\times$ more than CB's and about $39\times$ more than DFA's, even though CB's direction has the highest cos with the BP gradient. Second, the integrated training loss decrease over the same $30$-epoch trajectories ranks SB ($-0.458$) $\gg$ CB ($-0.122$) $\approx$ DFA ($-0.095 \pm 0.007$, three seeds). All three functional metrics---headline accuracy, single-step nudging, and training-loss trajectory---agree on the ordering SB $\gg$ CB $\approx$ DFA, while the deep-cosine ordering CB $>$ SB $>$ DFA is the only one that disagrees. We therefore frame the Mode~2 reading as a three-part proposition. \emph{Observation}: CB has $4\times$ DFA's deep cosine yet matches DFA's accuracy, while SB attains the best accuracy with intermediate cosine. \emph{Inference}: layerwise cosine is necessary to rule out grossly wrong credit signals (distinguishing the rescued regime from the clamp-dominated vanilla one) but not sufficient to certify usable credit for depth. \emph{Mechanism hypothesis}: usefulness depends on whether the local update induces useful forward-state change across blocks, not merely on the angle to the BP gradient---CB supplies a gradient-direction surrogate that aligns in angle without translating to coordinated forward-state improvement, while SB supplies a state-level teaching signal that preserves aspects of useful credit which cosine does not measure. We state this as a hypothesis because we have measured the angle-to-accuracy gap and two functional proxies but not the full per-block forward-state-change content; the reporting rule that follows is robust to either interpretation. This cross-method dissociation strengthens the methodological point that alignment must be reported jointly with measurement validity and a depth-utilization baseline rather than as a single headline number.
+Mode~2 has method-dependent severity within the audited fixed-feedback family once Mode~1 is alleviated. Applying the same $\lambda{=}10^{-2}$ scale-control penalty to SB, CB, and DFA on the audited 4-block $d{=}256$ ResMLP for $30$ epochs (three seeds) gives, in order, test accuracies $0.453 \pm 0.003$, $0.360 \pm 0.003$, $0.360 \pm 0.001$ and deep mean cosines $+0.322 \pm 0.007$, $+0.679 \pm 0.008$, $+0.151 \pm 0.025$ (deep mean $\rho$ $+0.402$, $+0.464$, $+0.080$ and full $\|h_L\|/\|g_L\|$ in Appendix~\ref{app:sb_penalty}), all in the meaningful-measurement regime. SB+penalty is the first audited non-BP method whose trained deep blocks beat the frozen-blocks baseline ($0.349$), by $+10.4$ pp---comparable to BP+penalty's $+18.3$ pp. Within this rescued regime the three methods reveal a clean cosine-versus-accuracy dissociation, and two independent functional measurements rule out the interpretation that cosine is just noisy. \emph{Nudging:} a single step $\eta{=}0.01$ along each method's per-layer credit $a_l$ at the converged checkpoint changes the deep-block test loss by $-1.93 \pm 0.11 \times 10^{-3}$ (SB+pen), $-4.26 \pm 0.24 \times 10^{-4}$ (CB+pen), and $-4.98 \pm 0.44 \times 10^{-5}$ (DFA+pen) across three seeds (per-seed values in Appendix~\ref{app:sb_penalty}): SB moves the loss $\approx\!4.5\times$ more than CB and $\approx\!39\times$ more than DFA, even though CB has the highest deep cosine with BP. \emph{Training-loss trajectory:} the integrated 30-epoch training loss decrease ranks SB ($-0.458$) $\gg$ CB ($-0.122$) $\approx$ DFA ($-0.095 \pm 0.007$). All three functional metrics (accuracy, nudging, training-loss trajectory) agree on SB $\gg$ CB $\approx$ DFA; the deep-cosine ordering CB $>$ SB $>$ DFA is the only one that disagrees. We therefore frame the Mode~2 reading as a three-part proposition. \emph{Observation}: CB has $4\times$ DFA's deep cosine yet matches DFA's accuracy, while SB attains the best accuracy with intermediate cosine. \emph{Inference}: layerwise cosine is necessary to rule out grossly wrong credit signals (distinguishing the rescued regime from the clamp-dominated vanilla one) but not sufficient to certify usable credit for depth. \emph{Mechanism hypothesis}: usefulness depends on whether the local update induces useful forward-state change across blocks, not merely on the angle to the BP gradient---CB supplies a gradient-direction surrogate that aligns in angle without translating to coordinated forward-state improvement, while SB supplies a state-level teaching signal that preserves aspects of useful credit which cosine does not measure. The same hypothesis casts Mode~1 as a downstream symptom of Mode~2 rather than a parallel failure: when $a_l$ cannot drive useful per-block forward-state change, the only easy way to increase $\langle f_l, a_l\rangle$ is to inflate $\|f_l\|$ along the cheap random direction set by $a_l$, producing Mode~1(a) growth and (via terminal LN) Mode~1(b) collapse; the per-block penalty then breaks the chain by capping $\|f_l\|$ without fixing credit quality, consistent with the observed asymmetry that it alleviates Mode~1 fully but only partially fixes Mode~2. We state this as a hypothesis because we have measured the angle-to-accuracy gap and two functional proxies but not the full per-block forward-state-change content; the reporting rule that follows is robust to either interpretation. This cross-method dissociation strengthens the methodological point that alignment must be reported jointly with measurement validity and a depth-utilization baseline rather than as a single headline number.
\section{Intervention and Cross-Architecture Evidence}
\label{sec:validation}
@@ -180,7 +180,7 @@ The protocol is conservative in a specific sense: it preserves BP and EP as evid
\section{Discussion, Limits, Conclusion}
\label{sec:discussion}
-\looseness=-2 Our claim is about evidence, not impossibility: we show that current FA evaluation practice can misread what happened, not that FA cannot work in deep networks. DFA, SB, and CB all pass status-quo reporting (Table~\ref{tab:main_audit}) but fail the protocol's deep checks, and the Figure~\ref{fig:penalty_rescue} penalty partially rescues credit signal rather than validating headlines. Our strongest claim is scoped to $d{=}256/512$ pre-LayerNorm ResMLPs and ViT-Mini, where both Mode~1 diagnostics fire; the no-terminal-LN ResMLP ablation and the BatchNorm CNN show that activation growth can persist without gradient-floor collapse; the no-terminal-LN control establishes terminal LayerNorm as causally necessary for diagnostic~(b) on residual ResMLP; the dataset is CIFAR-10; and the BP-plus-penalty comparison is a lower bound, not a full decomposition. In the evaluation-methodology line of \citet{jordan2020evaluating,obray2022evaluation,paleka2026pitfalls}, FA papers should report BP-reference validity, layerwise credit quality, and a frozen-blocks depth-utilization baseline as separate axes, not a single headline.
+\looseness=-2 Our claim is about evidence, not impossibility: we show that current FA evaluation practice can misread what happened, not that FA cannot work in deep networks. DFA, SB, and CB all pass status-quo reporting (Table~\ref{tab:main_audit}) but fail the protocol's deep checks, and the Figure~\ref{fig:penalty_rescue} penalty partially rescues credit signal rather than validating headlines. Our strongest claim is scoped to $d{=}256/512$ pre-LayerNorm ResMLPs and ViT-Mini, where both Mode~1 diagnostics fire; the no-terminal-LN ResMLP ablation establishes terminal LayerNorm as causally necessary for diagnostic~(b) on residual ResMLP and (with the BatchNorm CNN) shows that activation growth can persist without gradient-floor collapse; the dataset is CIFAR-10; and the BP-plus-penalty comparison is a lower bound, not a full decomposition. In the evaluation-methodology line of \citet{jordan2020evaluating,obray2022evaluation,paleka2026pitfalls}, FA papers should report BP-reference validity, layerwise credit quality, and a frozen-blocks depth-utilization baseline as separate axes, not a single headline.
\begin{thebibliography}{10}