summaryrefslogtreecommitdiff
path: root/paper/main.tex
diff options
context:
space:
mode:
Diffstat (limited to 'paper/main.tex')
-rw-r--r--paper/main.tex9
1 files changed, 8 insertions, 1 deletions
diff --git a/paper/main.tex b/paper/main.tex
index 7b125e4..90a06b3 100644
--- a/paper/main.tex
+++ b/paper/main.tex
@@ -98,7 +98,14 @@ A second metric with different numerical failure modes tells the same story. Cos
Per-layer reporting is therefore not cosmetic. In ResMLP under vanilla DFA, the headline aggregate alignment $\Gamma \approx 0.07$--$0.10$ can look mildly positive only because layer $0$ remains strongly aligned while the deep network is not: at the same epoch-1 checkpoints where layers $1$--$4$ are essentially zero, layer $0$ has cosine $+0.42$, $+0.44$, and $+0.42$ across seeds (Table~\ref{tab:mode_validation}; per-seed values in Appendix~\ref{app:layer0_dominance}). The resulting average can therefore be driven by the embedding layer even when the interior blocks are effectively unaligned, so aggregate reporting obscures the very distinction needed to separate ``measurement collapse'' from ``poor credit direction.'' This layer-$0$ dominance is specific to the ResMLP DFA setting; on ViT-Mini DFA, all layers are near zero, which strengthens the broader methodological point that alignment should be reported per layer rather than only in aggregate. With the two modes separated observationally, the remaining question is whether intervention can move them independently.
-Mode~2 has method-dependent severity within the audited fixed-feedback family once Mode~1 is alleviated. Applying the same $\lambda{=}10^{-2}$ scale-control penalty to SB, CB, and DFA on the audited 4-block $d{=}256$ ResMLP for $30$ epochs (three seeds) gives, in order, test accuracies $0.453 \pm 0.003$, $0.360 \pm 0.003$, $0.360 \pm 0.001$ and deep mean cosines $+0.322 \pm 0.007$, $+0.679 \pm 0.008$, $+0.151 \pm 0.025$ (deep mean $\rho$ $+0.402$, $+0.464$, $+0.080$ and full $\|h_L\|/\|g_L\|$ in Appendix~\ref{app:sb_penalty}), all in the meaningful-measurement regime. SB+penalty is the first audited non-BP method whose trained deep blocks beat the frozen-blocks baseline ($0.349$), by $+10.4$ pp---comparable to BP+penalty's $+18.3$ pp. Within this rescued regime the three methods reveal a clean cosine-versus-accuracy dissociation, and two independent functional measurements rule out the interpretation that cosine is just noisy. \emph{Nudging:} a single step $\eta{=}0.01$ along each method's per-layer credit $a_l$ at the converged checkpoint changes the deep-block test loss by $-1.93 \pm 0.11 \times 10^{-3}$ (SB+pen), $-4.26 \pm 0.24 \times 10^{-4}$ (CB+pen), and $-4.98 \pm 0.44 \times 10^{-5}$ (DFA+pen) across three seeds (per-seed values in Appendix~\ref{app:sb_penalty}): SB moves the loss $\approx\!4.5\times$ more than CB and $\approx\!39\times$ more than DFA, even though CB has the highest deep cosine with BP. \emph{Training-loss trajectory:} the integrated 30-epoch training loss decrease across three seeds ranks SB ($-0.447 \pm 0.008$) $\gg$ CB ($-0.121 \pm 0.003$) $\approx$ DFA ($-0.095 \pm 0.007$). All three functional metrics (accuracy, nudging, training-loss trajectory) agree on SB $\gg$ CB $\approx$ DFA; the deep-cosine ordering CB $>$ SB $>$ DFA is the only one that disagrees. We therefore frame the Mode~2 reading as a three-part proposition. \emph{Observation}: CB has $4\times$ DFA's deep cosine yet matches DFA's accuracy, while SB attains the best accuracy with intermediate cosine. \emph{Inference}: layerwise cosine is necessary to rule out grossly wrong credit signals (distinguishing the rescued regime from the clamp-dominated vanilla one) but not sufficient to certify usable credit for depth. \emph{Mechanism hypothesis}: usefulness depends on whether the local update induces useful forward-state change across blocks, not merely on the angle to the BP gradient---CB supplies a gradient-direction surrogate that aligns in angle without translating to coordinated forward-state improvement, while SB supplies a state-level teaching signal that preserves aspects of useful credit which cosine does not measure. The same hypothesis casts Mode~1 as a downstream symptom of Mode~2 rather than a parallel failure: when $a_l$ cannot drive useful per-block forward-state change, the only easy way to increase $\langle f_l, a_l\rangle$ is to inflate $\|f_l\|$ along the cheap random direction set by $a_l$, producing Mode~1(a) growth and (via terminal LN) Mode~1(b) collapse; the per-block penalty then breaks the chain by capping $\|f_l\|$ without fixing credit quality, consistent with the observed asymmetry that it alleviates Mode~1 fully but only partially fixes Mode~2. We state this as a hypothesis because we have measured the angle-to-accuracy gap and two functional proxies but not the full per-block forward-state-change content; the reporting rule that follows is robust to either interpretation. This cross-method dissociation strengthens the methodological point that alignment must be reported jointly with measurement validity and a depth-utilization baseline rather than as a single headline number.
+Mode~2 has method-dependent severity within the audited fixed-feedback family once Mode~1 is alleviated. Applying the same $\lambda{=}10^{-2}$ scale-control penalty to SB, CB, and DFA on the audited 4-block $d{=}256$ ResMLP for $30$ epochs (three seeds) gives, in order, test accuracies $0.453 \pm 0.003$, $0.360 \pm 0.003$, $0.360 \pm 0.001$ and deep mean cosines $+0.322 \pm 0.007$, $+0.679 \pm 0.008$, $+0.151 \pm 0.025$ (deep mean $\rho$ $+0.402$, $+0.464$, $+0.080$ and full $\|h_L\|/\|g_L\|$ in Appendix~\ref{app:sb_penalty}), all in the meaningful-measurement regime. SB+penalty is the first audited non-BP method whose trained deep blocks beat the frozen-blocks baseline ($0.349$), by $+10.4$ pp---comparable to BP+penalty's $+18.3$ pp. Within this rescued regime the three methods reveal a clean cosine-versus-accuracy dissociation, and two independent functional measurements rule out the interpretation that cosine is just noisy. \emph{Nudging:} a single step $\eta{=}0.01$ along each method's per-layer credit $a_l$ at the converged checkpoint changes the deep-block test loss by $-1.93 \pm 0.11 \times 10^{-3}$ (SB+pen), $-4.26 \pm 0.24 \times 10^{-4}$ (CB+pen), and $-4.98 \pm 0.44 \times 10^{-5}$ (DFA+pen) across three seeds (per-seed values in Appendix~\ref{app:sb_penalty}): SB moves the loss $\approx\!4.5\times$ more than CB and $\approx\!39\times$ more than DFA, even though CB has the highest deep cosine with BP. \emph{Training-loss trajectory:} the integrated 30-epoch training loss decrease across three seeds ranks SB ($-0.447 \pm 0.008$) $\gg$ CB ($-0.121 \pm 0.003$) $\approx$ DFA ($-0.095 \pm 0.007$). All three functional metrics (accuracy, nudging, training-loss trajectory) agree on SB $\gg$ CB $\approx$ DFA; the deep-cosine ordering CB $>$ SB $>$ DFA is the only one that disagrees (Figure~\ref{fig:cos_acc_dissoc}). We therefore frame the Mode~2 reading as a three-part proposition. \emph{Observation}: CB has $4\times$ DFA's deep cosine yet matches DFA's accuracy, while SB attains the best accuracy with intermediate cosine. \emph{Inference}: layerwise cosine is necessary to rule out grossly wrong credit signals (distinguishing the rescued regime from the clamp-dominated vanilla one) but not sufficient to certify usable credit for depth. \emph{Mechanism hypothesis}: usefulness depends on whether the local update induces useful forward-state change across blocks, not merely on the angle to the BP gradient---CB supplies a gradient-direction surrogate that aligns in angle without translating to coordinated forward-state improvement, while SB supplies a state-level teaching signal that preserves aspects of useful credit which cosine does not measure. The same hypothesis casts Mode~1 as a downstream symptom of Mode~2 rather than a parallel failure: when $a_l$ cannot drive useful per-block forward-state change, the only easy way to increase $\langle f_l, a_l\rangle$ is to inflate $\|f_l\|$ along the cheap random direction set by $a_l$, producing Mode~1(a) growth and (via terminal LN) Mode~1(b) collapse; the per-block penalty then breaks the chain by capping $\|f_l\|$ without fixing credit quality, consistent with the observed asymmetry that it alleviates Mode~1 fully but only partially fixes Mode~2. We state this as a hypothesis because we have measured the angle-to-accuracy gap and two functional proxies but not the full per-block forward-state-change content; the reporting rule that follows is robust to either interpretation. This cross-method dissociation strengthens the methodological point that alignment must be reported jointly with measurement validity and a depth-utilization baseline rather than as a single headline number.
+
+\begin{figure}[t]
+\centering
+\includegraphics[width=0.85\linewidth]{figures/fig_cos_acc_dissociation.pdf}
+\caption{Cross-method functional dissociation under matched penalty rescue ($\lambda{=}10^{-2}$, $30$ epochs, $3$ seeds, 4-block $d{=}256$ pre-LayerNorm ResMLP). Each line tracks one method across four metrics, normalized so that the maximum across methods equals $1.0$ in each column; raw values are annotated. Deep cosine to the BP gradient ranks the three methods $\mathrm{CB}{>}\mathrm{SB}{>}\mathrm{DFA}$, but the three functional metrics (test accuracy, single-step nudging-test loss decrease, and integrated 30-epoch training-loss decrease) all rank them $\mathrm{SB}{\gg}\mathrm{CB}{\approx}\mathrm{DFA}$. The X-pattern between deep cos and accuracy is the cross-method cos-versus-accuracy dissociation: SB rises from middle (cos) to top (functional), CB drops from top (cos) to tied with DFA (functional). Deep cosine is the only one of the four metrics that does not predict accuracy.}
+\label{fig:cos_acc_dissoc}
+\end{figure}
\section{Intervention and Cross-Architecture Evidence}
\label{sec:validation}