summaryrefslogtreecommitdiff
path: root/paper/main.tex
diff options
context:
space:
mode:
Diffstat (limited to 'paper/main.tex')
-rw-r--r--paper/main.tex88
1 files changed, 75 insertions, 13 deletions
diff --git a/paper/main.tex b/paper/main.tex
index 9b32157..eaea3f7 100644
--- a/paper/main.tex
+++ b/paper/main.tex
@@ -477,6 +477,64 @@ but well above noise. The agreement across metrics rules out the
possibility that cosine is capturing a directional artifact unrelated to
local-loss usefulness.
+\subsection{$\lambda$ sweep: independent dissociation of the two modes}
+\label{sec:lambda-sweep}
+
+The disambiguation experiment of Section~\ref{sec:two-modes} relied on
+vanilla DFA early-epoch checkpoints (epochs 1--2) to measure deep-layer
+cosine while $\|g_L\|$ was still in the meaningful regime. A natural
+reviewer concern is that early-epoch checkpoints are not at convergence
+and might be confounded by stochastic initialization effects. We
+strengthen the disambiguation with an independent control: a sweep over
+the penalty strength $\lambda$ at convergence (30~epochs), with both
+metrics measured on each saved checkpoint.
+
+\begin{table}[h]
+\centering
+\caption{$\lambda$ sweep on the penalty strength, all 30 epochs, seed
+42. The deep-layer cosine and perturbation correlation rise from
+essentially zero at $\lambda{=}10^{-4}$ to small-but-positive at
+$\lambda{=}10^{-2}$, even though diagnostics (a) and (b) already pass
+at $\lambda{=}10^{-4}$.}
+\label{tab:lambda-sweep}
+\begin{tabular}{rrrrrr}
+\toprule
+$\lambda$ & test acc & $\|h_L\|$ & $\|g_L\|$ & deep $\overline{\cos}$ & deep $\overline{\rho}$ \\
+\midrule
+$0$ & $0.308$ & $4.4{\times}10^{8}$ & $5{\times}10^{-10}$ & (degenerate) & (degenerate) \\
+$10^{-4}$ & $0.359$ & $2.4{\times}10^{4}$ & $6.3{\times}10^{-7}$ & $-0.022$ & $-0.004$ \\
+$10^{-2}$ & $0.363$ & $4.0{\times}10^{4}$ & $9.0{\times}10^{-7}$ & $+0.165$ & $+0.091$ \\
+$10^{-1}$ & $0.349$ & $1.2{\times}10^{4}$ & $1.6{\times}10^{-6}$ & $+0.131$ & $+0.067$ \\
+\bottomrule
+\end{tabular}
+\end{table}
+
+\textbf{The killer row is $\lambda{=}10^{-4}$.} At this penalty
+strength, the residual stream is already contained ($\|h_L\| = 2.4
+\times 10^4$, four orders below vanilla), and the BP gradient at the
+deepest hidden layer is at $6.3 \times 10^{-7}$ (well above the
+$10^{-7}$ floor and in the meaningful measurement regime). Diagnostics
+(a) and (b) both pass: \textbf{Mode~1 is fully alleviated}. But the
+deep-layer cosine ($-0.022$) and perturbation correlation ($-0.004$)
+are essentially zero, on both metrics independently. \textbf{Mode~2 is
+not alleviated at all.}
+
+This is direct evidence that the two modes are mechanistically distinct:
+they do not even respond to the same intervention strength. There exists
+a regime ($\lambda{=}10^{-4}$, 30~epochs of training) in which
+Mode~1 is fully alleviated and Mode~2 is unchanged from vanilla, with
+both metrics agreeing.
+
+The threshold for Mode~2 alleviation is somewhere between
+$\lambda{=}10^{-4}$ and $\lambda{=}10^{-2}$. At $\lambda{=}10^{-2}$ the
+penalty is strong enough to alter the optimization trajectory of the
+block parameters (constraining $\|f_l\|$ tightly enough that the
+direction of $f_l$ has to be coordinated more carefully with the local
+credit signal), and the deep-layer alignment rises to $\sim\!+0.16$.
+At $\lambda{=}10^{-1}$ the penalty starts to over-constrain and the
+alignment is slightly lower ($\sim\!+0.13$), giving an inverted-U
+relationship between $\lambda$ and deep alignment.
+
\subsection{Capacity-cost control}
\label{sec:capacity-cost}
@@ -520,19 +578,23 @@ on BP. It is not a perfect isolation of ``credit quality'' in a vacuum
trainers may differ in non-capacity ways), but it is a strong lower bound
on the non-capacity penalty-unexplained gap.
-\subsection{Summary: four validations of the two-mode separation}
-
-Together, the disambiguation experiment, the cross-metric triangulation,
-the capacity-cost control, and the threshold robustness analysis provide
-four independent lines of evidence that the failure of standard FA
-evaluation is not a single phenomenon. Mode~1 (measurement degeneracy)
-is detected by diagnostic (b), is causally controlled by the residual-
-stream penalty, and is specifically associated with terminal-LayerNorm
-architectures in our audits. Mode~2 (low intrinsic credit quality)
-persists after Mode~1 is alleviated, is invisible in vanilla DFA at any
-epoch (because the measurement is degenerate), and is detected by direct
-per-layer cosine in the meaningful regime, with the perturbation
-correlation triangulating the same finding via a different metric.
+\subsection{Summary: five validations of the two-mode separation}
+
+Together, the disambiguation experiment, the $\lambda$ sweep, the
+cross-metric triangulation, the capacity-cost control, and the
+threshold robustness analysis provide five independent lines of
+evidence that the failure of standard FA evaluation is not a single
+phenomenon. Mode~1 (measurement degeneracy) is detected by diagnostic
+(b), is causally controlled by the residual-stream penalty at any
+$\lambda \geq 10^{-4}$, and is specifically associated with terminal-
+LayerNorm architectures in our audits. Mode~2 (low intrinsic credit
+quality) persists after Mode~1 is alleviated at weak penalty
+strengths ($\lambda{=}10^{-4}$), is detected by direct per-layer
+cosine in the meaningful regime, and rises only when the penalty is
+strong enough to alter the optimization trajectory of the deep
+blocks ($\lambda \geq 10^{-2}$). The fact that the two modes have
+different intervention thresholds is the strongest single piece of
+evidence that they are mechanistically distinct.
\section{Limitations}
\label{sec:limitations}