diff options
| -rw-r--r-- | paper/main.tex | 88 |
1 files changed, 75 insertions, 13 deletions
diff --git a/paper/main.tex b/paper/main.tex index 9b32157..eaea3f7 100644 --- a/paper/main.tex +++ b/paper/main.tex @@ -477,6 +477,64 @@ but well above noise. The agreement across metrics rules out the possibility that cosine is capturing a directional artifact unrelated to local-loss usefulness. +\subsection{$\lambda$ sweep: independent dissociation of the two modes} +\label{sec:lambda-sweep} + +The disambiguation experiment of Section~\ref{sec:two-modes} relied on +vanilla DFA early-epoch checkpoints (epochs 1--2) to measure deep-layer +cosine while $\|g_L\|$ was still in the meaningful regime. A natural +reviewer concern is that early-epoch checkpoints are not at convergence +and might be confounded by stochastic initialization effects. We +strengthen the disambiguation with an independent control: a sweep over +the penalty strength $\lambda$ at convergence (30~epochs), with both +metrics measured on each saved checkpoint. + +\begin{table}[h] +\centering +\caption{$\lambda$ sweep on the penalty strength, all 30 epochs, seed +42. The deep-layer cosine and perturbation correlation rise from +essentially zero at $\lambda{=}10^{-4}$ to small-but-positive at +$\lambda{=}10^{-2}$, even though diagnostics (a) and (b) already pass +at $\lambda{=}10^{-4}$.} +\label{tab:lambda-sweep} +\begin{tabular}{rrrrrr} +\toprule +$\lambda$ & test acc & $\|h_L\|$ & $\|g_L\|$ & deep $\overline{\cos}$ & deep $\overline{\rho}$ \\ +\midrule +$0$ & $0.308$ & $4.4{\times}10^{8}$ & $5{\times}10^{-10}$ & (degenerate) & (degenerate) \\ +$10^{-4}$ & $0.359$ & $2.4{\times}10^{4}$ & $6.3{\times}10^{-7}$ & $-0.022$ & $-0.004$ \\ +$10^{-2}$ & $0.363$ & $4.0{\times}10^{4}$ & $9.0{\times}10^{-7}$ & $+0.165$ & $+0.091$ \\ +$10^{-1}$ & $0.349$ & $1.2{\times}10^{4}$ & $1.6{\times}10^{-6}$ & $+0.131$ & $+0.067$ \\ +\bottomrule +\end{tabular} +\end{table} + +\textbf{The killer row is $\lambda{=}10^{-4}$.} At this penalty +strength, the residual stream is already contained ($\|h_L\| = 2.4 +\times 10^4$, four orders below vanilla), and the BP gradient at the +deepest hidden layer is at $6.3 \times 10^{-7}$ (well above the +$10^{-7}$ floor and in the meaningful measurement regime). Diagnostics +(a) and (b) both pass: \textbf{Mode~1 is fully alleviated}. But the +deep-layer cosine ($-0.022$) and perturbation correlation ($-0.004$) +are essentially zero, on both metrics independently. \textbf{Mode~2 is +not alleviated at all.} + +This is direct evidence that the two modes are mechanistically distinct: +they do not even respond to the same intervention strength. There exists +a regime ($\lambda{=}10^{-4}$, 30~epochs of training) in which +Mode~1 is fully alleviated and Mode~2 is unchanged from vanilla, with +both metrics agreeing. + +The threshold for Mode~2 alleviation is somewhere between +$\lambda{=}10^{-4}$ and $\lambda{=}10^{-2}$. At $\lambda{=}10^{-2}$ the +penalty is strong enough to alter the optimization trajectory of the +block parameters (constraining $\|f_l\|$ tightly enough that the +direction of $f_l$ has to be coordinated more carefully with the local +credit signal), and the deep-layer alignment rises to $\sim\!+0.16$. +At $\lambda{=}10^{-1}$ the penalty starts to over-constrain and the +alignment is slightly lower ($\sim\!+0.13$), giving an inverted-U +relationship between $\lambda$ and deep alignment. + \subsection{Capacity-cost control} \label{sec:capacity-cost} @@ -520,19 +578,23 @@ on BP. It is not a perfect isolation of ``credit quality'' in a vacuum trainers may differ in non-capacity ways), but it is a strong lower bound on the non-capacity penalty-unexplained gap. -\subsection{Summary: four validations of the two-mode separation} - -Together, the disambiguation experiment, the cross-metric triangulation, -the capacity-cost control, and the threshold robustness analysis provide -four independent lines of evidence that the failure of standard FA -evaluation is not a single phenomenon. Mode~1 (measurement degeneracy) -is detected by diagnostic (b), is causally controlled by the residual- -stream penalty, and is specifically associated with terminal-LayerNorm -architectures in our audits. Mode~2 (low intrinsic credit quality) -persists after Mode~1 is alleviated, is invisible in vanilla DFA at any -epoch (because the measurement is degenerate), and is detected by direct -per-layer cosine in the meaningful regime, with the perturbation -correlation triangulating the same finding via a different metric. +\subsection{Summary: five validations of the two-mode separation} + +Together, the disambiguation experiment, the $\lambda$ sweep, the +cross-metric triangulation, the capacity-cost control, and the +threshold robustness analysis provide five independent lines of +evidence that the failure of standard FA evaluation is not a single +phenomenon. Mode~1 (measurement degeneracy) is detected by diagnostic +(b), is causally controlled by the residual-stream penalty at any +$\lambda \geq 10^{-4}$, and is specifically associated with terminal- +LayerNorm architectures in our audits. Mode~2 (low intrinsic credit +quality) persists after Mode~1 is alleviated at weak penalty +strengths ($\lambda{=}10^{-4}$), is detected by direct per-layer +cosine in the meaningful regime, and rises only when the penalty is +strong enough to alter the optimization trajectory of the deep +blocks ($\lambda \geq 10^{-2}$). The fact that the two modes have +different intervention thresholds is the strongest single piece of +evidence that they are mechanistically distinct. \section{Limitations} \label{sec:limitations} |
