From 8f52b493dbf1a21a762dd2c9e924bbbbebeb911d Mon Sep 17 00:00:00 2001 From: YurenHao0426 Date: Wed, 8 Apr 2026 04:55:52 -0500 Subject: =?UTF-8?q?Round=2028:=20fill=20in=20=C2=A73=20Failure=20Mode=201?= =?UTF-8?q?=20prose=20(4=20paragraphs)=20via=20codex?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- paper/main.tex | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) (limited to 'paper/main.tex') diff --git a/paper/main.tex b/paper/main.tex index 3cba99d..19324ec 100644 --- a/paper/main.tex +++ b/paper/main.tex @@ -76,13 +76,13 @@ When we compare each method to a frozen-blocks baseline matched to the same arch \section{Failure Mode 1: Measurement Degeneracy} \label{sec:mode1} -The first failure mode is measurement degeneracy via terminal-LayerNorm gradient cancellation. % TODO: evidence sentence % TODO: closing sentence +The first failure mode is a scale pathology, not yet an alignment pathology. On the audited 4-block pre-LayerNorm ResMLP ($d{=}256$, CIFAR-10, 100 epochs, 3 seeds), DFA optimizes block-local objectives of the form $\langle f_l(h_l),\, e_T B_l^\top\rangle$ with no explicit scale constraint on $f_l$, so the residual stream is free to inflate while still reducing the local loss \citep{launay2020direct}. In the same runs, each block's $w_1$ and $w_2$ grows by roughly $200\times$ in relative delta, their norm product reaches about $5\times 10^4$ per block, and the terminal hidden-state norm $\|h_L\|$ rises monotonically from about $9$ at random initialization to about $4\times 10^8$ by epoch 100 (Figure~\ref{fig:temporal_cross_arch}). Most of that growth appears immediately: $\|h_L\|$ already reaches about $10^6$ by epoch 5. Once the residual stream reaches this regime, the backpropagation reference vector no longer behaves like a healthy target. -In this regime, the problem is not merely that FA performs poorly; it is that the BP reference direction used to score FA has itself become numerically non-diagnostic at the deepest hidden layers, so reported cosine values no longer support the scientific claim they are being used to justify. % TODO: evidence sentence % TODO: closing sentence +The measurement failure occurs at the point where the hidden-layer BP gradient ceases to be a meaningful reference direction. In terminal-LayerNorm architectures, the LayerNorm Jacobian scales as $\partial \mathrm{LN}(h)/\partial h \propto 1/\|h\|$ in expectation, so the same residual-stream inflation is accompanied by collapse of the hidden-layer BP reference norm: on DFA-trained ResMLP, $\|g_L\|$ falls from about $9.8\times 10^{-4}$ at random initialization to about $5\times 10^{-10}$ by epoch 100, a six-order-of-magnitude drop, while the reported cosine remains mathematically defined only because \texttt{F.cosine\_similarity} clamps the denominator at $\varepsilon{=}10^{-8}$ (Table~\ref{tab:main_audit}; Figure~\ref{fig:audit_hero}). At that endpoint the reference norm is about $20\times$ below the clamp, so the quantity being reported is effectively $(a\cdot b)/(\|a\|\max(\|b\|,10^{-8}))$ rather than a comparison to an informative BP direction. At that point, reporting a cosine is no longer evidence about credit quality. -Residual-stream growth provides the practical detector because it is the upstream quantity that makes the terminal normalization step increasingly cancellation-prone in the settings audited here, especially on the pre-LayerNorm residual architectures where the final hidden-state scale is free to drift \citep{xiong2020layer,launay2020direct}. % TODO: evidence sentence % TODO: closing sentence +The simplest control is architectural, not theoretical. On the same ResMLP backbone, BP keeps $\|h_L\|$ near $200$ and $\|g_L\|$ near $4\times 10^{-4}$ throughout training, while EP keeps $\|h_L\|$ around $5\times 10^3$ and $\|g_L\|$ around $1.3\times 10^{-4}$, so hard optimization on CIFAR-10 by itself does not force hidden-layer gradients to the numerical floor (Table~\ref{tab:main_audit}; Figure~\ref{fig:temporal_cross_arch}). The broader cross-architecture pattern is consistent with the same interpretation: StudentNet and the BatchNorm CNN, which lack terminal LayerNorm, keep deepest BP gradients around $10^{-4}$ and never trigger diagnostic (b), whereas ViT-Mini with a terminal LN shows the same collapse pattern and triggers diagnostic (b) by epochs 2--3 (Figure~\ref{fig:temporal_cross_arch}). The pathology therefore belongs to the evaluated FA regime, not to CIFAR-10 or the backbone alone. -The main consequence for evaluation is that a positive or weakly positive deep-layer $\Gamma$ can no longer be read as evidence that meaningful credit alignment exists once the reference gradient norm has collapsed to the numerical floor. % TODO: evidence sentence % TODO: closing sentence +The collapse is not a late-epoch curiosity. For vanilla DFA on the ResMLP temporal replay, $\|g_L\|$ drops from $9.8\times 10^{-4}$ at epoch 0 to $1.4\times 10^{-6}$ at epoch 1, $3.1\times 10^{-7}$ at epoch 2, $1.3\times 10^{-7}$ at epoch 3, and $6.7\times 10^{-8}$ at epoch 4, so diagnostic (b) fires at epoch 3--4 across all three seeds, while the max-per-block growth detector fires slightly later at epochs 8--11 (Figure~\ref{fig:temporal_cross_arch}). Both detectors therefore fire in the first 11 epochs of a 100-epoch run, making the protocol actionable as an early-stop criterion rather than a post hoc explanation. The practical point is reinforced by accuracy: DFA is at $0.308$ already at epoch 4 and ends at $0.306$ by epoch 100, so the remaining training budget adds essentially nothing to the headline result once the measurement has already degenerated. Once measurement degeneracy is identified, the next question is whether poor deep credit remains even before collapse. \section{Failure Mode 2: Low Intrinsic Credit-Direction Quality} \label{sec:mode2} -- cgit v1.2.3