diff options
| -rw-r--r-- | paper/main.pdf | bin | 500969 -> 500979 bytes | |||
| -rw-r--r-- | paper/main.tex | 2 |
2 files changed, 1 insertions, 1 deletions
diff --git a/paper/main.pdf b/paper/main.pdf Binary files differindex 84667b1..157ab24 100644 --- a/paper/main.pdf +++ b/paper/main.pdf diff --git a/paper/main.tex b/paper/main.tex index ada802d..7359634 100644 --- a/paper/main.tex +++ b/paper/main.tex @@ -85,7 +85,7 @@ When we compare each method to a frozen-blocks baseline matched to the same arch Mode~1 has two parts. The activation-growth part~(a) is a scale pathology of fixed-feedback local-credit objectives without an effective scale-control term: for block $l$, DFA, State Bridge, and Credit Bridge each update $f_l$ by reducing a local loss of the form $-\langle f_l(h_l),\, a_l\rangle$, where the per-layer credit vector $a_l$ is the method-specific projection of the output error (for DFA, $a_l = B_l^\top e_T$ with a fixed random $B_l$; for State Bridge, $a_l$ is the gradient of a cross-entropy loss measured through a learned state predictor $G_\psi(h_l,t_l,s)$ that estimates $h_L$; for Credit Bridge, $a_l$ is the gradient of a learned value network $V(h_l,t_l,s)$). None of these three local losses contains a penalty on $\|f_l(h_l)\|$, so any direction in which a larger block output improves inner-product alignment with the method's fixed or learned credit target is rewarded; in a pre-LN residual stack, larger block outputs directly increase residual-stream scale, and terminal LayerNorm at the output removes task-loss sensitivity to that scale, so the architecture supplies no global restraint on the local growth incentive. The gradient-floor part~(b) follows from the LayerNorm Jacobian. For $y = \mathrm{LN}(h) = (h - \mu(h))/\sigma(h)$ with $\sigma(h) = \big(\tfrac{1}{d}\sum_i (h_i - \mu(h))^2\big)^{1/2}$ proportional to $\|h\|/\sqrt{d}$, the spectral norm of $\partial y/\partial h$ is $\Theta(1/\sigma(h))$, so back-propagating through terminal LayerNorm scales the deepest hidden BP gradient as $\|g_L\| = \Theta(1/\|h_L\|)$, and the same residual-stream inflation that drives diagnostic~(a) drives a proportional collapse of the diagnostic~(b) reference. Empirically, on the audited 4-block pre-LayerNorm ResMLP ($d{=}256$, CIFAR-10, 100 epochs, 3 seeds), DFA training drives the three-seed mean $\|h_L\|$ from about $9$ at initialization to about $5\times 10^8$ by epoch 100 and $\|g_L\|$ from about $9.8\times 10^{-4}$ to about $4\times 10^{-10}$, while the reported deep cosine remains defined only because \texttt{F.cosine\_similarity} clamps the denominator at $\varepsilon{=}10^{-8}$ (Table~\ref{tab:main_audit}; Figure~\ref{fig:audit_hero}). At that endpoint the reference norm is about $20\times$ below the clamp, so the quantity being reported is effectively $(a\cdot b)/(\|a\|\max(\|b\|,10^{-8}))$ rather than a comparison to a meaningful BP direction. -We tested this mechanism story against four natural alternative attributions, all of which it survives. \emph{Not residual-skip-driven:} with terminal LN kept and the additive skip removed ($h_{l+1}{=}F_l(h_l)$), DFA still converges to $\|h_L\|{\approx}1.06{\times}10^{8}$ and $\|g_L\|{\approx}1.09{\times}10^{-10}$ at $100$ epochs, both at the diagnostic floor (Appendix~\ref{app:no_residual}). \emph{Not task-signal-driven:} under i.i.d.\ random class targets per minibatch, DFA still reaches $\|h_L\|{\approx}1.67{\times}10^{8}$ and $\|g_L\|{\approx}8{\times}10^{-12}$ while accuracy stays at chance (Appendix~\ref{app:random_targets}). \emph{Not DFA-specific:} the same random-target ablation drives $\|h_L\|$ to $6.2{\times}10^{3}$ for SB and $2.0{\times}10^{4}$ for CB in three epochs, so all three audited fixed-feedback methods exhibit data-agnostic activation growth. \emph{Not shared by EP:} under the same protocol, EP keeps $\|h_L\|{\approx}586$ at five epochs, $25\times$ smaller than DFA's three-epoch value, confirming that the random-target assay separates the explosion-prone fixed-feedback class from EP's energy-based objective. +We tested this mechanism story against four natural alternative attributions, all of which it survives. \emph{Not residual-skip-driven:} with terminal LN kept and the additive skip removed ($h_{l+1}{=}F_l(h_l)$), DFA still converges across three seeds to mean $\|h_L\|{\approx}8.2{\times}10^{7}$ and mean $\|g_L\|{\approx}1.9{\times}10^{-10}$ at $100$ epochs, both at the diagnostic floor (Appendix~\ref{app:no_residual}). \emph{Not task-signal-driven:} under i.i.d.\ random class targets per minibatch, DFA still reaches $\|h_L\|{\approx}1.67{\times}10^{8}$ and $\|g_L\|{\approx}8{\times}10^{-12}$ while accuracy stays at chance (Appendix~\ref{app:random_targets}). \emph{Not DFA-specific:} the same random-target ablation drives $\|h_L\|$ to $6.2{\times}10^{3}$ for SB and $2.0{\times}10^{4}$ for CB in three epochs, so all three audited fixed-feedback methods exhibit data-agnostic activation growth. \emph{Not shared by EP:} under the same protocol, EP keeps $\|h_L\|{\approx}586$ at five epochs, $25\times$ smaller than DFA's three-epoch value, confirming that the random-target assay separates the explosion-prone fixed-feedback class from EP's energy-based objective. The matched same-backbone causal control for diagnostic~(b) is removing terminal LayerNorm. On the same ResMLP-d256 with the residual skip intact, $100$ epochs of DFA, three seeds, the residual stream still inflates to $\|h_L\|\!\approx\!1.21\times 10^7$, but the deepest hidden-layer BP gradient remains at $\|g_L\|\!\approx\!7.2\times 10^{-4}$ (four orders of magnitude above the diagnostic~(b) floor), and the final test accuracy is $0.327\pm 0.012$, statistically indistinguishable from vanilla DFA's $0.306\pm 0.006$ on the same backbone with terminal LayerNorm intact. Removing terminal LayerNorm therefore preserves Mode~1~(a) but cleanly eliminates Mode~1~(b) on the same architecture, while leaving final task accuracy essentially unchanged. Combined with the broader cross-architecture pattern (StudentNet and the BatchNorm CNN, which lack terminal LayerNorm, never trigger diagnostic~(b); ViT-Mini with a terminal LN does, by epochs 2--3 (Figure~\ref{fig:temporal_cross_arch})), terminal LayerNorm is necessary for Mode~1~(b) in the audited residual ResMLP and ViT-Mini setting. The collapse is also not a late-epoch curiosity: $\|g_L\|$ drops from $9.8\times 10^{-4}$ at epoch~0 to $5.8\times 10^{-8}$ by epoch~4 in the three-seed temporal replay (per seed: $6.8$, $6.4$, $4.1\times 10^{-8}$), so the protocol fires within the first $11$ epochs of a $100$-epoch run and is actionable as an early-stop criterion rather than a post hoc explanation. Once measurement degeneracy is identified, the next question is whether poor deep credit remains even before collapse. |
