summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorYurenHao0426 <Blackhao0426@gmail.com>2026-04-08 09:55:27 -0500
committerYurenHao0426 <Blackhao0426@gmail.com>2026-04-08 09:55:27 -0500
commit921d3dec7daa67f16194d7eca7712c4903ce6f1d (patch)
tree62601790621652412c4030a3b463d83ed777ebb4
parentef2dbf9e90ffa1740f6bd3a6eebc47c124a36c2f (diff)
Polish: fix ยง3 typo 'to an meaningful' -> 'to a meaningful'
-rw-r--r--paper/main.pdfbin480228 -> 480230 bytes
-rw-r--r--paper/main.tex2
2 files changed, 1 insertions, 1 deletions
diff --git a/paper/main.pdf b/paper/main.pdf
index a264800..8238583 100644
--- a/paper/main.pdf
+++ b/paper/main.pdf
Binary files differ
diff --git a/paper/main.tex b/paper/main.tex
index 9da868e..e7aa843 100644
--- a/paper/main.tex
+++ b/paper/main.tex
@@ -76,7 +76,7 @@ When we compare each method to a frozen-blocks baseline matched to the same arch
\section{Failure Mode 1: Measurement Degeneracy}
\label{sec:mode1}
-Mode~1 has two parts. The activation-growth part~(a) is a scale pathology of fixed-feedback local-credit objectives without an effective scale-control term: for block $l$, DFA, State Bridge, and Credit Bridge update $f_l$ by reducing a local loss of the form $-\langle f_l(h_l),\, B_l^\top e_T\rangle$, which contains no penalty on $\|f_l(h_l)\|$, so any direction in which a larger block output improves inner-product alignment with the fixed feedback target is rewarded; in a pre-LN residual stack, larger block outputs directly increase residual-stream scale, and terminal LayerNorm at the output removes task-loss sensitivity to that scale, so the architecture supplies no global restraint on the local growth incentive. The gradient-floor part~(b) follows from the LayerNorm Jacobian: in terminal-LN architectures $\partial\mathrm{LN}(h)/\partial h \propto 1/\|h\|$ in expectation, so the same residual-stream inflation is accompanied by collapse of the hidden-layer BP reference norm. Empirically, on the audited 4-block pre-LayerNorm ResMLP ($d{=}256$, CIFAR-10, 100 epochs, 3 seeds), DFA training drives $\|h_L\|$ from about $9$ at initialization to about $4\times 10^8$ by epoch 100 and $\|g_L\|$ from about $9.8\times 10^{-4}$ to about $5\times 10^{-10}$, while the reported deep cosine remains defined only because \texttt{F.cosine\_similarity} clamps the denominator at $\varepsilon{=}10^{-8}$ (Table~\ref{tab:main_audit}; Figure~\ref{fig:audit_hero}). At that endpoint the reference norm is about $20\times$ below the clamp, so the quantity being reported is effectively $(a\cdot b)/(\|a\|\max(\|b\|,10^{-8}))$ rather than a comparison to an meaningful BP direction.
+Mode~1 has two parts. The activation-growth part~(a) is a scale pathology of fixed-feedback local-credit objectives without an effective scale-control term: for block $l$, DFA, State Bridge, and Credit Bridge update $f_l$ by reducing a local loss of the form $-\langle f_l(h_l),\, B_l^\top e_T\rangle$, which contains no penalty on $\|f_l(h_l)\|$, so any direction in which a larger block output improves inner-product alignment with the fixed feedback target is rewarded; in a pre-LN residual stack, larger block outputs directly increase residual-stream scale, and terminal LayerNorm at the output removes task-loss sensitivity to that scale, so the architecture supplies no global restraint on the local growth incentive. The gradient-floor part~(b) follows from the LayerNorm Jacobian: in terminal-LN architectures $\partial\mathrm{LN}(h)/\partial h \propto 1/\|h\|$ in expectation, so the same residual-stream inflation is accompanied by collapse of the hidden-layer BP reference norm. Empirically, on the audited 4-block pre-LayerNorm ResMLP ($d{=}256$, CIFAR-10, 100 epochs, 3 seeds), DFA training drives $\|h_L\|$ from about $9$ at initialization to about $4\times 10^8$ by epoch 100 and $\|g_L\|$ from about $9.8\times 10^{-4}$ to about $5\times 10^{-10}$, while the reported deep cosine remains defined only because \texttt{F.cosine\_similarity} clamps the denominator at $\varepsilon{=}10^{-8}$ (Table~\ref{tab:main_audit}; Figure~\ref{fig:audit_hero}). At that endpoint the reference norm is about $20\times$ below the clamp, so the quantity being reported is effectively $(a\cdot b)/(\|a\|\max(\|b\|,10^{-8}))$ rather than a comparison to a meaningful BP direction.
We tested this mechanism story against four natural alternative attributions, all of which it survives. \emph{Not residual-skip-driven:} on the same ResMLP-d256 with terminal LN kept and the additive skip removed ($h_{l+1}{=}F_l(h_l)$), DFA still inflates $\|h_L\|$ from $\sim\!5$ to $\sim\!2.2{\times}10^{4}$ in three epochs and converges to $\|h_L\|{\approx}1.06{\times}10^{8}$ and $\|g_L\|{\approx}1.09{\times}10^{-10}$ at $100$ epochs, both already at the diagnostic floor (Appendix~\ref{app:no_residual}). \emph{Not task-signal-driven:} replacing labels by i.i.d.\ random class targets refreshed every minibatch on the same backbone, DFA still reaches $\|h_L\|{\approx}1.67{\times}10^{8}$ and $\|g_L\|{\approx}8{\times}10^{-12}$ at $100$ epochs while accuracy stays at chance (Appendix~\ref{app:random_targets}). \emph{Not DFA-specific:} the same random-target ablation also drives $\|h_L\|$ from $9$ to $6.2{\times}10^{3}$ for State Bridge and $2.0{\times}10^{4}$ for Credit Bridge in three epochs, again at chance accuracy, so all three audited fixed-feedback methods exhibit data-agnostic activation growth (Appendix~\ref{app:random_targets}). \emph{Not shared by EP:} under the same random-target protocol, EP keeps $\|h_L\|{\approx}586$ at five epochs of training, $25\times$ smaller than DFA's three-epoch value on the same architecture, consistent with EP's bounded behavior on real labels and confirming that the random-target assay separates the explosion-prone fixed-feedback class from EP's energy-based local objective.