summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorYurenHao0426 <Blackhao0426@gmail.com>2026-04-08 10:22:20 -0500
committerYurenHao0426 <Blackhao0426@gmail.com>2026-04-08 10:22:20 -0500
commit15845f0226fe5e1f64ec2ab6bd0253d59ae813ce (patch)
treef4a95d5a1a9f51889a17c6d274a629b32b97ba2d
parent921d3dec7daa67f16194d7eca7712c4903ce6f1d (diff)
§3 fix: correctly distinguish DFA/SB/CB local credit vectors
Previous §3 ¶1 wrote the local loss as -<f_l, B_l^T e_T> as if it applied to DFA, SB, and CB all three. But that's only DFA's form. SB and CB use learned bridge networks to derive credit: - DFA: a_l = B_l^T e_T (fixed random projection) - State Bridge: a_l = gradient of CE(head(LN(G_psi(h_l, t_l, s))), y) where G_psi is a learned state predictor of h_L - Credit Bridge: a_l = gradient of learned value net V(h_l, t_l, s) The fix correctly writes the shared local loss form -<f_l, a_l> and defines a_l for each method in-line. This also serves as the first definition of SB and CB in the paper (previously they were named in Table 1 without being defined). Main content still ends at p9 (just slightly before the bottom margin now); references span p9-p10 but are not counted against the 9-page content budget. Total 17 pages. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
-rw-r--r--paper/main.pdfbin480230 -> 481568 bytes
-rw-r--r--paper/main.tex2
2 files changed, 1 insertions, 1 deletions
diff --git a/paper/main.pdf b/paper/main.pdf
index 8238583..817491d 100644
--- a/paper/main.pdf
+++ b/paper/main.pdf
Binary files differ
diff --git a/paper/main.tex b/paper/main.tex
index e7aa843..3284e05 100644
--- a/paper/main.tex
+++ b/paper/main.tex
@@ -76,7 +76,7 @@ When we compare each method to a frozen-blocks baseline matched to the same arch
\section{Failure Mode 1: Measurement Degeneracy}
\label{sec:mode1}
-Mode~1 has two parts. The activation-growth part~(a) is a scale pathology of fixed-feedback local-credit objectives without an effective scale-control term: for block $l$, DFA, State Bridge, and Credit Bridge update $f_l$ by reducing a local loss of the form $-\langle f_l(h_l),\, B_l^\top e_T\rangle$, which contains no penalty on $\|f_l(h_l)\|$, so any direction in which a larger block output improves inner-product alignment with the fixed feedback target is rewarded; in a pre-LN residual stack, larger block outputs directly increase residual-stream scale, and terminal LayerNorm at the output removes task-loss sensitivity to that scale, so the architecture supplies no global restraint on the local growth incentive. The gradient-floor part~(b) follows from the LayerNorm Jacobian: in terminal-LN architectures $\partial\mathrm{LN}(h)/\partial h \propto 1/\|h\|$ in expectation, so the same residual-stream inflation is accompanied by collapse of the hidden-layer BP reference norm. Empirically, on the audited 4-block pre-LayerNorm ResMLP ($d{=}256$, CIFAR-10, 100 epochs, 3 seeds), DFA training drives $\|h_L\|$ from about $9$ at initialization to about $4\times 10^8$ by epoch 100 and $\|g_L\|$ from about $9.8\times 10^{-4}$ to about $5\times 10^{-10}$, while the reported deep cosine remains defined only because \texttt{F.cosine\_similarity} clamps the denominator at $\varepsilon{=}10^{-8}$ (Table~\ref{tab:main_audit}; Figure~\ref{fig:audit_hero}). At that endpoint the reference norm is about $20\times$ below the clamp, so the quantity being reported is effectively $(a\cdot b)/(\|a\|\max(\|b\|,10^{-8}))$ rather than a comparison to a meaningful BP direction.
+Mode~1 has two parts. The activation-growth part~(a) is a scale pathology of fixed-feedback local-credit objectives without an effective scale-control term: for block $l$, DFA, State Bridge, and Credit Bridge each update $f_l$ by reducing a local loss of the form $-\langle f_l(h_l),\, a_l\rangle$, where the per-layer credit vector $a_l$ is the method-specific projection of the output error (for DFA, $a_l = B_l^\top e_T$ with a fixed random $B_l$; for State Bridge, $a_l$ is the gradient of a cross-entropy loss measured through a learned state predictor $G_\psi(h_l,t_l,s)$ that estimates $h_L$; for Credit Bridge, $a_l$ is the gradient of a learned value network $V(h_l,t_l,s)$). None of these three local losses contains a penalty on $\|f_l(h_l)\|$, so any direction in which a larger block output improves inner-product alignment with the method's fixed or learned credit target is rewarded; in a pre-LN residual stack, larger block outputs directly increase residual-stream scale, and terminal LayerNorm at the output removes task-loss sensitivity to that scale, so the architecture supplies no global restraint on the local growth incentive. The gradient-floor part~(b) follows from the LayerNorm Jacobian: in terminal-LN architectures $\partial\mathrm{LN}(h)/\partial h \propto 1/\|h\|$ in expectation, so the same residual-stream inflation is accompanied by collapse of the hidden-layer BP reference norm. Empirically, on the audited 4-block pre-LayerNorm ResMLP ($d{=}256$, CIFAR-10, 100 epochs, 3 seeds), DFA training drives $\|h_L\|$ from about $9$ at initialization to about $4\times 10^8$ by epoch 100 and $\|g_L\|$ from about $9.8\times 10^{-4}$ to about $5\times 10^{-10}$, while the reported deep cosine remains defined only because \texttt{F.cosine\_similarity} clamps the denominator at $\varepsilon{=}10^{-8}$ (Table~\ref{tab:main_audit}; Figure~\ref{fig:audit_hero}). At that endpoint the reference norm is about $20\times$ below the clamp, so the quantity being reported is effectively $(a\cdot b)/(\|a\|\max(\|b\|,10^{-8}))$ rather than a comparison to a meaningful BP direction.
We tested this mechanism story against four natural alternative attributions, all of which it survives. \emph{Not residual-skip-driven:} on the same ResMLP-d256 with terminal LN kept and the additive skip removed ($h_{l+1}{=}F_l(h_l)$), DFA still inflates $\|h_L\|$ from $\sim\!5$ to $\sim\!2.2{\times}10^{4}$ in three epochs and converges to $\|h_L\|{\approx}1.06{\times}10^{8}$ and $\|g_L\|{\approx}1.09{\times}10^{-10}$ at $100$ epochs, both already at the diagnostic floor (Appendix~\ref{app:no_residual}). \emph{Not task-signal-driven:} replacing labels by i.i.d.\ random class targets refreshed every minibatch on the same backbone, DFA still reaches $\|h_L\|{\approx}1.67{\times}10^{8}$ and $\|g_L\|{\approx}8{\times}10^{-12}$ at $100$ epochs while accuracy stays at chance (Appendix~\ref{app:random_targets}). \emph{Not DFA-specific:} the same random-target ablation also drives $\|h_L\|$ from $9$ to $6.2{\times}10^{3}$ for State Bridge and $2.0{\times}10^{4}$ for Credit Bridge in three epochs, again at chance accuracy, so all three audited fixed-feedback methods exhibit data-agnostic activation growth (Appendix~\ref{app:random_targets}). \emph{Not shared by EP:} under the same random-target protocol, EP keeps $\|h_L\|{\approx}586$ at five epochs of training, $25\times$ smaller than DFA's three-epoch value on the same architecture, consistent with EP's bounded behavior on real labels and confirming that the random-target assay separates the explosion-prone fixed-feedback class from EP's energy-based local objective.