diff options
| author | YurenHao0426 <Blackhao0426@gmail.com> | 2026-04-08 10:22:20 -0500 |
|---|---|---|
| committer | YurenHao0426 <Blackhao0426@gmail.com> | 2026-04-08 10:22:20 -0500 |
| commit | 15845f0226fe5e1f64ec2ab6bd0253d59ae813ce (patch) | |
| tree | f4a95d5a1a9f51889a17c6d274a629b32b97ba2d | |
| parent | 921d3dec7daa67f16194d7eca7712c4903ce6f1d (diff) | |
§3 fix: correctly distinguish DFA/SB/CB local credit vectors
Previous §3 ¶1 wrote the local loss as -<f_l, B_l^T e_T> as if it applied
to DFA, SB, and CB all three. But that's only DFA's form. SB and CB use
learned bridge networks to derive credit:
- DFA: a_l = B_l^T e_T (fixed random projection)
- State Bridge: a_l = gradient of CE(head(LN(G_psi(h_l, t_l, s))), y)
where G_psi is a learned state predictor of h_L
- Credit Bridge: a_l = gradient of learned value net V(h_l, t_l, s)
The fix correctly writes the shared local loss form -<f_l, a_l> and
defines a_l for each method in-line. This also serves as the first
definition of SB and CB in the paper (previously they were named in
Table 1 without being defined).
Main content still ends at p9 (just slightly before the bottom margin
now); references span p9-p10 but are not counted against the 9-page
content budget. Total 17 pages.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
| -rw-r--r-- | paper/main.pdf | bin | 480230 -> 481568 bytes | |||
| -rw-r--r-- | paper/main.tex | 2 |
2 files changed, 1 insertions, 1 deletions
diff --git a/paper/main.pdf b/paper/main.pdf Binary files differindex 8238583..817491d 100644 --- a/paper/main.pdf +++ b/paper/main.pdf diff --git a/paper/main.tex b/paper/main.tex index e7aa843..3284e05 100644 --- a/paper/main.tex +++ b/paper/main.tex @@ -76,7 +76,7 @@ When we compare each method to a frozen-blocks baseline matched to the same arch \section{Failure Mode 1: Measurement Degeneracy} \label{sec:mode1} -Mode~1 has two parts. The activation-growth part~(a) is a scale pathology of fixed-feedback local-credit objectives without an effective scale-control term: for block $l$, DFA, State Bridge, and Credit Bridge update $f_l$ by reducing a local loss of the form $-\langle f_l(h_l),\, B_l^\top e_T\rangle$, which contains no penalty on $\|f_l(h_l)\|$, so any direction in which a larger block output improves inner-product alignment with the fixed feedback target is rewarded; in a pre-LN residual stack, larger block outputs directly increase residual-stream scale, and terminal LayerNorm at the output removes task-loss sensitivity to that scale, so the architecture supplies no global restraint on the local growth incentive. The gradient-floor part~(b) follows from the LayerNorm Jacobian: in terminal-LN architectures $\partial\mathrm{LN}(h)/\partial h \propto 1/\|h\|$ in expectation, so the same residual-stream inflation is accompanied by collapse of the hidden-layer BP reference norm. Empirically, on the audited 4-block pre-LayerNorm ResMLP ($d{=}256$, CIFAR-10, 100 epochs, 3 seeds), DFA training drives $\|h_L\|$ from about $9$ at initialization to about $4\times 10^8$ by epoch 100 and $\|g_L\|$ from about $9.8\times 10^{-4}$ to about $5\times 10^{-10}$, while the reported deep cosine remains defined only because \texttt{F.cosine\_similarity} clamps the denominator at $\varepsilon{=}10^{-8}$ (Table~\ref{tab:main_audit}; Figure~\ref{fig:audit_hero}). At that endpoint the reference norm is about $20\times$ below the clamp, so the quantity being reported is effectively $(a\cdot b)/(\|a\|\max(\|b\|,10^{-8}))$ rather than a comparison to a meaningful BP direction. +Mode~1 has two parts. The activation-growth part~(a) is a scale pathology of fixed-feedback local-credit objectives without an effective scale-control term: for block $l$, DFA, State Bridge, and Credit Bridge each update $f_l$ by reducing a local loss of the form $-\langle f_l(h_l),\, a_l\rangle$, where the per-layer credit vector $a_l$ is the method-specific projection of the output error (for DFA, $a_l = B_l^\top e_T$ with a fixed random $B_l$; for State Bridge, $a_l$ is the gradient of a cross-entropy loss measured through a learned state predictor $G_\psi(h_l,t_l,s)$ that estimates $h_L$; for Credit Bridge, $a_l$ is the gradient of a learned value network $V(h_l,t_l,s)$). None of these three local losses contains a penalty on $\|f_l(h_l)\|$, so any direction in which a larger block output improves inner-product alignment with the method's fixed or learned credit target is rewarded; in a pre-LN residual stack, larger block outputs directly increase residual-stream scale, and terminal LayerNorm at the output removes task-loss sensitivity to that scale, so the architecture supplies no global restraint on the local growth incentive. The gradient-floor part~(b) follows from the LayerNorm Jacobian: in terminal-LN architectures $\partial\mathrm{LN}(h)/\partial h \propto 1/\|h\|$ in expectation, so the same residual-stream inflation is accompanied by collapse of the hidden-layer BP reference norm. Empirically, on the audited 4-block pre-LayerNorm ResMLP ($d{=}256$, CIFAR-10, 100 epochs, 3 seeds), DFA training drives $\|h_L\|$ from about $9$ at initialization to about $4\times 10^8$ by epoch 100 and $\|g_L\|$ from about $9.8\times 10^{-4}$ to about $5\times 10^{-10}$, while the reported deep cosine remains defined only because \texttt{F.cosine\_similarity} clamps the denominator at $\varepsilon{=}10^{-8}$ (Table~\ref{tab:main_audit}; Figure~\ref{fig:audit_hero}). At that endpoint the reference norm is about $20\times$ below the clamp, so the quantity being reported is effectively $(a\cdot b)/(\|a\|\max(\|b\|,10^{-8}))$ rather than a comparison to a meaningful BP direction. We tested this mechanism story against four natural alternative attributions, all of which it survives. \emph{Not residual-skip-driven:} on the same ResMLP-d256 with terminal LN kept and the additive skip removed ($h_{l+1}{=}F_l(h_l)$), DFA still inflates $\|h_L\|$ from $\sim\!5$ to $\sim\!2.2{\times}10^{4}$ in three epochs and converges to $\|h_L\|{\approx}1.06{\times}10^{8}$ and $\|g_L\|{\approx}1.09{\times}10^{-10}$ at $100$ epochs, both already at the diagnostic floor (Appendix~\ref{app:no_residual}). \emph{Not task-signal-driven:} replacing labels by i.i.d.\ random class targets refreshed every minibatch on the same backbone, DFA still reaches $\|h_L\|{\approx}1.67{\times}10^{8}$ and $\|g_L\|{\approx}8{\times}10^{-12}$ at $100$ epochs while accuracy stays at chance (Appendix~\ref{app:random_targets}). \emph{Not DFA-specific:} the same random-target ablation also drives $\|h_L\|$ from $9$ to $6.2{\times}10^{3}$ for State Bridge and $2.0{\times}10^{4}$ for Credit Bridge in three epochs, again at chance accuracy, so all three audited fixed-feedback methods exhibit data-agnostic activation growth (Appendix~\ref{app:random_targets}). \emph{Not shared by EP:} under the same random-target protocol, EP keeps $\|h_L\|{\approx}586$ at five epochs of training, $25\times$ smaller than DFA's three-epoch value on the same architecture, consistent with EP's bounded behavior on real labels and confirming that the random-target assay separates the explosion-prone fixed-feedback class from EP's energy-based local objective. |
