diff options
| author | YurenHao0426 <Blackhao0426@gmail.com> | 2026-04-08 18:59:33 -0500 |
|---|---|---|
| committer | YurenHao0426 <Blackhao0426@gmail.com> | 2026-04-08 18:59:33 -0500 |
| commit | 6a057a379e58dc464f04e5208861699b01b5d477 (patch) | |
| tree | f91ca912d69c3b887aabc0cb93c22dc10eafb65a /paper | |
| parent | 7c3b3393ca2264c463296144461d2ce6c62e3657 (diff) | |
paper v2.31.14: §3 ¶1 local-loss formula sign-consistency
Original §3 ¶1: "reducing a local loss of the form -<f_l, a_l>"
= minimizing -<f, a>
= maximizing <f, a>
= "alignment with credit target is rewarded" (matches prose)
But "reducing a local loss of the form -X" reads awkwardly because the
"-" sign in front of an inner product looks like a typo or extra
negative. Replaced with the equivalent positive form:
"maximizing a local objective of the form <f_l(h_l), a_l>"
This is mathematically equivalent (both give optimizer direction
∇W in the direction that increases <f, a>) but reads more cleanly.
The §3 ¶1 prose "any direction in which a larger block output
improves inner-product alignment with the method's fixed or learned
credit target is rewarded" now matches the formula directly:
maximizing <f, a> = rewarding alignment.
This relies on the convention e_T = one_hot - softmax (paper a_l is the
negative of the script's a_dfa). The script's positive-sign minimization
of local_loss = (f * a_dfa).sum() is the same operation under that
convention. If the convention is unfamiliar to a reader, the actual
direction (toward larger inner product with a fixed random direction)
is unambiguously stated by the prose.
Page layout preserved: 9 pages main, refs p10, 0 overfull boxes.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Diffstat (limited to 'paper')
| -rw-r--r-- | paper/main.pdf | bin | 501154 -> 501164 bytes | |||
| -rw-r--r-- | paper/main.tex | 2 |
2 files changed, 1 insertions, 1 deletions
diff --git a/paper/main.pdf b/paper/main.pdf Binary files differindex 229f118..bd6fa77 100644 --- a/paper/main.pdf +++ b/paper/main.pdf diff --git a/paper/main.tex b/paper/main.tex index 6ba45ce..981d511 100644 --- a/paper/main.tex +++ b/paper/main.tex @@ -83,7 +83,7 @@ When we compare each method to a frozen-blocks baseline matched to the same arch \section{Failure Mode 1: Measurement Degeneracy} \label{sec:mode1} -Mode~1 has two parts. The activation-growth part~(a) is a scale pathology of fixed-feedback local-credit objectives without an effective scale-control term: for block $l$, DFA, State Bridge, and Credit Bridge each update $f_l$ by reducing a local loss of the form $-\langle f_l(h_l),\, a_l\rangle$, where the per-layer credit vector $a_l$ is the method-specific projection of the output error (for DFA, $a_l = B_l^\top e_T$ with a fixed random $B_l$; for State Bridge, $a_l$ is the gradient of a cross-entropy loss measured through a learned state predictor $G_\psi(h_l,t_l,s)$ that estimates $h_L$; for Credit Bridge, $a_l$ is the gradient of a learned value network $V(h_l,t_l,s)$). None of these three local losses contains a penalty on $\|f_l(h_l)\|$, so any direction in which a larger block output improves inner-product alignment with the method's fixed or learned credit target is rewarded; in a pre-LN residual stack, larger block outputs directly increase residual-stream scale, and terminal LayerNorm at the output removes task-loss sensitivity to that scale, so the architecture supplies no global restraint on the local growth incentive. The gradient-floor part~(b) follows from the LayerNorm Jacobian. For $y = \mathrm{LN}(h) = (h - \mu(h))/\sigma(h)$ with $\sigma(h) = \big(\tfrac{1}{d}\sum_i (h_i - \mu(h))^2\big)^{1/2}$ proportional to $\|h\|/\sqrt{d}$, the spectral norm of $\partial y/\partial h$ is $\Theta(1/\sigma(h))$, so back-propagating through terminal LayerNorm scales the deepest hidden BP gradient as $\|g_L\| = \Theta(1/\|h_L\|)$, and the same residual-stream inflation that drives diagnostic~(a) drives a proportional collapse of the diagnostic~(b) reference. Empirically, on the audited 4-block pre-LayerNorm ResMLP ($d{=}256$, CIFAR-10, 100 epochs, 3 seeds), DFA training drives the three-seed mean $\|h_L\|$ from about $9$ at initialization to about $5\times 10^8$ by epoch 100 and $\|g_L\|$ from about $9.8\times 10^{-4}$ to about $4\times 10^{-10}$, while the reported deep cosine remains defined only because \texttt{F.cosine\_similarity} clamps the denominator at $\varepsilon{=}10^{-8}$ (Table~\ref{tab:main_audit}; Figure~\ref{fig:audit_hero}). At that endpoint the reference norm is about $20\times$ below the clamp, so the quantity being reported is effectively $(a\cdot b)/(\|a\|\max(\|b\|,10^{-8}))$ rather than a comparison to a meaningful BP direction. +Mode~1 has two parts. The activation-growth part~(a) is a scale pathology of fixed-feedback local-credit objectives without an effective scale-control term: for block $l$, DFA, State Bridge, and Credit Bridge each update $f_l$ by maximizing a local objective of the form $\langle f_l(h_l),\, a_l\rangle$, where the per-layer credit vector $a_l$ is the method-specific projection of the output error (for DFA, $a_l = B_l^\top e_T$ with a fixed random $B_l$; for State Bridge, $a_l$ is the gradient of a cross-entropy loss measured through a learned state predictor $G_\psi(h_l,t_l,s)$ that estimates $h_L$; for Credit Bridge, $a_l$ is the gradient of a learned value network $V(h_l,t_l,s)$). None of these three local losses contains a penalty on $\|f_l(h_l)\|$, so any direction in which a larger block output improves inner-product alignment with the method's fixed or learned credit target is rewarded; in a pre-LN residual stack, larger block outputs directly increase residual-stream scale, and terminal LayerNorm at the output removes task-loss sensitivity to that scale, so the architecture supplies no global restraint on the local growth incentive. The gradient-floor part~(b) follows from the LayerNorm Jacobian. For $y = \mathrm{LN}(h) = (h - \mu(h))/\sigma(h)$ with $\sigma(h) = \big(\tfrac{1}{d}\sum_i (h_i - \mu(h))^2\big)^{1/2}$ proportional to $\|h\|/\sqrt{d}$, the spectral norm of $\partial y/\partial h$ is $\Theta(1/\sigma(h))$, so back-propagating through terminal LayerNorm scales the deepest hidden BP gradient as $\|g_L\| = \Theta(1/\|h_L\|)$, and the same residual-stream inflation that drives diagnostic~(a) drives a proportional collapse of the diagnostic~(b) reference. Empirically, on the audited 4-block pre-LayerNorm ResMLP ($d{=}256$, CIFAR-10, 100 epochs, 3 seeds), DFA training drives the three-seed mean $\|h_L\|$ from about $9$ at initialization to about $5\times 10^8$ by epoch 100 and $\|g_L\|$ from about $9.8\times 10^{-4}$ to about $4\times 10^{-10}$, while the reported deep cosine remains defined only because \texttt{F.cosine\_similarity} clamps the denominator at $\varepsilon{=}10^{-8}$ (Table~\ref{tab:main_audit}; Figure~\ref{fig:audit_hero}). At that endpoint the reference norm is about $20\times$ below the clamp, so the quantity being reported is effectively $(a\cdot b)/(\|a\|\max(\|b\|,10^{-8}))$ rather than a comparison to a meaningful BP direction. We tested this mechanism story against four natural alternative attributions, all of which it survives. \emph{Not residual-skip-driven:} with terminal LN kept and the additive skip removed ($h_{l+1}{=}F_l(h_l)$), DFA still converges across three seeds to mean $\|h_L\|{\approx}8.2{\times}10^{7}$ and mean $\|g_L\|{\approx}1.9{\times}10^{-10}$ at $100$ epochs, both at the diagnostic floor (Appendix~\ref{app:no_residual}). \emph{Not task-signal-driven:} under i.i.d.\ random class targets per minibatch, DFA still reaches $\|h_L\|{\approx}1.67{\times}10^{8}$ and $\|g_L\|{\approx}8{\times}10^{-12}$ while accuracy stays at chance (Appendix~\ref{app:random_targets}). \emph{Not DFA-specific:} the same random-target ablation drives $\|h_L\|$ to $6.2{\times}10^{3}$ for SB and $2.0{\times}10^{4}$ for CB in three epochs, so all three audited fixed-feedback methods exhibit data-agnostic activation growth. \emph{Not shared by EP:} under the same protocol, EP keeps $\|h_L\|{\approx}586$ at five epochs, $25\times$ smaller than DFA's three-epoch value, confirming that the random-target assay separates the explosion-prone fixed-feedback class from EP's energy-based objective. |
