summaryrefslogtreecommitdiff
path: root/paper
diff options
context:
space:
mode:
authorYurenHao0426 <Blackhao0426@gmail.com>2026-04-08 18:19:52 -0500
committerYurenHao0426 <Blackhao0426@gmail.com>2026-04-08 18:19:52 -0500
commitc2228c43809fd1d634d98dfcecc1ee72881af639 (patch)
tree5183e548f0079aedb01d34a275dd7d24b63fb32d /paper
parent60947156c4e66d801d043b484ce8bda5314deab0 (diff)
paper v2.31.5: §3 ¶3 ep-4 g_L from 3-seed (was single-seed g_2)
Paper claimed "‖g_L‖ drops from 9.8×10^-4 at ep 0 to 6.7×10^-8 by ep 4 in the temporal replay across three seeds". The 9.8×10^-4 is the 3-seed mean of g_L at ep 0 (correct). But the 6.7×10^-8 was the s42 single-seed g_2 value (6.73×10^-8) at ep 4, not g_L and not 3-seed. The actual 3-seed g_L means from results/snapshot_evolution_v2/: ep 0: 9.83, 9.74, 9.87 × 10^-4 → mean 9.81 ≈ 9.8e-4 ✓ ep 4: 6.82, 6.37, 4.12 × 10^-8 → mean 5.77 ≈ 5.8e-8 (was 6.7e-8) Updated to 5.8×10^-8 with the per-seed values shown for transparency. The "fires within 11 epochs" actionable-early-stop conclusion is unchanged — all three seeds are well below the 1e-7 floor by ep 4. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Diffstat (limited to 'paper')
-rw-r--r--paper/main.pdfbin500922 -> 500969 bytes
-rw-r--r--paper/main.tex2
2 files changed, 1 insertions, 1 deletions
diff --git a/paper/main.pdf b/paper/main.pdf
index 1174ab6..84667b1 100644
--- a/paper/main.pdf
+++ b/paper/main.pdf
Binary files differ
diff --git a/paper/main.tex b/paper/main.tex
index 1d835c0..ada802d 100644
--- a/paper/main.tex
+++ b/paper/main.tex
@@ -87,7 +87,7 @@ Mode~1 has two parts. The activation-growth part~(a) is a scale pathology of fix
We tested this mechanism story against four natural alternative attributions, all of which it survives. \emph{Not residual-skip-driven:} with terminal LN kept and the additive skip removed ($h_{l+1}{=}F_l(h_l)$), DFA still converges to $\|h_L\|{\approx}1.06{\times}10^{8}$ and $\|g_L\|{\approx}1.09{\times}10^{-10}$ at $100$ epochs, both at the diagnostic floor (Appendix~\ref{app:no_residual}). \emph{Not task-signal-driven:} under i.i.d.\ random class targets per minibatch, DFA still reaches $\|h_L\|{\approx}1.67{\times}10^{8}$ and $\|g_L\|{\approx}8{\times}10^{-12}$ while accuracy stays at chance (Appendix~\ref{app:random_targets}). \emph{Not DFA-specific:} the same random-target ablation drives $\|h_L\|$ to $6.2{\times}10^{3}$ for SB and $2.0{\times}10^{4}$ for CB in three epochs, so all three audited fixed-feedback methods exhibit data-agnostic activation growth. \emph{Not shared by EP:} under the same protocol, EP keeps $\|h_L\|{\approx}586$ at five epochs, $25\times$ smaller than DFA's three-epoch value, confirming that the random-target assay separates the explosion-prone fixed-feedback class from EP's energy-based objective.
-The matched same-backbone causal control for diagnostic~(b) is removing terminal LayerNorm. On the same ResMLP-d256 with the residual skip intact, $100$ epochs of DFA, three seeds, the residual stream still inflates to $\|h_L\|\!\approx\!1.21\times 10^7$, but the deepest hidden-layer BP gradient remains at $\|g_L\|\!\approx\!7.2\times 10^{-4}$ (four orders of magnitude above the diagnostic~(b) floor), and the final test accuracy is $0.327\pm 0.012$, statistically indistinguishable from vanilla DFA's $0.306\pm 0.006$ on the same backbone with terminal LayerNorm intact. Removing terminal LayerNorm therefore preserves Mode~1~(a) but cleanly eliminates Mode~1~(b) on the same architecture, while leaving final task accuracy essentially unchanged. Combined with the broader cross-architecture pattern (StudentNet and the BatchNorm CNN, which lack terminal LayerNorm, never trigger diagnostic~(b); ViT-Mini with a terminal LN does, by epochs 2--3 (Figure~\ref{fig:temporal_cross_arch})), terminal LayerNorm is necessary for Mode~1~(b) in the audited residual ResMLP and ViT-Mini setting. The collapse is also not a late-epoch curiosity: $\|g_L\|$ drops from $9.8\times 10^{-4}$ at epoch~0 to $6.7\times 10^{-8}$ by epoch~4 in the temporal replay across three seeds, so the protocol fires within the first $11$ epochs of a $100$-epoch run and is actionable as an early-stop criterion rather than a post hoc explanation. Once measurement degeneracy is identified, the next question is whether poor deep credit remains even before collapse.
+The matched same-backbone causal control for diagnostic~(b) is removing terminal LayerNorm. On the same ResMLP-d256 with the residual skip intact, $100$ epochs of DFA, three seeds, the residual stream still inflates to $\|h_L\|\!\approx\!1.21\times 10^7$, but the deepest hidden-layer BP gradient remains at $\|g_L\|\!\approx\!7.2\times 10^{-4}$ (four orders of magnitude above the diagnostic~(b) floor), and the final test accuracy is $0.327\pm 0.012$, statistically indistinguishable from vanilla DFA's $0.306\pm 0.006$ on the same backbone with terminal LayerNorm intact. Removing terminal LayerNorm therefore preserves Mode~1~(a) but cleanly eliminates Mode~1~(b) on the same architecture, while leaving final task accuracy essentially unchanged. Combined with the broader cross-architecture pattern (StudentNet and the BatchNorm CNN, which lack terminal LayerNorm, never trigger diagnostic~(b); ViT-Mini with a terminal LN does, by epochs 2--3 (Figure~\ref{fig:temporal_cross_arch})), terminal LayerNorm is necessary for Mode~1~(b) in the audited residual ResMLP and ViT-Mini setting. The collapse is also not a late-epoch curiosity: $\|g_L\|$ drops from $9.8\times 10^{-4}$ at epoch~0 to $5.8\times 10^{-8}$ by epoch~4 in the three-seed temporal replay (per seed: $6.8$, $6.4$, $4.1\times 10^{-8}$), so the protocol fires within the first $11$ epochs of a $100$-epoch run and is actionable as an early-stop criterion rather than a post hoc explanation. Once measurement degeneracy is identified, the next question is whether poor deep credit remains even before collapse.
\section{Failure Mode 2: Low Intrinsic Credit-Direction Quality}
\label{sec:mode2}