From aa12974e22de1887b636219096a02c44c595dcf7 Mon Sep 17 00:00:00 2001 From: YurenHao0426 Date: Wed, 8 Apr 2026 07:51:21 -0500 Subject: =?UTF-8?q?Appendix=20H:=20H2=20ablation=20now=203-seed=20(s42,=20?= =?UTF-8?q?s123,=20s456)=20=E2=80=94=20multi-seed=20validates=20falsificat?= =?UTF-8?q?ion?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit 3-seed mean: ||h_L||=8.2e7, ||g_L||=1.9e-10 Per-seed: ||h_L|| in {1.06e8, 3.15e7, 1.09e8}, ||g_L|| in {1.08, 2.94, 1.77}e-10 All deeply below the (b) floor, all confirm Mode 1 (a)+(b) fire on no-residual ResMLP+terminal-LN. Multi-seed H2 falsification of 'residual skip causes Mode 1' is now robust. Co-Authored-By: Claude Opus 4.6 (1M context) --- paper/main.pdf | Bin 466460 -> 466667 bytes paper/main.tex | 2 +- 2 files changed, 1 insertion(+), 1 deletion(-) (limited to 'paper') diff --git a/paper/main.pdf b/paper/main.pdf index bd881ce..3be679b 100644 Binary files a/paper/main.pdf and b/paper/main.pdf differ diff --git a/paper/main.tex b/paper/main.tex index d58e53a..34efb5b 100644 --- a/paper/main.tex +++ b/paper/main.tex @@ -408,7 +408,7 @@ DFA & $0.5$ & $3$ & $22{,}050$ & $1.6\times 10^{-7}$ & $0.148$ & $0.039$ \\ \end{tabular} \end{table} -The qualitative shape matches what we see in vanilla residual DFA, only with a slower onset because the architecture itself is harder to train. Diagnostic~(a) clearly fires within three epochs, and diagnostic~(b) is already on the floor side of $10^{-7}$. Across $w_2$ std values $\{0.1,0.2,0.5\}$ that we tried in the same smoke sweep, the qualitative outcome is the same: residual stream grows by three to four orders of magnitude, $\|g_L\|$ drops by three to four orders of magnitude, and BP itself never reaches a healthy training regime. We retain $w_2{=}0.5$ here because that is the only value where BP is at least beginning to learn. The full $100$-epoch trajectory of the same configuration converges to $\|h_L\| \approx 1.06\times 10^8$ and $\|g_L\| \approx 1.09\times 10^{-10}$, deeply below the diagnostic~(b) floor and within an order of magnitude of vanilla residual DFA's $\|h_L\| \approx 4\times 10^8$ and $\|g_L\| \approx 5\times 10^{-10}$ on the same backbone, confirming that the smoke-test trend is the converged behavior rather than an early-training artifact. +The qualitative shape matches what we see in vanilla residual DFA, only with a slower onset because the architecture itself is harder to train. Diagnostic~(a) clearly fires within three epochs, and diagnostic~(b) is already on the floor side of $10^{-7}$. Across $w_2$ std values $\{0.1,0.2,0.5\}$ that we tried in the same smoke sweep, the qualitative outcome is the same: residual stream grows by three to four orders of magnitude, $\|g_L\|$ drops by three to four orders of magnitude, and BP itself never reaches a healthy training regime. We retain $w_2{=}0.5$ here because that is the only value where BP is at least beginning to learn. The full $100$-epoch trajectory of the same configuration, replicated across three seeds (42, 123, 456), converges to a mean $\|h_L\| \approx 8.2\times 10^7$ and mean $\|g_L\| \approx 1.9\times 10^{-10}$ (per-seed values $\|h_L\| \in \{1.06\times 10^8, 3.15\times 10^7, 1.09\times 10^8\}$ and $\|g_L\| \in \{1.08, 2.94, 1.77\}\times 10^{-10}$), all deeply below the diagnostic~(b) floor and within an order of magnitude of vanilla residual DFA's $\|h_L\| \approx 4\times 10^8$ and $\|g_L\| \approx 5\times 10^{-10}$ on the same backbone, confirming that the smoke-test trend is the converged behavior rather than an early-training artifact. We treat this ablation as evidence about \emph{necessity}, not about clean algorithm separation. Specifically, the evidence supports: the additive residual skip is not necessary for Mode~1 activation growth or for the gradient-floor trend; Mode~1~(a) appears to be a generic deep-DFA instability on these stacks, modulated but not gated by skip presence; and the catastrophic, well-defined $\|g_L\|$ collapse remains most tightly associated with terminal LayerNorm in our audited settings, where the no-out\_ln control already showed activation growth without the same severity of collapse. The full $100$-epoch trajectory of this no-residual run is reported as a confirmatory check rather than as a primary claim. -- cgit v1.2.3