paper v2.31.7: Appendix H vanilla residual DFA endpoint values

Make Appendix H consistent with §3 ¶1 (which v2.30.2 already updated to 3-seed means): vanilla residual DFA's endpoint ‖h_L‖ ≈ 5×10^8 and ‖g_L‖ ≈ 4×10^-10 (three-seed mean), not the s42 single-seed values. The s42 numbers are 4.39e8 and 4.86e-10, which were rounded as "≈4e8" and "≈5e-10" in the appendix. The 3-seed means are 5.18e8 and 3.76e-10, which round to "≈5e8" and "≈4e-10". Now §3 ¶1 (3-seed) and Appendix H (3-seed) report consistently. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
author: YurenHao0426 <Blackhao0426@gmail.com> 2026-04-08 18:24:43 -0500
committer: YurenHao0426 <Blackhao0426@gmail.com> 2026-04-08 18:24:43 -0500
commit: daa0cb53cad483903f1f89a3d7d692bdc00d9827 (patch)
tree: 17048a502b40165ebd971de723f55c95074a01dd /paper
parent: f232b790ed94b6fb76939e80201c29b3419a5ecc (diff)
2 files changed, 1 insertions, 1 deletions
diff --git a/paper/main.pdf b/paper/main.pdf
index 157ab24..e744481 100644
--- a/paper/main.pdf
+++ b/paper/main.pdf
diff --git a/paper/main.tex b/paper/main.tex
index 7359634..0ff0e13 100644
--- a/paper/main.tex
+++ b/paper/main.tex
@@ -436,7 +436,7 @@ DFA & $0.5$ & $3$ & $22{,}050$ & $1.6\times 10^{-7}$ & $0.148$ & $0.039$ \\
 \end{tabularx}
 \end{table}
 
-The qualitative shape matches what we see in vanilla residual DFA, only with a slower onset because the architecture itself is harder to train. Diagnostic~(a) clearly fires within three epochs, and diagnostic~(b) is already on the floor side of $10^{-7}$. Across $w_2$ std values $\{0.1,0.2,0.5\}$ that we tried in the same smoke sweep, the qualitative outcome is the same: residual stream grows by three to four orders of magnitude, $\|g_L\|$ drops by three to four orders of magnitude, and BP itself never reaches a healthy training regime. We retain $w_2{=}0.5$ here because that is the only value where BP is at least beginning to learn. The full $100$-epoch trajectory of the same configuration, replicated across three seeds (42, 123, 456), converges to a mean $\|h_L\| \approx 8.2\times 10^7$ and mean $\|g_L\| \approx 1.9\times 10^{-10}$ (per-seed values $\|h_L\| \in \{1.06\times 10^8, 3.15\times 10^7, 1.09\times 10^8\}$ and $\|g_L\| \in \{1.08, 2.94, 1.77\}\times 10^{-10}$), all deeply below the diagnostic~(b) floor and within an order of magnitude of vanilla residual DFA's $\|h_L\| \approx 4\times 10^8$ and $\|g_L\| \approx 5\times 10^{-10}$ on the same backbone, confirming that the smoke-test trend is the converged behavior rather than an early-training artifact.
+The qualitative shape matches what we see in vanilla residual DFA, only with a slower onset because the architecture itself is harder to train. Diagnostic~(a) clearly fires within three epochs, and diagnostic~(b) is already on the floor side of $10^{-7}$. Across $w_2$ std values $\{0.1,0.2,0.5\}$ that we tried in the same smoke sweep, the qualitative outcome is the same: residual stream grows by three to four orders of magnitude, $\|g_L\|$ drops by three to four orders of magnitude, and BP itself never reaches a healthy training regime. We retain $w_2{=}0.5$ here because that is the only value where BP is at least beginning to learn. The full $100$-epoch trajectory of the same configuration, replicated across three seeds (42, 123, 456), converges to a mean $\|h_L\| \approx 8.2\times 10^7$ and mean $\|g_L\| \approx 1.9\times 10^{-10}$ (per-seed values $\|h_L\| \in \{1.06\times 10^8, 3.15\times 10^7, 1.09\times 10^8\}$ and $\|g_L\| \in \{1.08, 2.94, 1.77\}\times 10^{-10}$), all deeply below the diagnostic~(b) floor and within an order of magnitude of vanilla residual DFA's three-seed mean $\|h_L\| \approx 5\times 10^8$ and mean $\|g_L\| \approx 4\times 10^{-10}$ on the same backbone, confirming that the smoke-test trend is the converged behavior rather than an early-training artifact.
 
 We treat this ablation as evidence about \emph{necessity}, not about clean algorithm separation. Specifically, the evidence supports: the additive residual skip is not necessary for Mode~1 activation growth or for the gradient-floor trend; Mode~1~(a) appears to be a generic deep-DFA instability on these stacks, modulated but not gated by skip presence; and the catastrophic, well-defined $\|g_L\|$ collapse remains most tightly associated with terminal LayerNorm in our audited settings, where the no-out\_ln control already showed activation growth without the same severity of collapse. The full $100$-epoch trajectory of this no-residual run is reported as a confirmatory check rather than as a primary claim.
author	YurenHao0426 <Blackhao0426@gmail.com>	2026-04-08 18:24:43 -0500
committer	YurenHao0426 <Blackhao0426@gmail.com>	2026-04-08 18:24:43 -0500
commit	daa0cb53cad483903f1f89a3d7d692bdc00d9827 (patch)
tree	17048a502b40165ebd971de723f55c95074a01dd /paper
parent	f232b790ed94b6fb76939e80201c29b3419a5ecc (diff)