diff options
| author | YurenHao0426 <Blackhao0426@gmail.com> | 2026-04-08 06:00:37 -0500 |
|---|---|---|
| committer | YurenHao0426 <Blackhao0426@gmail.com> | 2026-04-08 06:00:37 -0500 |
| commit | 47f833be0f2abacd0ce53bbe32c7ac7b60fd59d6 (patch) | |
| tree | 63a69152696d969a0f77ccc3c3eb29f6254ef888 | |
| parent | be39c2b5ebec37f993b1a862459455a98cf39eb2 (diff) | |
Add 100ep trajectory confirmations to Appendix I (H2) and Appendix J (random targets)
H2 100ep: ||h_L||=1.06e8, ||g_L||=1.09e-10 (below (b) floor)
Random-target DFA 100ep: ||h_L||=1.67e8, ||g_L||=8e-12 (worse than vanilla)
Both fully confirm the smoke-test trends at converged training horizons.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
| -rw-r--r-- | paper/main.pdf | bin | 464384 -> 465275 bytes | |||
| -rw-r--r-- | paper/main.tex | 4 |
2 files changed, 2 insertions, 2 deletions
diff --git a/paper/main.pdf b/paper/main.pdf Binary files differindex a62416c..c8f6b89 100644 --- a/paper/main.pdf +++ b/paper/main.pdf diff --git a/paper/main.tex b/paper/main.tex index 49cefc8..d3aad6b 100644 --- a/paper/main.tex +++ b/paper/main.tex @@ -410,7 +410,7 @@ DFA & $0.5$ & $3$ & $22{,}050$ & $1.6\times 10^{-7}$ & $0.148$ & $0.039$ \\ \end{tabular} \end{table} -The qualitative shape matches what we see in vanilla residual DFA, only with a slower onset because the architecture itself is harder to train. Diagnostic~(a) clearly fires within three epochs, and diagnostic~(b) is already on the floor side of $10^{-7}$. Across $w_2$ std values $\{0.1,0.2,0.5\}$ that we tried in the same smoke sweep, the qualitative outcome is the same: residual stream grows by three to four orders of magnitude, $\|g_L\|$ drops by three to four orders of magnitude, and BP itself never reaches a healthy training regime. We retain $w_2{=}0.5$ here because that is the only value where BP is at least beginning to learn. +The qualitative shape matches what we see in vanilla residual DFA, only with a slower onset because the architecture itself is harder to train. Diagnostic~(a) clearly fires within three epochs, and diagnostic~(b) is already on the floor side of $10^{-7}$. Across $w_2$ std values $\{0.1,0.2,0.5\}$ that we tried in the same smoke sweep, the qualitative outcome is the same: residual stream grows by three to four orders of magnitude, $\|g_L\|$ drops by three to four orders of magnitude, and BP itself never reaches a healthy training regime. We retain $w_2{=}0.5$ here because that is the only value where BP is at least beginning to learn. The full $100$-epoch trajectory of the same configuration converges to $\|h_L\| \approx 1.06\times 10^8$ and $\|g_L\| \approx 1.09\times 10^{-10}$, deeply below the diagnostic~(b) floor and within an order of magnitude of vanilla residual DFA's $\|h_L\| \approx 4\times 10^8$ and $\|g_L\| \approx 5\times 10^{-10}$ on the same backbone, confirming that the smoke-test trend is the converged behavior rather than an early-training artifact. We treat this ablation as evidence about \emph{necessity}, not about clean algorithm separation. Specifically, the evidence supports: the additive residual skip is not necessary for Mode~1 activation growth or for the gradient-floor trend; Mode~1~(a) appears to be a generic deep-DFA instability on these stacks, modulated but not gated by skip presence; and the catastrophic, well-defined $\|g_L\|$ collapse remains most tightly associated with terminal LayerNorm in our audited settings, where the no-out\_ln control already showed activation growth without the same severity of collapse. The full $100$-epoch trajectory of this no-residual run is reported as a confirmatory check rather than as a primary claim. @@ -436,7 +436,7 @@ $3$ & $14{,}510$ & $5.62\times 10^{-7}$ & $0.071$ & $-0.025$ \\ \end{tabular} \end{table} -This ablation answers the natural counterargument that DFA's residual-stream growth might be a side-effect of the network adapting to genuine task signal in a particularly bad local minimum: it is not. With no task signal at all, DFA on this architecture still inflates the residual stream by more than three orders of magnitude in the first three epochs and pushes the deepest BP reference gradient to the floor of $10^{-7}$ in the same window. The local DFA objective $\langle f_l(h_l),\, e_T B_l^\top\rangle$ contains no penalty on $\|f_l(h_l)\|$, so any direction in which a larger block output increases inner-product alignment with the fixed feedback target is rewarded; the random-target run isolates exactly this geometric incentive, free of any task-driven feature pressure. The full $100$-epoch trajectory of this random-target run is reported as a confirmatory check rather than a primary claim. +This ablation answers the natural counterargument that DFA's residual-stream growth might be a side-effect of the network adapting to genuine task signal in a particularly bad local minimum: it is not. With no task signal at all, DFA on this architecture still inflates the residual stream by more than three orders of magnitude in the first three epochs and pushes the deepest BP reference gradient to the floor of $10^{-7}$ in the same window. The full $100$-epoch trajectory of the same DFA random-target run converges to $\|h_L\| \approx 1.67\times 10^8$ and $\|g_L\| \approx 8.0\times 10^{-12}$, both more extreme than the corresponding endpoints of vanilla DFA on the same backbone with real labels (about $4\times 10^8$ and $5\times 10^{-10}$ respectively), so the data-agnostic trajectory does not just reach Mode~1 but in fact passes through the same regime even without any per-sample task pressure. The local DFA objective $\langle f_l(h_l),\, e_T B_l^\top\rangle$ contains no penalty on $\|f_l(h_l)\|$, so any direction in which a larger block output increases inner-product alignment with the fixed feedback target is rewarded; the random-target run isolates exactly this geometric incentive, free of any task-driven feature pressure. The full $100$-epoch trajectory of this random-target run is reported as a confirmatory check rather than a primary claim. We then asked whether this data-agnostic growth is specific to DFA or generalizes to other fixed-feedback local-credit methods, by repeating the random-target ablation under State Bridge and Credit Bridge with the same architecture, hyperparameters, and seed. Both methods also exhibit data-agnostic activation growth in the same three-epoch window, with $\|h_L\|$ rising from about $9$ to about $6.2\times 10^3$ (State Bridge) and about $2.0\times 10^4$ (Credit Bridge), while their test accuracies remain at chance ($0.10$ and $0.09$, respectively): |
