Round 37 OPTION C: §3 compression — preserve causal structure, drop per-round narration

Codex round 37 verdict: 'Page budget is the bottleneck, not mechanism uncertainty.' Mode 1 is mechanism-complete after rounds 32-36. Strict compression rule: 'one claim sentence per falsified alternative, one for the positive mechanism, everything numeric goes to appendix tables.' §3 rewrite (4 -> 3 paragraphs): 1. Phenomenon class — 6-line geometric argument inlined as one sentence; LN Jacobian derivation for (b); empirical anchors for vanilla DFA. 2. Falsification chain — 4 alternative attributions each in one sentence: not residual-skip-driven (App H), not task-signal-driven (App I), not DFA-specific (App I), not shared by EP. 3. Positive necessity for (b) — same-backbone no_outln control with full numbers; cross-architecture support; temporal early-fire result. Result: main content 9 -> 8 pages (1 page of slack restored). Total 15 -> 14 pages. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
author: YurenHao0426 <Blackhao0426@gmail.com> 2026-04-08 06:17:51 -0500
committer: YurenHao0426 <Blackhao0426@gmail.com> 2026-04-08 06:17:51 -0500
commit: ddf1a6807bd849e91b6d2615e7c2e83719886294 (patch)
tree: da9074b8950dd9d7b17cc80157603fec6106f466
parent: 02c3d2c80805daedb2b6c8e9d6e5f36c52d361a1 (diff)
2 files changed, 3 insertions, 5 deletions
diff --git a/paper/main.pdf b/paper/main.pdf
index b0a2973..12ef514 100644
--- a/paper/main.pdf
+++ b/paper/main.pdf
diff --git a/paper/main.tex b/paper/main.tex
index 001a02e..f331b2c 100644
--- a/paper/main.tex
+++ b/paper/main.tex
@@ -76,13 +76,11 @@ When we compare each method to a frozen-blocks baseline matched to the same arch
 \section{Failure Mode 1: Measurement Degeneracy}
 \label{sec:mode1}
 
-The first failure mode is a scale pathology, not yet an alignment pathology. On the audited 4-block pre-LayerNorm ResMLP ($d{=}256$, CIFAR-10, 100 epochs, 3 seeds), DFA optimizes block-local objectives of the form $\langle f_l(h_l),\, e_T B_l^\top\rangle$ with no explicit scale constraint on $f_l$, so for any direction in which increasing $\|f_l(h_l)\|$ improves alignment with the fixed feedback target $B_l^\top e_T$, the local objective rewards larger output magnitude. In a pre-LN residual stack, larger block outputs directly increase residual-stream scale; terminal LayerNorm then removes task-loss sensitivity to that scale at the output, so the architecture provides no global restraint on the local growth incentive \citep{launay2020direct}. In the same runs, each block's $w_1$ and $w_2$ grows by roughly $200\times$ in relative delta, their norm product reaches about $5\times 10^4$ per block, and the terminal hidden-state norm $\|h_L\|$ rises monotonically from about $9$ at random initialization to about $4\times 10^8$ by epoch 100 (Figure~\ref{fig:temporal_cross_arch}). Most of that growth appears immediately: $\|h_L\|$ already reaches about $10^6$ by epoch 5. As a direct test of whether this growth needs task signal at all, we re-ran DFA, State Bridge, and Credit Bridge on the same backbone with i.i.d.\ random class targets refreshed every minibatch, so the labels carry no information; under random targets all three methods stay at chance accuracy, yet $\|h_L\|$ still grows from about $9$ to about $1.45\times 10^4$ for DFA, $6.2\times 10^3$ for State Bridge, and $2.0\times 10^4$ for Credit Bridge within three epochs, and DFA's $\|g_L\|$ already drops to about $5.6\times 10^{-7}$, so Mode~1 is essentially data-agnostic on this architecture across the three audited fixed-feedback local-credit methods (Appendix~\ref{app:random_targets}). Once the residual stream reaches this regime, the backpropagation reference vector no longer behaves like a healthy target.
+Mode~1 has two parts. The activation-growth part~(a) is a scale pathology of fixed-feedback local-credit objectives without an effective scale-control term: for block $l$, DFA, State Bridge, and Credit Bridge update $f_l$ by reducing a local loss of the form $-\langle f_l(h_l),\, B_l^\top e_T\rangle$, which contains no penalty on $\|f_l(h_l)\|$, so any direction in which a larger block output improves inner-product alignment with the fixed feedback target is rewarded; in a pre-LN residual stack, larger block outputs directly increase residual-stream scale, and terminal LayerNorm at the output removes task-loss sensitivity to that scale, so the architecture supplies no global restraint on the local growth incentive. The gradient-floor part~(b) follows from the LayerNorm Jacobian: in terminal-LN architectures $\partial\mathrm{LN}(h)/\partial h \propto 1/\|h\|$ in expectation, so the same residual-stream inflation is accompanied by collapse of the hidden-layer BP reference norm. Empirically, on the audited 4-block pre-LayerNorm ResMLP ($d{=}256$, CIFAR-10, 100 epochs, 3 seeds), DFA training drives $\|h_L\|$ from about $9$ at initialization to about $4\times 10^8$ by epoch 100 and $\|g_L\|$ from about $9.8\times 10^{-4}$ to about $5\times 10^{-10}$, while the reported deep cosine remains defined only because \texttt{F.cosine\_similarity} clamps the denominator at $\varepsilon{=}10^{-8}$ (Table~\ref{tab:main_audit}; Figure~\ref{fig:audit_hero}). At that endpoint the reference norm is about $20\times$ below the clamp, so the quantity being reported is effectively $(a\cdot b)/(\|a\|\max(\|b\|,10^{-8}))$ rather than a comparison to an meaningful BP direction.
 
-The measurement failure occurs at the point where the hidden-layer BP gradient ceases to be a meaningful reference direction. In terminal-LayerNorm architectures, the LayerNorm Jacobian scales as $\partial \mathrm{LN}(h)/\partial h \propto 1/\|h\|$ in expectation, so the same residual-stream inflation is accompanied by collapse of the hidden-layer BP reference norm: on DFA-trained ResMLP, $\|g_L\|$ falls from about $9.8\times 10^{-4}$ at random initialization to about $5\times 10^{-10}$ by epoch 100, a six-order-of-magnitude drop, while the reported cosine remains mathematically defined only because \texttt{F.cosine\_similarity} clamps the denominator at $\varepsilon{=}10^{-8}$ (Table~\ref{tab:main_audit}; Figure~\ref{fig:audit_hero}). At that endpoint the reference norm is about $20\times$ below the clamp, so the quantity being reported is effectively $(a\cdot b)/(\|a\|\max(\|b\|,10^{-8}))$ rather than a comparison to an informative BP direction. At that point, reporting a cosine is no longer evidence about credit quality.
+We tested this mechanism story against four natural alternative attributions, all of which it survives. \emph{Not residual-skip-driven:} on the same ResMLP-d256 with terminal LN kept and the additive skip removed ($h_{l+1}{=}F_l(h_l)$), DFA still inflates $\|h_L\|$ from $\sim\!5$ to $\sim\!2.2{\times}10^{4}$ in three epochs and converges to $\|h_L\|{\approx}1.06{\times}10^{8}$ and $\|g_L\|{\approx}1.09{\times}10^{-10}$ at $100$ epochs, both already at the diagnostic floor (Appendix~\ref{app:no_residual}). \emph{Not task-signal-driven:} replacing labels by i.i.d.\ random class targets refreshed every minibatch on the same backbone, DFA still reaches $\|h_L\|{\approx}1.67{\times}10^{8}$ and $\|g_L\|{\approx}8{\times}10^{-12}$ at $100$ epochs while accuracy stays at chance (Appendix~\ref{app:random_targets}). \emph{Not DFA-specific:} the same random-target ablation also drives $\|h_L\|$ from $9$ to $6.2{\times}10^{3}$ for State Bridge and $2.0{\times}10^{4}$ for Credit Bridge in three epochs, again at chance accuracy, so all three audited fixed-feedback methods exhibit data-agnostic activation growth (Appendix~\ref{app:random_targets}). \emph{Not shared by EP:} under the same random-target protocol, EP keeps $\|h_L\|{\approx}586$ at five epochs of training, $25\times$ smaller than DFA's three-epoch value on the same architecture, consistent with EP's bounded behavior on real labels and confirming that the random-target assay separates the explosion-prone fixed-feedback class from EP's energy-based local objective.
 
-The simplest control is architectural, not theoretical. On the same ResMLP backbone, BP keeps $\|h_L\|$ near $200$ and $\|g_L\|$ near $4\times 10^{-4}$ throughout training, while EP keeps $\|h_L\|$ around $5\times 10^3$ and $\|g_L\|$ around $1.3\times 10^{-4}$, so hard optimization on CIFAR-10 by itself does not force hidden-layer gradients to the numerical floor (Table~\ref{tab:main_audit}; Figure~\ref{fig:temporal_cross_arch}). The matched same-backbone control for terminal LayerNorm itself is the cleanest test: when we strip out the terminal LayerNorm from the same ResMLP-d256 with the residual skip intact, train DFA to convergence over 100 epochs and three seeds, the residual stream still inflates to $\|h_L\| \approx 1.2 \times 10^7 \pm 0.1$, but the deepest hidden-layer BP gradient remains at $\|g_L\| \approx 7.4\times 10^{-4}$ (mean over three seeds), four orders of magnitude above the diagnostic~(b) floor, with final test accuracy $0.327\pm 0.013$ which is statistically indistinguishable from vanilla DFA's $0.308\pm 0.014$ on the same backbone. So removing terminal LayerNorm preserves Mode~1~(a) but cleanly eliminates Mode~1~(b) on the same architecture, while leaving final task accuracy essentially unchanged. Combined with the broader cross-architecture pattern, where StudentNet and the BatchNorm CNN, which lack terminal LayerNorm, also keep deepest BP gradients around $10^{-4}$ and never trigger diagnostic (b) while ViT-Mini with a terminal LN shows the same collapse pattern and triggers diagnostic (b) by epochs 2--3 (Figure~\ref{fig:temporal_cross_arch}), the picture is that terminal LayerNorm is necessary for Mode~1~(b) in the audited residual ResMLP and ViT-Mini setting. To check whether the additive residual skip itself is the proximate trigger, we ran a matched ResMLP-d256 ablation that replaces $h_{l+1} = h_l + F_l(h_l)$ with $h_{l+1} = F_l(h_l)$ while keeping terminal LN and all other hyperparameters fixed; in that ablation DFA's $\|h_L\|$ still grows from $\sim\!5$ to $\sim\!2.2\times 10^4$ within three epochs and $\|g_L\|$ already drops to $\sim\!1.6\times 10^{-7}$, so the additive skip is \emph{not} necessary for Mode~1 either, even though the no-residual stack is partially degenerate for both BP and DFA (Appendix~\ref{app:no_residual}). The pathology therefore belongs to the evaluated FA regime, not to CIFAR-10, the backbone, or the residual skip alone.
-
-The collapse is not a late-epoch curiosity. For vanilla DFA on the ResMLP temporal replay, $\|g_L\|$ drops from $9.8\times 10^{-4}$ at epoch 0 to $1.4\times 10^{-6}$ at epoch 1, $3.1\times 10^{-7}$ at epoch 2, $1.3\times 10^{-7}$ at epoch 3, and $6.7\times 10^{-8}$ at epoch 4, so diagnostic (b) fires at epoch 3--4 across all three seeds, while the max-per-block growth detector fires slightly later at epochs 8--11 (Figure~\ref{fig:temporal_cross_arch}). Both detectors therefore fire in the first 11 epochs of a 100-epoch run, making the protocol actionable as an early-stop criterion rather than a post hoc explanation. The practical point is reinforced by accuracy: DFA is at $0.308$ already at epoch 4 and ends at $0.306$ by epoch 100, so the remaining training budget adds essentially nothing to the headline result once the measurement has already degenerated. Once measurement degeneracy is identified, the next question is whether poor deep credit remains even before collapse.
+The matched same-backbone causal control for diagnostic~(b) is removing terminal LayerNorm. On the same ResMLP-d256 with the residual skip intact, $100$ epochs of DFA, three seeds, the residual stream still inflates to $\|h_L\|\!\approx\!1.21\times 10^7$, but the deepest hidden-layer BP gradient remains at $\|g_L\|\!\approx\!7.4\times 10^{-4}$ (four orders of magnitude above the diagnostic~(b) floor), and the final test accuracy is $0.327\pm 0.013$, statistically indistinguishable from vanilla DFA's $0.308\pm 0.014$. Removing terminal LayerNorm therefore preserves Mode~1~(a) but cleanly eliminates Mode~1~(b) on the same architecture, while leaving final task accuracy essentially unchanged. Combined with the broader cross-architecture pattern (StudentNet and the BatchNorm CNN, which lack terminal LayerNorm, never trigger diagnostic~(b); ViT-Mini with a terminal LN does, by epochs 2--3 (Figure~\ref{fig:temporal_cross_arch})), terminal LayerNorm is necessary for Mode~1~(b) in the audited residual ResMLP and ViT-Mini setting. The collapse is also not a late-epoch curiosity: $\|g_L\|$ drops from $9.8\times 10^{-4}$ at epoch~0 to $6.7\times 10^{-8}$ by epoch~4 in the temporal replay across three seeds, so the protocol fires within the first $11$ epochs of a $100$-epoch run and is actionable as an early-stop criterion rather than a post hoc explanation. Once measurement degeneracy is identified, the next question is whether poor deep credit remains even before collapse.
 
 \section{Failure Mode 2: Low Intrinsic Credit-Direction Quality}
 \label{sec:mode2}
author	YurenHao0426 <Blackhao0426@gmail.com>	2026-04-08 06:17:51 -0500
committer	YurenHao0426 <Blackhao0426@gmail.com>	2026-04-08 06:17:51 -0500
commit	ddf1a6807bd849e91b6d2615e7c2e83719886294 (patch)
tree	da9074b8950dd9d7b17cc80157603fec6106f466
parent	02c3d2c80805daedb2b6c8e9d6e5f36c52d361a1 (diff)