diff options
| author | YurenHao0426 <Blackhao0426@gmail.com> | 2026-04-08 17:31:09 -0500 |
|---|---|---|
| committer | YurenHao0426 <Blackhao0426@gmail.com> | 2026-04-08 17:31:09 -0500 |
| commit | 61d483e429f4fead0e44805eba0f643b7464c6e9 (patch) | |
| tree | 41c207ba3412928d564da7418b113831696478ef | |
| parent | 4b731f824e4a4ee8606aa472a9e5adc4021991b8 (diff) | |
paper v2.30: fix layer-0 cosine numbers + add per-seed appendix M
Found a numerical error in §4 ¶3: the layer-0 vanilla DFA cosines were
listed as +0.42, +0.45, +0.39 across seeds 42/123/456 but the actual
re-measurement on the saved early-epoch checkpoints gives +0.421, +0.436,
+0.418 (the s456 value was off by 0.03). The deep-mean numbers in
Table 2 (-0.008 ± 0.013) were already correct.
Changes:
- §4 ¶3: layer-0 trio updated to +0.42, +0.44, +0.42 across seeds and
cite now points to a new per-seed appendix.
- New Appendix M (Layer-0 Dominance): 6-row table of per-seed per-layer
cosines on vanilla DFA early checkpoints (3 seeds × ep 1, 2), with
per-layer ||g||. Documents the layer-0 dominance pattern that drives
the headline aggregate Γ on these checkpoints.
- results/vanilla_dfa_early_ckpts/per_layer_cos_3seed.json: machine-
readable dump of all 6 measurements for future audit.
- §7 compressed (~30 words trimmed across the closing paragraph) and
Figure 3 width 0.92 → 0.82 to keep main content at exactly 9 pages
after the appendix addition.
Verified: 9 pages main + refs on p10, 18 total, 0 overfull boxes.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
| -rw-r--r-- | paper/main.pdf | bin | 496499 -> 500476 bytes | |||
| -rw-r--r-- | paper/main.tex | 32 | ||||
| -rw-r--r-- | results/vanilla_dfa_early_ckpts/per_layer_cos_3seed.json | 98 |
3 files changed, 127 insertions, 3 deletions
diff --git a/paper/main.pdf b/paper/main.pdf Binary files differindex e681ca9..b7bb844 100644 --- a/paper/main.pdf +++ b/paper/main.pdf diff --git a/paper/main.tex b/paper/main.tex index bf9b73c..26e4eb2 100644 --- a/paper/main.tex +++ b/paper/main.tex @@ -96,7 +96,7 @@ The second failure mode appears even in the meaningful-measurement regime. At th A second metric with different numerical failure modes tells the same story. Cosine measures directional agreement with the BP gradient, whereas the per-layer perturbation correlation $\rho_l$ measures whether the proposed credit predicts the actual loss response: for $M{=}32$ unit-norm random directions $v_m$ and step $\varepsilon{=}10^{-3}$, $\rho_l \;{=}\; \mathrm{Pearson}_m\!\left(\langle a_l,\, \varepsilon v_m\rangle,\;\, \ell(h_l + \varepsilon v_m) - \ell(h_l)\right)$, evaluated per sample on a fixed eval batch and then averaged. Cosine and $\rho$ have different failure modes, especially with respect to normalization and small-denominator effects. In our controls, $\rho$ behaves as expected, with a Taylor-ceiling positive control near $+0.997$ and a random-vector negative control near $+0.006$ (Figure~\ref{fig:penalty_rescue}, Table~\ref{tab:mode_validation}). On vanilla DFA, deep $\rho$ is likewise null: for the early checkpoints where the gradients remain measurable, the deep average is $-0.003 \pm 0.005$ across seeds and epochs, and in a floor-level checkpoint it is $+0.002$, again indistinguishable from noise. The agreement between cosine and $\rho$ therefore rules out the interpretation that the null deep result is an artifact of cosine's $\varepsilon$-clamp or vector normalization. The deep blocks are not just hard to measure; they are receiving weakly useful directions. -Per-layer reporting is therefore not cosmetic. In ResMLP under vanilla DFA, the headline aggregate alignment $\Gamma \approx 0.07$--$0.10$ can look mildly positive only because layer $0$ remains strongly aligned while the deep network is not: at the same early checkpoints where layers $1$--$4$ are essentially zero, layer $0$ has cosine $+0.42$, $+0.45$, and $+0.39$ across seeds (Table~\ref{tab:mode_validation}). The resulting average can therefore be driven by the embedding layer even when the interior blocks are effectively unaligned, so aggregate reporting obscures the very distinction needed to separate ``measurement collapse'' from ``poor credit direction.'' This layer-$0$ dominance is specific to the ResMLP DFA setting; on ViT-Mini DFA, all layers are near zero, which strengthens the broader methodological point that alignment should be reported per layer rather than only in aggregate. With the two modes separated observationally, the remaining question is whether intervention can move them independently. +Per-layer reporting is therefore not cosmetic. In ResMLP under vanilla DFA, the headline aggregate alignment $\Gamma \approx 0.07$--$0.10$ can look mildly positive only because layer $0$ remains strongly aligned while the deep network is not: at the same epoch-1 checkpoints where layers $1$--$4$ are essentially zero, layer $0$ has cosine $+0.42$, $+0.44$, and $+0.42$ across seeds (Table~\ref{tab:mode_validation}; per-seed values in Appendix~\ref{app:layer0_dominance}). The resulting average can therefore be driven by the embedding layer even when the interior blocks are effectively unaligned, so aggregate reporting obscures the very distinction needed to separate ``measurement collapse'' from ``poor credit direction.'' This layer-$0$ dominance is specific to the ResMLP DFA setting; on ViT-Mini DFA, all layers are near zero, which strengthens the broader methodological point that alignment should be reported per layer rather than only in aggregate. With the two modes separated observationally, the remaining question is whether intervention can move them independently. Mode~2 has method-dependent severity within the audited fixed-feedback family once Mode~1 is alleviated. Applying the same per-block scale-control penalty $\lambda{=}10^{-2}$ that rescued DFA to State Bridge and to Credit Bridge on the same 4-block $d{=}256$ ResMLP backbone over $30$ epochs and three seeds gives converged test accuracies of $0.453 \pm 0.003$ (SB) and $0.360 \pm 0.003$ (CB), with deep mean cosines of $+0.322 \pm 0.007$ (SB) and $+0.679 \pm 0.008$ (CB) and deep mean $\rho$ of $+0.402 \pm 0.015$ (SB) and $+0.464 \pm 0.025$ (CB), while DFA under the same intervention reaches $0.363 \pm 0.001$ with deep cosine $+0.155 \pm 0.025$ and deep $\rho$ $+0.080 \pm 0.011$ (Table~\ref{tab:mode_validation}; Appendix~\ref{app:sb_penalty}). The State Bridge penalty rescue is roughly $24$ percentage points above the vanilla State Bridge baseline of $0.213$ on the same architecture and, more importantly for the paper's central walk-back, exceeds the architecture-matched frozen-blocks shallow baseline of $0.349$ by $+10.4$ percentage points. State Bridge with the penalty intervention is therefore the first audited non-BP method whose trained deep blocks substantively improve over an architecture-matched random-block baseline; the headline accuracy gap is comparable to BP+penalty's $+18.1$ pp over the same shallow baseline. Neither the activation scale nor the deep BP gradient magnitude is silenced under the penalty: $\|h_L\|$ stays at $302 \pm 8$ for SB and $5680 \pm 178$ for CB, with $\|g_L\|$ at $\sim\!1.8\times 10^{-4}$ and $\sim\!1.9\times 10^{-5}$ respectively, both well within the meaningful-measurement regime, so the recovered deep cosines are computed against an informative reference and not against a numerical floor. Within this rescued regime, the three methods reveal a clean cosine-versus-accuracy dissociation. Credit Bridge achieves roughly $4\times$ the deep cosine of DFA and $2\times$ that of State Bridge, yet its final accuracy matches DFA's and is $9$ percentage points below State Bridge's. We therefore frame the Mode~2 reading as a three-part proposition. \emph{Observation}: under the same intervention and matched training budget, CB and DFA reach the same accuracy despite a $4\times$ deep-cosine gap, while SB is the best accuracy with intermediate cosine. \emph{Inference}: layerwise cosine to the BP gradient is necessary to rule out grossly wrong credit signals (it distinguishes the rescued regime from the clamp-dominated vanilla regime), but it is not sufficient to certify that the supplied signal is useful credit for depth. \emph{Mechanism hypothesis}: usefulness depends on whether the local update induces useful forward-state change across blocks, not merely whether its direction is close to the BP gradient in angle. Under this reading, CB supplies a gradient-direction surrogate that aligns with BP in angle but does not translate to a coordinated forward-state improvement, while State Bridge supplies a state-level downstream teaching signal that preserves aspects of useful credit which layerwise cosine does not measure. We state this as a mechanism hypothesis rather than a theorem because we have measured the angle-to-accuracy gap but not the full functional-credit content; the reporting rule that follows is robust to either interpretation. This cross-method dissociation strengthens the methodological point that alignment must be reported jointly with measurement validity and a depth-utilization baseline rather than as a single headline number. @@ -138,7 +138,7 @@ The architecture comparison sharpens the scope of the critique. In the terminal- \begin{figure}[t] \centering -\includegraphics[width=0.92\linewidth]{figures/fig4_penalty_rescue.pdf} +\includegraphics[width=0.82\linewidth]{figures/fig4_penalty_rescue.pdf} \caption{Penalty intervention view of the two modes: penalization rescues residual-stream scale and restores a measurable but still partial deep-layer credit signal, clarifying that numerical rescue and credit-quality rescue are related but distinct.} \label{fig:penalty_rescue} \end{figure} @@ -180,7 +180,7 @@ The protocol is conservative in a specific sense: it preserves BP and EP as evid \section{Discussion, Limits, Conclusion} \label{sec:discussion} -Our claim is about what existing evidence licenses, not about impossibility: this paper does not show that FA cannot work in deep networks, only that current evaluation practice can misread what happened. DFA, State Bridge, and Credit Bridge all survive status-quo reporting in Table~\ref{tab:main_audit}, yet the protocol shows that their deep claims are unsupported, while the intervention in Figure~\ref{fig:penalty_rescue} partially rescues deep credit signal rather than validating the original headline. Our strongest claim is scoped to the 4-block $d{=}256$ and $d{=}512$ pre-LayerNorm ResMLPs and to ViT-Mini, where Mode~1 (a)+(b) both fire; StudentNet and the BatchNorm CNN refine the scope by showing that activation growth can persist without the gradient-floor collapse, the no-terminal-LN same-backbone control establishes terminal LayerNorm as causally necessary for diagnostic~(b) on residual ResMLP but not proven beyond that family, the dataset is only CIFAR-10, and the BP-plus-penalty comparison is a lower-bound control rather than a full decomposition. In the evaluation-methodology line of \citet{jordan2020evaluating,obray2022evaluation,paleka2026pitfalls}, FA papers should therefore report BP-reference validity, layerwise credit quality, and a frozen-blocks depth-utilization baseline as three separate axes rather than a single headline. +\looseness=-2 Our claim is about evidence, not impossibility: we show that current FA evaluation practice can misread what happened, not that FA cannot work in deep networks. DFA, SB, and CB all pass status-quo reporting (Table~\ref{tab:main_audit}) but fail the protocol's deep checks, and the Figure~\ref{fig:penalty_rescue} penalty partially rescues credit signal rather than validating headlines. Our strongest claim is scoped to $d{=}256/512$ pre-LayerNorm ResMLPs and ViT-Mini, where both Mode~1 diagnostics fire; StudentNet and the BatchNorm CNN show that activation growth can persist without gradient-floor collapse; the no-terminal-LN control establishes terminal LayerNorm as causally necessary for diagnostic~(b) on residual ResMLP; the dataset is CIFAR-10; and the BP-plus-penalty comparison is a lower bound, not a full decomposition. In the evaluation-methodology line of \citet{jordan2020evaluating,obray2022evaluation,paleka2026pitfalls}, FA papers should report BP-reference validity, layerwise credit quality, and a frozen-blocks depth-utilization baseline as separate axes, not a single headline. \begin{thebibliography}{10} @@ -519,6 +519,32 @@ DFA+pen mean & $0.363 \pm 0.001$ & $4.0\times 10^4$ & $9.0\times 10^{-7}$ & $+0. The penalty rescue effect on State Bridge is much larger than on DFA: $+24$ percentage points for State Bridge versus $+5.5$ percentage points for DFA on the same architecture and intervention. SB+penalty is the first audited non-BP method whose trained deep blocks substantively beat the architecture-matched random-block baseline. We treat this as evidence that Mode~2 (low intrinsic credit-direction quality) has method-dependent severity within the audited fixed-feedback family once Mode~1 is alleviated, rather than being a uniform property of all fixed-feedback local-credit objectives. Importantly, State Bridge's deep cosine $+0.322$ is approximately twice DFA's $+0.155$ on the same intervention, but neither approaches the BP reference value of $\approx +1.0$, so this is a within-class gradation in credit-direction quality, not a claim that bridge constructions ``solve'' Mode~2. The drift diagnostic reinforces this reading rather than contradicting it: per-block $w_2$ relative displacement after $30$ epochs averages $14.3\times$ for SB+penalty, $18.6\times \pm 0.5$ for DFA+penalty, and $19.3\times$ for CB+penalty (three seeds each), and the embedding layer's relative drift is $7.1\times$ for SB versus $44.6\times$ for CB and $94.6\times \pm 1.4$ for DFA, so none of the three methods' per-block updates are silenced under penalty and CB's are in fact larger in magnitude than SB's while DFA's embedding updates are the largest of all, yet CB's and DFA's final accuracies are both $9.3$ percentage points below State Bridge's. The larger-but-less-useful parameter updates in CB are consistent with the mechanism hypothesis that angular agreement with the BP gradient does not by itself certify the functional forward-state content of the update. The nudging test at the same checkpoints provides the direct functional measurement: taking a small step of size $\eta{=}0.01$ in the direction of each method's per-layer credit $a_l$ decreases the test loss by $-1.78\times 10^{-3}$ on average over the deep blocks for SB+penalty, by $-0.45\times 10^{-3}$ for CB+penalty, and by only $-5\times 10^{-5}$ for DFA+penalty (three seeds each, $30$-epoch runs via the same training script). At the same per-layer credit direction, a step in SB's direction moves the loss about four times more than a step in CB's direction and about thirty-five times more than a step in DFA's direction, even though CB's direction is more aligned with the BP gradient in angle than either. The $30$-epoch training trajectories give a third independent confirmation: SB+penalty's training loss falls from $2.047$ at epoch $1$ to $1.589$ at epoch $30$, a decrease of $0.458$, whereas CB+penalty's training loss falls by only $0.122$ and DFA+penalty's by only $0.095 \pm 0.007$ over the same $30$ epochs (three seeds). Deep cosine ranks the three methods CB $>$ SB $>$ DFA, but every functional metric (nudging, integrated training-loss decrease, headline accuracy) ranks them SB $\gg$ CB $\approx$ DFA: the ordering produced by deep cosine is the only one that does not predict accuracy correctly. This is the strongest form of the cos-versus-accuracy dissociation: across three audited fixed-feedback methods under the same penalty intervention, the ranking implied by angular agreement with the BP gradient is contradicted by three independent functional measurements that do predict accuracy. Under the same intervention Credit Bridge reaches a three-seed test accuracy of $0.360 \pm 0.003$, a three-seed deep mean cosine of $+0.679 \pm 0.008$, and a three-seed deep mean $\rho$ of $+0.464 \pm 0.025$, with $\|h_L\|\approx 5680 \pm 178$ and $\|g_L\|\approx 1.9\times 10^{-5}$ well above the diagnostic floor. Credit Bridge therefore has an even higher deep cosine than State Bridge (about $4\times$ the DFA value and roughly $2\times$ the State Bridge value), but reaches the same final accuracy as DFA+penalty and $9.3$ percentage points below State Bridge+penalty. This is a clean dissociation: within the audited fixed-feedback family under the same rescue, deep cosine and deep $\rho$ differ by more than a factor of four across methods without tracking final accuracy in the same direction, so alignment to the BP gradient is a necessary but not sufficient diagnostic of usable credit for depth. That cross-method dissociation is a direct reason the protocol in Section~\ref{sec:protocol} keeps final accuracy, layerwise credit quality, and the depth-utilization baseline as three separate reporting axes rather than collapsing them into a single headline. +\section{Layer-0 Dominance: Per-Seed Vanilla DFA Early-Epoch Cosines} +\label{app:layer0_dominance} + +For the layer-0-dominance claim in Section~\ref{sec:mode2}, the per-layer cosines between DFA's local credit signal $a_l = e_T B_l^\top$ and the BP gradient at the corresponding hidden state were measured on the saved vanilla DFA early-epoch checkpoints (Section~\ref{sec:mode2}, Table~\ref{tab:mode_validation}). All measurements use the script's default eval batch ($n{=}2048$ CIFAR-10 test samples) and the training-time $B_l$ matrices reconstructed from the original training RNG. Layer indices follow the convention used elsewhere in the paper: $l{=}0$ is the first residual block (which sees the embedding output) and $l{=}1\!\ldots\!4$ are the deeper residual blocks. The full per-seed values are dumped to \texttt{results/vanilla\_dfa\_early\_ckpts/per\_layer\_cos\_3seed.json}. + +\begin{table}[h] +\centering +\small +\caption{Per-layer cosines on vanilla DFA early-epoch checkpoints (3 seeds, ep 1 and ep 2). Layer 0 is consistently $\approx +0.42$ across all six measurements while every deep layer (1--4) lies in $[-0.06,+0.02]$, so the headline aggregate $\Gamma$ on these checkpoints is driven almost entirely by layer 0 even though the deep blocks carry essentially no alignment with the BP gradient.} +\label{tab:layer0_per_seed} +\begin{tabular}{llrrrrrr} +\toprule +seed & ep & $l{=}0$ & $l{=}1$ & $l{=}2$ & $l{=}3$ & $l{=}4$ & $\|g_2\|$ \\ +\midrule +$42$ & $1$ & $+0.421$ & $+0.005$ & $-0.028$ & $-0.039$ & $-0.038$ & $6.8\times 10^{-7}$ \\ +$42$ & $2$ & $+0.437$ & $-0.002$ & $-0.040$ & $-0.055$ & $-0.054$ & $1.6\times 10^{-7}$ \\ +$123$ & $1$ & $+0.436$ & $+0.008$ & $-0.033$ & $+0.016$ & $+0.017$ & $6.6\times 10^{-7}$ \\ +$123$ & $2$ & $+0.460$ & $+0.005$ & $-0.037$ & $+0.003$ & $+0.003$ & $1.4\times 10^{-7}$ \\ +$456$ & $1$ & $+0.418$ & $+0.011$ & $-0.026$ & $+0.007$ & $+0.006$ & $3.8\times 10^{-7}$ \\ +$456$ & $2$ & $+0.409$ & $+0.003$ & $-0.039$ & $+0.001$ & $+0.000$ & $8.5\times 10^{-8}$ \\ +\bottomrule +\end{tabular} +\end{table} + +The deep-layer mean across the three seeds at epoch 1 is $-0.008 \pm 0.013$ (matching Table~\ref{tab:mode_validation}), and at epoch 2 is $-0.018 \pm 0.018$. Layer 0 stays at $+0.42 \pm 0.02$ across all six measurements, so the layer-0-dominance pattern is not a single-seed coincidence: it is consistent across seeds and across the early epochs in which $\|g_2\|$ remains above the $10^{-7}$ diagnostic-(b) floor. This is the per-seed evidence behind the Section~\ref{sec:mode2} claim that aggregate cosine on vanilla DFA can look mildly positive only because layer 0 carries the entire alignment budget. + \section{Reproducibility} \label{app:reproducibility} diff --git a/results/vanilla_dfa_early_ckpts/per_layer_cos_3seed.json b/results/vanilla_dfa_early_ckpts/per_layer_cos_3seed.json new file mode 100644 index 0000000..b1c9de6 --- /dev/null +++ b/results/vanilla_dfa_early_ckpts/per_layer_cos_3seed.json @@ -0,0 +1,98 @@ +{ + "s42_ep1": { + "per_layer_cos": [ + 0.4208017587661743, + 0.004838745109736919, + -0.027907393872737885, + -0.038808517158031464, + -0.03807452693581581 + ], + "per_layer_g_norm_median": [ + 1.2788983667633147e-06, + 6.714879532410123e-07, + 6.794414844080165e-07, + 6.840860464762955e-07, + 6.860727808088996e-07 + ] + }, + "s42_ep2": { + "per_layer_cos": [ + 0.4369885325431824, + -0.001831148169003427, + -0.039548110216856, + -0.05499383062124252, + -0.05395658314228058 + ], + "per_layer_g_norm_median": [ + 5.726975587094785e-07, + 1.5392966190574953e-07, + 1.5485885285215772e-07, + 1.5658912388971657e-07, + 1.576423045435149e-07 + ] + }, + "s123_ep1": { + "per_layer_cos": [ + 0.43564510345458984, + 0.008119785226881504, + -0.032616909593343735, + 0.015984557569026947, + 0.0165080688893795 + ], + "per_layer_g_norm_median": [ + 1.1976044334005564e-06, + 6.545930091306218e-07, + 6.567059358530969e-07, + 6.61169508475723e-07, + 6.618395786972542e-07 + ] + }, + "s123_ep2": { + "per_layer_cos": [ + 0.4596974849700928, + 0.005399191286414862, + -0.03668719530105591, + 0.0029763614293187857, + 0.0030912011861801147 + ], + "per_layer_g_norm_median": [ + 5.662583362209261e-07, + 1.351014162764841e-07, + 1.350312288650457e-07, + 1.36037257902899e-07, + 1.3582508984200103e-07 + ] + }, + "s456_ep1": { + "per_layer_cos": [ + 0.4175308644771576, + 0.010558882728219032, + -0.026187893003225327, + 0.007104482501745224, + 0.006094767712056637 + ], + "per_layer_g_norm_median": [ + 1.0160714509765967e-06, + 3.9226171111295116e-07, + 3.8446077610387874e-07, + 3.8279964087450935e-07, + 3.864832649469463e-07 + ] + }, + "s456_ep2": { + "per_layer_cos": [ + 0.40920183062553406, + 0.002915931399911642, + -0.038979340344667435, + 0.0008528914768248796, + 6.853311788290739e-06 + ], + "per_layer_g_norm_median": [ + 4.5268734538694844e-07, + 8.855825228692993e-08, + 8.533425699397412e-08, + 8.582557597947016e-08, + 8.653171335026855e-08 + ] + } +}
\ No newline at end of file |
