diff options
| author | YurenHao0426 <Blackhao0426@gmail.com> | 2026-04-08 05:01:39 -0500 |
|---|---|---|
| committer | YurenHao0426 <Blackhao0426@gmail.com> | 2026-04-08 05:01:39 -0500 |
| commit | 2749679c46ab6d823aefa94e24162dc664a322f0 (patch) | |
| tree | e3408e091ddf8f74a6d9bf782bb025afa3585b75 | |
| parent | 1685abecc234d63daede821bc41d90a98c576528 (diff) | |
Round 30: fill in ยง5 Intervention and Cross-Arch prose (4 paragraphs) via codex
| -rw-r--r-- | paper/main.pdf | bin | 437513 -> 442932 bytes | |||
| -rw-r--r-- | paper/main.tex | 8 |
2 files changed, 4 insertions, 4 deletions
diff --git a/paper/main.pdf b/paper/main.pdf Binary files differindex 6fa7a0e..e4e233a 100644 --- a/paper/main.pdf +++ b/paper/main.pdf diff --git a/paper/main.tex b/paper/main.tex index dc1ba7c..0f3b474 100644 --- a/paper/main.tex +++ b/paper/main.tex @@ -96,7 +96,7 @@ Per-layer reporting is therefore not cosmetic. In ResMLP under vanilla DFA, the \section{Intervention and Cross-Architecture Evidence} \label{sec:validation} -Temporal replay shows that the protocol fires early enough to change experimental practice rather than merely re-describe final checkpoints. % TODO: evidence sentence % TODO: closing sentence +The penalty intervention first matters as a rescue of the measurement regime. When we add a per-block penalty $\lambda \,\mathrm{mean}(\|f_l(h_l)\|^2)$ to DFA's local loss and train the 4-block $d{=}256$ ResMLP for 30 epochs on CIFAR-10, the $\lambda{=}10^{-2}$ setting contains the terminal hidden-state scale from $\|h_L\| \sim 4.4\times 10^8$ under vanilla DFA to $\sim 4.0\times 10^4$, while lifting the deepest BP reference norm from $\|g_L\| \sim 5\times 10^{-10}$ to $\sim 9.0\times 10^{-7}$, a roughly four-order-of-magnitude rescue on both quantities (Figure~\ref{fig:penalty_rescue}; Table~\ref{tab:mode_validation}). At that setting, both diagnostic~(a) and diagnostic~(b) pass on penalized DFA, and test accuracy rises to $0.363 \pm 0.001$ from $0.308 \pm 0.014$ for vanilla DFA. The key point is not yet that the recovered network has good deep credit, but that the deep reference vector is again large enough to function as a meaningful target direction rather than a clamp-dominated artifact. That rescue makes the second question measurable rather than hypothetical. \begin{table}[t] \centering @@ -115,11 +115,11 @@ Fresh-$B$ null control & $\overline{\cos}_{deep}{=}+0.002{\pm}0.022$ ($n{=}20$ d \end{tabular} \end{table} -Cross-architecture validation shows that diagnostic (b) appears restricted to the terminal-normalized architectures we audited, while diagnostic (a) remains useful more broadly. % TODO: evidence sentence % TODO: closing sentence +Once the reference vector is meaningful again, the deep layers no longer sit exactly at null. At $\lambda{=}10^{-2}$, penalized DFA reaches a three-seed deep-layer mean cosine of $+0.155 \pm 0.025$ and deep perturbation correlation of $+0.080 \pm 0.011$, whereas vanilla DFA is essentially zero on both metrics in the deep blocks, consistent with prior concerns that alternative feedback can fail by supplying poor credit directions even before full collapse \citep{bartunov2018assessing,moskovitz2018feedback,crafton2019backpropagation,refinetti2023align}. The null calibration rules out the interpretation that this recovered signal is merely measurement noise: on the same penalized checkpoint, replacing the training-time feedback matrices with 20 fresh random $B_l$ draws gives a deep cosine of only $+0.002 \pm 0.022$, with per-layer standard deviations of $0.013$--$0.023$, all within noise of zero (Table~\ref{tab:mode_validation}). The $\lambda$ sweep sharpens the dissociation further: at $\lambda{=}10^{-4}$, Mode~1 is already alleviated, with $\|h_L\|{=}2.4\times 10^4$ and $\|g_L\|{=}6.3\times 10^{-7}$, but deep cosine remains $-0.022$, while at $\lambda{=}10^{-2}$ it rises to $+0.165$ and deep $\rho$ to $+0.091$ (Figure~\ref{fig:penalty_rescue}). The improvement is real, but it is only partial. -The residual-stream penalty intervention partially alleviates both failure modes but does not erase the remaining performance gap to BP. % TODO: evidence sentence % TODO: closing sentence +A rescue intervention is only informative if its direct cost is controlled. The relevant control is BP trained under the same penalty: BP falls from $0.609 \pm 0.004$ without the penalty to $0.530$ with $\lambda{=}10^{-2}$, so the penalty has a direct cost of about $8$ percentage points even when credit assignment is correct, whereas DFA moves in the opposite direction, from $0.308 \pm 0.014$ to $0.363 \pm 0.001$ under the same intervention (Figure~\ref{fig:penalty_rescue}). Relative to the frozen-blocks baseline of $0.349$, BP+penalty still retains a margin of $+18.1$ points, while DFA+penalty retains only $+1.4$ points. The remaining gap, $0.530 - 0.363 = 17$ points, is therefore a lower bound on the part of DFA's deficit that is not explained by simple penalty-induced capacity loss alone, though not a clean isolation because BP uses an end-to-end loss whereas DFA uses block-local losses. The residual gap after that control is what keeps Mode~2 substantively alive. -Matched BP+penalty controls show that only part of DFA's deficit is attributable to the penalty's direct capacity cost, leaving a substantial residual consistent with poorer credit assignment. % TODO: evidence sentence % TODO: closing sentence +The architecture comparison sharpens the scope of the critique. In the terminal-LN architectures we audited, both diagnostics fire for DFA-trained ResMLP at $d{=}256$, the same pattern recurs at $d{=}512$ with even larger max-per-block growth (about $1.5\times 10^4$), and ViT-Mini with a class token and terminal LN shows diagnostic~(a) by epoch~1 and diagnostic~(b) by epochs~2--3 (Figure~\ref{fig:temporal_cross_arch}). In the non-terminal-LN controls, the pattern is different: StudentNet shows diagnostic~(a) only at epochs~14--25 while diagnostic~(b) never fires across $100$ epochs and three seeds, and the BatchNorm CNN on CIFAR-10 likewise shows strong growth under DFA, with max-per-block growth up to $237\times$, but keeps deepest BP gradients around $\|g\| \sim 10^{-3}$ and never triggers diagnostic~(b) (Figure~\ref{fig:temporal_cross_arch}). BP never triggers either diagnostic in any audited architecture. This is an observational association rather than a causal identification of terminal LayerNorm as the unique mechanism, but it is enough to support a narrower claim: diagnostic~(b) appears tied to the terminal-LN architectures audited here, while diagnostic~(a) remains useful more broadly. This lets the paper end with a reporting rule rather than an overclaimed theory. \begin{figure}[t] \centering |
