diff options
| author | YurenHao0426 <Blackhao0426@gmail.com> | 2026-04-08 12:22:58 -0500 |
|---|---|---|
| committer | YurenHao0426 <Blackhao0426@gmail.com> | 2026-04-08 12:22:58 -0500 |
| commit | c201cb31018b35bf88482f7dc768b8f7a057703b (patch) | |
| tree | d12e4640b4d0abef34c73f0f667f8a0eb026f794 | |
| parent | 35be969067396306c19a3caac2d887bcde48c5d0 (diff) | |
Round 41 (Appendix L): add per-block drift diagnostic reinforcing cos-vs-acc hypothesis
Extracted from existing round 38 JSON data without running new compute.
The drift field (||W_final - W_init||_F / ||W_init||_F) is produced by
cifar_resmlp.py's feature_drift() and was already saved but not used in
the paper.
Key finding: CB+penalty has LARGER block updates than SB+penalty
(per-block w2 drift 19.3x vs 14.3x; embed drift 44.6x vs 7.1x) yet
9.3 pp LOWER accuracy. This rules out 'CB just has smaller updates'
as an alternative explanation for the cos-vs-acc dissociation.
Added 2 sentences to Appendix L paragraph 2 noting this supporting
evidence for the 'angular agreement does not certify functional
forward-state content' mechanism hypothesis in ยง4.
Main content still 9 pages exactly within E&D budget.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
| -rw-r--r-- | paper/main.pdf | bin | 482361 -> 483268 bytes | |||
| -rw-r--r-- | paper/main.tex | 2 |
2 files changed, 1 insertions, 1 deletions
diff --git a/paper/main.pdf b/paper/main.pdf Binary files differindex 0414d31..0626715 100644 --- a/paper/main.pdf +++ b/paper/main.pdf diff --git a/paper/main.tex b/paper/main.tex index faa22df..b71d023 100644 --- a/paper/main.tex +++ b/paper/main.tex @@ -485,7 +485,7 @@ DFA+pen mean (3 seeds) & $0.363 \pm 0.001$ & $4.0\times 10^4$ & $9.0\times 10^{- \end{tabular} \end{table} -The penalty rescue effect on State Bridge is much larger than on DFA: $+24$ percentage points for State Bridge versus $+5.5$ percentage points for DFA on the same architecture and intervention. SB+penalty is the first audited non-BP method whose trained deep blocks substantively beat the architecture-matched random-block baseline. We treat this as evidence that Mode~2 (low intrinsic credit-direction quality) has method-dependent severity within the audited fixed-feedback family once Mode~1 is alleviated, rather than being a uniform property of all fixed-feedback local-credit objectives. Importantly, State Bridge's deep cosine $+0.322$ is approximately twice DFA's $+0.155$ on the same intervention, but neither approaches the BP reference value of $\approx +1.0$, so this is a within-class gradation in credit-direction quality, not a claim that bridge constructions ``solve'' Mode~2. Under the same intervention Credit Bridge reaches a three-seed test accuracy of $0.360 \pm 0.003$, a three-seed deep mean cosine of $+0.679 \pm 0.008$, and a three-seed deep mean $\rho$ of $+0.464 \pm 0.025$, with $\|h_L\|\approx 5680 \pm 178$ and $\|g_L\|\approx 1.9\times 10^{-5}$ well above the diagnostic floor. Credit Bridge therefore has an even higher deep cosine than State Bridge (about $4\times$ the DFA value and roughly $2\times$ the State Bridge value), but reaches the same final accuracy as DFA+penalty and $9.3$ percentage points below State Bridge+penalty. This is a clean dissociation: within the audited fixed-feedback family under the same rescue, deep cosine and deep $\rho$ differ by more than a factor of four across methods without tracking final accuracy in the same direction, so alignment to the BP gradient is a necessary but not sufficient diagnostic of usable credit for depth. That cross-method dissociation is a direct reason the protocol in Section~\ref{sec:protocol} keeps final accuracy, layerwise credit quality, and the depth-utilization baseline as three separate reporting axes rather than collapsing them into a single headline. +The penalty rescue effect on State Bridge is much larger than on DFA: $+24$ percentage points for State Bridge versus $+5.5$ percentage points for DFA on the same architecture and intervention. SB+penalty is the first audited non-BP method whose trained deep blocks substantively beat the architecture-matched random-block baseline. We treat this as evidence that Mode~2 (low intrinsic credit-direction quality) has method-dependent severity within the audited fixed-feedback family once Mode~1 is alleviated, rather than being a uniform property of all fixed-feedback local-credit objectives. Importantly, State Bridge's deep cosine $+0.322$ is approximately twice DFA's $+0.155$ on the same intervention, but neither approaches the BP reference value of $\approx +1.0$, so this is a within-class gradation in credit-direction quality, not a claim that bridge constructions ``solve'' Mode~2. The drift diagnostic reinforces this reading rather than contradicting it: per-block $w_2$ relative displacement after $30$ epochs is $14.3\times$ for SB+penalty and $19.3\times$ for CB+penalty (a $35\%$ gap), and the embedding layer's relative drift is $7.1\times$ for SB versus $44.6\times$ for CB (a $6\times$ gap), so CB's per-block updates are not silenced under penalty and are in fact larger in magnitude than SB's, yet CB's final accuracy is $9.3$ percentage points lower. The larger-but-less-useful parameter updates in CB are consistent with the mechanism hypothesis that angular agreement with the BP gradient does not by itself certify the functional forward-state content of the update. Under the same intervention Credit Bridge reaches a three-seed test accuracy of $0.360 \pm 0.003$, a three-seed deep mean cosine of $+0.679 \pm 0.008$, and a three-seed deep mean $\rho$ of $+0.464 \pm 0.025$, with $\|h_L\|\approx 5680 \pm 178$ and $\|g_L\|\approx 1.9\times 10^{-5}$ well above the diagnostic floor. Credit Bridge therefore has an even higher deep cosine than State Bridge (about $4\times$ the DFA value and roughly $2\times$ the State Bridge value), but reaches the same final accuracy as DFA+penalty and $9.3$ percentage points below State Bridge+penalty. This is a clean dissociation: within the audited fixed-feedback family under the same rescue, deep cosine and deep $\rho$ differ by more than a factor of four across methods without tracking final accuracy in the same direction, so alignment to the BP gradient is a necessary but not sufficient diagnostic of usable credit for depth. That cross-method dissociation is a direct reason the protocol in Section~\ref{sec:protocol} keeps final accuracy, layerwise credit quality, and the depth-utilization baseline as three separate reporting axes rather than collapsing them into a single headline. \section{Reproducibility} \label{app:reproducibility} |
