diff options
| author | YurenHao0426 <Blackhao0426@gmail.com> | 2026-04-08 14:51:58 -0500 |
|---|---|---|
| committer | YurenHao0426 <Blackhao0426@gmail.com> | 2026-04-08 14:51:58 -0500 |
| commit | 0abfa8e96afffd75cdb6a985603dceb55a284427 (patch) | |
| tree | 5543e407b39c0663eb4b3b6679ec392ae9c9b46a | |
| parent | 1c4672e465b9145d71cfddeae4e2ebaeeef2964f (diff) | |
Appendix L: upgrade DFA+pen nudging/drift/trajectory to 3-seed (matches SB/CB)
Ran DFA+pen via cifar_resmlp.py on s123, s456. 3-seed:
- acc 0.3601 ± 0.0014 (matches existing 0.363 from round 11)
- deep cos +0.1518 ± 0.0110 (matches existing +0.155 ± 0.025)
- deep nudge -5e-5 (was single-seed -6e-5)
- train Δloss 0.095 ± 0.007 (was single-seed 0.104)
- w2 drift 18.6 ± 0.5 (matches single-seed 18.8)
- embed drift 94.6 ± 1.4 (matches single-seed 92.7)
Single-seed s42 was within noise of the 3-seed mean across all 4 functional
metrics, so the cos-vs-acc dissociation story is unchanged — Appendix L
now just reports 3-seed for all 3 methods consistently.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
| -rw-r--r-- | paper/main.pdf | bin | 486620 -> 486667 bytes | |||
| -rw-r--r-- | paper/main.tex | 2 |
2 files changed, 1 insertions, 1 deletions
diff --git a/paper/main.pdf b/paper/main.pdf Binary files differindex 7d5dfea..da3099a 100644 --- a/paper/main.pdf +++ b/paper/main.pdf diff --git a/paper/main.tex b/paper/main.tex index 8486315..4dc3f1e 100644 --- a/paper/main.tex +++ b/paper/main.tex @@ -490,7 +490,7 @@ DFA+pen mean & $0.363 \pm 0.001$ & $4.0\times 10^4$ & $9.0\times 10^{-7}$ & $+0. \end{tabularx} \end{table} -The penalty rescue effect on State Bridge is much larger than on DFA: $+24$ percentage points for State Bridge versus $+5.5$ percentage points for DFA on the same architecture and intervention. SB+penalty is the first audited non-BP method whose trained deep blocks substantively beat the architecture-matched random-block baseline. We treat this as evidence that Mode~2 (low intrinsic credit-direction quality) has method-dependent severity within the audited fixed-feedback family once Mode~1 is alleviated, rather than being a uniform property of all fixed-feedback local-credit objectives. Importantly, State Bridge's deep cosine $+0.322$ is approximately twice DFA's $+0.155$ on the same intervention, but neither approaches the BP reference value of $\approx +1.0$, so this is a within-class gradation in credit-direction quality, not a claim that bridge constructions ``solve'' Mode~2. The drift diagnostic reinforces this reading rather than contradicting it: per-block $w_2$ relative displacement after $30$ epochs is $14.3\times$ for SB+penalty and $19.3\times$ for CB+penalty (a $35\%$ gap), and the embedding layer's relative drift is $7.1\times$ for SB versus $44.6\times$ for CB (a $6\times$ gap), so CB's per-block updates are not silenced under penalty and are in fact larger in magnitude than SB's, yet CB's final accuracy is $9.3$ percentage points lower. The larger-but-less-useful parameter updates in CB are consistent with the mechanism hypothesis that angular agreement with the BP gradient does not by itself certify the functional forward-state content of the update. The nudging test at the same checkpoints provides the direct functional measurement: taking a small step of size $\eta{=}0.01$ in the direction of each method's per-layer credit $a_l$ decreases the test loss by $-1.78\times 10^{-3}$ on average over the deep blocks for SB+penalty, by $-0.45\times 10^{-3}$ for CB+penalty, and by only $-6\times 10^{-5}$ for DFA+penalty (single seed $42$, $30$-epoch run via the same training script). At the same per-layer credit direction, a step in SB's direction moves the loss about four times more than a step in CB's direction and about $30$ times more than a step in DFA's direction, even though CB's direction is more aligned with the BP gradient in angle than either. The $30$-epoch training trajectories give a third independent confirmation: SB+penalty's training loss falls from $2.047$ at epoch $1$ to $1.589$ at epoch $30$, a decrease of $0.458$, whereas CB+penalty's training loss falls by only $0.122$ and DFA+penalty's by only $0.104$ over the same $30$ epochs. Deep cosine ranks the three methods CB $>$ SB $>$ DFA, but every functional metric (nudging, integrated training-loss decrease, headline accuracy) ranks them SB $\gg$ CB $\approx$ DFA: the ordering produced by deep cosine is the only one that does not predict accuracy correctly. This is the strongest form of the cos-versus-accuracy dissociation: across three audited fixed-feedback methods under the same penalty intervention, the ranking implied by angular agreement with the BP gradient is contradicted by three independent functional measurements that do predict accuracy. Under the same intervention Credit Bridge reaches a three-seed test accuracy of $0.360 \pm 0.003$, a three-seed deep mean cosine of $+0.679 \pm 0.008$, and a three-seed deep mean $\rho$ of $+0.464 \pm 0.025$, with $\|h_L\|\approx 5680 \pm 178$ and $\|g_L\|\approx 1.9\times 10^{-5}$ well above the diagnostic floor. Credit Bridge therefore has an even higher deep cosine than State Bridge (about $4\times$ the DFA value and roughly $2\times$ the State Bridge value), but reaches the same final accuracy as DFA+penalty and $9.3$ percentage points below State Bridge+penalty. This is a clean dissociation: within the audited fixed-feedback family under the same rescue, deep cosine and deep $\rho$ differ by more than a factor of four across methods without tracking final accuracy in the same direction, so alignment to the BP gradient is a necessary but not sufficient diagnostic of usable credit for depth. That cross-method dissociation is a direct reason the protocol in Section~\ref{sec:protocol} keeps final accuracy, layerwise credit quality, and the depth-utilization baseline as three separate reporting axes rather than collapsing them into a single headline. +The penalty rescue effect on State Bridge is much larger than on DFA: $+24$ percentage points for State Bridge versus $+5.5$ percentage points for DFA on the same architecture and intervention. SB+penalty is the first audited non-BP method whose trained deep blocks substantively beat the architecture-matched random-block baseline. We treat this as evidence that Mode~2 (low intrinsic credit-direction quality) has method-dependent severity within the audited fixed-feedback family once Mode~1 is alleviated, rather than being a uniform property of all fixed-feedback local-credit objectives. Importantly, State Bridge's deep cosine $+0.322$ is approximately twice DFA's $+0.155$ on the same intervention, but neither approaches the BP reference value of $\approx +1.0$, so this is a within-class gradation in credit-direction quality, not a claim that bridge constructions ``solve'' Mode~2. The drift diagnostic reinforces this reading rather than contradicting it: per-block $w_2$ relative displacement after $30$ epochs averages $14.3\times$ for SB+penalty, $18.6\times \pm 0.5$ for DFA+penalty, and $19.3\times$ for CB+penalty (three seeds each), and the embedding layer's relative drift is $7.1\times$ for SB versus $44.6\times$ for CB and $94.6\times \pm 1.4$ for DFA, so none of the three methods' per-block updates are silenced under penalty and CB's are in fact larger in magnitude than SB's while DFA's embedding updates are the largest of all, yet CB's and DFA's final accuracies are both $9.3$ percentage points below State Bridge's. The larger-but-less-useful parameter updates in CB are consistent with the mechanism hypothesis that angular agreement with the BP gradient does not by itself certify the functional forward-state content of the update. The nudging test at the same checkpoints provides the direct functional measurement: taking a small step of size $\eta{=}0.01$ in the direction of each method's per-layer credit $a_l$ decreases the test loss by $-1.78\times 10^{-3}$ on average over the deep blocks for SB+penalty, by $-0.45\times 10^{-3}$ for CB+penalty, and by only $-5\times 10^{-5}$ for DFA+penalty (three seeds each, $30$-epoch runs via the same training script). At the same per-layer credit direction, a step in SB's direction moves the loss about four times more than a step in CB's direction and about thirty-five times more than a step in DFA's direction, even though CB's direction is more aligned with the BP gradient in angle than either. The $30$-epoch training trajectories give a third independent confirmation: SB+penalty's training loss falls from $2.047$ at epoch $1$ to $1.589$ at epoch $30$, a decrease of $0.458$, whereas CB+penalty's training loss falls by only $0.122$ and DFA+penalty's by only $0.095 \pm 0.007$ over the same $30$ epochs (three seeds). Deep cosine ranks the three methods CB $>$ SB $>$ DFA, but every functional metric (nudging, integrated training-loss decrease, headline accuracy) ranks them SB $\gg$ CB $\approx$ DFA: the ordering produced by deep cosine is the only one that does not predict accuracy correctly. This is the strongest form of the cos-versus-accuracy dissociation: across three audited fixed-feedback methods under the same penalty intervention, the ranking implied by angular agreement with the BP gradient is contradicted by three independent functional measurements that do predict accuracy. Under the same intervention Credit Bridge reaches a three-seed test accuracy of $0.360 \pm 0.003$, a three-seed deep mean cosine of $+0.679 \pm 0.008$, and a three-seed deep mean $\rho$ of $+0.464 \pm 0.025$, with $\|h_L\|\approx 5680 \pm 178$ and $\|g_L\|\approx 1.9\times 10^{-5}$ well above the diagnostic floor. Credit Bridge therefore has an even higher deep cosine than State Bridge (about $4\times$ the DFA value and roughly $2\times$ the State Bridge value), but reaches the same final accuracy as DFA+penalty and $9.3$ percentage points below State Bridge+penalty. This is a clean dissociation: within the audited fixed-feedback family under the same rescue, deep cosine and deep $\rho$ differ by more than a factor of four across methods without tracking final accuracy in the same direction, so alignment to the BP gradient is a necessary but not sufficient diagnostic of usable credit for depth. That cross-method dissociation is a direct reason the protocol in Section~\ref{sec:protocol} keeps final accuracy, layerwise credit quality, and the depth-utilization baseline as three separate reporting axes rather than collapsing them into a single headline. \section{Reproducibility} \label{app:reproducibility} |
