From 9ebaa25377996c8ad437856d68f515b6d0d64a36 Mon Sep 17 00:00:00 2001 From: YurenHao0426 Date: Wed, 8 Apr 2026 19:54:11 -0500 Subject: =?UTF-8?q?paper=20v2.33:=20promote=20nudging=20test=20to=20=C2=A7?= =?UTF-8?q?4=20main=20text=20+=20correct=20to=203-seed=20values?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit User flagged that the cos-vs-accuracy cross-method dissociation is the paper's strongest new observation and the nudging-test functional triangulation should be in §4 main text, not buried in Appendix L. Also flagged that Appendix L's "three seeds each" claim was correct in *labeling* but the cited values (-1.78e-3, -0.45e-3, -5e-5) were s42 single-seed. Re-aggregating from existing per-seed JSONs (no new compute needed): results/round38_sbcb_penalty_30ep/results_cifar10.json (s42) results/round38_{sb,cb}_penalty_30ep_s{123,456}/results_cifar10.json results/round41_dfa_penalty_30ep{,_s{123,456}}/results_cifar10.json 3-seed deep-block nudging means (eta=0.01): SB+pen: -1.93 ± 0.11 × 10^-3 (was -1.78 single) CB+pen: -4.26 ± 0.24 × 10^-4 (was -0.45 single) DFA+pen: -4.98 ± 0.44 × 10^-5 (was -5 single) Ratios (essentially unchanged): SB / CB: 4.5× (was ~4×) SB / DFA: 39× (was ~35×) Changes: - §4 ¶4 NEW prose block: promotes the nudging test + training-loss decrease as two independent functional measurements that confirm the ordering SB ≫ CB ≈ DFA. Three functional metrics (accuracy, nudging, loss-trajectory) all agree; deep cosine is the only one that doesn't. - §4 ¶4 setup compressed (drops some redundant per-method recital, references Appendix J for full numerics) to make room. - Appendix L paragraph: nudging values updated to true 3-seed (with per-seed values listed), points at saved JSON. - New auditable file: results/nudging_test_3seed_summary.json. Page layout: main content still 9 pages exactly (§7 ends p9, refs p10). Total now 19 pages (was 18) — one extra appendix page from per-seed nudging values. 9-page main content budget preserved. This responds to user message: "cos-vs-accuracy 跨方法 dissociation 是 本文最有分量的新观测... nudging 数字应该进 Section 4 而不是埋在附录" and "nudging test 只有 single seed 42... 三 seed 跑一下几乎零成本" (turned out to be zero compute — data was already in saved JSONs). Co-Authored-By: Claude Opus 4.6 (1M context) --- paper/main.pdf | Bin 501189 -> 502005 bytes paper/main.tex | 4 +- results/nudging_test_3seed_summary.json | 122 ++++++++++++++++++++++++++++++++ 3 files changed, 124 insertions(+), 2 deletions(-) create mode 100644 results/nudging_test_3seed_summary.json diff --git a/paper/main.pdf b/paper/main.pdf index 733d229..033851c 100644 Binary files a/paper/main.pdf and b/paper/main.pdf differ diff --git a/paper/main.tex b/paper/main.tex index 5543d6d..7dc3abe 100644 --- a/paper/main.tex +++ b/paper/main.tex @@ -98,7 +98,7 @@ A second metric with different numerical failure modes tells the same story. Cos Per-layer reporting is therefore not cosmetic. In ResMLP under vanilla DFA, the headline aggregate alignment $\Gamma \approx 0.07$--$0.10$ can look mildly positive only because layer $0$ remains strongly aligned while the deep network is not: at the same epoch-1 checkpoints where layers $1$--$4$ are essentially zero, layer $0$ has cosine $+0.42$, $+0.44$, and $+0.42$ across seeds (Table~\ref{tab:mode_validation}; per-seed values in Appendix~\ref{app:layer0_dominance}). The resulting average can therefore be driven by the embedding layer even when the interior blocks are effectively unaligned, so aggregate reporting obscures the very distinction needed to separate ``measurement collapse'' from ``poor credit direction.'' This layer-$0$ dominance is specific to the ResMLP DFA setting; on ViT-Mini DFA, all layers are near zero, which strengthens the broader methodological point that alignment should be reported per layer rather than only in aggregate. With the two modes separated observationally, the remaining question is whether intervention can move them independently. -Mode~2 has method-dependent severity within the audited fixed-feedback family once Mode~1 is alleviated. Applying the same per-block scale-control penalty $\lambda{=}10^{-2}$ that rescued DFA to State Bridge and to Credit Bridge on the same 4-block $d{=}256$ ResMLP backbone over $30$ epochs and three seeds gives converged test accuracies of $0.453 \pm 0.003$ (SB) and $0.360 \pm 0.003$ (CB), with deep mean cosines of $+0.322 \pm 0.007$ (SB) and $+0.679 \pm 0.008$ (CB) and deep mean $\rho$ of $+0.402 \pm 0.015$ (SB) and $+0.464 \pm 0.025$ (CB), while DFA under the same intervention reaches $0.360 \pm 0.001$ with deep cosine $+0.151 \pm 0.025$ and deep $\rho$ $+0.080 \pm 0.011$ (Table~\ref{tab:mode_validation}; Appendix~\ref{app:sb_penalty}). The State Bridge penalty rescue is roughly $24$ percentage points above the vanilla State Bridge baseline of $0.213$ on the same architecture and, more importantly for the paper's central walk-back, exceeds the architecture-matched frozen-blocks shallow baseline of $0.349$ by $+10.4$ percentage points. State Bridge with the penalty intervention is therefore the first audited non-BP method whose trained deep blocks substantively improve over an architecture-matched random-block baseline; the headline accuracy gap is comparable to BP+penalty's $+18.3$ pp over the same shallow baseline. Neither the activation scale nor the deep BP gradient magnitude is silenced under the penalty: $\|h_L\|$ stays at $302 \pm 8$ for SB and $5680 \pm 178$ for CB, with $\|g_L\|$ at $\sim\!1.8\times 10^{-4}$ and $\sim\!1.9\times 10^{-5}$ respectively, both well within the meaningful-measurement regime, so the recovered deep cosines are computed against an informative reference and not against a numerical floor. Within this rescued regime, the three methods reveal a clean cosine-versus-accuracy dissociation. Credit Bridge achieves roughly $4\times$ the deep cosine of DFA and $2\times$ that of State Bridge, yet its final accuracy matches DFA's and is $9$ percentage points below State Bridge's. We therefore frame the Mode~2 reading as a three-part proposition. \emph{Observation}: under the same intervention and matched training budget, CB and DFA reach the same accuracy despite a $4\times$ deep-cosine gap, while SB is the best accuracy with intermediate cosine. \emph{Inference}: layerwise cosine to the BP gradient is necessary to rule out grossly wrong credit signals (it distinguishes the rescued regime from the clamp-dominated vanilla regime), but it is not sufficient to certify that the supplied signal is useful credit for depth. \emph{Mechanism hypothesis}: usefulness depends on whether the local update induces useful forward-state change across blocks, not merely whether its direction is close to the BP gradient in angle. Under this reading, CB supplies a gradient-direction surrogate that aligns with BP in angle but does not translate to a coordinated forward-state improvement, while State Bridge supplies a state-level downstream teaching signal that preserves aspects of useful credit which layerwise cosine does not measure. We state this as a mechanism hypothesis rather than a theorem because we have measured the angle-to-accuracy gap but not the full functional-credit content; the reporting rule that follows is robust to either interpretation. This cross-method dissociation strengthens the methodological point that alignment must be reported jointly with measurement validity and a depth-utilization baseline rather than as a single headline number. +Mode~2 has method-dependent severity within the audited fixed-feedback family once Mode~1 is alleviated. Applying the same $\lambda{=}10^{-2}$ scale-control penalty to State Bridge and to Credit Bridge on the audited 4-block $d{=}256$ ResMLP backbone over $30$ epochs and three seeds gives converged test accuracies of $0.453 \pm 0.003$ (SB), $0.360 \pm 0.003$ (CB), and $0.360 \pm 0.001$ (DFA), with deep mean cosines of $+0.322 \pm 0.007$ (SB), $+0.679 \pm 0.008$ (CB), and $+0.151 \pm 0.025$ (DFA), and deep mean $\rho$ of $+0.402$, $+0.464$, and $+0.080$ respectively (Appendix~\ref{app:sb_penalty}). State Bridge+penalty is the first audited non-BP method whose trained deep blocks substantively beat the architecture-matched frozen-blocks baseline ($0.349$), exceeding it by $+10.4$ pp---comparable to BP+penalty's $+18.3$ pp---while $\|h_L\|$ and $\|g_L\|$ stay in the meaningful-measurement regime ($\|h_L\|{\approx}302$ for SB, $5680$ for CB; $\|g_L\|{\approx}1.8{\times}10^{-4}$ and $1.9{\times}10^{-5}$). Within this rescued regime, the three methods reveal a clean cosine-versus-accuracy dissociation, and two independent functional measurements rule out the interpretation that this is just cosine being a noisy metric. First, taking a single step of size $\eta{=}0.01$ along each method's per-layer credit $a_l$ at the converged checkpoint and measuring the resulting test-loss change averaged over the deep blocks gives, across three seeds, $-1.93 \pm 0.11 \times 10^{-3}$ for SB+penalty, $-4.26 \pm 0.24 \times 10^{-4}$ for CB+penalty, and $-4.98 \pm 0.44 \times 10^{-5}$ for DFA+penalty (per-seed and per-block values in Appendix~\ref{app:sb_penalty}): SB's credit direction moves the loss about $4.5\times$ more than CB's and about $39\times$ more than DFA's, even though CB's direction has the highest cos with the BP gradient. Second, the integrated training loss decrease over the same $30$-epoch trajectories ranks SB ($-0.458$) $\gg$ CB ($-0.122$) $\approx$ DFA ($-0.095 \pm 0.007$, three seeds). All three functional metrics---headline accuracy, single-step nudging, and training-loss trajectory---agree on the ordering SB $\gg$ CB $\approx$ DFA, while the deep-cosine ordering CB $>$ SB $>$ DFA is the only one that disagrees. We therefore frame the Mode~2 reading as a three-part proposition. \emph{Observation}: CB has $4\times$ DFA's deep cosine yet matches DFA's accuracy, while SB attains the best accuracy with intermediate cosine. \emph{Inference}: layerwise cosine is necessary to rule out grossly wrong credit signals (distinguishing the rescued regime from the clamp-dominated vanilla one) but not sufficient to certify usable credit for depth. \emph{Mechanism hypothesis}: usefulness depends on whether the local update induces useful forward-state change across blocks, not merely on the angle to the BP gradient---CB supplies a gradient-direction surrogate that aligns in angle without translating to coordinated forward-state improvement, while SB supplies a state-level teaching signal that preserves aspects of useful credit which cosine does not measure. We state this as a hypothesis because we have measured the angle-to-accuracy gap and two functional proxies but not the full per-block forward-state-change content; the reporting rule that follows is robust to either interpretation. This cross-method dissociation strengthens the methodological point that alignment must be reported jointly with measurement validity and a depth-utilization baseline rather than as a single headline number. \section{Intervention and Cross-Architecture Evidence} \label{sec:validation} @@ -517,7 +517,7 @@ DFA+pen mean & $0.360 \pm 0.001$ & $1.3\times 10^4$ & $1.6\times 10^{-6}$ & $+0. \end{tabularx} \end{table} -The penalty rescue effect on State Bridge is much larger than on DFA: $+24$ percentage points for State Bridge versus $+5.9$ percentage points for DFA on the same architecture and intervention. SB+penalty is the first audited non-BP method whose trained deep blocks substantively beat the architecture-matched random-block baseline. We treat this as evidence that Mode~2 (low intrinsic credit-direction quality) has method-dependent severity within the audited fixed-feedback family once Mode~1 is alleviated, rather than being a uniform property of all fixed-feedback local-credit objectives. Importantly, State Bridge's deep cosine $+0.322$ is approximately twice DFA's $+0.151$ on the same intervention, but neither approaches the BP reference value of $\approx +1.0$, so this is a within-class gradation in credit-direction quality, not a claim that bridge constructions ``solve'' Mode~2. The drift diagnostic reinforces this reading rather than contradicting it: per-block $w_2$ relative displacement after $30$ epochs averages $14.8\times \pm 0.5$ for SB+penalty, $18.6\times \pm 0.5$ for DFA+penalty, and $19.1\times \pm 0.6$ for CB+penalty (three seeds each), and the embedding layer's relative drift is $7.0\times \pm 0.1$ for SB versus $46.3\times \pm 1.5$ for CB and $94.6\times \pm 1.4$ for DFA, so none of the three methods' per-block updates are silenced under penalty and CB's are in fact larger in magnitude than SB's while DFA's embedding updates are the largest of all, yet CB's and DFA's final accuracies are both $9.3$ percentage points below State Bridge's. The larger-but-less-useful parameter updates in CB are consistent with the mechanism hypothesis that angular agreement with the BP gradient does not by itself certify the functional forward-state content of the update. The nudging test at the same checkpoints provides the direct functional measurement: taking a small step of size $\eta{=}0.01$ in the direction of each method's per-layer credit $a_l$ decreases the test loss by $-1.78\times 10^{-3}$ on average over the deep blocks for SB+penalty, by $-0.45\times 10^{-3}$ for CB+penalty, and by only $-5\times 10^{-5}$ for DFA+penalty (three seeds each, $30$-epoch runs via the same training script). At the same per-layer credit direction, a step in SB's direction moves the loss about four times more than a step in CB's direction and about thirty-five times more than a step in DFA's direction, even though CB's direction is more aligned with the BP gradient in angle than either. The $30$-epoch training trajectories give a third independent confirmation: SB+penalty's training loss falls from $2.047$ at epoch $1$ to $1.589$ at epoch $30$, a decrease of $0.458$, whereas CB+penalty's training loss falls by only $0.122$ and DFA+penalty's by only $0.095 \pm 0.007$ over the same $30$ epochs (three seeds). Deep cosine ranks the three methods CB $>$ SB $>$ DFA, but every functional metric (nudging, integrated training-loss decrease, headline accuracy) ranks them SB $\gg$ CB $\approx$ DFA: the ordering produced by deep cosine is the only one that does not predict accuracy correctly. This is the strongest form of the cos-versus-accuracy dissociation: across three audited fixed-feedback methods under the same penalty intervention, the ranking implied by angular agreement with the BP gradient is contradicted by three independent functional measurements that do predict accuracy. Under the same intervention Credit Bridge reaches a three-seed test accuracy of $0.360 \pm 0.003$, a three-seed deep mean cosine of $+0.679 \pm 0.008$, and a three-seed deep mean $\rho$ of $+0.464 \pm 0.025$, with $\|h_L\|\approx 5680 \pm 178$ and $\|g_L\|\approx 1.9\times 10^{-5}$ well above the diagnostic floor. Credit Bridge therefore has an even higher deep cosine than State Bridge (about $4\times$ the DFA value and roughly $2\times$ the State Bridge value), but reaches the same final accuracy as DFA+penalty and $9.3$ percentage points below State Bridge+penalty. This is a clean dissociation: within the audited fixed-feedback family under the same rescue, deep cosine and deep $\rho$ differ by more than a factor of four across methods without tracking final accuracy in the same direction, so alignment to the BP gradient is a necessary but not sufficient diagnostic of usable credit for depth. That cross-method dissociation is a direct reason the protocol in Section~\ref{sec:protocol} keeps final accuracy, layerwise credit quality, and the depth-utilization baseline as three separate reporting axes rather than collapsing them into a single headline. +The penalty rescue effect on State Bridge is much larger than on DFA: $+24$ percentage points for State Bridge versus $+5.9$ percentage points for DFA on the same architecture and intervention. SB+penalty is the first audited non-BP method whose trained deep blocks substantively beat the architecture-matched random-block baseline. We treat this as evidence that Mode~2 (low intrinsic credit-direction quality) has method-dependent severity within the audited fixed-feedback family once Mode~1 is alleviated, rather than being a uniform property of all fixed-feedback local-credit objectives. Importantly, State Bridge's deep cosine $+0.322$ is approximately twice DFA's $+0.151$ on the same intervention, but neither approaches the BP reference value of $\approx +1.0$, so this is a within-class gradation in credit-direction quality, not a claim that bridge constructions ``solve'' Mode~2. The drift diagnostic reinforces this reading rather than contradicting it: per-block $w_2$ relative displacement after $30$ epochs averages $14.8\times \pm 0.5$ for SB+penalty, $18.6\times \pm 0.5$ for DFA+penalty, and $19.1\times \pm 0.6$ for CB+penalty (three seeds each), and the embedding layer's relative drift is $7.0\times \pm 0.1$ for SB versus $46.3\times \pm 1.5$ for CB and $94.6\times \pm 1.4$ for DFA, so none of the three methods' per-block updates are silenced under penalty and CB's are in fact larger in magnitude than SB's while DFA's embedding updates are the largest of all, yet CB's and DFA's final accuracies are both $9.3$ percentage points below State Bridge's. The larger-but-less-useful parameter updates in CB are consistent with the mechanism hypothesis that angular agreement with the BP gradient does not by itself certify the functional forward-state content of the update. The nudging test at the same checkpoints provides the direct functional measurement: taking a single step of size $\eta{=}0.01$ in the direction of each method's per-layer credit $a_l$ at the converged checkpoint and measuring the resulting test-loss change averaged over the deep blocks (l1--l3 of the 4-block model) gives, across three seeds (42, 123, 456), $-1.93 \pm 0.11 \times 10^{-3}$ for SB+penalty (per-seed deep means $\{-1.78, -1.96, -2.05\}\times 10^{-3}$), $-4.26 \pm 0.24 \times 10^{-4}$ for CB+penalty (per-seed $\{-4.45, -3.93, -4.42\}\times 10^{-4}$), and $-4.98 \pm 0.44 \times 10^{-5}$ for DFA+penalty (per-seed $\{-5.53, -4.46, -4.95\}\times 10^{-5}$). At the same per-layer credit direction, a step in SB's direction moves the loss about $4.5\times$ more than a step in CB's direction and about $39\times$ more than a step in DFA's direction, even though CB's direction is more aligned with the BP gradient in angle than either. The full per-seed per-block nudging values are saved in \texttt{results/nudging\_test\_3seed\_summary.json}. The $30$-epoch training trajectories give a third independent confirmation: SB+penalty's training loss falls from $2.047$ at epoch $1$ to $1.589$ at epoch $30$, a decrease of $0.458$, whereas CB+penalty's training loss falls by only $0.122$ and DFA+penalty's by only $0.095 \pm 0.007$ over the same $30$ epochs (three seeds). Deep cosine ranks the three methods CB $>$ SB $>$ DFA, but every functional metric (nudging, integrated training-loss decrease, headline accuracy) ranks them SB $\gg$ CB $\approx$ DFA: the ordering produced by deep cosine is the only one that does not predict accuracy correctly. This is the strongest form of the cos-versus-accuracy dissociation: across three audited fixed-feedback methods under the same penalty intervention, the ranking implied by angular agreement with the BP gradient is contradicted by three independent functional measurements that do predict accuracy. Under the same intervention Credit Bridge reaches a three-seed test accuracy of $0.360 \pm 0.003$, a three-seed deep mean cosine of $+0.679 \pm 0.008$, and a three-seed deep mean $\rho$ of $+0.464 \pm 0.025$, with $\|h_L\|\approx 5680 \pm 178$ and $\|g_L\|\approx 1.9\times 10^{-5}$ well above the diagnostic floor. Credit Bridge therefore has an even higher deep cosine than State Bridge (about $4\times$ the DFA value and roughly $2\times$ the State Bridge value), but reaches the same final accuracy as DFA+penalty and $9.3$ percentage points below State Bridge+penalty. This is a clean dissociation: within the audited fixed-feedback family under the same rescue, deep cosine and deep $\rho$ differ by more than a factor of four across methods without tracking final accuracy in the same direction, so alignment to the BP gradient is a necessary but not sufficient diagnostic of usable credit for depth. That cross-method dissociation is a direct reason the protocol in Section~\ref{sec:protocol} keeps final accuracy, layerwise credit quality, and the depth-utilization baseline as three separate reporting axes rather than collapsing them into a single headline. \section{Layer-0 Dominance: Per-Seed Vanilla DFA Early-Epoch Cosines} \label{app:layer0_dominance} diff --git a/results/nudging_test_3seed_summary.json b/results/nudging_test_3seed_summary.json new file mode 100644 index 0000000..b970c59 --- /dev/null +++ b/results/nudging_test_3seed_summary.json @@ -0,0 +1,122 @@ +{ + "description": "Nudging test 3-seed values for \u00a74 \u00b64 cross-method functional dissociation. Each value is the test-loss change after a single step of size eta=0.01 along the per-layer credit direction at the converged checkpoint.", + "eta": 0.01, + "epochs": 30, + "arch": "4-block d=256 pre-LN ResMLP", + "penalty_lam": 0.01, + "methods": { + "state_bridge": { + "per_seed": { + "42": { + "per_block": [ + -0.0035966814029961824, + -0.0023783869110047817, + -0.0017011994495987892, + -0.0012592736165970564 + ], + "deep_mean": -0.0017796199924002092, + "note": "deep = blocks l1, l2, l3 (excluding l0)" + }, + "123": { + "per_block": [ + -0.003914414905011654, + -0.002520129084587097, + -0.0018878313712775707, + -0.0014582867734134197 + ], + "deep_mean": -0.0019554157430926957, + "note": "deep = blocks l1, l2, l3 (excluding l0)" + }, + "456": { + "per_block": [ + -0.004026009701192379, + -0.0028109778650105, + -0.001904117758385837, + -0.0014447440626099706 + ], + "deep_mean": -0.002053279895335436, + "note": "deep = blocks l1, l2, l3 (excluding l0)" + } + }, + "three_seed_deep_mean": -0.0019294385436094469, + "three_seed_deep_std_ddof0": 0.00011322116053216024 + }, + "credit_bridge": { + "per_seed": { + "42": { + "per_block": [ + -0.0005118446424603462, + -0.0004658599500544369, + -0.00044331286335363984, + -0.00042594311526045203 + ], + "deep_mean": -0.0004450386428895096, + "note": "deep = blocks l1, l2, l3 (excluding l0)" + }, + "123": { + "per_block": [ + -0.00045391899766400456, + -0.00041642854921519756, + -0.00038977732765488327, + -0.00037192515446804464 + ], + "deep_mean": -0.00039271034377937514, + "note": "deep = blocks l1, l2, l3 (excluding l0)" + }, + "456": { + "per_block": [ + -0.00048537791008129716, + -0.00046242878306657076, + -0.00043993344297632575, + -0.00042223266791552305 + ], + "deep_mean": -0.0004415316313194732, + "note": "deep = blocks l1, l2, l3 (excluding l0)" + } + }, + "three_seed_deep_mean": -0.000426426872662786, + "three_seed_deep_std_ddof0": 2.3884137309066322e-05 + }, + "dfa": { + "per_seed": { + "42": { + "per_block": [ + -0.00013115769252181053, + -5.455967038869858e-05, + -5.524198058992624e-05, + -5.596294067800045e-05 + ], + "deep_mean": -5.525486388554176e-05, + "note": "deep = blocks l1, l2, l3 (excluding l0)" + }, + "123": { + "per_block": [ + -0.00010294892126694322, + -3.057112917304039e-05, + -4.6447094064205885e-05, + -5.68098621442914e-05 + ], + "deep_mean": -4.460936179384589e-05, + "note": "deep = blocks l1, l2, l3 (excluding l0)" + }, + "456": { + "per_block": [ + -0.00014482333790510893, + -4.394981078803539e-05, + -4.4448592234402895e-05, + -5.99866034463048e-05 + ], + "deep_mean": -4.946166882291436e-05, + "note": "deep = blocks l1, l2, l3 (excluding l0)" + } + }, + "three_seed_deep_mean": -4.9775298167434004e-05, + "three_seed_deep_std_ddof0": 4.3516626110321815e-06 + } + }, + "ratios_3seed_means": { + "SB / CB": 4.5246645258569975, + "SB / DFA": 38.76297309398745, + "CB / DFA": 8.56703803618358 + } +} \ No newline at end of file -- cgit v1.2.3