From 04011a880cbd59ee022d566220bf7fd4818205e2 Mon Sep 17 00:00:00 2001 From: YurenHao0426 Date: Wed, 8 Apr 2026 20:00:22 -0500 Subject: paper v2.34.1: SB/CB training loss decrease values from 3-seed (were s42) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Same bug pattern as v2.33's nudging test: the §4 ¶4 training loss decrease values for SB+pen (-0.458) and CB+pen (-0.122) were s42 single-seed numbers labeled as part of the "three seeds" framing. DFA+pen (-0.095 ± 0.007) was actually 3-seed. Re-aggregated from existing JSONs (no new compute): SB+pen: per-seed {0.457, 0.444, 0.439} → 0.447 ± 0.008 (was 0.458) CB+pen: per-seed {0.123, 0.118, 0.124} → 0.121 ± 0.003 (was 0.122) DFA+pen: per-seed {0.104, 0.088, 0.093} → 0.095 ± 0.007 ✓ (unchanged) Changes: - §4 ¶4 training-loss trajectory line now uses 3-seed mean ± std for all three methods - Appendix L paragraph now lists per-seed decreases for all three - New auditable file: results/training_loss_decrease_3seed.json Ratios SB ≫ CB ≈ DFA unchanged. The "all three functional metrics agree on the SB ≫ CB ≈ DFA ordering" claim is unchanged. Page layout: §1-§7 still 9 pages, refs p10, total 19 pages. 0 overfull. Co-Authored-By: Claude Opus 4.6 (1M context) --- paper/main.pdf | Bin 502309 -> 502399 bytes paper/main.tex | 4 +- results/training_loss_decrease_3seed.json | 68 ++++++++++++++++++++++++++++++ 3 files changed, 70 insertions(+), 2 deletions(-) create mode 100644 results/training_loss_decrease_3seed.json diff --git a/paper/main.pdf b/paper/main.pdf index ea62a49..975377b 100644 Binary files a/paper/main.pdf and b/paper/main.pdf differ diff --git a/paper/main.tex b/paper/main.tex index 013c7fa..8147436 100644 --- a/paper/main.tex +++ b/paper/main.tex @@ -98,7 +98,7 @@ A second metric with different numerical failure modes tells the same story. Cos Per-layer reporting is therefore not cosmetic. In ResMLP under vanilla DFA, the headline aggregate alignment $\Gamma \approx 0.07$--$0.10$ can look mildly positive only because layer $0$ remains strongly aligned while the deep network is not: at the same epoch-1 checkpoints where layers $1$--$4$ are essentially zero, layer $0$ has cosine $+0.42$, $+0.44$, and $+0.42$ across seeds (Table~\ref{tab:mode_validation}; per-seed values in Appendix~\ref{app:layer0_dominance}). The resulting average can therefore be driven by the embedding layer even when the interior blocks are effectively unaligned, so aggregate reporting obscures the very distinction needed to separate ``measurement collapse'' from ``poor credit direction.'' This layer-$0$ dominance is specific to the ResMLP DFA setting; on ViT-Mini DFA, all layers are near zero, which strengthens the broader methodological point that alignment should be reported per layer rather than only in aggregate. With the two modes separated observationally, the remaining question is whether intervention can move them independently. -Mode~2 has method-dependent severity within the audited fixed-feedback family once Mode~1 is alleviated. Applying the same $\lambda{=}10^{-2}$ scale-control penalty to SB, CB, and DFA on the audited 4-block $d{=}256$ ResMLP for $30$ epochs (three seeds) gives, in order, test accuracies $0.453 \pm 0.003$, $0.360 \pm 0.003$, $0.360 \pm 0.001$ and deep mean cosines $+0.322 \pm 0.007$, $+0.679 \pm 0.008$, $+0.151 \pm 0.025$ (deep mean $\rho$ $+0.402$, $+0.464$, $+0.080$ and full $\|h_L\|/\|g_L\|$ in Appendix~\ref{app:sb_penalty}), all in the meaningful-measurement regime. SB+penalty is the first audited non-BP method whose trained deep blocks beat the frozen-blocks baseline ($0.349$), by $+10.4$ pp---comparable to BP+penalty's $+18.3$ pp. Within this rescued regime the three methods reveal a clean cosine-versus-accuracy dissociation, and two independent functional measurements rule out the interpretation that cosine is just noisy. \emph{Nudging:} a single step $\eta{=}0.01$ along each method's per-layer credit $a_l$ at the converged checkpoint changes the deep-block test loss by $-1.93 \pm 0.11 \times 10^{-3}$ (SB+pen), $-4.26 \pm 0.24 \times 10^{-4}$ (CB+pen), and $-4.98 \pm 0.44 \times 10^{-5}$ (DFA+pen) across three seeds (per-seed values in Appendix~\ref{app:sb_penalty}): SB moves the loss $\approx\!4.5\times$ more than CB and $\approx\!39\times$ more than DFA, even though CB has the highest deep cosine with BP. \emph{Training-loss trajectory:} the integrated 30-epoch training loss decrease ranks SB ($-0.458$) $\gg$ CB ($-0.122$) $\approx$ DFA ($-0.095 \pm 0.007$). All three functional metrics (accuracy, nudging, training-loss trajectory) agree on SB $\gg$ CB $\approx$ DFA; the deep-cosine ordering CB $>$ SB $>$ DFA is the only one that disagrees. We therefore frame the Mode~2 reading as a three-part proposition. \emph{Observation}: CB has $4\times$ DFA's deep cosine yet matches DFA's accuracy, while SB attains the best accuracy with intermediate cosine. \emph{Inference}: layerwise cosine is necessary to rule out grossly wrong credit signals (distinguishing the rescued regime from the clamp-dominated vanilla one) but not sufficient to certify usable credit for depth. \emph{Mechanism hypothesis}: usefulness depends on whether the local update induces useful forward-state change across blocks, not merely on the angle to the BP gradient---CB supplies a gradient-direction surrogate that aligns in angle without translating to coordinated forward-state improvement, while SB supplies a state-level teaching signal that preserves aspects of useful credit which cosine does not measure. The same hypothesis casts Mode~1 as a downstream symptom of Mode~2 rather than a parallel failure: when $a_l$ cannot drive useful per-block forward-state change, the only easy way to increase $\langle f_l, a_l\rangle$ is to inflate $\|f_l\|$ along the cheap random direction set by $a_l$, producing Mode~1(a) growth and (via terminal LN) Mode~1(b) collapse; the per-block penalty then breaks the chain by capping $\|f_l\|$ without fixing credit quality, consistent with the observed asymmetry that it alleviates Mode~1 fully but only partially fixes Mode~2. We state this as a hypothesis because we have measured the angle-to-accuracy gap and two functional proxies but not the full per-block forward-state-change content; the reporting rule that follows is robust to either interpretation. This cross-method dissociation strengthens the methodological point that alignment must be reported jointly with measurement validity and a depth-utilization baseline rather than as a single headline number. +Mode~2 has method-dependent severity within the audited fixed-feedback family once Mode~1 is alleviated. Applying the same $\lambda{=}10^{-2}$ scale-control penalty to SB, CB, and DFA on the audited 4-block $d{=}256$ ResMLP for $30$ epochs (three seeds) gives, in order, test accuracies $0.453 \pm 0.003$, $0.360 \pm 0.003$, $0.360 \pm 0.001$ and deep mean cosines $+0.322 \pm 0.007$, $+0.679 \pm 0.008$, $+0.151 \pm 0.025$ (deep mean $\rho$ $+0.402$, $+0.464$, $+0.080$ and full $\|h_L\|/\|g_L\|$ in Appendix~\ref{app:sb_penalty}), all in the meaningful-measurement regime. SB+penalty is the first audited non-BP method whose trained deep blocks beat the frozen-blocks baseline ($0.349$), by $+10.4$ pp---comparable to BP+penalty's $+18.3$ pp. Within this rescued regime the three methods reveal a clean cosine-versus-accuracy dissociation, and two independent functional measurements rule out the interpretation that cosine is just noisy. \emph{Nudging:} a single step $\eta{=}0.01$ along each method's per-layer credit $a_l$ at the converged checkpoint changes the deep-block test loss by $-1.93 \pm 0.11 \times 10^{-3}$ (SB+pen), $-4.26 \pm 0.24 \times 10^{-4}$ (CB+pen), and $-4.98 \pm 0.44 \times 10^{-5}$ (DFA+pen) across three seeds (per-seed values in Appendix~\ref{app:sb_penalty}): SB moves the loss $\approx\!4.5\times$ more than CB and $\approx\!39\times$ more than DFA, even though CB has the highest deep cosine with BP. \emph{Training-loss trajectory:} the integrated 30-epoch training loss decrease across three seeds ranks SB ($-0.447 \pm 0.008$) $\gg$ CB ($-0.121 \pm 0.003$) $\approx$ DFA ($-0.095 \pm 0.007$). All three functional metrics (accuracy, nudging, training-loss trajectory) agree on SB $\gg$ CB $\approx$ DFA; the deep-cosine ordering CB $>$ SB $>$ DFA is the only one that disagrees. We therefore frame the Mode~2 reading as a three-part proposition. \emph{Observation}: CB has $4\times$ DFA's deep cosine yet matches DFA's accuracy, while SB attains the best accuracy with intermediate cosine. \emph{Inference}: layerwise cosine is necessary to rule out grossly wrong credit signals (distinguishing the rescued regime from the clamp-dominated vanilla one) but not sufficient to certify usable credit for depth. \emph{Mechanism hypothesis}: usefulness depends on whether the local update induces useful forward-state change across blocks, not merely on the angle to the BP gradient---CB supplies a gradient-direction surrogate that aligns in angle without translating to coordinated forward-state improvement, while SB supplies a state-level teaching signal that preserves aspects of useful credit which cosine does not measure. The same hypothesis casts Mode~1 as a downstream symptom of Mode~2 rather than a parallel failure: when $a_l$ cannot drive useful per-block forward-state change, the only easy way to increase $\langle f_l, a_l\rangle$ is to inflate $\|f_l\|$ along the cheap random direction set by $a_l$, producing Mode~1(a) growth and (via terminal LN) Mode~1(b) collapse; the per-block penalty then breaks the chain by capping $\|f_l\|$ without fixing credit quality, consistent with the observed asymmetry that it alleviates Mode~1 fully but only partially fixes Mode~2. We state this as a hypothesis because we have measured the angle-to-accuracy gap and two functional proxies but not the full per-block forward-state-change content; the reporting rule that follows is robust to either interpretation. This cross-method dissociation strengthens the methodological point that alignment must be reported jointly with measurement validity and a depth-utilization baseline rather than as a single headline number. \section{Intervention and Cross-Architecture Evidence} \label{sec:validation} @@ -517,7 +517,7 @@ DFA+pen mean & $0.360 \pm 0.001$ & $1.3\times 10^4$ & $1.6\times 10^{-6}$ & $+0. \end{tabularx} \end{table} -The penalty rescue effect on State Bridge is much larger than on DFA: $+24$ percentage points for State Bridge versus $+5.9$ percentage points for DFA on the same architecture and intervention. SB+penalty is the first audited non-BP method whose trained deep blocks substantively beat the architecture-matched random-block baseline. We treat this as evidence that Mode~2 (low intrinsic credit-direction quality) has method-dependent severity within the audited fixed-feedback family once Mode~1 is alleviated, rather than being a uniform property of all fixed-feedback local-credit objectives. Importantly, State Bridge's deep cosine $+0.322$ is approximately twice DFA's $+0.151$ on the same intervention, but neither approaches the BP reference value of $\approx +1.0$, so this is a within-class gradation in credit-direction quality, not a claim that bridge constructions ``solve'' Mode~2. The drift diagnostic reinforces this reading rather than contradicting it: per-block $w_2$ relative displacement after $30$ epochs averages $14.8\times \pm 0.5$ for SB+penalty, $18.6\times \pm 0.5$ for DFA+penalty, and $19.1\times \pm 0.6$ for CB+penalty (three seeds each), and the embedding layer's relative drift is $7.0\times \pm 0.1$ for SB versus $46.3\times \pm 1.5$ for CB and $94.6\times \pm 1.4$ for DFA, so none of the three methods' per-block updates are silenced under penalty and CB's are in fact larger in magnitude than SB's while DFA's embedding updates are the largest of all, yet CB's and DFA's final accuracies are both $9.3$ percentage points below State Bridge's. The larger-but-less-useful parameter updates in CB are consistent with the mechanism hypothesis that angular agreement with the BP gradient does not by itself certify the functional forward-state content of the update. The nudging test at the same checkpoints provides the direct functional measurement: taking a single step of size $\eta{=}0.01$ in the direction of each method's per-layer credit $a_l$ at the converged checkpoint and measuring the resulting test-loss change averaged over the deep blocks (l1--l3 of the 4-block model) gives, across three seeds (42, 123, 456), $-1.93 \pm 0.11 \times 10^{-3}$ for SB+penalty (per-seed deep means $\{-1.78, -1.96, -2.05\}\times 10^{-3}$), $-4.26 \pm 0.24 \times 10^{-4}$ for CB+penalty (per-seed $\{-4.45, -3.93, -4.42\}\times 10^{-4}$), and $-4.98 \pm 0.44 \times 10^{-5}$ for DFA+penalty (per-seed $\{-5.53, -4.46, -4.95\}\times 10^{-5}$). At the same per-layer credit direction, a step in SB's direction moves the loss about $4.5\times$ more than a step in CB's direction and about $39\times$ more than a step in DFA's direction, even though CB's direction is more aligned with the BP gradient in angle than either. The full per-seed per-block nudging values are saved in \texttt{results/nudging\_test\_3seed\_summary.json}. The $30$-epoch training trajectories give a third independent confirmation: SB+penalty's training loss falls from $2.047$ at epoch $1$ to $1.589$ at epoch $30$, a decrease of $0.458$, whereas CB+penalty's training loss falls by only $0.122$ and DFA+penalty's by only $0.095 \pm 0.007$ over the same $30$ epochs (three seeds). Deep cosine ranks the three methods CB $>$ SB $>$ DFA, but every functional metric (nudging, integrated training-loss decrease, headline accuracy) ranks them SB $\gg$ CB $\approx$ DFA: the ordering produced by deep cosine is the only one that does not predict accuracy correctly. This is the strongest form of the cos-versus-accuracy dissociation: across three audited fixed-feedback methods under the same penalty intervention, the ranking implied by angular agreement with the BP gradient is contradicted by three independent functional measurements that do predict accuracy. Under the same intervention Credit Bridge reaches a three-seed test accuracy of $0.360 \pm 0.003$, a three-seed deep mean cosine of $+0.679 \pm 0.008$, and a three-seed deep mean $\rho$ of $+0.464 \pm 0.025$, with $\|h_L\|\approx 5680 \pm 178$ and $\|g_L\|\approx 1.9\times 10^{-5}$ well above the diagnostic floor. Credit Bridge therefore has an even higher deep cosine than State Bridge (about $4\times$ the DFA value and roughly $2\times$ the State Bridge value), but reaches the same final accuracy as DFA+penalty and $9.3$ percentage points below State Bridge+penalty. This is a clean dissociation: within the audited fixed-feedback family under the same rescue, deep cosine and deep $\rho$ differ by more than a factor of four across methods without tracking final accuracy in the same direction, so alignment to the BP gradient is a necessary but not sufficient diagnostic of usable credit for depth. That cross-method dissociation is a direct reason the protocol in Section~\ref{sec:protocol} keeps final accuracy, layerwise credit quality, and the depth-utilization baseline as three separate reporting axes rather than collapsing them into a single headline. +The penalty rescue effect on State Bridge is much larger than on DFA: $+24$ percentage points for State Bridge versus $+5.9$ percentage points for DFA on the same architecture and intervention. SB+penalty is the first audited non-BP method whose trained deep blocks substantively beat the architecture-matched random-block baseline. We treat this as evidence that Mode~2 (low intrinsic credit-direction quality) has method-dependent severity within the audited fixed-feedback family once Mode~1 is alleviated, rather than being a uniform property of all fixed-feedback local-credit objectives. Importantly, State Bridge's deep cosine $+0.322$ is approximately twice DFA's $+0.151$ on the same intervention, but neither approaches the BP reference value of $\approx +1.0$, so this is a within-class gradation in credit-direction quality, not a claim that bridge constructions ``solve'' Mode~2. The drift diagnostic reinforces this reading rather than contradicting it: per-block $w_2$ relative displacement after $30$ epochs averages $14.8\times \pm 0.5$ for SB+penalty, $18.6\times \pm 0.5$ for DFA+penalty, and $19.1\times \pm 0.6$ for CB+penalty (three seeds each), and the embedding layer's relative drift is $7.0\times \pm 0.1$ for SB versus $46.3\times \pm 1.5$ for CB and $94.6\times \pm 1.4$ for DFA, so none of the three methods' per-block updates are silenced under penalty and CB's are in fact larger in magnitude than SB's while DFA's embedding updates are the largest of all, yet CB's and DFA's final accuracies are both $9.3$ percentage points below State Bridge's. The larger-but-less-useful parameter updates in CB are consistent with the mechanism hypothesis that angular agreement with the BP gradient does not by itself certify the functional forward-state content of the update. The nudging test at the same checkpoints provides the direct functional measurement: taking a single step of size $\eta{=}0.01$ in the direction of each method's per-layer credit $a_l$ at the converged checkpoint and measuring the resulting test-loss change averaged over the deep blocks (l1--l3 of the 4-block model) gives, across three seeds (42, 123, 456), $-1.93 \pm 0.11 \times 10^{-3}$ for SB+penalty (per-seed deep means $\{-1.78, -1.96, -2.05\}\times 10^{-3}$), $-4.26 \pm 0.24 \times 10^{-4}$ for CB+penalty (per-seed $\{-4.45, -3.93, -4.42\}\times 10^{-4}$), and $-4.98 \pm 0.44 \times 10^{-5}$ for DFA+penalty (per-seed $\{-5.53, -4.46, -4.95\}\times 10^{-5}$). At the same per-layer credit direction, a step in SB's direction moves the loss about $4.5\times$ more than a step in CB's direction and about $39\times$ more than a step in DFA's direction, even though CB's direction is more aligned with the BP gradient in angle than either. The full per-seed per-block nudging values are saved in \texttt{results/nudging\_test\_3seed\_summary.json}. The $30$-epoch training trajectories give a third independent confirmation: across three seeds, SB+penalty's training loss decreases by $0.447 \pm 0.008$ over the run (per seed $\{0.457, 0.444, 0.439\}$), whereas CB+penalty's decreases by only $0.121 \pm 0.003$ (per seed $\{0.123, 0.118, 0.124\}$) and DFA+penalty's by only $0.095 \pm 0.007$ (per seed $\{0.104, 0.088, 0.093\}$). Deep cosine ranks the three methods CB $>$ SB $>$ DFA, but every functional metric (nudging, integrated training-loss decrease, headline accuracy) ranks them SB $\gg$ CB $\approx$ DFA: the ordering produced by deep cosine is the only one that does not predict accuracy correctly. This is the strongest form of the cos-versus-accuracy dissociation: across three audited fixed-feedback methods under the same penalty intervention, the ranking implied by angular agreement with the BP gradient is contradicted by three independent functional measurements that do predict accuracy. Under the same intervention Credit Bridge reaches a three-seed test accuracy of $0.360 \pm 0.003$, a three-seed deep mean cosine of $+0.679 \pm 0.008$, and a three-seed deep mean $\rho$ of $+0.464 \pm 0.025$, with $\|h_L\|\approx 5680 \pm 178$ and $\|g_L\|\approx 1.9\times 10^{-5}$ well above the diagnostic floor. Credit Bridge therefore has an even higher deep cosine than State Bridge (about $4\times$ the DFA value and roughly $2\times$ the State Bridge value), but reaches the same final accuracy as DFA+penalty and $9.3$ percentage points below State Bridge+penalty. This is a clean dissociation: within the audited fixed-feedback family under the same rescue, deep cosine and deep $\rho$ differ by more than a factor of four across methods without tracking final accuracy in the same direction, so alignment to the BP gradient is a necessary but not sufficient diagnostic of usable credit for depth. That cross-method dissociation is a direct reason the protocol in Section~\ref{sec:protocol} keeps final accuracy, layerwise credit quality, and the depth-utilization baseline as three separate reporting axes rather than collapsing them into a single headline. \section{Layer-0 Dominance: Per-Seed Vanilla DFA Early-Epoch Cosines} \label{app:layer0_dominance} diff --git a/results/training_loss_decrease_3seed.json b/results/training_loss_decrease_3seed.json new file mode 100644 index 0000000..43202b9 --- /dev/null +++ b/results/training_loss_decrease_3seed.json @@ -0,0 +1,68 @@ +{ + "description": "Training loss decrease (loss[ep1] - loss[ep30]) for SB+pen, CB+pen, DFA+pen 3-seed. Used in \u00a74 \u00b64 functional triangulation alongside nudging test.", + "per_method": { + "state_bridge": { + "per_seed": { + "42": { + "ep1_loss": 2.0474753201293945, + "ep30_loss": 1.5901478890609742, + "decrease": 0.45732743106842033 + }, + "123": { + "ep1_loss": 2.0394552979278564, + "ep30_loss": 1.5955050112533569, + "decrease": 0.44395028667449954 + }, + "456": { + "ep1_loss": 2.0369125997924806, + "ep30_loss": 1.5980215132904052, + "decrease": 0.4388910865020754 + } + }, + "three_seed_mean": 0.44672293474833175, + "three_seed_std_ddof0": 0.007777762456655643 + }, + "credit_bridge": { + "per_seed": { + "42": { + "ep1_loss": 1.9963660776519776, + "ep30_loss": 1.8736286736679078, + "decrease": 0.12273740398406985 + }, + "123": { + "ep1_loss": 1.9903950632476806, + "ep30_loss": 1.8727011511993408, + "decrease": 0.11769391204833979 + }, + "456": { + "ep1_loss": 1.9934451489257812, + "ep30_loss": 1.8695441330337523, + "decrease": 0.12390101589202884 + } + }, + "three_seed_mean": 0.12144411064147949, + "three_seed_std_ddof0": 0.0026940044894884965 + }, + "dfa": { + "per_seed": { + "42": { + "ep1_loss": 1.9962164908218383, + "ep30_loss": 1.8923273934555054, + "decrease": 0.1038890973663329 + }, + "123": { + "ep1_loss": 1.9917970180892943, + "ep30_loss": 1.9041575216674804, + "decrease": 0.0876394964218139 + }, + "456": { + "ep1_loss": 1.9882178707504272, + "ep30_loss": 1.8952873357391358, + "decrease": 0.09293053501129145 + } + }, + "three_seed_mean": 0.09481970959981274, + "three_seed_std_ddof0": 0.006767033726504009 + } + } +} \ No newline at end of file -- cgit v1.2.3