diff options
| author | YurenHao0426 <Blackhao0426@gmail.com> | 2026-04-08 14:23:35 -0500 |
|---|---|---|
| committer | YurenHao0426 <Blackhao0426@gmail.com> | 2026-04-08 14:23:35 -0500 |
| commit | 3f4f9937829b2f7a0756ce870a0cc9f7b5e4a9f9 (patch) | |
| tree | 67dc324f293a292c4bd2caf84558697883e7f716 | |
| parent | 02d339f897eef1344f215f3035e78864688e6c6f (diff) | |
Polish: convert all 9 tables to tabularx for robust \linewidth fitting
User flagged Table 2 (mode validation) as overflowing. Root cause: the 'Deep-layer
alignment signal' column had long multi-term cosine+rho expressions under
plain 'lccc' column spec with no width constraint.
Fix:
- Added \usepackage{tabularx} and a raggedright L column type
- Converted all 9 tables from tabular to tabularx{\linewidth}{...}
- Table 1 (main audit): plain lcccc inside tabularx, fits width
- Table 2 (mode validation): first column p{0.18\linewidth}, three wrapping L columns
- Table 3 (protocol def): tight left p{}, wrapping L measurement column, two right p{} columns
- Table 4 (all validations): p{0.18\linewidth} + three L columns
- Tables 5-9 (appendices, numeric): @{\extracolsep{\fill}} with existing lrrr... specs
Also shortened 'DFA+pen mean (3 seeds)' label to 'DFA+pen mean' in Appendix L
table to eliminate a 19.5pt overfull on that row.
Result: 0 overfull hbox warnings (was several), main content still 9 pages
exactly within E&D budget, total 17 pages.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
| -rw-r--r-- | paper/main.pdf | bin | 485406 -> 485001 bytes | |||
| -rw-r--r-- | paper/main.tex | 40 |
2 files changed, 21 insertions, 19 deletions
diff --git a/paper/main.pdf b/paper/main.pdf Binary files differindex 7d64258..93dc6cb 100644 --- a/paper/main.pdf +++ b/paper/main.pdf diff --git a/paper/main.tex b/paper/main.tex index bea72e2..5807ca8 100644 --- a/paper/main.tex +++ b/paper/main.tex @@ -8,6 +8,8 @@ \usepackage{hyperref} \usepackage{url} \usepackage{booktabs} +\usepackage{tabularx} +\newcolumntype{L}{>{\raggedright\arraybackslash}X} \usepackage{amsfonts} \usepackage{amsmath} \usepackage{amssymb} @@ -46,7 +48,7 @@ We begin with the smallest setting in which all methods can be compared head-to- \small \caption{Main audit table for the 4-block $d{=}256$ pre-LayerNorm ResMLP on CIFAR-10. The row and column structure is fixed here; fill from the three-seed audit output.} \label{tab:main_audit} -\begin{tabular}{lcccc} +\begin{tabularx}{\linewidth}{@{}lcccc@{}} \toprule Method & Test acc. & Headline $\Gamma$ & Status-quo verdict & Protocol verdict \\ \midrule @@ -56,7 +58,7 @@ DFA & $0.306 \pm 0.006$ & $0.10$ & trustworthy & walked back \\ State Bridge & $0.205 \pm 0.032$ & $0.005$ & trustworthy & walked back \\ Credit Bridge & $0.289 \pm 0.026$ & $0.07$ & trustworthy & walked back \\ \bottomrule -\end{tabular} +\end{tabularx} \end{table} By the field's usual criteria, the non-BP methods appear to train to nontrivial accuracy and report nonzero alignment. In Table~\ref{tab:main_audit}, DFA reaches $0.306 \pm 0.006$ test accuracy with headline $\Gamma{=}0.10$, State Bridge reaches $0.205 \pm 0.032$ with $\Gamma{=}0.005$, and Credit Bridge reaches $0.289 \pm 0.026$ with $\Gamma{=}0.07$; none of these rows looks like an obvious invalidation if one is reading the usual pair of final accuracy and aggregate alignment in the style of prior FA reporting \citep{lillicrap2016random,nokland2016direct,akrout2019deep,launay2020direct}. Even the absolute scale does not itself force a walk-back, because all three methods are plainly above chance and all three report positive headline alignment rather than a visibly broken or undefined quantity. That reading is exactly what the rest of the paper overturns. @@ -102,7 +104,7 @@ The penalty intervention first matters as a rescue of the measurement regime. Wh \small \caption{Two-mode validation table built around the intervention and disambiguation results.} \label{tab:mode_validation} -\begin{tabular}{lccc} +\begin{tabularx}{\linewidth}{@{}>{\raggedright\arraybackslash}p{0.18\linewidth}LLL@{}} \toprule Condition & Deep-layer alignment signal & Measurement regime & Interpretation \\ \midrule @@ -111,7 +113,7 @@ Vanilla DFA, converged & $\overline{\cos}_{deep}{=}{-}0.022$, $\overline{\rho}_{ Penalized DFA, $\lambda{=}10^{-2}$ & $\overline{\cos}_{deep}{=}+0.155{\pm}0.025$, $\overline{\rho}_{deep}{=}+0.080{\pm}0.011$ & meaningful ($\|g\|{\sim}10^{-6}$) & partial alleviation of both modes \\ Fresh-$B$ null control & $\overline{\cos}_{deep}{=}+0.002{\pm}0.022$ ($n{=}20$ draws) & meaningful & training-specific adaptation check \\ \bottomrule -\end{tabular} +\end{tabularx} \end{table} Once the reference vector is meaningful again, the deep layers no longer sit exactly at null. At $\lambda{=}10^{-2}$, penalized DFA reaches a three-seed deep-layer mean cosine of $+0.155 \pm 0.025$ and deep perturbation correlation of $+0.080 \pm 0.011$, whereas vanilla DFA is essentially zero on both metrics in the deep blocks, consistent with prior concerns that alternative feedback can fail by supplying poor credit directions even before full collapse \citep{bartunov2018assessing,moskovitz2018feedback,crafton2019backpropagation,refinetti2023aligning}. The null calibration rules out the interpretation that this recovered signal is merely measurement noise: on the same penalized checkpoint, replacing the training-time feedback matrices with 20 fresh random $B_l$ draws gives a deep cosine of only $+0.002 \pm 0.022$, with per-layer standard deviations of $0.013$--$0.023$, all within noise of zero (Table~\ref{tab:mode_validation}). The $\lambda$ sweep sharpens the dissociation further: at $\lambda{=}10^{-4}$, Mode~1 is already alleviated, with $\|h_L\|{=}2.4\times 10^4$ and $\|g_L\|{=}6.3\times 10^{-7}$, but deep cosine remains $-0.022$, while at $\lambda{=}10^{-2}$ it rises to $+0.165$ and deep $\rho$ to $+0.091$ (Figure~\ref{fig:penalty_rescue}). The improvement is real, but it is only partial. @@ -151,7 +153,7 @@ The reporting protocol begins with measurement validity. Before any FA paper rep \small \caption{Protocol definition table. Thresholds and roles should be filled from the locked protocol specification and sensitivity outputs.} \label{tab:protocol_def} -\begin{tabular}{p{0.12\linewidth}p{0.42\linewidth}p{0.16\linewidth}p{0.20\linewidth}} +\begin{tabularx}{\linewidth}{@{}>{\raggedright\arraybackslash}p{0.06\linewidth}L>{\raggedright\arraybackslash}p{0.16\linewidth}>{\raggedright\arraybackslash}p{0.22\linewidth}@{}} \toprule Diag. & Measurement & Default threshold & Role \\ \midrule @@ -160,7 +162,7 @@ Diag. & Measurement & Default threshold & Role \\ (c) & Cross-batch direction stability of normalized BP gradients & $> 0.30$ & sub-mode discriminator \\ (d) & Frozen-blocks baseline margin for trained blocks over random blocks & $< 2$pp & depth-utilization check \\ \bottomrule -\end{tabular} +\end{tabularx} \end{table} The point of the protocol is not to add plots; it is to prevent a specific class of false conclusions. For this paper, the minimal protocol is four checks: per-layer activation scale via max-per-block growth, deepest hidden BP gradient floor, meaningful-regime per-layer credit quality, and an architecture-matched frozen-blocks baseline (Table~\ref{tab:protocol_def}). The first two ask whether the reference quantity is still valid; the third asks whether, once validity is restored, the deep blocks receive useful directions; and the fourth asks whether the trained depth is doing better than a model whose residual blocks were never trained at all. Figure~\ref{fig:decision_utility} makes the decision value explicit: accuracy alone walks back $0/5$ audited methods, accuracy plus headline $\Gamma$ still walks back $0/5$, and the full protocol walks back $3/5$ by flagging DFA, State Bridge, and Credit Bridge, with diagnostics (a), (b), and (d) each independently sufficient for binary detection on those failures. On our audit, these checks catch failures that accuracy plus aggregate alignment miss completely. @@ -312,7 +314,7 @@ Table~\ref{tab:all_validations} lists the seven validation exercises that suppor \small \caption{Summary of the seven validation exercises used to justify the protocol.} \label{tab:all_validations} -\begin{tabular}{p{0.18\linewidth}p{0.29\linewidth}p{0.23\linewidth}p{0.22\linewidth}} +\begin{tabularx}{\linewidth}{@{}>{\raggedright\arraybackslash}p{0.18\linewidth}LLL@{}} \toprule Validation & Question & Main observation & Why it matters \\ \midrule @@ -324,7 +326,7 @@ Penalty intervention & Can mode 1 be alleviated without full rescue? & Measurabi Fresh-$B$ and BP+penalty controls & Are rescue effects training-specific? & Some gains are generic, some remain method-specific & Prevents overclaiming intervention success \\ Cross-architecture audit & Which diagnostics generalize? & Activation growth generalizes more broadly than gradient-floor collapse & Scopes the claims correctly \\ \bottomrule -\end{tabular} +\end{tabularx} \end{table} A useful way to read the table is that no single validation carries the paper by itself. The five-method audit shows that the problem exists, temporal replay shows that the protocol is actionable, intervention and null controls show that the two modes respond differently, and cross-architecture evidence shows which parts of the protocol are specific to terminal-normalized residual settings and which parts are more general. @@ -353,7 +355,7 @@ To check whether the layerwise pattern in Figure~\ref{fig:audit_hero} is an arti \small \caption{Depth sweep on $d{=}512$ ResMLP, seed 42, 100 epochs CIFAR-10. \emph{layer-0 cos} is the embedding-block BP cosine, \emph{deep cos} is the mean BP cosine over the remaining $L{-}1$ blocks, and \emph{deep $\rho$} is the corresponding mean perturbation correlation. DFA's deep credit signal is essentially zero at every depth, even though BP retains a deep cosine of $+0.94$ at $L{=}12$.} \label{tab:depth_scan} -\begin{tabular}{rlrrrr} +\begin{tabularx}{\linewidth}{@{\extracolsep{\fill}}rlrrrr@{}} \toprule $L$ & method & test acc & layer-0 cos & deep cos & deep $\rho$ \\ \midrule @@ -373,7 +375,7 @@ $12$ & BP & $0.594$ & $+0.942$ & $+0.940$ & $+0.990$ \\ $12$ & DFA & $0.309$ & $+0.388$ & $-0.000$ & $+0.000$ \\ $12$ & Credit Bridge & $0.239$ & $+0.208$ & $+0.016$ & $+0.000$ \\ \bottomrule -\end{tabular} +\end{tabularx} \end{table} The layerwise pattern is essentially depth-invariant. DFA's layer-0 cosine stays in $[+0.39,+0.40]$ across all five depths, while its mean deep cosine sits within $[-0.005,+0.000]$ and its deep $\rho$ collapses to numerical zero in every condition. Credit Bridge shows a slightly milder version of the same shape, with a small positive deep cosine that does not improve as depth shrinks. BP, by contrast, maintains a deep cosine of $+0.94$ even at $L{=}12$, so the BP reference is still measurably non-degenerate where DFA and Credit Bridge are flat. The $L{=}4$ row, which matches the main audit's architecture, has also been replicated across three seeds (42, 123, 456): 3-seed DFA layer-0 cosine is $+0.412 \pm 0.011$, 3-seed DFA deep cosine is $-0.0004 \pm 0.0008$, and 3-seed CB deep cosine is $+0.039 \pm 0.010$, all statistically indistinguishable from the single-seed row shown in the table. This rules out the explanation that DFA's deep blocks are merely too far from the loss to receive useful credit: making the network shallower does not reach the deep blocks any better. The failure is structural to the credit signal rather than an artifact of depth. @@ -388,7 +390,7 @@ To test whether Mode~1 is specifically a property of the additive residual skip \small \caption{No-residual ResMLP-d256 ablation, seed 42, 3 epochs each. Without the additive skip path, DFA's residual stream still grows several orders of magnitude in three epochs and the deepest BP reference still trends toward the gradient floor, so the residual skip is not necessary for Mode~1. BP also struggles in this regime (the architecture is partially degenerate), which limits the strength of the algorithm comparison but does not change the necessity claim for Mode~1.} \label{tab:no_residual_smoke} -\begin{tabular}{lrrrrrr} +\begin{tabularx}{\linewidth}{@{\extracolsep{\fill}}lrrrrrr@{}} \toprule method & $w_2$ std & ep & $\|h_L\|$ & $\|g_L\|$ & test acc & gamma\_dfa \\ \midrule @@ -401,7 +403,7 @@ DFA & $0.5$ & $1$ & $5{,}295$ & $8.6\times 10^{-7}$ & $0.156$ & $0.047$ \\ DFA & $0.5$ & $2$ & $16{,}930$ & $2.2\times 10^{-7}$ & $0.151$ & $0.040$ \\ DFA & $0.5$ & $3$ & $22{,}050$ & $1.6\times 10^{-7}$ & $0.148$ & $0.039$ \\ \bottomrule -\end{tabular} +\end{tabularx} \end{table} The qualitative shape matches what we see in vanilla residual DFA, only with a slower onset because the architecture itself is harder to train. Diagnostic~(a) clearly fires within three epochs, and diagnostic~(b) is already on the floor side of $10^{-7}$. Across $w_2$ std values $\{0.1,0.2,0.5\}$ that we tried in the same smoke sweep, the qualitative outcome is the same: residual stream grows by three to four orders of magnitude, $\|g_L\|$ drops by three to four orders of magnitude, and BP itself never reaches a healthy training regime. We retain $w_2{=}0.5$ here because that is the only value where BP is at least beginning to learn. The full $100$-epoch trajectory of the same configuration, replicated across three seeds (42, 123, 456), converges to a mean $\|h_L\| \approx 8.2\times 10^7$ and mean $\|g_L\| \approx 1.9\times 10^{-10}$ (per-seed values $\|h_L\| \in \{1.06\times 10^8, 3.15\times 10^7, 1.09\times 10^8\}$ and $\|g_L\| \in \{1.08, 2.94, 1.77\}\times 10^{-10}$), all deeply below the diagnostic~(b) floor and within an order of magnitude of vanilla residual DFA's $\|h_L\| \approx 4\times 10^8$ and $\|g_L\| \approx 5\times 10^{-10}$ on the same backbone, confirming that the smoke-test trend is the converged behavior rather than an early-training artifact. @@ -418,7 +420,7 @@ To test whether Mode~1 activation growth requires any task signal at all, we re- \small \caption{Random-target ablation, DFA on the standard residual ResMLP-d256, seed 42, three epochs of training with i.i.d.\ random class targets refreshed every minibatch. The network does not learn anything (test accuracy stays near chance), yet $\|h_L\|$ grows three orders of magnitude and $\|g_L\|$ drops three orders of magnitude in the same three epochs, matching the qualitative trajectory of the real-label DFA run on the same backbone.} \label{tab:random_targets_smoke} -\begin{tabular}{rrrrr} +\begin{tabularx}{\linewidth}{@{\extracolsep{\fill}}rrrrr@{}} \toprule ep & $\|h_L\|$ & $\|g_L\|$ & test acc & gamma\_dfa \\ \midrule @@ -427,7 +429,7 @@ $1$ & $1{,}616$ & $5.12\times 10^{-6}$ & $0.078$ & $-0.020$ \\ $2$ & $9{,}768$ & $8.50\times 10^{-7}$ & $0.081$ & $-0.024$ \\ $3$ & $14{,}510$ & $5.62\times 10^{-7}$ & $0.071$ & $-0.025$ \\ \bottomrule -\end{tabular} +\end{tabularx} \end{table} This ablation answers the natural counterargument that DFA's residual-stream growth might be a side-effect of the network adapting to genuine task signal in a particularly bad local minimum: it is not. With no task signal at all, DFA on this architecture still inflates the residual stream by more than three orders of magnitude in the first three epochs and pushes the deepest BP reference gradient to the floor of $10^{-7}$ in the same window. The full $100$-epoch trajectory of the same DFA random-target run converges to $\|h_L\| \approx 1.67\times 10^8$ and $\|g_L\| \approx 8.0\times 10^{-12}$, both more extreme than the corresponding endpoints of vanilla DFA on the same backbone with real labels (about $4\times 10^8$ and $5\times 10^{-10}$ respectively), so the data-agnostic trajectory does not just reach Mode~1 but in fact passes through the same regime even without any per-sample task pressure. The local DFA objective $\langle f_l(h_l),\, e_T B_l^\top\rangle$ contains no penalty on $\|f_l(h_l)\|$, so any direction in which a larger block output increases inner-product alignment with the fixed feedback target is rewarded; the random-target run isolates exactly this geometric incentive, free of any task-driven feature pressure. The full $100$-epoch trajectory of this random-target run is reported as a confirmatory check rather than a primary claim. @@ -439,7 +441,7 @@ We then asked whether this data-agnostic growth is specific to DFA or generalize \small \caption{Random-target ablation across the three audited fixed-feedback local-credit methods on the standard residual ResMLP-d256, seed 42, three epochs of training with i.i.d.\ random class targets. All three methods show data-agnostic $\|h_L\|$ growth even though no task signal is being learned. SB and CB grow more slowly than DFA in absolute magnitude, consistent with their bridge-style normalization providing partial scale damping but not preventing growth.} \label{tab:random_targets_sbcb_smoke} -\begin{tabular}{lrrr} +\begin{tabularx}{\linewidth}{@{\extracolsep{\fill}}lrrr@{}} \toprule method & $\|h_L\|$ at ep 3 & $\|g_L\|$ at ep 3 & test acc \\ \midrule @@ -447,7 +449,7 @@ DFA & $14{,}510$ & $5.6\times 10^{-7}$ & $0.071$ \\ State Bridge & $6{,}225$ & $1.0\times 10^{-5}$ & $0.104$ \\ Credit Bridge & $19{,}974$ & $3.2\times 10^{-6}$ & $0.092$ \\ \bottomrule -\end{tabular} +\end{tabularx} \end{table} The cross-method version of the test rules out the explanation that the random-target growth is specific to DFA's particular feedback projection. State Bridge and Credit Bridge use bridge constructions with target normalization and stop-gradients, so any residual-stream growth they exhibit cannot be attributed to a simple absence of normalization. Their $\|g_L\|$ values at three epochs are still well above the $10^{-7}$ floor used by diagnostic~(b), so the gradient collapse part of Mode~1 does not yet appear at this horizon for SB/CB; the activation-growth part of Mode~1 is already present. At the full $100$-epoch trajectory of the same random-target protocol, both SB and CB also reach the (b) floor: SB converges to $\|h_L\|\approx 3.6\times 10^5$ and $\|g_L\|\approx 4\times 10^{-8}$, and CB converges to $\|h_L\|\approx 1.38\times 10^8$ and $\|g_L\|\approx 0$ (below the numerical clamp), with test accuracies $0.100$ and $0.085$ respectively, consistent with DFA's $1.67\times 10^8$ and $8.0\times 10^{-12}$ at the same horizon. We treat this as evidence that the local-credit growth incentive is not unique to DFA but is shared by the audited family of fixed-feedback methods. @@ -464,7 +466,7 @@ To test whether the per-block scale-control penalty $\lambda \,\mathrm{mean}(\|f \small \caption{State Bridge with the same per-block scale-control penalty $\lambda{=}10^{-2}$ that rescues DFA in Section~\ref{sec:validation}, on the 4-block $d{=}256$ pre-LayerNorm ResMLP, 30 epochs, three seeds. SB+penalty reaches a converged test accuracy of $0.453 \pm 0.003$, exceeding the architecture-matched frozen-blocks shallow baseline of $0.349$ by $+10.4$ percentage points and the DFA+penalty value of $0.363 \pm 0.001$ by $+9.0$ percentage points. The deep mean cosine and deep mean perturbation correlation are roughly $2\times$ and $5\times$ the corresponding DFA+penalty values respectively, while the residual stream is contained but not silenced ($\|h_L\|\!\approx\!302$, $\|g_L\|\!\approx\!1.8\times 10^{-4}$). Vanilla SB on the same architecture and seed reaches only $0.213$, with $\|h_L\|\!\approx\!9.85\times 10^6$ and $\|g_L\|$ at the diagnostic-(b) floor.} \label{tab:sb_penalty} -\begin{tabular}{lrrrrr} +\begin{tabularx}{\linewidth}{@{\extracolsep{\fill}}lrrrrr@{}} \toprule seed & test acc & $\|h_L\|$ & $\|g_L\|$ & deep cos & deep $\rho$ \\ \midrule @@ -480,9 +482,9 @@ CB+pen mean & $0.360 \pm 0.003$ & $5680 \pm 178$ & $1.90\times 10^{-5}$ & $+0.67 \midrule vanilla SB $42$ & $0.213$ & $9.85\times 10^6$ & $1\times 10^{-8}$ & --- & --- \\ vanilla CB $42$ & $0.211$ & $6.7\times 10^7$ & $\sim 0$ & --- & --- \\ -DFA+pen mean (3 seeds) & $0.363 \pm 0.001$ & $4.0\times 10^4$ & $9.0\times 10^{-7}$ & $+0.155 \pm 0.025$ & $+0.080 \pm 0.011$ \\ +DFA+pen mean & $0.363 \pm 0.001$ & $4.0\times 10^4$ & $9.0\times 10^{-7}$ & $+0.155 \pm 0.025$ & $+0.080 \pm 0.011$ \\ \bottomrule -\end{tabular} +\end{tabularx} \end{table} The penalty rescue effect on State Bridge is much larger than on DFA: $+24$ percentage points for State Bridge versus $+5.5$ percentage points for DFA on the same architecture and intervention. SB+penalty is the first audited non-BP method whose trained deep blocks substantively beat the architecture-matched random-block baseline. We treat this as evidence that Mode~2 (low intrinsic credit-direction quality) has method-dependent severity within the audited fixed-feedback family once Mode~1 is alleviated, rather than being a uniform property of all fixed-feedback local-credit objectives. Importantly, State Bridge's deep cosine $+0.322$ is approximately twice DFA's $+0.155$ on the same intervention, but neither approaches the BP reference value of $\approx +1.0$, so this is a within-class gradation in credit-direction quality, not a claim that bridge constructions ``solve'' Mode~2. The drift diagnostic reinforces this reading rather than contradicting it: per-block $w_2$ relative displacement after $30$ epochs is $14.3\times$ for SB+penalty and $19.3\times$ for CB+penalty (a $35\%$ gap), and the embedding layer's relative drift is $7.1\times$ for SB versus $44.6\times$ for CB (a $6\times$ gap), so CB's per-block updates are not silenced under penalty and are in fact larger in magnitude than SB's, yet CB's final accuracy is $9.3$ percentage points lower. The larger-but-less-useful parameter updates in CB are consistent with the mechanism hypothesis that angular agreement with the BP gradient does not by itself certify the functional forward-state content of the update. The nudging test at the same checkpoints provides the direct functional measurement: taking a small step of size $\eta{=}0.01$ in the direction of each method's per-layer credit $a_l$ decreases the test loss by $-1.78\times 10^{-3}$ on average over the deep blocks for SB+penalty, by $-0.45\times 10^{-3}$ for CB+penalty, and by only $-6\times 10^{-5}$ for DFA+penalty (single seed $42$, $30$-epoch run via the same training script). At the same per-layer credit direction, a step in SB's direction moves the loss about four times more than a step in CB's direction and about $30$ times more than a step in DFA's direction, even though CB's direction is more aligned with the BP gradient in angle than either. The $30$-epoch training trajectories give a third independent confirmation: SB+penalty's training loss falls from $2.047$ at epoch $1$ to $1.589$ at epoch $30$, a decrease of $0.458$, whereas CB+penalty's training loss falls by only $0.122$ and DFA+penalty's by only $0.104$ over the same $30$ epochs. Deep cosine ranks the three methods CB $>$ SB $>$ DFA, but every functional metric (nudging, integrated training-loss decrease, headline accuracy) ranks them SB $\gg$ CB $\approx$ DFA: the ordering produced by deep cosine is the only one that does not predict accuracy correctly. This is the strongest form of the cos-versus-accuracy dissociation: across three audited fixed-feedback methods under the same penalty intervention, the ranking implied by angular agreement with the BP gradient is contradicted by three independent functional measurements that do predict accuracy. Under the same intervention Credit Bridge reaches a three-seed test accuracy of $0.360 \pm 0.003$, a three-seed deep mean cosine of $+0.679 \pm 0.008$, and a three-seed deep mean $\rho$ of $+0.464 \pm 0.025$, with $\|h_L\|\approx 5680 \pm 178$ and $\|g_L\|\approx 1.9\times 10^{-5}$ well above the diagnostic floor. Credit Bridge therefore has an even higher deep cosine than State Bridge (about $4\times$ the DFA value and roughly $2\times$ the State Bridge value), but reaches the same final accuracy as DFA+penalty and $9.3$ percentage points below State Bridge+penalty. This is a clean dissociation: within the audited fixed-feedback family under the same rescue, deep cosine and deep $\rho$ differ by more than a factor of four across methods without tracking final accuracy in the same direction, so alignment to the BP gradient is a necessary but not sufficient diagnostic of usable credit for depth. That cross-method dissociation is a direct reason the protocol in Section~\ref{sec:protocol} keeps final accuracy, layerwise credit quality, and the depth-utilization baseline as three separate reporting axes rather than collapsing them into a single headline. |
