diff options
| author | YurenHao0426 <Blackhao0426@gmail.com> | 2026-04-08 15:55:04 -0500 |
|---|---|---|
| committer | YurenHao0426 <Blackhao0426@gmail.com> | 2026-04-08 15:55:04 -0500 |
| commit | 9343b29f358cb963dd224d9524e7fd55e1a8b05b (patch) | |
| tree | ad4c7f56e401180d051f9db7f97cb808080f38ba | |
| parent | 8edd4505568ef327eb72be2c5c57d24439b36986 (diff) | |
Reviewer-concern batch: ρ formula + LN Jacobian derivation + diagnostic (c) formula + threshold pointer + hyperparameter fairness clause
Addressed 4 secondary reviewer concerns from the user's earlier list, all
small inline additions:
1. §3 ¶1 LN Jacobian: extended the 1-line claim into a 2-line derivation.
For y = LN(h) = (h-μ)/σ with σ ∝ ||h||/√d, ||∂y/∂h|| = Θ(1/σ),
so ||g_L|| = Θ(1/||h_L||). Connects the (a) growth and (b) collapse
formally.
2. §4 ¶2 ρ formal definition: added the inline formula
ρ_l = Pearson(<a_l, εv>, ℓ(h_l + εv) - ℓ(h_l)) over M=32 random
unit-norm directions v with ε=1e-3, evaluated per sample on a fixed
eval batch and averaged. Previously this was narrative-only.
3. §6 ¶3 diagnostic (c) cross-batch stability: added inline definition as
the mean pairwise cosine of per-batch-averaged BP-grad direction at
the chosen layer across K≥8 disjoint 128-sample minibatches, with
the empirical separation (drift 0.5-0.99 vs healthy 0.05-0.18).
4. §6 ¶3 threshold sensitivity pointer: added (Appendix~\ref{app:threshold_sweep})
pointer next to the (a)/(b) calibration claim.
5. §2 ¶1 hyperparameter fairness: changed 'against the same architecture,
optimizer, and training budget' to 'against the identical architecture,
optimizer, schedule, and training budget without method-specific tuning'
and added 'batch size 128'. Closes the 'fairness asserted but not
evidenced' reviewer concern.
Page budget: each addition ate ~1-2 lines. Net push was ~9 lines, which
spilled §7 onto p10. Recovered by:
- Shrinking Figure 3 (penalty rescue) from \linewidth to 0.92\linewidth
- Shrinking Figure 4 (cross_arch_summary) from 0.78\linewidth to 0.7\linewidth
- Compressing diagnostic (c) clause (kept the formula intent without all
the LaTeX math symbols inline)
- Trimming §7 closing sentence: dropped 'main lesson is to decompose'
preamble; merged 'a reporting rule' phrase into the same sentence as
the methodology-line citations
Result: §1-§7 + all figures fit strictly in 9 pages (verified via pdftotext;
p9 ends with §7 closing sentence + page number '9'; p10 starts with
References). Total 18 pages, 0 overfull hbox.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
| -rw-r--r-- | paper/main.pdf | bin | 492424 -> 496433 bytes | |||
| -rw-r--r-- | paper/main.tex | 14 |
2 files changed, 7 insertions, 7 deletions
diff --git a/paper/main.pdf b/paper/main.pdf Binary files differindex bd5018c..af404f1 100644 --- a/paper/main.pdf +++ b/paper/main.pdf diff --git a/paper/main.tex b/paper/main.tex index cb27a47..1e561c4 100644 --- a/paper/main.tex +++ b/paper/main.tex @@ -44,7 +44,7 @@ Accordingly, this paper does not introduce a new FA variant or a new benchmark. \section{Audit: Standard Reporting Walks Back Nothing} \label{sec:audit} -Table~\ref{tab:main_audit} fixes the canonical audit to a 4-block pre-LayerNorm ResMLP with width $d{=}256$ on CIFAR-10, trained for 100 epochs with AdamW (learning rate $10^{-3}$, weight decay $0.01$), a cosine schedule, and three seeds (42, 123, 456); all five methods are read against the same architecture, optimizer, and training budget, and Figure~\ref{fig:audit_hero} summarizes the corresponding per-block growth, deepest-layer BP reference norm, cross-batch stability, and frozen-baseline comparison. +Table~\ref{tab:main_audit} fixes the canonical audit to a 4-block pre-LayerNorm ResMLP with width $d{=}256$ on CIFAR-10, trained for 100 epochs with AdamW (learning rate $10^{-3}$, weight decay $0.01$), a cosine schedule, batch size $128$, and three seeds (42, 123, 456); all five methods are read against the identical architecture, optimizer, schedule, and training budget without method-specific tuning, and Figure~\ref{fig:audit_hero} summarizes the corresponding per-block growth, deepest-layer BP reference norm, cross-batch stability, and frozen-baseline comparison. \begin{table}[t] \centering @@ -83,7 +83,7 @@ When we compare each method to a frozen-blocks baseline matched to the same arch \section{Failure Mode 1: Measurement Degeneracy} \label{sec:mode1} -Mode~1 has two parts. The activation-growth part~(a) is a scale pathology of fixed-feedback local-credit objectives without an effective scale-control term: for block $l$, DFA, State Bridge, and Credit Bridge each update $f_l$ by reducing a local loss of the form $-\langle f_l(h_l),\, a_l\rangle$, where the per-layer credit vector $a_l$ is the method-specific projection of the output error (for DFA, $a_l = B_l^\top e_T$ with a fixed random $B_l$; for State Bridge, $a_l$ is the gradient of a cross-entropy loss measured through a learned state predictor $G_\psi(h_l,t_l,s)$ that estimates $h_L$; for Credit Bridge, $a_l$ is the gradient of a learned value network $V(h_l,t_l,s)$). None of these three local losses contains a penalty on $\|f_l(h_l)\|$, so any direction in which a larger block output improves inner-product alignment with the method's fixed or learned credit target is rewarded; in a pre-LN residual stack, larger block outputs directly increase residual-stream scale, and terminal LayerNorm at the output removes task-loss sensitivity to that scale, so the architecture supplies no global restraint on the local growth incentive. The gradient-floor part~(b) follows from the LayerNorm Jacobian: in terminal-LN architectures $\partial\mathrm{LN}(h)/\partial h \propto 1/\|h\|$ in expectation, so the same residual-stream inflation is accompanied by collapse of the hidden-layer BP reference norm. Empirically, on the audited 4-block pre-LayerNorm ResMLP ($d{=}256$, CIFAR-10, 100 epochs, 3 seeds), DFA training drives $\|h_L\|$ from about $9$ at initialization to about $4\times 10^8$ by epoch 100 and $\|g_L\|$ from about $9.8\times 10^{-4}$ to about $5\times 10^{-10}$, while the reported deep cosine remains defined only because \texttt{F.cosine\_similarity} clamps the denominator at $\varepsilon{=}10^{-8}$ (Table~\ref{tab:main_audit}; Figure~\ref{fig:audit_hero}). At that endpoint the reference norm is about $20\times$ below the clamp, so the quantity being reported is effectively $(a\cdot b)/(\|a\|\max(\|b\|,10^{-8}))$ rather than a comparison to a meaningful BP direction. +Mode~1 has two parts. The activation-growth part~(a) is a scale pathology of fixed-feedback local-credit objectives without an effective scale-control term: for block $l$, DFA, State Bridge, and Credit Bridge each update $f_l$ by reducing a local loss of the form $-\langle f_l(h_l),\, a_l\rangle$, where the per-layer credit vector $a_l$ is the method-specific projection of the output error (for DFA, $a_l = B_l^\top e_T$ with a fixed random $B_l$; for State Bridge, $a_l$ is the gradient of a cross-entropy loss measured through a learned state predictor $G_\psi(h_l,t_l,s)$ that estimates $h_L$; for Credit Bridge, $a_l$ is the gradient of a learned value network $V(h_l,t_l,s)$). None of these three local losses contains a penalty on $\|f_l(h_l)\|$, so any direction in which a larger block output improves inner-product alignment with the method's fixed or learned credit target is rewarded; in a pre-LN residual stack, larger block outputs directly increase residual-stream scale, and terminal LayerNorm at the output removes task-loss sensitivity to that scale, so the architecture supplies no global restraint on the local growth incentive. The gradient-floor part~(b) follows from the LayerNorm Jacobian. For $y = \mathrm{LN}(h) = (h - \mu(h))/\sigma(h)$ with $\sigma(h) = \big(\tfrac{1}{d}\sum_i (h_i - \mu(h))^2\big)^{1/2}$ proportional to $\|h\|/\sqrt{d}$, the spectral norm of $\partial y/\partial h$ is $\Theta(1/\sigma(h))$, so back-propagating through terminal LayerNorm scales the deepest hidden BP gradient as $\|g_L\| = \Theta(1/\|h_L\|)$, and the same residual-stream inflation that drives diagnostic~(a) drives a proportional collapse of the diagnostic~(b) reference. Empirically, on the audited 4-block pre-LayerNorm ResMLP ($d{=}256$, CIFAR-10, 100 epochs, 3 seeds), DFA training drives $\|h_L\|$ from about $9$ at initialization to about $4\times 10^8$ by epoch 100 and $\|g_L\|$ from about $9.8\times 10^{-4}$ to about $5\times 10^{-10}$, while the reported deep cosine remains defined only because \texttt{F.cosine\_similarity} clamps the denominator at $\varepsilon{=}10^{-8}$ (Table~\ref{tab:main_audit}; Figure~\ref{fig:audit_hero}). At that endpoint the reference norm is about $20\times$ below the clamp, so the quantity being reported is effectively $(a\cdot b)/(\|a\|\max(\|b\|,10^{-8}))$ rather than a comparison to a meaningful BP direction. We tested this mechanism story against four natural alternative attributions, all of which it survives. \emph{Not residual-skip-driven:} with terminal LN kept and the additive skip removed ($h_{l+1}{=}F_l(h_l)$), DFA still converges to $\|h_L\|{\approx}1.06{\times}10^{8}$ and $\|g_L\|{\approx}1.09{\times}10^{-10}$ at $100$ epochs, both at the diagnostic floor (Appendix~\ref{app:no_residual}). \emph{Not task-signal-driven:} under i.i.d.\ random class targets per minibatch, DFA still reaches $\|h_L\|{\approx}1.67{\times}10^{8}$ and $\|g_L\|{\approx}8{\times}10^{-12}$ while accuracy stays at chance (Appendix~\ref{app:random_targets}). \emph{Not DFA-specific:} the same random-target ablation drives $\|h_L\|$ to $6.2{\times}10^{3}$ for SB and $2.0{\times}10^{4}$ for CB in three epochs, so all three audited fixed-feedback methods exhibit data-agnostic activation growth. \emph{Not shared by EP:} under the same protocol, EP keeps $\|h_L\|{\approx}586$ at five epochs, $25\times$ smaller than DFA's three-epoch value, confirming that the random-target assay separates the explosion-prone fixed-feedback class from EP's energy-based objective. @@ -94,7 +94,7 @@ The matched same-backbone causal control for diagnostic~(b) is removing terminal The second failure mode appears even in the meaningful-measurement regime. At the earliest vanilla DFA checkpoints on ResMLP, the hidden backpropagated gradient at the first deep block remains above the numerical floor: at epoch 1, $\|g_2\|$ is $6.7\times 10^{-7}$, $6.5\times 10^{-7}$, and $3.9\times 10^{-7}$ across the three seeds, all above the $10^{-7}$ threshold used to distinguish measurable from collapsed gradients. Yet the corresponding deep-layer cosine values are already essentially null: across layers $1$--$4$, all seed-level measurements at epoch 1 lie in $[-0.04,+0.02]$, with a three-seed mean of $-0.008 \pm 0.013$, and by epoch 2 the deep mean is still only $-0.018 \pm 0.018$ (Table~\ref{tab:mode_validation}). This is the observational pattern predicted by low credit-direction quality rather than mere disappearance of signal: the gradient is still present enough to measure, but the directions delivered to the deep network carry little agreement with backpropagation, consistent with prior concerns that alternative feedback rules can fail by supplying poor credit assignments even before full collapse \citep{bartunov2018assessing,moskovitz2018feedback,crafton2019backpropagation,refinetti2023aligning}. This rules out the simplest objection that the deep-layer null result is merely a byproduct of collapse. -A second metric with different numerical failure modes tells the same story. Cosine measures directional agreement with the BP gradient, whereas perturbation correlation $\rho$ measures whether the proposed update predicts the correct sign and relative magnitude of loss change under actual perturbations; their failure modes are therefore different, especially with respect to normalization and small-denominator effects. In our controls, $\rho$ behaves as expected, with a Taylor-ceiling positive control near $+0.997$ and a random-vector negative control near $+0.006$ (Figure~\ref{fig:penalty_rescue}, Table~\ref{tab:mode_validation}). On vanilla DFA, deep $\rho$ is likewise null: for the early checkpoints where the gradients remain measurable, the deep average is $-0.003 \pm 0.005$ across seeds and epochs, and in a floor-level checkpoint it is $+0.002$, again indistinguishable from noise. The agreement between cosine and $\rho$ therefore rules out the interpretation that the null deep result is an artifact of cosine's $\varepsilon$-clamp or vector normalization. The deep blocks are not just hard to measure; they are receiving weakly useful directions. +A second metric with different numerical failure modes tells the same story. Cosine measures directional agreement with the BP gradient, whereas the per-layer perturbation correlation $\rho_l$ measures whether the proposed credit predicts the actual loss response: for $M{=}32$ unit-norm random directions $v_m$ and step $\varepsilon{=}10^{-3}$, $\rho_l \;{=}\; \mathrm{Pearson}_m\!\left(\langle a_l,\, \varepsilon v_m\rangle,\;\, \ell(h_l + \varepsilon v_m) - \ell(h_l)\right)$, evaluated per sample on a fixed eval batch and then averaged. Cosine and $\rho$ have different failure modes, especially with respect to normalization and small-denominator effects. In our controls, $\rho$ behaves as expected, with a Taylor-ceiling positive control near $+0.997$ and a random-vector negative control near $+0.006$ (Figure~\ref{fig:penalty_rescue}, Table~\ref{tab:mode_validation}). On vanilla DFA, deep $\rho$ is likewise null: for the early checkpoints where the gradients remain measurable, the deep average is $-0.003 \pm 0.005$ across seeds and epochs, and in a floor-level checkpoint it is $+0.002$, again indistinguishable from noise. The agreement between cosine and $\rho$ therefore rules out the interpretation that the null deep result is an artifact of cosine's $\varepsilon$-clamp or vector normalization. The deep blocks are not just hard to measure; they are receiving weakly useful directions. Per-layer reporting is therefore not cosmetic. In ResMLP under vanilla DFA, the headline aggregate alignment $\Gamma \approx 0.07$--$0.10$ can look mildly positive only because layer $0$ remains strongly aligned while the deep network is not: at the same early checkpoints where layers $1$--$4$ are essentially zero, layer $0$ has cosine $+0.42$, $+0.45$, and $+0.39$ across seeds (Table~\ref{tab:mode_validation}). The resulting average can therefore be driven by the embedding layer even when the interior blocks are effectively unaligned, so aggregate reporting obscures the very distinction needed to separate ``measurement collapse'' from ``poor credit direction.'' This layer-$0$ dominance is specific to the ResMLP DFA setting; on ViT-Mini DFA, all layers are near zero, which strengthens the broader methodological point that alignment should be reported per layer rather than only in aggregate. With the two modes separated observationally, the remaining question is whether intervention can move them independently. @@ -138,14 +138,14 @@ The architecture comparison sharpens the scope of the critique. In the terminal- \begin{figure}[t] \centering -\includegraphics[width=\linewidth]{figures/fig4_penalty_rescue.pdf} +\includegraphics[width=0.92\linewidth]{figures/fig4_penalty_rescue.pdf} \caption{Penalty intervention view of the two modes: penalization rescues residual-stream scale and restores a measurable but still partial deep-layer credit signal, clarifying that numerical rescue and credit-quality rescue are related but distinct.} \label{fig:penalty_rescue} \end{figure} \begin{figure}[t] \centering -\includegraphics[width=0.78\linewidth]{figures/fig5_cross_arch_summary.pdf} +\includegraphics[width=0.7\linewidth]{figures/fig5_cross_arch_summary.pdf} \caption{Cross-architecture summary over ResMLP, ViT-Mini, StudentNet, and CNN: activation-growth failures recur across architectures, while gradient-floor failures appear in the terminal-normalized settings audited here.} \label{fig:cross_arch_summary} \end{figure} @@ -175,12 +175,12 @@ Diag. & Measurement & Default threshold & Role \\ The point of the protocol is not to add plots; it is to prevent a specific class of false conclusions. For this paper, the minimal protocol is four checks: per-layer activation scale via max-per-block growth, deepest hidden BP gradient floor, meaningful-regime per-layer credit quality, and an architecture-matched frozen-blocks baseline (Table~\ref{tab:protocol_def}). The first two ask whether the reference quantity is still valid; the third asks whether, once validity is restored, the deep blocks receive useful directions; and the fourth asks whether the trained depth is doing better than a model whose residual blocks were never trained at all. Figure~\ref{fig:decision_utility} makes the decision value explicit: accuracy alone walks back $0/5$ audited methods, accuracy plus headline $\Gamma$ still walks back $0/5$, and the full protocol walks back $3/5$ by flagging DFA, State Bridge, and Credit Bridge, with diagnostics (a), (b), and (d) each independently sufficient for binary detection on those failures. On our audit, these checks catch failures that accuracy plus aggregate alignment miss completely. -The protocol is conservative in a specific sense: it preserves BP and EP as evidence-bearing controls and walks back only claims that fail measurement-validity or depth-utilization checks. Diagnostics (a) and (b) have sharp empirical calibration gaps in the audited regime, diagnostic (c) is a sub-mode discriminator rather than a primary detector, and diagnostic (d) uses a deliberately weak $2$pp margin as a context check rather than a theorem about useful depth. The Section~\ref{sec:mode2} cross-method cosine-versus-accuracy dissociation reinforces the necessity of keeping all four diagnostics separate: Credit Bridge, State Bridge, and DFA differ by more than $4\times$ in deep-layer alignment under the same penalty rescue without tracking final accuracy in the same direction, so aligning an alternative credit rule with the BP gradient is not a substitute for checking depth utilization against a matched shallow baseline. +The protocol is conservative in a specific sense: it preserves BP and EP as evidence-bearing controls and walks back only claims that fail measurement-validity or depth-utilization checks. Diagnostics (a) and (b) have sharp empirical calibration gaps in the audited regime (Appendix~\ref{app:threshold_sweep}), diagnostic (c) is a sub-mode discriminator computed as the mean pairwise cosine of the per-batch-averaged BP-grad direction at the chosen layer across $K{\geq}8$ disjoint $128$-sample minibatches (high values, $0.5$--$0.99$, indicate drift-dominated reference vectors; healthy per-sample credit gives $0.05$--$0.18$), and diagnostic (d) uses a deliberately weak $2$pp margin as a context check rather than a theorem about useful depth. The Section~\ref{sec:mode2} cross-method cosine-versus-accuracy dissociation reinforces the necessity of keeping all four diagnostics separate: Credit Bridge, State Bridge, and DFA differ by more than $4\times$ in deep-layer alignment under the same penalty rescue without tracking final accuracy in the same direction, so aligning an alternative credit rule with the BP gradient is not a substitute for checking depth utilization against a matched shallow baseline. \section{Discussion, Limits, Conclusion} \label{sec:discussion} -Our claim is about what existing evidence licenses, not about impossibility: this paper does not show that FA cannot work in deep networks, only that current evaluation practice can misread what happened. DFA, State Bridge, and Credit Bridge all survive status-quo reporting in Table~\ref{tab:main_audit}, yet the protocol shows that their deep claims are unsupported, while the intervention in Figure~\ref{fig:penalty_rescue} partially rescues deep credit signal rather than validating the original headline. Our strongest claim is scoped to the 4-block $d{=}256$ and $d{=}512$ pre-LayerNorm ResMLPs and to ViT-Mini, where Mode~1 (a)+(b) both fire; StudentNet and the BatchNorm CNN refine the scope by showing that activation growth can persist without the gradient-floor collapse, the no-terminal-LN same-backbone control establishes terminal LayerNorm as causally necessary for diagnostic~(b) on residual ResMLP but not proven beyond that family, the dataset is only CIFAR-10, and the BP-plus-penalty comparison is a lower-bound control rather than a full decomposition. The main lesson is to decompose the evaluation question before interpreting the answer: FA papers should report the BP-reference validity, the layerwise credit quality in that meaningful regime, and the frozen-blocks depth-utilization baseline as three separate axes, rather than as a single headline accuracy or headline $\Gamma$. The contribution is a reporting rule in the evaluation-methodology line of \citet{jordan2020evaluating,obray2022evaluation,paleka2026pitfalls}, not a new benchmark artifact. +Our claim is about what existing evidence licenses, not about impossibility: this paper does not show that FA cannot work in deep networks, only that current evaluation practice can misread what happened. DFA, State Bridge, and Credit Bridge all survive status-quo reporting in Table~\ref{tab:main_audit}, yet the protocol shows that their deep claims are unsupported, while the intervention in Figure~\ref{fig:penalty_rescue} partially rescues deep credit signal rather than validating the original headline. Our strongest claim is scoped to the 4-block $d{=}256$ and $d{=}512$ pre-LayerNorm ResMLPs and to ViT-Mini, where Mode~1 (a)+(b) both fire; StudentNet and the BatchNorm CNN refine the scope by showing that activation growth can persist without the gradient-floor collapse, the no-terminal-LN same-backbone control establishes terminal LayerNorm as causally necessary for diagnostic~(b) on residual ResMLP but not proven beyond that family, the dataset is only CIFAR-10, and the BP-plus-penalty comparison is a lower-bound control rather than a full decomposition. In the evaluation-methodology line of \citet{jordan2020evaluating,obray2022evaluation,paleka2026pitfalls}, FA papers should therefore report BP-reference validity, layerwise credit quality, and a frozen-blocks depth-utilization baseline as three separate axes rather than a single headline. \begin{thebibliography}{10} |
