diff options
Diffstat (limited to 'paper/main.tex')
| -rw-r--r-- | paper/main.tex | 2 |
1 files changed, 1 insertions, 1 deletions
diff --git a/paper/main.tex b/paper/main.tex index 039ce2a..e5e5e5a 100644 --- a/paper/main.tex +++ b/paper/main.tex @@ -197,7 +197,7 @@ Diag. & Measurement & Default threshold & Role \\ \paragraph{Scope, limits, and reporting recommendation.} \looseness=-2 Our claim is about evidence, not impossibility: we show that current FA evaluation practice can misread what happened, not that FA cannot work in deep networks. DFA, SB, and CB all pass status-quo reporting (Table~\ref{tab:main_audit}) but fail the protocol's deep checks, and the Figure~\ref{fig:penalty_rescue} penalty partially rescues credit signal rather than validating headlines. Our strongest claim is scoped to $d{=}256/512$ pre-LayerNorm ResMLPs and ViT-Mini, where both Mode~1 diagnostics fire; the no-terminal-LN ResMLP ablation establishes terminal LayerNorm as causally necessary for diagnostic~(b) on residual ResMLP and (with the BatchNorm CNN) shows that activation growth can persist without gradient-floor collapse; the dataset is CIFAR-10; and the BP-plus-penalty comparison is a lower bound, not a full decomposition. In the evaluation-methodology line of \citet{jordan2020evaluating,obray2022evaluation,paleka2026pitfalls}, FA papers should report BP-reference validity, layerwise credit quality, and a frozen-blocks depth-utilization baseline as separate axes, not a single headline. -\paragraph{Open questions and concrete next experiments.} The mechanism story in Section~\ref{sec:mode2} treats Mode~1 as a plausible downstream symptom of Mode~2 rather than a parallel, independently destructive failure, but the audit data is also formally consistent with a fully parallel reading. A direct test would measure per-block forward-state-change content along the training trajectory and check whether per-block decrease in test loss tracks per-block credit usefulness (e.g.\ nudging-test loss change) more tightly than it tracks per-block angular agreement with the BP gradient; a complementary test would substitute the random feedback $B_l$ with a high-quality credit signal (sparse, learned to predict the BP gradient, or weight-transport-restored \`a la \citet{akrout2019deep}) at fixed $\|f_l\|$ and check whether activation growth still appears, which would falsify the Mode~2~$\to$~Mode~1 reading by exhibiting Mode~1 in the absence of Mode~2. Beyond the mechanism question, a wider-scope replication would extend the same audit to additional datasets (CIFAR-100, Tiny-ImageNet) and architectures outside the residual ResMLP / ViT-Mini family, which would calibrate how broadly the protocol's binary detectors generalize past the audited regime; the protocol code in Appendix~\ref{app:reference_impl} is structured to make these extensions a configuration change rather than a new experimental design. +\paragraph{Open questions and concrete next experiments.} The mechanism story in Section~\ref{sec:mode2} treats Mode~1 as a plausible downstream symptom of Mode~2 rather than a parallel, independently destructive failure, but the audit data is also formally consistent with a fully parallel reading. A direct test would measure per-block forward-state-change content along the training trajectory and check whether per-block decrease in test loss tracks per-block credit usefulness (e.g.\ nudging-test loss change) more tightly than it tracks per-block angular agreement with the BP gradient; a complementary test would substitute the random feedback $B_l$ with a high-quality credit signal (sparse, learned to predict the BP gradient, or weight-transport-restored \`a la \citet{akrout2019deep}) at fixed $\|f_l\|$ and check whether activation growth still appears, which would falsify the Mode~2~$\to$~Mode~1 reading by exhibiting Mode~1 in the absence of Mode~2. Beyond the mechanism question, a wider-scope replication would extend the same audit to additional datasets (CIFAR-100, Tiny-ImageNet) and architectures outside the residual ResMLP / ViT-Mini family, which would calibrate how broadly the protocol's binary detectors generalize past the audited regime. The reference implementation in Appendix~\ref{app:reference_impl} is intended to support such extensions at the level of training-recipe and architecture-class configuration so the audit pipeline itself does not need to be re-derived. \begin{thebibliography}{10} |
