diff options
| author | YurenHao0426 <Blackhao0426@gmail.com> | 2026-04-08 14:26:13 -0500 |
|---|---|---|
| committer | YurenHao0426 <Blackhao0426@gmail.com> | 2026-04-08 14:26:13 -0500 |
| commit | 3a35c6e0636aa78737847f92712b2dc36bad3618 (patch) | |
| tree | 452a365468b75e374fc310b491603f2c57964269 | |
| parent | 3f4f9937829b2f7a0756ce870a0cc9f7b5e4a9f9 (diff) | |
Polish: tighten §7 conclusion so it fully fits on page 9 (was spilling 2 lines to p10)
The two final sentences both started with 'Once...' and were redundant.
Cut the last sentence entirely and tightened the preceding phrase from
'Once the field enforces that separation...' to just stating the claim.
Before: 'That is the sense in which this paper fits the evaluation-methodology line...: the contribution is not a new benchmark artifact, but a reporting rule for preventing a repeatable interpretive error. Once the field enforces that separation between measurement validity and substantive credit quality, positive results will become more trustworthy, negative results more precise, and the apparent evidence for successful deep credit assignment much harder to overstate.'
After: 'That is the sense in which this paper fits the evaluation-methodology line...: the contribution is a reporting rule for preventing a repeatable interpretive error, not a new benchmark artifact.'
§7 now fully fits on p9. Main content exactly 9 pages within E&D budget.
| -rw-r--r-- | paper/main.pdf | bin | 485001 -> 484726 bytes | |||
| -rw-r--r-- | paper/main.tex | 2 |
2 files changed, 1 insertions, 1 deletions
diff --git a/paper/main.pdf b/paper/main.pdf Binary files differindex 93dc6cb..26f73c0 100644 --- a/paper/main.pdf +++ b/paper/main.pdf diff --git a/paper/main.tex b/paper/main.tex index 5807ca8..a372af8 100644 --- a/paper/main.tex +++ b/paper/main.tex @@ -183,7 +183,7 @@ Our claim is about what existing evidence licenses, not about impossibility. Thi The right level of generality is the audited regime. Our strongest claim is scoped to modern residual vision architectures, especially the pre-LayerNorm and terminal-LayerNorm settings where we directly observed Mode~1: the 4-block ResMLP at $d{=}256$, its $d{=}512$ extension, and ViT-Mini all show the same basic pattern, whereas StudentNet and the BatchNorm CNN refine the scope by showing that activation-growth failures can persist without the hidden-gradient-floor collapse (Figure~\ref{fig:cross_arch_summary}; Figure~\ref{fig:penalty_rescue}). That leaves clear limits. The dataset is only CIFAR-10, the models are small to medium rather than frontier-scale, the terminal-LayerNorm-necessity claim for diagnostic~(b) is established causally on the audited residual ResMLP via the matched same-backbone no-terminal-LN control but not proven to extend beyond that architecture family, and the BP-plus-penalty comparison is only a lower-bound control on penalty cost rather than a perfect decomposition. Those limitations narrow what is claimed, but they do not weaken the core methodological point that the audited measurement regime can fail silently in exactly the architectures that now dominate this genre of experiment. Future positive or negative examples outside this regime would refine the scope of the protocol, not invalidate the critique. -The main lesson is to decompose the evaluation question before interpreting the answer. Future FA papers should report, separately, whether the BP reference is still meaningful, whether the deep layers receive useful credit in that meaningful regime, and whether trained depth beats an architecture-matched frozen-blocks baseline, instead of compressing those distinct questions into a single headline accuracy or headline $\Gamma$. That is the sense in which this paper fits the evaluation-methodology line of \citet{jordan2020evaluating}, \citet{obray2022evaluation}, and \citet{paleka2026pitfalls}: the contribution is not a new benchmark artifact, but a reporting rule for preventing a repeatable interpretive error. Once the field enforces that separation between measurement validity and substantive credit quality, positive results will become more trustworthy and negative results more precise. Once that decomposition is enforced, the apparent evidence for successful deep credit assignment becomes much harder to overstate. +The main lesson is to decompose the evaluation question before interpreting the answer. Future FA papers should report, separately, whether the BP reference is still meaningful, whether the deep layers receive useful credit in that meaningful regime, and whether trained depth beats an architecture-matched frozen-blocks baseline, instead of compressing those distinct questions into a single headline accuracy or headline $\Gamma$. That is the sense in which this paper fits the evaluation-methodology line of \citet{jordan2020evaluating}, \citet{obray2022evaluation}, and \citet{paleka2026pitfalls}: the contribution is a reporting rule for preventing a repeatable interpretive error, not a new benchmark artifact. \begin{thebibliography}{10} |
