Round 31: fill §6 Recommended Protocol + §7 Discussion prose; v2 content-complete

Six paragraphs total via codex round 31: - §6 §6.1: measurement-validity-first ordering with 63x and 24338x calibration gaps - §6 §6.2: minimal four-check protocol; decision-utility 0/5 vs 3/5 walk-back - §6 §6.3: conservative asymmetry (BP/EP preserved, DFA/SB/CB walked back) - §7 §7.1: scope claim — evaluation failure not algorithmic impossibility - §7 §7.2: limits — CIFAR-10 only, observational LN interpretation, lower-bound BP+penalty control - §7 §7.3: lesson — decompose evaluation question, position vs Jordan/O'Bray/Paleka Compiles to 12 pages (main content 1-8, refs+appendices 8-12), within E&D 9-page main budget. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
author: YurenHao0426 <Blackhao0426@gmail.com> 2026-04-08 05:15:23 -0500
committer: YurenHao0426 <Blackhao0426@gmail.com> 2026-04-08 05:15:23 -0500
commit: 3e49631d6dbf6b66683cb7dac5ecf71df2ad7de0 (patch)
tree: ee0a6a3f4550959fb352bfdb0dd2ce82d380fd8d
parent: 2749679c46ab6d823aefa94e24162dc664a322f0 (diff)
2 files changed, 6 insertions, 10 deletions
diff --git a/paper/main.pdf b/paper/main.pdf
index e4e233a..b13f9af 100644
--- a/paper/main.pdf
+++ b/paper/main.pdf
diff --git a/paper/main.tex b/paper/main.tex
index 0f3b474..1a6f189 100644
--- a/paper/main.tex
+++ b/paper/main.tex
@@ -148,7 +148,7 @@ The architecture comparison sharpens the scope of the critique. In the terminal-
 \section{Recommended FA Evaluation Protocol}
 \label{sec:protocol}
 
-The protocol has four diagnostics because the evaluation failure is not visible from any single headline number. % TODO: evidence sentence % TODO: closing sentence
+The reporting protocol begins with measurement validity. Before any FA paper reports a headline alignment number, it should report per-layer state scale and the hidden BP reference-gradient scale at the layers where the scientific claim is being made. In our audited regime, those two quantities already separate healthy from invalid measurement with unusually wide margins: the maximum per-block growth stays below about $11\times$ for BP and EP but is at least $694\times$ for the degenerate methods, giving a $63\times$ calibration gap, while the deepest hidden BP norm stays above about $10^{-4}$ for BP and EP but below about $4\times 10^{-9}$ for the degenerate methods, giving a $24{,}338\times$ gap (Table~\ref{tab:protocol_def}; Table~\ref{tab:main_audit}; Figure~\ref{fig:cross_arch_summary}). These are not cosmetic diagnostics around the real result: they determine whether the reported cosine is being computed against an informative BP direction or against a floor-level reference. If the reference gradient is at floor, the evaluator should stop treating aggregate alignment as evidence.
 
 \begin{table}[t]
 \centering
@@ -167,11 +167,9 @@ Diag. & Measurement & Default threshold & Role \\
 \end{tabular}
 \end{table}
 
-Diagnostics (a), (b), and (d) are independently sufficient for binary detection on the audited failures, while diagnostic (c) is primarily interpretive. % TODO: evidence sentence % TODO: closing sentence
+The point of the protocol is not to add plots; it is to prevent a specific class of false conclusions. For this paper, the minimal protocol is four checks: per-layer activation scale via max-per-block growth, deepest hidden BP gradient floor, meaningful-regime per-layer credit quality, and an architecture-matched frozen-blocks baseline (Table~\ref{tab:protocol_def}). The first two ask whether the reference quantity is still valid; the third asks whether, once validity is restored, the deep blocks receive useful directions; and the fourth asks whether the trained depth is doing better than a model whose residual blocks were never trained at all. Figure~\ref{fig:decision_utility} makes the decision value explicit: accuracy alone walks back $0/5$ audited methods, accuracy plus headline $\Gamma$ still walks back $0/5$, and the full protocol walks back $3/5$ by flagging DFA, State Bridge, and Credit Bridge, with diagnostics (a), (b), and (d) each independently sufficient for binary detection on those failures. On our audit, these checks catch failures that accuracy plus aggregate alignment miss completely.
 
-The decision-utility ablation is the compact empirical argument for why this protocol belongs in an E\&D paper. % TODO: evidence sentence % TODO: closing sentence
-
-Threshold calibration is strong for diagnostics (a) and (b) and deliberately weaker for diagnostic (d), so the paper should state that asymmetry rather than oversell uniform robustness. % TODO: evidence sentence % TODO: closing sentence
+A useful evaluation rule should reject the bad cases without collapsing everything into a negative result. The protocol is conservative in exactly that sense: it preserves BP and EP as evidence-bearing controls, and it walks back only those claims that fail measurement-validity or depth-utilization checks in Table~\ref{tab:main_audit}. That asymmetry is important because the thresholds are not equally strong in the same way. Diagnostics (a) and (b) have sharp empirical calibration gaps in the audited regime, diagnostic (c) is explicitly a sub-mode discriminator rather than a primary detector, and diagnostic (d) uses a deliberately weak $2$pp margin as a context check rather than a theorem about useful depth. The rule therefore does not say that low accuracy, low aggregate alignment, or any non-BP method is automatically invalid; it says only that claims unsupported by measurement-valid evidence should be withdrawn, while trustworthy controls should remain standing. That conservative asymmetry is why the protocol belongs in the main paper rather than the appendix.
 
 \begin{figure}[t]
 \centering
@@ -184,13 +182,11 @@ Threshold calibration is strong for diagnostics (a) and (b) and deliberately wea
 \section{Discussion, Limits, Conclusion}
 \label{sec:discussion}
 
-The main recommendation of this paper is that headline accuracy and headline $\Gamma$ should no longer be treated as sufficient evidence that deep local-credit learning is working on modern residual architectures. % TODO: evidence sentence % TODO: closing sentence
-
-Our claim is deliberately scoped to the architectures and methods audited here, and especially to pre-LayerNorm residual settings where measurement degeneracy is empirically strongest. % TODO: evidence sentence % TODO: closing sentence
+Our claim is about what existing evidence licenses, not about impossibility. This paper does not show that FA cannot work in deep networks; it shows that current evaluation practice can misread what happened by letting headline accuracy and aggregate alignment stand in for measurement validity and layerwise credit quality. The strongest examples are precisely the cases where the field-standard summary would sound mildly positive while the audited deep evidence has already collapsed or is already null: DFA, State Bridge, and Credit Bridge all survive status-quo reporting in Table~\ref{tab:main_audit}, yet the protocol shows that their deep claims are unsupported. The intervention results in Figure~\ref{fig:penalty_rescue} reinforce the same distinction, because restoring a measurable regime partially rescues deep credit signal rather than proving that the original headline had been trustworthy all along. That distinction is important because evaluation failure and algorithmic impossibility are different statements.
 
-Positioned against prior evaluation-methodology papers, this work contributes a failure analysis and diagnostic protocol for a mature evaluation practice rather than a new benchmark suite, dataset release, or leaderboard \citep{jordan2020evaluating,obray2022evaluation,paleka2026pitfalls}. % TODO: evidence sentence % TODO: closing sentence
+The right level of generality is the audited regime. Our strongest claim is scoped to modern residual vision architectures, especially the pre-LayerNorm and terminal-LayerNorm settings where we directly observed Mode~1: the 4-block ResMLP at $d{=}256$, its $d{=}512$ extension, and ViT-Mini all show the same basic pattern, whereas StudentNet and the BatchNorm CNN refine the scope by showing that activation-growth failures can persist without the hidden-gradient-floor collapse (Figure~\ref{fig:cross_arch_summary}; Figure~\ref{fig:penalty_rescue}). That leaves clear limits. The dataset is only CIFAR-10, the models are small to medium rather than frontier-scale, the terminal-LN interpretation is observational rather than a causal identification, and the BP-plus-penalty comparison is only a lower-bound control on penalty cost rather than a perfect decomposition. Those limitations narrow what is claimed, but they do not weaken the core methodological point that the audited measurement regime can fail silently in exactly the architectures that now dominate this genre of experiment. Future positive or negative examples outside this regime would refine the scope of the protocol, not invalidate the critique.
 
-A reasonable conclusion for the field is therefore not that FA-like methods are categorically impossible, but that future claims must report whether they have escaped both failure modes, under matched baselines and with diagnostics that remain meaningful at the layers where the scientific claim is being made. % TODO: evidence sentence % TODO: closing sentence
+The main lesson is to decompose the evaluation question before interpreting the answer. Future FA papers should report, separately, whether the BP reference is still meaningful, whether the deep layers receive useful credit in that meaningful regime, and whether trained depth beats an architecture-matched frozen-blocks baseline, instead of compressing those distinct questions into a single headline accuracy or headline $\Gamma$. That is the sense in which this paper fits the evaluation-methodology line of \citet{jordan2020evaluating}, \citet{obray2022evaluation}, and \citet{paleka2026pitfalls}: the contribution is not a new benchmark artifact, but a reporting rule for preventing a repeatable interpretive error. Once the field enforces that separation between measurement validity and substantive credit quality, positive results will become more trustworthy and negative results more precise. Once that decomposition is enforced, the apparent evidence for successful deep credit assignment becomes much harder to overstate.
 
 \begin{thebibliography}{10}
author	YurenHao0426 <Blackhao0426@gmail.com>	2026-04-08 05:15:23 -0500
committer	YurenHao0426 <Blackhao0426@gmail.com>	2026-04-08 05:15:23 -0500
commit	3e49631d6dbf6b66683cb7dac5ecf71df2ad7de0 (patch)
tree	ee0a6a3f4550959fb352bfdb0dd2ce82d380fd8d
parent	2749679c46ab6d823aefa94e24162dc664a322f0 (diff)