1 files changed, 4 insertions, 4 deletions
diff --git a/paper/main.tex b/paper/main.tex
index 21dabd9..3cba99d 100644
--- a/paper/main.tex
+++ b/paper/main.tex
@@ -39,7 +39,7 @@ Accordingly, this paper does not introduce a new FA variant or a new benchmark.
 \section{Audit: Standard Reporting Walks Back Nothing}
 \label{sec:audit}
 
-On the 4-block pre-LayerNorm ResMLP at $d{=}256$ on CIFAR-10, the field-standard reporting pair does not walk back any of the five methods we audit. % TODO: evidence sentence % TODO: closing sentence
+We begin with the smallest setting in which all methods can be compared head-to-head under identical architecture, optimizer family, and data. Table~\ref{tab:main_audit} fixes that canonical audit to a 4-block pre-LayerNorm ResMLP with width $d{=}256$ on CIFAR-10, trained for 100 epochs with AdamW (learning rate $10^{-3}$, weight decay $0.01$), a cosine schedule, and three seeds (42, 123, 456). Within that single setting, BP, EP, DFA, State Bridge, and Credit Bridge can be read against the same architecture and the same training budget, while Figure~\ref{fig:audit_hero} summarizes the corresponding per-block growth, deepest-layer BP reference norm, cross-batch stability, and frozen-baseline comparison. This is the table a reader would normally use to decide whether the methods trained the deep network.
 
 \begin{table}[t]
 \centering
@@ -59,11 +59,11 @@ Credit Bridge & $0.289 \pm 0.026$ & $0.07$        & trustworthy & walked back \\
 \end{tabular}
 \end{table}
 
-DFA's headline accuracy walks back from ``the deep blocks are training'' to ``the trainable-blocks model is below the architecture-matched random-blocks baseline.'' % TODO: evidence sentence % TODO: closing sentence
+By the field's usual criteria, the non-BP methods appear to train to nontrivial accuracy and report nonzero alignment. In Table~\ref{tab:main_audit}, DFA reaches $0.306 \pm 0.006$ test accuracy with headline $\Gamma{=}0.10$, State Bridge reaches $0.205 \pm 0.032$ with $\Gamma{=}0.005$, and Credit Bridge reaches $0.289 \pm 0.026$ with $\Gamma{=}0.07$; none of these rows looks like an obvious invalidation if one is reading the usual pair of final accuracy and aggregate alignment in the style of prior FA reporting \citep{lillicrap2016random,nokland2016direct,akrout2019deep,launay2020direct}. Even the absolute scale does not itself force a walk-back, because all three methods are plainly above chance and all three report positive headline alignment rather than a visibly broken or undefined quantity. That reading is exactly what the rest of the paper overturns.
 
-DFA's headline $\Gamma$ walks back from ``small but positive alignment'' to ``a cosine measured against a numerical-floor reference vector and driven by layer 0.'' % TODO: evidence sentence % TODO: closing sentence
+Low accuracy by itself is not the pathology. EP is the key internal comparison in Table~\ref{tab:main_audit} and Figure~\ref{fig:audit_hero}: it achieves only $0.316 \pm 0.030$ accuracy and a very small headline $\Gamma{=}0.008$, yet its per-block growth is only $11.6\times$, its deepest BP reference norm remains around $1.3\times 10^{-4}$ rather than collapsing to the numerical floor, and its cross-batch direction-stability score is $0.02$ rather than the much higher drift-dominated values seen for DFA-family methods. At the same time, EP is not a positive result for depth usage in the stronger sense, because its trainable-model accuracy is still $3.3$ percentage points below the frozen-blocks baseline of $0.349 \pm 0.002$. The distinction matters because it separates underperformance from invalid evaluation.
 
-State Bridge and Credit Bridge show the same qualitative pattern as DFA, while EP serves as the internal control that the audit does not over-flag. % TODO: evidence sentence % TODO: closing sentence
+When we compare each method to a frozen-blocks baseline matched to the same architecture, the headline interpretation changes immediately. The frozen-blocks model, which trains only the embedding, LayerNorm, and head while holding the residual blocks fixed, reaches $0.349 \pm 0.002$ across the same three seeds; against that baseline, BP is higher by $26.6$ points, but DFA is lower by $4.3$ points, State Bridge by $14.4$ points, Credit Bridge by $6.0$ points, and even EP by $3.3$ points. Figure~\ref{fig:audit_hero} shows that this accuracy comparison lines up with the diagnostic split: DFA, State Bridge, and Credit Bridge also combine extreme per-block growth ($237\times$, $12000\times$, and $96\times$), deepest-layer BP norms around $10^{-9}$, and high cross-batch instability ($0.16$, $0.53$, and $0.37$), so their deep blocks are at best passengers and in practice often harmful. This establishes the audit question the rest of the paper must answer: why do the standard signals fail so badly?
 
 \begin{figure}[t]
 \centering