OPTION F polish pass: update abstract, §1 ¶2, §6 ¶3, §7 ¶2 with round 38-40 findings

Codex round 40 sequencing: A then F. OPTION F (polish) pass found 4 real issues: 1. Abstract: added narrow conditional-positive finding sentence about SB+penalty beating the shallow baseline + cos-vs-acc dissociation 2. §1 ¶2: replaced 'deep cosine can improve to about +0.16' (DFA-specific) with a fuller cross-method statement mentioning SB +0.32, CB +0.68, and the cos-vs-acc dissociation 3. §6 ¶3 (protocol asymmetry): added one sentence noting that the cross-method cos-vs-acc dissociation REINFORCES the necessity of keeping all four diagnostics separate 4. §7 ¶2 (limits): upgraded 'terminal-LN interpretation is observational rather than causal identification' to 'established causally on the audited residual ResMLP via the matched same-backbone no-terminal-LN control but not proven to extend beyond that architecture family' — reflects round 36 wording upgrade based on existing April 7 no_outln data All four changes are prose-level updates driven by data that was already in the paper. Main content still fits at 9 pages exactly (E&D limit). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
author: YurenHao0426 <Blackhao0426@gmail.com> 2026-04-08 09:53:29 -0500
committer: YurenHao0426 <Blackhao0426@gmail.com> 2026-04-08 09:53:29 -0500
commit: ef2dbf9e90ffa1740f6bd3a6eebc47c124a36c2f (patch)
tree: c9bd6970993884b39bb001ff07ddf3b21beaeec9
parent: 0f0bce37ddbc485768181e3f4d8bf93ba660c5db (diff)
2 files changed, 4 insertions, 4 deletions
diff --git a/paper/main.pdf b/paper/main.pdf
index dcb3fa4..a264800 100644
--- a/paper/main.pdf
+++ b/paper/main.pdf
diff --git a/paper/main.tex b/paper/main.tex
index 596f969..9da868e 100644
--- a/paper/main.tex
+++ b/paper/main.tex
@@ -24,7 +24,7 @@
 \maketitle
 
 \begin{abstract}
-Modern feedback-alignment evaluation on deep residual networks is still summarized by a deceptively simple pair: headline accuracy and headline cosine alignment $\Gamma$ to the backpropagation gradient. We show that this pair can silently fail in two distinct ways on standard CIFAR-10 pre-LayerNorm ResMLP and ViT-Mini settings: first, \emph{measurement degeneracy}, where residual-stream growth drives hidden-layer BP gradients to the numerical floor and makes $\Gamma$ uninterpretable; and second, \emph{low intrinsic credit-direction quality}, where random-feedback credit remains essentially unaligned with BP on the deep blocks even when the reference gradient is still meaningful. The headline result is that the field-standard reporting pair walks back none of the methods we audit, whereas a four-diagnostic protocol walks back the three degenerate methods and passes the two trustworthy controls. Our contribution is an evaluation methodology paper for the NeurIPS 2026 Evaluations \& Datasets track: we provide the protocol, the calibration logic for its thresholds, a reference implementation, a five-method audit, and validation through temporal replay, cross-architecture checks, intervention-based disambiguation, and a documented catalog of pipeline pitfalls, in the spirit of critical evaluation analyses such as \citet{jordan2020evaluating,obray2022evaluation,paleka2026pitfalls}.
+Modern feedback-alignment evaluation on deep residual networks is still summarized by a deceptively simple pair: headline accuracy and headline cosine alignment $\Gamma$ to the backpropagation gradient. We show that this pair can silently fail in two distinct ways on standard CIFAR-10 pre-LayerNorm ResMLP and ViT-Mini settings: first, \emph{measurement degeneracy}, where residual-stream growth drives hidden-layer BP gradients to the numerical floor and makes $\Gamma$ uninterpretable; and second, \emph{low intrinsic credit-direction quality}, where random-feedback credit remains essentially unaligned with BP on the deep blocks even when the reference gradient is still meaningful. The headline result is that the field-standard reporting pair walks back none of the methods we audit, whereas a four-diagnostic protocol walks back the three degenerate methods and passes the two trustworthy controls. Intervention with a per-block scale-control penalty further reveals method-dependent severity within the audited fixed-feedback family: State Bridge then exceeds the architecture-matched frozen-blocks baseline by about $10$ percentage points, while Credit Bridge attains much higher deep BP cosine than DFA at the same final accuracy, a dissociation that motivates reporting layerwise credit quality jointly with a depth-utilization baseline. Our contribution is an evaluation methodology paper for the NeurIPS 2026 Evaluations \& Datasets track: we provide the protocol, the calibration logic for its thresholds, a reference implementation, a five-method audit, and validation through temporal replay, cross-architecture checks, intervention-based disambiguation, and a documented catalog of pipeline pitfalls, in the spirit of critical evaluation analyses such as \citet{jordan2020evaluating,obray2022evaluation,paleka2026pitfalls}.
 \end{abstract}
 
 \section{Introduction}
@@ -32,7 +32,7 @@ Modern feedback-alignment evaluation on deep residual networks is still summariz
 
 Feedback-alignment papers are usually judged by two numbers: task accuracy and an aggregate similarity between the method's local credit signal and the backpropagation gradient \citep{lillicrap2016random,nokland2016direct,akrout2019deep,launay2020direct}. On the audited 4-block $d{=}256$ ResMLP, however, Table~\ref{tab:main_audit} already shows that this pair is not a validity check: DFA reaches only $0.306 \pm 0.006$ test accuracy, below the architecture-matched frozen-blocks baseline of $0.349 \pm 0.002$, while still looking superficially comparable to other non-BP methods. Figure~\ref{fig:audit_hero} further shows that the apparent cosine evidence is concentrated at the shallowest block, with DFA at seed 42 reaching about $+0.42$ at layer 0 but approximately $-0.03$ to $0$ on layers 1--4, so the aggregate obscures where credit direction is and is not present. At the same time, the deepest BP reference norm is only about $5 \times 10^{-10}$ for DFA, State Bridge, and Credit Bridge, below the $10^{-8}$ clamp used by \texttt{F.cosine\_similarity}, whereas BP remains around $4 \times 10^{-4}$, so the reported deep cosine is partly computed against a numerical-floor reference rather than an informative gradient direction (Figure~\ref{fig:audit_hero}; Table~\ref{tab:main_audit}). Those numbers can be useful, but only if the measurement regime itself is valid.
 
-Our audit shows that modern residual vision models can make these two quantities look informative while failing to answer the question they are taken to answer. Figure~\ref{fig:audit_hero} shows the first failure mode, which we call \emph{Mode 1: measurement degeneracy}, where residual-stream growth drives the deepest hidden state to about $\|h_L\| \sim 10^8$ under DFA/SB/CB while the corresponding BP reference collapses to $\|g_L\| \sim 5 \times 10^{-10}$, so the deep-layer cosine is measured against a clamp-dominated floor rather than a meaningful target direction. The same figure also shows the second failure mode, \emph{Mode 2: low intrinsic credit-direction quality}, because even after comparing against the stronger frozen-blocks baseline ($0.349 \pm 0.002$) and looking layer-by-layer, DFA's deep blocks remain essentially null while only layer 0 is visibly positive. To test whether this is only a measurement problem, the intervention results show a dissociation: with a residual penalty $\lambda \|f_l(h_l)\|^2$, the deepest state scale falls toward $4 \times 10^4$, the reference gradient rises toward $10^{-6}$, and deep cosine can improve to about $+0.16$, yet at $\lambda{=}10^{-4}$ Mode 1 is alleviated while deep cosine still stays near zero, and at vanilla DFA epoch 1 the reference is already meaningful at about $6 \times 10^{-7}$ but the deep cosine is still $-0.008 \pm 0.013$ across three seeds. The failure is not unitary: one mode breaks the measurement, and the other survives even when the measurement is still meaningful.
+Our audit shows that modern residual vision models can make these two quantities look informative while failing to answer the question they are taken to answer. Figure~\ref{fig:audit_hero} shows the first failure mode, which we call \emph{Mode 1: measurement degeneracy}, where residual-stream growth drives the deepest hidden state to about $\|h_L\| \sim 10^8$ under DFA/SB/CB while the corresponding BP reference collapses to $\|g_L\| \sim 5 \times 10^{-10}$, so the deep-layer cosine is measured against a clamp-dominated floor rather than a meaningful target direction. The same figure also shows the second failure mode, \emph{Mode 2: low intrinsic credit-direction quality}, because even after comparing against the stronger frozen-blocks baseline ($0.349 \pm 0.002$) and looking layer-by-layer, DFA's deep blocks remain essentially null while only layer 0 is visibly positive. Intervention sharpens both modes. Adding a per-block residual penalty $\lambda \|f_l(h_l)\|^2$ to DFA at $\lambda{=}10^{-2}$ contains $\|h_L\|$ to about $4\times 10^4$ and lifts the deep BP reference to about $10^{-6}$, but DFA's rescued deep cosine is only about $+0.16$; State Bridge under the same intervention reaches a three-seed deep cosine of $+0.32$ and, unlike DFA, exceeds the frozen-blocks baseline by $+10$ points in final accuracy; Credit Bridge reaches a deep cosine near $+0.68$ yet matches only the DFA accuracy, so Mode~2 has method-dependent severity and deep cosine is not a sufficient predictor of final accuracy across methods. At the same time, at $\lambda{=}10^{-4}$ Mode~1 is alleviated while the DFA deep cosine still stays near zero, and at vanilla DFA epoch 1 the reference is already meaningful at about $6 \times 10^{-7}$ but the deep cosine is still $-0.008 \pm 0.013$ across three seeds. The failure is therefore neither unitary nor uniform: Mode~1 and Mode~2 are observationally separable, and within the audited fixed-feedback family, the severity of each mode varies by method.
 
 Accordingly, this paper does not introduce a new FA variant or a new benchmark. Instead, Table~\ref{tab:main_audit} and Figure~\ref{fig:audit_hero} use a standard five-method CIFAR-10 audit to show that status-quo reporting would treat BP, EP, DFA, State Bridge, and Credit Bridge as the same kind of evidence-bearing object even though only BP and EP remain trustworthy under matched diagnostic checks. This makes the contribution methodological in the sense of \citet{jordan2020evaluating}, \citet{obray2022evaluation}, and \citet{paleka2026pitfalls}: the central question is not whether one more FA variant can post a headline number, but whether the reporting pipeline distinguishes meaningful credit-direction evidence from numerical-floor artifacts and from shallow-only learning. The protocol therefore starts from per-layer diagnostics and a frozen-blocks baseline before reading any aggregate cosine or final accuracy as evidence about deep credit assignment. We first show the walk-back on a standard audit, then isolate the two failure modes, and finally state the reporting protocol that future FA papers should satisfy.
 
@@ -169,7 +169,7 @@ Diag. & Measurement & Default threshold & Role \\
 
 The point of the protocol is not to add plots; it is to prevent a specific class of false conclusions. For this paper, the minimal protocol is four checks: per-layer activation scale via max-per-block growth, deepest hidden BP gradient floor, meaningful-regime per-layer credit quality, and an architecture-matched frozen-blocks baseline (Table~\ref{tab:protocol_def}). The first two ask whether the reference quantity is still valid; the third asks whether, once validity is restored, the deep blocks receive useful directions; and the fourth asks whether the trained depth is doing better than a model whose residual blocks were never trained at all. Figure~\ref{fig:decision_utility} makes the decision value explicit: accuracy alone walks back $0/5$ audited methods, accuracy plus headline $\Gamma$ still walks back $0/5$, and the full protocol walks back $3/5$ by flagging DFA, State Bridge, and Credit Bridge, with diagnostics (a), (b), and (d) each independently sufficient for binary detection on those failures. On our audit, these checks catch failures that accuracy plus aggregate alignment miss completely.
 
-A useful evaluation rule should reject the bad cases without collapsing everything into a negative result. The protocol is conservative in exactly that sense: it preserves BP and EP as evidence-bearing controls, and it walks back only those claims that fail measurement-validity or depth-utilization checks in Table~\ref{tab:main_audit}. That asymmetry is important because the thresholds are not equally strong in the same way. Diagnostics (a) and (b) have sharp empirical calibration gaps in the audited regime, diagnostic (c) is explicitly a sub-mode discriminator rather than a primary detector, and diagnostic (d) uses a deliberately weak $2$pp margin as a context check rather than a theorem about useful depth. The rule therefore does not say that low accuracy, low aggregate alignment, or any non-BP method is automatically invalid; it says only that claims unsupported by measurement-valid evidence should be withdrawn, while trustworthy controls should remain standing. That conservative asymmetry is why the protocol belongs in the main paper rather than the appendix.
+A useful evaluation rule should reject the bad cases without collapsing everything into a negative result. The protocol is conservative in exactly that sense: it preserves BP and EP as evidence-bearing controls, and it walks back only those claims that fail measurement-validity or depth-utilization checks in Table~\ref{tab:main_audit}. That asymmetry is important because the thresholds are not equally strong in the same way. Diagnostics (a) and (b) have sharp empirical calibration gaps in the audited regime, diagnostic (c) is explicitly a sub-mode discriminator rather than a primary detector, and diagnostic (d) uses a deliberately weak $2$pp margin as a context check rather than a theorem about useful depth. The rule therefore does not say that low accuracy, low aggregate alignment, or any non-BP method is automatically invalid; it says only that claims unsupported by measurement-valid evidence should be withdrawn, while trustworthy controls should remain standing. The Section~\ref{sec:mode2} cross-method cosine-versus-accuracy dissociation reinforces the necessity of keeping all four diagnostics separate: Credit Bridge, State Bridge, and DFA differ by more than a factor of four in deep-layer alignment under the same penalty rescue without tracking final accuracy in the same direction, so aligning an alternative credit rule with the BP gradient is not a substitute for checking depth utilization against a matched shallow baseline. That conservative asymmetry is why the protocol belongs in the main paper rather than the appendix.
 
 \begin{figure}[t]
 \centering
@@ -184,7 +184,7 @@ A useful evaluation rule should reject the bad cases without collapsing everythi
 
 Our claim is about what existing evidence licenses, not about impossibility. This paper does not show that FA cannot work in deep networks; it shows that current evaluation practice can misread what happened by letting headline accuracy and aggregate alignment stand in for measurement validity and layerwise credit quality. The strongest examples are precisely the cases where the field-standard summary would sound mildly positive while the audited deep evidence has already collapsed or is already null: DFA, State Bridge, and Credit Bridge all survive status-quo reporting in Table~\ref{tab:main_audit}, yet the protocol shows that their deep claims are unsupported. The intervention results in Figure~\ref{fig:penalty_rescue} reinforce the same distinction, because restoring a measurable regime partially rescues deep credit signal rather than proving that the original headline had been trustworthy all along. That distinction is important because evaluation failure and algorithmic impossibility are different statements.
 
-The right level of generality is the audited regime. Our strongest claim is scoped to modern residual vision architectures, especially the pre-LayerNorm and terminal-LayerNorm settings where we directly observed Mode~1: the 4-block ResMLP at $d{=}256$, its $d{=}512$ extension, and ViT-Mini all show the same basic pattern, whereas StudentNet and the BatchNorm CNN refine the scope by showing that activation-growth failures can persist without the hidden-gradient-floor collapse (Figure~\ref{fig:cross_arch_summary}; Figure~\ref{fig:penalty_rescue}). That leaves clear limits. The dataset is only CIFAR-10, the models are small to medium rather than frontier-scale, the terminal-LN interpretation is observational rather than a causal identification, and the BP-plus-penalty comparison is only a lower-bound control on penalty cost rather than a perfect decomposition. Those limitations narrow what is claimed, but they do not weaken the core methodological point that the audited measurement regime can fail silently in exactly the architectures that now dominate this genre of experiment. Future positive or negative examples outside this regime would refine the scope of the protocol, not invalidate the critique.
+The right level of generality is the audited regime. Our strongest claim is scoped to modern residual vision architectures, especially the pre-LayerNorm and terminal-LayerNorm settings where we directly observed Mode~1: the 4-block ResMLP at $d{=}256$, its $d{=}512$ extension, and ViT-Mini all show the same basic pattern, whereas StudentNet and the BatchNorm CNN refine the scope by showing that activation-growth failures can persist without the hidden-gradient-floor collapse (Figure~\ref{fig:cross_arch_summary}; Figure~\ref{fig:penalty_rescue}). That leaves clear limits. The dataset is only CIFAR-10, the models are small to medium rather than frontier-scale, the terminal-LayerNorm-necessity claim for diagnostic~(b) is established causally on the audited residual ResMLP via the matched same-backbone no-terminal-LN control but not proven to extend beyond that architecture family, and the BP-plus-penalty comparison is only a lower-bound control on penalty cost rather than a perfect decomposition. Those limitations narrow what is claimed, but they do not weaken the core methodological point that the audited measurement regime can fail silently in exactly the architectures that now dominate this genre of experiment. Future positive or negative examples outside this regime would refine the scope of the protocol, not invalidate the critique.
 
 The main lesson is to decompose the evaluation question before interpreting the answer. Future FA papers should report, separately, whether the BP reference is still meaningful, whether the deep layers receive useful credit in that meaningful regime, and whether trained depth beats an architecture-matched frozen-blocks baseline, instead of compressing those distinct questions into a single headline accuracy or headline $\Gamma$. That is the sense in which this paper fits the evaluation-methodology line of \citet{jordan2020evaluating}, \citet{obray2022evaluation}, and \citet{paleka2026pitfalls}: the contribution is not a new benchmark artifact, but a reporting rule for preventing a repeatable interpretive error. Once the field enforces that separation between measurement validity and substantive credit quality, positive results will become more trustworthy and negative results more precise. Once that decomposition is enforced, the apparent evidence for successful deep credit assignment becomes much harder to overstate.
author	YurenHao0426 <Blackhao0426@gmail.com>	2026-04-08 09:53:29 -0500
committer	YurenHao0426 <Blackhao0426@gmail.com>	2026-04-08 09:53:29 -0500
commit	ef2dbf9e90ffa1740f6bd3a6eebc47c124a36c2f (patch)
tree	c9bd6970993884b39bb001ff07ddf3b21beaeec9
parent	0f0bce37ddbc485768181e3f4d8bf93ba660c5db (diff)