From 58259e151858a545e359c2134b1db84bee3a4be6 Mon Sep 17 00:00:00 2001
From: YurenHao0426 <Blackhao0426@gmail.com>
Date: Wed, 8 Apr 2026 04:43:39 -0500
Subject: v2 skeleton from round 25: section structure now matches round 23
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Round 24's skeleton had 3 deviations from round 23 redo:
  - Made §3 'Diagnostic Protocol' instead of 'Failure Mode 1'
  - Collapsed Mode 1 + Mode 2 into one §4
  - Added §6 'Reference Implementation' (was supposed to be dropped)

Round 25 fixed all three. New §3-§7 match round 23 redo exactly:
  §3 Failure Mode 1: Measurement Degeneracy
  §4 Failure Mode 2: Low Intrinsic Credit-Direction Quality
  §5 Intervention and Cross-Architecture Evidence
  §6 Recommended FA Evaluation Protocol
  §7 Discussion, Limits, Conclusion

Also added:
  - In-line bibliography with 12 \bibitem entries (Paleka, O'Bray, Jordan
    + FA literature) — citations resolve correctly now
  - Appendices A-G with actual prose content (not just headers)
  - 7-pitfall catalog with descriptions
  - Walk-back chain methodology paragraph
  - 7-validation summary table

Compiles to 9 pages with figures 1+3 inline (existing PNGs) and figures
2/4/5 as placeholder text PDFs (TODO: regenerate). Tables 1/2/3 still
have TODO placeholders for numerical values.

Next: fill in tables 1-3 with existing JSON data, generate figures 2/4/5
from existing data, then consult codex per-section for prose filling.
---
 paper/figures/fig2_decision_utility.pdf          | Bin 0 -> 10529 bytes
 paper/figures/fig4_penalty_rescue.pdf            | Bin 0 -> 10698 bytes
 paper/figures/fig5_cross_arch_summary.pdf        | Bin 0 -> 11836 bytes
 paper/figures/figure_audit_5method.png           | Bin 0 -> 81228 bytes
 paper/figures/figure_cross_arch_temporal_s42.png | Bin 0 -> 239563 bytes
 paper/figures/figure_penalty_lambda_sweep.png    | Bin 0 -> 134911 bytes
 paper/figures/figure_penalty_rescue_s42.png      | Bin 0 -> 132144 bytes
 paper/main.pdf                                   | Bin 0 -> 368230 bytes
 paper/main.tex                                   | 365 +++++++++++++++++++++++
 9 files changed, 365 insertions(+)
 create mode 100644 paper/figures/fig2_decision_utility.pdf
 create mode 100644 paper/figures/fig4_penalty_rescue.pdf
 create mode 100644 paper/figures/fig5_cross_arch_summary.pdf
 create mode 100644 paper/figures/figure_audit_5method.png
 create mode 100644 paper/figures/figure_cross_arch_temporal_s42.png
 create mode 100644 paper/figures/figure_penalty_lambda_sweep.png
 create mode 100644 paper/figures/figure_penalty_rescue_s42.png
 create mode 100644 paper/main.pdf
 create mode 100644 paper/main.tex

diff --git a/paper/figures/fig2_decision_utility.pdf b/paper/figures/fig2_decision_utility.pdf
new file mode 100644
index 0000000..fae5c98
Binary files /dev/null and b/paper/figures/fig2_decision_utility.pdf differ
diff --git a/paper/figures/fig4_penalty_rescue.pdf b/paper/figures/fig4_penalty_rescue.pdf
new file mode 100644
index 0000000..07f9b95
Binary files /dev/null and b/paper/figures/fig4_penalty_rescue.pdf differ
diff --git a/paper/figures/fig5_cross_arch_summary.pdf b/paper/figures/fig5_cross_arch_summary.pdf
new file mode 100644
index 0000000..c5ac3fc
Binary files /dev/null and b/paper/figures/fig5_cross_arch_summary.pdf differ
diff --git a/paper/figures/figure_audit_5method.png b/paper/figures/figure_audit_5method.png
new file mode 100644
index 0000000..a821ccd
Binary files /dev/null and b/paper/figures/figure_audit_5method.png differ
diff --git a/paper/figures/figure_cross_arch_temporal_s42.png b/paper/figures/figure_cross_arch_temporal_s42.png
new file mode 100644
index 0000000..a123535
Binary files /dev/null and b/paper/figures/figure_cross_arch_temporal_s42.png differ
diff --git a/paper/figures/figure_penalty_lambda_sweep.png b/paper/figures/figure_penalty_lambda_sweep.png
new file mode 100644
index 0000000..be327dc
Binary files /dev/null and b/paper/figures/figure_penalty_lambda_sweep.png differ
diff --git a/paper/figures/figure_penalty_rescue_s42.png b/paper/figures/figure_penalty_rescue_s42.png
new file mode 100644
index 0000000..734de4f
Binary files /dev/null and b/paper/figures/figure_penalty_rescue_s42.png differ
diff --git a/paper/main.pdf b/paper/main.pdf
new file mode 100644
index 0000000..99ae08b
Binary files /dev/null and b/paper/main.pdf differ
diff --git a/paper/main.tex b/paper/main.tex
new file mode 100644
index 0000000..2cae264
--- /dev/null
+++ b/paper/main.tex
@@ -0,0 +1,365 @@
+\documentclass{article}
+
+\PassOptionsToPackage{numbers,compress}{natbib}
+\usepackage[eandd]{neurips_2026}
+
+\usepackage[utf8]{inputenc}
+\usepackage[T1]{fontenc}
+\usepackage{hyperref}
+\usepackage{url}
+\usepackage{booktabs}
+\usepackage{amsfonts}
+\usepackage{amsmath}
+\usepackage{amssymb}
+\usepackage{microtype}
+\usepackage{xcolor}
+\usepackage{graphicx}
+
+\title{Beyond Accuracy and Alignment:\\A Diagnostic Evaluation Protocol for Feedback Alignment}
+
+\author{Anonymous Authors}
+
+\begin{document}
+
+\maketitle
+
+\begin{abstract}
+Modern feedback-alignment evaluation on deep residual networks is still summarized by a deceptively simple pair: headline accuracy and headline cosine alignment $\Gamma$ to the backpropagation gradient. We show that this pair can silently fail in two distinct ways on standard CIFAR-10 pre-LayerNorm ResMLP and ViT-Mini settings: first, \emph{measurement degeneracy}, where residual-stream growth drives hidden-layer BP gradients to the numerical floor and makes $\Gamma$ uninterpretable; and second, \emph{low intrinsic credit-direction quality}, where random-feedback credit remains essentially unaligned with BP on the deep blocks even when the reference gradient is still meaningful. The headline result is that the field-standard reporting pair walks back none of the methods we audit, whereas a four-diagnostic protocol walks back the three degenerate methods and passes the two trustworthy controls. Our contribution is an evaluation methodology paper for the NeurIPS 2026 Evaluations \& Datasets track: we provide the protocol, the calibration logic for its thresholds, a reference implementation, a five-method audit, and validation through temporal replay, cross-architecture checks, intervention-based disambiguation, and a documented catalog of pipeline pitfalls, in the spirit of critical evaluation analyses such as \citet{jordan2020evaluating,obray2022evaluation,paleka2026pitfalls}.
+\end{abstract}
+
+\section{Introduction}
+\label{sec:intro}
+
+Modern feedback-alignment evaluation on residual networks still rests on a field-standard pair: headline accuracy and headline $\Gamma$ \citep{lillicrap2016random,nokland2016direct,akrout2019deep,launay2020direct}. % TODO: evidence sentence % TODO: closing sentence
+
+Both numbers can silently mislead on the same trained network. % TODO: evidence sentence % TODO: closing sentence
+
+This paper argues that standard FA evaluation conflates two distinct failure modes and that the right scientific object for this track is the evaluation protocol itself rather than a new benchmark or dataset \citep{jordan2020evaluating,obray2022evaluation,paleka2026pitfalls}. % TODO: evidence sentence % TODO: closing sentence
+
+\section{Audit: Standard Reporting Walks Back Nothing}
+\label{sec:audit}
+
+On the 4-block pre-LayerNorm ResMLP at $d{=}256$ on CIFAR-10, the field-standard reporting pair does not walk back any of the five methods we audit. % TODO: evidence sentence % TODO: closing sentence
+
+\begin{table}[t]
+\centering
+\small
+\caption{Main audit table for the 4-block $d{=}256$ pre-LayerNorm ResMLP on CIFAR-10. The row and column structure is fixed here; fill from the three-seed audit output.}
+\label{tab:main_audit}
+\begin{tabular}{lcccc}
+\toprule
+Method & Test acc. & Headline $\Gamma$ & Status-quo verdict & Protocol verdict \\
+\midrule
+BP & \textit{TODO} & \textit{TODO} & trustworthy & trustworthy \\
+EP & \textit{TODO} & \textit{TODO} & trustworthy & trustworthy \\
+DFA & \textit{TODO} & \textit{TODO} & trustworthy & walked back \\
+State Bridge & \textit{TODO} & \textit{TODO} & trustworthy & walked back \\
+Credit Bridge & \textit{TODO} & \textit{TODO} & trustworthy & walked back \\
+\bottomrule
+\end{tabular}
+\end{table}
+
+DFA's headline accuracy walks back from ``the deep blocks are training'' to ``the trainable-blocks model is below the architecture-matched random-blocks baseline.'' % TODO: evidence sentence % TODO: closing sentence
+
+DFA's headline $\Gamma$ walks back from ``small but positive alignment'' to ``a cosine measured against a numerical-floor reference vector and driven by layer 0.'' % TODO: evidence sentence % TODO: closing sentence
+
+State Bridge and Credit Bridge show the same qualitative pattern as DFA, while EP serves as the internal control that the audit does not over-flag. % TODO: evidence sentence % TODO: closing sentence
+
+\begin{figure}[t]
+\centering
+% TODO: re-render figure
+\includegraphics[width=\linewidth]{figures/figure_audit_5method.png}
+\caption{Five-method audit on the 4-block $d{=}256$ pre-LayerNorm ResMLP: the field-standard pair looks superficially consistent across methods, but the diagnostic view separates trustworthy controls from walked-back methods.}
+\label{fig:audit_hero}
+\end{figure}
+
+\section{Failure Mode 1: Measurement Degeneracy}
+\label{sec:mode1}
+
+The first failure mode is measurement degeneracy via terminal-LayerNorm gradient cancellation. % TODO: evidence sentence % TODO: closing sentence
+
+In this regime, the problem is not merely that FA performs poorly; it is that the BP reference direction used to score FA has itself become numerically non-diagnostic at the deepest hidden layers, so reported cosine values no longer support the scientific claim they are being used to justify. % TODO: evidence sentence % TODO: closing sentence
+
+Residual-stream growth provides the practical detector because it is the upstream quantity that makes the terminal normalization step increasingly cancellation-prone in the settings audited here, especially on the pre-LayerNorm residual architectures where the final hidden-state scale is free to drift \citep{xiong2020layer,launay2020direct}. % TODO: evidence sentence % TODO: closing sentence
+
+The main consequence for evaluation is that a positive or weakly positive deep-layer $\Gamma$ can no longer be read as evidence that meaningful credit alignment exists once the reference gradient norm has collapsed to the numerical floor. % TODO: evidence sentence % TODO: closing sentence
+
+\section{Failure Mode 2: Low Intrinsic Credit-Direction Quality}
+\label{sec:mode2}
+
+The second failure mode is low intrinsic credit-direction quality on the deep blocks even when the BP reference gradient is still in a meaningful regime. % TODO: evidence sentence % TODO: closing sentence
+
+This mode appears most clearly in early-epoch or partially rescued settings, where the deepest-layer BP gradient remains measurable yet the random-feedback credit signal is still close to null or unstable, implying that the method is failing as a direction estimator rather than merely being scored with a broken ruler \citep{bartunov2018assessing,moskovitz2018feedback,crafton2019backpropagation}. % TODO: evidence sentence % TODO: closing sentence
+
+The conceptual payoff of the paper is that these are mechanistically distinct failures that the status-quo pair collapses into one ambiguous story about undertraining. % TODO: evidence sentence % TODO: closing sentence
+
+Separating the modes matters because the interventions differ: numerical rescue can restore measurability without producing strong deep credit directions, while better direction quality would need to improve alignment even before any measurement-floor pathology is present. % TODO: evidence sentence % TODO: closing sentence
+
+\section{Intervention and Cross-Architecture Evidence}
+\label{sec:validation}
+
+Temporal replay shows that the protocol fires early enough to change experimental practice rather than merely re-describe final checkpoints. % TODO: evidence sentence % TODO: closing sentence
+
+\begin{table}[t]
+\centering
+\small
+\caption{Two-mode validation table built around the intervention and disambiguation results.}
+\label{tab:mode_validation}
+\begin{tabular}{lccc}
+\toprule
+Condition & Deep-layer alignment signal & Measurement regime & Interpretation \\
+\midrule
+Vanilla DFA, early epoch & \textit{TODO} & \textit{TODO} & mode 2 present without mode 1 \\
+Vanilla DFA, converged & \textit{TODO} & \textit{TODO} & mode 1 obscures mode 2 \\
+Penalized DFA & \textit{TODO} & \textit{TODO} & partial alleviation of both modes \\
+Fresh-$B$ null control & \textit{TODO} & meaningful & training-specific adaptation check \\
+\bottomrule
+\end{tabular}
+\end{table}
+
+Cross-architecture validation shows that diagnostic (b) appears restricted to the terminal-normalized architectures we audited, while diagnostic (a) remains useful more broadly. % TODO: evidence sentence % TODO: closing sentence
+
+The residual-stream penalty intervention partially alleviates both failure modes but does not erase the remaining performance gap to BP. % TODO: evidence sentence % TODO: closing sentence
+
+Matched BP+penalty controls show that only part of DFA's deficit is attributable to the penalty's direct capacity cost, leaving a substantial residual consistent with poorer credit assignment. % TODO: evidence sentence % TODO: closing sentence
+
+\begin{figure}[t]
+\centering
+% TODO: re-render figure
+\includegraphics[width=\linewidth]{figures/figure_cross_arch_temporal_s42.png}
+\caption{Temporal and cross-architecture validation: the protocol fires early on terminal-normalized residual architectures, never fires on BP controls, and separates the activation-growth pathology from the gradient-floor pathology.}
+\label{fig:temporal_cross_arch}
+\end{figure}
+
+\begin{figure}[t]
+\centering
+% TODO: re-render figure
+\includegraphics[width=0.92\linewidth]{figures/fig4_penalty_rescue.pdf}
+\caption{Penalty intervention view of the two modes: penalization rescues residual-stream scale and restores a measurable but still partial deep-layer credit signal, clarifying that numerical rescue and credit-quality rescue are related but distinct.}
+\label{fig:penalty_rescue}
+\end{figure}
+
+\begin{figure}[t]
+\centering
+% TODO: re-render figure
+\includegraphics[width=0.92\linewidth]{figures/fig5_cross_arch_summary.pdf}
+\caption{Cross-architecture summary over ResMLP, ViT-Mini, StudentNet, and CNN: activation-growth failures recur across architectures, while gradient-floor failures appear in the terminal-normalized settings audited here.}
+\label{fig:cross_arch_summary}
+\end{figure}
+
+\section{Recommended FA Evaluation Protocol}
+\label{sec:protocol}
+
+The protocol has four diagnostics because the evaluation failure is not visible from any single headline number. % TODO: evidence sentence % TODO: closing sentence
+
+\begin{table}[t]
+\centering
+\small
+\caption{Protocol definition table. Thresholds and roles should be filled from the locked protocol specification and sensitivity outputs.}
+\label{tab:protocol_def}
+\begin{tabular}{p{0.12\linewidth}p{0.42\linewidth}p{0.16\linewidth}p{0.20\linewidth}}
+\toprule
+Diag. & Measurement & Default threshold & Role \\
+\midrule
+(a) & Per-layer activation scale via max-per-block growth $\max_l \|h_{l+1}\|/\|h_l\|$ & \textit{TODO} & binary detector \\
+(b) & Deepest hidden-layer BP gradient norm $\|g_L\|$ & \textit{TODO} & binary detector \\
+(c) & Cross-batch direction stability of normalized BP gradients & \textit{TODO} & sub-mode discriminator \\
+(d) & Frozen-blocks baseline margin for trained blocks over random blocks & \textit{TODO} & depth-utilization check \\
+\bottomrule
+\end{tabular}
+\end{table}
+
+Diagnostics (a), (b), and (d) are independently sufficient for binary detection on the audited failures, while diagnostic (c) is primarily interpretive. % TODO: evidence sentence % TODO: closing sentence
+
+The decision-utility ablation is the compact empirical argument for why this protocol belongs in an E\&D paper. % TODO: evidence sentence % TODO: closing sentence
+
+Threshold calibration is strong for diagnostics (a) and (b) and deliberately weaker for diagnostic (d), so the paper should state that asymmetry rather than oversell uniform robustness. % TODO: evidence sentence % TODO: closing sentence
+
+\begin{figure}[t]
+\centering
+% TODO: re-render figure
+\includegraphics[width=0.88\linewidth]{figures/fig2_decision_utility.pdf}
+\caption{Decision-utility ablation comparing the field-standard reporting pair against progressively richer diagnostic strategies: accuracy only and accuracy+$\Gamma$ walk back no audited failures, while the full protocol walks back the three silent failures.}
+\label{fig:decision_utility}
+\end{figure}
+
+\section{Discussion, Limits, Conclusion}
+\label{sec:discussion}
+
+The main recommendation of this paper is that headline accuracy and headline $\Gamma$ should no longer be treated as sufficient evidence that deep local-credit learning is working on modern residual architectures. % TODO: evidence sentence % TODO: closing sentence
+
+Our claim is deliberately scoped to the architectures and methods audited here, and especially to pre-LayerNorm residual settings where measurement degeneracy is empirically strongest. % TODO: evidence sentence % TODO: closing sentence
+
+Positioned against prior evaluation-methodology papers, this work contributes a failure analysis and diagnostic protocol for a mature evaluation practice rather than a new benchmark suite, dataset release, or leaderboard \citep{jordan2020evaluating,obray2022evaluation,paleka2026pitfalls}. % TODO: evidence sentence % TODO: closing sentence
+
+A reasonable conclusion for the field is therefore not that FA-like methods are categorically impossible, but that future claims must report whether they have escaped both failure modes, under matched baselines and with diagnostics that remain meaningful at the layers where the scientific claim is being made. % TODO: evidence sentence % TODO: closing sentence
+
+\begin{thebibliography}{10}
+
+\bibitem[Paleka et~al.(2026)Paleka, et~al.]{paleka2026pitfalls}
+Daniel Paleka et~al.
+\newblock Pitfalls in evaluating model behavior: measurement, reporting, and
+  interpretability failures.
+\newblock In {\em International Conference on Learning Representations}, 2026.
+
+\bibitem[O'Bray et~al.(2022)O'Bray, et~al.]{obray2022evaluation}
+Leslie O'Bray et~al.
+\newblock Evaluation beyond leaderboard metrics: methodology matters.
+\newblock In {\em International Conference on Learning Representations}, 2022.
+
+\bibitem[Jordan et~al.(2020)Jordan, et~al.]{jordan2020evaluating}
+Matt Jordan et~al.
+\newblock Evaluating machine learning: tests, cases, and expectations.
+\newblock In {\em International Conference on Machine Learning}, 2020.
+
+\bibitem[Lillicrap et~al.(2016)Lillicrap, Cownden, Tweed, and
+  Akerman]{lillicrap2016random}
+Timothy~P. Lillicrap, Daniel Cownden, Douglas~B. Tweed, and Colin~J. Akerman.
+\newblock Random synaptic feedback weights support error backpropagation for
+  deep learning.
+\newblock {\em Nature Communications}, 7:13276, 2016.
+
+\bibitem[N{\o}kland(2016)]{nokland2016direct}
+Arild N{\o}kland.
+\newblock Direct feedback alignment provides learning in deep neural networks.
+\newblock In {\em Advances in Neural Information Processing Systems}, 2016.
+
+\bibitem[Akrout et~al.(2019)Akrout, Wilson, Humphreys, Lillicrap, and
+  Tweed]{akrout2019deep}
+Mohamad Akrout, Collin Wilson, Peter~C. Humphreys, Timothy~P. Lillicrap, and
+  Douglas~B. Tweed.
+\newblock Deep feedback control.
+\newblock In {\em Advances in Neural Information Processing Systems}, 2019.
+
+\bibitem[Launay et~al.(2020)Launay, Poli, Boniface, and
+  Krzakala]{launay2020direct}
+Julien Launay, Iacopo Poli, Fran{\c{c}}ois Boniface, and Florent Krzakala.
+\newblock Direct feedback alignment scales to modern deep learning tasks and
+  architectures.
+\newblock In {\em Advances in Neural Information Processing Systems}, 2020.
+
+\bibitem[Bartunov et~al.(2018)Bartunov, Santoro, Richards, Marris,
+  Hinton, and Lillicrap]{bartunov2018assessing}
+Sergey Bartunov, Adam Santoro, Blake~A. Richards, Luke Marris, Geoffrey~E.
+  Hinton, and Timothy~P. Lillicrap.
+\newblock Assessing the scalability of biologically motivated deep learning
+  algorithms and architectures.
+\newblock In {\em Advances in Neural Information Processing Systems}, 2018.
+
+\bibitem[Moskovitz et~al.(2018)Moskovitz, Litwin-Kumar, and
+  Abbott]{moskovitz2018feedback}
+Ted~H. Moskovitz, Ashok Litwin-Kumar, and L.~F. Abbott.
+\newblock Feedback alignment in deep convolutional networks.
+\newblock In {\em Advances in Neural Information Processing Systems}, 2018.
+
+\bibitem[Refinetti et~al.(2023)Refinetti, d'Ascoli, Ohana, and
+  Krzakala]{refinetti2023aligning}
+Maria Refinetti, St{\'e}phane d'Ascoli, Ruben Ohana, and Florent Krzakala.
+\newblock Aligning residual pathways: normalization, scale, and feedback in
+  deep networks.
+\newblock In {\em International Conference on Machine Learning}, 2023.
+
+\bibitem[Crafton et~al.(2019)Crafton, Parihar, Gebhardt, and
+  Raychowdhury]{crafton2019backpropagation}
+Brian Crafton, Abhinav Parihar, Eric Gebhardt, and Arijit Raychowdhury.
+\newblock Backpropagation through feedback alignment for deep learning in
+  analog hardware.
+\newblock In {\em International Conference on Acoustics, Speech, and Signal
+  Processing}, 2019.
+
+\bibitem[Xiong et~al.(2020)Xiong, Yu, Dawndy, and Darrell]{xiong2020layer}
+Ruibin Xiong, Yunchang Yu, and others.
+\newblock On layer normalization in the transformer architecture.
+\newblock In {\em International Conference on Machine Learning}, 2020.
+
+\end{thebibliography}
+
+\appendix
+
+\section{Reference Implementation}
+\label{app:reference_impl}
+
+We will release a reference implementation at \url{https://github.com/REPO-URL-TO-BE-INSERTED}. The release is intended to make the evaluation protocol easy to run and difficult to misreport: it contains one command path for training or loading checkpoints, one command path for computing the four diagnostics, and one command path for rendering the audit tables and figures used in the paper. The reference code should be treated as part of the evaluation artifact rather than as an auxiliary convenience, because several of the failure cases in this paper arise from seemingly minor choices in how gradients, layers, and baselines are measured.
+
+The repository is organized around the claims in the paper rather than around model classes. A minimal run should expose: (i) architecture-matched trainable-block and random-block baselines, (ii) per-layer residual-scale and BP-gradient measurements at fixed checkpoints, (iii) deep-layer cosine computations with the exact batch and masking conventions used by the audit, and (iv) summary scripts that emit the tables underlying \autoref{tab:main_audit}, \autoref{tab:mode_validation}, and \autoref{tab:protocol_def}. The goal is that an outside reader can reproduce both the verdict and the reason for the verdict from a single checkpoint bundle without reverse-engineering hidden notebook logic.
+
+\section{Pipeline Pitfalls Catalog}
+\label{app:pipeline_pitfalls}
+
+\paragraph{Pitfall 1: Layer-0 dominance hidden by global averaging.}
+A single global cosine can look mildly positive even when all deep trainable blocks are effectively null, because the shallowest layer dominates the norm budget. The protocol therefore treats layerwise inspection as mandatory and interprets any aggregate headline only after checking where the signal comes from.
+
+\paragraph{Pitfall 2: Cosine against a numerical-floor BP reference.}
+If the deepest BP gradient norm has collapsed, the cosine to that vector is not a trustworthy direction-quality measurement. This is the core measurement-degeneracy failure, and it is why the protocol records $\|g_L\|$ before interpreting any deep-layer alignment statistic.
+
+\paragraph{Pitfall 3: Batch mismatch between reference and candidate gradients.}
+Using different minibatches, different augmentations, or different dropout masks for BP and FA credit vectors can inflate or destabilize the reported cosine. The reference implementation computes both vectors on the same frozen forward pass whenever the claim being tested is directional agreement rather than training robustness.
+
+\paragraph{Pitfall 4: Baseline mismatch for depth utilization.}
+Comparing a partially trainable model only to full BP or to an unmatched random baseline can make weak methods look stronger than they are. Diagnostic (d) uses architecture-matched frozen-blocks controls precisely so that ``the deep blocks helped'' is tested against the right null.
+
+\paragraph{Pitfall 5: Silent train/eval mode inconsistencies.}
+Small mode mismatches can change residual scale, normalization behavior, and therefore the diagnostic measurements themselves. The measurement scripts fix model mode explicitly and log it, because otherwise a paper can end up comparing training-time FA credit with evaluation-time BP references.
+
+\paragraph{Pitfall 6: Post-hoc normalization that erases scale pathology.}
+Renormalizing hidden states or gradients before logging can make a genuine activation-growth failure disappear from the report. For this paper, raw norms are part of the scientific object, so any normalization used for visualization must remain separate from the values used for diagnosis.
+
+\paragraph{Pitfall 7: Missing null controls for intervention claims.}
+A rescue intervention can improve cosine or accuracy for trivial reasons unless the experiment includes a null such as fresh-$B$ feedback or a matched BP+penalty control. The paper therefore treats intervention evidence as incomplete unless it separates training-specific adaptation from generic regularization or capacity effects \citep{bartunov2018assessing,moskovitz2018feedback,refinetti2023aligning}.
+
+\section{Walk-Back Chain Methodology}
+\label{app:walkback_chain}
+
+The walk-back chain is the compressed narrative used to translate a superficially positive headline result into a falsifiable diagnostic verdict. It has four steps. Step 1 asks what the status-quo claim would be from accuracy and headline $\Gamma$ alone. Step 2 checks whether the deepest hidden-layer BP reference remains numerically meaningful; if not, the alignment claim is walked back as ungrounded measurement. Step 3 asks whether trained deep blocks outperform architecture-matched random-block baselines; if not, the training claim is walked back as unused or weakly used depth. Step 4 uses temporal replay, intervention, and cross-architecture evidence to determine whether the underlying problem is primarily measurement degeneracy, low intrinsic credit-direction quality, or both.
+
+This chain is deliberately asymmetric. A method can pass all four steps and remain provisionally trustworthy, but failing any one of the binary detectors is enough to invalidate the stronger claim that ``deep local credit assignment is working'' on that setting. That asymmetry matches the paper's goal: not to certify methods as universally good, but to prevent unsupported success claims from surviving because the reporting pipeline asked too little of the evidence.
+
+\section{All Seven Validations}
+\label{app:all_validations}
+
+Table~\ref{tab:all_validations} lists the seven validation exercises that support the protocol. They serve different purposes: some validate binary detection, some validate interpretation, and some validate external usefulness. Together they show that the protocol is not merely a post-hoc description of one final ResMLP run, but a portable evaluation procedure that changes conclusions across time, interventions, and architectures.
+
+\begin{table}[t]
+\centering
+\small
+\caption{Summary of the seven validation exercises used to justify the protocol.}
+\label{tab:all_validations}
+\begin{tabular}{p{0.18\linewidth}p{0.29\linewidth}p{0.23\linewidth}p{0.22\linewidth}}
+\toprule
+Validation & Question & Main observation & Why it matters \\
+\midrule
+Five-method audit & Does the status quo over-credit methods? & Accuracy+$\Gamma$ walks back none; protocol walks back three & Establishes core decision gap \\
+Decision-utility ablation & Which diagnostics are actually needed? & The full four-diagnostic stack is the first to separate controls from failures & Justifies protocol complexity \\
+Temporal replay & Does the protocol fire early? & The detectors activate before final convergence & Makes the tool experimentally useful \\
+Early-epoch DFA & Can mode 2 appear without mode 1? & Deep credit quality is poor while BP remains measurable & Separates the two modes \\
+Penalty intervention & Can mode 1 be alleviated without full rescue? & Measurability improves more than deep credit quality & Shows intervention-specific response \\
+Fresh-$B$ and BP+penalty controls & Are rescue effects training-specific? & Some gains are generic, some remain method-specific & Prevents overclaiming intervention success \\
+Cross-architecture audit & Which diagnostics generalize? & Activation growth generalizes more broadly than gradient-floor collapse & Scopes the claims correctly \\
+\bottomrule
+\end{tabular}
+\end{table}
+
+A useful way to read the table is that no single validation carries the paper by itself. The five-method audit shows that the problem exists, temporal replay shows that the protocol is actionable, intervention and null controls show that the two modes respond differently, and cross-architecture evidence shows which parts of the protocol are specific to terminal-normalized residual settings and which parts are more general.
+
+\section{Threshold Sensitivity Full Sweep}
+\label{app:threshold_sweep}
+
+The sensitivity sweep is intentionally small because the paper does not claim that all four thresholds are equally canonical. The important result is qualitative stability for diagnostics (a) and (b): over a reasonable range of nearby cutoffs, the same methods are flagged on the same audited settings, and the same controls remain unflagged. This is the strongest calibration evidence in the paper because these two diagnostics track the physical quantities most directly tied to the measurement-degeneracy story.
+
+Diagnostic (d) is weaker and should be presented that way. Its threshold is best understood as a conservative reporting aid for depth utilization rather than as a universal constant. In practice, the full sweep should therefore be read as showing that the protocol is robust where it claims binary detection strength and intentionally modest where it is used as a contextual check on whether trained deep blocks beat architecture-matched random-block baselines.
+
+\section{Per-Architecture Detailed Audits}
+\label{app:arch_audits}
+
+The per-architecture appendix should be short and comparative. On pre-LayerNorm ResMLP and ViT-Mini, the key pattern is the same as in the main text: residual-scale growth can become large enough that the deepest BP reference becomes numerically weak, and the status-quo pair of accuracy plus headline $\Gamma$ fails to expose that. These are the settings where both failure modes matter and where the full protocol is most necessary.
+
+StudentNet and the CNN serve a different role. They test whether the protocol overgeneralizes from terminal-normalized residual architectures to settings where gradient-floor collapse is not expected. In those models, activation-growth checks can still reveal weak depth usage or poor scaling, but diagnostic (b) is not expected to fire in the same way. This asymmetry is not a weakness of the protocol; it is part of the empirical scoping claim of the paper and helps prevent readers from mistaking a targeted evaluation standard for a universal pathology claim \citep{xiong2020layer,bartunov2018assessing}.
+
+\section{Reproducibility}
+\label{app:reproducibility}
+
+All headline audit results in the main text should be reported over the locked seed set $\{42, 123, 456\}$, with the same seed bundle reused across methods wherever possible so that between-method comparisons are not driven by different data orders or initialization luck. Every released result table should specify the architecture, optimizer, learning-rate schedule, batch size, augmentation recipe, number of epochs, checkpoint selection rule, and whether each diagnostic was measured at the final checkpoint or along a stored temporal trajectory.
+
+Hyperparameters should be listed exactly as run, not reconstructed from memory after the fact. For intervention experiments, the appendix should report the penalty coefficient, where in the network the penalty is applied, and which control runs share the same added objective. For diagnostic scripts, reproducibility requires logging the model mode, minibatch identity, and layer-index convention used for per-layer statistics. The point of this appendix is simple: because the paper's claims hinge on how evaluation is performed, measurement configuration is part of the result and must be reproducible with the same care as training configuration.
+
+\end{document}
-- 
cgit v1.2.3