Intro: add slower FA-first opening paragraph (user feedback)

User flagged that intro entered the critique too fast without first explaining what feedback alignment is. Based on Bartunov et al. 2018's intro structure (their Section 1 opens with the weight-transport problem, introduces FA as the response, and only then motivates the evaluation question), rewrote §1 paragraph 1. New §1 ¶1: - BP is standard but biologically implausible (weight-transport problem) - FA (Lillicrap 2016) side-steps via random feedback - DFA (Nokland 2016) simplifies by direct projection per layer - Parallel lines: target propagation (Lee 2015), equilibrium propagation - Modern scaling: Launay 2020 (transformers), Akrout 2019 - Evaluation converged on accuracy + Gamma cosine summary §1 ¶2 (old ¶1) then starts the audit critique against this backdrop, so a reader who arrived without any FA context now has one paragraph of set-up before the critique begins. Page-budget side effect: the ~110-word addition pushed main content to 10 pages briefly. Recovered by shrinking Figure 4 (cross_arch_summary) from width=\linewidth to width=0.78\linewidth, which freed enough p9 vertical space for §7 to fit entirely on p9. Result: main content strictly 9 pages (§1-§7 on p1-p9, references and appendices on p10+). Total 18 pages. 0 overfull hbox. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
author: YurenHao0426 <Blackhao0426@gmail.com> 2026-04-08 15:31:03 -0500
committer: YurenHao0426 <Blackhao0426@gmail.com> 2026-04-08 15:31:03 -0500
commit: 8edd4505568ef327eb72be2c5c57d24439b36986 (patch)
tree: 66481ba16442d7eab33553280dd61d096b2f30b1
parent: 3d4e6844d56ab87a3798e010400616bc3a49da7d (diff)
2 files changed, 4 insertions, 2 deletions
diff --git a/paper/main.pdf b/paper/main.pdf
index 743c185..bd5018c 100644
--- a/paper/main.pdf
+++ b/paper/main.pdf
diff --git a/paper/main.tex b/paper/main.tex
index 5e1cbd4..cb27a47 100644
--- a/paper/main.tex
+++ b/paper/main.tex
@@ -33,7 +33,9 @@ Modern feedback-alignment evaluation on deep residual networks is still summariz
 \section{Introduction}
 \label{sec:intro}
 
-Feedback-alignment papers are usually judged by two numbers: task accuracy and an aggregate similarity between the method's local credit signal and the backpropagation gradient \citep{lillicrap2016random,nokland2016direct,akrout2019deep,launay2020direct}. On the audited 4-block $d{=}256$ ResMLP, however, Table~\ref{tab:main_audit} already shows that this pair is not a validity check: DFA reaches only $0.306 \pm 0.006$ test accuracy, below the architecture-matched frozen-blocks baseline of $0.349 \pm 0.002$, while still looking superficially comparable to other non-BP methods. Figure~\ref{fig:audit_hero} further shows that the apparent cosine evidence is concentrated at the shallowest block, with DFA at seed 42 reaching about $+0.42$ at layer 0 but approximately $-0.03$ to $0$ on layers 1--4, so the aggregate obscures where credit direction is and is not present. At the same time, the deepest BP reference norm is only about $5 \times 10^{-10}$ for DFA, State Bridge, and Credit Bridge, below the $10^{-8}$ clamp used by \texttt{F.cosine\_similarity}, whereas BP remains around $4 \times 10^{-4}$, so the reported deep cosine is partly computed against a numerical-floor reference rather than an informative gradient direction (Figure~\ref{fig:audit_hero}; Table~\ref{tab:main_audit}). Those numbers can be useful, but only if the measurement regime itself is valid.
+Backpropagation (BP) is the de facto training method for deep neural networks, but its requirement that each feedback connection carry a weight identical to the corresponding forward connection -- the weight-transport problem -- has long been considered biologically implausible \citep{lillicrap2016random,bartunov2018assessing}. \emph{Feedback alignment} (FA) \citep{lillicrap2016random} side-steps weight transport by delivering per-layer credit through fixed random feedback matrices, and its direct variant (DFA) \citep{nokland2016direct} projects the output error to every hidden layer through an independent random matrix; parallel lines include target propagation \citep{lee2015difference} and equilibrium propagation. These rules are studied both as biologically-plausible alternatives to BP and as scalable, asynchronous training schemes, with recent work scaling DFA to transformer-scale architectures on language, recommendation, and view-synthesis tasks \citep{launay2020direct,akrout2019deep}. Evaluation in this line of work has converged on a two-number summary: final task accuracy, and an aggregate cosine alignment $\Gamma$ between the method's per-layer credit and the BP gradient on the trained network \citep{lillicrap2016random,nokland2016direct,akrout2019deep,launay2020direct,bartunov2018assessing}.
+
+On the audited 4-block $d{=}256$ ResMLP, however, Table~\ref{tab:main_audit} already shows that this accuracy-plus-$\Gamma$ pair is not a validity check: DFA reaches only $0.306 \pm 0.006$ test accuracy, below the architecture-matched frozen-blocks baseline of $0.349 \pm 0.002$, while still looking superficially comparable to other non-BP methods. Figure~\ref{fig:audit_hero} further shows that the apparent cosine evidence is concentrated at the shallowest block, with DFA at seed 42 reaching about $+0.42$ at layer 0 but approximately $-0.03$ to $0$ on layers 1--4, so the aggregate obscures where credit direction is and is not present. At the same time, the deepest BP reference norm is only about $5 \times 10^{-10}$ for DFA, State Bridge, and Credit Bridge, below the $10^{-8}$ clamp used by \texttt{F.cosine\_similarity}, whereas BP remains around $4 \times 10^{-4}$, so the reported deep cosine is partly computed against a numerical-floor reference rather than an informative gradient direction (Figure~\ref{fig:audit_hero}; Table~\ref{tab:main_audit}). Those numbers can be useful, but only if the measurement regime itself is valid.
 
 Our audit shows that modern residual vision models can make these two quantities look informative while failing to answer the question they are taken to answer. Figure~\ref{fig:audit_hero} shows the first failure mode, which we call \emph{Mode 1: measurement degeneracy}, where residual-stream growth drives the deepest hidden state to about $\|h_L\| \sim 10^8$ under DFA/SB/CB while the corresponding BP reference collapses to $\|g_L\| \sim 5 \times 10^{-10}$, so the deep-layer cosine is measured against a clamp-dominated floor rather than a meaningful target direction. The same figure also shows the second failure mode, \emph{Mode 2: low intrinsic credit-direction quality}, because even after comparing against the stronger frozen-blocks baseline ($0.349 \pm 0.002$) and looking layer-by-layer, DFA's deep blocks remain essentially null while only layer 0 is visibly positive. Intervention sharpens both modes. Adding a per-block residual penalty $\lambda \|f_l(h_l)\|^2$ to DFA at $\lambda{=}10^{-2}$ contains $\|h_L\|$ to about $4\times 10^4$ and lifts the deep BP reference to about $10^{-6}$, but DFA's rescued deep cosine is only about $+0.16$; State Bridge under the same intervention reaches a three-seed deep cosine of $+0.32$ and, unlike DFA, exceeds the frozen-blocks baseline by $+10$ points in final accuracy; Credit Bridge reaches a deep cosine near $+0.68$ yet matches only the DFA accuracy, so Mode~2 has method-dependent severity and deep cosine is not a sufficient predictor of final accuracy across methods. At the same time, at $\lambda{=}10^{-4}$ Mode~1 is alleviated while the DFA deep cosine still stays near zero, and at vanilla DFA epoch 1 the reference is already meaningful at about $6 \times 10^{-7}$ but the deep cosine is still $-0.008 \pm 0.013$ across three seeds. The failure is therefore neither unitary nor uniform: Mode~1 and Mode~2 are observationally separable, and within the audited fixed-feedback family, the severity of each mode varies by method.
 
@@ -143,7 +145,7 @@ The architecture comparison sharpens the scope of the critique. In the terminal-
 
 \begin{figure}[t]
 \centering
-\includegraphics[width=\linewidth]{figures/fig5_cross_arch_summary.pdf}
+\includegraphics[width=0.78\linewidth]{figures/fig5_cross_arch_summary.pdf}
 \caption{Cross-architecture summary over ResMLP, ViT-Mini, StudentNet, and CNN: activation-growth failures recur across architectures, while gradient-floor failures appear in the terminal-normalized settings audited here.}
 \label{fig:cross_arch_summary}
 \end{figure}
author	YurenHao0426 <Blackhao0426@gmail.com>	2026-04-08 15:31:03 -0500
committer	YurenHao0426 <Blackhao0426@gmail.com>	2026-04-08 15:31:03 -0500
commit	8edd4505568ef327eb72be2c5c57d24439b36986 (patch)
tree	66481ba16442d7eab33553280dd61d096b2f30b1
parent	3d4e6844d56ab87a3798e010400616bc3a49da7d (diff)