From 60947156c4e66d801d043b484ce8bda5314deab0 Mon Sep 17 00:00:00 2001
From: YurenHao0426 <Blackhao0426@gmail.com>
Date: Wed, 8 Apr 2026 18:18:12 -0500
Subject: =?UTF-8?q?paper=20v2.31.4:=20=C2=A72=20=C2=B63=20EP=20per-block?=
 =?UTF-8?q?=20growth=2011.6=C3=97=20=E2=86=92=206.6=C3=97=20(3-seed)?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Re-aggregating from results/protocol_audit/audit_table_s42_s123_s456.json:
EP per-block max growth ratios per seed are 2.87, 10.96, 6.10 → 3-seed
mean 6.64. Single-seed max is 10.96 ≈ 11.0, not 11.6. The 11.6× value
in the prose was untraceable to any seed or aggregation; replaced with
"three-seed mean max-per-block growth is only 6.6× (highest single-seed
value 11.0×)" so both the average and the worst-seed are sourced.

This keeps EP cleanly under the §6 ¶1 "below about 11×" threshold for
healthy methods (max single-seed is 11.0, comfortably below the 50×
diagnostic-(a) threshold), preserving the EP-as-internal-control story.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---
 paper/main.pdf | Bin 500841 -> 500922 bytes
 paper/main.tex |   2 +-
 2 files changed, 1 insertion(+), 1 deletion(-)

diff --git a/paper/main.pdf b/paper/main.pdf
index 1aa826e..1174ab6 100644
Binary files a/paper/main.pdf and b/paper/main.pdf differ
diff --git a/paper/main.tex b/paper/main.tex
index b31cfe8..1d835c0 100644
--- a/paper/main.tex
+++ b/paper/main.tex
@@ -69,7 +69,7 @@ Two rows in Table~\ref{tab:main_audit}, \emph{State Bridge} (SB) and \emph{Credi
 
 By the field's usual criteria, the non-BP methods appear to train to nontrivial accuracy and report nonzero alignment. In Table~\ref{tab:main_audit}, DFA reaches $0.306 \pm 0.006$ test accuracy with headline $\Gamma{=}0.10$, State Bridge reaches $0.205 \pm 0.032$ with $\Gamma{=}0.005$, and Credit Bridge reaches $0.289 \pm 0.026$ with $\Gamma{=}0.07$; none of these rows looks like an obvious invalidation if one is reading the usual pair of final accuracy and aggregate alignment in the style of prior FA reporting \citep{lillicrap2016random,nokland2016direct,akrout2019deep,launay2020direct}. Even the absolute scale does not itself force a walk-back, because all three methods are plainly above chance and all three report positive headline alignment rather than a visibly broken or undefined quantity. That reading is exactly what the rest of the paper overturns.
 
-Low accuracy by itself is not the pathology. Equilibrium Propagation (EP), a contrastive energy-based alternative to BP that updates weights from the difference between a free-phase and a nudged-phase hidden trajectory, is the key internal comparison in Table~\ref{tab:main_audit} and Figure~\ref{fig:audit_hero}: it achieves only $0.316 \pm 0.030$ accuracy and a very small headline $\Gamma{=}0.008$, yet its per-block growth is only $11.6\times$, its deepest BP reference norm remains around $1.3\times 10^{-4}$ rather than collapsing to the numerical floor, and its cross-batch direction-stability score is $0.02$ rather than the much higher drift-dominated values seen for DFA-family methods. At the same time, EP is not a positive result for depth usage in the stronger sense, because its trainable-model accuracy is still $3.3$ percentage points below the frozen-blocks baseline of $0.349 \pm 0.002$. The distinction matters because it separates underperformance from invalid evaluation.
+Low accuracy by itself is not the pathology. Equilibrium Propagation (EP), a contrastive energy-based alternative to BP that updates weights from the difference between a free-phase and a nudged-phase hidden trajectory, is the key internal comparison in Table~\ref{tab:main_audit} and Figure~\ref{fig:audit_hero}: it achieves only $0.316 \pm 0.030$ accuracy and a very small headline $\Gamma{=}0.008$, yet its three-seed mean max-per-block growth is only $6.6\times$ (highest single-seed value $11.0\times$), its deepest BP reference norm remains around $1.3\times 10^{-4}$ rather than collapsing to the numerical floor, and its cross-batch direction-stability score is $0.02$ rather than the much higher drift-dominated values seen for DFA-family methods. At the same time, EP is not a positive result for depth usage in the stronger sense, because its trainable-model accuracy is still $3.3$ percentage points below the frozen-blocks baseline of $0.349 \pm 0.002$. The distinction matters because it separates underperformance from invalid evaluation.
 
 When we compare each method to a frozen-blocks baseline matched to the same architecture, the headline interpretation changes immediately. The frozen-blocks model, which trains only the embedding, LayerNorm, and head while holding the residual blocks fixed, reaches $0.349 \pm 0.002$ across the same three seeds; against that baseline, BP is higher by $26.6$ points, but DFA is lower by $4.3$ points, State Bridge by $14.4$ points, Credit Bridge by $6.0$ points, and even EP by $3.3$ points. Figure~\ref{fig:audit_hero} shows that this accuracy comparison lines up with the diagnostic split: DFA, State Bridge, and Credit Bridge also combine extreme per-block growth (three-seed mean max ratios $\sim\!1.9\times 10^3$, $\sim\!1.6\times 10^4$, and $\sim\!1.2\times 10^3$ respectively), deepest-layer BP norms around $10^{-9}$, and high cross-batch instability ($0.16$, $0.53$, and $0.37$), so their deep blocks are at best passengers and in practice often harmful. This establishes the audit question the rest of the paper must answer: why do the standard signals fail so badly?
 
-- 
cgit v1.2.3