summaryrefslogtreecommitdiff
path: root/paper
diff options
context:
space:
mode:
authorYurenHao0426 <Blackhao0426@gmail.com>2026-04-08 18:16:55 -0500
committerYurenHao0426 <Blackhao0426@gmail.com>2026-04-08 18:16:55 -0500
commit96bb72683e7356719f94dab15bfe3c8c4266fd88 (patch)
treefc783e46013e38cfaa44c676a07d84a17857d67e /paper
parent9c6d49bf201e5a9407ea02f6e7aa0d52e55f2038 (diff)
paper v2.31.3: §2 ¶3 per-block growth values were architecture mix-up
The paper claimed DFA/SB/CB had max-per-block growth of "237×, 12000×, 96×" on the 4-block d=256 ResMLP. Re-aggregating from the protocol audit JSON (results/protocol_audit/audit_table_s42_s123_s456.json) gives: DFA d=256: max growth 2043, 979, 2545 → 3-seed mean ~1856 (≈1.9e3) SB d=256: max growth 12781, 24126, 10467 → mean ~15791 (≈1.6e4) CB d=256: max growth 1820, 695, 1034 → mean ~1183 (≈1.2e3) The paper's "237" and "96" actually match the BatchNorm CNN audit (audit_cnn_3seed.json gives DFA 214/235/263 → mean 237 and CB 108/90/91 → mean 96), not the d=256 ResMLP. SB "12000" was close to ResMLP s42 single-seed (12781) but the other two values were apparently picked from the wrong architecture. This was an architecture mix-up that under-reported the d=256 ResMLP per-block growth by ~8x for DFA and ~12x for CB. Updated to the actual 3-seed mean values from the matched d=256 audit. The numbers are now an order of magnitude larger and more clearly "extreme" than the original mistaken values. The CNN per-block growth claim of "up to 237×" in §5 ¶3 (which says "the BatchNorm CNN ... shows strong growth under DFA, with max-per- block growth up to 237×") is correct — that 237 is the right value for the CNN context. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Diffstat (limited to 'paper')
-rw-r--r--paper/main.pdfbin500665 -> 500841 bytes
-rw-r--r--paper/main.tex2
2 files changed, 1 insertions, 1 deletions
diff --git a/paper/main.pdf b/paper/main.pdf
index e46a0be..1aa826e 100644
--- a/paper/main.pdf
+++ b/paper/main.pdf
Binary files differ
diff --git a/paper/main.tex b/paper/main.tex
index 6acc223..b31cfe8 100644
--- a/paper/main.tex
+++ b/paper/main.tex
@@ -71,7 +71,7 @@ By the field's usual criteria, the non-BP methods appear to train to nontrivial
Low accuracy by itself is not the pathology. Equilibrium Propagation (EP), a contrastive energy-based alternative to BP that updates weights from the difference between a free-phase and a nudged-phase hidden trajectory, is the key internal comparison in Table~\ref{tab:main_audit} and Figure~\ref{fig:audit_hero}: it achieves only $0.316 \pm 0.030$ accuracy and a very small headline $\Gamma{=}0.008$, yet its per-block growth is only $11.6\times$, its deepest BP reference norm remains around $1.3\times 10^{-4}$ rather than collapsing to the numerical floor, and its cross-batch direction-stability score is $0.02$ rather than the much higher drift-dominated values seen for DFA-family methods. At the same time, EP is not a positive result for depth usage in the stronger sense, because its trainable-model accuracy is still $3.3$ percentage points below the frozen-blocks baseline of $0.349 \pm 0.002$. The distinction matters because it separates underperformance from invalid evaluation.
-When we compare each method to a frozen-blocks baseline matched to the same architecture, the headline interpretation changes immediately. The frozen-blocks model, which trains only the embedding, LayerNorm, and head while holding the residual blocks fixed, reaches $0.349 \pm 0.002$ across the same three seeds; against that baseline, BP is higher by $26.6$ points, but DFA is lower by $4.3$ points, State Bridge by $14.4$ points, Credit Bridge by $6.0$ points, and even EP by $3.3$ points. Figure~\ref{fig:audit_hero} shows that this accuracy comparison lines up with the diagnostic split: DFA, State Bridge, and Credit Bridge also combine extreme per-block growth ($237\times$, $12000\times$, and $96\times$), deepest-layer BP norms around $10^{-9}$, and high cross-batch instability ($0.16$, $0.53$, and $0.37$), so their deep blocks are at best passengers and in practice often harmful. This establishes the audit question the rest of the paper must answer: why do the standard signals fail so badly?
+When we compare each method to a frozen-blocks baseline matched to the same architecture, the headline interpretation changes immediately. The frozen-blocks model, which trains only the embedding, LayerNorm, and head while holding the residual blocks fixed, reaches $0.349 \pm 0.002$ across the same three seeds; against that baseline, BP is higher by $26.6$ points, but DFA is lower by $4.3$ points, State Bridge by $14.4$ points, Credit Bridge by $6.0$ points, and even EP by $3.3$ points. Figure~\ref{fig:audit_hero} shows that this accuracy comparison lines up with the diagnostic split: DFA, State Bridge, and Credit Bridge also combine extreme per-block growth (three-seed mean max ratios $\sim\!1.9\times 10^3$, $\sim\!1.6\times 10^4$, and $\sim\!1.2\times 10^3$ respectively), deepest-layer BP norms around $10^{-9}$, and high cross-batch instability ($0.16$, $0.53$, and $0.37$), so their deep blocks are at best passengers and in practice often harmful. This establishes the audit question the rest of the paper must answer: why do the standard signals fail so badly?
\begin{figure}[t]
\centering