summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorYurenHao0426 <Blackhao0426@gmail.com>2026-04-08 06:11:25 -0500
committerYurenHao0426 <Blackhao0426@gmail.com>2026-04-08 06:11:25 -0500
commit02c3d2c80805daedb2b6c8e9d6e5f36c52d361a1 (patch)
treee1dcf1086721c1fd392d9bfc410fb179d42f3063
parent47f833be0f2abacd0ce53bbe32c7ac7b60fd59d6 (diff)
Round 36: upgrade (b) wording + add EP random-target neg control to §3
Two changes from round 36: 1. §3 paragraph 3: replace 'observational association' with full causal claim based on existing April 7 no-out_ln data (3 seeds, ResMLP-d256+terminal-LN removed, residual skip kept): ||h_L||=1.21e7 (Mode 1 (a) still fires) but ||g_L||=7.4e-4 (HEALTHY, ~10000x above floor — (b) eliminated). Final acc 0.327±0.013 indistinguishable from vanilla DFA's 0.308±0.014. Wording upgraded to 'terminal LayerNorm is necessary for Mode 1(b) in the audited residual ResMLP and ViT-Mini setting'. 2. §3 paragraph after random-target ablation: add EP under random targets smoke result (||h_L||=586 at ep 5 vs DFA's 14510 at ep 3, 25x gap). Random-target assay now cleanly separates fixed-feedback methods (explode) from EP (bounded). Cross-method negative control complete. - experiments/ep_baseline.py: add --random_targets flag + train_ep parameter - v2.5 paper compiles to 15 pages, main content 1-9 (right at E&D limit) Combined picture (rounds 32-36): - Mode 1 (a) localized to 'fixed-feedback local-credit objectives without scale control on architectures absorbing scale at output'. Falsified: residual skip (round 33), task signal (round 34), DFA-specific (round 35). EP is the working negative control (round 36). - Mode 1 (b) localized to terminal LayerNorm via the 1/||h|| Jacobian. Causally established by April 7 no_outln 3-seed data. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
-rw-r--r--experiments/ep_baseline.py9
-rw-r--r--paper/main.pdfbin465275 -> 467841 bytes
-rw-r--r--paper/main.tex6
3 files changed, 11 insertions, 4 deletions
diff --git a/experiments/ep_baseline.py b/experiments/ep_baseline.py
index 7f3d004..36f97f6 100644
--- a/experiments/ep_baseline.py
+++ b/experiments/ep_baseline.py
@@ -90,7 +90,7 @@ def ep_nudged_phase(model, x, y, h_free, beta, T_nudge, alpha_nudge):
def train_ep(model, trl, tel, dev, epochs=100, lr=1e-3, wd=0.01,
- beta=0.5, T_nudge=20, alpha_nudge=0.1):
+ beta=0.5, T_nudge=20, alpha_nudge=0.1, random_targets: bool = False):
L = model.num_blocks
# Separate optimizers for different parts
@@ -104,6 +104,8 @@ def train_ep(model, trl, tel, dev, epochs=100, lr=1e-3, wd=0.01,
model.train()
for x, y in trl:
x = x.view(x.size(0), -1).to(dev); y = y.to(dev)
+ if random_targets:
+ y = torch.randint(0, 10, y.shape, device=dev)
# ---- FREE PHASE ----
# Standard forward pass to get free fixed point
@@ -281,6 +283,8 @@ def main():
p.add_argument('--lr', type=float, default=1e-3)
p.add_argument('--wd', type=float, default=0.01)
p.add_argument('--d_hidden', type=int, default=256)
+ p.add_argument('--random_targets', action='store_true',
+ help='Replace each minibatch label with i.i.d. random class targets (codex round 36 OPTION EP).')
args = p.parse_args()
os.makedirs(args.output_dir, exist_ok=True)
@@ -294,7 +298,8 @@ def main():
print(f"[{args.method} s={args.seed}] Training EP beta={args.beta} T={args.T_nudge} alpha={args.alpha_nudge}", flush=True)
model = train_ep(model, trl, tel, dev, epochs=args.epochs, lr=args.lr, wd=args.wd,
- beta=args.beta, T_nudge=args.T_nudge, alpha_nudge=args.alpha_nudge)
+ beta=args.beta, T_nudge=args.T_nudge, alpha_nudge=args.alpha_nudge,
+ random_targets=args.random_targets)
acc = evaluate(model, tel, dev)
diag = compute_diagnostics(model, tel, dev, args.beta, args.T_nudge, args.alpha_nudge)
diff --git a/paper/main.pdf b/paper/main.pdf
index c8f6b89..b0a2973 100644
--- a/paper/main.pdf
+++ b/paper/main.pdf
Binary files differ
diff --git a/paper/main.tex b/paper/main.tex
index d3aad6b..001a02e 100644
--- a/paper/main.tex
+++ b/paper/main.tex
@@ -80,7 +80,7 @@ The first failure mode is a scale pathology, not yet an alignment pathology. On
The measurement failure occurs at the point where the hidden-layer BP gradient ceases to be a meaningful reference direction. In terminal-LayerNorm architectures, the LayerNorm Jacobian scales as $\partial \mathrm{LN}(h)/\partial h \propto 1/\|h\|$ in expectation, so the same residual-stream inflation is accompanied by collapse of the hidden-layer BP reference norm: on DFA-trained ResMLP, $\|g_L\|$ falls from about $9.8\times 10^{-4}$ at random initialization to about $5\times 10^{-10}$ by epoch 100, a six-order-of-magnitude drop, while the reported cosine remains mathematically defined only because \texttt{F.cosine\_similarity} clamps the denominator at $\varepsilon{=}10^{-8}$ (Table~\ref{tab:main_audit}; Figure~\ref{fig:audit_hero}). At that endpoint the reference norm is about $20\times$ below the clamp, so the quantity being reported is effectively $(a\cdot b)/(\|a\|\max(\|b\|,10^{-8}))$ rather than a comparison to an informative BP direction. At that point, reporting a cosine is no longer evidence about credit quality.
-The simplest control is architectural, not theoretical. On the same ResMLP backbone, BP keeps $\|h_L\|$ near $200$ and $\|g_L\|$ near $4\times 10^{-4}$ throughout training, while EP keeps $\|h_L\|$ around $5\times 10^3$ and $\|g_L\|$ around $1.3\times 10^{-4}$, so hard optimization on CIFAR-10 by itself does not force hidden-layer gradients to the numerical floor (Table~\ref{tab:main_audit}; Figure~\ref{fig:temporal_cross_arch}). The broader cross-architecture pattern is consistent with the same interpretation: StudentNet and the BatchNorm CNN, which lack terminal LayerNorm, keep deepest BP gradients around $10^{-4}$ and never trigger diagnostic (b), whereas ViT-Mini with a terminal LN shows the same collapse pattern and triggers diagnostic (b) by epochs 2--3 (Figure~\ref{fig:temporal_cross_arch}). To check whether the additive residual skip itself is the proximate trigger, we ran a matched ResMLP-d256 ablation that replaces $h_{l+1} = h_l + F_l(h_l)$ with $h_{l+1} = F_l(h_l)$ while keeping terminal LN and all other hyperparameters fixed; in that ablation DFA's $\|h_L\|$ still grows from $\sim\!5$ to $\sim\!2.2\times 10^4$ within three epochs and $\|g_L\|$ already drops to $\sim\!1.6\times 10^{-7}$, so the additive skip is \emph{not} necessary for Mode~1 either, even though the no-residual stack is partially degenerate for both BP and DFA (Appendix~\ref{app:no_residual}). The pathology therefore belongs to the evaluated FA regime, not to CIFAR-10, the backbone, or the residual skip alone.
+The simplest control is architectural, not theoretical. On the same ResMLP backbone, BP keeps $\|h_L\|$ near $200$ and $\|g_L\|$ near $4\times 10^{-4}$ throughout training, while EP keeps $\|h_L\|$ around $5\times 10^3$ and $\|g_L\|$ around $1.3\times 10^{-4}$, so hard optimization on CIFAR-10 by itself does not force hidden-layer gradients to the numerical floor (Table~\ref{tab:main_audit}; Figure~\ref{fig:temporal_cross_arch}). The matched same-backbone control for terminal LayerNorm itself is the cleanest test: when we strip out the terminal LayerNorm from the same ResMLP-d256 with the residual skip intact, train DFA to convergence over 100 epochs and three seeds, the residual stream still inflates to $\|h_L\| \approx 1.2 \times 10^7 \pm 0.1$, but the deepest hidden-layer BP gradient remains at $\|g_L\| \approx 7.4\times 10^{-4}$ (mean over three seeds), four orders of magnitude above the diagnostic~(b) floor, with final test accuracy $0.327\pm 0.013$ which is statistically indistinguishable from vanilla DFA's $0.308\pm 0.014$ on the same backbone. So removing terminal LayerNorm preserves Mode~1~(a) but cleanly eliminates Mode~1~(b) on the same architecture, while leaving final task accuracy essentially unchanged. Combined with the broader cross-architecture pattern, where StudentNet and the BatchNorm CNN, which lack terminal LayerNorm, also keep deepest BP gradients around $10^{-4}$ and never trigger diagnostic (b) while ViT-Mini with a terminal LN shows the same collapse pattern and triggers diagnostic (b) by epochs 2--3 (Figure~\ref{fig:temporal_cross_arch}), the picture is that terminal LayerNorm is necessary for Mode~1~(b) in the audited residual ResMLP and ViT-Mini setting. To check whether the additive residual skip itself is the proximate trigger, we ran a matched ResMLP-d256 ablation that replaces $h_{l+1} = h_l + F_l(h_l)$ with $h_{l+1} = F_l(h_l)$ while keeping terminal LN and all other hyperparameters fixed; in that ablation DFA's $\|h_L\|$ still grows from $\sim\!5$ to $\sim\!2.2\times 10^4$ within three epochs and $\|g_L\|$ already drops to $\sim\!1.6\times 10^{-7}$, so the additive skip is \emph{not} necessary for Mode~1 either, even though the no-residual stack is partially degenerate for both BP and DFA (Appendix~\ref{app:no_residual}). The pathology therefore belongs to the evaluated FA regime, not to CIFAR-10, the backbone, or the residual skip alone.
The collapse is not a late-epoch curiosity. For vanilla DFA on the ResMLP temporal replay, $\|g_L\|$ drops from $9.8\times 10^{-4}$ at epoch 0 to $1.4\times 10^{-6}$ at epoch 1, $3.1\times 10^{-7}$ at epoch 2, $1.3\times 10^{-7}$ at epoch 3, and $6.7\times 10^{-8}$ at epoch 4, so diagnostic (b) fires at epoch 3--4 across all three seeds, while the max-per-block growth detector fires slightly later at epochs 8--11 (Figure~\ref{fig:temporal_cross_arch}). Both detectors therefore fire in the first 11 epochs of a 100-epoch run, making the protocol actionable as an early-stop criterion rather than a post hoc explanation. The practical point is reinforced by accuracy: DFA is at $0.308$ already at epoch 4 and ends at $0.306$ by epoch 100, so the remaining training budget adds essentially nothing to the headline result once the measurement has already degenerated. Once measurement degeneracy is identified, the next question is whether poor deep credit remains even before collapse.
@@ -119,7 +119,7 @@ Once the reference vector is meaningful again, the deep layers no longer sit exa
A rescue intervention is only informative if its direct cost is controlled. The relevant control is BP trained under the same penalty: BP falls from $0.609 \pm 0.004$ without the penalty to $0.530$ with $\lambda{=}10^{-2}$, so the penalty has a direct cost of about $8$ percentage points even when credit assignment is correct, whereas DFA moves in the opposite direction, from $0.308 \pm 0.014$ to $0.363 \pm 0.001$ under the same intervention (Figure~\ref{fig:penalty_rescue}). Relative to the frozen-blocks baseline of $0.349$, BP+penalty still retains a margin of $+18.1$ points, while DFA+penalty retains only $+1.4$ points. The remaining gap, $0.530 - 0.363 = 17$ points, is therefore a lower bound on the part of DFA's deficit that is not explained by simple penalty-induced capacity loss alone, though not a clean isolation because BP uses an end-to-end loss whereas DFA uses block-local losses. The residual gap after that control is what keeps Mode~2 substantively alive.
-The architecture comparison sharpens the scope of the critique. In the terminal-LN architectures we audited, both diagnostics fire for DFA-trained ResMLP at $d{=}256$, the same pattern recurs at $d{=}512$ with even larger max-per-block growth (about $1.5\times 10^4$), and ViT-Mini with a class token and terminal LN shows diagnostic~(a) by epoch~1 and diagnostic~(b) by epochs~2--3 (Figure~\ref{fig:temporal_cross_arch}). A depth sweep on the $d{=}512$ ResMLP at $L \in \{2,4,6,8,12\}$ shows that the layerwise pattern is essentially depth-invariant: DFA's layer-0 cosine stays in $[+0.39,+0.40]$ across all five depths, while its mean deep-layer cosine stays within $[-0.005,+0.000]$ and its deep perturbation correlation collapses to $0.000$ in every depth tested, even though BP retains a deep-layer cosine of $+0.94$ at $L{=}12$ (Appendix~\ref{app:depth_scan}). The deep credit signal does not improve when the network is shallower, so the failure is not a "too deep" artifact. In the non-terminal-LN controls, the pattern is different: StudentNet shows diagnostic~(a) only at epochs~14--25 while diagnostic~(b) never fires across $100$ epochs and three seeds, and the BatchNorm CNN on CIFAR-10 likewise shows strong growth under DFA, with max-per-block growth up to $237\times$, but keeps deepest BP gradients around $\|g\| \sim 10^{-3}$ and never triggers diagnostic~(b) (Figure~\ref{fig:temporal_cross_arch}). BP never triggers either diagnostic in any audited architecture. This is an observational association rather than a causal identification of terminal LayerNorm as the unique mechanism, but it is enough to support a narrower claim: diagnostic~(b) appears tied to the terminal-LN architectures audited here, while diagnostic~(a) remains useful more broadly. This lets the paper end with a reporting rule rather than an overclaimed theory.
+The architecture comparison sharpens the scope of the critique. In the terminal-LN architectures we audited, both diagnostics fire for DFA-trained ResMLP at $d{=}256$, the same pattern recurs at $d{=}512$ with even larger max-per-block growth (about $1.5\times 10^4$), and ViT-Mini with a class token and terminal LN shows diagnostic~(a) by epoch~1 and diagnostic~(b) by epochs~2--3 (Figure~\ref{fig:temporal_cross_arch}). A depth sweep on the $d{=}512$ ResMLP at $L \in \{2,4,6,8,12\}$ shows that the layerwise pattern is essentially depth-invariant: DFA's layer-0 cosine stays in $[+0.39,+0.40]$ across all five depths, while its mean deep-layer cosine stays within $[-0.005,+0.000]$ and its deep perturbation correlation collapses to $0.000$ in every depth tested, even though BP retains a deep-layer cosine of $+0.94$ at $L{=}12$ (Appendix~\ref{app:depth_scan}). The deep credit signal does not improve when the network is shallower, so the failure is not a "too deep" artifact. In the non-terminal-LN controls, the pattern is different: StudentNet shows diagnostic~(a) only at epochs~14--25 while diagnostic~(b) never fires across $100$ epochs and three seeds, and the BatchNorm CNN on CIFAR-10 likewise shows strong growth under DFA, with max-per-block growth up to $237\times$, but keeps deepest BP gradients around $\|g\| \sim 10^{-3}$ and never triggers diagnostic~(b) (Figure~\ref{fig:temporal_cross_arch}). BP never triggers either diagnostic in any audited architecture. The matched same-backbone ResMLP-d256 ablation in Section~\ref{sec:mode1} supplies the cleanest causal control: removing terminal LayerNorm from the same architecture preserves activation growth but eliminates the gradient floor, so diagnostic~(b) is necessary on terminal-LN ResMLP and is not just an architecture-class coincidence. The broader claim therefore holds at full strength inside the audited residual ResMLP and ViT-Mini regime, while diagnostic~(a) remains useful more broadly. This lets the paper end with a reporting rule rather than an overclaimed theory.
\begin{figure}[t]
\centering
@@ -458,6 +458,8 @@ Credit Bridge & $19{,}974$ & $3.2\times 10^{-6}$ & $0.092$ \\
The cross-method version of the test rules out the explanation that the random-target growth is specific to DFA's particular feedback projection. State Bridge and Credit Bridge use bridge constructions with target normalization and stop-gradients, so any residual-stream growth they exhibit cannot be attributed to a simple absence of normalization. Their $\|g_L\|$ values at three epochs are still well above the $10^{-7}$ floor used by diagnostic~(b), so the gradient collapse part of Mode~1 does not yet appear at this horizon for SB/CB; the activation-growth part of Mode~1 is already present. We treat this as evidence that the local-credit growth incentive is not unique to DFA but is shared by the audited family of fixed-feedback methods.
+The cleanest negative control for the random-target assay is Equilibrium Propagation, which trains the same backbone with a contrastive nudged-vs-free local energy objective rather than a fixed feedback projection. We re-ran EP on the same ResMLP-d256 with i.i.d.\ random class targets, seed 42, identical hyperparameters: at five epochs of training, EP's $\|h_L\|$ stays at about $586$, $25\times$ smaller than DFA's $14{,}510$ at three epochs and consistent with vanilla EP's bounded trajectory on real labels (Table~\ref{tab:random_targets_sbcb_smoke} extension). The random-target assay therefore separates the audited fixed-feedback methods (DFA/SB/CB) from EP cleanly: fixed-feedback objectives without an explicit scale-control term exhibit data-agnostic activation growth on this architecture, while EP's energy-based local objective does not.
+
\section{Reproducibility}
\label{app:reproducibility}