diff options
| author | YurenHao0426 <Blackhao0426@gmail.com> | 2026-04-22 23:46:33 -0500 |
|---|---|---|
| committer | YurenHao0426 <Blackhao0426@gmail.com> | 2026-04-22 23:46:33 -0500 |
| commit | 05c935ab03ee0bdb8597d19466192dfb92ee889d (patch) | |
| tree | f8411f1724ed6379c392f0bd7629c83cb4eea534 /results/confirmatory/clean_sparsity/synth_dfa_s456_a0.5_L4.json | |
| parent | 7aa7123e190cbae3f6ce55050666efcc2ce00796 (diff) | |
Add vanilla FA (Lillicrap 2016) implementation + full experiment suite
PAPER-CHANGING FINDING: FA is dramatically different from DFA on the
same architecture. FA has genuine deep credit quality where DFA has none.
Implementation:
- experiments/cifar_resmlp.py: added train_fa() + FA diagnostic support
FA uses sequential backward credit propagation with d×d random matrices
(a_l = B_l @ a_{l+1}) instead of DFA's direct output-error projection
(a_l = B_l^T @ e_T). Same local loss form <f_l, a_l>.
Core results (A-H, 100ep 3-seed d=256 terminal-LN ResMLP):
FA main audit: 0.401 ± 0.009 (DFA: 0.306 ± 0.008) +9.5 pp
FA vs frozen: +5.2 pp ABOVE baseline (DFA: -4.3 pp below)
FA deep cos: +0.33 (DFA: ~0 degenerate)
FA ||h_L||: ~10^5 (DFA: ~5×10^8) 3 OOM less growth
FA ||g_L||: ~10^-6 meaningful (DFA: ~10^-10 floor)
Mode 1(b) fires: NO for FA; YES for DFA
FA+pen lam=1e-2: 0.369 ± 0.003 (DFA+pen: 0.360 ± 0.002)
FA+pen lam=1e-4: 0.377 ± 0.006 (DFA+pen lam=1e-4: 0.360)
At lam=1e-4, FA already has deep cos +0.30 while DFA has -0.02
FA random-target: acc 0.12 (chance), h_L=1.3e5 (DFA: 1.7e8)
FA early 5ep: deep cos already +0.32 (DFA ep1: -0.008)
Extension results (d=512 depth sweep, 100ep, s42):
L=2: FA 0.350, cos +0.96 (DFA: n/a)
L=4: FA 0.424, cos +0.29 (DFA: n/a)
L=6: FA 0.401, cos +0.16 (DFA: n/a)
L=8: FA 0.409, cos +0.11 (DFA: 0.306, cos -0.0001)
L=12: FA 0.404, cos +0.09 (DFA: 0.309, cos -0.0001)
FA deep cos is positive at EVERY depth; DFA is ~0 everywhere.
FA accuracy exceeds DFA by 5-10 pp at L=8 and L=12.
This is the strongest empirical support for the Mode 2 → Mode 1
hypothesis: same local loss, same architecture, same optimizer —
only the credit signal differs. FA's sequential propagation produces
much better per-layer credit (cos +0.33 vs ~0), which prevents the
catastrophic activation growth that DFA exhibits.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Diffstat (limited to 'results/confirmatory/clean_sparsity/synth_dfa_s456_a0.5_L4.json')
0 files changed, 0 insertions, 0 deletions
