faeval.git - Unnamed repository; edit this file 'description' to name the repository.

diff options

author	YurenHao0426 <Blackhao0426@gmail.com>	2026-04-22 23:46:33 -0500
committer	YurenHao0426 <Blackhao0426@gmail.com>	2026-04-22 23:46:33 -0500
commit	05c935ab03ee0bdb8597d19466192dfb92ee889d (patch)
tree	f8411f1724ed6379c392f0bd7629c83cb4eea534 /results/confirmatory/clean_sparsity/synth_dfa_s456_a0.5_L4.json
parent	7aa7123e190cbae3f6ce55050666efcc2ce00796 (diff)

Add vanilla FA (Lillicrap 2016) implementation + full experiment suite

PAPER-CHANGING FINDING: FA is dramatically different from DFA on the same architecture. FA has genuine deep credit quality where DFA has none. Implementation: - experiments/cifar_resmlp.py: added train_fa() + FA diagnostic support FA uses sequential backward credit propagation with d×d random matrices (a_l = B_l @ a_{l+1}) instead of DFA's direct output-error projection (a_l = B_l^T @ e_T). Same local loss form <f_l, a_l>. Core results (A-H, 100ep 3-seed d=256 terminal-LN ResMLP): FA main audit: 0.401 ± 0.009 (DFA: 0.306 ± 0.008) +9.5 pp FA vs frozen: +5.2 pp ABOVE baseline (DFA: -4.3 pp below) FA deep cos: +0.33 (DFA: ~0 degenerate) FA ||h_L||: ~10^5 (DFA: ~5×10^8) 3 OOM less growth FA ||g_L||: ~10^-6 meaningful (DFA: ~10^-10 floor) Mode 1(b) fires: NO for FA; YES for DFA FA+pen lam=1e-2: 0.369 ± 0.003 (DFA+pen: 0.360 ± 0.002) FA+pen lam=1e-4: 0.377 ± 0.006 (DFA+pen lam=1e-4: 0.360) At lam=1e-4, FA already has deep cos +0.30 while DFA has -0.02 FA random-target: acc 0.12 (chance), h_L=1.3e5 (DFA: 1.7e8) FA early 5ep: deep cos already +0.32 (DFA ep1: -0.008) Extension results (d=512 depth sweep, 100ep, s42): L=2: FA 0.350, cos +0.96 (DFA: n/a) L=4: FA 0.424, cos +0.29 (DFA: n/a) L=6: FA 0.401, cos +0.16 (DFA: n/a) L=8: FA 0.409, cos +0.11 (DFA: 0.306, cos -0.0001) L=12: FA 0.404, cos +0.09 (DFA: 0.309, cos -0.0001) FA deep cos is positive at EVERY depth; DFA is ~0 everywhere. FA accuracy exceeds DFA by 5-10 pp at L=8 and L=12. This is the strongest empirical support for the Mode 2 → Mode 1 hypothesis: same local loss, same architecture, same optimizer — only the credit signal differs. FA's sequential propagation produces much better per-layer credit (cos +0.33 vs ~0), which prevents the catastrophic activation growth that DFA exhibits. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Diffstat (limited to 'results/confirmatory/clean_sparsity/synth_dfa_s456_a0.5_L4.json')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: