<feed xmlns='http://www.w3.org/2005/Atom'>
<title>faeval.git/results/fa_no_penalty_30ep, branch master</title>
<subtitle>Unnamed repository; edit this file 'description' to name the repository.
</subtitle>
<link rel='alternate' type='text/html' href='https://git.blackhao.com/faeval.git/'/>
<entry>
<title>Add vanilla FA (Lillicrap 2016) implementation + full experiment suite</title>
<updated>2026-04-23T04:46:33+00:00</updated>
<author>
<name>YurenHao0426</name>
<email>Blackhao0426@gmail.com</email>
</author>
<published>2026-04-23T04:46:33+00:00</published>
<link rel='alternate' type='text/html' href='https://git.blackhao.com/faeval.git/commit/?id=05c935ab03ee0bdb8597d19466192dfb92ee889d'/>
<id>05c935ab03ee0bdb8597d19466192dfb92ee889d</id>
<content type='text'>
PAPER-CHANGING FINDING: FA is dramatically different from DFA on the
same architecture. FA has genuine deep credit quality where DFA has none.

Implementation:
- experiments/cifar_resmlp.py: added train_fa() + FA diagnostic support
  FA uses sequential backward credit propagation with d×d random matrices
  (a_l = B_l @ a_{l+1}) instead of DFA's direct output-error projection
  (a_l = B_l^T @ e_T). Same local loss form &lt;f_l, a_l&gt;.

Core results (A-H, 100ep 3-seed d=256 terminal-LN ResMLP):

  FA main audit:    0.401 ± 0.009 (DFA: 0.306 ± 0.008)  +9.5 pp
  FA vs frozen:     +5.2 pp ABOVE baseline (DFA: -4.3 pp below)
  FA deep cos:      +0.33 (DFA: ~0 degenerate)
  FA ||h_L||:       ~10^5 (DFA: ~5×10^8)  3 OOM less growth
  FA ||g_L||:       ~10^-6 meaningful (DFA: ~10^-10 floor)
  Mode 1(b) fires:  NO for FA; YES for DFA

  FA+pen lam=1e-2:  0.369 ± 0.003 (DFA+pen: 0.360 ± 0.002)
  FA+pen lam=1e-4:  0.377 ± 0.006 (DFA+pen lam=1e-4: 0.360)
    At lam=1e-4, FA already has deep cos +0.30 while DFA has -0.02

  FA random-target: acc 0.12 (chance), h_L=1.3e5 (DFA: 1.7e8)
  FA early 5ep:     deep cos already +0.32 (DFA ep1: -0.008)

Extension results (d=512 depth sweep, 100ep, s42):
  L=2:  FA 0.350, cos +0.96  (DFA: n/a)
  L=4:  FA 0.424, cos +0.29  (DFA: n/a)
  L=6:  FA 0.401, cos +0.16  (DFA: n/a)
  L=8:  FA 0.409, cos +0.11  (DFA: 0.306, cos -0.0001)
  L=12: FA 0.404, cos +0.09  (DFA: 0.309, cos -0.0001)

FA deep cos is positive at EVERY depth; DFA is ~0 everywhere.
FA accuracy exceeds DFA by 5-10 pp at L=8 and L=12.

This is the strongest empirical support for the Mode 2 → Mode 1
hypothesis: same local loss, same architecture, same optimizer —
only the credit signal differs. FA's sequential propagation produces
much better per-layer credit (cos +0.33 vs ~0), which prevents the
catastrophic activation growth that DFA exhibits.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
PAPER-CHANGING FINDING: FA is dramatically different from DFA on the
same architecture. FA has genuine deep credit quality where DFA has none.

Implementation:
- experiments/cifar_resmlp.py: added train_fa() + FA diagnostic support
  FA uses sequential backward credit propagation with d×d random matrices
  (a_l = B_l @ a_{l+1}) instead of DFA's direct output-error projection
  (a_l = B_l^T @ e_T). Same local loss form &lt;f_l, a_l&gt;.

Core results (A-H, 100ep 3-seed d=256 terminal-LN ResMLP):

  FA main audit:    0.401 ± 0.009 (DFA: 0.306 ± 0.008)  +9.5 pp
  FA vs frozen:     +5.2 pp ABOVE baseline (DFA: -4.3 pp below)
  FA deep cos:      +0.33 (DFA: ~0 degenerate)
  FA ||h_L||:       ~10^5 (DFA: ~5×10^8)  3 OOM less growth
  FA ||g_L||:       ~10^-6 meaningful (DFA: ~10^-10 floor)
  Mode 1(b) fires:  NO for FA; YES for DFA

  FA+pen lam=1e-2:  0.369 ± 0.003 (DFA+pen: 0.360 ± 0.002)
  FA+pen lam=1e-4:  0.377 ± 0.006 (DFA+pen lam=1e-4: 0.360)
    At lam=1e-4, FA already has deep cos +0.30 while DFA has -0.02

  FA random-target: acc 0.12 (chance), h_L=1.3e5 (DFA: 1.7e8)
  FA early 5ep:     deep cos already +0.32 (DFA ep1: -0.008)

Extension results (d=512 depth sweep, 100ep, s42):
  L=2:  FA 0.350, cos +0.96  (DFA: n/a)
  L=4:  FA 0.424, cos +0.29  (DFA: n/a)
  L=6:  FA 0.401, cos +0.16  (DFA: n/a)
  L=8:  FA 0.409, cos +0.11  (DFA: 0.306, cos -0.0001)
  L=12: FA 0.404, cos +0.09  (DFA: 0.309, cos -0.0001)

FA deep cos is positive at EVERY depth; DFA is ~0 everywhere.
FA accuracy exceeds DFA by 5-10 pp at L=8 and L=12.

This is the strongest empirical support for the Mode 2 → Mode 1
hypothesis: same local loss, same architecture, same optimizer —
only the credit signal differs. FA's sequential propagation produces
much better per-layer credit (cos +0.33 vs ~0), which prevents the
catastrophic activation growth that DFA exhibits.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;
</pre>
</div>
</content>
</entry>
</feed>
