faeval.git/results, branch master

Depth-utility ladder: trainable-block sweep (BP/FA/DFA) on ResMLP CIFAR-10

2026-06-15T01:32:31+00:00

Appendix experiment triangulating the depth-utility diagnostic (D3) by varying
the number of trainable residual blocks k (last-k trainable, first L-k frozen at
init; embed/LN/head always trained).

- d=256 L=4 and d=512 L=2, 3 seeds, recipe identical to the main audit.
- BP climbs monotonically (+22-23pp); DFA peaks at the frozen baseline (k=0) and
  declines once any deep block is trained; FA shows partial/no net depth utility.
- Cross-checks reproduce existing anchors (BP 0.617, DFA 0.301, FA 0.402, frozen 0.349).
- frozen_init_identity_check quantifies frozen stack as a near-norm-preserving
  random feature map (per-block ||f||/||h||~0.10, stack cos 0.981), explaining the
  above-chance k=0 rung.

Co-Authored-By: Claude Opus 4.8 (1M context)

Add new experiment scripts, figures, and paper assets; untrack pyc/build artifacts

2026-06-14T09:06:32+00:00

Co-Authored-By: Claude Opus 4.8 (1M context)

d=512 deep scan complete: FA+DFA at L=6,8,12 (10 seeds each)

2026-04-26T23:25:50+00:00

FA is depth-invariant at ~0.41 for L>=4, never below frozen 0.349.
Only L=2 has enough variance (σ=0.027) for 3/10 seeds to qualify.
Deeper L does not produce the "both FA and DFA fail" panel.

Co-Authored-By: Claude Opus 4.6 (1M context)

d=256 deep scan: FA+DFA at L=8 and L=12 (5 seeds each)

2026-04-26T20:05:13+00:00

FA does NOT drop below frozen 0.349 at deeper L on d=256:
  L=8:  FA mean 0.394, min 0.386 (gap +3.7pp)
  L=12: FA mean 0.391, min 0.368 (gap +1.9pp)

FA accuracy is essentially depth-invariant (~0.39) even though FA
deep cosine drops from +0.13 (L=8) to +0.09 (L=12). DFA is always
below frozen (~0.27-0.30).

Conclusion: on CIFAR-10 with d=256 ResMLP, FA is too good at L≥4
to fail the frozen baseline. The only qualifying setting at deeper L
would require ~20+ seeds to find a rare 2σ outlier. The d=512 L=2
setting (seeds 1,2,5) remains the cleanest qualifying case.

Co-Authored-By: Claude Opus 4.6 (1M context)

CIFAR-100 per-seed diagnostics complete — full qualifying table

2026-04-26T16:03:29+00:00

CIFAR-100, d=256 L=4, 100ep, 3 seeds. Frozen baseline (BP-frozen) = 0.178.

         acc (±ddof=1)    cos (±ddof=1)    h_L        g_L

CIFAR-100 d=256 L=4: both FA and DFA fail — strongest qualifying setting

2026-04-26T15:16:39+00:00

CIFAR-100 on the SAME architecture as the main CIFAR-10 audit (d=256 L=4
pre-LN ResMLP) is a setting where BOTH FA and DFA fall below the frozen-
blocks baseline at ALL 3 seeds while reporting positive cosine.

Frozen baseline (BP-frozen, 2 seeds): 0.177, 0.178 → mean ~0.178

Methods (3 seeds, 100ep):
  seed   BP      DFA     FA
  42     0.319   0.088   0.146
  123    0.322   0.087   0.121
  456    0.322   0.089   0.131

s456 diagnostics (only seed with full JSON — others being re-run):
  DFA: cos=+0.030 (positive), h_L=1.9e8, g_L=1.0e-8
  FA:  cos=+0.247 (positive), h_L=2.3e5, g_L=1.3e-6
  BP:  cos=+1.000 (trustworthy), h_L=192, g_L=9.7e-4

This is STRONGER than d=512 L=2 CIFAR-10 because:
1. Same architecture as the paper's main audit (d=256 L=4)
2. ALL 3 seeds qualify (not just 3/10)
3. Large margin: FA 4.7pp below frozen, DFA 8.9pp below frozen
4. Standard reporting pair (acc + cos) would NOT walk back either

Also added: CIFAR-100 dataset support in cifar_resmlp.py and
resmlp_frozen_blocks_baseline.py.

Co-Authored-By: Claude Opus 4.6 (1M context)

BP+EP audit for d=512 L=2 qualifying seeds + CIFAR-100 support

2026-04-26T14:31:30+00:00

BP results for qualifying seeds (1, 2, 5) on d=512 L=2:
  BP s1: 0.606, s2: 0.608, s5: 0.607 (all above frozen 0.349)
  FA s1: 0.347, s2: 0.346, s5: 0.341 (all below frozen, cos +0.47-0.49)
  DFA s1: 0.298, s2: 0.297, s5: 0.296 (all below frozen, cos +0.18-0.21)

EP did not save (likely architecture compatibility issue at d=512 L=2).

Also: added CIFAR-100 dataset support to both cifar_resmlp.py and
resmlp_frozen_blocks_baseline.py for the harder-task scan.

Co-Authored-By: Claude Opus 4.6 (1M context)

Find setting where both FA and DFA fail: d=512 L=2 ResMLP

2026-04-26T13:45:34+00:00

TASK COMPLETE: Found 3/10 seeds where BOTH FA and DFA fall below
the frozen-blocks baseline while reporting positive cosine and
nontrivial accuracy — proving that the standard evaluation pair
can simultaneously miss both FA and DFA on the same setting.

Setting: d=512 L=2 pre-LayerNorm ResMLP, CIFAR-10, 100 epochs
Frozen baseline (3-seed mean): 0.349

Qualifying seeds:
  seed 1: DFA=0.298 (cos +0.206), FA=0.347 (cos +0.484)
  seed 2: DFA=0.297 (cos +0.179), FA=0.346 (cos +0.472)
  seed 5: DFA=0.296 (cos +0.194), FA=0.341 (cos +0.492)

All qualifying cases have:
  - Both methods below frozen baseline ✓
  - Both methods report positive aggregate cosine ✓
  - Both methods above chance (~0.10) ✓
  - Standard reporting pair (acc + Γ) would NOT walk back either ✓

DFA is below frozen in ALL 10/10 seeds (mean 0.300 ± 0.009).
FA is below frozen in 3/10 seeds (mean across all 10: 0.370 ± 0.026).

Also includes:
- Frozen baselines for d=512 at L=2,4,8,12 × 3 seeds (12 runs)
- resmlp_frozen_blocks_baseline.py patched with --num_blocks arg

Co-Authored-By: Claude Opus 4.6 (1M context)

Add vanilla FA (Lillicrap 2016) implementation + full experiment suite

2026-04-23T04:46:33+00:00

PAPER-CHANGING FINDING: FA is dramatically different from DFA on the
same architecture. FA has genuine deep credit quality where DFA has none.

Implementation:
- experiments/cifar_resmlp.py: added train_fa() + FA diagnostic support
  FA uses sequential backward credit propagation with d×d random matrices
  (a_l = B_l @ a_{l+1}) instead of DFA's direct output-error projection
  (a_l = B_l^T @ e_T). Same local loss form .

Core results (A-H, 100ep 3-seed d=256 terminal-LN ResMLP):

  FA main audit:    0.401 ± 0.009 (DFA: 0.306 ± 0.008)  +9.5 pp
  FA vs frozen:     +5.2 pp ABOVE baseline (DFA: -4.3 pp below)
  FA deep cos:      +0.33 (DFA: ~0 degenerate)
  FA ||h_L||:       ~10^5 (DFA: ~5×10^8)  3 OOM less growth
  FA ||g_L||:       ~10^-6 meaningful (DFA: ~10^-10 floor)
  Mode 1(b) fires:  NO for FA; YES for DFA

  FA+pen lam=1e-2:  0.369 ± 0.003 (DFA+pen: 0.360 ± 0.002)
  FA+pen lam=1e-4:  0.377 ± 0.006 (DFA+pen lam=1e-4: 0.360)
    At lam=1e-4, FA already has deep cos +0.30 while DFA has -0.02

  FA random-target: acc 0.12 (chance), h_L=1.3e5 (DFA: 1.7e8)
  FA early 5ep:     deep cos already +0.32 (DFA ep1: -0.008)

Extension results (d=512 depth sweep, 100ep, s42):
  L=2:  FA 0.350, cos +0.96  (DFA: n/a)
  L=4:  FA 0.424, cos +0.29  (DFA: n/a)
  L=6:  FA 0.401, cos +0.16  (DFA: n/a)
  L=8:  FA 0.409, cos +0.11  (DFA: 0.306, cos -0.0001)
  L=12: FA 0.404, cos +0.09  (DFA: 0.309, cos -0.0001)

FA deep cos is positive at EVERY depth; DFA is ~0 everywhere.
FA accuracy exceeds DFA by 5-10 pp at L=8 and L=12.

This is the strongest empirical support for the Mode 2 → Mode 1
hypothesis: same local loss, same architecture, same optimizer —
only the credit signal differs. FA's sequential propagation produces
much better per-layer credit (cos +0.33 vs ~0), which prevents the
catastrophic activation growth that DFA exhibits.

Co-Authored-By: Claude Opus 4.6 (1M context)

paper v2.34.1: SB/CB training loss decrease values from 3-seed (were s42)

2026-04-09T01:00:22+00:00

Same bug pattern as v2.33's nudging test: the §4 ¶4 training loss
decrease values for SB+pen (-0.458) and CB+pen (-0.122) were s42
single-seed numbers labeled as part of the "three seeds" framing.
DFA+pen (-0.095 ± 0.007) was actually 3-seed.

Re-aggregated from existing JSONs (no new compute):
  SB+pen: per-seed {0.457, 0.444, 0.439} → 0.447 ± 0.008 (was 0.458)
  CB+pen: per-seed {0.123, 0.118, 0.124} → 0.121 ± 0.003 (was 0.122)
  DFA+pen: per-seed {0.104, 0.088, 0.093} → 0.095 ± 0.007 ✓ (unchanged)

Changes:
- §4 ¶4 training-loss trajectory line now uses 3-seed mean ± std for
  all three methods
- Appendix L paragraph now lists per-seed decreases for all three
- New auditable file: results/training_loss_decrease_3seed.json

Ratios SB ≫ CB ≈ DFA unchanged. The "all three functional metrics
agree on the SB ≫ CB ≈ DFA ordering" claim is unchanged.

Page layout: §1-§7 still 9 pages, refs p10, total 19 pages. 0 overfull.

Co-Authored-By: Claude Opus 4.6 (1M context)