faeval.git/NOTE.md, branch master

Update NOTE.md + EVIDENCE_SUMMARY.md with FA results (2026-04-23)

2026-04-23T16:18:59+00:00

NOTE.md: added comprehensive current-status section at the top with
the full 6-method audit table (BP/FA/EP/DFA/CB/SB), FA vs DFA key
comparison, depth sweep, penalty rescue comparison, cross-method
functional triangulation, and open items. Old Phase 10A content kept
below as historical reference.

EVIDENCE_SUMMARY.md: added "Vanilla FA vs DFA" section with the
paper-changing finding (FA 0.401 ± 0.009 vs DFA 0.306 ± 0.008,
FA has genuine deep cos +0.33, no Mode 1(b) collapse) and the
d=512 depth sweep table.

Co-Authored-By: Claude Opus 4.6 (1M context)

Add Phase 10A.8C: 3-seed replication — scaffold gains are marginal

2026-03-27T23:07:58+00:00

3-seed results (mean±std):
- DFA: 0.306±0.006
- perlayer_vector α=0.75: 0.304±0.006 (-0.2%, not significant)
- random_trainable α=0.75: 0.313±0.007 (+0.7%, marginal, error bars overlap)

Single-seed gains (+1.1% perlayer, +0.8% vec) do not robustly replicate.
The scaffold mechanism provides at best a marginal, statistically uncertain benefit.

Co-Authored-By: Claude Opus 4.6 (1M context)

Add Phase 10A.8: freeze-with-decay confirms stale aux is main freeze failure cause;

2026-03-27T21:39:17+00:00

alpha sweep shows perlayer_vector at alpha=0.75 matches full network

10A.8A: freeze_decay_to_000 recovers to 28.5% (vs 14.6% fixed freeze) — stale
high-weight aux is the primary cause of freeze crashes. But 28.5% < DFA 31.2%
confirms continuous trainability adds ~2.7% independent value.

10A.8B: Both perlayer_vector and random_trainable optimal at alpha=0.75.
perlayer_vector +1.1% vs random_trainable +0.8% — per-layer vector is
the minimal sufficient scaffold, no network needed.

Co-Authored-By: Claude Opus 4.6 (1M context)

Add Phase 10A.7: minimal aux compression — continuous trainability is essential

2026-03-27T19:25:00+00:00

8-branch dissection:
- zero_target + normmatched both crash: non-zero direction necessary, not norm
- perlayer_vector: +0.7% (per-block trainable vector works, network not required)
- freeze_after_{1,5,10}: ALL crash to ~13-14% (continuous trainability essential)
- random_trainable: +1.0% (reference)

Minimal mechanism: continuously trainable, non-zero, depth-aware auxiliary perturbation.
Freezing at ANY point destroys the benefit entirely.

Co-Authored-By: Claude Opus 4.6 (1M context)

Add Phase 10A.6: gain requires trainable depth-aware aux, not semantic credit

2026-03-27T03:07:35+00:00

9-branch dissection results:
- zero_target crashes (-9.1%): aux must output non-zero
- constant_input neutral (+0.0%): needs at least depth info
- time_only works (+1.0%): h_l not needed, just depth index
- shuffled/fresh_random work (+1.3-1.4%): no semantic content needed
- prefit60_trainable ≈ random_trainable: prefit adds nothing
- All frozen branches crash: trainability is essential

Mechanism: depth-aware trainable auxiliary perturbation that diversifies
block-local updates. Not semantic credit, not pure trainability.

Co-Authored-By: Claude Opus 4.6 (1M context)

Add Phase 10A.5: blend gain is implicit regularization, not learned credit

2026-03-26T21:27:53+00:00

Dissection of 6 branches from same DFA checkpoint:
- blend_random_frozen: 12.6% (CATASTROPHIC — frozen noise destroys training)
- blend_random_trainable: 32.2% (+1.2% — trainable network helps)
- blend_shuffled_trainable: 32.5% (+1.4% — even wrong targets work!)
- blend_gaussian_noise: 30.8% (neutral)
- scaled_DFA_norm_match: 31.0% (neutral)

The gain comes from implicit regularization through a co-optimized auxiliary
network, NOT from learned credit quality. Phase 9A's +1.5% was an optimization
dynamics effect, not evidence of useful credit assignment.

Co-Authored-By: Claude Opus 4.6 (1M context)

Add Phase 10A: no prefit threshold — even random Vec blend beats DFA by +1.3%

2026-03-26T13:37:39+00:00

E_prefit=0 (random Vec) + blend(0.75): 32.4% vs DFA 31.1% (+1.3%)
E_prefit=15: 32.3% (+1.2%)
E_prefit=60: 32.5% (+1.4%)

Frozen Gamma/rho near zero at all prefit levels. The Phase 9A success was NOT
from Vec learning useful credit — it was from the blend mechanism itself providing
regularization/diversification over pure DFA.

Co-Authored-By: Claude Opus 4.6 (1M context)

Add Phase 9B+9C: periodic refit fails, top-down curriculum neutral

2026-03-26T05:07:01+00:00

Phase 9B (periodic refit K=5 R=1 alpha=0.75): 14.0% — Vec starts random,
periodic refits insufficient without offline pretraining.

Phase 9C (top-down curriculum): last1_vec=30.8%, last2_vec=31.1% vs DFA=31.2%.
Near-neutral. Cold-start problem persists even for single-block Vec.

Only Phase 9A's offline prefit + blend handoff (+1.5%) works.
The key ingredient is offline Vec training on frozen checkpoint features.

Co-Authored-By: Claude Opus 4.6 (1M context)

Full Phase 9A: blend(0.75) outperforms DFA by +1.5% across multiple t0

2026-03-26T04:03:32+00:00

Best configs (seed=42):
- t0=5, blend_075 (75%Vec+25%DFA): 32.6% vs DFA 31.0% (+1.5%)
- t0=10, blend_075: 32.5% vs 31.0% (+1.4%)
- t0=1, blend_05: 31.9% vs 31.0% (+0.9%)

Higher Vec fraction (0.75) consistently outperforms lower (0.25, 0.5) at t0>=5.
Pure Vec handoff still fails at all checkpoints.

Co-Authored-By: Claude Opus 4.6 (1M context)

Add Phase 9A: checkpointed handoff — blend(Vec+DFA) outperforms pure DFA

2026-03-25T21:20:53+00:00

First positive online result: 50% blend of offline-fitted Vec + DFA gives 31.7%
vs 31.1% for pure DFA (+0.55%). This is Case B: pure Vec handoff fails (-1.1%)
but blend works because DFA stabilizes trajectory while Vec adds directional credit.

Offline-fitted Vec at DFA epoch-5 checkpoint: Gamma=0.229, rho=0.262.
Cold-start confirmed as main bottleneck — Vec IS useful on DFA trajectory features.

Co-Authored-By: Claude Opus 4.6 (1M context)