faeval.git - Unnamed repository; edit this file 'description' to name the repository.

Age	Commit message (Collapse)	Author
2026-04-08	Round 34 random-target ablation: Mode 1 fires under random labels too	YurenHao0426
	Codex round 34 picked OPTION A (i.i.d. random class targets per minibatch) over the analytic-only OPTION D as the most discriminating test of 'is (a) intrinsic to DFA update geometry or task-driven?'. Smoke test result is unambiguous: ep 0: \|\|h_L\|\|=8.9 \|\|g_L\|\|=9.8e-4 ep 1: \|\|h_L\|\|=1616 \|\|g_L\|\|=5.1e-6 ep 2: \|\|h_L\|\|=9768 \|\|g_L\|\|=8.5e-7 ep 3: \|\|h_L\|\|=14510 \|\|g_L\|\|=5.6e-7 (test acc still at chance ~0.07) Three orders of magnitude growth in \|\|h_L\|\| in 3 epochs, three orders of magnitude collapse in \|\|g_L\|\| in the same 3 epochs, with NO task signal whatsoever — DFA's local-loss geometry is the proximate driver, not data adaptation. - experiments/snapshot_evolution_residual_explosion.py: add --random_targets and --skip_bp flags - paper/main.tex §3 ¶1: replace 'no explicit scale constraint' framing with codex round 34's 6-line geometric argument and the random-target empirical falsifier - paper/main.tex Appendix J: full smoke-test table + interpretation - v2.3: 14 pages total, main content still 8 pages Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08	Round 32+33 H2 ablation: add no_residual_add flag; falsify residual-as-cause ↵	YurenHao0426
	hypothesis - models/residual_mlp.py: add residual_add and w2_std flags (default unchanged) - experiments/snapshot_evolution_residual_explosion.py: add --no_residual_add and --w2_std CLI flags - paper/main.tex §3 ¶3: add 1-sentence reference to no-residual control showing Mode 1 still fires - paper/main.tex Appendix I: full smoke-test table + interpretation - v2.2 main content stays at 8 pages (within 9-page E&D budget); 13 pages total Smoke test (3 ep, w2_std=0.5, seed 42): - DFA no-residual: \|\|h_L\|\| 4.69 -> 22050, \|\|g\|\| 1.6e-7 (Mode 1 (a) fires; (b) at floor) - BP no-residual: acc only 0.16 at ep 3 (architecture is partially degenerate) - Conclusion: residual skip is NOT necessary for Mode 1; the proximate trigger is more general - Codex round 33 verdict: WALK BACK H2; demote 100ep run to confirmatory Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08	Add perturbation correlation metric calibration	YurenHao0426
	Anchors the rho +0.08 finding with positive and negative controls: positive control (BP grad as a_l): +0.9965 (perfect, expected ~1) negative control (random vector): +0.0056 (noise floor, expected ~0) vanilla DFA s42 (\|\|g\|\| at floor): +0.0020 (within noise floor) penalized DFA s42 (\|\|g\|\| healthy): +0.0937 (~48x above noise, ~9% of perfect) The metric is well-calibrated. BP gradient as a_l gives rho ~1 (Taylor), random vector gives rho ~0 (noise floor), random feedback in degenerate regime is indistinguishable from noise floor, random feedback in penalized regime is small-but-well-above-noise (~48x noise, ~9% perfect). Defensible paper claim: 'rho +0.08 is small in absolute terms but clearly above the calibrated noise floor and on the order of 10% of the perfect-signal ceiling — consistent with the 60% of BP accuracy the penalized network achieves.' Closes round 19's 'is rho +0.08 a meaningful number on this metric?' question with explicit calibration.
2026-04-08	Extend perturbation audit to vanilla early-epoch checkpoints	YurenHao0426
	Cross-metric disambiguation confirmation. Vanilla DFA at ep 1 (meaningful regime, \|\|g\|\|~6e-7) deep rho across 3 seeds: s42: deep rho -0.008 s123: deep rho +0.000 s456: deep rho -0.000 mean: -0.003 ± 0.005 Compare to penalized DFA 3-seed: deep rho +0.080 ± 0.011. The disambiguation (penalty CREATES alignment, not just reveals it) is now confirmed by TWO independent metrics: - cos: vanilla -0.008 ± 0.013, penalized +0.155 ± 0.025 - rho: vanilla -0.003 ± 0.005, penalized +0.080 ± 0.011 Both metrics agree on the vanilla→penalized transition. The l0 (embedding) rho is high (~0.25-0.29) at every vanilla checkpoint, mirroring the cos l0 +0.42 — the embedding layer is genuinely useful while the deep blocks are not, by BOTH metrics. The penalty restores some deep usefulness to ~+0.08 rho / +0.16 cos. Cross-metric agreement rules out single-metric artifacts on either side.
2026-04-08	Add perturbation correlation audit (round 19's recommended alt metric)	YurenHao0426
	Codex round 19 said: 'use nudging or perturbation correlation on the penalized checkpoints. In the healthy-gradient regime, that is a more direct is-the-local-signal-useful test than cosine alone'. Result on existing checkpoints (eps=1e-3, M=32 random directions, n=1024): vanilla DFA s42: deep rho +0.002 penalized DFA s42 lam=1e-2 30ep: deep rho +0.094 penalized DFA s123 lam=1e-2 30ep: deep rho +0.073 penalized DFA s456 lam=1e-2 30ep: deep rho +0.072 penalized 3-seed mean: deep rho +0.080 ± 0.011 This INDEPENDENTLY TRIANGULATES the cos +0.17 finding via a different metric: - vanilla deep cos ~0 matches vanilla deep rho ~0 - penalized deep cos +0.155 matches penalized deep rho +0.080 The two metrics measure different things: - cos = directional alignment with BP grad - rho = correlation between predicted and true loss change under random perturbation Both show the same pattern: penalty creates partial usefulness from essentially zero. This is the 6th independent validation of the mode 2 'penalty creates partial alignment' framing. Crucially, rho doesn't use F.cosine_similarity (no eps clamp), and it measures sample-level loss change correlation rather than direction match — so it rules out 'cos is capturing some directional artifact unrelated to local usefulness'.
2026-04-08	Add BP+penalty control (round 19's #4 critical experiment)	YurenHao0426
	Trains end-to-end BP with the same lambda*\|\|f_l(h_l)\|\|^2 penalty used in the DFA penalty rescue. Tests whether the penalty's depth utilization loss in penalized DFA is intrinsic to DFA's random-feedback credit quality (mode 2) or due to penalty-induced capacity regularization. Decision rule: BP+pen margin > 25 pp -> mode 2 confirmed (penalty is not the cap) BP+pen margin < 5 pp -> penalty itself caps depth (capacity loss) intermediate -> both effects present
2026-04-08	Add vanilla DFA early-epoch checkpoint training (round 19 disambiguation)	YurenHao0426
	Trains vanilla DFA (no penalty) for max_epoch epochs and saves checkpoints + Bs at specified early epochs (default: 1, 2, 3, 4, 5). Logs per-layer \|\|h_l\|\| and \|\|g_l\|\| at each epoch so we can see when \|\|g_L\|\| crosses the 1e-7 floor. Codex round 19's #3 critical experiment for disambiguating: Hypothesis A: deep alignment was always there in vanilla DFA but hidden by the post-collapse measurement degeneracy Hypothesis B: deep alignment was created by the penalty intervention Test: measure deep-layer cos at vanilla checkpoints from ep 1-3 (when \|\|g_L\|\| should still be in the meaningful regime). If cos > 0 at ep 1-2 vanilla -> hypothesis A If cos ~ 0 at ep 1-2 vanilla -> hypothesis B
2026-04-08	Add null calibration script: training-Bs vs fresh-Bs cos on penalized DFA	YurenHao0426
	Codex round 19's #1 critical control. Result on penalized DFA s42 (lam=1e-2, 30 ep): training-Bs deep-layer cos: +0.1627 fresh-Bs deep-layer cos: +0.0022 ± 0.0220 (n=20 draws) The +0.17 measurement is REAL signal, not artifact. The network specifically adapted to its training-time Bs during the penalized run. Fresh Bs give essentially zero cosine (within noise). This validates the walk-back interpretation: in the rescued regime where \|\|g_l\|\| is meaningful, DFA's local credit signal shows partial alignment with BP grad — and this alignment is specifically the network learning to align with its specific Bs. Round 19 caveat preserved: cannot yet distinguish whether the alignment was always present in vanilla but hidden by measurement degeneracy, OR whether it was created by the penalty intervention. The early-epoch vanilla checkpoint sweep (round 19's other proposed control) would disambiguate.
2026-04-08	Add penalty lambda 3-seed summary script + checkpoint save in penalty test	YurenHao0426
	- New script: protocol/examples/penalty_lam_3seed_summary.py Loads existing penalty JSON files for lam=1e-3 and lam=1e-2 across seeds, computes 3-seed mean margin vs DFA-shallow baseline, and explicitly checks the (d) verdict at 2pp threshold per seed and in aggregate. Reports MIXED if seeds disagree. Current result: lam=1e-2 has 3 seeds (margin +1.38 ± 0.05 pp, all FIRE), lam=1e-3 has 1 seed (+2.31 pp, PASSES). Awaiting s123/s456 for lam=1e-3. - experiments/dfa_residual_penalty_test.py: now saves model checkpoint + Bs alongside JSON log so post-hoc protocol can be applied without re-running. Closes the pitfall #6.5 self-disclosure (auxiliary nets must be saved for post-hoc Gamma to be reconstructible).
2026-04-07	Add fast direction-quality measurement on existing DFA checkpoints	YurenHao0426
	3-seed result on the existing dfa_s{42,123,456}.pt checkpoints from results/confirmatory/checkpoints_A2/, computing per-layer cosine of DFA's local credit signal e_T@B_l^T vs the true BP gradient at h_l. Key findings: per-layer cos (3-seed mean): l0: +0.42 (high — embedding alignment) l1: +0.006 (essentially zero) l2: -0.015 (essentially zero) l3: -0.004 (essentially zero) l4: -0.004 (essentially zero) layer-mean across all 5: +0.07-0.10 The deep blocks (l1-l4) have essentially zero alignment with BP grad in the vanilla scale-failure regime. Layer 0 dominates the headline. The script reconstructs the training-time random Bs by replaying the RNG sequence (torch.manual_seed + ResidualMLP construction + randn draws), since the existing checkpoints don't save Bs. For the still-running direction-quality experiment which DOES save Bs, the script auto-detects the dict format and uses the saved Bs directly.
2026-04-07	Add ViT-Mini DFA training script that saves checkpoint + Bs	YurenHao0426
	The existing snapshot_evolution_vit.py and vit_frozen_blocks_baseline.py do not save model checkpoints — they only emit per-epoch JSON logs. This makes it impossible to apply the diagnostic protocol to a trained ViT post-hoc, since the protocol needs an actual model object. This script trains a 4-block d=128 ViT-Mini with block-level DFA on CIFAR-10 (same training rule as snapshot_evolution_vit.py) for 60 epochs and saves: - the final state_dict - the random feedback Bs (so the protocol can also verify bug 4 on this checkpoint) - test_acc and config Output: results/vit_dfa_checkpoints/dfa_vit_s{seed}.pt
2026-04-07	Add DFA direction-quality direct test (codex round 13 option c)	YurenHao0426
	Trains both vanilla DFA (lam=0) and penalized DFA (lam=1e-2) from the same seed, then directly measures the per-layer cosine between DFA's local credit signal e_T @ B_l^T and the BP gradient at hidden layers. Uses the training Bs (not fresh ones, per the Bs-specificity finding from earlier). The penalized run is the key measurement: in that condition the BP grad is ~10^-7 (well above the eps=1e-8 floor), so a near-zero cosine here would be the direct evidence of the second failure mode (direction-quality ceiling) that codex round 13 hypothesized. Pre-registered prediction: penalized cos(DFA, BP) ~ 0.01-0.05 -> direction quality is the second, separable failure mode. Saves the penalized checkpoint so the diagnostic protocol can be re-applied to it (where (a) and (b) should pass, (d) should still fail).
2026-04-03	Fix CNN compute_bp_grads: remove inter-layer detach so gradients flow to all ↵	YurenHao0426
	layers Old code detached hidden states between layers, making layers 0-2 disconnected from the loss (gradient = None → 0). Fixed by keeping the forward graph connected. BP CNN Gamma per-layer now: [0.985, 0.990, 0.987, 0.967] (was [0, 0, 0, 0.967]) But gradient norms are ~1e-17 (genuine numerical precision issue with CNN architecture). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03	Add checkpoint saving to ep_synthetic.py	YurenHao0426
	Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03	Fix EP credit sign in cnn_baseline.py	YurenHao0426
	Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02	Fix EP credit sign: negate (h_nudge - h_free)/β to align with BP grad direction	YurenHao0426
	EP nudge moves h toward lower loss (opposite to BP grad which points toward loss increase). Without negation, Gamma is negative and rho is -0.25. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02	Add --d_hidden arg to ep_baseline.py for d=512 support	YurenHao0426
	Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02	Fix ep_synthetic: bp dict needs L+1 entries for EP credit comparison	YurenHao0426
	Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02	Add EP synthetic ladder script	YurenHao0426
	Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02	Fix CNN state bridge: use custom CNNStateBridge for variable input dims	YurenHao0426
	Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02	Add SB and CB methods to cnn_baseline.py	YurenHao0426
	State bridge: per-layer StateBridgeNet predicting h3 from flattened h_l Credit bridge: per-layer ValueNet with terminal + bridge consistency + DFA warmup Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02	Fix gelu_ablation.py: compute method-specific Gamma instead of hardcoded 1.0	YurenHao0426
	DFA now uses regenerated DFA Bs for credit; SB/CB use BP as proxy (feedback nets not saved). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02	Add GELU/ReLU ablation script for CIFAR MLP	YurenHao0426
	Note: existing ResidualMLP already uses GELU. This adds ResidualMLPReLU variant. Ablation compares ReLU vs GELU for BP/DFA/SB/CB. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02	Add CNN baseline: SmallCNN with BP/DFA/EP on CIFAR-10	YurenHao0426
	Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02	Add EP baseline implementation (Scellier & Bengio 2017) for CIFAR MLP	YurenHao0426
	Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01	Add d512_sparsity.py: support sparsity for d=512 checkpoints	YurenHao0426
	Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01	Add cifar_d512_confirmatory.py: L=4 d=512 with checkpoint saving	YurenHao0426
	Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01	Add clean_sparsity_persample.py: per-sample gradient stats	YurenHao0426
	Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01	Add clean_sparsity_full.py: independent-process full sparsity analysis	YurenHao0426
	Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01	Add clean gradient check: independent Python process per method, GPU 1	YurenHao0426
	Clean results (each method in fresh Python process): BP: mean_norm=2.58e-04, s(1e-6)=98% — CONFIRMED DFA: layer 0 = 2.86e-07 (1.2%), layers 1-3 ≈ 2.4e-09 (0%) SB: layer 0 = 6.13e-06 (86%), layers 1-3 ≈ 1e-09 (0%) CB: layer 0 = 6.33e-07 (18%), layers 1-3 ≈ 5e-10 (0%) Method A (autograd.grad) and Method B (retain_grad) give identical results. Previous 1e-12 results were caused by Python process state pollution in combined scripts. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01	Add confirmatory supplement: T1-T4 from checkpoints (no retraining)	YurenHao0426
	WARNING: All methods (including BP) show near-zero BP hidden gradients (~1e-12-1e-14) when computed via manual forward with detached hidden states. This is inconsistent with the earlier first-priority analysis which showed BP at 2.86e-04. Investigation needed. T1: 40 rows (4 methods × 10 seeds) - full metrics T2: 800 rows (support sparsity, 5 thresholds × 4 methods × 10 seeds × 4 layers) T3: 48 rows (gradient norm distributions, 3 seeds × 4 methods × 4 layers) T4: 100 rows (active-subset Gamma, 5 thresholds × 2 methods × 10 seeds) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01	Add extended sparsity analysis: A4 per-layer, B1 snapshots, B2 active ↵	YurenHao0426
	subset, C1/C2 A4: Per-layer support — DFA/SB/CB layers 1-3 have 0% support at τ=1e-6 Only BP has ~95% support; only SB layer 0 has 53% B1: Snapshot evolution — old snapshot checkpoints have near-zero grads (data issue) B2: Active subset — with τ=1e-6, no active samples for non-BP methods C1: Active vs inactive cosine — only inactive subset exists for non-BP C2: Energy concentration — near-zero for non-BP methods Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01	Add BP support sparsity analysis: threshold sweep + gradient histograms	YurenHao0426
	A1 Synthetic: all methods have >93% support at τ=1e-6 (gradients rarely zero) A2 CIFAR: massive gap — BP 98.4% vs DFA 0.4% vs SB 21% vs CB 3% DFA-trained CIFAR networks have near-zero BP gradients for 99.6% of samples This explains why Gamma is unreliable for CIFAR non-BP methods Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31	Update naive StateErr v3: L2 norm ratio formula, with checkpoints saved	YurenHao0426
	Formula: \|\|h_{L//2} - h_L\|\|_2 / \|\|h_L\|\|_2 (scalar L2 ratio) A1: 240 rows (3 alpha × 2 depth × 4 methods × 10 seeds) A2: 40 rows (4 methods including BP × 10 seeds) All model checkpoints saved in checkpoints_A1/ and checkpoints_A2/ Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31	Add BP supplement for A2 CIFAR: 10 seeds with acc, Gamma, rho, naive_StateErr	YurenHao0426
	BP 10-seed results: acc=0.614±0.003, Gamma=1.0, rho=0.998 Appended to A2_cifar_state_vs_credit.csv and A2_naive_state_err.csv Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31	Add naive state prediction baseline for A1 and A2	YurenHao0426
	A1: 240 rows (3 alpha × 2 depth × 4 methods × 10 seeds) A2: 30 rows (3 methods × 10 seeds) naive_StateErr = \|\|h_{L//2} - h_L\|\| / \|\|h_L\|\| Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30	Add confirmatory paper experiments: A1-A4, all 10 seeds complete	YurenHao0426
	A1: Synthetic nonlinearity ladder (240 rows: 3 alpha × 2 depth × 4 methods × 10 seeds) A2: CIFAR state-vs-credit counterexample (30 rows: 3 methods × 10 seeds) A3: Frozen vs online dissociation (60 rows: 2 regimes × 3 methods × 10 seeds) A4: Protocol dependence panel (82 rows: assembled from existing results) All experiments ran on GPU 3. Total runtime: ~20 hours. CSVs in results/confirmatory/. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27	Add Phase 10A.8C: 3-seed replication — scaffold gains are marginal	YurenHao0426
	3-seed results (mean±std): - DFA: 0.306±0.006 - perlayer_vector α=0.75: 0.304±0.006 (-0.2%, not significant) - random_trainable α=0.75: 0.313±0.007 (+0.7%, marginal, error bars overlap) Single-seed gains (+1.1% perlayer, +0.8% vec) do not robustly replicate. The scaffold mechanism provides at best a marginal, statistically uncertain benefit. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27	Add Phase 10A.8: freeze-with-decay confirms stale aux is main freeze failure ↵	YurenHao0426
	cause; alpha sweep shows perlayer_vector at alpha=0.75 matches full network 10A.8A: freeze_decay_to_000 recovers to 28.5% (vs 14.6% fixed freeze) — stale high-weight aux is the primary cause of freeze crashes. But 28.5% < DFA 31.2% confirms continuous trainability adds ~2.7% independent value. 10A.8B: Both perlayer_vector and random_trainable optimal at alpha=0.75. perlayer_vector +1.1% vs random_trainable +0.8% — per-layer vector is the minimal sufficient scaffold, no network needed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27	Add Phase 10A.7: minimal aux compression — continuous trainability is ↵	YurenHao0426
	essential 8-branch dissection: - zero_target + normmatched both crash: non-zero direction necessary, not norm - perlayer_vector: +0.7% (per-block trainable vector works, network not required) - freeze_after_{1,5,10}: ALL crash to ~13-14% (continuous trainability essential) - random_trainable: +1.0% (reference) Minimal mechanism: continuously trainable, non-zero, depth-aware auxiliary perturbation. Freezing at ANY point destroys the benefit entirely. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26	Add Phase 10A.6: gain requires trainable depth-aware aux, not semantic credit	YurenHao0426
	9-branch dissection results: - zero_target crashes (-9.1%): aux must output non-zero - constant_input neutral (+0.0%): needs at least depth info - time_only works (+1.0%): h_l not needed, just depth index - shuffled/fresh_random work (+1.3-1.4%): no semantic content needed - prefit60_trainable ≈ random_trainable: prefit adds nothing - All frozen branches crash: trainability is essential Mechanism: depth-aware trainable auxiliary perturbation that diversifies block-local updates. Not semantic credit, not pure trainability. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26	Add Phase 10A.5: blend gain is implicit regularization, not learned credit	YurenHao0426
	Dissection of 6 branches from same DFA checkpoint: - blend_random_frozen: 12.6% (CATASTROPHIC — frozen noise destroys training) - blend_random_trainable: 32.2% (+1.2% — trainable network helps) - blend_shuffled_trainable: 32.5% (+1.4% — even wrong targets work!) - blend_gaussian_noise: 30.8% (neutral) - scaled_DFA_norm_match: 31.0% (neutral) The gain comes from implicit regularization through a co-optimized auxiliary network, NOT from learned credit quality. Phase 9A's +1.5% was an optimization dynamics effect, not evidence of useful credit assignment. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26	Add Phase 10A: no prefit threshold — even random Vec blend beats DFA by +1.3%	YurenHao0426
	E_prefit=0 (random Vec) + blend(0.75): 32.4% vs DFA 31.1% (+1.3%) E_prefit=15: 32.3% (+1.2%) E_prefit=60: 32.5% (+1.4%) Frozen Gamma/rho near zero at all prefit levels. The Phase 9A success was NOT from Vec learning useful credit — it was from the blend mechanism itself providing regularization/diversification over pure DFA. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26	Add Phase 9B+9C: periodic refit fails, top-down curriculum neutral	YurenHao0426
	Phase 9B (periodic refit K=5 R=1 alpha=0.75): 14.0% — Vec starts random, periodic refits insufficient without offline pretraining. Phase 9C (top-down curriculum): last1_vec=30.8%, last2_vec=31.1% vs DFA=31.2%. Near-neutral. Cold-start problem persists even for single-block Vec. Only Phase 9A's offline prefit + blend handoff (+1.5%) works. The key ingredient is offline Vec training on frozen checkpoint features. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-25	Add Phase 9A: checkpointed handoff — blend(Vec+DFA) outperforms pure DFA	YurenHao0426
	First positive online result: 50% blend of offline-fitted Vec + DFA gives 31.7% vs 31.1% for pure DFA (+0.55%). This is Case B: pure Vec handoff fails (-1.1%) but blend works because DFA stabilizes trajectory while Vec adds directional credit. Offline-fitted Vec at DFA epoch-5 checkpoint: Gamma=0.229, rho=0.262. Cold-start confirmed as main bottleneck — Vec IS useful on DFA trajectory features. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-25	Add Phase 8: schedule timing test — online co-learning is the remaining ↵	YurenHao0426
	bottleneck Vec_only_from_0: 15.4% (cold-start failure, can't learn credit on random features) DFA_only: 31.2% (remains best non-BP method) DFA_then_Vec_T20: 12.9% (switching to Vec destroys DFA-built features) Vec_T5_then_DFA: 26.6% (partial recovery but still worse than pure DFA) Phase 7A's early-window finding doesn't transfer: it required offline-trained Vec on frozen features. Online Vec estimator faces cold-start paradox — needs structured features to learn credit, but structured features need good credit to form. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-25	Add Phase 7A: snapshot time sweep shows early snapshots have positive ↵	YurenHao0426
	held-out transfer At epoch 5 (acc=49%), Vec_M4 5-step: dL_held=-0.005 (PUR=0.70) Oracle BP 5-step: dL_held=-0.009 (PUR=1.05) DFA 5-step: dL_held=+0.003 (always hurts held-out) By epoch 20, generalization window closes. Held-out failure is late-snapshot artifact. Better credit → lower update variance (Vec=0.8 vs DFA=40), not higher. Key implication: DFA warmup delays credit bridge past its useful window. Credit should be used from epoch 0, not after 20% warmup. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-25	Add Phase 6.5A: same-batch linesearch REVISES Phase 6A conclusion	YurenHao0426
	Phase 6A's "better credit → worse loss" was a protocol artifact caused by: 1. Credit normalization (inflated DFA, suppressed Vec magnitude ordering) 2. Held-out evaluation (measured generalization failure, not exploitability) 3. Gradient clamping With strict same-batch evaluation: - Oracle BP: dL_same = -0.406 (strongest descent) - Vec_M4: dL_same = -0.135 - ScalarCB: dL_same = -0.025 - DFA: dL_same = -0.003 Same-batch loss decrease is MONOTONIC with credit quality. But held-out loss INCREASES for all non-DFA methods (Case D: overfitting). The bottleneck is batch-level generalization, not surrogate exploitability. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-24	Add Phase 6: snapshot exploitability reveals local update rule is the bottleneck	YurenHao0426
	Phase 6A: Better credit is ANTI-CORRELATED with loss decrease on fixed snapshot. DFA (Gamma=0.01) → dL=-0.0001 (only method that decreases loss) Vec_M4 (Gamma=0.38) → dL=+0.057 (increases loss most) Oracle BP (Gamma=1.0) → dL=+0.011 (still increases loss) Phase 6C: Target-shift rule reduces damage but cannot make non-DFA credits productive. The inner-product surrogate <F_l(h), a_l> is fundamentally mismatched with directional credit. Conclusion: Case B — the primary bottleneck is the local update paradigm itself, not the credit estimator quality or tracking/co-adaptation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-24	Add Phase 5: vector field audit, frozen CIFAR transfer, online pilot	YurenHao0426
	Phase 5A: Audit passes — shuffle control collapses, gains are real Phase 5B: Transfer SUCCESS — vec_M4 beats scalar CB by +0.25 Gamma, +0.31 rho on frozen CIFAR Phase 5C: Online FAILURE — vec does worse than scalar CB online despite better frozen credit Core finding: bottleneck is in local surrogate / co-adaptation, not estimator quality Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>