summaryrefslogtreecommitdiff
path: root/experiments
AgeCommit message (Collapse)Author
2026-04-02Add SB and CB methods to cnn_baseline.pyYurenHao0426
State bridge: per-layer StateBridgeNet predicting h3 from flattened h_l Credit bridge: per-layer ValueNet with terminal + bridge consistency + DFA warmup Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02Fix gelu_ablation.py: compute method-specific Gamma instead of hardcoded 1.0YurenHao0426
DFA now uses regenerated DFA Bs for credit; SB/CB use BP as proxy (feedback nets not saved). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02Add GELU/ReLU ablation script for CIFAR MLPYurenHao0426
Note: existing ResidualMLP already uses GELU. This adds ResidualMLPReLU variant. Ablation compares ReLU vs GELU for BP/DFA/SB/CB. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02Add CNN baseline: SmallCNN with BP/DFA/EP on CIFAR-10YurenHao0426
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02Add EP baseline implementation (Scellier & Bengio 2017) for CIFAR MLPYurenHao0426
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01Add d512_sparsity.py: support sparsity for d=512 checkpointsYurenHao0426
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01Add cifar_d512_confirmatory.py: L=4 d=512 with checkpoint savingYurenHao0426
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01Add clean_sparsity_persample.py: per-sample gradient statsYurenHao0426
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01Add clean_sparsity_full.py: independent-process full sparsity analysisYurenHao0426
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01Add clean gradient check: independent Python process per method, GPU 1YurenHao0426
Clean results (each method in fresh Python process): BP: mean_norm=2.58e-04, s(1e-6)=98% — CONFIRMED DFA: layer 0 = 2.86e-07 (1.2%), layers 1-3 ≈ 2.4e-09 (0%) SB: layer 0 = 6.13e-06 (86%), layers 1-3 ≈ 1e-09 (0%) CB: layer 0 = 6.33e-07 (18%), layers 1-3 ≈ 5e-10 (0%) Method A (autograd.grad) and Method B (retain_grad) give identical results. Previous 1e-12 results were caused by Python process state pollution in combined scripts. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01Add confirmatory supplement: T1-T4 from checkpoints (no retraining)YurenHao0426
WARNING: All methods (including BP) show near-zero BP hidden gradients (~1e-12-1e-14) when computed via manual forward with detached hidden states. This is inconsistent with the earlier first-priority analysis which showed BP at 2.86e-04. Investigation needed. T1: 40 rows (4 methods × 10 seeds) - full metrics T2: 800 rows (support sparsity, 5 thresholds × 4 methods × 10 seeds × 4 layers) T3: 48 rows (gradient norm distributions, 3 seeds × 4 methods × 4 layers) T4: 100 rows (active-subset Gamma, 5 thresholds × 2 methods × 10 seeds) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01Add extended sparsity analysis: A4 per-layer, B1 snapshots, B2 active ↵YurenHao0426
subset, C1/C2 A4: Per-layer support — DFA/SB/CB layers 1-3 have 0% support at τ=1e-6 Only BP has ~95% support; only SB layer 0 has 53% B1: Snapshot evolution — old snapshot checkpoints have near-zero grads (data issue) B2: Active subset — with τ=1e-6, no active samples for non-BP methods C1: Active vs inactive cosine — only inactive subset exists for non-BP C2: Energy concentration — near-zero for non-BP methods Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01Add BP support sparsity analysis: threshold sweep + gradient histogramsYurenHao0426
A1 Synthetic: all methods have >93% support at τ=1e-6 (gradients rarely zero) A2 CIFAR: massive gap — BP 98.4% vs DFA 0.4% vs SB 21% vs CB 3% DFA-trained CIFAR networks have near-zero BP gradients for 99.6% of samples This explains why Gamma is unreliable for CIFAR non-BP methods Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31Update naive StateErr v3: L2 norm ratio formula, with checkpoints savedYurenHao0426
Formula: ||h_{L//2} - h_L||_2 / ||h_L||_2 (scalar L2 ratio) A1: 240 rows (3 alpha × 2 depth × 4 methods × 10 seeds) A2: 40 rows (4 methods including BP × 10 seeds) All model checkpoints saved in checkpoints_A1/ and checkpoints_A2/ Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31Add BP supplement for A2 CIFAR: 10 seeds with acc, Gamma, rho, naive_StateErrYurenHao0426
BP 10-seed results: acc=0.614±0.003, Gamma=1.0, rho=0.998 Appended to A2_cifar_state_vs_credit.csv and A2_naive_state_err.csv Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31Add naive state prediction baseline for A1 and A2YurenHao0426
A1: 240 rows (3 alpha × 2 depth × 4 methods × 10 seeds) A2: 30 rows (3 methods × 10 seeds) naive_StateErr = ||h_{L//2} - h_L|| / ||h_L|| Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30Add confirmatory paper experiments: A1-A4, all 10 seeds completeYurenHao0426
A1: Synthetic nonlinearity ladder (240 rows: 3 alpha × 2 depth × 4 methods × 10 seeds) A2: CIFAR state-vs-credit counterexample (30 rows: 3 methods × 10 seeds) A3: Frozen vs online dissociation (60 rows: 2 regimes × 3 methods × 10 seeds) A4: Protocol dependence panel (82 rows: assembled from existing results) All experiments ran on GPU 3. Total runtime: ~20 hours. CSVs in results/confirmatory/. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27Add Phase 10A.8C: 3-seed replication — scaffold gains are marginalYurenHao0426
3-seed results (mean±std): - DFA: 0.306±0.006 - perlayer_vector α=0.75: 0.304±0.006 (-0.2%, not significant) - random_trainable α=0.75: 0.313±0.007 (+0.7%, marginal, error bars overlap) Single-seed gains (+1.1% perlayer, +0.8% vec) do not robustly replicate. The scaffold mechanism provides at best a marginal, statistically uncertain benefit. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27Add Phase 10A.8: freeze-with-decay confirms stale aux is main freeze failure ↵YurenHao0426
cause; alpha sweep shows perlayer_vector at alpha=0.75 matches full network 10A.8A: freeze_decay_to_000 recovers to 28.5% (vs 14.6% fixed freeze) — stale high-weight aux is the primary cause of freeze crashes. But 28.5% < DFA 31.2% confirms continuous trainability adds ~2.7% independent value. 10A.8B: Both perlayer_vector and random_trainable optimal at alpha=0.75. perlayer_vector +1.1% vs random_trainable +0.8% — per-layer vector is the minimal sufficient scaffold, no network needed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27Add Phase 10A.7: minimal aux compression — continuous trainability is ↵YurenHao0426
essential 8-branch dissection: - zero_target + normmatched both crash: non-zero direction necessary, not norm - perlayer_vector: +0.7% (per-block trainable vector works, network not required) - freeze_after_{1,5,10}: ALL crash to ~13-14% (continuous trainability essential) - random_trainable: +1.0% (reference) Minimal mechanism: continuously trainable, non-zero, depth-aware auxiliary perturbation. Freezing at ANY point destroys the benefit entirely. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26Add Phase 10A.6: gain requires trainable depth-aware aux, not semantic creditYurenHao0426
9-branch dissection results: - zero_target crashes (-9.1%): aux must output non-zero - constant_input neutral (+0.0%): needs at least depth info - time_only works (+1.0%): h_l not needed, just depth index - shuffled/fresh_random work (+1.3-1.4%): no semantic content needed - prefit60_trainable ≈ random_trainable: prefit adds nothing - All frozen branches crash: trainability is essential Mechanism: depth-aware trainable auxiliary perturbation that diversifies block-local updates. Not semantic credit, not pure trainability. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26Add Phase 10A.5: blend gain is implicit regularization, not learned creditYurenHao0426
Dissection of 6 branches from same DFA checkpoint: - blend_random_frozen: 12.6% (CATASTROPHIC — frozen noise destroys training) - blend_random_trainable: 32.2% (+1.2% — trainable network helps) - blend_shuffled_trainable: 32.5% (+1.4% — even wrong targets work!) - blend_gaussian_noise: 30.8% (neutral) - scaled_DFA_norm_match: 31.0% (neutral) The gain comes from implicit regularization through a co-optimized auxiliary network, NOT from learned credit quality. Phase 9A's +1.5% was an optimization dynamics effect, not evidence of useful credit assignment. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26Add Phase 10A: no prefit threshold — even random Vec blend beats DFA by +1.3%YurenHao0426
E_prefit=0 (random Vec) + blend(0.75): 32.4% vs DFA 31.1% (+1.3%) E_prefit=15: 32.3% (+1.2%) E_prefit=60: 32.5% (+1.4%) Frozen Gamma/rho near zero at all prefit levels. The Phase 9A success was NOT from Vec learning useful credit — it was from the blend mechanism itself providing regularization/diversification over pure DFA. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26Add Phase 9B+9C: periodic refit fails, top-down curriculum neutralYurenHao0426
Phase 9B (periodic refit K=5 R=1 alpha=0.75): 14.0% — Vec starts random, periodic refits insufficient without offline pretraining. Phase 9C (top-down curriculum): last1_vec=30.8%, last2_vec=31.1% vs DFA=31.2%. Near-neutral. Cold-start problem persists even for single-block Vec. Only Phase 9A's offline prefit + blend handoff (+1.5%) works. The key ingredient is offline Vec training on frozen checkpoint features. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-25Add Phase 9A: checkpointed handoff — blend(Vec+DFA) outperforms pure DFAYurenHao0426
First positive online result: 50% blend of offline-fitted Vec + DFA gives 31.7% vs 31.1% for pure DFA (+0.55%). This is Case B: pure Vec handoff fails (-1.1%) but blend works because DFA stabilizes trajectory while Vec adds directional credit. Offline-fitted Vec at DFA epoch-5 checkpoint: Gamma=0.229, rho=0.262. Cold-start confirmed as main bottleneck — Vec IS useful on DFA trajectory features. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-25Add Phase 8: schedule timing test — online co-learning is the remaining ↵YurenHao0426
bottleneck Vec_only_from_0: 15.4% (cold-start failure, can't learn credit on random features) DFA_only: 31.2% (remains best non-BP method) DFA_then_Vec_T20: 12.9% (switching to Vec destroys DFA-built features) Vec_T5_then_DFA: 26.6% (partial recovery but still worse than pure DFA) Phase 7A's early-window finding doesn't transfer: it required offline-trained Vec on frozen features. Online Vec estimator faces cold-start paradox — needs structured features to learn credit, but structured features need good credit to form. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-25Add Phase 7A: snapshot time sweep shows early snapshots have positive ↵YurenHao0426
held-out transfer At epoch 5 (acc=49%), Vec_M4 5-step: dL_held=-0.005 (PUR=0.70) Oracle BP 5-step: dL_held=-0.009 (PUR=1.05) DFA 5-step: dL_held=+0.003 (always hurts held-out) By epoch 20, generalization window closes. Held-out failure is late-snapshot artifact. Better credit → lower update variance (Vec=0.8 vs DFA=40), not higher. Key implication: DFA warmup delays credit bridge past its useful window. Credit should be used from epoch 0, not after 20% warmup. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-25Add Phase 6.5A: same-batch linesearch REVISES Phase 6A conclusionYurenHao0426
Phase 6A's "better credit → worse loss" was a protocol artifact caused by: 1. Credit normalization (inflated DFA, suppressed Vec magnitude ordering) 2. Held-out evaluation (measured generalization failure, not exploitability) 3. Gradient clamping With strict same-batch evaluation: - Oracle BP: dL_same = -0.406 (strongest descent) - Vec_M4: dL_same = -0.135 - ScalarCB: dL_same = -0.025 - DFA: dL_same = -0.003 Same-batch loss decrease is MONOTONIC with credit quality. But held-out loss INCREASES for all non-DFA methods (Case D: overfitting). The bottleneck is batch-level generalization, not surrogate exploitability. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-24Add Phase 6: snapshot exploitability reveals local update rule is the bottleneckYurenHao0426
Phase 6A: Better credit is ANTI-CORRELATED with loss decrease on fixed snapshot. DFA (Gamma=0.01) → dL=-0.0001 (only method that decreases loss) Vec_M4 (Gamma=0.38) → dL=+0.057 (increases loss most) Oracle BP (Gamma=1.0) → dL=+0.011 (still increases loss) Phase 6C: Target-shift rule reduces damage but cannot make non-DFA credits productive. The inner-product surrogate <F_l(h), a_l> is fundamentally mismatched with directional credit. Conclusion: Case B — the primary bottleneck is the local update paradigm itself, not the credit estimator quality or tracking/co-adaptation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-24Add Phase 5: vector field audit, frozen CIFAR transfer, online pilotYurenHao0426
Phase 5A: Audit passes — shuffle control collapses, gains are real Phase 5B: Transfer SUCCESS — vec_M4 beats scalar CB by +0.25 Gamma, +0.31 rho on frozen CIFAR Phase 5C: Online FAILURE — vec does worse than scalar CB online despite better frozen credit Core finding: bottleneck is in local surrogate / co-adaptation, not estimator quality Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-24Add Phase 4 diagnostic dissection: frozen credit recovery, online shallow ↵YurenHao0426
scan, vector field pilot Key findings: - Frozen CIFAR: estimators CAN recover credit (SB best, CB 20x > DFA) - Online shallow: cb_eT wr=0.2 tgw=1.0 achieves S1>0, S2 marginal - Vector credit field: 0.91-0.96 Gamma/rho on synthetic (vs 0.34 scalar CB) - Direct vector field avoids scalar V curvature problem entirely Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-24Add CIFAR deltaL test (failed) and pivot design memoYurenHao0426
- CIFAR deltaL: s=grad_hL CE (dim=512) -> acc=17.2%, Gamma≈0 Confirms scalar value field has dimensionality bottleneck on CIFAR - Pivot memo: direct vector credit field a_phi(h,t,s) -> R^d Trained with perturbation-based target, avoids curvature problem Still satisfies no hidden BP anchor constraint Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-24Add exploration visualization: CIFAR depth scan, boundary ablation, synth vs ↵YurenHao0426
CIFAR gap Three new plots: - cifar_depth_scan.png: acc/Gamma/rho vs depth for all methods - boundary_ablation.png: s_type, tgw, warmup ratio sweeps - synth_vs_cifar.png: dimensionality gap comparison (d=128 vs d=512) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-24Add Phase 3 boundary-condition ablation results and combined memoYurenHao0426
Key findings: - deltaL (output-layer gradient) gives best Gamma (0.562 vs 0.452 for eT) - Concatenating h_L to s destroys credit quality (value net cheats) - Terminal gradient matching is monotonically beneficial - Best config: deltaL + tgw=1.0 + wr=0.05 -> Gamma=0.768, rho=0.691 - CIFAR depth scan shows no Goldilocks regime (dimensionality bottleneck) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-23Add Phase 2 explore experiments: synthetic nonlinearity ladder + CIFAR depth ↵YurenHao0426
scan - synth_nonlinearity_ladder.py: teacher-student with phi_alpha(z) = (1-a)z + a*tanh(z) Sweeps alpha x depth to find where state bridge / credit bridge fail - cifar_depth_scan.py: CIFAR-10 with L={2,4,6,8,12}, d={256,512} Finds Goldilocks regime for credit bridge vs DFA - plot_synth_ladder.py: phase diagram visualization Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-23Add final report, plots, experiment guide, and complete NOTE.mdYurenHao0426
All experiments complete: - Toy LQ: credit bridge matches state bridge (~0.94 costate cosine) - CIFAR-10: credit bridge (29.6%) comparable to DFA (30.0%), both beat state bridge (18.5%) - State bridge confirms core hypothesis: perfect state prediction != useful credit - Terminal gradient matching is essential for credit bridge
2026-03-23Sync state bridge: use normalized MSE target in both toy and CIFARYurenHao0426
Reason: toy used raw MSE, CIFAR used normalized. They must be the same method for consistent reporting. Normalized MSE is more robust to varying h_L magnitudes.
2026-03-23Initial implementation: all models, methods, toy and CIFAR experimentsYurenHao0426
Debug phase. Toy LQ experiments (3 seeds) complete with terminal gradient matching. Credit bridge matches state bridge on linear system (~0.94 cosine). CIFAR experiments in progress.