summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2026-04-03Add EP to BP gradient sparsity analysisYurenHao0426
EP CIFAR d=256: s(1e-6)=100%, mean_norm=1.41e-04 EP produces networks where ALL samples have non-zero BP gradients, unlike DFA (0.4%), SB (21%), CB (3%). EP is closer to BP (98.7%). Updated clean_sparsity_summary.csv (980 rows, now includes EP). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03Add cross-method hidden state distance vs BPYurenHao0426
Non-BP methods produce radically different representations: DFA L0: 162×, L4: 2.5M× relative to BP hidden norms SB L0: 3.2×, L4: 1.1M× CB L0: 59×, L4: 1.4M× (BP vs itself = 0) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03Update all EP results with sign-corrected creditYurenHao0426
EP Synthetic (fixed): Gamma=+0.13~0.20, rho=+0.25 EP CIFAR d=256: Gamma=+0.007, rho=+0.051 EP CIFAR d=512: Gamma=+0.000, rho=-0.002 EP CNN: Gamma=+0.248, rho=+0.492 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03Fix EP credit sign in cnn_baseline.pyYurenHao0426
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02Recompute EP CIFAR d=256 diagnostics with sign fixYurenHao0426
EP d=256 (5 seeds): acc=31.9%, Gamma=+0.007 (was -0.13), rho=+0.051 (was -0.037) Sign correction: -(h_nudge - h_free)/β aligns EP credit with BP gradient direction. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02Fix EP credit sign: negate (h_nudge - h_free)/β to align with BP grad directionYurenHao0426
EP nudge moves h toward lower loss (opposite to BP grad which points toward loss increase). Without negation, Gamma is negative and rho is -0.25. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02Add EP synthetic results: 15 JSONs (3α × 5 seeds)YurenHao0426
EP synthetic: acc high (92-96%) but Gamma negative (-0.13 to -0.20), rho=-0.25 EP credit direction may be inverted or diagnostics have issue. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02Add --d_hidden arg to ep_baseline.py for d=512 supportYurenHao0426
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02Fix ep_synthetic: bp dict needs L+1 entries for EP credit comparisonYurenHao0426
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02Add EP synthetic ladder scriptYurenHao0426
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02Add CNN SB+CB results (5 seeds each), update summary CSVYurenHao0426
CNN CIFAR-10 (5 seeds): BP: 86.8%±0.3%, Gamma=0.238, rho=0.250 DFA: 56.7%±2.0%, Gamma=0.216, rho=0.017 SB: 63.3%±0.5%, Gamma=0.045, rho=0.298 CB: 31.8%±6.2%, Gamma=0.013, rho=0.033 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02Fix CNN state bridge: use custom CNNStateBridge for variable input dimsYurenHao0426
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02Add SB and CB methods to cnn_baseline.pyYurenHao0426
State bridge: per-layer StateBridgeNet predicting h3 from flattened h_l Credit bridge: per-layer ValueNet with terminal + bridge consistency + DFA warmup Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02Fix and recompute GELU ablation Gamma from checkpointsYurenHao0426
ReLU MLP (L=4 d=256): BP: acc=61.1%, Gamma=1.000, rho=0.998 DFA: acc=30.7%, Gamma=0.104, rho=-0.001 SB: acc=15.5%, Gamma=0.300, rho=0.159 CB: acc=28.7%, Gamma=0.298, rho=0.007 Note: SB/CB Gamma uses BP gradient as proxy (feedback nets not checkpointed). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02Fix gelu_ablation.py: compute method-specific Gamma instead of hardcoded 1.0YurenHao0426
DFA now uses regenerated DFA Bs for credit; SB/CB use BP as proxy (feedback nets not saved). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02Add summary CSVs for EP, GELU ablation, CNNYurenHao0426
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02Add EP baseline (5 seeds), GELU ablation (20 runs), CNN BP+DFA (10 runs)YurenHao0426
EP (L=4 d=256): acc≈30%, Gamma≈0, rho≈0 — EP credit signal weak on feedforward MLP GELU ablation (ReLU variant): 4 methods × 5 seeds complete CNN BP+DFA: 5 seeds each, BP + DFA on SmallCNN Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02Add GELU/ReLU ablation script for CIFAR MLPYurenHao0426
Note: existing ResidualMLP already uses GELU. This adds ResidualMLPReLU variant. Ablation compares ReLU vs GELU for BP/DFA/SB/CB. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02Add CNN baseline: SmallCNN with BP/DFA/EP on CIFAR-10YurenHao0426
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02Add EP baseline implementation (Scellier & Bengio 2017) for CIFAR MLPYurenHao0426
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01Add d=512 support sparsity: 20 JSONs + summary CSVYurenHao0426
BP: s(1e-6)=92.7%, norm=2.70e-04, r_inf=0.159, PR=0.300 DFA: s(1e-6)=0.1%, norm=5.31e-08 SB: s(1e-6)=20.3%, norm=2.33e-06 CB: s(1e-6)=1.2%, norm=9.88e-08 Same pattern as d=256, confirming width-independence of the sparsity gap. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01Add d512_sparsity.py: support sparsity for d=512 checkpointsYurenHao0426
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01Add missing bp_s456.json for CIFAR d=512 (rerun after SIGTERM)YurenHao0426
bp s=456: acc=0.5999, rho=0.9881, nse=0.4764 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01Add CIFAR L=4 d=512 confirmatory: 4 methods × 5 seeds with checkpointsYurenHao0426
BP: 60.6%±0.3%, rho=0.989 DFA: 30.8%±0.5%, rho=0.003 State Bridge: 21.2%±3.7%, rho=0.119 Credit Bridge: 30.1%±0.5%, rho=0.002 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01Add cifar_d512_confirmatory.py: L=4 d=512 with checkpoint savingYurenHao0426
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01Add P3 protocol panel: method ranking across 5 protocol slicesYurenHao0426
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01Add per-sample gradient stats: 24576 rows (256 samples × 4 layers × 4 ↵YurenHao0426
methods × 6 seeds) Columns: method, seed, layer, sample_id, grad_norm, log10_grad_norm, r_inf, pr, hoyer, topk1, topk5 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01Add clean_sparsity_persample.py: per-sample gradient statsYurenHao0426
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01Add clean_sparsity_summary.csv: 960 rows aggregated from 168 JSONsYurenHao0426
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01Add clean sparsity results: 168 JSONs from independent processes on GPU 1YurenHao0426
CIFAR: 24 JSONs (4 methods × 6 seeds), BP s(1e-6)=98% confirmed Synthetic: 144 JSONs (4 methods × 6 seeds × 3 alphas × 2 depths) All data reliable — each method+seed in separate Python process. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01Add run_clean_sparsity.sh: shell runner for independent-process sparsityYurenHao0426
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01Add clean_sparsity_full.py: independent-process full sparsity analysisYurenHao0426
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01Add clean gradient check: independent Python process per method, GPU 1YurenHao0426
Clean results (each method in fresh Python process): BP: mean_norm=2.58e-04, s(1e-6)=98% — CONFIRMED DFA: layer 0 = 2.86e-07 (1.2%), layers 1-3 ≈ 2.4e-09 (0%) SB: layer 0 = 6.13e-06 (86%), layers 1-3 ≈ 1e-09 (0%) CB: layer 0 = 6.33e-07 (18%), layers 1-3 ≈ 5e-10 (0%) Method A (autograd.grad) and Method B (retain_grad) give identical results. Previous 1e-12 results were caused by Python process state pollution in combined scripts. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01Add element-wise gradient concentration analysis (CPU, from checkpoints)YurenHao0426
BP gradients are relatively uniform: top1%=7.1%, PR=0.327, eff_dim=0.632 DFA gradients extremely concentrated: top1%=40.6%, PR=0.089, eff_dim=0.272 SB/CB intermediate: top1%=17-21%, PR=0.14-0.17 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01Add confirmatory supplement: T1-T4 from checkpoints (no retraining)YurenHao0426
WARNING: All methods (including BP) show near-zero BP hidden gradients (~1e-12-1e-14) when computed via manual forward with detached hidden states. This is inconsistent with the earlier first-priority analysis which showed BP at 2.86e-04. Investigation needed. T1: 40 rows (4 methods × 10 seeds) - full metrics T2: 800 rows (support sparsity, 5 thresholds × 4 methods × 10 seeds × 4 layers) T3: 48 rows (gradient norm distributions, 3 seeds × 4 methods × 4 layers) T4: 100 rows (active-subset Gamma, 5 thresholds × 2 methods × 10 seeds) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01Add extended sparsity analysis: A4 per-layer, B1 snapshots, B2 active ↵YurenHao0426
subset, C1/C2 A4: Per-layer support — DFA/SB/CB layers 1-3 have 0% support at τ=1e-6 Only BP has ~95% support; only SB layer 0 has 53% B1: Snapshot evolution — old snapshot checkpoints have near-zero grads (data issue) B2: Active subset — with τ=1e-6, no active samples for non-BP methods C1: Active vs inactive cosine — only inactive subset exists for non-BP C2: Energy concentration — near-zero for non-BP methods Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01Add BP support sparsity analysis: threshold sweep + gradient histogramsYurenHao0426
A1 Synthetic: all methods have >93% support at τ=1e-6 (gradients rarely zero) A2 CIFAR: massive gap — BP 98.4% vs DFA 0.4% vs SB 21% vs CB 3% DFA-trained CIFAR networks have near-zero BP gradients for 99.6% of samples This explains why Gamma is unreliable for CIFAR non-BP methods Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01Recompute BP and DFA Gamma with near-zero gradient filteringYurenHao0426
BP Gamma: raw~0.99, filtered=1.000 (confirms self-cosine artifact from zero grads) DFA Gamma (synth): raw~0.01-0.16, filtered~0.01-0.17 (minimal filtering effect) DFA Gamma (CIFAR): raw=0.107, filtered=0.466 (99.7% samples have near-zero BP grad!) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31Update naive StateErr v3: L2 norm ratio formula, with checkpoints savedYurenHao0426
Formula: ||h_{L//2} - h_L||_2 / ||h_L||_2 (scalar L2 ratio) A1: 240 rows (3 alpha × 2 depth × 4 methods × 10 seeds) A2: 40 rows (4 methods including BP × 10 seeds) All model checkpoints saved in checkpoints_A1/ and checkpoints_A2/ Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31Add BP supplement for A2 CIFAR: 10 seeds with acc, Gamma, rho, naive_StateErrYurenHao0426
BP 10-seed results: acc=0.614±0.003, Gamma=1.0, rho=0.998 Appended to A2_cifar_state_vs_credit.csv and A2_naive_state_err.csv Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31Add naive state prediction baseline for A1 and A2YurenHao0426
A1: 240 rows (3 alpha × 2 depth × 4 methods × 10 seeds) A2: 30 rows (3 methods × 10 seeds) naive_StateErr = ||h_{L//2} - h_L|| / ||h_L|| Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30Add confirmatory paper experiments: A1-A4, all 10 seeds completeYurenHao0426
A1: Synthetic nonlinearity ladder (240 rows: 3 alpha × 2 depth × 4 methods × 10 seeds) A2: CIFAR state-vs-credit counterexample (30 rows: 3 methods × 10 seeds) A3: Frozen vs online dissociation (60 rows: 2 regimes × 3 methods × 10 seeds) A4: Protocol dependence panel (82 rows: assembled from existing results) All experiments ran on GPU 3. Total runtime: ~20 hours. CSVs in results/confirmatory/. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27Add Phase 10A.8C: 3-seed replication — scaffold gains are marginalYurenHao0426
3-seed results (mean±std): - DFA: 0.306±0.006 - perlayer_vector α=0.75: 0.304±0.006 (-0.2%, not significant) - random_trainable α=0.75: 0.313±0.007 (+0.7%, marginal, error bars overlap) Single-seed gains (+1.1% perlayer, +0.8% vec) do not robustly replicate. The scaffold mechanism provides at best a marginal, statistically uncertain benefit. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27Add Phase 10A.8: freeze-with-decay confirms stale aux is main freeze failure ↵YurenHao0426
cause; alpha sweep shows perlayer_vector at alpha=0.75 matches full network 10A.8A: freeze_decay_to_000 recovers to 28.5% (vs 14.6% fixed freeze) — stale high-weight aux is the primary cause of freeze crashes. But 28.5% < DFA 31.2% confirms continuous trainability adds ~2.7% independent value. 10A.8B: Both perlayer_vector and random_trainable optimal at alpha=0.75. perlayer_vector +1.1% vs random_trainable +0.8% — per-layer vector is the minimal sufficient scaffold, no network needed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27Add Phase 10A.7: minimal aux compression — continuous trainability is ↵YurenHao0426
essential 8-branch dissection: - zero_target + normmatched both crash: non-zero direction necessary, not norm - perlayer_vector: +0.7% (per-block trainable vector works, network not required) - freeze_after_{1,5,10}: ALL crash to ~13-14% (continuous trainability essential) - random_trainable: +1.0% (reference) Minimal mechanism: continuously trainable, non-zero, depth-aware auxiliary perturbation. Freezing at ANY point destroys the benefit entirely. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26Add Phase 10A.6: gain requires trainable depth-aware aux, not semantic creditYurenHao0426
9-branch dissection results: - zero_target crashes (-9.1%): aux must output non-zero - constant_input neutral (+0.0%): needs at least depth info - time_only works (+1.0%): h_l not needed, just depth index - shuffled/fresh_random work (+1.3-1.4%): no semantic content needed - prefit60_trainable ≈ random_trainable: prefit adds nothing - All frozen branches crash: trainability is essential Mechanism: depth-aware trainable auxiliary perturbation that diversifies block-local updates. Not semantic credit, not pure trainability. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26Add Phase 10A.5: blend gain is implicit regularization, not learned creditYurenHao0426
Dissection of 6 branches from same DFA checkpoint: - blend_random_frozen: 12.6% (CATASTROPHIC — frozen noise destroys training) - blend_random_trainable: 32.2% (+1.2% — trainable network helps) - blend_shuffled_trainable: 32.5% (+1.4% — even wrong targets work!) - blend_gaussian_noise: 30.8% (neutral) - scaled_DFA_norm_match: 31.0% (neutral) The gain comes from implicit regularization through a co-optimized auxiliary network, NOT from learned credit quality. Phase 9A's +1.5% was an optimization dynamics effect, not evidence of useful credit assignment. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26Add Phase 10A: no prefit threshold — even random Vec blend beats DFA by +1.3%YurenHao0426
E_prefit=0 (random Vec) + blend(0.75): 32.4% vs DFA 31.1% (+1.3%) E_prefit=15: 32.3% (+1.2%) E_prefit=60: 32.5% (+1.4%) Frozen Gamma/rho near zero at all prefit levels. The Phase 9A success was NOT from Vec learning useful credit — it was from the blend mechanism itself providing regularization/diversification over pure DFA. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26Add Phase 9B+9C: periodic refit fails, top-down curriculum neutralYurenHao0426
Phase 9B (periodic refit K=5 R=1 alpha=0.75): 14.0% — Vec starts random, periodic refits insufficient without offline pretraining. Phase 9C (top-down curriculum): last1_vec=30.8%, last2_vec=31.1% vs DFA=31.2%. Near-neutral. Cold-start problem persists even for single-block Vec. Only Phase 9A's offline prefit + blend handoff (+1.5%) works. The key ingredient is offline Vec training on frozen checkpoint features. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-25Full Phase 9A: blend(0.75) outperforms DFA by +1.5% across multiple t0YurenHao0426
Best configs (seed=42): - t0=5, blend_075 (75%Vec+25%DFA): 32.6% vs DFA 31.0% (+1.5%) - t0=10, blend_075: 32.5% vs 31.0% (+1.4%) - t0=1, blend_05: 31.9% vs 31.0% (+0.9%) Higher Vec fraction (0.75) consistently outperforms lower (0.25, 0.5) at t0>=5. Pure Vec handoff still fails at all checkpoints. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>