| Age | Commit message (Collapse) | Author |
|
EP (L=4 d=256): acc≈30%, Gamma≈0, rho≈0 — EP credit signal weak on feedforward MLP
GELU ablation (ReLU variant): 4 methods × 5 seeds complete
CNN BP+DFA: 5 seeds each, BP + DFA on SmallCNN
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
BP: s(1e-6)=92.7%, norm=2.70e-04, r_inf=0.159, PR=0.300
DFA: s(1e-6)=0.1%, norm=5.31e-08
SB: s(1e-6)=20.3%, norm=2.33e-06
CB: s(1e-6)=1.2%, norm=9.88e-08
Same pattern as d=256, confirming width-independence of the sparsity gap.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
bp s=456: acc=0.5999, rho=0.9881, nse=0.4764
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
BP: 60.6%±0.3%, rho=0.989
DFA: 30.8%±0.5%, rho=0.003
State Bridge: 21.2%±3.7%, rho=0.119
Credit Bridge: 30.1%±0.5%, rho=0.002
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
methods × 6 seeds)
Columns: method, seed, layer, sample_id, grad_norm, log10_grad_norm, r_inf, pr, hoyer, topk1, topk5
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
CIFAR: 24 JSONs (4 methods × 6 seeds), BP s(1e-6)=98% confirmed
Synthetic: 144 JSONs (4 methods × 6 seeds × 3 alphas × 2 depths)
All data reliable — each method+seed in separate Python process.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
Clean results (each method in fresh Python process):
BP: mean_norm=2.58e-04, s(1e-6)=98% — CONFIRMED
DFA: layer 0 = 2.86e-07 (1.2%), layers 1-3 ≈ 2.4e-09 (0%)
SB: layer 0 = 6.13e-06 (86%), layers 1-3 ≈ 1e-09 (0%)
CB: layer 0 = 6.33e-07 (18%), layers 1-3 ≈ 5e-10 (0%)
Method A (autograd.grad) and Method B (retain_grad) give identical results.
Previous 1e-12 results were caused by Python process state pollution in combined scripts.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
BP gradients are relatively uniform: top1%=7.1%, PR=0.327, eff_dim=0.632
DFA gradients extremely concentrated: top1%=40.6%, PR=0.089, eff_dim=0.272
SB/CB intermediate: top1%=17-21%, PR=0.14-0.17
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
WARNING: All methods (including BP) show near-zero BP hidden gradients (~1e-12-1e-14)
when computed via manual forward with detached hidden states. This is inconsistent with
the earlier first-priority analysis which showed BP at 2.86e-04. Investigation needed.
T1: 40 rows (4 methods × 10 seeds) - full metrics
T2: 800 rows (support sparsity, 5 thresholds × 4 methods × 10 seeds × 4 layers)
T3: 48 rows (gradient norm distributions, 3 seeds × 4 methods × 4 layers)
T4: 100 rows (active-subset Gamma, 5 thresholds × 2 methods × 10 seeds)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
subset, C1/C2
A4: Per-layer support — DFA/SB/CB layers 1-3 have 0% support at τ=1e-6
Only BP has ~95% support; only SB layer 0 has 53%
B1: Snapshot evolution — old snapshot checkpoints have near-zero grads (data issue)
B2: Active subset — with τ=1e-6, no active samples for non-BP methods
C1: Active vs inactive cosine — only inactive subset exists for non-BP
C2: Energy concentration — near-zero for non-BP methods
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
A1 Synthetic: all methods have >93% support at τ=1e-6 (gradients rarely zero)
A2 CIFAR: massive gap — BP 98.4% vs DFA 0.4% vs SB 21% vs CB 3%
DFA-trained CIFAR networks have near-zero BP gradients for 99.6% of samples
This explains why Gamma is unreliable for CIFAR non-BP methods
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
BP Gamma: raw~0.99, filtered=1.000 (confirms self-cosine artifact from zero grads)
DFA Gamma (synth): raw~0.01-0.16, filtered~0.01-0.17 (minimal filtering effect)
DFA Gamma (CIFAR): raw=0.107, filtered=0.466 (99.7% samples have near-zero BP grad!)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
Formula: ||h_{L//2} - h_L||_2 / ||h_L||_2 (scalar L2 ratio)
A1: 240 rows (3 alpha × 2 depth × 4 methods × 10 seeds)
A2: 40 rows (4 methods including BP × 10 seeds)
All model checkpoints saved in checkpoints_A1/ and checkpoints_A2/
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
BP 10-seed results: acc=0.614±0.003, Gamma=1.0, rho=0.998
Appended to A2_cifar_state_vs_credit.csv and A2_naive_state_err.csv
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
A1: 240 rows (3 alpha × 2 depth × 4 methods × 10 seeds)
A2: 30 rows (3 methods × 10 seeds)
naive_StateErr = ||h_{L//2} - h_L|| / ||h_L||
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
A1: Synthetic nonlinearity ladder (240 rows: 3 alpha × 2 depth × 4 methods × 10 seeds)
A2: CIFAR state-vs-credit counterexample (30 rows: 3 methods × 10 seeds)
A3: Frozen vs online dissociation (60 rows: 2 regimes × 3 methods × 10 seeds)
A4: Protocol dependence panel (82 rows: assembled from existing results)
All experiments ran on GPU 3. Total runtime: ~20 hours.
CSVs in results/confirmatory/.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
All experiments complete:
- Toy LQ: credit bridge matches state bridge (~0.94 costate cosine)
- CIFAR-10: credit bridge (29.6%) comparable to DFA (30.0%), both beat state bridge (18.5%)
- State bridge confirms core hypothesis: perfect state prediction != useful credit
- Terminal gradient matching is essential for credit bridge
|
|
Debug phase. Toy LQ experiments (3 seeds) complete with terminal gradient matching.
Credit bridge matches state bridge on linear system (~0.94 cosine).
CIFAR experiments in progress.
|