| Age | Commit message (Collapse) | Author |
|
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
Clean results (each method in fresh Python process):
BP: mean_norm=2.58e-04, s(1e-6)=98% — CONFIRMED
DFA: layer 0 = 2.86e-07 (1.2%), layers 1-3 ≈ 2.4e-09 (0%)
SB: layer 0 = 6.13e-06 (86%), layers 1-3 ≈ 1e-09 (0%)
CB: layer 0 = 6.33e-07 (18%), layers 1-3 ≈ 5e-10 (0%)
Method A (autograd.grad) and Method B (retain_grad) give identical results.
Previous 1e-12 results were caused by Python process state pollution in combined scripts.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
WARNING: All methods (including BP) show near-zero BP hidden gradients (~1e-12-1e-14)
when computed via manual forward with detached hidden states. This is inconsistent with
the earlier first-priority analysis which showed BP at 2.86e-04. Investigation needed.
T1: 40 rows (4 methods × 10 seeds) - full metrics
T2: 800 rows (support sparsity, 5 thresholds × 4 methods × 10 seeds × 4 layers)
T3: 48 rows (gradient norm distributions, 3 seeds × 4 methods × 4 layers)
T4: 100 rows (active-subset Gamma, 5 thresholds × 2 methods × 10 seeds)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
subset, C1/C2
A4: Per-layer support — DFA/SB/CB layers 1-3 have 0% support at τ=1e-6
Only BP has ~95% support; only SB layer 0 has 53%
B1: Snapshot evolution — old snapshot checkpoints have near-zero grads (data issue)
B2: Active subset — with τ=1e-6, no active samples for non-BP methods
C1: Active vs inactive cosine — only inactive subset exists for non-BP
C2: Energy concentration — near-zero for non-BP methods
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
A1 Synthetic: all methods have >93% support at τ=1e-6 (gradients rarely zero)
A2 CIFAR: massive gap — BP 98.4% vs DFA 0.4% vs SB 21% vs CB 3%
DFA-trained CIFAR networks have near-zero BP gradients for 99.6% of samples
This explains why Gamma is unreliable for CIFAR non-BP methods
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
Formula: ||h_{L//2} - h_L||_2 / ||h_L||_2 (scalar L2 ratio)
A1: 240 rows (3 alpha × 2 depth × 4 methods × 10 seeds)
A2: 40 rows (4 methods including BP × 10 seeds)
All model checkpoints saved in checkpoints_A1/ and checkpoints_A2/
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
BP 10-seed results: acc=0.614±0.003, Gamma=1.0, rho=0.998
Appended to A2_cifar_state_vs_credit.csv and A2_naive_state_err.csv
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
A1: 240 rows (3 alpha × 2 depth × 4 methods × 10 seeds)
A2: 30 rows (3 methods × 10 seeds)
naive_StateErr = ||h_{L//2} - h_L|| / ||h_L||
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
A1: Synthetic nonlinearity ladder (240 rows: 3 alpha × 2 depth × 4 methods × 10 seeds)
A2: CIFAR state-vs-credit counterexample (30 rows: 3 methods × 10 seeds)
A3: Frozen vs online dissociation (60 rows: 2 regimes × 3 methods × 10 seeds)
A4: Protocol dependence panel (82 rows: assembled from existing results)
All experiments ran on GPU 3. Total runtime: ~20 hours.
CSVs in results/confirmatory/.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
3-seed results (mean±std):
- DFA: 0.306±0.006
- perlayer_vector α=0.75: 0.304±0.006 (-0.2%, not significant)
- random_trainable α=0.75: 0.313±0.007 (+0.7%, marginal, error bars overlap)
Single-seed gains (+1.1% perlayer, +0.8% vec) do not robustly replicate.
The scaffold mechanism provides at best a marginal, statistically uncertain benefit.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
cause;
alpha sweep shows perlayer_vector at alpha=0.75 matches full network
10A.8A: freeze_decay_to_000 recovers to 28.5% (vs 14.6% fixed freeze) — stale
high-weight aux is the primary cause of freeze crashes. But 28.5% < DFA 31.2%
confirms continuous trainability adds ~2.7% independent value.
10A.8B: Both perlayer_vector and random_trainable optimal at alpha=0.75.
perlayer_vector +1.1% vs random_trainable +0.8% — per-layer vector is
the minimal sufficient scaffold, no network needed.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
essential
8-branch dissection:
- zero_target + normmatched both crash: non-zero direction necessary, not norm
- perlayer_vector: +0.7% (per-block trainable vector works, network not required)
- freeze_after_{1,5,10}: ALL crash to ~13-14% (continuous trainability essential)
- random_trainable: +1.0% (reference)
Minimal mechanism: continuously trainable, non-zero, depth-aware auxiliary perturbation.
Freezing at ANY point destroys the benefit entirely.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
9-branch dissection results:
- zero_target crashes (-9.1%): aux must output non-zero
- constant_input neutral (+0.0%): needs at least depth info
- time_only works (+1.0%): h_l not needed, just depth index
- shuffled/fresh_random work (+1.3-1.4%): no semantic content needed
- prefit60_trainable ≈ random_trainable: prefit adds nothing
- All frozen branches crash: trainability is essential
Mechanism: depth-aware trainable auxiliary perturbation that diversifies
block-local updates. Not semantic credit, not pure trainability.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
Dissection of 6 branches from same DFA checkpoint:
- blend_random_frozen: 12.6% (CATASTROPHIC — frozen noise destroys training)
- blend_random_trainable: 32.2% (+1.2% — trainable network helps)
- blend_shuffled_trainable: 32.5% (+1.4% — even wrong targets work!)
- blend_gaussian_noise: 30.8% (neutral)
- scaled_DFA_norm_match: 31.0% (neutral)
The gain comes from implicit regularization through a co-optimized auxiliary
network, NOT from learned credit quality. Phase 9A's +1.5% was an optimization
dynamics effect, not evidence of useful credit assignment.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
E_prefit=0 (random Vec) + blend(0.75): 32.4% vs DFA 31.1% (+1.3%)
E_prefit=15: 32.3% (+1.2%)
E_prefit=60: 32.5% (+1.4%)
Frozen Gamma/rho near zero at all prefit levels. The Phase 9A success was NOT
from Vec learning useful credit — it was from the blend mechanism itself providing
regularization/diversification over pure DFA.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
Phase 9B (periodic refit K=5 R=1 alpha=0.75): 14.0% — Vec starts random,
periodic refits insufficient without offline pretraining.
Phase 9C (top-down curriculum): last1_vec=30.8%, last2_vec=31.1% vs DFA=31.2%.
Near-neutral. Cold-start problem persists even for single-block Vec.
Only Phase 9A's offline prefit + blend handoff (+1.5%) works.
The key ingredient is offline Vec training on frozen checkpoint features.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
First positive online result: 50% blend of offline-fitted Vec + DFA gives 31.7%
vs 31.1% for pure DFA (+0.55%). This is Case B: pure Vec handoff fails (-1.1%)
but blend works because DFA stabilizes trajectory while Vec adds directional credit.
Offline-fitted Vec at DFA epoch-5 checkpoint: Gamma=0.229, rho=0.262.
Cold-start confirmed as main bottleneck — Vec IS useful on DFA trajectory features.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
bottleneck
Vec_only_from_0: 15.4% (cold-start failure, can't learn credit on random features)
DFA_only: 31.2% (remains best non-BP method)
DFA_then_Vec_T20: 12.9% (switching to Vec destroys DFA-built features)
Vec_T5_then_DFA: 26.6% (partial recovery but still worse than pure DFA)
Phase 7A's early-window finding doesn't transfer: it required offline-trained Vec
on frozen features. Online Vec estimator faces cold-start paradox — needs structured
features to learn credit, but structured features need good credit to form.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
held-out transfer
At epoch 5 (acc=49%), Vec_M4 5-step: dL_held=-0.005 (PUR=0.70)
Oracle BP 5-step: dL_held=-0.009 (PUR=1.05)
DFA 5-step: dL_held=+0.003 (always hurts held-out)
By epoch 20, generalization window closes. Held-out failure is late-snapshot artifact.
Better credit → lower update variance (Vec=0.8 vs DFA=40), not higher.
Key implication: DFA warmup delays credit bridge past its useful window.
Credit should be used from epoch 0, not after 20% warmup.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
Phase 6A's "better credit → worse loss" was a protocol artifact caused by:
1. Credit normalization (inflated DFA, suppressed Vec magnitude ordering)
2. Held-out evaluation (measured generalization failure, not exploitability)
3. Gradient clamping
With strict same-batch evaluation:
- Oracle BP: dL_same = -0.406 (strongest descent)
- Vec_M4: dL_same = -0.135
- ScalarCB: dL_same = -0.025
- DFA: dL_same = -0.003
Same-batch loss decrease is MONOTONIC with credit quality.
But held-out loss INCREASES for all non-DFA methods (Case D: overfitting).
The bottleneck is batch-level generalization, not surrogate exploitability.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
Phase 6A: Better credit is ANTI-CORRELATED with loss decrease on fixed snapshot.
DFA (Gamma=0.01) → dL=-0.0001 (only method that decreases loss)
Vec_M4 (Gamma=0.38) → dL=+0.057 (increases loss most)
Oracle BP (Gamma=1.0) → dL=+0.011 (still increases loss)
Phase 6C: Target-shift rule reduces damage but cannot make non-DFA credits productive.
The inner-product surrogate <F_l(h), a_l> is fundamentally mismatched with directional credit.
Conclusion: Case B — the primary bottleneck is the local update paradigm itself,
not the credit estimator quality or tracking/co-adaptation.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
Phase 5A: Audit passes — shuffle control collapses, gains are real
Phase 5B: Transfer SUCCESS — vec_M4 beats scalar CB by +0.25 Gamma, +0.31 rho on frozen CIFAR
Phase 5C: Online FAILURE — vec does worse than scalar CB online despite better frozen credit
Core finding: bottleneck is in local surrogate / co-adaptation, not estimator quality
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
scan, vector field pilot
Key findings:
- Frozen CIFAR: estimators CAN recover credit (SB best, CB 20x > DFA)
- Online shallow: cb_eT wr=0.2 tgw=1.0 achieves S1>0, S2 marginal
- Vector credit field: 0.91-0.96 Gamma/rho on synthetic (vs 0.34 scalar CB)
- Direct vector field avoids scalar V curvature problem entirely
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
- CIFAR deltaL: s=grad_hL CE (dim=512) -> acc=17.2%, Gamma≈0
Confirms scalar value field has dimensionality bottleneck on CIFAR
- Pivot memo: direct vector credit field a_phi(h,t,s) -> R^d
Trained with perturbation-based target, avoids curvature problem
Still satisfies no hidden BP anchor constraint
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
CIFAR gap
Three new plots:
- cifar_depth_scan.png: acc/Gamma/rho vs depth for all methods
- boundary_ablation.png: s_type, tgw, warmup ratio sweeps
- synth_vs_cifar.png: dimensionality gap comparison (d=128 vs d=512)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
Key findings:
- deltaL (output-layer gradient) gives best Gamma (0.562 vs 0.452 for eT)
- Concatenating h_L to s destroys credit quality (value net cheats)
- Terminal gradient matching is monotonically beneficial
- Best config: deltaL + tgw=1.0 + wr=0.05 -> Gamma=0.768, rho=0.691
- CIFAR depth scan shows no Goldilocks regime (dimensionality bottleneck)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
scan
- synth_nonlinearity_ladder.py: teacher-student with phi_alpha(z) = (1-a)z + a*tanh(z)
Sweeps alpha x depth to find where state bridge / credit bridge fail
- cifar_depth_scan.py: CIFAR-10 with L={2,4,6,8,12}, d={256,512}
Finds Goldilocks regime for credit bridge vs DFA
- plot_synth_ladder.py: phase diagram visualization
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
All experiments complete:
- Toy LQ: credit bridge matches state bridge (~0.94 costate cosine)
- CIFAR-10: credit bridge (29.6%) comparable to DFA (30.0%), both beat state bridge (18.5%)
- State bridge confirms core hypothesis: perfect state prediction != useful credit
- Terminal gradient matching is essential for credit bridge
|
|
Reason: toy used raw MSE, CIFAR used normalized. They must be the same method
for consistent reporting. Normalized MSE is more robust to varying h_L magnitudes.
|
|
Debug phase. Toy LQ experiments (3 seeds) complete with terminal gradient matching.
Credit bridge matches state bridge on linear system (~0.94 cosine).
CIFAR experiments in progress.
|