| Age | Commit message (Collapse) | Author |
|
NOTE.md: added comprehensive current-status section at the top with
the full 6-method audit table (BP/FA/EP/DFA/CB/SB), FA vs DFA key
comparison, depth sweep, penalty rescue comparison, cross-method
functional triangulation, and open items. Old Phase 10A content kept
below as historical reference.
EVIDENCE_SUMMARY.md: added "Vanilla FA vs DFA" section with the
paper-changing finding (FA 0.401 ± 0.009 vs DFA 0.306 ± 0.008,
FA has genuine deep cos +0.33, no Mode 1(b) collapse) and the
d=512 depth sweep table.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
3-seed results (mean±std):
- DFA: 0.306±0.006
- perlayer_vector α=0.75: 0.304±0.006 (-0.2%, not significant)
- random_trainable α=0.75: 0.313±0.007 (+0.7%, marginal, error bars overlap)
Single-seed gains (+1.1% perlayer, +0.8% vec) do not robustly replicate.
The scaffold mechanism provides at best a marginal, statistically uncertain benefit.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
cause;
alpha sweep shows perlayer_vector at alpha=0.75 matches full network
10A.8A: freeze_decay_to_000 recovers to 28.5% (vs 14.6% fixed freeze) — stale
high-weight aux is the primary cause of freeze crashes. But 28.5% < DFA 31.2%
confirms continuous trainability adds ~2.7% independent value.
10A.8B: Both perlayer_vector and random_trainable optimal at alpha=0.75.
perlayer_vector +1.1% vs random_trainable +0.8% — per-layer vector is
the minimal sufficient scaffold, no network needed.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
essential
8-branch dissection:
- zero_target + normmatched both crash: non-zero direction necessary, not norm
- perlayer_vector: +0.7% (per-block trainable vector works, network not required)
- freeze_after_{1,5,10}: ALL crash to ~13-14% (continuous trainability essential)
- random_trainable: +1.0% (reference)
Minimal mechanism: continuously trainable, non-zero, depth-aware auxiliary perturbation.
Freezing at ANY point destroys the benefit entirely.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
9-branch dissection results:
- zero_target crashes (-9.1%): aux must output non-zero
- constant_input neutral (+0.0%): needs at least depth info
- time_only works (+1.0%): h_l not needed, just depth index
- shuffled/fresh_random work (+1.3-1.4%): no semantic content needed
- prefit60_trainable ≈ random_trainable: prefit adds nothing
- All frozen branches crash: trainability is essential
Mechanism: depth-aware trainable auxiliary perturbation that diversifies
block-local updates. Not semantic credit, not pure trainability.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
Dissection of 6 branches from same DFA checkpoint:
- blend_random_frozen: 12.6% (CATASTROPHIC — frozen noise destroys training)
- blend_random_trainable: 32.2% (+1.2% — trainable network helps)
- blend_shuffled_trainable: 32.5% (+1.4% — even wrong targets work!)
- blend_gaussian_noise: 30.8% (neutral)
- scaled_DFA_norm_match: 31.0% (neutral)
The gain comes from implicit regularization through a co-optimized auxiliary
network, NOT from learned credit quality. Phase 9A's +1.5% was an optimization
dynamics effect, not evidence of useful credit assignment.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
E_prefit=0 (random Vec) + blend(0.75): 32.4% vs DFA 31.1% (+1.3%)
E_prefit=15: 32.3% (+1.2%)
E_prefit=60: 32.5% (+1.4%)
Frozen Gamma/rho near zero at all prefit levels. The Phase 9A success was NOT
from Vec learning useful credit — it was from the blend mechanism itself providing
regularization/diversification over pure DFA.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
Phase 9B (periodic refit K=5 R=1 alpha=0.75): 14.0% — Vec starts random,
periodic refits insufficient without offline pretraining.
Phase 9C (top-down curriculum): last1_vec=30.8%, last2_vec=31.1% vs DFA=31.2%.
Near-neutral. Cold-start problem persists even for single-block Vec.
Only Phase 9A's offline prefit + blend handoff (+1.5%) works.
The key ingredient is offline Vec training on frozen checkpoint features.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
Best configs (seed=42):
- t0=5, blend_075 (75%Vec+25%DFA): 32.6% vs DFA 31.0% (+1.5%)
- t0=10, blend_075: 32.5% vs 31.0% (+1.4%)
- t0=1, blend_05: 31.9% vs 31.0% (+0.9%)
Higher Vec fraction (0.75) consistently outperforms lower (0.25, 0.5) at t0>=5.
Pure Vec handoff still fails at all checkpoints.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
First positive online result: 50% blend of offline-fitted Vec + DFA gives 31.7%
vs 31.1% for pure DFA (+0.55%). This is Case B: pure Vec handoff fails (-1.1%)
but blend works because DFA stabilizes trajectory while Vec adds directional credit.
Offline-fitted Vec at DFA epoch-5 checkpoint: Gamma=0.229, rho=0.262.
Cold-start confirmed as main bottleneck — Vec IS useful on DFA trajectory features.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
bottleneck
Vec_only_from_0: 15.4% (cold-start failure, can't learn credit on random features)
DFA_only: 31.2% (remains best non-BP method)
DFA_then_Vec_T20: 12.9% (switching to Vec destroys DFA-built features)
Vec_T5_then_DFA: 26.6% (partial recovery but still worse than pure DFA)
Phase 7A's early-window finding doesn't transfer: it required offline-trained Vec
on frozen features. Online Vec estimator faces cold-start paradox — needs structured
features to learn credit, but structured features need good credit to form.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
held-out transfer
At epoch 5 (acc=49%), Vec_M4 5-step: dL_held=-0.005 (PUR=0.70)
Oracle BP 5-step: dL_held=-0.009 (PUR=1.05)
DFA 5-step: dL_held=+0.003 (always hurts held-out)
By epoch 20, generalization window closes. Held-out failure is late-snapshot artifact.
Better credit → lower update variance (Vec=0.8 vs DFA=40), not higher.
Key implication: DFA warmup delays credit bridge past its useful window.
Credit should be used from epoch 0, not after 20% warmup.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
Phase 6A's "better credit → worse loss" was a protocol artifact caused by:
1. Credit normalization (inflated DFA, suppressed Vec magnitude ordering)
2. Held-out evaluation (measured generalization failure, not exploitability)
3. Gradient clamping
With strict same-batch evaluation:
- Oracle BP: dL_same = -0.406 (strongest descent)
- Vec_M4: dL_same = -0.135
- ScalarCB: dL_same = -0.025
- DFA: dL_same = -0.003
Same-batch loss decrease is MONOTONIC with credit quality.
But held-out loss INCREASES for all non-DFA methods (Case D: overfitting).
The bottleneck is batch-level generalization, not surrogate exploitability.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
Phase 6A: Better credit is ANTI-CORRELATED with loss decrease on fixed snapshot.
DFA (Gamma=0.01) → dL=-0.0001 (only method that decreases loss)
Vec_M4 (Gamma=0.38) → dL=+0.057 (increases loss most)
Oracle BP (Gamma=1.0) → dL=+0.011 (still increases loss)
Phase 6C: Target-shift rule reduces damage but cannot make non-DFA credits productive.
The inner-product surrogate <F_l(h), a_l> is fundamentally mismatched with directional credit.
Conclusion: Case B — the primary bottleneck is the local update paradigm itself,
not the credit estimator quality or tracking/co-adaptation.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
Phase 5A: Audit passes — shuffle control collapses, gains are real
Phase 5B: Transfer SUCCESS — vec_M4 beats scalar CB by +0.25 Gamma, +0.31 rho on frozen CIFAR
Phase 5C: Online FAILURE — vec does worse than scalar CB online despite better frozen credit
Core finding: bottleneck is in local surrogate / co-adaptation, not estimator quality
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
scan, vector field pilot
Key findings:
- Frozen CIFAR: estimators CAN recover credit (SB best, CB 20x > DFA)
- Online shallow: cb_eT wr=0.2 tgw=1.0 achieves S1>0, S2 marginal
- Vector credit field: 0.91-0.96 Gamma/rho on synthetic (vs 0.34 scalar CB)
- Direct vector field avoids scalar V curvature problem entirely
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
- CIFAR deltaL: s=grad_hL CE (dim=512) -> acc=17.2%, Gamma≈0
Confirms scalar value field has dimensionality bottleneck on CIFAR
- Pivot memo: direct vector credit field a_phi(h,t,s) -> R^d
Trained with perturbation-based target, avoids curvature problem
Still satisfies no hidden BP anchor constraint
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
Key findings:
- deltaL (output-layer gradient) gives best Gamma (0.562 vs 0.452 for eT)
- Concatenating h_L to s destroys credit quality (value net cheats)
- Terminal gradient matching is monotonically beneficial
- Best config: deltaL + tgw=1.0 + wr=0.05 -> Gamma=0.768, rho=0.691
- CIFAR depth scan shows no Goldilocks regime (dimensionality bottleneck)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
Key finding: credit bridge advantage scales with nonlinearity.
At alpha=1.0 (full tanh), CB > SB > DFA on both Gamma and rho at all depths.
The crossover where CB surpasses SB happens around alpha=0.7-1.0.
Full 4x4x3 grid complete with 3 seeds each.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
All experiments complete:
- Toy LQ: credit bridge matches state bridge (~0.94 costate cosine)
- CIFAR-10: credit bridge (29.6%) comparable to DFA (30.0%), both beat state bridge (18.5%)
- State bridge confirms core hypothesis: perfect state prediction != useful credit
- Terminal gradient matching is essential for credit bridge
|
|
Track experiment phases (debug/pilot/frozen), key findings, and design decisions.
|
|
Debug phase. Toy LQ experiments (3 seeds) complete with terminal gradient matching.
Credit bridge matches state bridge on linear system (~0.94 cosine).
CIFAR experiments in progress.
|