# Experiment Notes ## Current Status (2026-04-23): Paper v2.39 + FA experiments complete Paper: `paper/main.tex` (v2.39, commit `bcfec1d`). NeurIPS 2026 E&D track. FA implementation + results: commit `88ff85c`. ### Paper structure (v2.39) 7 sections with 14 subsections. Key sections: - §3 Mode 1: Measurement Degeneracy (residual-stream growth + LN gradient collapse) - §4 Mode 2: Credit Quality (the paper's main contribution) - §4.2 Functional triangulation: 3 metrics (accuracy, nudging, training-loss) rank SB ≫ CB ≈ DFA; only deep cos disagrees - §4.3 Mode 2 → Mode 1 causal hypothesis: poor credit quality drives activation growth - §5 Intervention + cross-arch evidence - §6 4-diagnostic evaluation protocol ### 6-method audit (4-block d=256 pre-LN ResMLP, CIFAR-10, 100ep, 3 seeds) | Method | Test Acc | Deep Cos | ‖h_L‖ | ‖g_L‖ | Mode 1(b)? | |--------|----------|----------|--------|--------|------------| | **BP** | 0.615 ± 0.004 | ≈ 1.0 | ~200 | ~4e-4 | no | | **FA** (NEW) | **0.401 ± 0.009** | **+0.33** | **~1e5** | **~1e-6** | **no** | | EP | 0.316 ± 0.037 | 0.008 | ~3e3 | ~1e-4 | no | | DFA | 0.306 ± 0.008 | ~0 | ~5e8 | ~4e-10 | **YES** | | CB | 0.289 ± 0.031 | 0.07 | ~7e7 | ~2e-9 | **YES** | | SB | 0.205 ± 0.039 | 0.005 | ~2e8 | ~2e-9 | **YES** | | Frozen baseline | 0.349 ± 0.003 | — | — | — | — | ### FA vs DFA: the key comparison **Same local loss form** ⟨f_l, a_l⟩, same architecture, same optimizer. **Only difference**: how a_l is computed. - DFA: a_l = B_l^T @ e_T (direct projection, d×C random matrix) - FA: a_l = B_l @ a_{l+1} (sequential backward, d×d random matrix, starts from exact ∂CE/∂h_L) | | FA | DFA | |---|---|---| | Test acc | **0.401** (+5.2 pp above frozen) | 0.306 (-4.3 pp below frozen) | | Deep cos 100ep | **+0.33** (genuine) | ~0 (degenerate) | | Deep cos 5ep | **+0.32** (from the start) | -0.008 (null) | | ‖h_L‖ | ~10⁵ | ~5×10⁸ (3 OOM larger) | | ‖g_L‖ | ~10⁻⁶ (meaningful) | ~10⁻¹⁰ (floor) | | Mode 1(b) fires? | **NO** | YES | | FA random-target h_L | ~1.3e5 | ~1.7e8 (DFA) | **Interpretation**: FA's sequential propagation preserves enough credit-direction quality to prevent the catastrophic activation growth that DFA exhibits. This is the **strongest empirical support for the Mode 2 → Mode 1 causal hypothesis**: better credit → less growth → no gradient-floor collapse. ### FA depth sweep (d=512, 100ep, s42) | L | FA acc | DFA acc | FA deep cos | DFA deep cos | |---|--------|---------|-------------|--------------| | 2 | 0.350 | n/a | +0.96 | n/a | | 4 | 0.424 | n/a | +0.29 | n/a | | 6 | 0.401 | n/a | +0.16 | n/a | | 8 | 0.409 | 0.306 | +0.11 | -0.0001 | | 12 | 0.404 | 0.309 | +0.09 | -0.0001 | FA deep cos decreases with depth but stays positive. DFA is ~0 everywhere. ### Penalty rescue comparison (30ep, d=256, 3 seeds) | | FA no-pen | FA+pen λ=1e-2 | DFA no-pen | DFA+pen λ=1e-2 | |---|---|---|---|---| | acc | 0.372 ± 0.007 | 0.369 ± 0.003 | 0.301 ± 0.006 | 0.360 ± 0.002 | Penalty barely helps FA (it doesn't need rescue) but substantially helps DFA (+5.9 pp). ### Cross-method functional triangulation (§4.2, penalty-rescued d=256 30ep) | Metric | SB+pen | CB+pen | DFA+pen | Ordering | |--------|--------|--------|---------|----------| | Accuracy | **0.453** | 0.360 | 0.360 | SB ≫ CB ≈ DFA | | Nudging (loss Δ) | **-1.93e-3** | -4.26e-4 | -4.98e-5 | SB ≫ CB ≈ DFA | | Training loss decrease | **-0.447** | -0.121 | -0.095 | SB ≫ CB ≈ DFA | | Deep cos | +0.322 | **+0.679** | +0.151 | CB > SB > DFA | 3 functional metrics agree SB ≫ CB ≈ DFA. Deep cos is the ONLY one that disagrees. ### Auditable JSON sources All paper numbers derive from saved files in `results/`. See `reproduce_all.ipynb` for the complete source map. Key files: - `results/protocol_audit/audit_table_s42_s123_s456.json` — Table 1 - `results/fa_main_audit/results_cifar10.json` — FA main results - `results/nudging_test_3seed_summary.json` — §4.2 nudging - `results/training_loss_decrease_3seed.json` — §4.2 loss trajectory - `results/matched_30ep_control_summary.json` — §5.2 BP+DFA controls - All ± values use ddof=1 (sample std with Bessel correction) as of v2.38. ### Open items 1. **FA results not yet in the paper** — need to add FA as a 6th method row in Table 1 and discuss in §3-§5. 2. **FA on ViT-Mini / CNN**: experiment scripts (`snapshot_evolution_vit.py`, `cnn_baseline.py`) don't yet support FA. Need implementation. 3. **§7 falsification test**: the FA result itself is a strong empirical test of Mode 2 → Mode 1. Could frame FA as the "better credit quality at fixed ‖f‖" control proposed in §7.2. 4. **Codex restructure** (v2.39): applied codex's subsection proposal. 14 subsections across 7 sections. --- ## Historical Notes (pre-v2.xx, kept for reference) ## Experiment Phases - **debug**: Initial implementation, rapid iteration (commits ce24e36) - **pilot**: Controlled iteration (commits 0b9ebb2, 7baf7ae) - **frozen**: Code at commit 0b9ebb2 for all reported results ## Status: PHASE 10A.6 — GAIN REQUIRES TRAINABLE DEPTH-AWARE AUX, NOT SEMANTIC CREDIT --- ## Final Results Summary ### Toy LQ (3 seeds, 8000 steps) | Method | Costate Cosine | ρ | Nudging | |--------|---------------|---|---------| | DFA | 0.001±0.003 | 0.001±0.007 | 0.000±0.001 | | State Bridge | 0.945±0.002 | 0.931±0.003 | -0.344±0.019 | | Credit Bridge | 0.944±0.001 | 0.930±0.002 | -0.342±0.019 | ### CIFAR-10 (3 seeds, 100 epochs) | Method | Test Accuracy | |--------|:------------:| | BP | 59.2%±0.4% | | DFA | 30.0%±0.3% | | Credit Bridge | 29.6%±1.0% | | State Bridge | 18.5%±1.8% | ### CIFAR-10 Diagnostics (seed 42) | Method | BP Cosine | ρ | Nudge | |--------|-----------|---|-------| | BP | 0.940 | 0.990 | -0.027 | | Credit Bridge | 0.056 | ~0 | ~0 | | DFA | 0.030 | 0.005 | ~0 | | State Bridge | 0.021 | 0.004 | ~0 | --- ## Key Findings 1. **Terminal gradient matching is essential** for credit bridge. Without it, V learns correct values but uninformative gradients (cos → 0.03). With it, credit bridge matches state bridge on toy (~0.94 cosine). 2. **State bridge fails on nonlinear systems** despite near-perfect state prediction. State prediction error → 0.0000 but test accuracy = 18.5% (worst of all methods). This confirms the core hypothesis: bridging state ≠ bridging credit. 3. **Credit bridge modestly outperforms DFA in BP cosine** (0.056 vs 0.030, ~2x) but accuracy is comparable (29.6% vs 30.0%). 4. **All non-BP methods struggle** on the deep 12-block MLP architecture. The gap to BP (59.2%) is large for all methods. --- ## Changes Log - `ce24e36`: Initial implementation - `0b9ebb2`: Sync state bridge to use normalized MSE in both toy and CIFAR - `7baf7ae`: Add experiment notes and .gitignore ## Experiment IDs - `toy_lq_frozen/`: Final toy results (3 seeds, synced state bridge) - `cifar10/`, `cifar10_seed123/`, `cifar10_seed456/`: Final CIFAR results - `toy_lq/`: Debug-phase toy results (raw state bridge, for ablation) - `smoke_test/`, `smoke_test2/`: FashionMNIST debug runs ## Design Decisions 1. Terminal gradient matching (term_grad_weight=1.0): output-layer-local, not hidden BP 2. DFA warmup for credit bridge (20% of epochs): prevents value net bootstrap failure 3. Normalized MSE for state bridge: numerical stability 4. Credit normalization: a_norm = a / (RMS(a) + 1e-6) --- ## Phase 2: Explore (commit 2403960+) ### Synthetic Nonlinearity Ladder (Phase 1 of explore) **Setup**: Teacher-student with phi_alpha(z) = (1-alpha)*z + alpha*tanh(z) - alpha in {0, 0.25, 0.5, 1.0}, L in {2, 4, 8, 12} - d=128, C=10, 80 epochs, 3 seeds **Critical Finding**: Credit bridge advantage scales with nonlinearity. At alpha=1.0 (full tanh), credit bridge is the BEST method on Gamma and rho at ALL depths: | L | DFA Gamma | SB Gamma | CB Gamma | DFA rho | SB rho | CB rho | |---|-----------|----------|----------|---------|--------|--------| | 2 | 0.03 | 0.52 | **0.53** | 0.03 | 0.47 | **0.57** | | 4 | 0.05 | 0.34 | **0.45** | 0.06 | 0.32 | **0.51** | | 8 | 0.06 | 0.25 | **0.36** | 0.07 | 0.23 | **0.42** | | 12 | 0.07 | 0.22 | **0.24** | 0.07 | 0.21 | **0.32** | At alpha=0.5 (moderate nonlinearity), SB still wins on Gamma but CB wins on rho at L=4. At alpha=0 (linear), SB dominates. **Interpretation**: State bridge fails via Jacobian mismatch, not value prediction error. Credit bridge avoids this by learning value field gradients directly. The crossover happens around alpha=0.7-1.0. ### CIFAR-10 Depth Scan (Phase 2 of explore, in progress) Sweep L={2,4,6,8,12}, d=512, 100 epochs on CIFAR-10. Preliminary results (L=2,4, seed=42): | L | Method | Acc | Gamma | rho | |---|--------|-----|-------|-----| | 2 | DFA | 0.312 | 0.196 | 0.001 | | 2 | CB | 0.311 | 0.175 | **0.031** | | 4 | DFA | 0.314 | 0.100 | 0.003 | | 4 | CB | 0.298 | 0.123 | -0.002 | CIFAR is much harder -- rho signal is very weak for all non-BP methods. ### Changes Log (explore phase) - `2403960`: Add synthetic ladder and CIFAR depth scan experiments - Student blocks now use pre-LayerNorm for stability (fixes L>=8 blowup) - Added gradient clipping to block updates ### Experiment IDs (explore phase) - `synth_ladder_smoke/`: Initial 3-alpha x 2-depth smoke test - `synth_ladder_v2_lo/`: Full alpha=0,0.25 x L=2,4,8,12 x 3 seeds - `synth_ladder_v2_hi/`: Full alpha=0.5,1.0 x L=2,4,8,12 x 3 seeds - `cifar_depth_scan_s42/`: CIFAR L=2,4,6,8,12 x d=512 x seed=42 (COMPLETE) - `boundary_ablation_s_sweep/`: s_type in {eT, deltaL, eT_hL, deltaL_hL} - `boundary_ablation_tgw_sweep/`: tgw in {0, 0.25, 1.0, 4.0} - `boundary_ablation_wr_sweep/`: warmup ratio in {0, 0.05, 0.2, 0.5} - `boundary_ablation_s123/`, `boundary_ablation_s456/`: s_type sweep with seeds 123, 456 - `boundary_ablation_deltaL_wr/`: deltaL with warmup ratio sweep ### Phase 3 Results: Boundary-Condition Ablation At alpha=1.0, L=4 (best synthetic regime), 3 seeds: **s_type (conditioning code):** | Code | Gamma | rho | Acc | |------|-------|-----|-----| | eT (dim=10) | 0.452+/-0.042 | 0.509+/-0.033 | 0.523 | | deltaL (dim=d) | **0.562+/-0.007** | **0.510+/-0.014** | 0.448 | | eT+proj(h_L) | 0.002 | 0.016 | 0.559 | | deltaL+proj(h_L) | 0.018 | 0.026 | 0.564 | **deltaL gives best Gamma. Concatenating h_L destroys credit quality (value net cheats).** **Terminal gradient matching weight:** tgw=0 -> Gamma=0.12; tgw=1 -> Gamma=0.46; tgw=4 -> Gamma=0.57 (but acc drops). Terminal gradient matching is monotonically beneficial for credit quality. **Warmup ratio:** wr=0 -> best Gamma (0.68) but worst acc (0.46). wr=0.5 -> worst Gamma (0.23) but best acc (0.66). Clear tradeoff between credit quality and accuracy. Best single config: deltaL + tgw=1.0 + wr=0.05 -> **Gamma=0.768, rho=0.691** ### CIFAR deltaL Test deltaL conditioning (s=grad_{h_L} CE, dim=512) on CIFAR L=4: FAILED. Acc=17.2%, Gamma≈0, rho≈0. The 512-dim conditioning is too high-dimensional for the value net. Confirms the scalar V approach has a dimensionality bottleneck. ### Pivot Recommendation: Direct Vector Credit Field See `report_explore/MEMO_pivot_vector_field.md`. Instead of V_phi -> grad_h V, learn a_phi(h_l, t_l, s) -> R^d directly. Train with perturbation-based target: match to actual loss change. Still satisfies no hidden BP anchor constraint. Minimal test: synthetic alpha=1.0, L=4 with M=4 perturbation directions. --- ## Phase 4: Diagnostic Dissection (commit TBD) ### Phase A: Frozen CIFAR Credit Recovery **Setup**: BP-trained CIFAR-10 network (L=4, d=256, 61.7% test acc), frozen. Train credit estimators on fixed representations. **Key Result**: On frozen BP features, credit estimators CAN recover meaningful credit. | Method | mean Gamma | mean rho | mean nudge | |--------|-----------|---------|-----------| | DFA (random) | 0.006 | 0.005 | -0.000022 | | State Bridge (eT) | **0.287** | **0.246** | **-0.000957** | | Scalar CB (eT) | 0.115 | 0.125 | -0.000370 | | Scalar CB (deltaL) | 0.070 | 0.062 | -0.000160 | **Surprising**: State bridge is BEST on frozen BP features (opposite of synthetic). BP-trained features are quasi-linear, so SB's Jacobian approximation works well. Also tested L=6 d=256 and L=4 d=512: same pattern (SB > CB_eT > CB_deltaL >> DFA). **Implication**: Estimator is NOT the fundamental bottleneck. The online training failure is due to co-adaptation between forward net and credit estimator. ### Phase B: Online Shallow CIFAR Conditioning Scan **Setup**: L=4, d=256, CIFAR-10, 100 epochs, seed=42 (then 3-seed on best config). Sweep: methods={DFA, SB, CB_eT, CB_deltaL}, wr={0,0.05,0.2}, tgw={1.0,4.0}. **Found 2 positive configs with S1>0 AND S2>0:** | Config | Acc | Gamma | rho | S1 vs DFA | S2 vs DFA | |--------|-----|-------|-----|-----------|-----------| | cb_eT wr=0.2 tgw=1.0 | 0.283 | 0.179 | 0.009 | **+0.079** | **+0.014** | | cb_eT wr=0.2 tgw=4.0 | 0.285 | 0.187 | 0.002 | **+0.087** | **+0.007** | **3-seed validation of cb_eT wr=0.2 tgw=1.0:** - Seed 42: S1=+0.079, S2=+0.014 (both positive) - Seed 123: S1=+0.059, S2=-0.004 (S1 positive, S2 marginal negative) - Seed 456: S1=+0.135, S2=+0.003 (both positive, barely) S1 is consistently positive. S2 is marginal — sometimes positive, sometimes not. **CB_deltaL failed entirely on online CIFAR** (all configs near chance, Gamma≈0, rho≈0). **SB_eT also failed online** (Gamma=0.025, rho=-0.013, despite being best on frozen features). ### Phase C: Direct Vector Credit Field Pilot **Setup**: Synthetic alpha=1.0, L={4,8}, d=128, 80 epochs, 3 seeds. Compare DFA vs Scalar CB vs Vector Field (M=4, M=8). **BREAKTHROUGH RESULT:** | Method | L=4 Gamma | L=4 rho | L=8 Gamma | L=8 rho | |--------|-----------|---------|-----------|---------| | DFA | 0.01±0.01 | 0.01±0.01 | 0.08±0.04 | 0.08±0.04 | | Scalar CB | 0.34±0.10 | 0.33±0.11 | 0.26±0.03 | 0.29±0.02 | | **Vector M=4** | **0.91±0.05** | **0.96±0.01** | **0.96±0.01** | **0.95±0.01** | | **Vector M=8** | **0.84±0.10** | **0.92±0.05** | **0.90±0.10** | **0.93±0.04** | **Vector field improves over scalar CB by +0.3 to +0.7 on BOTH Gamma and rho.** This is consistent across all 6 seed x depth combinations. The perturbation-based directional target directly trains the credit vector to predict local loss sensitivity, avoiding the scalar V curvature problem entirely. ### Experiment IDs (Phase 4) - `frozen_cifar/`: Frozen CIFAR credit recovery (L=4 d=256, L=6 d=256, L=4 d=512) - `online_shallow/`: Phase B online scan (L=4 d=256, all configs) - `online_shallow_3seed/`: 3-seed validation of best config - `vector_credit_pilot/`: Phase C vector field vs scalar CB on synthetic ### Answers to Phase 4 Questions **Q1: On frozen CIFAR, can the current credit estimator recover useful credit?** YES. Scalar CB achieves Gamma=0.115, rho=0.125 — 20x better than DFA. State bridge is even better (Gamma=0.287, rho=0.246). **Q2: If yes, why does online training still fail?** Co-adaptation. The forward net's features change every epoch, making the value net's credit stale. DFA avoids this because its credits don't depend on learning. The wr=0.2 warmup helps by letting DFA carry the forward net to a useful feature regime before CB takes over, but the S2 signal remains marginal. **Q3: Can a direct vector credit field improve over scalar CB?** YES, dramatically. On synthetic alpha=1.0, vector field (M=4) achieves Gamma=0.91, rho=0.96, compared to scalar CB's Gamma=0.34, rho=0.33. The scalar V curvature problem is real and avoidable. ### Next Steps 1. Test vector credit field on CIFAR-10 (the real task) 2. Key concern: M=4 perturbation directions may not suffice for d=256/512 dimensions (signal/noise ~ sqrt(M/d) ~ sqrt(4/256) ~ 0.12) 3. May need M=32+ for CIFAR, which is expensive but tractable --- ## Phase 5: Vector Field Audit + Real-Task Transfer ### Phase 5A: Synthetic Audit (4 sanity checks) **Setup**: alpha=1.0, L={4,8}, d=128, 3 seeds, 80 epochs **Smoke test result (L=4, seed=42):** | Method | Gamma | rho | nudge | |--------|-------|-----|-------| | scalar_cb | 0.224 | 0.210 | -0.007 | | vec_eT_M4 | **0.847** | **0.951** | -0.026 | | vec_shuffleCtrl | 0.051 | 0.068 | -0.001 | | vec_noTerm | 0.955 | 0.971 | -0.027 | | vec_onesided | 0.832 | 0.943 | -0.024 | **Check B (shuffle)**: PASS — shuffled control collapses (5/6 near zero across full audit) **Check C (noTerm)**: Terminal NOT needed — perturbation target alone gives 0.95+ Gamma **Check D (onesided)**: PASS — one-sided ≈ central difference **Full 3-seed audit**: All 6 configs pass (delta_Gamma >= 0.49, delta_rho >= 0.55) ### Phase 5B: Frozen CIFAR Vector Transfer **Setup**: CIFAR-10, frozen BP ref (L=4, d=256, 61.7%), 100 epochs estimators | Method | Gamma | rho | nudge | |--------|-------|-----|-------| | DFA | 0.005 | 0.005 | -0.000006 | | ScalarCB_eT | 0.115 | 0.120 | -0.000370 | | StateBridge_eT | 0.287 | 0.264 | -0.000957 | | **Vec_eT_M4** | **0.364** | **0.426** | **-0.001406** | **TRANSFER SUCCESS**: Vec beats scalar CB by +0.25 Gamma, +0.31 rho. Vec also beats state bridge on rho (0.43 vs 0.26). M=4 is sufficient (M=8, M=16 give same results). ### Phase 5C: Online CIFAR Vector Pilot **Setup**: CIFAR-10, L=4, d=256, 100 epochs, seed=42 | Config | Acc | Gamma | rho | S1 | S2 | |--------|-----|-------|-----|----|----| | DFA | 0.312 | 0.101 | -0.005 | 0 | 0 | | vec wr=0.2 tw=1.0 | 0.243 | 0.001 | 0.000 | -0.100 | +0.005 | | (for comparison: scalar CB wr=0.2 tgw=1.0 from Phase 4) | 0.283 | 0.179 | 0.009 | +0.079 | +0.014 | **ONLINE FAILURE**: Vector field does WORSE than scalar CB online, despite being much better on frozen features. No config achieves S1>0. ### Core Finding of Phases 4-5 **The bottleneck is NOT the credit estimator.** Improving credit quality from scalar CB (Gamma=0.12) to vector field (Gamma=0.36) on frozen features does NOT help online training. The bottleneck is in the **local surrogate update**: cannot effectively exploit directional credit information, and co-adaptation between forward net and credit estimator rapidly degrades the signal. ### Experiment IDs (Phase 5) - `vector_audit/`: Phase 5A smoke test - `vector_audit_full/`: Phase 5A full 3-seed audit - `frozen_cifar_vec/`: Phase 5B frozen CIFAR vector transfer - `online_vec_pilot/`: Phase 5C online CIFAR vector pilot --- ## Phase 6: Exploitability Dissection ### Phase 6A: Snapshot Exploitability **Setup**: BP-trained CIFAR snapshot (L=4, d=256, 61.9% acc). Offline-trained estimators. k-step local updates with real loss measurement. **CRITICAL FINDING: Better credit → worse loss decrease.** | Credit | Gamma | rho | dL_5step (inner_product) | |--------|-------|-----|-------------------------| | DFA | 0.009 | -0.023 | **-0.0001** (only negative!) | | ScalarCB | 0.122 | 0.090 | +0.042 | | Vec_M4 | 0.378 | 0.411 | +0.057 | | Oracle BP | 1.000 | 0.998 | +0.011 | Credit quality is ANTI-CORRELATED with loss decrease. DFA (worst credit) is the only method not increasing loss. ### Phase 6C: Local Update Rule Swap Tested target-shift (`h_target = h_{l+1} - eta * a_norm`) at eta in {0.01, 0.1, 0.3, 1.0}. Target-shift reduces damage (Vec dL: +0.057 → +0.002 at eta=0.1) but never achieves negative DeltaLoss for any non-DFA credit. Cosine rule produces near-zero effects. ### Root Cause The inner-product surrogate `` is not a valid proxy for global loss minimization. The gradient of this surrogate w.r.t. block parameters ≠ gradient of global loss w.r.t. same parameters. A BP-trained snapshot is at a minimum reachable only by full BP; local updates systematically push uphill. DFA works because its credits are weak enough to produce near-zero updates, effectively doing nothing. ### This is Case B from the diagnostic logic tree: Better credit does NOT lead to better snapshot loss decrease. **The primary bottleneck is the local update rule itself, not the estimator or tracking.** ### Experiment IDs (Phase 6) - `snapshot_exploit/`: Phase 6A snapshot exploitability - `update_swap/`: Phase 6C local update rule comparison --- ## Phase 6.5: Protocol Audit (REVISES Phase 6A conclusion) ### Phase 6.5A: Same-Batch Linesearch **CRITICAL REVISION**: Phase 6A's "better credit → worse loss" was a protocol artifact. Phase 6A used: normalized credit + held-out evaluation + gradient clamping. Phase 6.5A uses: raw + norm credit, same-batch + held-out eval, no clamping, eta sweep. **With same-batch evaluation, better credit DOES produce more loss decrease:** | Method | Gamma | dL_same (norm, all, best eta) | dL_held | |--------|-------|-------------------------------|---------| | DFA | 0.01 | -0.003 | +0.004 | | ScalarCB | 0.12 | -0.025 | +0.027 | | Vec_M4 | 0.38 | **-0.135** | +0.045 | | Oracle BP | 1.00 | **-0.406** | +0.094 | Same-batch loss decrease is MONOTONIC with credit quality. But held-out loss INCREASES for all non-DFA methods. **This is Case D: the local surrogate exploits credit correctly on training data, but the update overfits to the batch. Better credit = more effective overfitting.** ### Key confounds identified in Phase 6A: 1. **Normalization** inflated DFA's weak credits to same magnitude as Vec's 2. **Held-out evaluation** showed generalization failure, not exploitability failure 3. **Gradient clamping** distorted the natural credit quality ordering ### Raw vs Norm: - Raw credit: tiny updates (BP grad RMS ≈ 0.00004). Vec raw best dL_same=-0.005 - Norm credit: amplifies to useful magnitude but also amplifies overfitting ### Revised diagnosis: The bottleneck is NOT "surrogate can't exploit credit" (Phase 6A was wrong). It IS "local surrogate with good credit overfits to mini-batch." This suggests: regularization of local updates (larger batches, weight decay, gradient noise) could make better credit usable. ### Experiment IDs (Phase 6.5) - `exploit_linesearch/`: Phase 6.5A smoke test (Oracle + Vec, last1, raw) - `exploit_linesearch_full/`: Phase 6.5A full sweep (all methods, ranges, norm modes) --- ## Phase 7A: Snapshot Time Sweep **Setup**: BP snapshots at epoch {5, 20, 100} (acc 0.49/0.57/0.62). Train Vec_M4 on each frozen snapshot. Test 1-step and 5-step with raw credit, last-block-only. **KEY FINDING: Held-out failure is primarily a LATE-SNAPSHOT artifact.** 5-step DeltaLoss held-out: | Epoch | DFA dL_held | Vec dL_held | Oracle dL_held | Vec PUR | |-------|-------------|-------------|----------------|---------| | **5** | +0.003 | **-0.005** | **-0.009** | **0.70** | | 20 | +0.001 | +0.002 | +0.000 | -3.87 | | 100 | +0.000 | +0.001 | -0.001 | -1.01 | At epoch 5: Vec decreases held-out loss (PUR=0.70), Oracle too (PUR=1.05). DFA INCREASES held-out at all snapshots. By epoch 20 the generalization window closes. **Better credit produces MORE consistent updates** (Vec variance=0.8 vs DFA variance=40). The problem is not batch-specificity but snapshot timing: credit is useful early, useless late. **Implication**: The DFA warmup (which delays credit bridge to epoch ~20) is counterproductive. Credit bridge should be used from epoch 0. ### Experiment IDs (Phase 7) - `snapshot_time/`: Phase 7A snapshot time sweep with BP checkpoints --- ## Phase 8: Schedule Timing Hypothesis Test **Setup**: CIFAR-10, L=4, d=256, 100 epochs, seed=42 | Schedule | acc@5 | acc@20 | final | |----------|-------|--------|-------| | DFA_only | **0.297** | **0.308** | **0.312** | | Vec_only_from_0 | 0.135 | 0.151 | 0.154 | | Vec_T5_then_DFA | 0.135 | 0.213 | 0.266 | | DFA_T20_then_Vec | 0.297 | 0.308 | 0.129 | **Phase 7A's timing hypothesis does NOT transfer to online training.** Vec from epoch 0 gets stuck at 15% (near chance). The online Vec estimator starts from random initialization and cannot learn useful credit fast enough when the forward net is also random (cold-start paradox). DFA alone remains the best non-BP method (31.2%). ### The cold-start paradox: Vec credit is most useful on early features, but Vec can only learn useful credit from features with structure. DFA provides structure slowly, but by the time Vec is ready, the early window is closed. ### Project conclusion at this point: - Vec estimator WORKS (synthetic + frozen CIFAR) - Local surrogate CAN exploit it (same-batch, Phase 6.5A) - Early snapshots show generalization (Phase 7A, offline-trained Vec) - But online co-learning of estimator + forward net is unsolved (cold-start) ### Experiment IDs (Phase 8) - `schedule_timing/`: Phase 8 schedule comparison --- ## Phase 9A: Checkpointed Offline Handoff **Setup**: DFA trajectory with checkpoint at t0=5 (acc=30.4%). Offline-fit Vec_M4 on frozen checkpoint (60 epochs). Then branch. | Branch | final acc | diff vs DFA | |--------|-----------|-------------| | continue_DFA | 0.311 | baseline | | handoff_to_Vec | 0.300 | -0.011 | | **handoff_blend_05** | **0.317** | **+0.006** | **FIRST POSITIVE ONLINE RESULT**: 50% blend (Vec+DFA) outperforms pure DFA by +0.55%. This is Case B: pure Vec handoff fails, but blend works. Vec provides complementary directional credit that DFA alone lacks. DFA stabilizes the trajectory while Vec adds learned corrections. ### Full Phase 9A Results (seed=42, all t0 × alpha) | t0 | Branch | final acc | diff vs DFA | |----|--------|-----------|-------------| | 1 | continue_DFA | 0.310 | — | | 1 | blend_05 | **0.319** | **+0.9%** | | 1 | blend_075 | 0.313 | +0.3% | | 5 | continue_DFA | 0.310 | — | | 5 | blend_05 | 0.317 | +0.6% | | 5 | **blend_075** | **0.326** | **+1.5%** | | 10 | continue_DFA | 0.310 | — | | 10 | blend_05 | 0.318 | +0.7% | | 10 | **blend_075** | **0.325** | **+1.4%** | **Best config: t0=5 or 10, blend alpha=0.75 (75% Vec + 25% DFA), +1.5%.** Higher Vec fraction (0.75) outperforms lower (0.25, 0.5) at t0=5,10. Pure Vec handoff still fails at all t0. ### Phase 9B: Periodic Refit — FAILED K=5 (refit every 5 epochs), R=1 (1 epoch refit), alpha=0.75: **14.0%** (vs DFA 31.2%). Vec starts random, periodic 1-epoch refits aren't enough. Without offline pretraining (as in 9A), the cold-start problem remains unsolved. ### Phase 9C: Top-Down Curriculum — NEUTRAL | Config | final acc | diff vs DFA | |--------|-----------|-------------| | DFA_only | 0.312 | — | | last1_vec (blend 0.75) | 0.308 | -0.4% | | last2_vec (blend 0.75) | 0.311 | -0.1% | Vec for last 1-2 blocks ≈ DFA. No improvement, no degradation. Same cold-start problem: Vec can't learn useful credit online even for 1 block. ### Phase 9 Overall Conclusion Only **offline prefit + blend handoff** (9A) works. Both online periodic refit (9B) and top-down curriculum (9C) fail because they don't pre-train Vec on frozen features. The +1.5% gain from 9A's blend(0.75) at t0=5 is the project's best online result. ### Experiment IDs (Phase 9) - `checkpointed_handoff/`: Phase 9A checkpointed handoff with branches - `periodic_refit/`: Phase 9B periodic refit - `topdown_curriculum/`: Phase 9C top-down curriculum --- ## Phase 10A: Prefit Threshold Curve **Setup**: t0=5, blend_075, E_prefit in {0, 15, 60}, seed=42 | E_prefit | Gamma_frozen | rho_frozen | final acc | diff vs DFA | |----------|-------------|-----------|-----------|-------------| | 0 (random Vec) | -0.005 | 0.014 | **0.324** | **+1.3%** | | 15 | 0.002 | 0.011 | 0.323 | +1.2% | | 60 | -0.001 | -0.009 | 0.325 | +1.4% | | continue_DFA | — | — | 0.311 | baseline | **Case C: NO prefit threshold exists.** Even random Vec (E=0) with blend(0.75) beats DFA. **Critical reinterpretation of Phase 9A**: the +1.5% gain was NOT from Vec learning good credit. Frozen Gamma/rho are near zero at all prefit levels. The benefit comes from the blend mechanism itself — blending DFA with any additional signal provides regularization/diversification. ### Phase 10A.5: Blend Mechanism Dissection | Branch | final acc | diff vs DFA | |--------|-----------|-------------| | continue_DFA | 0.311 | baseline | | blend_random_**frozen** | **0.126** | **-18.5%** (catastrophic) | | blend_random_**trainable** | 0.322 | +1.2% | | blend_shuffled_trainable | 0.325 | +1.4% | | blend_gaussian_noise | 0.308 | -0.3% | | scaled_DFA_norm_match | 0.310 | -0.0% | **Mechanism identified**: The gain is from **implicit regularization through a trainable auxiliary network**, NOT from learned credit. Frozen random Vec crashes (12.6%). Trainable Vec helps even with shuffled targets. Gaussian noise and norm scaling don't help. Phase 9A's +1.5% was not evidence of useful credit — it was an optimization dynamics effect. ### Phase 10A.6: Structured vs Semantic Auxiliary | Branch | final | diff | Key insight | |--------|-------|------|-------------| | random_trainable | 0.324 | +1.2% | works | | shuffled_trainable | 0.325 | +1.4% | no semantics needed | | **zero_target** | **0.221** | **-9.1%** | must output non-zero | | fresh_random_target | 0.325 | +1.3% | stable targets not needed | | time_only | 0.321 | +1.0% | h_l not needed, just depth | | **constant_input** | **0.312** | **+0.0%** | needs at least depth info | | prefit60_frozen | 0.127 | -18.4% | frozen = crash | | prefit60_trainable | 0.321 | +1.0% | prefit ≈ random init | **Mechanism**: depth-aware trainable auxiliary perturbation that diversifies block-local updates. Not semantic credit. Not pure trainability (zero_target crashes). Not state-dependent (time_only works). Depth-awareness is the minimal requirement (constant_input fails). ### Phase 10A.7: Minimal Auxiliary Compression | Branch | final | diff | Key insight | |--------|-------|------|-------------| | random_trainable | 0.321 | +1.0% | reference | | zero_target | 0.203 | -10.8% | must output non-zero (confirmed) | | zero_target_normmatched | 0.202 | -10.9% | norm matching doesn't save it | | **perlayer_vector** | **0.318** | **+0.7%** | per-block trainable vector works! | | freeze_after_1 | 0.144 | -16.7% | freeze = crash | | freeze_after_5 | 0.143 | -16.8% | freeze = crash | | freeze_after_10 | 0.130 | -18.1% | freeze = crash | **Key findings**: 1. Norm-matched zero-target still crashes → non-zero direction is necessary, not just norm 2. Per-layer trainable vector works (+0.7%) → network not strictly needed, but helps 3. ALL freeze-after-k crash → **continuous trainability is absolutely necessary** **Minimal mechanism**: continuously trainable, non-zero, depth-aware auxiliary perturbation. ### Phase 10A.8: Scaffold Dynamics **8A: Freeze with Decay** | Branch | final | diff | Key | |--------|-------|------|-----| | random_trainable_075 | 0.322 | +1.1% | reference | | freeze1_fixed075 | 0.146 | -16.6% | stale aux at 75% kills | | freeze1_decay_to_000 | **0.285** | **-2.7%** | decay to DFA recovers most | | freeze5_decay_to_000 | 0.285 | -2.6% | same | Freeze failure is MOSTLY stale high-weight aux (decay_to_000 recovers to 28.5%). But 28.5% < DFA 31.2% → continuous trainability adds ~2.7% additional value. **8B: Alpha Sweep** | Method | α=0.25 | α=0.50 | α=0.75 | α=0.90 | |--------|--------|--------|--------|--------| | perlayer_vector | +0.0% | +0.6% | **+1.1%** | -1.4% | | random_trainable | +0.1% | +0.4% | **+0.8%** | -0.1% | Both methods optimal at α=0.75. perlayer_vector (+1.1%) ≈ random_trainable (+0.8%). Per-layer vector is the minimal sufficient scaffold. ### Phase 10A.8C: 3-Seed Replication | Method | final acc (3 seeds) | diff vs DFA | |--------|---------------------|-------------| | DFA | 0.306±0.006 | baseline | | perlayer_vector α=0.75 | 0.304±0.006 | -0.2% (not significant) | | random_trainable α=0.75 | 0.313±0.007 | +0.7% (marginal) | Single-seed gains do not robustly replicate. Error bars overlap. ### Experiment IDs (Phase 10) - `prefit_threshold/`: Phase 10A prefit threshold curve - `blend_dissection/`: Phase 10A.5 blend mechanism dissection - `structured_aux/`: Phase 10A.6 structured vs semantic auxiliary - `minimal_aux_compression/`: Phase 10A.7 minimal aux compression - `freeze_with_decay/`: Phase 10A.8A freeze with decay - `alpha_sweep_scaffold/`: Phase 10A.8B alpha sweep