# Experiment Notes ## Experiment Phases - **debug**: Initial implementation, rapid iteration (commits ce24e36) - **pilot**: Controlled iteration (commits 0b9ebb2, 7baf7ae) - **frozen**: Code at commit 0b9ebb2 for all reported results ## Status: PHASE 6.5 PROTOCOL AUDIT — PHASE 6A CONCLUSION REVISED --- ## Final Results Summary ### Toy LQ (3 seeds, 8000 steps) | Method | Costate Cosine | ρ | Nudging | |--------|---------------|---|---------| | DFA | 0.001±0.003 | 0.001±0.007 | 0.000±0.001 | | State Bridge | 0.945±0.002 | 0.931±0.003 | -0.344±0.019 | | Credit Bridge | 0.944±0.001 | 0.930±0.002 | -0.342±0.019 | ### CIFAR-10 (3 seeds, 100 epochs) | Method | Test Accuracy | |--------|:------------:| | BP | 59.2%±0.4% | | DFA | 30.0%±0.3% | | Credit Bridge | 29.6%±1.0% | | State Bridge | 18.5%±1.8% | ### CIFAR-10 Diagnostics (seed 42) | Method | BP Cosine | ρ | Nudge | |--------|-----------|---|-------| | BP | 0.940 | 0.990 | -0.027 | | Credit Bridge | 0.056 | ~0 | ~0 | | DFA | 0.030 | 0.005 | ~0 | | State Bridge | 0.021 | 0.004 | ~0 | --- ## Key Findings 1. **Terminal gradient matching is essential** for credit bridge. Without it, V learns correct values but uninformative gradients (cos → 0.03). With it, credit bridge matches state bridge on toy (~0.94 cosine). 2. **State bridge fails on nonlinear systems** despite near-perfect state prediction. State prediction error → 0.0000 but test accuracy = 18.5% (worst of all methods). This confirms the core hypothesis: bridging state ≠ bridging credit. 3. **Credit bridge modestly outperforms DFA in BP cosine** (0.056 vs 0.030, ~2x) but accuracy is comparable (29.6% vs 30.0%). 4. **All non-BP methods struggle** on the deep 12-block MLP architecture. The gap to BP (59.2%) is large for all methods. --- ## Changes Log - `ce24e36`: Initial implementation - `0b9ebb2`: Sync state bridge to use normalized MSE in both toy and CIFAR - `7baf7ae`: Add experiment notes and .gitignore ## Experiment IDs - `toy_lq_frozen/`: Final toy results (3 seeds, synced state bridge) - `cifar10/`, `cifar10_seed123/`, `cifar10_seed456/`: Final CIFAR results - `toy_lq/`: Debug-phase toy results (raw state bridge, for ablation) - `smoke_test/`, `smoke_test2/`: FashionMNIST debug runs ## Design Decisions 1. Terminal gradient matching (term_grad_weight=1.0): output-layer-local, not hidden BP 2. DFA warmup for credit bridge (20% of epochs): prevents value net bootstrap failure 3. Normalized MSE for state bridge: numerical stability 4. Credit normalization: a_norm = a / (RMS(a) + 1e-6) --- ## Phase 2: Explore (commit 2403960+) ### Synthetic Nonlinearity Ladder (Phase 1 of explore) **Setup**: Teacher-student with phi_alpha(z) = (1-alpha)*z + alpha*tanh(z) - alpha in {0, 0.25, 0.5, 1.0}, L in {2, 4, 8, 12} - d=128, C=10, 80 epochs, 3 seeds **Critical Finding**: Credit bridge advantage scales with nonlinearity. At alpha=1.0 (full tanh), credit bridge is the BEST method on Gamma and rho at ALL depths: | L | DFA Gamma | SB Gamma | CB Gamma | DFA rho | SB rho | CB rho | |---|-----------|----------|----------|---------|--------|--------| | 2 | 0.03 | 0.52 | **0.53** | 0.03 | 0.47 | **0.57** | | 4 | 0.05 | 0.34 | **0.45** | 0.06 | 0.32 | **0.51** | | 8 | 0.06 | 0.25 | **0.36** | 0.07 | 0.23 | **0.42** | | 12 | 0.07 | 0.22 | **0.24** | 0.07 | 0.21 | **0.32** | At alpha=0.5 (moderate nonlinearity), SB still wins on Gamma but CB wins on rho at L=4. At alpha=0 (linear), SB dominates. **Interpretation**: State bridge fails via Jacobian mismatch, not value prediction error. Credit bridge avoids this by learning value field gradients directly. The crossover happens around alpha=0.7-1.0. ### CIFAR-10 Depth Scan (Phase 2 of explore, in progress) Sweep L={2,4,6,8,12}, d=512, 100 epochs on CIFAR-10. Preliminary results (L=2,4, seed=42): | L | Method | Acc | Gamma | rho | |---|--------|-----|-------|-----| | 2 | DFA | 0.312 | 0.196 | 0.001 | | 2 | CB | 0.311 | 0.175 | **0.031** | | 4 | DFA | 0.314 | 0.100 | 0.003 | | 4 | CB | 0.298 | 0.123 | -0.002 | CIFAR is much harder -- rho signal is very weak for all non-BP methods. ### Changes Log (explore phase) - `2403960`: Add synthetic ladder and CIFAR depth scan experiments - Student blocks now use pre-LayerNorm for stability (fixes L>=8 blowup) - Added gradient clipping to block updates ### Experiment IDs (explore phase) - `synth_ladder_smoke/`: Initial 3-alpha x 2-depth smoke test - `synth_ladder_v2_lo/`: Full alpha=0,0.25 x L=2,4,8,12 x 3 seeds - `synth_ladder_v2_hi/`: Full alpha=0.5,1.0 x L=2,4,8,12 x 3 seeds - `cifar_depth_scan_s42/`: CIFAR L=2,4,6,8,12 x d=512 x seed=42 (COMPLETE) - `boundary_ablation_s_sweep/`: s_type in {eT, deltaL, eT_hL, deltaL_hL} - `boundary_ablation_tgw_sweep/`: tgw in {0, 0.25, 1.0, 4.0} - `boundary_ablation_wr_sweep/`: warmup ratio in {0, 0.05, 0.2, 0.5} - `boundary_ablation_s123/`, `boundary_ablation_s456/`: s_type sweep with seeds 123, 456 - `boundary_ablation_deltaL_wr/`: deltaL with warmup ratio sweep ### Phase 3 Results: Boundary-Condition Ablation At alpha=1.0, L=4 (best synthetic regime), 3 seeds: **s_type (conditioning code):** | Code | Gamma | rho | Acc | |------|-------|-----|-----| | eT (dim=10) | 0.452+/-0.042 | 0.509+/-0.033 | 0.523 | | deltaL (dim=d) | **0.562+/-0.007** | **0.510+/-0.014** | 0.448 | | eT+proj(h_L) | 0.002 | 0.016 | 0.559 | | deltaL+proj(h_L) | 0.018 | 0.026 | 0.564 | **deltaL gives best Gamma. Concatenating h_L destroys credit quality (value net cheats).** **Terminal gradient matching weight:** tgw=0 -> Gamma=0.12; tgw=1 -> Gamma=0.46; tgw=4 -> Gamma=0.57 (but acc drops). Terminal gradient matching is monotonically beneficial for credit quality. **Warmup ratio:** wr=0 -> best Gamma (0.68) but worst acc (0.46). wr=0.5 -> worst Gamma (0.23) but best acc (0.66). Clear tradeoff between credit quality and accuracy. Best single config: deltaL + tgw=1.0 + wr=0.05 -> **Gamma=0.768, rho=0.691** ### CIFAR deltaL Test deltaL conditioning (s=grad_{h_L} CE, dim=512) on CIFAR L=4: FAILED. Acc=17.2%, Gamma≈0, rho≈0. The 512-dim conditioning is too high-dimensional for the value net. Confirms the scalar V approach has a dimensionality bottleneck. ### Pivot Recommendation: Direct Vector Credit Field See `report_explore/MEMO_pivot_vector_field.md`. Instead of V_phi -> grad_h V, learn a_phi(h_l, t_l, s) -> R^d directly. Train with perturbation-based target: match to actual loss change. Still satisfies no hidden BP anchor constraint. Minimal test: synthetic alpha=1.0, L=4 with M=4 perturbation directions. --- ## Phase 4: Diagnostic Dissection (commit TBD) ### Phase A: Frozen CIFAR Credit Recovery **Setup**: BP-trained CIFAR-10 network (L=4, d=256, 61.7% test acc), frozen. Train credit estimators on fixed representations. **Key Result**: On frozen BP features, credit estimators CAN recover meaningful credit. | Method | mean Gamma | mean rho | mean nudge | |--------|-----------|---------|-----------| | DFA (random) | 0.006 | 0.005 | -0.000022 | | State Bridge (eT) | **0.287** | **0.246** | **-0.000957** | | Scalar CB (eT) | 0.115 | 0.125 | -0.000370 | | Scalar CB (deltaL) | 0.070 | 0.062 | -0.000160 | **Surprising**: State bridge is BEST on frozen BP features (opposite of synthetic). BP-trained features are quasi-linear, so SB's Jacobian approximation works well. Also tested L=6 d=256 and L=4 d=512: same pattern (SB > CB_eT > CB_deltaL >> DFA). **Implication**: Estimator is NOT the fundamental bottleneck. The online training failure is due to co-adaptation between forward net and credit estimator. ### Phase B: Online Shallow CIFAR Conditioning Scan **Setup**: L=4, d=256, CIFAR-10, 100 epochs, seed=42 (then 3-seed on best config). Sweep: methods={DFA, SB, CB_eT, CB_deltaL}, wr={0,0.05,0.2}, tgw={1.0,4.0}. **Found 2 positive configs with S1>0 AND S2>0:** | Config | Acc | Gamma | rho | S1 vs DFA | S2 vs DFA | |--------|-----|-------|-----|-----------|-----------| | cb_eT wr=0.2 tgw=1.0 | 0.283 | 0.179 | 0.009 | **+0.079** | **+0.014** | | cb_eT wr=0.2 tgw=4.0 | 0.285 | 0.187 | 0.002 | **+0.087** | **+0.007** | **3-seed validation of cb_eT wr=0.2 tgw=1.0:** - Seed 42: S1=+0.079, S2=+0.014 (both positive) - Seed 123: S1=+0.059, S2=-0.004 (S1 positive, S2 marginal negative) - Seed 456: S1=+0.135, S2=+0.003 (both positive, barely) S1 is consistently positive. S2 is marginal — sometimes positive, sometimes not. **CB_deltaL failed entirely on online CIFAR** (all configs near chance, Gamma≈0, rho≈0). **SB_eT also failed online** (Gamma=0.025, rho=-0.013, despite being best on frozen features). ### Phase C: Direct Vector Credit Field Pilot **Setup**: Synthetic alpha=1.0, L={4,8}, d=128, 80 epochs, 3 seeds. Compare DFA vs Scalar CB vs Vector Field (M=4, M=8). **BREAKTHROUGH RESULT:** | Method | L=4 Gamma | L=4 rho | L=8 Gamma | L=8 rho | |--------|-----------|---------|-----------|---------| | DFA | 0.01±0.01 | 0.01±0.01 | 0.08±0.04 | 0.08±0.04 | | Scalar CB | 0.34±0.10 | 0.33±0.11 | 0.26±0.03 | 0.29±0.02 | | **Vector M=4** | **0.91±0.05** | **0.96±0.01** | **0.96±0.01** | **0.95±0.01** | | **Vector M=8** | **0.84±0.10** | **0.92±0.05** | **0.90±0.10** | **0.93±0.04** | **Vector field improves over scalar CB by +0.3 to +0.7 on BOTH Gamma and rho.** This is consistent across all 6 seed x depth combinations. The perturbation-based directional target directly trains the credit vector to predict local loss sensitivity, avoiding the scalar V curvature problem entirely. ### Experiment IDs (Phase 4) - `frozen_cifar/`: Frozen CIFAR credit recovery (L=4 d=256, L=6 d=256, L=4 d=512) - `online_shallow/`: Phase B online scan (L=4 d=256, all configs) - `online_shallow_3seed/`: 3-seed validation of best config - `vector_credit_pilot/`: Phase C vector field vs scalar CB on synthetic ### Answers to Phase 4 Questions **Q1: On frozen CIFAR, can the current credit estimator recover useful credit?** YES. Scalar CB achieves Gamma=0.115, rho=0.125 — 20x better than DFA. State bridge is even better (Gamma=0.287, rho=0.246). **Q2: If yes, why does online training still fail?** Co-adaptation. The forward net's features change every epoch, making the value net's credit stale. DFA avoids this because its credits don't depend on learning. The wr=0.2 warmup helps by letting DFA carry the forward net to a useful feature regime before CB takes over, but the S2 signal remains marginal. **Q3: Can a direct vector credit field improve over scalar CB?** YES, dramatically. On synthetic alpha=1.0, vector field (M=4) achieves Gamma=0.91, rho=0.96, compared to scalar CB's Gamma=0.34, rho=0.33. The scalar V curvature problem is real and avoidable. ### Next Steps 1. Test vector credit field on CIFAR-10 (the real task) 2. Key concern: M=4 perturbation directions may not suffice for d=256/512 dimensions (signal/noise ~ sqrt(M/d) ~ sqrt(4/256) ~ 0.12) 3. May need M=32+ for CIFAR, which is expensive but tractable --- ## Phase 5: Vector Field Audit + Real-Task Transfer ### Phase 5A: Synthetic Audit (4 sanity checks) **Setup**: alpha=1.0, L={4,8}, d=128, 3 seeds, 80 epochs **Smoke test result (L=4, seed=42):** | Method | Gamma | rho | nudge | |--------|-------|-----|-------| | scalar_cb | 0.224 | 0.210 | -0.007 | | vec_eT_M4 | **0.847** | **0.951** | -0.026 | | vec_shuffleCtrl | 0.051 | 0.068 | -0.001 | | vec_noTerm | 0.955 | 0.971 | -0.027 | | vec_onesided | 0.832 | 0.943 | -0.024 | **Check B (shuffle)**: PASS — shuffled control collapses (5/6 near zero across full audit) **Check C (noTerm)**: Terminal NOT needed — perturbation target alone gives 0.95+ Gamma **Check D (onesided)**: PASS — one-sided ≈ central difference **Full 3-seed audit**: All 6 configs pass (delta_Gamma >= 0.49, delta_rho >= 0.55) ### Phase 5B: Frozen CIFAR Vector Transfer **Setup**: CIFAR-10, frozen BP ref (L=4, d=256, 61.7%), 100 epochs estimators | Method | Gamma | rho | nudge | |--------|-------|-----|-------| | DFA | 0.005 | 0.005 | -0.000006 | | ScalarCB_eT | 0.115 | 0.120 | -0.000370 | | StateBridge_eT | 0.287 | 0.264 | -0.000957 | | **Vec_eT_M4** | **0.364** | **0.426** | **-0.001406** | **TRANSFER SUCCESS**: Vec beats scalar CB by +0.25 Gamma, +0.31 rho. Vec also beats state bridge on rho (0.43 vs 0.26). M=4 is sufficient (M=8, M=16 give same results). ### Phase 5C: Online CIFAR Vector Pilot **Setup**: CIFAR-10, L=4, d=256, 100 epochs, seed=42 | Config | Acc | Gamma | rho | S1 | S2 | |--------|-----|-------|-----|----|----| | DFA | 0.312 | 0.101 | -0.005 | 0 | 0 | | vec wr=0.2 tw=1.0 | 0.243 | 0.001 | 0.000 | -0.100 | +0.005 | | (for comparison: scalar CB wr=0.2 tgw=1.0 from Phase 4) | 0.283 | 0.179 | 0.009 | +0.079 | +0.014 | **ONLINE FAILURE**: Vector field does WORSE than scalar CB online, despite being much better on frozen features. No config achieves S1>0. ### Core Finding of Phases 4-5 **The bottleneck is NOT the credit estimator.** Improving credit quality from scalar CB (Gamma=0.12) to vector field (Gamma=0.36) on frozen features does NOT help online training. The bottleneck is in the **local surrogate update**: cannot effectively exploit directional credit information, and co-adaptation between forward net and credit estimator rapidly degrades the signal. ### Experiment IDs (Phase 5) - `vector_audit/`: Phase 5A smoke test - `vector_audit_full/`: Phase 5A full 3-seed audit - `frozen_cifar_vec/`: Phase 5B frozen CIFAR vector transfer - `online_vec_pilot/`: Phase 5C online CIFAR vector pilot --- ## Phase 6: Exploitability Dissection ### Phase 6A: Snapshot Exploitability **Setup**: BP-trained CIFAR snapshot (L=4, d=256, 61.9% acc). Offline-trained estimators. k-step local updates with real loss measurement. **CRITICAL FINDING: Better credit → worse loss decrease.** | Credit | Gamma | rho | dL_5step (inner_product) | |--------|-------|-----|-------------------------| | DFA | 0.009 | -0.023 | **-0.0001** (only negative!) | | ScalarCB | 0.122 | 0.090 | +0.042 | | Vec_M4 | 0.378 | 0.411 | +0.057 | | Oracle BP | 1.000 | 0.998 | +0.011 | Credit quality is ANTI-CORRELATED with loss decrease. DFA (worst credit) is the only method not increasing loss. ### Phase 6C: Local Update Rule Swap Tested target-shift (`h_target = h_{l+1} - eta * a_norm`) at eta in {0.01, 0.1, 0.3, 1.0}. Target-shift reduces damage (Vec dL: +0.057 → +0.002 at eta=0.1) but never achieves negative DeltaLoss for any non-DFA credit. Cosine rule produces near-zero effects. ### Root Cause The inner-product surrogate `` is not a valid proxy for global loss minimization. The gradient of this surrogate w.r.t. block parameters ≠ gradient of global loss w.r.t. same parameters. A BP-trained snapshot is at a minimum reachable only by full BP; local updates systematically push uphill. DFA works because its credits are weak enough to produce near-zero updates, effectively doing nothing. ### This is Case B from the diagnostic logic tree: Better credit does NOT lead to better snapshot loss decrease. **The primary bottleneck is the local update rule itself, not the estimator or tracking.** ### Experiment IDs (Phase 6) - `snapshot_exploit/`: Phase 6A snapshot exploitability - `update_swap/`: Phase 6C local update rule comparison --- ## Phase 6.5: Protocol Audit (REVISES Phase 6A conclusion) ### Phase 6.5A: Same-Batch Linesearch **CRITICAL REVISION**: Phase 6A's "better credit → worse loss" was a protocol artifact. Phase 6A used: normalized credit + held-out evaluation + gradient clamping. Phase 6.5A uses: raw + norm credit, same-batch + held-out eval, no clamping, eta sweep. **With same-batch evaluation, better credit DOES produce more loss decrease:** | Method | Gamma | dL_same (norm, all, best eta) | dL_held | |--------|-------|-------------------------------|---------| | DFA | 0.01 | -0.003 | +0.004 | | ScalarCB | 0.12 | -0.025 | +0.027 | | Vec_M4 | 0.38 | **-0.135** | +0.045 | | Oracle BP | 1.00 | **-0.406** | +0.094 | Same-batch loss decrease is MONOTONIC with credit quality. But held-out loss INCREASES for all non-DFA methods. **This is Case D: the local surrogate exploits credit correctly on training data, but the update overfits to the batch. Better credit = more effective overfitting.** ### Key confounds identified in Phase 6A: 1. **Normalization** inflated DFA's weak credits to same magnitude as Vec's 2. **Held-out evaluation** showed generalization failure, not exploitability failure 3. **Gradient clamping** distorted the natural credit quality ordering ### Raw vs Norm: - Raw credit: tiny updates (BP grad RMS ≈ 0.00004). Vec raw best dL_same=-0.005 - Norm credit: amplifies to useful magnitude but also amplifies overfitting ### Revised diagnosis: The bottleneck is NOT "surrogate can't exploit credit" (Phase 6A was wrong). It IS "local surrogate with good credit overfits to mini-batch." This suggests: regularization of local updates (larger batches, weight decay, gradient noise) could make better credit usable. ### Experiment IDs (Phase 6.5) - `exploit_linesearch/`: Phase 6.5A smoke test (Oracle + Vec, last1, raw) - `exploit_linesearch_full/`: Phase 6.5A full sweep (all methods, ranges, norm modes)