# Experiment Notes ## Experiment Phases - **debug**: Initial implementation, rapid iteration (commits ce24e36) - **pilot**: Controlled iteration (commits 0b9ebb2, 7baf7ae) - **frozen**: Code at commit 0b9ebb2 for all reported results ## Status: PHASE 2 EXPLORE IN PROGRESS --- ## Final Results Summary ### Toy LQ (3 seeds, 8000 steps) | Method | Costate Cosine | ρ | Nudging | |--------|---------------|---|---------| | DFA | 0.001±0.003 | 0.001±0.007 | 0.000±0.001 | | State Bridge | 0.945±0.002 | 0.931±0.003 | -0.344±0.019 | | Credit Bridge | 0.944±0.001 | 0.930±0.002 | -0.342±0.019 | ### CIFAR-10 (3 seeds, 100 epochs) | Method | Test Accuracy | |--------|:------------:| | BP | 59.2%±0.4% | | DFA | 30.0%±0.3% | | Credit Bridge | 29.6%±1.0% | | State Bridge | 18.5%±1.8% | ### CIFAR-10 Diagnostics (seed 42) | Method | BP Cosine | ρ | Nudge | |--------|-----------|---|-------| | BP | 0.940 | 0.990 | -0.027 | | Credit Bridge | 0.056 | ~0 | ~0 | | DFA | 0.030 | 0.005 | ~0 | | State Bridge | 0.021 | 0.004 | ~0 | --- ## Key Findings 1. **Terminal gradient matching is essential** for credit bridge. Without it, V learns correct values but uninformative gradients (cos → 0.03). With it, credit bridge matches state bridge on toy (~0.94 cosine). 2. **State bridge fails on nonlinear systems** despite near-perfect state prediction. State prediction error → 0.0000 but test accuracy = 18.5% (worst of all methods). This confirms the core hypothesis: bridging state ≠ bridging credit. 3. **Credit bridge modestly outperforms DFA in BP cosine** (0.056 vs 0.030, ~2x) but accuracy is comparable (29.6% vs 30.0%). 4. **All non-BP methods struggle** on the deep 12-block MLP architecture. The gap to BP (59.2%) is large for all methods. --- ## Changes Log - `ce24e36`: Initial implementation - `0b9ebb2`: Sync state bridge to use normalized MSE in both toy and CIFAR - `7baf7ae`: Add experiment notes and .gitignore ## Experiment IDs - `toy_lq_frozen/`: Final toy results (3 seeds, synced state bridge) - `cifar10/`, `cifar10_seed123/`, `cifar10_seed456/`: Final CIFAR results - `toy_lq/`: Debug-phase toy results (raw state bridge, for ablation) - `smoke_test/`, `smoke_test2/`: FashionMNIST debug runs ## Design Decisions 1. Terminal gradient matching (term_grad_weight=1.0): output-layer-local, not hidden BP 2. DFA warmup for credit bridge (20% of epochs): prevents value net bootstrap failure 3. Normalized MSE for state bridge: numerical stability 4. Credit normalization: a_norm = a / (RMS(a) + 1e-6) --- ## Phase 2: Explore (commit 2403960+) ### Synthetic Nonlinearity Ladder (Phase 1 of explore) **Setup**: Teacher-student with phi_alpha(z) = (1-alpha)*z + alpha*tanh(z) - alpha in {0, 0.25, 0.5, 1.0}, L in {2, 4, 8, 12} - d=128, C=10, 80 epochs, 3 seeds **Critical Finding**: Credit bridge advantage scales with nonlinearity. At alpha=1.0 (full tanh), credit bridge is the BEST method on Gamma and rho at ALL depths: | L | DFA Gamma | SB Gamma | CB Gamma | DFA rho | SB rho | CB rho | |---|-----------|----------|----------|---------|--------|--------| | 2 | 0.03 | 0.52 | **0.53** | 0.03 | 0.47 | **0.57** | | 4 | 0.05 | 0.34 | **0.45** | 0.06 | 0.32 | **0.51** | | 8 | 0.06 | 0.25 | **0.36** | 0.07 | 0.23 | **0.42** | | 12 | 0.07 | 0.22 | **0.24** | 0.07 | 0.21 | **0.32** | At alpha=0.5 (moderate nonlinearity), SB still wins on Gamma but CB wins on rho at L=4. At alpha=0 (linear), SB dominates. **Interpretation**: State bridge fails via Jacobian mismatch, not value prediction error. Credit bridge avoids this by learning value field gradients directly. The crossover happens around alpha=0.7-1.0. ### CIFAR-10 Depth Scan (Phase 2 of explore, in progress) Sweep L={2,4,6,8,12}, d=512, 100 epochs on CIFAR-10. Preliminary results (L=2,4, seed=42): | L | Method | Acc | Gamma | rho | |---|--------|-----|-------|-----| | 2 | DFA | 0.312 | 0.196 | 0.001 | | 2 | CB | 0.311 | 0.175 | **0.031** | | 4 | DFA | 0.314 | 0.100 | 0.003 | | 4 | CB | 0.298 | 0.123 | -0.002 | CIFAR is much harder -- rho signal is very weak for all non-BP methods. ### Changes Log (explore phase) - `2403960`: Add synthetic ladder and CIFAR depth scan experiments - Student blocks now use pre-LayerNorm for stability (fixes L>=8 blowup) - Added gradient clipping to block updates ### Experiment IDs (explore phase) - `synth_ladder_smoke/`: Initial 3-alpha x 2-depth smoke test - `synth_ladder_v2_lo/`: Full alpha=0,0.25 x L=2,4,8,12 x 3 seeds - `synth_ladder_v2_hi/`: Full alpha=0.5,1.0 x L=2,4,8,12 x 3 seeds - `cifar_depth_scan_s42/`: CIFAR L=2,4,6,8,12 x d=512 x seed=42 (COMPLETE) - `boundary_ablation_s_sweep/`: s_type in {eT, deltaL, eT_hL, deltaL_hL} - `boundary_ablation_tgw_sweep/`: tgw in {0, 0.25, 1.0, 4.0} - `boundary_ablation_wr_sweep/`: warmup ratio in {0, 0.05, 0.2, 0.5} - `boundary_ablation_s123/`, `boundary_ablation_s456/`: s_type sweep with seeds 123, 456 - `boundary_ablation_deltaL_wr/`: deltaL with warmup ratio sweep ### Phase 3 Results: Boundary-Condition Ablation At alpha=1.0, L=4 (best synthetic regime), 3 seeds: **s_type (conditioning code):** | Code | Gamma | rho | Acc | |------|-------|-----|-----| | eT (dim=10) | 0.452+/-0.042 | 0.509+/-0.033 | 0.523 | | deltaL (dim=d) | **0.562+/-0.007** | **0.510+/-0.014** | 0.448 | | eT+proj(h_L) | 0.002 | 0.016 | 0.559 | | deltaL+proj(h_L) | 0.018 | 0.026 | 0.564 | **deltaL gives best Gamma. Concatenating h_L destroys credit quality (value net cheats).** **Terminal gradient matching weight:** tgw=0 -> Gamma=0.12; tgw=1 -> Gamma=0.46; tgw=4 -> Gamma=0.57 (but acc drops). Terminal gradient matching is monotonically beneficial for credit quality. **Warmup ratio:** wr=0 -> best Gamma (0.68) but worst acc (0.46). wr=0.5 -> worst Gamma (0.23) but best acc (0.66). Clear tradeoff between credit quality and accuracy. Best single config: deltaL + tgw=1.0 + wr=0.05 -> **Gamma=0.768, rho=0.691**