summaryrefslogtreecommitdiff
path: root/NOTE.md
blob: 8d6091d9bb4e0034c69ab09c8dc6ec279e7bee75 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
# Experiment Notes

## Experiment Phases
- **debug**: Initial implementation, rapid iteration (commits ce24e36)
- **pilot**: Controlled iteration (commits 0b9ebb2, 7baf7ae)
- **frozen**: Code at commit 0b9ebb2 for all reported results

## Status: PHASE 2 EXPLORE IN PROGRESS

---

## Final Results Summary

### Toy LQ (3 seeds, 8000 steps)
| Method | Costate Cosine | ρ | Nudging |
|--------|---------------|---|---------|
| DFA | 0.001±0.003 | 0.001±0.007 | 0.000±0.001 |
| State Bridge | 0.945±0.002 | 0.931±0.003 | -0.344±0.019 |
| Credit Bridge | 0.944±0.001 | 0.930±0.002 | -0.342±0.019 |

### CIFAR-10 (3 seeds, 100 epochs)
| Method | Test Accuracy |
|--------|:------------:|
| BP | 59.2%±0.4% |
| DFA | 30.0%±0.3% |
| Credit Bridge | 29.6%±1.0% |
| State Bridge | 18.5%±1.8% |

### CIFAR-10 Diagnostics (seed 42)
| Method | BP Cosine | ρ | Nudge |
|--------|-----------|---|-------|
| BP | 0.940 | 0.990 | -0.027 |
| Credit Bridge | 0.056 | ~0 | ~0 |
| DFA | 0.030 | 0.005 | ~0 |
| State Bridge | 0.021 | 0.004 | ~0 |

---

## Key Findings

1. **Terminal gradient matching is essential** for credit bridge.
   Without it, V learns correct values but uninformative gradients (cos → 0.03).
   With it, credit bridge matches state bridge on toy (~0.94 cosine).

2. **State bridge fails on nonlinear systems** despite near-perfect state prediction.
   State prediction error → 0.0000 but test accuracy = 18.5% (worst of all methods).
   This confirms the core hypothesis: bridging state ≠ bridging credit.

3. **Credit bridge modestly outperforms DFA in BP cosine** (0.056 vs 0.030, ~2x)
   but accuracy is comparable (29.6% vs 30.0%).

4. **All non-BP methods struggle** on the deep 12-block MLP architecture.
   The gap to BP (59.2%) is large for all methods.

---

## Changes Log
- `ce24e36`: Initial implementation
- `0b9ebb2`: Sync state bridge to use normalized MSE in both toy and CIFAR
- `7baf7ae`: Add experiment notes and .gitignore

## Experiment IDs
- `toy_lq_frozen/`: Final toy results (3 seeds, synced state bridge)
- `cifar10/`, `cifar10_seed123/`, `cifar10_seed456/`: Final CIFAR results
- `toy_lq/`: Debug-phase toy results (raw state bridge, for ablation)
- `smoke_test/`, `smoke_test2/`: FashionMNIST debug runs

## Design Decisions
1. Terminal gradient matching (term_grad_weight=1.0): output-layer-local, not hidden BP
2. DFA warmup for credit bridge (20% of epochs): prevents value net bootstrap failure
3. Normalized MSE for state bridge: numerical stability
4. Credit normalization: a_norm = a / (RMS(a) + 1e-6)

---

## Phase 2: Explore (commit 2403960+)

### Synthetic Nonlinearity Ladder (Phase 1 of explore)

**Setup**: Teacher-student with phi_alpha(z) = (1-alpha)*z + alpha*tanh(z)
- alpha in {0, 0.25, 0.5, 1.0}, L in {2, 4, 8, 12}
- d=128, C=10, 80 epochs, 3 seeds

**Critical Finding**: Credit bridge advantage scales with nonlinearity.

At alpha=1.0 (full tanh), credit bridge is the BEST method on Gamma and rho at ALL depths:

| L | DFA Gamma | SB Gamma | CB Gamma | DFA rho | SB rho | CB rho |
|---|-----------|----------|----------|---------|--------|--------|
| 2 | 0.03 | 0.52 | **0.53** | 0.03 | 0.47 | **0.57** |
| 4 | 0.05 | 0.34 | **0.45** | 0.06 | 0.32 | **0.51** |
| 8 | 0.06 | 0.25 | **0.36** | 0.07 | 0.23 | **0.42** |
| 12 | 0.07 | 0.22 | **0.24** | 0.07 | 0.21 | **0.32** |

At alpha=0.5 (moderate nonlinearity), SB still wins on Gamma but CB wins on rho at L=4.
At alpha=0 (linear), SB dominates.

**Interpretation**: State bridge fails via Jacobian mismatch, not value prediction error.
Credit bridge avoids this by learning value field gradients directly.
The crossover happens around alpha=0.7-1.0.

### CIFAR-10 Depth Scan (Phase 2 of explore, in progress)

Sweep L={2,4,6,8,12}, d=512, 100 epochs on CIFAR-10.
Preliminary results (L=2,4, seed=42):

| L | Method | Acc | Gamma | rho |
|---|--------|-----|-------|-----|
| 2 | DFA | 0.312 | 0.196 | 0.001 |
| 2 | CB | 0.311 | 0.175 | **0.031** |
| 4 | DFA | 0.314 | 0.100 | 0.003 |
| 4 | CB | 0.298 | 0.123 | -0.002 |

CIFAR is much harder -- rho signal is very weak for all non-BP methods.

### Changes Log (explore phase)
- `2403960`: Add synthetic ladder and CIFAR depth scan experiments
- Student blocks now use pre-LayerNorm for stability (fixes L>=8 blowup)
- Added gradient clipping to block updates

### Experiment IDs (explore phase)
- `synth_ladder_smoke/`: Initial 3-alpha x 2-depth smoke test
- `synth_ladder_v2_lo/`: Full alpha=0,0.25 x L=2,4,8,12 x 3 seeds
- `synth_ladder_v2_hi/`: Full alpha=0.5,1.0 x L=2,4,8,12 x 3 seeds
- `cifar_depth_scan_s42/`: CIFAR L=2,4,6,8,12 x d=512 x seed=42 (COMPLETE)
- `boundary_ablation_s_sweep/`: s_type in {eT, deltaL, eT_hL, deltaL_hL}
- `boundary_ablation_tgw_sweep/`: tgw in {0, 0.25, 1.0, 4.0}
- `boundary_ablation_wr_sweep/`: warmup ratio in {0, 0.05, 0.2, 0.5}
- `boundary_ablation_s123/`, `boundary_ablation_s456/`: s_type sweep with seeds 123, 456
- `boundary_ablation_deltaL_wr/`: deltaL with warmup ratio sweep

### Phase 3 Results: Boundary-Condition Ablation

At alpha=1.0, L=4 (best synthetic regime), 3 seeds:

**s_type (conditioning code):**
| Code | Gamma | rho | Acc |
|------|-------|-----|-----|
| eT (dim=10) | 0.452+/-0.042 | 0.509+/-0.033 | 0.523 |
| deltaL (dim=d) | **0.562+/-0.007** | **0.510+/-0.014** | 0.448 |
| eT+proj(h_L) | 0.002 | 0.016 | 0.559 |
| deltaL+proj(h_L) | 0.018 | 0.026 | 0.564 |

**deltaL gives best Gamma. Concatenating h_L destroys credit quality (value net cheats).**

**Terminal gradient matching weight:**
tgw=0 -> Gamma=0.12; tgw=1 -> Gamma=0.46; tgw=4 -> Gamma=0.57 (but acc drops).
Terminal gradient matching is monotonically beneficial for credit quality.

**Warmup ratio:**
wr=0 -> best Gamma (0.68) but worst acc (0.46).
wr=0.5 -> worst Gamma (0.23) but best acc (0.66).
Clear tradeoff between credit quality and accuracy.

Best single config: deltaL + tgw=1.0 + wr=0.05 -> **Gamma=0.768, rho=0.691**