NOTE.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331

# Experiment Notes

## Experiment Phases
- **debug**: Initial implementation, rapid iteration (commits ce24e36)
- **pilot**: Controlled iteration (commits 0b9ebb2, 7baf7ae)
- **frozen**: Code at commit 0b9ebb2 for all reported results

## Status: PHASE 5 VECTOR FIELD AUDIT + TRANSFER COMPLETE

---

## Final Results Summary

### Toy LQ (3 seeds, 8000 steps)
| Method | Costate Cosine | ρ | Nudging |
|--------|---------------|---|---------|
| DFA | 0.001±0.003 | 0.001±0.007 | 0.000±0.001 |
| State Bridge | 0.945±0.002 | 0.931±0.003 | -0.344±0.019 |
| Credit Bridge | 0.944±0.001 | 0.930±0.002 | -0.342±0.019 |

### CIFAR-10 (3 seeds, 100 epochs)
| Method | Test Accuracy |
|--------|:------------:|
| BP | 59.2%±0.4% |
| DFA | 30.0%±0.3% |
| Credit Bridge | 29.6%±1.0% |
| State Bridge | 18.5%±1.8% |

### CIFAR-10 Diagnostics (seed 42)
| Method | BP Cosine | ρ | Nudge |
|--------|-----------|---|-------|
| BP | 0.940 | 0.990 | -0.027 |
| Credit Bridge | 0.056 | ~0 | ~0 |
| DFA | 0.030 | 0.005 | ~0 |
| State Bridge | 0.021 | 0.004 | ~0 |

---

## Key Findings

1. **Terminal gradient matching is essential** for credit bridge.
   Without it, V learns correct values but uninformative gradients (cos → 0.03).
   With it, credit bridge matches state bridge on toy (~0.94 cosine).

2. **State bridge fails on nonlinear systems** despite near-perfect state prediction.
   State prediction error → 0.0000 but test accuracy = 18.5% (worst of all methods).
   This confirms the core hypothesis: bridging state ≠ bridging credit.

3. **Credit bridge modestly outperforms DFA in BP cosine** (0.056 vs 0.030, ~2x)
   but accuracy is comparable (29.6% vs 30.0%).

4. **All non-BP methods struggle** on the deep 12-block MLP architecture.
   The gap to BP (59.2%) is large for all methods.

---

## Changes Log
- `ce24e36`: Initial implementation
- `0b9ebb2`: Sync state bridge to use normalized MSE in both toy and CIFAR
- `7baf7ae`: Add experiment notes and .gitignore

## Experiment IDs
- `toy_lq_frozen/`: Final toy results (3 seeds, synced state bridge)
- `cifar10/`, `cifar10_seed123/`, `cifar10_seed456/`: Final CIFAR results
- `toy_lq/`: Debug-phase toy results (raw state bridge, for ablation)
- `smoke_test/`, `smoke_test2/`: FashionMNIST debug runs

## Design Decisions
1. Terminal gradient matching (term_grad_weight=1.0): output-layer-local, not hidden BP
2. DFA warmup for credit bridge (20% of epochs): prevents value net bootstrap failure
3. Normalized MSE for state bridge: numerical stability
4. Credit normalization: a_norm = a / (RMS(a) + 1e-6)

---

## Phase 2: Explore (commit 2403960+)

### Synthetic Nonlinearity Ladder (Phase 1 of explore)

**Setup**: Teacher-student with phi_alpha(z) = (1-alpha)*z + alpha*tanh(z)
- alpha in {0, 0.25, 0.5, 1.0}, L in {2, 4, 8, 12}
- d=128, C=10, 80 epochs, 3 seeds

**Critical Finding**: Credit bridge advantage scales with nonlinearity.

At alpha=1.0 (full tanh), credit bridge is the BEST method on Gamma and rho at ALL depths:

| L | DFA Gamma | SB Gamma | CB Gamma | DFA rho | SB rho | CB rho |
|---|-----------|----------|----------|---------|--------|--------|
| 2 | 0.03 | 0.52 | **0.53** | 0.03 | 0.47 | **0.57** |
| 4 | 0.05 | 0.34 | **0.45** | 0.06 | 0.32 | **0.51** |
| 8 | 0.06 | 0.25 | **0.36** | 0.07 | 0.23 | **0.42** |
| 12 | 0.07 | 0.22 | **0.24** | 0.07 | 0.21 | **0.32** |

At alpha=0.5 (moderate nonlinearity), SB still wins on Gamma but CB wins on rho at L=4.
At alpha=0 (linear), SB dominates.

**Interpretation**: State bridge fails via Jacobian mismatch, not value prediction error.
Credit bridge avoids this by learning value field gradients directly.
The crossover happens around alpha=0.7-1.0.

### CIFAR-10 Depth Scan (Phase 2 of explore, in progress)

Sweep L={2,4,6,8,12}, d=512, 100 epochs on CIFAR-10.
Preliminary results (L=2,4, seed=42):

| L | Method | Acc | Gamma | rho |
|---|--------|-----|-------|-----|
| 2 | DFA | 0.312 | 0.196 | 0.001 |
| 2 | CB | 0.311 | 0.175 | **0.031** |
| 4 | DFA | 0.314 | 0.100 | 0.003 |
| 4 | CB | 0.298 | 0.123 | -0.002 |

CIFAR is much harder -- rho signal is very weak for all non-BP methods.

### Changes Log (explore phase)
- `2403960`: Add synthetic ladder and CIFAR depth scan experiments
- Student blocks now use pre-LayerNorm for stability (fixes L>=8 blowup)
- Added gradient clipping to block updates

### Experiment IDs (explore phase)
- `synth_ladder_smoke/`: Initial 3-alpha x 2-depth smoke test
- `synth_ladder_v2_lo/`: Full alpha=0,0.25 x L=2,4,8,12 x 3 seeds
- `synth_ladder_v2_hi/`: Full alpha=0.5,1.0 x L=2,4,8,12 x 3 seeds
- `cifar_depth_scan_s42/`: CIFAR L=2,4,6,8,12 x d=512 x seed=42 (COMPLETE)
- `boundary_ablation_s_sweep/`: s_type in {eT, deltaL, eT_hL, deltaL_hL}
- `boundary_ablation_tgw_sweep/`: tgw in {0, 0.25, 1.0, 4.0}
- `boundary_ablation_wr_sweep/`: warmup ratio in {0, 0.05, 0.2, 0.5}
- `boundary_ablation_s123/`, `boundary_ablation_s456/`: s_type sweep with seeds 123, 456
- `boundary_ablation_deltaL_wr/`: deltaL with warmup ratio sweep

### Phase 3 Results: Boundary-Condition Ablation

At alpha=1.0, L=4 (best synthetic regime), 3 seeds:

**s_type (conditioning code):**
| Code | Gamma | rho | Acc |
|------|-------|-----|-----|
| eT (dim=10) | 0.452+/-0.042 | 0.509+/-0.033 | 0.523 |
| deltaL (dim=d) | **0.562+/-0.007** | **0.510+/-0.014** | 0.448 |
| eT+proj(h_L) | 0.002 | 0.016 | 0.559 |
| deltaL+proj(h_L) | 0.018 | 0.026 | 0.564 |

**deltaL gives best Gamma. Concatenating h_L destroys credit quality (value net cheats).**

**Terminal gradient matching weight:**
tgw=0 -> Gamma=0.12; tgw=1 -> Gamma=0.46; tgw=4 -> Gamma=0.57 (but acc drops).
Terminal gradient matching is monotonically beneficial for credit quality.

**Warmup ratio:**
wr=0 -> best Gamma (0.68) but worst acc (0.46).
wr=0.5 -> worst Gamma (0.23) but best acc (0.66).
Clear tradeoff between credit quality and accuracy.

Best single config: deltaL + tgw=1.0 + wr=0.05 -> **Gamma=0.768, rho=0.691**

### CIFAR deltaL Test
deltaL conditioning (s=grad_{h_L} CE, dim=512) on CIFAR L=4: FAILED.
Acc=17.2%, Gamma≈0, rho≈0. The 512-dim conditioning is too high-dimensional
for the value net. Confirms the scalar V approach has a dimensionality bottleneck.

### Pivot Recommendation: Direct Vector Credit Field
See `report_explore/MEMO_pivot_vector_field.md`.
Instead of V_phi -> grad_h V, learn a_phi(h_l, t_l, s) -> R^d directly.
Train with perturbation-based target: match <a, v> to actual loss change.
Still satisfies no hidden BP anchor constraint.
Minimal test: synthetic alpha=1.0, L=4 with M=4 perturbation directions.

---

## Phase 4: Diagnostic Dissection (commit TBD)

### Phase A: Frozen CIFAR Credit Recovery

**Setup**: BP-trained CIFAR-10 network (L=4, d=256, 61.7% test acc), frozen.
Train credit estimators on fixed representations.

**Key Result**: On frozen BP features, credit estimators CAN recover meaningful credit.

| Method | mean Gamma | mean rho | mean nudge |
|--------|-----------|---------|-----------|
| DFA (random) | 0.006 | 0.005 | -0.000022 |
| State Bridge (eT) | **0.287** | **0.246** | **-0.000957** |
| Scalar CB (eT) | 0.115 | 0.125 | -0.000370 |
| Scalar CB (deltaL) | 0.070 | 0.062 | -0.000160 |

**Surprising**: State bridge is BEST on frozen BP features (opposite of synthetic).
BP-trained features are quasi-linear, so SB's Jacobian approximation works well.

Also tested L=6 d=256 and L=4 d=512: same pattern (SB > CB_eT > CB_deltaL >> DFA).

**Implication**: Estimator is NOT the fundamental bottleneck. The online training
failure is due to co-adaptation between forward net and credit estimator.

### Phase B: Online Shallow CIFAR Conditioning Scan

**Setup**: L=4, d=256, CIFAR-10, 100 epochs, seed=42 (then 3-seed on best config).
Sweep: methods={DFA, SB, CB_eT, CB_deltaL}, wr={0,0.05,0.2}, tgw={1.0,4.0}.

**Found 2 positive configs with S1>0 AND S2>0:**

| Config | Acc | Gamma | rho | S1 vs DFA | S2 vs DFA |
|--------|-----|-------|-----|-----------|-----------|
| cb_eT wr=0.2 tgw=1.0 | 0.283 | 0.179 | 0.009 | **+0.079** | **+0.014** |
| cb_eT wr=0.2 tgw=4.0 | 0.285 | 0.187 | 0.002 | **+0.087** | **+0.007** |

**3-seed validation of cb_eT wr=0.2 tgw=1.0:**
- Seed 42: S1=+0.079, S2=+0.014 (both positive)
- Seed 123: S1=+0.059, S2=-0.004 (S1 positive, S2 marginal negative)
- Seed 456: S1=+0.135, S2=+0.003 (both positive, barely)

S1 is consistently positive. S2 is marginal — sometimes positive, sometimes not.

**CB_deltaL failed entirely on online CIFAR** (all configs near chance, Gamma≈0, rho≈0).
**SB_eT also failed online** (Gamma=0.025, rho=-0.013, despite being best on frozen features).

### Phase C: Direct Vector Credit Field Pilot

**Setup**: Synthetic alpha=1.0, L={4,8}, d=128, 80 epochs, 3 seeds.
Compare DFA vs Scalar CB vs Vector Field (M=4, M=8).

**BREAKTHROUGH RESULT:**

| Method | L=4 Gamma | L=4 rho | L=8 Gamma | L=8 rho |
|--------|-----------|---------|-----------|---------|
| DFA | 0.01±0.01 | 0.01±0.01 | 0.08±0.04 | 0.08±0.04 |
| Scalar CB | 0.34±0.10 | 0.33±0.11 | 0.26±0.03 | 0.29±0.02 |
| **Vector M=4** | **0.91±0.05** | **0.96±0.01** | **0.96±0.01** | **0.95±0.01** |
| **Vector M=8** | **0.84±0.10** | **0.92±0.05** | **0.90±0.10** | **0.93±0.04** |

**Vector field improves over scalar CB by +0.3 to +0.7 on BOTH Gamma and rho.**
This is consistent across all 6 seed x depth combinations.

The perturbation-based directional target directly trains the credit vector to
predict local loss sensitivity, avoiding the scalar V curvature problem entirely.

### Experiment IDs (Phase 4)
- `frozen_cifar/`: Frozen CIFAR credit recovery (L=4 d=256, L=6 d=256, L=4 d=512)
- `online_shallow/`: Phase B online scan (L=4 d=256, all configs)
- `online_shallow_3seed/`: 3-seed validation of best config
- `vector_credit_pilot/`: Phase C vector field vs scalar CB on synthetic

### Answers to Phase 4 Questions

**Q1: On frozen CIFAR, can the current credit estimator recover useful credit?**
YES. Scalar CB achieves Gamma=0.115, rho=0.125 — 20x better than DFA.
State bridge is even better (Gamma=0.287, rho=0.246).

**Q2: If yes, why does online training still fail?**
Co-adaptation. The forward net's features change every epoch, making the value net's
credit stale. DFA avoids this because its credits don't depend on learning.
The wr=0.2 warmup helps by letting DFA carry the forward net to a useful feature
regime before CB takes over, but the S2 signal remains marginal.

**Q3: Can a direct vector credit field improve over scalar CB?**
YES, dramatically. On synthetic alpha=1.0, vector field (M=4) achieves Gamma=0.91,
rho=0.96, compared to scalar CB's Gamma=0.34, rho=0.33.
The scalar V curvature problem is real and avoidable.

### Next Steps
1. Test vector credit field on CIFAR-10 (the real task)
2. Key concern: M=4 perturbation directions may not suffice for d=256/512 dimensions
   (signal/noise ~ sqrt(M/d) ~ sqrt(4/256) ~ 0.12)
3. May need M=32+ for CIFAR, which is expensive but tractable

---

## Phase 5: Vector Field Audit + Real-Task Transfer

### Phase 5A: Synthetic Audit (4 sanity checks)

**Setup**: alpha=1.0, L={4,8}, d=128, 3 seeds, 80 epochs

**Smoke test result (L=4, seed=42):**

| Method | Gamma | rho | nudge |
|--------|-------|-----|-------|
| scalar_cb | 0.224 | 0.210 | -0.007 |
| vec_eT_M4 | **0.847** | **0.951** | -0.026 |
| vec_shuffleCtrl | 0.051 | 0.068 | -0.001 |
| vec_noTerm | 0.955 | 0.971 | -0.027 |
| vec_onesided | 0.832 | 0.943 | -0.024 |

**Check B (shuffle)**: PASS — shuffled control collapses (5/6 near zero across full audit)
**Check C (noTerm)**: Terminal NOT needed — perturbation target alone gives 0.95+ Gamma
**Check D (onesided)**: PASS — one-sided ≈ central difference

**Full 3-seed audit**: All 6 configs pass (delta_Gamma >= 0.49, delta_rho >= 0.55)

### Phase 5B: Frozen CIFAR Vector Transfer

**Setup**: CIFAR-10, frozen BP ref (L=4, d=256, 61.7%), 100 epochs estimators

| Method | Gamma | rho | nudge |
|--------|-------|-----|-------|
| DFA | 0.005 | 0.005 | -0.000006 |
| ScalarCB_eT | 0.115 | 0.120 | -0.000370 |
| StateBridge_eT | 0.287 | 0.264 | -0.000957 |
| **Vec_eT_M4** | **0.364** | **0.426** | **-0.001406** |

**TRANSFER SUCCESS**: Vec beats scalar CB by +0.25 Gamma, +0.31 rho.
Vec also beats state bridge on rho (0.43 vs 0.26).
M=4 is sufficient (M=8, M=16 give same results).

### Phase 5C: Online CIFAR Vector Pilot

**Setup**: CIFAR-10, L=4, d=256, 100 epochs, seed=42

| Config | Acc | Gamma | rho | S1 | S2 |
|--------|-----|-------|-----|----|----|
| DFA | 0.312 | 0.101 | -0.005 | 0 | 0 |
| vec wr=0.2 tw=1.0 | 0.243 | 0.001 | 0.000 | -0.100 | +0.005 |
| (for comparison: scalar CB wr=0.2 tgw=1.0 from Phase 4) | 0.283 | 0.179 | 0.009 | +0.079 | +0.014 |

**ONLINE FAILURE**: Vector field does WORSE than scalar CB online, despite being
much better on frozen features. No config achieves S1>0.

### Core Finding of Phases 4-5

**The bottleneck is NOT the credit estimator.** Improving credit quality from
scalar CB (Gamma=0.12) to vector field (Gamma=0.36) on frozen features does NOT
help online training. The bottleneck is in the **local surrogate update**:
<F_l(h_l), a_l> cannot effectively exploit directional credit information,
and co-adaptation between forward net and credit estimator rapidly degrades
the signal.

### Experiment IDs (Phase 5)
- `vector_audit/`: Phase 5A smoke test
- `vector_audit_full/`: Phase 5A full 3-seed audit
- `frozen_cifar_vec/`: Phase 5B frozen CIFAR vector transfer
- `online_vec_pilot/`: Phase 5C online CIFAR vector pilot