summaryrefslogtreecommitdiff
path: root/NOTE.md
blob: b7ae247a0bcd9ad7b0ecb1d12a956427594d9ab7 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
# Experiment Notes

## Experiment Phases
- **debug**: Initial implementation, rapid iteration (commits ce24e36)
- **pilot**: Controlled iteration (commits 0b9ebb2, 7baf7ae)
- **frozen**: Code at commit 0b9ebb2 for all reported results

## Status: PHASE 10A.6 — GAIN REQUIRES TRAINABLE DEPTH-AWARE AUX, NOT SEMANTIC CREDIT

---

## Final Results Summary

### Toy LQ (3 seeds, 8000 steps)
| Method | Costate Cosine | ρ | Nudging |
|--------|---------------|---|---------|
| DFA | 0.001±0.003 | 0.001±0.007 | 0.000±0.001 |
| State Bridge | 0.945±0.002 | 0.931±0.003 | -0.344±0.019 |
| Credit Bridge | 0.944±0.001 | 0.930±0.002 | -0.342±0.019 |

### CIFAR-10 (3 seeds, 100 epochs)
| Method | Test Accuracy |
|--------|:------------:|
| BP | 59.2%±0.4% |
| DFA | 30.0%±0.3% |
| Credit Bridge | 29.6%±1.0% |
| State Bridge | 18.5%±1.8% |

### CIFAR-10 Diagnostics (seed 42)
| Method | BP Cosine | ρ | Nudge |
|--------|-----------|---|-------|
| BP | 0.940 | 0.990 | -0.027 |
| Credit Bridge | 0.056 | ~0 | ~0 |
| DFA | 0.030 | 0.005 | ~0 |
| State Bridge | 0.021 | 0.004 | ~0 |

---

## Key Findings

1. **Terminal gradient matching is essential** for credit bridge.
   Without it, V learns correct values but uninformative gradients (cos → 0.03).
   With it, credit bridge matches state bridge on toy (~0.94 cosine).

2. **State bridge fails on nonlinear systems** despite near-perfect state prediction.
   State prediction error → 0.0000 but test accuracy = 18.5% (worst of all methods).
   This confirms the core hypothesis: bridging state ≠ bridging credit.

3. **Credit bridge modestly outperforms DFA in BP cosine** (0.056 vs 0.030, ~2x)
   but accuracy is comparable (29.6% vs 30.0%).

4. **All non-BP methods struggle** on the deep 12-block MLP architecture.
   The gap to BP (59.2%) is large for all methods.

---

## Changes Log
- `ce24e36`: Initial implementation
- `0b9ebb2`: Sync state bridge to use normalized MSE in both toy and CIFAR
- `7baf7ae`: Add experiment notes and .gitignore

## Experiment IDs
- `toy_lq_frozen/`: Final toy results (3 seeds, synced state bridge)
- `cifar10/`, `cifar10_seed123/`, `cifar10_seed456/`: Final CIFAR results
- `toy_lq/`: Debug-phase toy results (raw state bridge, for ablation)
- `smoke_test/`, `smoke_test2/`: FashionMNIST debug runs

## Design Decisions
1. Terminal gradient matching (term_grad_weight=1.0): output-layer-local, not hidden BP
2. DFA warmup for credit bridge (20% of epochs): prevents value net bootstrap failure
3. Normalized MSE for state bridge: numerical stability
4. Credit normalization: a_norm = a / (RMS(a) + 1e-6)

---

## Phase 2: Explore (commit 2403960+)

### Synthetic Nonlinearity Ladder (Phase 1 of explore)

**Setup**: Teacher-student with phi_alpha(z) = (1-alpha)*z + alpha*tanh(z)
- alpha in {0, 0.25, 0.5, 1.0}, L in {2, 4, 8, 12}
- d=128, C=10, 80 epochs, 3 seeds

**Critical Finding**: Credit bridge advantage scales with nonlinearity.

At alpha=1.0 (full tanh), credit bridge is the BEST method on Gamma and rho at ALL depths:

| L | DFA Gamma | SB Gamma | CB Gamma | DFA rho | SB rho | CB rho |
|---|-----------|----------|----------|---------|--------|--------|
| 2 | 0.03 | 0.52 | **0.53** | 0.03 | 0.47 | **0.57** |
| 4 | 0.05 | 0.34 | **0.45** | 0.06 | 0.32 | **0.51** |
| 8 | 0.06 | 0.25 | **0.36** | 0.07 | 0.23 | **0.42** |
| 12 | 0.07 | 0.22 | **0.24** | 0.07 | 0.21 | **0.32** |

At alpha=0.5 (moderate nonlinearity), SB still wins on Gamma but CB wins on rho at L=4.
At alpha=0 (linear), SB dominates.

**Interpretation**: State bridge fails via Jacobian mismatch, not value prediction error.
Credit bridge avoids this by learning value field gradients directly.
The crossover happens around alpha=0.7-1.0.

### CIFAR-10 Depth Scan (Phase 2 of explore, in progress)

Sweep L={2,4,6,8,12}, d=512, 100 epochs on CIFAR-10.
Preliminary results (L=2,4, seed=42):

| L | Method | Acc | Gamma | rho |
|---|--------|-----|-------|-----|
| 2 | DFA | 0.312 | 0.196 | 0.001 |
| 2 | CB | 0.311 | 0.175 | **0.031** |
| 4 | DFA | 0.314 | 0.100 | 0.003 |
| 4 | CB | 0.298 | 0.123 | -0.002 |

CIFAR is much harder -- rho signal is very weak for all non-BP methods.

### Changes Log (explore phase)
- `2403960`: Add synthetic ladder and CIFAR depth scan experiments
- Student blocks now use pre-LayerNorm for stability (fixes L>=8 blowup)
- Added gradient clipping to block updates

### Experiment IDs (explore phase)
- `synth_ladder_smoke/`: Initial 3-alpha x 2-depth smoke test
- `synth_ladder_v2_lo/`: Full alpha=0,0.25 x L=2,4,8,12 x 3 seeds
- `synth_ladder_v2_hi/`: Full alpha=0.5,1.0 x L=2,4,8,12 x 3 seeds
- `cifar_depth_scan_s42/`: CIFAR L=2,4,6,8,12 x d=512 x seed=42 (COMPLETE)
- `boundary_ablation_s_sweep/`: s_type in {eT, deltaL, eT_hL, deltaL_hL}
- `boundary_ablation_tgw_sweep/`: tgw in {0, 0.25, 1.0, 4.0}
- `boundary_ablation_wr_sweep/`: warmup ratio in {0, 0.05, 0.2, 0.5}
- `boundary_ablation_s123/`, `boundary_ablation_s456/`: s_type sweep with seeds 123, 456
- `boundary_ablation_deltaL_wr/`: deltaL with warmup ratio sweep

### Phase 3 Results: Boundary-Condition Ablation

At alpha=1.0, L=4 (best synthetic regime), 3 seeds:

**s_type (conditioning code):**
| Code | Gamma | rho | Acc |
|------|-------|-----|-----|
| eT (dim=10) | 0.452+/-0.042 | 0.509+/-0.033 | 0.523 |
| deltaL (dim=d) | **0.562+/-0.007** | **0.510+/-0.014** | 0.448 |
| eT+proj(h_L) | 0.002 | 0.016 | 0.559 |
| deltaL+proj(h_L) | 0.018 | 0.026 | 0.564 |

**deltaL gives best Gamma. Concatenating h_L destroys credit quality (value net cheats).**

**Terminal gradient matching weight:**
tgw=0 -> Gamma=0.12; tgw=1 -> Gamma=0.46; tgw=4 -> Gamma=0.57 (but acc drops).
Terminal gradient matching is monotonically beneficial for credit quality.

**Warmup ratio:**
wr=0 -> best Gamma (0.68) but worst acc (0.46).
wr=0.5 -> worst Gamma (0.23) but best acc (0.66).
Clear tradeoff between credit quality and accuracy.

Best single config: deltaL + tgw=1.0 + wr=0.05 -> **Gamma=0.768, rho=0.691**

### CIFAR deltaL Test
deltaL conditioning (s=grad_{h_L} CE, dim=512) on CIFAR L=4: FAILED.
Acc=17.2%, Gamma≈0, rho≈0. The 512-dim conditioning is too high-dimensional
for the value net. Confirms the scalar V approach has a dimensionality bottleneck.

### Pivot Recommendation: Direct Vector Credit Field
See `report_explore/MEMO_pivot_vector_field.md`.
Instead of V_phi -> grad_h V, learn a_phi(h_l, t_l, s) -> R^d directly.
Train with perturbation-based target: match <a, v> to actual loss change.
Still satisfies no hidden BP anchor constraint.
Minimal test: synthetic alpha=1.0, L=4 with M=4 perturbation directions.

---

## Phase 4: Diagnostic Dissection (commit TBD)

### Phase A: Frozen CIFAR Credit Recovery

**Setup**: BP-trained CIFAR-10 network (L=4, d=256, 61.7% test acc), frozen.
Train credit estimators on fixed representations.

**Key Result**: On frozen BP features, credit estimators CAN recover meaningful credit.

| Method | mean Gamma | mean rho | mean nudge |
|--------|-----------|---------|-----------|
| DFA (random) | 0.006 | 0.005 | -0.000022 |
| State Bridge (eT) | **0.287** | **0.246** | **-0.000957** |
| Scalar CB (eT) | 0.115 | 0.125 | -0.000370 |
| Scalar CB (deltaL) | 0.070 | 0.062 | -0.000160 |

**Surprising**: State bridge is BEST on frozen BP features (opposite of synthetic).
BP-trained features are quasi-linear, so SB's Jacobian approximation works well.

Also tested L=6 d=256 and L=4 d=512: same pattern (SB > CB_eT > CB_deltaL >> DFA).

**Implication**: Estimator is NOT the fundamental bottleneck. The online training
failure is due to co-adaptation between forward net and credit estimator.

### Phase B: Online Shallow CIFAR Conditioning Scan

**Setup**: L=4, d=256, CIFAR-10, 100 epochs, seed=42 (then 3-seed on best config).
Sweep: methods={DFA, SB, CB_eT, CB_deltaL}, wr={0,0.05,0.2}, tgw={1.0,4.0}.

**Found 2 positive configs with S1>0 AND S2>0:**

| Config | Acc | Gamma | rho | S1 vs DFA | S2 vs DFA |
|--------|-----|-------|-----|-----------|-----------|
| cb_eT wr=0.2 tgw=1.0 | 0.283 | 0.179 | 0.009 | **+0.079** | **+0.014** |
| cb_eT wr=0.2 tgw=4.0 | 0.285 | 0.187 | 0.002 | **+0.087** | **+0.007** |

**3-seed validation of cb_eT wr=0.2 tgw=1.0:**
- Seed 42: S1=+0.079, S2=+0.014 (both positive)
- Seed 123: S1=+0.059, S2=-0.004 (S1 positive, S2 marginal negative)
- Seed 456: S1=+0.135, S2=+0.003 (both positive, barely)

S1 is consistently positive. S2 is marginal — sometimes positive, sometimes not.

**CB_deltaL failed entirely on online CIFAR** (all configs near chance, Gamma≈0, rho≈0).
**SB_eT also failed online** (Gamma=0.025, rho=-0.013, despite being best on frozen features).

### Phase C: Direct Vector Credit Field Pilot

**Setup**: Synthetic alpha=1.0, L={4,8}, d=128, 80 epochs, 3 seeds.
Compare DFA vs Scalar CB vs Vector Field (M=4, M=8).

**BREAKTHROUGH RESULT:**

| Method | L=4 Gamma | L=4 rho | L=8 Gamma | L=8 rho |
|--------|-----------|---------|-----------|---------|
| DFA | 0.01±0.01 | 0.01±0.01 | 0.08±0.04 | 0.08±0.04 |
| Scalar CB | 0.34±0.10 | 0.33±0.11 | 0.26±0.03 | 0.29±0.02 |
| **Vector M=4** | **0.91±0.05** | **0.96±0.01** | **0.96±0.01** | **0.95±0.01** |
| **Vector M=8** | **0.84±0.10** | **0.92±0.05** | **0.90±0.10** | **0.93±0.04** |

**Vector field improves over scalar CB by +0.3 to +0.7 on BOTH Gamma and rho.**
This is consistent across all 6 seed x depth combinations.

The perturbation-based directional target directly trains the credit vector to
predict local loss sensitivity, avoiding the scalar V curvature problem entirely.

### Experiment IDs (Phase 4)
- `frozen_cifar/`: Frozen CIFAR credit recovery (L=4 d=256, L=6 d=256, L=4 d=512)
- `online_shallow/`: Phase B online scan (L=4 d=256, all configs)
- `online_shallow_3seed/`: 3-seed validation of best config
- `vector_credit_pilot/`: Phase C vector field vs scalar CB on synthetic

### Answers to Phase 4 Questions

**Q1: On frozen CIFAR, can the current credit estimator recover useful credit?**
YES. Scalar CB achieves Gamma=0.115, rho=0.125 — 20x better than DFA.
State bridge is even better (Gamma=0.287, rho=0.246).

**Q2: If yes, why does online training still fail?**
Co-adaptation. The forward net's features change every epoch, making the value net's
credit stale. DFA avoids this because its credits don't depend on learning.
The wr=0.2 warmup helps by letting DFA carry the forward net to a useful feature
regime before CB takes over, but the S2 signal remains marginal.

**Q3: Can a direct vector credit field improve over scalar CB?**
YES, dramatically. On synthetic alpha=1.0, vector field (M=4) achieves Gamma=0.91,
rho=0.96, compared to scalar CB's Gamma=0.34, rho=0.33.
The scalar V curvature problem is real and avoidable.

### Next Steps
1. Test vector credit field on CIFAR-10 (the real task)
2. Key concern: M=4 perturbation directions may not suffice for d=256/512 dimensions
   (signal/noise ~ sqrt(M/d) ~ sqrt(4/256) ~ 0.12)
3. May need M=32+ for CIFAR, which is expensive but tractable

---

## Phase 5: Vector Field Audit + Real-Task Transfer

### Phase 5A: Synthetic Audit (4 sanity checks)

**Setup**: alpha=1.0, L={4,8}, d=128, 3 seeds, 80 epochs

**Smoke test result (L=4, seed=42):**

| Method | Gamma | rho | nudge |
|--------|-------|-----|-------|
| scalar_cb | 0.224 | 0.210 | -0.007 |
| vec_eT_M4 | **0.847** | **0.951** | -0.026 |
| vec_shuffleCtrl | 0.051 | 0.068 | -0.001 |
| vec_noTerm | 0.955 | 0.971 | -0.027 |
| vec_onesided | 0.832 | 0.943 | -0.024 |

**Check B (shuffle)**: PASS — shuffled control collapses (5/6 near zero across full audit)
**Check C (noTerm)**: Terminal NOT needed — perturbation target alone gives 0.95+ Gamma
**Check D (onesided)**: PASS — one-sided ≈ central difference

**Full 3-seed audit**: All 6 configs pass (delta_Gamma >= 0.49, delta_rho >= 0.55)

### Phase 5B: Frozen CIFAR Vector Transfer

**Setup**: CIFAR-10, frozen BP ref (L=4, d=256, 61.7%), 100 epochs estimators

| Method | Gamma | rho | nudge |
|--------|-------|-----|-------|
| DFA | 0.005 | 0.005 | -0.000006 |
| ScalarCB_eT | 0.115 | 0.120 | -0.000370 |
| StateBridge_eT | 0.287 | 0.264 | -0.000957 |
| **Vec_eT_M4** | **0.364** | **0.426** | **-0.001406** |

**TRANSFER SUCCESS**: Vec beats scalar CB by +0.25 Gamma, +0.31 rho.
Vec also beats state bridge on rho (0.43 vs 0.26).
M=4 is sufficient (M=8, M=16 give same results).

### Phase 5C: Online CIFAR Vector Pilot

**Setup**: CIFAR-10, L=4, d=256, 100 epochs, seed=42

| Config | Acc | Gamma | rho | S1 | S2 |
|--------|-----|-------|-----|----|----|
| DFA | 0.312 | 0.101 | -0.005 | 0 | 0 |
| vec wr=0.2 tw=1.0 | 0.243 | 0.001 | 0.000 | -0.100 | +0.005 |
| (for comparison: scalar CB wr=0.2 tgw=1.0 from Phase 4) | 0.283 | 0.179 | 0.009 | +0.079 | +0.014 |

**ONLINE FAILURE**: Vector field does WORSE than scalar CB online, despite being
much better on frozen features. No config achieves S1>0.

### Core Finding of Phases 4-5

**The bottleneck is NOT the credit estimator.** Improving credit quality from
scalar CB (Gamma=0.12) to vector field (Gamma=0.36) on frozen features does NOT
help online training. The bottleneck is in the **local surrogate update**:
<F_l(h_l), a_l> cannot effectively exploit directional credit information,
and co-adaptation between forward net and credit estimator rapidly degrades
the signal.

### Experiment IDs (Phase 5)
- `vector_audit/`: Phase 5A smoke test
- `vector_audit_full/`: Phase 5A full 3-seed audit
- `frozen_cifar_vec/`: Phase 5B frozen CIFAR vector transfer
- `online_vec_pilot/`: Phase 5C online CIFAR vector pilot

---

## Phase 6: Exploitability Dissection

### Phase 6A: Snapshot Exploitability

**Setup**: BP-trained CIFAR snapshot (L=4, d=256, 61.9% acc).
Offline-trained estimators. k-step local updates with real loss measurement.

**CRITICAL FINDING: Better credit → worse loss decrease.**

| Credit | Gamma | rho | dL_5step (inner_product) |
|--------|-------|-----|-------------------------|
| DFA | 0.009 | -0.023 | **-0.0001** (only negative!) |
| ScalarCB | 0.122 | 0.090 | +0.042 |
| Vec_M4 | 0.378 | 0.411 | +0.057 |
| Oracle BP | 1.000 | 0.998 | +0.011 |

Credit quality is ANTI-CORRELATED with loss decrease.
DFA (worst credit) is the only method not increasing loss.

### Phase 6C: Local Update Rule Swap

Tested target-shift (`h_target = h_{l+1} - eta * a_norm`) at eta in {0.01, 0.1, 0.3, 1.0}.

Target-shift reduces damage (Vec dL: +0.057 → +0.002 at eta=0.1) but never achieves
negative DeltaLoss for any non-DFA credit. Cosine rule produces near-zero effects.

### Root Cause

The inner-product surrogate `<F_l(h), a>` is not a valid proxy for global loss minimization.
The gradient of this surrogate w.r.t. block parameters ≠ gradient of global loss w.r.t. same parameters.
A BP-trained snapshot is at a minimum reachable only by full BP; local updates systematically push uphill.

DFA works because its credits are weak enough to produce near-zero updates, effectively doing nothing.

### This is Case B from the diagnostic logic tree:
Better credit does NOT lead to better snapshot loss decrease.
**The primary bottleneck is the local update rule itself, not the estimator or tracking.**

### Experiment IDs (Phase 6)
- `snapshot_exploit/`: Phase 6A snapshot exploitability
- `update_swap/`: Phase 6C local update rule comparison

---

## Phase 6.5: Protocol Audit (REVISES Phase 6A conclusion)

### Phase 6.5A: Same-Batch Linesearch

**CRITICAL REVISION**: Phase 6A's "better credit → worse loss" was a protocol artifact.

Phase 6A used: normalized credit + held-out evaluation + gradient clamping.
Phase 6.5A uses: raw + norm credit, same-batch + held-out eval, no clamping, eta sweep.

**With same-batch evaluation, better credit DOES produce more loss decrease:**

| Method | Gamma | dL_same (norm, all, best eta) | dL_held |
|--------|-------|-------------------------------|---------|
| DFA | 0.01 | -0.003 | +0.004 |
| ScalarCB | 0.12 | -0.025 | +0.027 |
| Vec_M4 | 0.38 | **-0.135** | +0.045 |
| Oracle BP | 1.00 | **-0.406** | +0.094 |

Same-batch loss decrease is MONOTONIC with credit quality.
But held-out loss INCREASES for all non-DFA methods.

**This is Case D: the local surrogate exploits credit correctly on training data,
but the update overfits to the batch. Better credit = more effective overfitting.**

### Key confounds identified in Phase 6A:
1. **Normalization** inflated DFA's weak credits to same magnitude as Vec's
2. **Held-out evaluation** showed generalization failure, not exploitability failure
3. **Gradient clamping** distorted the natural credit quality ordering

### Raw vs Norm:
- Raw credit: tiny updates (BP grad RMS ≈ 0.00004). Vec raw best dL_same=-0.005
- Norm credit: amplifies to useful magnitude but also amplifies overfitting

### Revised diagnosis:
The bottleneck is NOT "surrogate can't exploit credit" (Phase 6A was wrong).
It IS "local surrogate with good credit overfits to mini-batch."
This suggests: regularization of local updates (larger batches, weight decay,
gradient noise) could make better credit usable.

### Experiment IDs (Phase 6.5)
- `exploit_linesearch/`: Phase 6.5A smoke test (Oracle + Vec, last1, raw)
- `exploit_linesearch_full/`: Phase 6.5A full sweep (all methods, ranges, norm modes)

---

## Phase 7A: Snapshot Time Sweep

**Setup**: BP snapshots at epoch {5, 20, 100} (acc 0.49/0.57/0.62).
Train Vec_M4 on each frozen snapshot. Test 1-step and 5-step with raw credit, last-block-only.

**KEY FINDING: Held-out failure is primarily a LATE-SNAPSHOT artifact.**

5-step DeltaLoss held-out:

| Epoch | DFA dL_held | Vec dL_held | Oracle dL_held | Vec PUR |
|-------|-------------|-------------|----------------|---------|
| **5** | +0.003 | **-0.005** | **-0.009** | **0.70** |
| 20 | +0.001 | +0.002 | +0.000 | -3.87 |
| 100 | +0.000 | +0.001 | -0.001 | -1.01 |

At epoch 5: Vec decreases held-out loss (PUR=0.70), Oracle too (PUR=1.05).
DFA INCREASES held-out at all snapshots.

By epoch 20 the generalization window closes.

**Better credit produces MORE consistent updates** (Vec variance=0.8 vs DFA variance=40).
The problem is not batch-specificity but snapshot timing: credit is useful early, useless late.

**Implication**: The DFA warmup (which delays credit bridge to epoch ~20) is counterproductive.
Credit bridge should be used from epoch 0.

### Experiment IDs (Phase 7)
- `snapshot_time/`: Phase 7A snapshot time sweep with BP checkpoints

---

## Phase 8: Schedule Timing Hypothesis Test

**Setup**: CIFAR-10, L=4, d=256, 100 epochs, seed=42

| Schedule | acc@5 | acc@20 | final |
|----------|-------|--------|-------|
| DFA_only | **0.297** | **0.308** | **0.312** |
| Vec_only_from_0 | 0.135 | 0.151 | 0.154 |
| Vec_T5_then_DFA | 0.135 | 0.213 | 0.266 |
| DFA_T20_then_Vec | 0.297 | 0.308 | 0.129 |

**Phase 7A's timing hypothesis does NOT transfer to online training.**

Vec from epoch 0 gets stuck at 15% (near chance). The online Vec estimator
starts from random initialization and cannot learn useful credit fast enough
when the forward net is also random (cold-start paradox).

DFA alone remains the best non-BP method (31.2%).

### The cold-start paradox:
Vec credit is most useful on early features, but Vec can only learn useful credit
from features with structure. DFA provides structure slowly, but by the time Vec
is ready, the early window is closed.

### Project conclusion at this point:
- Vec estimator WORKS (synthetic + frozen CIFAR)
- Local surrogate CAN exploit it (same-batch, Phase 6.5A)
- Early snapshots show generalization (Phase 7A, offline-trained Vec)
- But online co-learning of estimator + forward net is unsolved (cold-start)

### Experiment IDs (Phase 8)
- `schedule_timing/`: Phase 8 schedule comparison

---

## Phase 9A: Checkpointed Offline Handoff

**Setup**: DFA trajectory with checkpoint at t0=5 (acc=30.4%).
Offline-fit Vec_M4 on frozen checkpoint (60 epochs). Then branch.

| Branch | final acc | diff vs DFA |
|--------|-----------|-------------|
| continue_DFA | 0.311 | baseline |
| handoff_to_Vec | 0.300 | -0.011 |
| **handoff_blend_05** | **0.317** | **+0.006** |

**FIRST POSITIVE ONLINE RESULT**: 50% blend (Vec+DFA) outperforms pure DFA by +0.55%.

This is Case B: pure Vec handoff fails, but blend works.
Vec provides complementary directional credit that DFA alone lacks.
DFA stabilizes the trajectory while Vec adds learned corrections.

### Full Phase 9A Results (seed=42, all t0 × alpha)

| t0 | Branch | final acc | diff vs DFA |
|----|--------|-----------|-------------|
| 1 | continue_DFA | 0.310 | — |
| 1 | blend_05 | **0.319** | **+0.9%** |
| 1 | blend_075 | 0.313 | +0.3% |
| 5 | continue_DFA | 0.310 | — |
| 5 | blend_05 | 0.317 | +0.6% |
| 5 | **blend_075** | **0.326** | **+1.5%** |
| 10 | continue_DFA | 0.310 | — |
| 10 | blend_05 | 0.318 | +0.7% |
| 10 | **blend_075** | **0.325** | **+1.4%** |

**Best config: t0=5 or 10, blend alpha=0.75 (75% Vec + 25% DFA), +1.5%.**
Higher Vec fraction (0.75) outperforms lower (0.25, 0.5) at t0=5,10.
Pure Vec handoff still fails at all t0.

### Phase 9B: Periodic Refit — FAILED

K=5 (refit every 5 epochs), R=1 (1 epoch refit), alpha=0.75: **14.0%** (vs DFA 31.2%).
Vec starts random, periodic 1-epoch refits aren't enough. Without offline pretraining
(as in 9A), the cold-start problem remains unsolved.

### Phase 9C: Top-Down Curriculum — NEUTRAL

| Config | final acc | diff vs DFA |
|--------|-----------|-------------|
| DFA_only | 0.312 | — |
| last1_vec (blend 0.75) | 0.308 | -0.4% |
| last2_vec (blend 0.75) | 0.311 | -0.1% |

Vec for last 1-2 blocks ≈ DFA. No improvement, no degradation.
Same cold-start problem: Vec can't learn useful credit online even for 1 block.

### Phase 9 Overall Conclusion

Only **offline prefit + blend handoff** (9A) works. Both online periodic refit (9B)
and top-down curriculum (9C) fail because they don't pre-train Vec on frozen features.

The +1.5% gain from 9A's blend(0.75) at t0=5 is the project's best online result.

### Experiment IDs (Phase 9)
- `checkpointed_handoff/`: Phase 9A checkpointed handoff with branches
- `periodic_refit/`: Phase 9B periodic refit
- `topdown_curriculum/`: Phase 9C top-down curriculum

---

## Phase 10A: Prefit Threshold Curve

**Setup**: t0=5, blend_075, E_prefit in {0, 15, 60}, seed=42

| E_prefit | Gamma_frozen | rho_frozen | final acc | diff vs DFA |
|----------|-------------|-----------|-----------|-------------|
| 0 (random Vec) | -0.005 | 0.014 | **0.324** | **+1.3%** |
| 15 | 0.002 | 0.011 | 0.323 | +1.2% |
| 60 | -0.001 | -0.009 | 0.325 | +1.4% |
| continue_DFA | — | — | 0.311 | baseline |

**Case C: NO prefit threshold exists.** Even random Vec (E=0) with blend(0.75) beats DFA.

**Critical reinterpretation of Phase 9A**: the +1.5% gain was NOT from Vec learning good credit.
Frozen Gamma/rho are near zero at all prefit levels. The benefit comes from the blend mechanism
itself — blending DFA with any additional signal provides regularization/diversification.

### Phase 10A.5: Blend Mechanism Dissection

| Branch | final acc | diff vs DFA |
|--------|-----------|-------------|
| continue_DFA | 0.311 | baseline |
| blend_random_**frozen** | **0.126** | **-18.5%** (catastrophic) |
| blend_random_**trainable** | 0.322 | +1.2% |
| blend_shuffled_trainable | 0.325 | +1.4% |
| blend_gaussian_noise | 0.308 | -0.3% |
| scaled_DFA_norm_match | 0.310 | -0.0% |

**Mechanism identified**: The gain is from **implicit regularization through a trainable
auxiliary network**, NOT from learned credit. Frozen random Vec crashes (12.6%).
Trainable Vec helps even with shuffled targets. Gaussian noise and norm scaling don't help.

Phase 9A's +1.5% was not evidence of useful credit — it was an optimization dynamics effect.

### Phase 10A.6: Structured vs Semantic Auxiliary

| Branch | final | diff | Key insight |
|--------|-------|------|-------------|
| random_trainable | 0.324 | +1.2% | works |
| shuffled_trainable | 0.325 | +1.4% | no semantics needed |
| **zero_target** | **0.221** | **-9.1%** | must output non-zero |
| fresh_random_target | 0.325 | +1.3% | stable targets not needed |
| time_only | 0.321 | +1.0% | h_l not needed, just depth |
| **constant_input** | **0.312** | **+0.0%** | needs at least depth info |
| prefit60_frozen | 0.127 | -18.4% | frozen = crash |
| prefit60_trainable | 0.321 | +1.0% | prefit ≈ random init |

**Mechanism**: depth-aware trainable auxiliary perturbation that diversifies block-local updates.
Not semantic credit. Not pure trainability (zero_target crashes). Not state-dependent (time_only works).
Depth-awareness is the minimal requirement (constant_input fails).

### Phase 10A.7: Minimal Auxiliary Compression

| Branch | final | diff | Key insight |
|--------|-------|------|-------------|
| random_trainable | 0.321 | +1.0% | reference |
| zero_target | 0.203 | -10.8% | must output non-zero (confirmed) |
| zero_target_normmatched | 0.202 | -10.9% | norm matching doesn't save it |
| **perlayer_vector** | **0.318** | **+0.7%** | per-block trainable vector works! |
| freeze_after_1 | 0.144 | -16.7% | freeze = crash |
| freeze_after_5 | 0.143 | -16.8% | freeze = crash |
| freeze_after_10 | 0.130 | -18.1% | freeze = crash |

**Key findings**:
1. Norm-matched zero-target still crashes → non-zero direction is necessary, not just norm
2. Per-layer trainable vector works (+0.7%) → network not strictly needed, but helps
3. ALL freeze-after-k crash → **continuous trainability is absolutely necessary**

**Minimal mechanism**: continuously trainable, non-zero, depth-aware auxiliary perturbation.

### Phase 10A.8: Scaffold Dynamics

**8A: Freeze with Decay**

| Branch | final | diff | Key |
|--------|-------|------|-----|
| random_trainable_075 | 0.322 | +1.1% | reference |
| freeze1_fixed075 | 0.146 | -16.6% | stale aux at 75% kills |
| freeze1_decay_to_000 | **0.285** | **-2.7%** | decay to DFA recovers most |
| freeze5_decay_to_000 | 0.285 | -2.6% | same |

Freeze failure is MOSTLY stale high-weight aux (decay_to_000 recovers to 28.5%).
But 28.5% < DFA 31.2% → continuous trainability adds ~2.7% additional value.

**8B: Alpha Sweep**

| Method | α=0.25 | α=0.50 | α=0.75 | α=0.90 |
|--------|--------|--------|--------|--------|
| perlayer_vector | +0.0% | +0.6% | **+1.1%** | -1.4% |
| random_trainable | +0.1% | +0.4% | **+0.8%** | -0.1% |

Both methods optimal at α=0.75. perlayer_vector (+1.1%) ≈ random_trainable (+0.8%).
Per-layer vector is the minimal sufficient scaffold.

### Phase 10A.8C: 3-Seed Replication

| Method | final acc (3 seeds) | diff vs DFA |
|--------|---------------------|-------------|
| DFA | 0.306±0.006 | baseline |
| perlayer_vector α=0.75 | 0.304±0.006 | -0.2% (not significant) |
| random_trainable α=0.75 | 0.313±0.007 | +0.7% (marginal) |

Single-seed gains do not robustly replicate. Error bars overlap.

### Experiment IDs (Phase 10)
- `prefit_threshold/`: Phase 10A prefit threshold curve
- `blend_dissection/`: Phase 10A.5 blend mechanism dissection
- `structured_aux/`: Phase 10A.6 structured vs semantic auxiliary
- `minimal_aux_compression/`: Phase 10A.7 minimal aux compression
- `freeze_with_decay/`: Phase 10A.8A freeze with decay
- `alpha_sweep_scaffold/`: Phase 10A.8B alpha sweep