summaryrefslogtreecommitdiff
path: root/report_explore/MEMO_phase4_dissection.md
blob: 53a9ee530bb03a710b45ee8de405fca29015d227 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
# Phase 4 Memo: Diagnostic Dissection

**Date**: 2026-03-24

## Executive Summary

We ran three tightly sequenced experiments to answer: **when credit bridge fails on CIFAR, is the problem in the estimator or the forward exploitability?**

The answer is **both, but for different reasons**:
1. The scalar value-field parameterization has a fundamental curvature problem (Phase A + C)
2. Online co-adaptation destroys even correct credit signals (Phase A + B)
3. A direct vector credit field completely solves problem 1, achieving 0.91-0.96 Gamma/rho on synthetic (Phase C)

---

## Phase A: Frozen CIFAR Credit Recovery

**Setup**: BP-trained CIFAR-10 ResidualMLP (L=4, d=256, 61.7% acc), all parameters frozen. Train credit estimators on fixed representations for 100 epochs.

| Method | mean Gamma | mean rho |
|--------|-----------|---------|
| DFA | 0.006 | 0.005 |
| Scalar CB (eT) | 0.115 | 0.125 |
| Scalar CB (deltaL) | 0.070 | 0.062 |
| **State Bridge (eT)** | **0.287** | **0.246** |

**Key finding**: State bridge dominates on frozen BP features. This reverses the synthetic result (where CB > SB at alpha=1.0). Why? BP-trained features are locally quasi-linear — the Jacobians are near-identity due to small residual initialization. State bridge exploits this; CB's curvature problem is less of an advantage here.

Both CB variants clearly beat DFA (20-50x), proving the estimator CAN recover useful credit on fixed representations. The online failure is therefore NOT a pure estimator problem.

---

## Phase B: Online Shallow CIFAR

**Setup**: L=4, d=256, CIFAR-10, 100 epochs. Methods: DFA, SB, CB_eT, CB_deltaL. Sweep warmup/tgw.

Best config: **cb_eT, wr=0.2, tgw=1.0**
- 3-seed results: S1 (Gamma advantage over DFA) = +0.059 to +0.135 (consistently positive)
- S2 (rho advantage over DFA) = -0.004 to +0.014 (marginal, inconsistent)

CB_deltaL and SB both failed online (despite SB being best on frozen features).

**Interpretation**: The warmup (wr=0.2) is essential — it lets DFA carry the forward net to useful features before CB takes over. But co-adaptation then degrades the credit signal. The rho signal remains near zero in most seeds.

---

## Phase C: Vector Credit Field (Breakthrough)

**Setup**: Synthetic alpha=1.0, L={4,8}, d=128. Direct vector field a_phi(h,t,s)->R^d trained with symmetric finite-difference directional targets.

| Method | L=4 Gamma (3s) | L=4 rho (3s) | L=8 Gamma (3s) | L=8 rho (3s) |
|--------|:---:|:---:|:---:|:---:|
| DFA | 0.01 | 0.01 | 0.08 | 0.08 |
| Scalar CB | 0.34 | 0.33 | 0.26 | 0.29 |
| **Vector M=4** | **0.91** | **0.96** | **0.96** | **0.95** |
| **Vector M=8** | **0.84** | **0.92** | **0.90** | **0.93** |

**The vector field is a game-changer.** It improves over scalar CB by +0.5 on average across all configs. The perturbation-based training directly optimizes for what we care about (local loss sensitivity) without going through a scalar value function.

M=4 surprisingly outperforms M=8 in some cases — lower M means less noise per direction estimate.

---

## Synthesis

The diagnostic dissection reveals a clear path forward:

1. **Scalar V is fundamentally limited** by the curvature-vs-value disconnect (confirmed by Phase A frozen vs Phase C vector field comparison)

2. **Online co-adaptation is a separate problem** (confirmed by Phase A frozen success vs Phase B online marginal results)

3. **The vector credit field solves problem 1 completely** on synthetic tasks

4. **Problem 2 remains open** and will need to be addressed separately when scaling to CIFAR (likely through better warmup schemes, adaptive learning rates, or periodic credit recalibration)

---

## Recommended Next Step

Test vector credit field on CIFAR-10 with L=4, d=256. Key concern: d=256 with M=4 gives signal/noise ~ sqrt(4/256) ~ 0.12, which may be too noisy. Options:
- Increase M to 16 or 32 (4-8x more forward passes per step)
- Use importance-weighted directions (along high-variance directions of a)
- Start with d=64 or d=128 CIFAR embedding for feasibility check