report_explore/MEMO_6_exploitability.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53

# Phase 6 Memo: Snapshot Exploitability + Local Update Rule Swap

**Date**: 2026-03-24

## Phase 6A: Snapshot Exploitability

**Setup**: BP-trained CIFAR-10 snapshot (L=4, d=256, 61.9% acc). Train estimators on frozen features, then do k-step local updates and measure real loss change.

### Results (5-step DeltaLoss, inner-product surrogate)

| Credit | Gamma | rho | dL_5step |
|--------|-------|-----|----------|
| DFA | 0.009 | -0.023 | **-0.0001** |
| ScalarCB | 0.122 | 0.090 | +0.042 |
| Vec_M4 | 0.378 | 0.411 | +0.057 |
| Oracle BP | 1.000 | 0.998 | +0.011 |

**Finding**: Better credit quality is ANTI-CORRELATED with loss decrease. DFA (worst credit) produces the only method that doesn't increase loss. Vec (best credit) increases loss the most. Even Oracle BP increases loss at 5 steps.

**Verdict**: This is **Case B** — the local update rule is the bottleneck.

## Phase 6C: Local Update Rule Swap

Tested target-shift rule (h_{l+1}^target = h_{l+1} - eta * a_norm) at eta in {0.01, 0.1, 0.3, 1.0}.

### Results (5-step DeltaLoss)

| Credit | inner_prod | shift_0.1 | shift_0.3 | shift_1.0 |
|--------|:---:|:---:|:---:|:---:|
| DFA | -0.0001 | **-0.0003** | +0.0004 | +0.001 |
| Vec_M4 | +0.057 | +0.002 | +0.009 | +0.048 |
| Oracle BP | +0.011 | +0.0002 | +0.001 | +0.005 |

Target-shift reduces the damage but never achieves negative DeltaLoss for non-DFA credits. The cosine rule produces near-zero effects at all settings.

## Root Cause Analysis

The issue is deeper than the update rule. A BP-trained snapshot sits at a minimum of the full-backprop loss surface. Any local update that doesn't have access to the full gradient chain will push parameters in a direction that may locally align with the credit but globally increases loss. This is because:

1. The inner-product surrogate `<F_l(h), a_l>` assumes a_l is the desired direction for the residual output. But even perfect credit (Oracle BP) doesn't produce good updates via this mechanism — the gradient of the surrogate w.r.t. block parameters is NOT the same as the gradient of the global loss.

2. Target-shift reduces the magnitude of harmful updates but doesn't fix the direction. At small eta, updates are negligible. At large eta, the target shifts too far and becomes harmful.

3. DFA "works" precisely because its random credits produce near-zero effective updates — it's approximately doing nothing, which is better than doing the wrong thing.

## Implications

**The project's fundamental limitation is NOT in the credit estimator.** It's in the local surrogate update paradigm itself. The inner-product surrogate `<F(h), a>` is not a valid proxy for global loss minimization, regardless of credit quality.

**Potential directions:**
1. Use credit to set per-block learning targets rather than gradients (e.g., knowledge distillation-style objectives)
2. Use credit to modulate a more expressive local loss (e.g., local CE with projected targets)
3. Abandon block-local updates entirely and use credit to define a global but differentiable auxiliary loss