report_explore/MEMO_depth_utility_ladder.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119

# MEMO — Depth-utility ladder (appendix experiment)

**Date:** 2026-06-14
**Purpose:** Reviewer asked to triangulate the depth-utility diagnostic (D3) more finely
— turn the binary *frozen-vs-fully-trained* block comparison into a **curve**. We vary the
number of trainable residual blocks `k`, training the **last `k`** blocks (output side) and
freezing the first `L−k` at random init; embedding / out_ln / out_head are **always** trained.

**Question.** As more blocks are made trainable, does test accuracy rise? Under a method that
genuinely trains depth (BP) it should climb; under a method whose deep credit is non-functional
(DFA) it should stay flat at — or below — the frozen baseline.

**Why output-side-first.** The deepest block receives the most direct credit (FA's last block
sees the exact output gradient), so the last `k` blocks are the **best case** for the method.
If even these don't help, depth is unused.

---

## Setup

- Arch / task: ResMLP (CIFAR-10). Two configs: **d=256 L=4** (primary audit) and **d=512 L=2**
  (FA-failure case — vanilla FA is known to be ≈ frozen here).
- Methods: **BP** (positive control), **FA** (Lillicrap vanilla feedback alignment), **DFA**.
- `k ∈ {0,…,L}`; `k=0` = frozen-blocks baseline, `k=L` = full audit.
- Seeds {42,123,456}; mean ± ddof-1 std.
- Recipe identical to the main audit: AdamW, lr 1e-3, wd 0.01, cosine, batch 128, 100 epochs,
  per-block independent optimizers, rms-normalized local surrogate losses.
- The **full** ladder (all `k`, incl. 0 and L) was run in **one** script for internal
  consistency — `k=0` / `k=L` reproduce the external anchors (see cross-checks).

Harness: `experiments/depth_utility_ladder.py`.
Raw results: `results/depth_ladder/ladder_d256_L4_cifar10.json`, `ladder_d512_L2_cifar10.json`.
Figure: `results/depth_ladder/depth_ladder.png` (`experiments/plot_depth_ladder.py`).

---

## Results (CIFAR-10 test acc, mean ± ddof-1 std, n=3)

**Primary — ResMLP d=256, L=4**

| k (last-k trainable) | BP | FA | DFA |
|---|---|---|---|
| 0 (frozen) | 0.389 ± 0.001 | 0.355 ± 0.003 | 0.349 ± 0.003 |
| 1 | 0.565 ± 0.003 | 0.382 ± 0.008 | 0.244 ± 0.015 |
| 2 | 0.598 ± 0.003 | 0.349 ± 0.016 | 0.286 ± 0.013 |
| 3 | 0.608 ± 0.001 | 0.398 ± 0.008 | 0.296 ± 0.008 |
| 4 (full) | 0.617 ± 0.002 | 0.402 ± 0.009 | 0.301 ± 0.006 |

**Secondary — ResMLP d=512, L=2 (FA-failure)**

| k | BP | FA | DFA |
|---|---|---|---|
| 0 (frozen) | 0.386 ± 0.003 | 0.359 ± 0.000 | 0.349 ± 0.005 |
| 1 | 0.583 ± 0.002 | 0.412 ± 0.004 | 0.226 ± 0.015 |
| 2 (full) | 0.603 ± 0.001 | 0.361 ± 0.003 | 0.302 ± 0.005 |

---

## Interpretation

- **BP — monotone climb.** d=256: 0.389 → 0.617 (**+23 pp**); d=512: 0.386 → 0.603 (**+22 pp**).
  Each block made trainable adds accuracy → depth is genuinely usable, so the D3 precondition
  (BP benefits from depth) holds.
- **DFA — flat-to-negative.** The frozen rung `k=0` (≈0.349) is DFA's **maximum** in both configs.
  Every trained-block configuration lands **below** it, including the full audit (`k=L`): d=256
  full = 0.301 (−4.8 pp vs frozen), d=512 full = 0.302 (−4.7 pp). Training deep DFA blocks does
  not just fail to help — it actively destroys ~5 pp. **The D3 failure now holds at every
  granularity**, not just the two extremes.
- **FA — partial / no net depth utility.** d=256 ends at 0.402 (+4.7 pp over frozen) but
  non-monotonically; d=512 ends at 0.361 ≈ frozen 0.359 (**no net gain** — the FA-failure case
  reproduces). FA is the intermediate: it can use some depth in the easier config and none in the
  harder one. The non-monotonic dips (d=256 k=2; d=512 k=2) are consistent with FA's mis-scaled
  sequential credit occasionally hurting.

**One-line takeaway for §6.2:** *A trainable-depth ladder shows BP's accuracy climbs monotonically
with the number of trainable blocks (+22–23 pp) while DFA peaks at the frozen baseline and
declines once any deep block is trained; FA shows partial-to-no depth utility. Depth is usable
(BP), but DFA's deep credit is not.*

## Cross-checks (internal rerun reproduces external anchors)

- BP `k=4` = 0.617 ≈ existing full-audit BP 0.615.
- DFA `k=4` = 0.301 ≈ existing full-audit DFA 0.301 / 0.306.
- FA `k=4` = 0.402 ≈ existing FA 0.401.
- Frozen `k=0` ≈ 0.349 across methods ≈ existing frozen-blocks baseline 0.349.

## Footnote — why `k=0` is already well above chance

`k=0` is **not** an untrained network: embed / out_ln / out_head are trained; only the blocks are
frozen at random init. At init the residual branches are **small but non-negligible**:
per block `‖f_l(h_l)‖/‖h_l‖ ≈ 0.10`, and the full frozen 4-block stack deviates from the identity
by `‖h_L−h_0‖/‖h_0‖ = 0.196 ± 0.003` with `cos(h_L,h_0) = 0.981 ± 0.001` (3 seeds, CIFAR-10
batch). The frozen stack is therefore a fixed, **near-norm-preserving random feature map**, not a
strict identity. So `k=0` (≈0.35) is the accuracy of a trained embedding+readout composed with
this fixed map — effectively a trained (near-)linear classifier on pixels, well above the 10%
chance level. Measurement: `experiments/frozen_init_identity_check.py` →
`results/depth_ladder/frozen_init_identity.json`.

## Reproduce

```bash
# ladders (GPU2, ~7 h for both, 72 runs, incremental/resumable JSON)
CUDA_VISIBLE_DEVICES=2 python experiments/depth_utility_ladder.py \
  --d_hidden 256 --num_blocks 4 --methods bp fa dfa --k_values 0 1 2 3 4 \
  --seeds 42 123 456 --epochs 100 --gpu 0 --output_dir results/depth_ladder
CUDA_VISIBLE_DEVICES=2 python experiments/depth_utility_ladder.py \
  --d_hidden 512 --num_blocks 2 --methods bp fa dfa --k_values 0 1 2 \
  --seeds 42 123 456 --epochs 100 --gpu 0 --output_dir results/depth_ladder
# figure + identity check
python experiments/plot_depth_ladder.py
CUDA_VISIBLE_DEVICES=2 python experiments/frozen_init_identity_check.py
```

## Caveats / open items

- Parameter-matched shallow baseline (rule out "it's capacity not depth") not yet run — lower
  priority; given deep-BP beats frozen by +22–23 pp, the D3 precondition is already safe.
- FA non-monotonicity (k=1 > k=2 in both configs) is noted but not separately investigated; it
  does not affect the headline (FA full ≈ or slightly above frozen, ≪ BP).