From 1118b7457c261de36ead6103503c00c321c75f9b Mon Sep 17 00:00:00 2001 From: YurenHao0426 Date: Sun, 14 Jun 2026 20:32:31 -0500 Subject: Depth-utility ladder: trainable-block sweep (BP/FA/DFA) on ResMLP CIFAR-10 Appendix experiment triangulating the depth-utility diagnostic (D3) by varying the number of trainable residual blocks k (last-k trainable, first L-k frozen at init; embed/LN/head always trained). - d=256 L=4 and d=512 L=2, 3 seeds, recipe identical to the main audit. - BP climbs monotonically (+22-23pp); DFA peaks at the frozen baseline (k=0) and declines once any deep block is trained; FA shows partial/no net depth utility. - Cross-checks reproduce existing anchors (BP 0.617, DFA 0.301, FA 0.402, frozen 0.349). - frozen_init_identity_check quantifies frozen stack as a near-norm-preserving random feature map (per-block ||f||/||h||~0.10, stack cos 0.981), explaining the above-chance k=0 rung. Co-Authored-By: Claude Opus 4.8 (1M context) --- report_explore/MEMO_depth_utility_ladder.md | 119 ++++++++++++++++++++++++++++ 1 file changed, 119 insertions(+) create mode 100644 report_explore/MEMO_depth_utility_ladder.md (limited to 'report_explore/MEMO_depth_utility_ladder.md') diff --git a/report_explore/MEMO_depth_utility_ladder.md b/report_explore/MEMO_depth_utility_ladder.md new file mode 100644 index 0000000..d43a983 --- /dev/null +++ b/report_explore/MEMO_depth_utility_ladder.md @@ -0,0 +1,119 @@ +# MEMO — Depth-utility ladder (appendix experiment) + +**Date:** 2026-06-14 +**Purpose:** Reviewer asked to triangulate the depth-utility diagnostic (D3) more finely +— turn the binary *frozen-vs-fully-trained* block comparison into a **curve**. We vary the +number of trainable residual blocks `k`, training the **last `k`** blocks (output side) and +freezing the first `L−k` at random init; embedding / out_ln / out_head are **always** trained. + +**Question.** As more blocks are made trainable, does test accuracy rise? Under a method that +genuinely trains depth (BP) it should climb; under a method whose deep credit is non-functional +(DFA) it should stay flat at — or below — the frozen baseline. + +**Why output-side-first.** The deepest block receives the most direct credit (FA's last block +sees the exact output gradient), so the last `k` blocks are the **best case** for the method. +If even these don't help, depth is unused. + +--- + +## Setup + +- Arch / task: ResMLP (CIFAR-10). Two configs: **d=256 L=4** (primary audit) and **d=512 L=2** + (FA-failure case — vanilla FA is known to be ≈ frozen here). +- Methods: **BP** (positive control), **FA** (Lillicrap vanilla feedback alignment), **DFA**. +- `k ∈ {0,…,L}`; `k=0` = frozen-blocks baseline, `k=L` = full audit. +- Seeds {42,123,456}; mean ± ddof-1 std. +- Recipe identical to the main audit: AdamW, lr 1e-3, wd 0.01, cosine, batch 128, 100 epochs, + per-block independent optimizers, rms-normalized local surrogate losses. +- The **full** ladder (all `k`, incl. 0 and L) was run in **one** script for internal + consistency — `k=0` / `k=L` reproduce the external anchors (see cross-checks). + +Harness: `experiments/depth_utility_ladder.py`. +Raw results: `results/depth_ladder/ladder_d256_L4_cifar10.json`, `ladder_d512_L2_cifar10.json`. +Figure: `results/depth_ladder/depth_ladder.png` (`experiments/plot_depth_ladder.py`). + +--- + +## Results (CIFAR-10 test acc, mean ± ddof-1 std, n=3) + +**Primary — ResMLP d=256, L=4** + +| k (last-k trainable) | BP | FA | DFA | +|---|---|---|---| +| 0 (frozen) | 0.389 ± 0.001 | 0.355 ± 0.003 | 0.349 ± 0.003 | +| 1 | 0.565 ± 0.003 | 0.382 ± 0.008 | 0.244 ± 0.015 | +| 2 | 0.598 ± 0.003 | 0.349 ± 0.016 | 0.286 ± 0.013 | +| 3 | 0.608 ± 0.001 | 0.398 ± 0.008 | 0.296 ± 0.008 | +| 4 (full) | 0.617 ± 0.002 | 0.402 ± 0.009 | 0.301 ± 0.006 | + +**Secondary — ResMLP d=512, L=2 (FA-failure)** + +| k | BP | FA | DFA | +|---|---|---|---| +| 0 (frozen) | 0.386 ± 0.003 | 0.359 ± 0.000 | 0.349 ± 0.005 | +| 1 | 0.583 ± 0.002 | 0.412 ± 0.004 | 0.226 ± 0.015 | +| 2 (full) | 0.603 ± 0.001 | 0.361 ± 0.003 | 0.302 ± 0.005 | + +--- + +## Interpretation + +- **BP — monotone climb.** d=256: 0.389 → 0.617 (**+23 pp**); d=512: 0.386 → 0.603 (**+22 pp**). + Each block made trainable adds accuracy → depth is genuinely usable, so the D3 precondition + (BP benefits from depth) holds. +- **DFA — flat-to-negative.** The frozen rung `k=0` (≈0.349) is DFA's **maximum** in both configs. + Every trained-block configuration lands **below** it, including the full audit (`k=L`): d=256 + full = 0.301 (−4.8 pp vs frozen), d=512 full = 0.302 (−4.7 pp). Training deep DFA blocks does + not just fail to help — it actively destroys ~5 pp. **The D3 failure now holds at every + granularity**, not just the two extremes. +- **FA — partial / no net depth utility.** d=256 ends at 0.402 (+4.7 pp over frozen) but + non-monotonically; d=512 ends at 0.361 ≈ frozen 0.359 (**no net gain** — the FA-failure case + reproduces). FA is the intermediate: it can use some depth in the easier config and none in the + harder one. The non-monotonic dips (d=256 k=2; d=512 k=2) are consistent with FA's mis-scaled + sequential credit occasionally hurting. + +**One-line takeaway for §6.2:** *A trainable-depth ladder shows BP's accuracy climbs monotonically +with the number of trainable blocks (+22–23 pp) while DFA peaks at the frozen baseline and +declines once any deep block is trained; FA shows partial-to-no depth utility. Depth is usable +(BP), but DFA's deep credit is not.* + +## Cross-checks (internal rerun reproduces external anchors) + +- BP `k=4` = 0.617 ≈ existing full-audit BP 0.615. +- DFA `k=4` = 0.301 ≈ existing full-audit DFA 0.301 / 0.306. +- FA `k=4` = 0.402 ≈ existing FA 0.401. +- Frozen `k=0` ≈ 0.349 across methods ≈ existing frozen-blocks baseline 0.349. + +## Footnote — why `k=0` is already well above chance + +`k=0` is **not** an untrained network: embed / out_ln / out_head are trained; only the blocks are +frozen at random init. At init the residual branches are **small but non-negligible**: +per block `‖f_l(h_l)‖/‖h_l‖ ≈ 0.10`, and the full frozen 4-block stack deviates from the identity +by `‖h_L−h_0‖/‖h_0‖ = 0.196 ± 0.003` with `cos(h_L,h_0) = 0.981 ± 0.001` (3 seeds, CIFAR-10 +batch). The frozen stack is therefore a fixed, **near-norm-preserving random feature map**, not a +strict identity. So `k=0` (≈0.35) is the accuracy of a trained embedding+readout composed with +this fixed map — effectively a trained (near-)linear classifier on pixels, well above the 10% +chance level. Measurement: `experiments/frozen_init_identity_check.py` → +`results/depth_ladder/frozen_init_identity.json`. + +## Reproduce + +```bash +# ladders (GPU2, ~7 h for both, 72 runs, incremental/resumable JSON) +CUDA_VISIBLE_DEVICES=2 python experiments/depth_utility_ladder.py \ + --d_hidden 256 --num_blocks 4 --methods bp fa dfa --k_values 0 1 2 3 4 \ + --seeds 42 123 456 --epochs 100 --gpu 0 --output_dir results/depth_ladder +CUDA_VISIBLE_DEVICES=2 python experiments/depth_utility_ladder.py \ + --d_hidden 512 --num_blocks 2 --methods bp fa dfa --k_values 0 1 2 \ + --seeds 42 123 456 --epochs 100 --gpu 0 --output_dir results/depth_ladder +# figure + identity check +python experiments/plot_depth_ladder.py +CUDA_VISIBLE_DEVICES=2 python experiments/frozen_init_identity_check.py +``` + +## Caveats / open items + +- Parameter-matched shallow baseline (rule out "it's capacity not depth") not yet run — lower + priority; given deep-BP beats frozen by +22–23 pp, the D3 precondition is already safe. +- FA non-monotonicity (k=1 > k=2 in both configs) is noted but not separately investigated; it + does not affect the headline (FA full ≈ or slightly above frozen, ≪ BP). -- cgit v1.2.3