From 1118b7457c261de36ead6103503c00c321c75f9b Mon Sep 17 00:00:00 2001
From: YurenHao0426 <Blackhao0426@gmail.com>
Date: Sun, 14 Jun 2026 20:32:31 -0500
Subject: Depth-utility ladder: trainable-block sweep (BP/FA/DFA) on ResMLP
 CIFAR-10

Appendix experiment triangulating the depth-utility diagnostic (D3) by varying
the number of trainable residual blocks k (last-k trainable, first L-k frozen at
init; embed/LN/head always trained).

- d=256 L=4 and d=512 L=2, 3 seeds, recipe identical to the main audit.
- BP climbs monotonically (+22-23pp); DFA peaks at the frozen baseline (k=0) and
  declines once any deep block is trained; FA shows partial/no net depth utility.
- Cross-checks reproduce existing anchors (BP 0.617, DFA 0.301, FA 0.402, frozen 0.349).
- frozen_init_identity_check quantifies frozen stack as a near-norm-preserving
  random feature map (per-block ||f||/||h||~0.10, stack cos 0.981), explaining the
  above-chance k=0 rung.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 report_explore/MEMO_depth_utility_ladder.md | 119 ++++++++++++++++++++++++++++
 1 file changed, 119 insertions(+)
 create mode 100644 report_explore/MEMO_depth_utility_ladder.md

(limited to 'report_explore/MEMO_depth_utility_ladder.md')

diff --git a/report_explore/MEMO_depth_utility_ladder.md b/report_explore/MEMO_depth_utility_ladder.md
new file mode 100644
index 0000000..d43a983
--- /dev/null
+++ b/report_explore/MEMO_depth_utility_ladder.md
@@ -0,0 +1,119 @@
+# MEMO — Depth-utility ladder (appendix experiment)
+
+**Date:** 2026-06-14
+**Purpose:** Reviewer asked to triangulate the depth-utility diagnostic (D3) more finely
+— turn the binary *frozen-vs-fully-trained* block comparison into a **curve**. We vary the
+number of trainable residual blocks `k`, training the **last `k`** blocks (output side) and
+freezing the first `L−k` at random init; embedding / out_ln / out_head are **always** trained.
+
+**Question.** As more blocks are made trainable, does test accuracy rise? Under a method that
+genuinely trains depth (BP) it should climb; under a method whose deep credit is non-functional
+(DFA) it should stay flat at — or below — the frozen baseline.
+
+**Why output-side-first.** The deepest block receives the most direct credit (FA's last block
+sees the exact output gradient), so the last `k` blocks are the **best case** for the method.
+If even these don't help, depth is unused.
+
+---
+
+## Setup
+
+- Arch / task: ResMLP (CIFAR-10). Two configs: **d=256 L=4** (primary audit) and **d=512 L=2**
+  (FA-failure case — vanilla FA is known to be ≈ frozen here).
+- Methods: **BP** (positive control), **FA** (Lillicrap vanilla feedback alignment), **DFA**.
+- `k ∈ {0,…,L}`; `k=0` = frozen-blocks baseline, `k=L` = full audit.
+- Seeds {42,123,456}; mean ± ddof-1 std.
+- Recipe identical to the main audit: AdamW, lr 1e-3, wd 0.01, cosine, batch 128, 100 epochs,
+  per-block independent optimizers, rms-normalized local surrogate losses.
+- The **full** ladder (all `k`, incl. 0 and L) was run in **one** script for internal
+  consistency — `k=0` / `k=L` reproduce the external anchors (see cross-checks).
+
+Harness: `experiments/depth_utility_ladder.py`.
+Raw results: `results/depth_ladder/ladder_d256_L4_cifar10.json`, `ladder_d512_L2_cifar10.json`.
+Figure: `results/depth_ladder/depth_ladder.png` (`experiments/plot_depth_ladder.py`).
+
+---
+
+## Results (CIFAR-10 test acc, mean ± ddof-1 std, n=3)
+
+**Primary — ResMLP d=256, L=4**
+
+| k (last-k trainable) | BP | FA | DFA |
+|---|---|---|---|
+| 0 (frozen) | 0.389 ± 0.001 | 0.355 ± 0.003 | 0.349 ± 0.003 |
+| 1 | 0.565 ± 0.003 | 0.382 ± 0.008 | 0.244 ± 0.015 |
+| 2 | 0.598 ± 0.003 | 0.349 ± 0.016 | 0.286 ± 0.013 |
+| 3 | 0.608 ± 0.001 | 0.398 ± 0.008 | 0.296 ± 0.008 |
+| 4 (full) | 0.617 ± 0.002 | 0.402 ± 0.009 | 0.301 ± 0.006 |
+
+**Secondary — ResMLP d=512, L=2 (FA-failure)**
+
+| k | BP | FA | DFA |
+|---|---|---|---|
+| 0 (frozen) | 0.386 ± 0.003 | 0.359 ± 0.000 | 0.349 ± 0.005 |
+| 1 | 0.583 ± 0.002 | 0.412 ± 0.004 | 0.226 ± 0.015 |
+| 2 (full) | 0.603 ± 0.001 | 0.361 ± 0.003 | 0.302 ± 0.005 |
+
+---
+
+## Interpretation
+
+- **BP — monotone climb.** d=256: 0.389 → 0.617 (**+23 pp**); d=512: 0.386 → 0.603 (**+22 pp**).
+  Each block made trainable adds accuracy → depth is genuinely usable, so the D3 precondition
+  (BP benefits from depth) holds.
+- **DFA — flat-to-negative.** The frozen rung `k=0` (≈0.349) is DFA's **maximum** in both configs.
+  Every trained-block configuration lands **below** it, including the full audit (`k=L`): d=256
+  full = 0.301 (−4.8 pp vs frozen), d=512 full = 0.302 (−4.7 pp). Training deep DFA blocks does
+  not just fail to help — it actively destroys ~5 pp. **The D3 failure now holds at every
+  granularity**, not just the two extremes.
+- **FA — partial / no net depth utility.** d=256 ends at 0.402 (+4.7 pp over frozen) but
+  non-monotonically; d=512 ends at 0.361 ≈ frozen 0.359 (**no net gain** — the FA-failure case
+  reproduces). FA is the intermediate: it can use some depth in the easier config and none in the
+  harder one. The non-monotonic dips (d=256 k=2; d=512 k=2) are consistent with FA's mis-scaled
+  sequential credit occasionally hurting.
+
+**One-line takeaway for §6.2:** *A trainable-depth ladder shows BP's accuracy climbs monotonically
+with the number of trainable blocks (+22–23 pp) while DFA peaks at the frozen baseline and
+declines once any deep block is trained; FA shows partial-to-no depth utility. Depth is usable
+(BP), but DFA's deep credit is not.*
+
+## Cross-checks (internal rerun reproduces external anchors)
+
+- BP `k=4` = 0.617 ≈ existing full-audit BP 0.615.
+- DFA `k=4` = 0.301 ≈ existing full-audit DFA 0.301 / 0.306.
+- FA `k=4` = 0.402 ≈ existing FA 0.401.
+- Frozen `k=0` ≈ 0.349 across methods ≈ existing frozen-blocks baseline 0.349.
+
+## Footnote — why `k=0` is already well above chance
+
+`k=0` is **not** an untrained network: embed / out_ln / out_head are trained; only the blocks are
+frozen at random init. At init the residual branches are **small but non-negligible**:
+per block `‖f_l(h_l)‖/‖h_l‖ ≈ 0.10`, and the full frozen 4-block stack deviates from the identity
+by `‖h_L−h_0‖/‖h_0‖ = 0.196 ± 0.003` with `cos(h_L,h_0) = 0.981 ± 0.001` (3 seeds, CIFAR-10
+batch). The frozen stack is therefore a fixed, **near-norm-preserving random feature map**, not a
+strict identity. So `k=0` (≈0.35) is the accuracy of a trained embedding+readout composed with
+this fixed map — effectively a trained (near-)linear classifier on pixels, well above the 10%
+chance level. Measurement: `experiments/frozen_init_identity_check.py` →
+`results/depth_ladder/frozen_init_identity.json`.
+
+## Reproduce
+
+```bash
+# ladders (GPU2, ~7 h for both, 72 runs, incremental/resumable JSON)
+CUDA_VISIBLE_DEVICES=2 python experiments/depth_utility_ladder.py \
+  --d_hidden 256 --num_blocks 4 --methods bp fa dfa --k_values 0 1 2 3 4 \
+  --seeds 42 123 456 --epochs 100 --gpu 0 --output_dir results/depth_ladder
+CUDA_VISIBLE_DEVICES=2 python experiments/depth_utility_ladder.py \
+  --d_hidden 512 --num_blocks 2 --methods bp fa dfa --k_values 0 1 2 \
+  --seeds 42 123 456 --epochs 100 --gpu 0 --output_dir results/depth_ladder
+# figure + identity check
+python experiments/plot_depth_ladder.py
+CUDA_VISIBLE_DEVICES=2 python experiments/frozen_init_identity_check.py
+```
+
+## Caveats / open items
+
+- Parameter-matched shallow baseline (rule out "it's capacity not depth") not yet run — lower
+  priority; given deep-BP beats frozen by +22–23 pp, the D3 precondition is already safe.
+- FA non-monotonicity (k=1 > k=2 in both configs) is noted but not separately investigated; it
+  does not affect the headline (FA full ≈ or slightly above frozen, ≪ BP).
-- 
cgit v1.2.3