diff options
| author | YurenHao0426 <Blackhao0426@gmail.com> | 2026-04-08 00:12:27 -0500 |
|---|---|---|
| committer | YurenHao0426 <Blackhao0426@gmail.com> | 2026-04-08 00:12:27 -0500 |
| commit | 5b7f83ae5240c78013c084cf2e24ce5a5f572c42 (patch) | |
| tree | 31e76fa07034af2d1ff4c8417c20c6ce7fae70f2 /protocol | |
| parent | 1f705408da9eb9ff0fcb6f2269dadb2ebf71a0f1 (diff) | |
Add PAPER_OUTLINE.md: §1-§6 draft reflecting round 17 + 18
Comprehensive paper draft outline for the NeurIPS 2026 E&D submission:
§1 Discovery-first hook (round 16 narrative arc): broken eval -> evidence
-> metrics miss -> need protocol -> validation
§2 Audit findings: 5-method × 3-seed audit, walk-back details, EP internal
control
§3 The diagnostic protocol: 4 diagnostics, decision-utility ablation,
threshold sensitivity (with (d) fragility flagged), temporal validation,
cross-architecture validation, sub-mode discrimination
§4 Two failure modes: mechanism story + causal penalty rescue, with the
round 18 softening (partial dissociation rather than full separability)
§5 Pipeline pitfalls catalog: 7 bugs (incl. new #6.5 self-cosine fallback)
§6 Reference implementation
+ Limitations / walk-backs section listing all 5 walked-back claims explicitly
This is a working draft to make the next writing step concrete. Reflects
all evidence collected through the round 18 follow-up.
Diffstat (limited to 'protocol')
| -rw-r--r-- | protocol/PAPER_OUTLINE.md | 165 |
1 files changed, 165 insertions, 0 deletions
diff --git a/protocol/PAPER_OUTLINE.md b/protocol/PAPER_OUTLINE.md new file mode 100644 index 0000000..f3b19d6 --- /dev/null +++ b/protocol/PAPER_OUTLINE.md @@ -0,0 +1,165 @@ +# Paper draft outline + +NeurIPS 2026 Evaluations & Datasets Track. Round 17 + 18 verdict locked-in. +This outline reflects ALL the evidence collected as of 2026-04-08, with +round 18 language softening applied throughout. + +**Working title** (round 17): *Beyond Accuracy and Alignment: A Diagnostic +Evaluation Protocol for Feedback Alignment* + +**Top-line claim**: standard FA evaluation on modern residual networks +(headline accuracy + Γ alignment) is unreliable; we identify two distinct +failure modes and provide a 4-diagnostic protocol that detects them. + +## §1 Introduction + +**Discovery-first hook** (round 16 narrative arc, kept under round 17 E&D framing): +- Modern Feedback Alignment evaluation reports headline accuracy and Γ alignment as evidence that the method is training successfully on a given architecture. +- We show, on standard 4-block pre-LayerNorm ResMLP and ViT-Mini at d=256/d=512 on CIFAR-10, that **both** signals silently fail. DFA "31% accuracy" walks back to "below the architecture-matched random-blocks baseline by 4 pp" (active harm). DFA Γ ≈ 0.10 walks back to "measured against a BP gradient at the numerical floor (~10⁻⁹), with the headline driven entirely by the embedding layer". +- The evaluation failure is not unique to DFA: State Bridge and Credit Bridge show the same pattern. The protocol catches all three on every seed we tested. +- We present the first systematic diagnostic protocol that catches these failures, with a reference implementation, validation across 3 architecture families (ResMLP, ViT-Mini, CNN), and an ablation showing 0/5 walked back by the field-standard reporting versus 3/5 by our protocol. + +## §2 Audit findings + +### 2.1 Audit on 4-block d=256 ResMLP CIFAR-10 (5 methods × 3 seeds) + +- BP, EP: trustworthy on all 9 method×seed combinations +- DFA, State Bridge, Credit Bridge: walked back on all 9 combinations +- Headline numbers (3-seed mean ± std): table from `audit_table_s42_s123_s456.json` + +### 2.2 Walk-back details + +**DFA**: 0.308 ± 0.014 acc. Reports as "DFA trains the deep blocks". Reality: +- ‖h_L‖ ~ 4×10⁸ (vs BP ~200, vs random init ~9): catastrophic residual stream growth, 4 OOM +- ‖g_L‖ ~ 5×10⁻¹⁰ (vs BP ~5×10⁻⁵, vs random init ~8×10⁻³): BP gradient at numerical floor, 5 OOM below healthy +- DFA-shallow (no deep blocks): 0.349 ± 0.002 — **DFA-trainable is 4 pp WORSE than DFA-shallow** +- Cosine of `e_T B_l^T` to BP grad, layer 0: +0.42; layers 1-4: ≈ 0 +- Headline Γ = 0.10 is entirely the embedding layer + +**State Bridge**: 0.205 ± 0.039 acc. Same pattern. Cross-batch direction stability of BP grad = 0.992 — the cleanest "drift-dominated reference vector" measurement we have. + +**Credit Bridge**: 0.289 ± 0.034 acc. Same pattern. + +### 2.3 Internal control (EP) + +EP on the same architecture: 0.316 ± 0.038 acc. ‖h_L‖ ~ 5×10³ (bounded), ‖g_L‖ ~ 1×10⁻⁴ (healthy), per-layer cosines small-but-meaningful. **The protocol passes EP** even though EP's headline accuracy is also low. This isolates "the network is in a degenerate measurement regime" from "the method underperforms" — they are different failure modes with different evidence. EP is the §2 internal control that validates the protocol does not over-flag. + +## §3 The diagnostic protocol + +### 3.1 Four diagnostics + +(a) **per-layer ‖h_l‖**: max-per-block growth `max_l (‖h_{l+1}‖ / ‖h_l‖)`. Threshold: 50×. +(b) **per-layer ‖g_l‖**: deepest-layer BP gradient norm. Floor: 10⁻⁷. +(c) **cross-batch direction stability**: mean pairwise cosine of normalized BP-grad direction across disjoint minibatches. Drift ceiling: 0.30. +(d) **frozen-blocks baseline**: trained-blocks acc must clear an architecture-matched random-untrained-blocks baseline by ≥ 2 pp. + +(a)/(b)/(d) are independently sufficient for binary detection on the cases we audited. (c) is for sub-mode discrimination (drift-dominated vs noise-dominated reference vector); it does not catch new positives. + +### 3.2 Decision-utility ablation + +S0 (acc only) walks back 0/5 methods. S1 (acc + Γ field-standard) walks back 0/5 methods. S_full (full protocol) walks back 3/5 (DFA, SB, CB). The protocol's added value is 3 silent failures detected. + +### 3.3 Threshold sensitivity + +- (a) has a **63× separation gap** between healthy max (EP s123 = 11×) and degenerate min (CB s123 = 694×). Threshold 50× sits cleanly in the gap; verdicts robust to ±50% perturbation. +- (b) has a **24,338× separation gap**. Threshold 10⁻⁷ similarly robust. +- (d) is **NOT robust**: penalized DFA at λ=1e-2 fires at 2 pp, passes at 1 pp; at λ=1e-3 the margin is +2.3 pp (single seed) which passes at 2 pp. The (d) verdict on penalized conditions depends on both threshold and intervention strength. + +### 3.4 Temporal validation (when does it fire during training?) + +Replayed per-epoch DFA training data through the thresholds. On 4-block d=256 ResMLP across 3 seeds: +- diagnostic (b) first fires at epoch 3-4 of 100 +- diagnostic (a) first fires at epoch 8-11 +- DFA accuracy at the moment of firing is essentially identical to its final value (the "training" is doing nothing useful for the headline number after epoch 1) +- 96% of compute could be saved by using the protocol as an early-stop criterion + +On ViT-Mini (3 seeds): (a) fires at ep 1, (b) fires at ep 2-3. Even faster. + +BP never fires on any of 9 BP×architecture×seed conditions. + +### 3.5 Cross-architecture validation + +3 architecture families × 3 seeds × 2 methods (BP/DFA) = 18 conditions, plus 5-method × 3-seed audit on CNN. + +| arch | terminal LN | BP fires? | DFA (a) fires? | DFA (b) fires? | +|---|:---:|:---:|:---:|:---:| +| ResMLP (d=256) | yes | never | yes (ep 8-11) | yes (ep 3-4) | +| ResMLP (d=512) | yes | never | yes (3-seed mean) | yes (3-seed mean) | +| ViT-Mini | yes | never | yes (ep 1) | yes (ep 2-3) | +| StudentNet (synthetic) | **no** | never | yes (ep 14-25) | **NEVER** | +| CNN (BatchNorm) | **no** | never | yes (5-method audit) | **NEVER** | + +**[Round 18 softening]** Across the architectures we audited, **diagnostic (b) appears restricted to terminal-LayerNorm architectures**. In the absence of terminal LN, the BP gradient at hidden layers does not collapse to the numerical floor, even on DFA. This is observational association, not causal identification of LayerNorm — a future architecture without terminal LN where (b) fires would refine the claim. + +### 3.6 Sub-mode discrimination via (c) + +(c) splits the (a)+(b) failures into two interpretable sub-modes: +- **drift-dominated**: stability ≈ 0.5-1.0. Reference BP grad direction is sample-invariant. (Example: SB s42 stability = 0.992.) +- **noise-dominated**: stability ≈ 0. Reference direction is random per batch. (Example: DFA s42 stability = 0.047.) + +The same method falls into different sub-modes on different seeds. (c) is for interpretation; (b) is the primary detector regardless of sub-mode. + +## §4 Two failure modes + +### 4.1 Mechanism story (the discovery layer) + +**Failure mode 1: residual-stream-amplified BP-grad collapse** +- DFA's local block losses have no global scale constraint +- Block parameters grow ~95× rel-delta on ResMLP (vs BP ~2.7×) +- w1·w2 product per block ~5×10⁴; block outputs grow 10⁷-10⁸× +- Residual stream ‖h_L‖ ~ 10⁸ on ResMLP / ViT (4 OOM) +- Terminal LayerNorm Jacobian rescaling drives ‖g_L‖ to ~10⁻¹⁰ (5 OOM below BP) +- Γ becomes a measurement of cosine to a numerical-floor reference vector + +This is what diagnostics (a) and (b) detect. + +**Failure mode 2: depth utilization** +- Even after the scale pathology is corrected, the deep blocks may not contribute over a random-untrained-blocks baseline +- Diagnostic (d) detects this via the frozen-blocks comparison +- **[Round 18 caveat]** The (d) verdict on penalized DFA depends on the intervention strength (λ): at λ=1e-2 the margin is +1.4 ± 0.05 pp; at λ=1e-3 it is +2.3 pp. There is a **real tradeoff between penalty strength and depth utilization** — stronger penalty contains scale better but kills more depth contribution; weaker penalty preserves depth but keeps more scale pathology. + +### 4.2 Causal validation: penalty rescue + +On 4-block d=256 ResMLP, adding `λ ‖f_l(h_l)‖²` to each DFA local block loss: +- λ=1e-2 (3 seeds): ‖h_L‖ 4×10⁸ → 4×10⁴ (4 OOM rescue), ‖g_L‖ 5×10⁻¹⁰ → ~10⁻⁶ (4 OOM rescue), acc 0.308 → 0.363 (+5.5 pp over vanilla, +1.4 pp over shallow) +- λ=1e-3: similar magnitude rescue, acc 0.372 (+2.3 pp over shallow, single seed; multi-seed verification in progress) + +**Round 18 framing**: this **partially dissociates the two putative failure modes by intervention**. The penalty alleviates the scale-related diagnostics (a) and (b) but does not bring depth contribution in line with BP. The two failure modes expose **distinct intervention surfaces**. + +Full mechanistic separability requires direct deep-block credit-quality measurement on the penalized checkpoint (in progress). + +## §5 Pipeline pitfalls catalog (appendix) + +7 evaluation-pipeline bugs we found in our own dogfood codebase. Each has a reproducer in `protocol/examples/verify_pitfalls*.py`. + +1. `tensor.norm(-1)` is L_{-1} of the whole tensor, not L_2 along dim=-1 +2. `F.cosine_similarity` clamps the divisor by `eps=1e-8`, distorting cosines by 10-100× when ‖g‖ is at the floor +3. fp16 mixed precision underflows BP grads at hidden layers (~5×10⁻¹⁰ → 0) +4. Random feedback `Bs` are training-specific; Γ is not invariant to reseeding them +5. Aggregation strategy across layers/samples/batches gives 2.5-3.5× spread on the same data; sign can flip +6. **[Round 18 reframing]** Per-layer Γ structure is hidden by aggregation. Layer-0 dominance is one specific manifestation (ResMLP DFA: l0 +0.42, l1-4 ≈ 0). On ViT-Mini all layers are uniformly near zero. The general lesson is to always report per-layer. +6.5. **[Round 18 disclosure]** Auxiliary networks (`Bs`, bridge models) not persisted alongside checkpoints can cause post-hoc Γ scripts to silently fall back to `cos(BP, BP) = 1.0` and report perfect alignment. We discovered this in our own pipeline. + +## §6 Reference implementation + +`protocol/` package: 4 diagnostics, DiagnosticReport class, smoke test, audit table, decision-utility ablation, temporal evolution, sensitivity analyses, reporting template, checklist. ~20 Python files, all syntax-clean. Smoke test passes BP/DFA/EP checkpoint verdicts. + +Reproducibility: every result in this paper reproduces from `protocol/examples/*.py` against checkpoints in `results/`. + +## Limitations and walk-backs (appendix or §7) + +We walked back our own claims multiple times during this work. Reporting these explicitly as part of the paper: + +1. "DFA trains ViT-Mini to 24% accuracy" → walked back to "DFA-frozen-random-blocks ViT also gets 25.7%; the deep blocks are passengers" (codex round 6) +2. "DFA trains ResMLP to 31%" → walked back to "DFA-trainable is 4 pp BELOW DFA-shallow on ResMLP; DFA training the deep blocks actively destroys value" (codex round 8) +3. "Penalty rescues to 36.3% above shallow baseline → second failure mode established by (d)" → softened to "the (d) verdict on penalized DFA depends on the intervention strength; the two failure modes are partially dissociated by intervention" (codex round 18) +4. "Layer 0 always dominates Γ" → softened to "aggregation hides per-layer structure that depends on the architecture; on ResMLP layer 0 dominates, on ViT-Mini all layers are uniformly near zero" (round 18 follow-up) +5. "(b) is causally specific to terminal LN" → softened to "(b) appears restricted to terminal-normalized architectures we audited" (round 18 follow-up) + +## Open experimental questions + +- ViT-Mini 5-method audit (waiting on checkpoint training) +- Multi-seed lam=1e-3 (running) +- Direct deep-block credit measurement on penalized DFA (rerunning at --epochs 30 with checkpoint save) +- Threshold robustness sweep on additional architectures +- BatchNorm ResNet as a 4th non-LN architecture (deferred — CNN already covers this case) |
