Add PAPER_OUTLINE.md: §1-§6 draft reflecting round 17 + 18

Comprehensive paper draft outline for the NeurIPS 2026 E&D submission: §1 Discovery-first hook (round 16 narrative arc): broken eval -> evidence -> metrics miss -> need protocol -> validation §2 Audit findings: 5-method × 3-seed audit, walk-back details, EP internal control §3 The diagnostic protocol: 4 diagnostics, decision-utility ablation, threshold sensitivity (with (d) fragility flagged), temporal validation, cross-architecture validation, sub-mode discrimination §4 Two failure modes: mechanism story + causal penalty rescue, with the round 18 softening (partial dissociation rather than full separability) §5 Pipeline pitfalls catalog: 7 bugs (incl. new #6.5 self-cosine fallback) §6 Reference implementation + Limitations / walk-backs section listing all 5 walked-back claims explicitly This is a working draft to make the next writing step concrete. Reflects all evidence collected through the round 18 follow-up.
author: YurenHao0426 <Blackhao0426@gmail.com> 2026-04-08 00:12:27 -0500
committer: YurenHao0426 <Blackhao0426@gmail.com> 2026-04-08 00:12:27 -0500
commit: 5b7f83ae5240c78013c084cf2e24ce5a5f572c42 (patch)
tree: 31e76fa07034af2d1ff4c8417c20c6ce7fae70f2 /protocol
parent: 1f705408da9eb9ff0fcb6f2269dadb2ebf71a0f1 (diff)
1 files changed, 165 insertions, 0 deletions
diff --git a/protocol/PAPER_OUTLINE.md b/protocol/PAPER_OUTLINE.md
new file mode 100644
index 0000000..f3b19d6
--- /dev/null
+++ b/protocol/PAPER_OUTLINE.md
@@ -0,0 +1,165 @@
+# Paper draft outline
+
+NeurIPS 2026 Evaluations & Datasets Track. Round 17 + 18 verdict locked-in.
+This outline reflects ALL the evidence collected as of 2026-04-08, with
+round 18 language softening applied throughout.
+
+**Working title** (round 17): *Beyond Accuracy and Alignment: A Diagnostic
+Evaluation Protocol for Feedback Alignment*
+
+**Top-line claim**: standard FA evaluation on modern residual networks
+(headline accuracy + Γ alignment) is unreliable; we identify two distinct
+failure modes and provide a 4-diagnostic protocol that detects them.
+
+## §1 Introduction
+
+**Discovery-first hook** (round 16 narrative arc, kept under round 17 E&D framing):
+- Modern Feedback Alignment evaluation reports headline accuracy and Γ alignment as evidence that the method is training successfully on a given architecture.
+- We show, on standard 4-block pre-LayerNorm ResMLP and ViT-Mini at d=256/d=512 on CIFAR-10, that **both** signals silently fail. DFA "31% accuracy" walks back to "below the architecture-matched random-blocks baseline by 4 pp" (active harm). DFA Γ ≈ 0.10 walks back to "measured against a BP gradient at the numerical floor (~10⁻⁹), with the headline driven entirely by the embedding layer".
+- The evaluation failure is not unique to DFA: State Bridge and Credit Bridge show the same pattern. The protocol catches all three on every seed we tested.
+- We present the first systematic diagnostic protocol that catches these failures, with a reference implementation, validation across 3 architecture families (ResMLP, ViT-Mini, CNN), and an ablation showing 0/5 walked back by the field-standard reporting versus 3/5 by our protocol.
+
+## §2 Audit findings
+
+### 2.1 Audit on 4-block d=256 ResMLP CIFAR-10 (5 methods × 3 seeds)
+
+- BP, EP: trustworthy on all 9 method×seed combinations
+- DFA, State Bridge, Credit Bridge: walked back on all 9 combinations
+- Headline numbers (3-seed mean ± std): table from `audit_table_s42_s123_s456.json`
+
+### 2.2 Walk-back details
+
+**DFA**: 0.308 ± 0.014 acc. Reports as "DFA trains the deep blocks". Reality:
+- ‖h_L‖ ~ 4×10⁸ (vs BP ~200, vs random init ~9): catastrophic residual stream growth, 4 OOM
+- ‖g_L‖ ~ 5×10⁻¹⁰ (vs BP ~5×10⁻⁵, vs random init ~8×10⁻³): BP gradient at numerical floor, 5 OOM below healthy
+- DFA-shallow (no deep blocks): 0.349 ± 0.002 — **DFA-trainable is 4 pp WORSE than DFA-shallow**
+- Cosine of `e_T B_l^T` to BP grad, layer 0: +0.42; layers 1-4: ≈ 0
+- Headline Γ = 0.10 is entirely the embedding layer
+
+**State Bridge**: 0.205 ± 0.039 acc. Same pattern. Cross-batch direction stability of BP grad = 0.992 — the cleanest "drift-dominated reference vector" measurement we have.
+
+**Credit Bridge**: 0.289 ± 0.034 acc. Same pattern.
+
+### 2.3 Internal control (EP)
+
+EP on the same architecture: 0.316 ± 0.038 acc. ‖h_L‖ ~ 5×10³ (bounded), ‖g_L‖ ~ 1×10⁻⁴ (healthy), per-layer cosines small-but-meaningful. **The protocol passes EP** even though EP's headline accuracy is also low. This isolates "the network is in a degenerate measurement regime" from "the method underperforms" — they are different failure modes with different evidence. EP is the §2 internal control that validates the protocol does not over-flag.
+
+## §3 The diagnostic protocol
+
+### 3.1 Four diagnostics
+
+(a) **per-layer ‖h_l‖**: max-per-block growth `max_l (‖h_{l+1}‖ / ‖h_l‖)`. Threshold: 50×.
+(b) **per-layer ‖g_l‖**: deepest-layer BP gradient norm. Floor: 10⁻⁷.
+(c) **cross-batch direction stability**: mean pairwise cosine of normalized BP-grad direction across disjoint minibatches. Drift ceiling: 0.30.
+(d) **frozen-blocks baseline**: trained-blocks acc must clear an architecture-matched random-untrained-blocks baseline by ≥ 2 pp.
+
+(a)/(b)/(d) are independently sufficient for binary detection on the cases we audited. (c) is for sub-mode discrimination (drift-dominated vs noise-dominated reference vector); it does not catch new positives.
+
+### 3.2 Decision-utility ablation
+
+S0 (acc only) walks back 0/5 methods. S1 (acc + Γ field-standard) walks back 0/5 methods. S_full (full protocol) walks back 3/5 (DFA, SB, CB). The protocol's added value is 3 silent failures detected.
+
+### 3.3 Threshold sensitivity
+
+- (a) has a **63× separation gap** between healthy max (EP s123 = 11×) and degenerate min (CB s123 = 694×). Threshold 50× sits cleanly in the gap; verdicts robust to ±50% perturbation.
+- (b) has a **24,338× separation gap**. Threshold 10⁻⁷ similarly robust.
+- (d) is **NOT robust**: penalized DFA at λ=1e-2 fires at 2 pp, passes at 1 pp; at λ=1e-3 the margin is +2.3 pp (single seed) which passes at 2 pp. The (d) verdict on penalized conditions depends on both threshold and intervention strength.
+
+### 3.4 Temporal validation (when does it fire during training?)
+
+Replayed per-epoch DFA training data through the thresholds. On 4-block d=256 ResMLP across 3 seeds:
+- diagnostic (b) first fires at epoch 3-4 of 100
+- diagnostic (a) first fires at epoch 8-11
+- DFA accuracy at the moment of firing is essentially identical to its final value (the "training" is doing nothing useful for the headline number after epoch 1)
+- 96% of compute could be saved by using the protocol as an early-stop criterion
+
+On ViT-Mini (3 seeds): (a) fires at ep 1, (b) fires at ep 2-3. Even faster.
+
+BP never fires on any of 9 BP×architecture×seed conditions.
+
+### 3.5 Cross-architecture validation
+
+3 architecture families × 3 seeds × 2 methods (BP/DFA) = 18 conditions, plus 5-method × 3-seed audit on CNN.
+
+| arch | terminal LN | BP fires? | DFA (a) fires? | DFA (b) fires? |
+|---|:---:|:---:|:---:|:---:|
+| ResMLP (d=256) | yes | never | yes (ep 8-11) | yes (ep 3-4) |
+| ResMLP (d=512) | yes | never | yes (3-seed mean) | yes (3-seed mean) |
+| ViT-Mini | yes | never | yes (ep 1) | yes (ep 2-3) |
+| StudentNet (synthetic) | **no** | never | yes (ep 14-25) | **NEVER** |
+| CNN (BatchNorm) | **no** | never | yes (5-method audit) | **NEVER** |
+
+**[Round 18 softening]** Across the architectures we audited, **diagnostic (b) appears restricted to terminal-LayerNorm architectures**. In the absence of terminal LN, the BP gradient at hidden layers does not collapse to the numerical floor, even on DFA. This is observational association, not causal identification of LayerNorm — a future architecture without terminal LN where (b) fires would refine the claim.
+
+### 3.6 Sub-mode discrimination via (c)
+
+(c) splits the (a)+(b) failures into two interpretable sub-modes:
+- **drift-dominated**: stability ≈ 0.5-1.0. Reference BP grad direction is sample-invariant. (Example: SB s42 stability = 0.992.)
+- **noise-dominated**: stability ≈ 0. Reference direction is random per batch. (Example: DFA s42 stability = 0.047.)
+
+The same method falls into different sub-modes on different seeds. (c) is for interpretation; (b) is the primary detector regardless of sub-mode.
+
+## §4 Two failure modes
+
+### 4.1 Mechanism story (the discovery layer)
+
+**Failure mode 1: residual-stream-amplified BP-grad collapse**
+- DFA's local block losses have no global scale constraint
+- Block parameters grow ~95× rel-delta on ResMLP (vs BP ~2.7×)
+- w1·w2 product per block ~5×10⁴; block outputs grow 10⁷-10⁸×
+- Residual stream ‖h_L‖ ~ 10⁸ on ResMLP / ViT (4 OOM)
+- Terminal LayerNorm Jacobian rescaling drives ‖g_L‖ to ~10⁻¹⁰ (5 OOM below BP)
+- Γ becomes a measurement of cosine to a numerical-floor reference vector
+
+This is what diagnostics (a) and (b) detect.
+
+**Failure mode 2: depth utilization**
+- Even after the scale pathology is corrected, the deep blocks may not contribute over a random-untrained-blocks baseline
+- Diagnostic (d) detects this via the frozen-blocks comparison
+- **[Round 18 caveat]** The (d) verdict on penalized DFA depends on the intervention strength (λ): at λ=1e-2 the margin is +1.4 ± 0.05 pp; at λ=1e-3 it is +2.3 pp. There is a **real tradeoff between penalty strength and depth utilization** — stronger penalty contains scale better but kills more depth contribution; weaker penalty preserves depth but keeps more scale pathology.
+
+### 4.2 Causal validation: penalty rescue
+
+On 4-block d=256 ResMLP, adding `λ ‖f_l(h_l)‖²` to each DFA local block loss:
+- λ=1e-2 (3 seeds): ‖h_L‖ 4×10⁸ → 4×10⁴ (4 OOM rescue), ‖g_L‖ 5×10⁻¹⁰ → ~10⁻⁶ (4 OOM rescue), acc 0.308 → 0.363 (+5.5 pp over vanilla, +1.4 pp over shallow)
+- λ=1e-3: similar magnitude rescue, acc 0.372 (+2.3 pp over shallow, single seed; multi-seed verification in progress)
+
+**Round 18 framing**: this **partially dissociates the two putative failure modes by intervention**. The penalty alleviates the scale-related diagnostics (a) and (b) but does not bring depth contribution in line with BP. The two failure modes expose **distinct intervention surfaces**.
+
+Full mechanistic separability requires direct deep-block credit-quality measurement on the penalized checkpoint (in progress).
+
+## §5 Pipeline pitfalls catalog (appendix)
+
+7 evaluation-pipeline bugs we found in our own dogfood codebase. Each has a reproducer in `protocol/examples/verify_pitfalls*.py`.
+
+1. `tensor.norm(-1)` is L_{-1} of the whole tensor, not L_2 along dim=-1
+2. `F.cosine_similarity` clamps the divisor by `eps=1e-8`, distorting cosines by 10-100× when ‖g‖ is at the floor
+3. fp16 mixed precision underflows BP grads at hidden layers (~5×10⁻¹⁰ → 0)
+4. Random feedback `Bs` are training-specific; Γ is not invariant to reseeding them
+5. Aggregation strategy across layers/samples/batches gives 2.5-3.5× spread on the same data; sign can flip
+6. **[Round 18 reframing]** Per-layer Γ structure is hidden by aggregation. Layer-0 dominance is one specific manifestation (ResMLP DFA: l0 +0.42, l1-4 ≈ 0). On ViT-Mini all layers are uniformly near zero. The general lesson is to always report per-layer.
+6.5. **[Round 18 disclosure]** Auxiliary networks (`Bs`, bridge models) not persisted alongside checkpoints can cause post-hoc Γ scripts to silently fall back to `cos(BP, BP) = 1.0` and report perfect alignment. We discovered this in our own pipeline.
+
+## §6 Reference implementation
+
+`protocol/` package: 4 diagnostics, DiagnosticReport class, smoke test, audit table, decision-utility ablation, temporal evolution, sensitivity analyses, reporting template, checklist. ~20 Python files, all syntax-clean. Smoke test passes BP/DFA/EP checkpoint verdicts.
+
+Reproducibility: every result in this paper reproduces from `protocol/examples/*.py` against checkpoints in `results/`.
+
+## Limitations and walk-backs (appendix or §7)
+
+We walked back our own claims multiple times during this work. Reporting these explicitly as part of the paper:
+
+1. "DFA trains ViT-Mini to 24% accuracy" → walked back to "DFA-frozen-random-blocks ViT also gets 25.7%; the deep blocks are passengers" (codex round 6)
+2. "DFA trains ResMLP to 31%" → walked back to "DFA-trainable is 4 pp BELOW DFA-shallow on ResMLP; DFA training the deep blocks actively destroys value" (codex round 8)
+3. "Penalty rescues to 36.3% above shallow baseline → second failure mode established by (d)" → softened to "the (d) verdict on penalized DFA depends on the intervention strength; the two failure modes are partially dissociated by intervention" (codex round 18)
+4. "Layer 0 always dominates Γ" → softened to "aggregation hides per-layer structure that depends on the architecture; on ResMLP layer 0 dominates, on ViT-Mini all layers are uniformly near zero" (round 18 follow-up)
+5. "(b) is causally specific to terminal LN" → softened to "(b) appears restricted to terminal-normalized architectures we audited" (round 18 follow-up)
+
+## Open experimental questions
+
+- ViT-Mini 5-method audit (waiting on checkpoint training)
+- Multi-seed lam=1e-3 (running)
+- Direct deep-block credit measurement on penalized DFA (rerunning at --epochs 30 with checkpoint save)
+- Threshold robustness sweep on additional architectures
+- BatchNorm ResNet as a 4th non-LN architecture (deferred — CNN already covers this case)
author	YurenHao0426 <Blackhao0426@gmail.com>	2026-04-08 00:12:27 -0500
committer	YurenHao0426 <Blackhao0426@gmail.com>	2026-04-08 00:12:27 -0500
commit	5b7f83ae5240c78013c084cf2e24ce5a5f572c42 (patch)
tree	31e76fa07034af2d1ff4c8417c20c6ce7fae70f2 /protocol
parent	1f705408da9eb9ff0fcb6f2269dadb2ebf71a0f1 (diff)