Add FA diagnostic protocol reference implementation

Codex round 15 #1 priority for the E&D-track paper: - protocol/protocol.py: 4 diagnostics (residual norms, BP grad norms, cross-batch direction stability, and a frozen-baseline comparator) - protocol/report.py: DiagnosticReport with per-diagnostic verdicts and pretty-printer - protocol/smoke_test.py: validates BP/DFA/EP checkpoints produce the expected verdicts (BP/EP trustworthy; DFA walked back via residual explosion + BP grad at floor) - protocol/README.md: usage, audit cases, threshold rationale - protocol/CHECKLIST.md: 6 evaluation pipeline pitfalls (norm(-1), cosine_similarity eps clamp, fp16 underflow, Bs reproducibility, aggregation, layer-0 dominance) - protocol/REPORTING_TEMPLATE.md: per-method fillable form for FA papers
author: YurenHao0426 <Blackhao0426@gmail.com> 2026-04-07 22:20:48 -0500
committer: YurenHao0426 <Blackhao0426@gmail.com> 2026-04-07 22:20:48 -0500
commit: 7b64702ad970c16171142665365e16a8e1737190 (patch)
tree: 30df9313fc195aa037f05c0598fdc0df8b32bbcc /protocol/CHECKLIST.md
parent: 2dc8e7efb5f2ff827fbd97be87f0127aa5ab2757 (diff)
1 files changed, 96 insertions, 0 deletions
diff --git a/protocol/CHECKLIST.md b/protocol/CHECKLIST.md
new file mode 100644
index 0000000..3227642
--- /dev/null
+++ b/protocol/CHECKLIST.md
@@ -0,0 +1,96 @@
+# FA Evaluation Pitfalls Checklist
+
+Six evaluation-pipeline bugs we found in our own codebase while developing
+this protocol. Each one silently corrupted reported numbers in a way that
+would have been very hard for a reviewer to catch. Use this list as a
+final-pass review for any FA evaluation paper.
+
+## 1. `tensor.norm(-1)` is NOT "L₂ along dim=-1"
+
+`tensor.norm(-1)` computes the L₋₁ "norm" (`1/sum(1/|x|)`) of the *entire*
+flattened tensor, not the per-row L₂ norm. The correct call for "L₂ along
+the feature dimension" is `tensor.norm(dim=-1)`.
+
+This single bug invalidated several months of our own gradient-norm
+measurements; gradient sparsity tables, "support sparsity" analyses, and
+the `naive_StateErr` metric in three of our trainers were all degenerate.
+
+**Check**: grep your codebase for `\.norm\(-1\)` and `\.norm\(\s*-1\s*\)`.
+Replace with `.norm(dim=-1)`.
+
+## 2. `F.cosine_similarity` clamps small magnitudes via `eps=1e-8`
+
+`torch.nn.functional.cosine_similarity(a, b)` divides by `max(||a||·||b||,
+eps)` with `eps=1e-8` by default. When the BP gradient at hidden layers is
+~1e-10 (which it is on DFA-trained pre-LN ResMLPs), the divisor becomes
+`||a|| · 1e-8` instead of `||a|| · 1e-10`, scaling the reported cosine by
+`||g|| / eps` ≈ 0.01. The reported "Γ ≈ 0.10" is then off by ~100× from the
+true cosine.
+
+**Check**: when computing Γ, either use a hand-rolled cosine that does not
+clamp, or assert `||g|| > 1e-8` before calling `F.cosine_similarity`. If you
+report Γ alongside ‖g‖ and ‖g‖ is at the floor, your Γ is uninterpretable —
+the diagnostic protocol will catch this.
+
+## 3. fp16 mixed precision underflows BP grads at hidden layers
+
+Computing Γ in fp16 mixed precision on a DFA-trained ResMLP gives `nan`
+because BP grads (~5e-10) are below fp16's smallest subnormal (~6e-8).
+**bf16 works** because it has the same exponent range as fp32 (only the
+mantissa is reduced).
+
+**Check**: if your reported metric depends on a tiny intermediate magnitude,
+either compute it in fp32, switch to bf16, or rescale by a known constant
+before the precision-sensitive step.
+
+## 4. Random feedback `Bs` are training-specific — Γ is not invariant to
+   reseeding them
+
+DFA at fixed seed s42 reports Γ ≈ 0.106 *with the random feedback projection
+matrices `Bs` that were used during training*. With 20 fresh random `Bs`
+draws on the same trained network, Γ ≈ 0 ± 0.005. The non-zero Γ is the
+network adapting to the *specific* `Bs` it saw during training, not an
+intrinsic property of the local credit signal.
+
+**Check**: when you report Γ for an FA-style method, specify exactly which
+`Bs` were used. If you cannot reproduce them from the seed, your numbers
+are not reproducible.
+
+## 5. Aggregation strategy across layers / samples / batches is rarely
+   specified, but determines the headline number
+
+We computed DFA's "headline Γ" with four equally valid aggregation
+strategies: (median over layers) × (mean over samples), (mean over layers) ×
+(median over samples), and so on. The four strategies give Γ ∈ [0.085,
+0.211] on the *same* trained DFA s42 network — a 2.5× spread. For credit
+bridge it's 5× (Γ ∈ [0.057, 0.285]).
+
+**Check**: state explicitly how you aggregate Γ. Report per-layer values
+and let the reader pick. Do not collapse to a single number without showing
+the layer breakdown.
+
+## 6. Layer-0 dominates the headline Γ; deeper layers are ~0
+
+For DFA on a 4-block ResMLP, the headline Γ ≈ 0.10 is driven almost entirely
+by the embedding layer (Γ_layer0 ≈ 0.43). The block layers have Γ ≈ 0. A
+"mean over layers" summary statistic hides this. The same pattern likely
+holds for other FA-style methods.
+
+**Check**: always report per-layer Γ. A single average is misleading when
+one layer dominates.
+
+## Suggested final-pass workflow
+
+1. Run `grep -nr '\.norm(-1)' YOUR_REPO` and fix all matches.
+2. Run the diagnostic protocol on your trained model. If any flag fires,
+   fix the underlying issue or walk back the corresponding claim before
+   submission.
+3. Inspect every place you compute Γ:
+   - Is the divisor clamped?
+   - Are you in fp16?
+   - Are you using the training `Bs` or fresh ones?
+   - How do you aggregate?
+   - Do you report per-layer?
+4. Report `‖h_l‖`, `‖g_l‖`, and `Γ_l` *together* for every layer. The
+   metrics are entangled; reporting one without the others hides the
+   pathology.
author	YurenHao0426 <Blackhao0426@gmail.com>	2026-04-07 22:20:48 -0500
committer	YurenHao0426 <Blackhao0426@gmail.com>	2026-04-07 22:20:48 -0500
commit	7b64702ad970c16171142665365e16a8e1737190 (patch)
tree	30df9313fc195aa037f05c0598fdc0df8b32bbcc /protocol/CHECKLIST.md
parent	2dc8e7efb5f2ff827fbd97be87f0127aa5ab2757 (diff)