# FA Evaluation Pitfalls Checklist Six evaluation-pipeline bugs we found in our own codebase while developing this protocol. Each one silently corrupted reported numbers in a way that would have been very hard for a reviewer to catch. Use this list as a final-pass review for any FA evaluation paper. ## 1. `tensor.norm(-1)` is NOT "L₂ along dim=-1" `tensor.norm(-1)` computes the L₋₁ "norm" (`1/sum(1/|x|)`) of the *entire* flattened tensor, not the per-row L₂ norm. The correct call for "L₂ along the feature dimension" is `tensor.norm(dim=-1)`. This single bug invalidated several months of our own gradient-norm measurements; gradient sparsity tables, "support sparsity" analyses, and the `naive_StateErr` metric in three of our trainers were all degenerate. **Check**: grep your codebase for `\.norm\(-1\)` and `\.norm\(\s*-1\s*\)`. Replace with `.norm(dim=-1)`. ## 2. `F.cosine_similarity` clamps small magnitudes via `eps=1e-8` `torch.nn.functional.cosine_similarity(a, b)` divides by `max(||a||·||b||, eps)` with `eps=1e-8` by default. When the BP gradient at hidden layers is ~1e-10 (which it is on DFA-trained pre-LN ResMLPs), the divisor becomes `||a|| · 1e-8` instead of `||a|| · 1e-10`, scaling the reported cosine by `||g|| / eps` ≈ 0.01. The reported "Γ ≈ 0.10" is then off by ~100× from the true cosine. **Check**: when computing Γ, either use a hand-rolled cosine that does not clamp, or assert `||g|| > 1e-8` before calling `F.cosine_similarity`. If you report Γ alongside ‖g‖ and ‖g‖ is at the floor, your Γ is uninterpretable — the diagnostic protocol will catch this. ## 3. fp16 mixed precision underflows BP grads at hidden layers Computing Γ in fp16 mixed precision on a DFA-trained ResMLP gives `nan` because BP grads (~5e-10) are below fp16's smallest subnormal (~6e-8). **bf16 works** because it has the same exponent range as fp32 (only the mantissa is reduced). **Check**: if your reported metric depends on a tiny intermediate magnitude, either compute it in fp32, switch to bf16, or rescale by a known constant before the precision-sensitive step. ## 4. Random feedback `Bs` are training-specific — Γ is not invariant to reseeding them DFA at fixed seed s42 reports Γ ≈ 0.106 *with the random feedback projection matrices `Bs` that were used during training*. With 20 fresh random `Bs` draws on the same trained network, Γ ≈ 0 ± 0.005. The non-zero Γ is the network adapting to the *specific* `Bs` it saw during training, not an intrinsic property of the local credit signal. **Check**: when you report Γ for an FA-style method, specify exactly which `Bs` were used. If you cannot reproduce them from the seed, your numbers are not reproducible. ## 5. Aggregation strategy across layers / samples / batches is rarely specified, but determines the headline number We computed DFA's "headline Γ" with four equally valid aggregation strategies: (median over layers) × (mean over samples), (mean over layers) × (median over samples), and so on. The four strategies give Γ ∈ [0.085, 0.211] on the *same* trained DFA s42 network — a 2.5× spread. For credit bridge it's 5× (Γ ∈ [0.057, 0.285]). **Check**: state explicitly how you aggregate Γ. Report per-layer values and let the reader pick. Do not collapse to a single number without showing the layer breakdown. ## 6. Layer-0 dominates the headline Γ; deeper layers are ~0 For DFA on a 4-block ResMLP, the headline Γ ≈ 0.10 is driven almost entirely by the embedding layer (Γ_layer0 ≈ 0.43). The block layers have Γ ≈ 0. A "mean over layers" summary statistic hides this. The same pattern likely holds for other FA-style methods. **Check**: always report per-layer Γ. A single average is misleading when one layer dominates. ## Suggested final-pass workflow 1. Run `grep -nr '\.norm(-1)' YOUR_REPO` and fix all matches. 2. Run the diagnostic protocol on your trained model. If any flag fires, fix the underlying issue or walk back the corresponding claim before submission. 3. Inspect every place you compute Γ: - Is the divisor clamped? - Are you in fp16? - Are you using the training `Bs` or fresh ones? - How do you aggregate? - Do you report per-layer? 4. Report `‖h_l‖`, `‖g_l‖`, and `Γ_l` *together* for every layer. The metrics are entangled; reporting one without the others hides the pathology.