# FA Evaluation Pitfalls Checklist Six evaluation-pipeline bugs we found in our own codebase while developing this protocol. Each one silently corrupted reported numbers in a way that would have been very hard for a reviewer to catch. Use this list as a final-pass review for any FA evaluation paper. ## 1. `tensor.norm(-1)` is NOT "L₂ along dim=-1" `tensor.norm(-1)` computes the L₋₁ "norm" (`1/sum(1/|x|)`) of the *entire* flattened tensor, not the per-row L₂ norm. The correct call for "L₂ along the feature dimension" is `tensor.norm(dim=-1)`. This single bug invalidated several months of our own gradient-norm measurements; gradient sparsity tables, "support sparsity" analyses, and the `naive_StateErr` metric in three of our trainers were all degenerate. **Check**: grep your codebase for `\.norm\(-1\)` and `\.norm\(\s*-1\s*\)`. Replace with `.norm(dim=-1)`. ## 2. `F.cosine_similarity` clamps small magnitudes via `eps=1e-8` `torch.nn.functional.cosine_similarity(a, b)` divides by `max(||a||·||b||, eps)` with `eps=1e-8` by default. When the BP gradient at hidden layers is ~1e-10 (which it is on DFA-trained pre-LN ResMLPs), the divisor becomes `||a|| · 1e-8` instead of `||a|| · 1e-10`, scaling the reported cosine by `||g|| / eps` ≈ 0.01. The reported "Γ ≈ 0.10" is then off by ~100× from the true cosine. **Check**: when computing Γ, either use a hand-rolled cosine that does not clamp, or assert `||g|| > 1e-8` before calling `F.cosine_similarity`. If you report Γ alongside ‖g‖ and ‖g‖ is at the floor, your Γ is uninterpretable — the diagnostic protocol will catch this. ## 3. fp16 mixed precision underflows BP grads at hidden layers Computing Γ in fp16 mixed precision on a DFA-trained ResMLP gives `nan` because BP grads (~5e-10) are below fp16's smallest subnormal (~6e-8). **bf16 works** because it has the same exponent range as fp32 (only the mantissa is reduced). **Check**: if your reported metric depends on a tiny intermediate magnitude, either compute it in fp32, switch to bf16, or rescale by a known constant before the precision-sensitive step. ## 4. Random feedback `Bs` are training-specific — Γ is not invariant to reseeding them DFA at fixed seed s42 reports Γ ≈ 0.106 *with the random feedback projection matrices `Bs` that were used during training*. With 20 fresh random `Bs` draws on the same trained network, Γ ≈ 0 ± 0.005. The non-zero Γ is the network adapting to the *specific* `Bs` it saw during training, not an intrinsic property of the local credit signal. **Check**: when you report Γ for an FA-style method, specify exactly which `Bs` were used. If you cannot reproduce them from the seed, your numbers are not reproducible. ## 5. Aggregation strategy across layers / samples / batches is rarely specified, but determines the headline number We computed DFA's "headline Γ" with four equally valid aggregation strategies: (median over layers) × (mean over samples), (mean over layers) × (median over samples), and so on. The four strategies give Γ ∈ [0.085, 0.211] on the *same* trained DFA s42 network — a 2.5× spread. For credit bridge it's 5× (Γ ∈ [0.057, 0.285]). **Check**: state explicitly how you aggregate Γ. Report per-layer values and let the reader pick. Do not collapse to a single number without showing the layer breakdown. ## 6.5 Auxiliary networks (Bs, bridge models) not persisted → silent self-cosine fallback We discovered this in our own dogfood codebase: the `cnn_baseline.py` script records State Bridge Γ = 1.0 and Credit Bridge Γ = 1.0 in its JSON output. Reading the source: when the auxiliary networks (the bridge predictor for SB/CB, the random feedback `Bs` for DFA) are not saved alongside the model checkpoint, the post-hoc Γ-computation script *cannot* reconstruct the local credit signal. Instead of erroring out, the script falls back to `cos(BP_grad, BP_grad) = 1.0` and silently records that. The JSON has a `Gamma_note` field saying "BP grad self-cosine (feedback nets not saved)" but a reader who doesn't read that field would interpret 1.0 as perfect alignment. **Check**: when you save a model checkpoint for FA evaluation, ALSO save: - the random feedback `Bs` for DFA - the bridge / value network state_dict for SB / CB - the EP feedback weights for EP If any of these are missing, your post-hoc Γ measurement is undefined. Report `Γ = N/A` in your tables, NOT a fallback value of 1.0. ## 6. Per-layer Γ structure is hidden by aggregation (layer-0 dominance is one mode) For DFA on a 4-block ResMLP, the headline Γ ≈ 0.10 is driven almost entirely by the embedding layer (Γ_layer0 ≈ +0.42, layers 1-4 ≈ 0). The "mean over layers" summary statistic hides this structure. **This pattern is architecture-specific, not universal.** On ViT-Mini, the per-layer Γ is uniformly near zero across all layers (no single layer dominates). On ResMLP the dominance is real and severe. The general lesson is not "layer 0 always dominates" — it is that **the aggregation hides per-layer structure that depends on the input preprocessing and the architecture's interaction with random feedback `Bs`**. **Check**: always report per-layer Γ. A single average can mislead in either direction — by hiding dominance (ResMLP-style) or by averaging over layers that all measure essentially the same degenerate quantity (ViT-style). ## Suggested final-pass workflow 1. Run `grep -nr '\.norm(-1)' YOUR_REPO` and fix all matches. 2. Run the diagnostic protocol on your trained model. If any flag fires, fix the underlying issue or walk back the corresponding claim before submission. 3. Inspect every place you compute Γ: - Is the divisor clamped? - Are you in fp16? - Are you using the training `Bs` or fresh ones? - How do you aggregate? - Do you report per-layer? 4. Report `‖h_l‖`, `‖g_l‖`, and `Γ_l` *together* for every layer. The metrics are entangled; reporting one without the others hides the pathology.