1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
|
# FA Evaluation Pitfalls Checklist
Six evaluation-pipeline bugs we found in our own codebase while developing
this protocol. Each one silently corrupted reported numbers in a way that
would have been very hard for a reviewer to catch. Use this list as a
final-pass review for any FA evaluation paper.
## 1. `tensor.norm(-1)` is NOT "L₂ along dim=-1"
`tensor.norm(-1)` computes the L₋₁ "norm" (`1/sum(1/|x|)`) of the *entire*
flattened tensor, not the per-row L₂ norm. The correct call for "L₂ along
the feature dimension" is `tensor.norm(dim=-1)`.
This single bug invalidated several months of our own gradient-norm
measurements; gradient sparsity tables, "support sparsity" analyses, and
the `naive_StateErr` metric in three of our trainers were all degenerate.
**Check**: grep your codebase for `\.norm\(-1\)` and `\.norm\(\s*-1\s*\)`.
Replace with `.norm(dim=-1)`.
## 2. `F.cosine_similarity` clamps small magnitudes via `eps=1e-8`
`torch.nn.functional.cosine_similarity(a, b)` divides by `max(||a||·||b||,
eps)` with `eps=1e-8` by default. When the BP gradient at hidden layers is
~1e-10 (which it is on DFA-trained pre-LN ResMLPs), the divisor becomes
`||a|| · 1e-8` instead of `||a|| · 1e-10`, scaling the reported cosine by
`||g|| / eps` ≈ 0.01. The reported "Γ ≈ 0.10" is then off by ~100× from the
true cosine.
**Check**: when computing Γ, either use a hand-rolled cosine that does not
clamp, or assert `||g|| > 1e-8` before calling `F.cosine_similarity`. If you
report Γ alongside ‖g‖ and ‖g‖ is at the floor, your Γ is uninterpretable —
the diagnostic protocol will catch this.
## 3. fp16 mixed precision underflows BP grads at hidden layers
Computing Γ in fp16 mixed precision on a DFA-trained ResMLP gives `nan`
because BP grads (~5e-10) are below fp16's smallest subnormal (~6e-8).
**bf16 works** because it has the same exponent range as fp32 (only the
mantissa is reduced).
**Check**: if your reported metric depends on a tiny intermediate magnitude,
either compute it in fp32, switch to bf16, or rescale by a known constant
before the precision-sensitive step.
## 4. Random feedback `Bs` are training-specific — Γ is not invariant to
reseeding them
DFA at fixed seed s42 reports Γ ≈ 0.106 *with the random feedback projection
matrices `Bs` that were used during training*. With 20 fresh random `Bs`
draws on the same trained network, Γ ≈ 0 ± 0.005. The non-zero Γ is the
network adapting to the *specific* `Bs` it saw during training, not an
intrinsic property of the local credit signal.
**Check**: when you report Γ for an FA-style method, specify exactly which
`Bs` were used. If you cannot reproduce them from the seed, your numbers
are not reproducible.
## 5. Aggregation strategy across layers / samples / batches is rarely
specified, but determines the headline number
We computed DFA's "headline Γ" with four equally valid aggregation
strategies: (median over layers) × (mean over samples), (mean over layers) ×
(median over samples), and so on. The four strategies give Γ ∈ [0.085,
0.211] on the *same* trained DFA s42 network — a 2.5× spread. For credit
bridge it's 5× (Γ ∈ [0.057, 0.285]).
**Check**: state explicitly how you aggregate Γ. Report per-layer values
and let the reader pick. Do not collapse to a single number without showing
the layer breakdown.
## 6. Layer-0 dominates the headline Γ; deeper layers are ~0
For DFA on a 4-block ResMLP, the headline Γ ≈ 0.10 is driven almost entirely
by the embedding layer (Γ_layer0 ≈ 0.43). The block layers have Γ ≈ 0. A
"mean over layers" summary statistic hides this. The same pattern likely
holds for other FA-style methods.
**Check**: always report per-layer Γ. A single average is misleading when
one layer dominates.
## Suggested final-pass workflow
1. Run `grep -nr '\.norm(-1)' YOUR_REPO` and fix all matches.
2. Run the diagnostic protocol on your trained model. If any flag fires,
fix the underlying issue or walk back the corresponding claim before
submission.
3. Inspect every place you compute Γ:
- Is the divisor clamped?
- Are you in fp16?
- Are you using the training `Bs` or fresh ones?
- How do you aggregate?
- Do you report per-layer?
4. Report `‖h_l‖`, `‖g_l‖`, and `Γ_l` *together* for every layer. The
metrics are entangled; reporting one without the others hides the
pathology.
|