protocol/CHECKLIST.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125

# FA Evaluation Pitfalls Checklist

Six evaluation-pipeline bugs we found in our own codebase while developing
this protocol. Each one silently corrupted reported numbers in a way that
would have been very hard for a reviewer to catch. Use this list as a
final-pass review for any FA evaluation paper.

## 1. `tensor.norm(-1)` is NOT "L₂ along dim=-1"

`tensor.norm(-1)` computes the L₋₁ "norm" (`1/sum(1/|x|)`) of the *entire*
flattened tensor, not the per-row L₂ norm. The correct call for "L₂ along
the feature dimension" is `tensor.norm(dim=-1)`.

This single bug invalidated several months of our own gradient-norm
measurements; gradient sparsity tables, "support sparsity" analyses, and
the `naive_StateErr` metric in three of our trainers were all degenerate.

**Check**: grep your codebase for `\.norm\(-1\)` and `\.norm\(\s*-1\s*\)`.
Replace with `.norm(dim=-1)`.

## 2. `F.cosine_similarity` clamps small magnitudes via `eps=1e-8`

`torch.nn.functional.cosine_similarity(a, b)` divides by `max(||a||·||b||,
eps)` with `eps=1e-8` by default. When the BP gradient at hidden layers is
~1e-10 (which it is on DFA-trained pre-LN ResMLPs), the divisor becomes
`||a|| · 1e-8` instead of `||a|| · 1e-10`, scaling the reported cosine by
`||g|| / eps` ≈ 0.01. The reported "Γ ≈ 0.10" is then off by ~100× from the
true cosine.

**Check**: when computing Γ, either use a hand-rolled cosine that does not
clamp, or assert `||g|| > 1e-8` before calling `F.cosine_similarity`. If you
report Γ alongside ‖g‖ and ‖g‖ is at the floor, your Γ is uninterpretable —
the diagnostic protocol will catch this.

## 3. fp16 mixed precision underflows BP grads at hidden layers

Computing Γ in fp16 mixed precision on a DFA-trained ResMLP gives `nan`
because BP grads (~5e-10) are below fp16's smallest subnormal (~6e-8).
**bf16 works** because it has the same exponent range as fp32 (only the
mantissa is reduced).

**Check**: if your reported metric depends on a tiny intermediate magnitude,
either compute it in fp32, switch to bf16, or rescale by a known constant
before the precision-sensitive step.

## 4. Random feedback `Bs` are training-specific — Γ is not invariant to
   reseeding them

DFA at fixed seed s42 reports Γ ≈ 0.106 *with the random feedback projection
matrices `Bs` that were used during training*. With 20 fresh random `Bs`
draws on the same trained network, Γ ≈ 0 ± 0.005. The non-zero Γ is the
network adapting to the *specific* `Bs` it saw during training, not an
intrinsic property of the local credit signal.

**Check**: when you report Γ for an FA-style method, specify exactly which
`Bs` were used. If you cannot reproduce them from the seed, your numbers
are not reproducible.

## 5. Aggregation strategy across layers / samples / batches is rarely
   specified, but determines the headline number

We computed DFA's "headline Γ" with four equally valid aggregation
strategies: (median over layers) × (mean over samples), (mean over layers) ×
(median over samples), and so on. The four strategies give Γ ∈ [0.085,
0.211] on the *same* trained DFA s42 network — a 2.5× spread. For credit
bridge it's 5× (Γ ∈ [0.057, 0.285]).

**Check**: state explicitly how you aggregate Γ. Report per-layer values
and let the reader pick. Do not collapse to a single number without showing
the layer breakdown.

## 6.5 Auxiliary networks (Bs, bridge models) not persisted → silent self-cosine fallback

We discovered this in our own dogfood codebase: the `cnn_baseline.py` script
records State Bridge Γ = 1.0 and Credit Bridge Γ = 1.0 in its JSON output.
Reading the source: when the auxiliary networks (the bridge predictor for
SB/CB, the random feedback `Bs` for DFA) are not saved alongside the model
checkpoint, the post-hoc Γ-computation script *cannot* reconstruct the
local credit signal. Instead of erroring out, the script falls back to
`cos(BP_grad, BP_grad) = 1.0` and silently records that. The JSON has a
`Gamma_note` field saying "BP grad self-cosine (feedback nets not saved)"
but a reader who doesn't read that field would interpret 1.0 as perfect
alignment.

**Check**: when you save a model checkpoint for FA evaluation, ALSO save:
- the random feedback `Bs` for DFA
- the bridge / value network state_dict for SB / CB
- the EP feedback weights for EP

If any of these are missing, your post-hoc Γ measurement is undefined.
Report `Γ = N/A` in your tables, NOT a fallback value of 1.0.

## 6. Per-layer Γ structure is hidden by aggregation (layer-0 dominance is one mode)

For DFA on a 4-block ResMLP, the headline Γ ≈ 0.10 is driven almost entirely
by the embedding layer (Γ_layer0 ≈ +0.42, layers 1-4 ≈ 0). The "mean over
layers" summary statistic hides this structure.

**This pattern is architecture-specific, not universal.** On ViT-Mini, the
per-layer Γ is uniformly near zero across all layers (no single layer
dominates). On ResMLP the dominance is real and severe. The general lesson
is not "layer 0 always dominates" — it is that **the aggregation hides
per-layer structure that depends on the input preprocessing and the
architecture's interaction with random feedback `Bs`**.

**Check**: always report per-layer Γ. A single average can mislead in
either direction — by hiding dominance (ResMLP-style) or by averaging
over layers that all measure essentially the same degenerate quantity
(ViT-style).

## Suggested final-pass workflow

1. Run `grep -nr '\.norm(-1)' YOUR_REPO` and fix all matches.
2. Run the diagnostic protocol on your trained model. If any flag fires,
   fix the underlying issue or walk back the corresponding claim before
   submission.
3. Inspect every place you compute Γ:
   - Is the divisor clamped?
   - Are you in fp16?
   - Are you using the training `Bs` or fresh ones?
   - How do you aggregate?
   - Do you report per-layer?
4. Report `‖h_l‖`, `‖g_l‖`, and `Γ_l` *together* for every layer. The
   metrics are entangled; reporting one without the others hides the
   pathology.