|
ResMLP (4-block d=256, with out_ln, CIFAR-10):
s42: DFA (a) ep 8, (b) ep 4, acc 0.308
s123: DFA (a) ep 11, (b) ep 4, acc 0.320
s456: DFA (a) ep 8, (b) ep 3, acc 0.300
ViT-Mini (4-block d=128, cls token + terminal LN, CIFAR-10):
s42: DFA (a) ep 1, (b) ep 3, acc 0.256
s123: DFA (a) ep 1, (b) ep 2, acc 0.202
s456: DFA (a) ep 1, (b) ep 3, acc 0.253
StudentNet (4-block d=128, NO terminal LN, synthetic alpha=1.0):
s42: DFA (a) ep 18, (b) NEVER, acc 0.332
s123: DFA (a) ep 14, (b) NEVER, acc 0.314
s456: DFA (a) ep 25, (b) NEVER, acc 0.336
BP: never fires on any seed x any architecture (9/9 sanity passes).
Key cross-architecture finding: diagnostic (b) is specifically the LN-
driven failure mode. Without out_ln, the BP grad never crosses the 1e-7
floor, even though (a) still fires (the residual stream still grows, just
without the LN-cancellation pathology that drives the BP grad to the
floor). This is the causal architectural control: (b) specifically tests
'is terminal-LN gradient cancellation active?' and (a) tests 'is the
residual stream growing without bound?'. They are linked but separable.
This is the §3 cross-architecture validation evidence.
|
|
Old metric: max(||h||) / max(||h_0||, eps). False-positives on ViT-style
architectures because the cls token at layer 0 (right after patch_embed)
has anomalously small magnitude (~0.3-1.5), inflating the ratio even on
healthy BP-trained ViTs.
New metric: max_l(||h_{l+1}|| / ||h_l||) — the largest single-block
residual amplification. Architecture-invariant.
Calibration:
- BP-trained, late training: <5x per block
- BP ViT, early epochs (cls token resolving): 13-25x max
- DFA-trained ResMLP/ViT: 100-4000x per block
Threshold raised from 10 to 50 to sit cleanly between healthy-early-
training (max 25) and failure-regime (min 100).
Re-verifications:
- smoke test (BP/DFA/EP): all 3 verdicts unchanged
- random init (3 seeds): trustworthy on all 3
- 5-method audit table single-seed: identical verdicts
- decision-utility ablation: identical (still 0/5 by S1, 3/5 by S_full)
- temporal evolution 3-seed: (b) now fires first at ep 3-4, (a) at ep
8-11. Both well before training ends. The 'protocol fires ~92 epochs
early' story still holds.
- ViT temporal evolution: BP no longer false-fires; DFA fires (a) ep 1,
(b) ep 3 — protocol works on the second architecture.
|