|
3 seeds × 5 methods × 4 diagnostics = 60 measurements. Key reproducibility
findings:
- BP: trustworthy on all 3 seeds (acc 0.61-0.62, h_L ~200, g_L ~3-4e-4)
- EP: trustworthy on all 3 seeds (acc 0.29-0.36, h_L 3-8e3, g_L ~1e-4)
- DFA, SB, CB: walked back on all 3 seeds × all 3 of (a)/(b)/(d)
Diagnostic (c) is bimodal across seeds — confirms the prior memory finding:
- DFA s42=0.047 (noise), s123=0.436 (drift), s456=-0.005 (noise)
- SB s42=0.992 (drift), s123=0.561 (drift), s456=0.035 (noise)
- CB s42=0.352 (drift), s123=0.250 (~edge), s456=0.518 (drift)
(c) catches different methods on different seeds. (a)/(b)/(d) catch all 3
failing methods on all 3 seeds — robust binary detection.
|
|
5-method audit table on 4-block d=256 ResMLP CIFAR-10 seed 42:
- BP: trustworthy (acc 0.615, h_L=2e2, g_L=4e-4, stab 0.099)
- DFA: walked back via (a)+(b)+(d) — h_L=4e8, g_L=4e-9, undercuts frozen
- State Bridge: walked back via all 4 diagnostics — stability 0.992 is the
cleanest possible drift-dominated case
- Credit Bridge: walked back via all 4 — stability 0.352, also drift mode
- EP: trustworthy (acc 0.359, h_L=3e3, g_L=2e-4, stab -0.036) — paper's
internal control case
This is the §2 audit evidence for the main-track paper. Confirms that
standard headline acc + Γ silently fails on 3 of 5 methods on this
architecture, while the 4-diagnostic protocol catches all three.
|