faeval.git - Unnamed repository; edit this file 'description' to name the repository.

Age	Commit message (Collapse)	Author
2026-04-07	Add §2/§3 hero figure: 5-method audit horizontal bar chart	YurenHao0426
	4-panel layout (one per diagnostic), 5 methods sorted bottom-to-top by ascending accuracy, color-coded healthy (BP/EP, blue) vs degenerate (DFA/SB/CB, red), with threshold lines drawn: (a) max per-block growth (log scale, threshold 50x) (b) \|\|g_L\|\| (log scale, floor 1e-7) (c) cross-batch stability (linear, ceiling 0.30) (d) headline acc (linear, frozen baseline 0.349) The visual layout makes it immediately obvious that: - (a) and (b) cleanly split healthy from degenerate (4-7 OOM gap) - (c) is bimodal and doesn't cleanly split — confirms it's a sub-mode discriminator, not a primary detector - (d) shows BP above the frozen baseline by ~25 pp while DFA/CB/SB are at or below it
2026-04-07	Add d=512 ResMLP audit table (3 seeds): cross-width validation	YurenHao0426
	Same protocol applied to the 4-block d=512 ResMLP variant (vs the d=256 default). 4 methods × 3 seeds = 12 conditions: BP @ d=512: trustworthy on all 3 seeds (acc 0.60-0.61) DFA @ d=512: walked back on all 3 seeds via (a)+(b) State Bridge @ d=512: walked back on all 3 seeds via (a)+(b), with drift sub-mode on s123 (stability 0.879) Credit Bridge @ d=512: walked back on all 3 seeds via (a)+(b) Width effect: max-per-block growth is HIGHER at d=512 (6e3-7e4) than at d=256 (~1e3). Larger width amplifies the explosion. The protocol verdicts are robust to this — same binary outcome, more extreme quantitative numbers. This is the cross-width validation: the protocol's findings are not d=256-specific. The §3 audit results generalize across the width dimension.
2026-04-07	Add §3 cross-architecture temporal evolution figure	YurenHao0426
	3-column 3-row plot: rows: \|\|h_L\|\|, \|\|g_L\|\|, test accuracy cols: ResMLP (with LN) \| ViT-Mini (cls + LN) \| StudentNet (no LN) BP and DFA trajectories overlaid. Floor threshold drawn on the \|\|g_L\|\| row. Visualizes the cross-architecture causal control: with-LN architectures both show \|\|g_L\|\| collapse below 1e-7 (DFA hits the floor within 5 epochs); without-LN architecture shows \|\|g_L\|\| stays in the healthy regime even though \|\|h_L\|\| still grows (catastrophic vs mild).
2026-04-07	Cross-architecture temporal validation: 3 archs x 3 seeds x 2 methods	YurenHao0426
	ResMLP (4-block d=256, with out_ln, CIFAR-10): s42: DFA (a) ep 8, (b) ep 4, acc 0.308 s123: DFA (a) ep 11, (b) ep 4, acc 0.320 s456: DFA (a) ep 8, (b) ep 3, acc 0.300 ViT-Mini (4-block d=128, cls token + terminal LN, CIFAR-10): s42: DFA (a) ep 1, (b) ep 3, acc 0.256 s123: DFA (a) ep 1, (b) ep 2, acc 0.202 s456: DFA (a) ep 1, (b) ep 3, acc 0.253 StudentNet (4-block d=128, NO terminal LN, synthetic alpha=1.0): s42: DFA (a) ep 18, (b) NEVER, acc 0.332 s123: DFA (a) ep 14, (b) NEVER, acc 0.314 s456: DFA (a) ep 25, (b) NEVER, acc 0.336 BP: never fires on any seed x any architecture (9/9 sanity passes). Key cross-architecture finding: diagnostic (b) is specifically the LN- driven failure mode. Without out_ln, the BP grad never crosses the 1e-7 floor, even though (a) still fires (the residual stream still grows, just without the LN-cancellation pathology that drives the BP grad to the floor). This is the causal architectural control: (b) specifically tests 'is terminal-LN gradient cancellation active?' and (a) tests 'is the residual stream growing without bound?'. They are linked but separable. This is the §3 cross-architecture validation evidence.
2026-04-07	Protocol diagnostic (a): use max per-block growth, not max/min ratio	YurenHao0426
	Old metric: max(\|\|h\|\|) / max(\|\|h_0\|\|, eps). False-positives on ViT-style architectures because the cls token at layer 0 (right after patch_embed) has anomalously small magnitude (~0.3-1.5), inflating the ratio even on healthy BP-trained ViTs. New metric: max_l(\|\|h_{l+1}\|\| / \|\|h_l\|\|) — the largest single-block residual amplification. Architecture-invariant. Calibration: - BP-trained, late training: <5x per block - BP ViT, early epochs (cls token resolving): 13-25x max - DFA-trained ResMLP/ViT: 100-4000x per block Threshold raised from 10 to 50 to sit cleanly between healthy-early- training (max 25) and failure-regime (min 100). Re-verifications: - smoke test (BP/DFA/EP): all 3 verdicts unchanged - random init (3 seeds): trustworthy on all 3 - 5-method audit table single-seed: identical verdicts - decision-utility ablation: identical (still 0/5 by S1, 3/5 by S_full) - temporal evolution 3-seed: (b) now fires first at ep 3-4, (a) at ep 8-11. Both well before training ends. The 'protocol fires ~92 epochs early' story still holds. - ViT temporal evolution: BP no longer false-fires; DFA fires (a) ep 1, (b) ep 3 — protocol works on the second architecture.
2026-04-07	Temporal evolution 3-seed: protocol fires at DFA epoch 3-4 on all seeds	YurenHao0426
	s42: (a)+(b) fire at epoch 4, DFA final acc 0.3076 s123: (a)+(b) fire at epoch 4, DFA final acc 0.3203 s456: (a)+(b) fire at epoch 3, DFA final acc 0.2998 BP never fires on any seed (final acc 0.61-0.63). The 'protocol catches it 96 epochs early' finding is fully reproducible across seeds.
2026-04-07	Add temporal diagnostic evolution: protocol fires at epoch 4 of DFA	YurenHao0426
	Replays per-epoch logged data from results/snapshot_evolution_v2/ through the protocol thresholds. Result: diagnostics (a) \|\|h_l\|\| explosion AND (b) \|\|g_L\|\| at floor BOTH first fire at epoch 4 of DFA training. At that point, DFA test acc is 0.308 — its final value at epoch 100 is also 0.308. The protocol could have walked back the headline 96 epochs before training finished. DFA's gamma hovers at 0.087-0.107 for all 100 epochs. A reviewer looking at acc+gamma would conclude 'DFA is hovering at 31% acc with ~0.10 alignment, both reasonable'. Wrong on both counts. BP never fires any diagnostic at any epoch. Stays bounded at \|\|h_L\|\|~200, \|\|g_L\|\|~3-5e-5, accuracy climbs to 0.61. This is the temporal validation of decision utility: the protocol catches the pathology AS IT HAPPENS, not just retrospectively.
2026-04-07	Audit table extension to 3 seeds (s42/s123/s456)	YurenHao0426
	3 seeds × 5 methods × 4 diagnostics = 60 measurements. Key reproducibility findings: - BP: trustworthy on all 3 seeds (acc 0.61-0.62, h_L ~200, g_L ~3-4e-4) - EP: trustworthy on all 3 seeds (acc 0.29-0.36, h_L 3-8e3, g_L ~1e-4) - DFA, SB, CB: walked back on all 3 seeds × all 3 of (a)/(b)/(d) Diagnostic (c) is bimodal across seeds — confirms the prior memory finding: - DFA s42=0.047 (noise), s123=0.436 (drift), s456=-0.005 (noise) - SB s42=0.992 (drift), s123=0.561 (drift), s456=0.035 (noise) - CB s42=0.352 (drift), s123=0.250 (~edge), s456=0.518 (drift) (c) catches different methods on different seeds. (a)/(b)/(d) catch all 3 failing methods on all 3 seeds — robust binary detection.
2026-04-07	Add protocol decision-utility ablation table	YurenHao0426
	Builds on the 5-method audit JSON. For each method, evaluates 7 reporting strategies (S0=acc only, S1=+Γ field standard, S2-S5=+single diagnostic, S_full=full protocol), and emits the verdict each strategy would have reached. Result: 3 of 5 methods (DFA/SB/CB) are walked back by S_full but NOT by S1. Each of (a)scale, (b)floor, (d)frozen is independently sufficient for binary detection of those 3 failures. Diagnostic (c)stability adds sub-mode discrimination (drift vs noise) but not new positive detections. This is the §3 protocol decision-utility evidence.
2026-04-07	Add audit table example: protocol applied to BP/DFA/SB/CB/EP	YurenHao0426
	5-method audit table on 4-block d=256 ResMLP CIFAR-10 seed 42: - BP: trustworthy (acc 0.615, h_L=2e2, g_L=4e-4, stab 0.099) - DFA: walked back via (a)+(b)+(d) — h_L=4e8, g_L=4e-9, undercuts frozen - State Bridge: walked back via all 4 diagnostics — stability 0.992 is the cleanest possible drift-dominated case - Credit Bridge: walked back via all 4 — stability 0.352, also drift mode - EP: trustworthy (acc 0.359, h_L=3e3, g_L=2e-4, stab -0.036) — paper's internal control case This is the §2 audit evidence for the main-track paper. Confirms that standard headline acc + Γ silently fails on 3 of 5 methods on this architecture, while the 4-diagnostic protocol catches all three.