faeval.git - Unnamed repository; edit this file 'description' to name the repository.

Age	Commit message (Collapse)	Author
2026-04-07	Add EVIDENCE_SUMMARY.md: consolidated snapshot of all protocol evidence	YurenHao0426
	Single-document overview of every result the protocol package has produced so far, with reproducibility commands and the file/memory entry where each result is recorded. Organized by paper section (§1 protocol, §2 audit, §3 decision utility, §4 temporal validation, §5 pitfalls). Includes the headline tables (3-seed audit, cross-architecture, penalty sweep) ready for the paper, and an explicit status field for each ongoing experiment. This is a reading guide for anyone (codex, future-me, the user) who needs to know what evidence is ready and how to reproduce it.
2026-04-07	Add §3 cross-architecture temporal evolution figure	YurenHao0426
	3-column 3-row plot: rows: \|\|h_L\|\|, \|\|g_L\|\|, test accuracy cols: ResMLP (with LN) \| ViT-Mini (cls + LN) \| StudentNet (no LN) BP and DFA trajectories overlaid. Floor threshold drawn on the \|\|g_L\|\| row. Visualizes the cross-architecture causal control: with-LN architectures both show \|\|g_L\|\| collapse below 1e-7 (DFA hits the floor within 5 epochs); without-LN architecture shows \|\|g_L\|\| stays in the healthy regime even though \|\|h_L\|\| still grows (catastrophic vs mild).
2026-04-07	Add threshold sensitivity analysis: (a) 63x gap, (b) 24338x gap	YurenHao0426
	For each diagnostic, sweeps threshold across orders of magnitude on the 3-seed audit data and reports the verdict at each value. Key calibration findings (3 seeds): Diagnostic (a) max per-block growth: Healthy max (BP/EP): 11.0 Degenerate min (DFA/SB/CB): 694 Separation gap: 63x Default threshold 50 sits comfortably in the middle. Any threshold in [12, 693] gives the same verdicts. Diagnostic (b) \|\|g_L\|\| at floor: Healthy min (BP/EP): 1.02e-4 Degenerate max (DFA/SB/CB): 4.18e-9 Separation gap: 24,338x Default threshold 1e-7 sits comfortably in the middle. Any threshold in [4.2e-9, 1.0e-4] gives the same verdicts. Diagnostic (c) cross-batch stability: NOT a clean binary discriminator across seeds. BP s456=0.114 near threshold; DFA s42=0.047 (noise sub-mode) doesn't fire; SB s456=0.035 (noise sub-mode) doesn't fire. (c) is for sub-mode interpretation, not binary detection. This is the calibration evidence answering the E&D reviewer question 'why these specific thresholds?'.
2026-04-07	Add reproducers for pitfalls 4-6 (Bs reproducibility, aggregation, layer-0)	YurenHao0426
	All 3 verified on the real DFA s42 checkpoint: Bug 4: training Bs gives Γ=+0.068, 10 fresh Bs draws give Γ=+0.0043±0.007. The 'alignment' is the network adapting to specific Bs. Bug 5: 4 valid aggregation strategies give Γ in [-0.028, +0.074]. The spread is 0.10 (3.45x ratio) and the sign flips between strategies. Pick the wrong aggregation and DFA is anti-aligned; pick the right one and DFA looks aligned. Bug 6: Γ_layer0 = +0.429 dominates the mean +0.068. Hidden layers 1-4 are all near zero or slightly negative. Mean of hidden layers only is -0.022 (negative!). The deep blocks the paper claims to be 'training' have Γ ≈ 0 or below. Bugs 5 and 6 are causally linked: 'median over layers' strategies pick a negative deep layer; 'mean over layers' is dominated by the positive l0. The catalog under-reported bug 5 (it said 2.5x, actual is 3.45x with sign flip).
2026-04-07	Add training-monitor early-stop demo: 96% compute savings on DFA	YurenHao0426
	Demonstrates the practical use case of the protocol — not as a post-hoc audit but as an in-training abort condition. Walks through the existing per-epoch trace and shows when the protocol would have triggered an early stop on DFA training and what the saved compute would be. Result: DFA on 4-block d=256 ResMLP fires diagnostic (b) at epoch 4 with test acc 0.3076. The final acc at epoch 100 is also 0.3076 (identical). Stopping at epoch 4 saves 96% of compute with zero headline acc loss.
2026-04-07	Cross-architecture temporal validation: 3 archs x 3 seeds x 2 methods	YurenHao0426
	ResMLP (4-block d=256, with out_ln, CIFAR-10): s42: DFA (a) ep 8, (b) ep 4, acc 0.308 s123: DFA (a) ep 11, (b) ep 4, acc 0.320 s456: DFA (a) ep 8, (b) ep 3, acc 0.300 ViT-Mini (4-block d=128, cls token + terminal LN, CIFAR-10): s42: DFA (a) ep 1, (b) ep 3, acc 0.256 s123: DFA (a) ep 1, (b) ep 2, acc 0.202 s456: DFA (a) ep 1, (b) ep 3, acc 0.253 StudentNet (4-block d=128, NO terminal LN, synthetic alpha=1.0): s42: DFA (a) ep 18, (b) NEVER, acc 0.332 s123: DFA (a) ep 14, (b) NEVER, acc 0.314 s456: DFA (a) ep 25, (b) NEVER, acc 0.336 BP: never fires on any seed x any architecture (9/9 sanity passes). Key cross-architecture finding: diagnostic (b) is specifically the LN- driven failure mode. Without out_ln, the BP grad never crosses the 1e-7 floor, even though (a) still fires (the residual stream still grows, just without the LN-cancellation pathology that drives the BP grad to the floor). This is the causal architectural control: (b) specifically tests 'is terminal-LN gradient cancellation active?' and (a) tests 'is the residual stream growing without bound?'. They are linked but separable. This is the §3 cross-architecture validation evidence.
2026-04-07	Protocol diagnostic (a): use max per-block growth, not max/min ratio	YurenHao0426
	Old metric: max(\|\|h\|\|) / max(\|\|h_0\|\|, eps). False-positives on ViT-style architectures because the cls token at layer 0 (right after patch_embed) has anomalously small magnitude (~0.3-1.5), inflating the ratio even on healthy BP-trained ViTs. New metric: max_l(\|\|h_{l+1}\|\| / \|\|h_l\|\|) — the largest single-block residual amplification. Architecture-invariant. Calibration: - BP-trained, late training: <5x per block - BP ViT, early epochs (cls token resolving): 13-25x max - DFA-trained ResMLP/ViT: 100-4000x per block Threshold raised from 10 to 50 to sit cleanly between healthy-early- training (max 25) and failure-regime (min 100). Re-verifications: - smoke test (BP/DFA/EP): all 3 verdicts unchanged - random init (3 seeds): trustworthy on all 3 - 5-method audit table single-seed: identical verdicts - decision-utility ablation: identical (still 0/5 by S1, 3/5 by S_full) - temporal evolution 3-seed: (b) now fires first at ep 3-4, (a) at ep 8-11. Both well before training ends. The 'protocol fires ~92 epochs early' story still holds. - ViT temporal evolution: BP no longer false-fires; DFA fires (a) ep 1, (b) ep 3 — protocol works on the second architecture.
2026-04-07	Add reproducers for pitfalls 1-3 in CHECKLIST.md	YurenHao0426
	Each bug from the catalog has a synthetic reproducer that runs in <1 sec without GPU: Bug 1: x.norm(-1) on a 2x2 tensor returns 1.143 (L_{-1} of whole tensor) instead of [5, 10] (per-row L_2 along dim=-1). Bug 2: F.cosine_similarity(a, b) with \|\|b\|\|=5e-10 returns +0.000905 instead of the true +0.018101. The clamp (eps=1e-8) underestimates the divisor 20x. Bug 3: 5e-10 in fp16 -> 0 (underflows smallest subnormal ~6e-8). Downstream F.cosine_similarity returns NaN. bf16 works because it shares fp32's exponent range. Bugs 4-6 (Bs reproducibility, aggregation, layer-0 dominance) require a trained network and are demonstrated inside audit_table and ablation_decision_utility.
2026-04-07	Temporal evolution 3-seed: protocol fires at DFA epoch 3-4 on all seeds	YurenHao0426
	s42: (a)+(b) fire at epoch 4, DFA final acc 0.3076 s123: (a)+(b) fire at epoch 4, DFA final acc 0.3203 s456: (a)+(b) fire at epoch 3, DFA final acc 0.2998 BP never fires on any seed (final acc 0.61-0.63). The 'protocol catches it 96 epochs early' finding is fully reproducible across seeds.
2026-04-07	Add temporal diagnostic evolution: protocol fires at epoch 4 of DFA	YurenHao0426
	Replays per-epoch logged data from results/snapshot_evolution_v2/ through the protocol thresholds. Result: diagnostics (a) \|\|h_l\|\| explosion AND (b) \|\|g_L\|\| at floor BOTH first fire at epoch 4 of DFA training. At that point, DFA test acc is 0.308 — its final value at epoch 100 is also 0.308. The protocol could have walked back the headline 96 epochs before training finished. DFA's gamma hovers at 0.087-0.107 for all 100 epochs. A reviewer looking at acc+gamma would conclude 'DFA is hovering at 31% acc with ~0.10 alignment, both reasonable'. Wrong on both counts. BP never fires any diagnostic at any epoch. Stays bounded at \|\|h_L\|\|~200, \|\|g_L\|\|~3-5e-5, accuracy climbs to 0.61. This is the temporal validation of decision utility: the protocol catches the pathology AS IT HAPPENS, not just retrospectively.
2026-04-07	Add random-init sanity check: protocol does not flag untrained networks	YurenHao0426
	3-seed random init ResMLP gives chance accuracy (~10%) but the protocol verdict is 'trustworthy' on all 3 seeds: - residual norms ~8.7 across all layers (no growth, bounded) - BP gradient norms ~8e-3 (healthy, well above 1e-7 floor) - cross-batch stability 0.08-0.18 (in the BP/EP range) This is the answer to the likely reviewer question: 'is your protocol just flagging anything that doesn't perform well?' Answer: no. Random init is at chance and the protocol passes it. The walked-back trained methods are walked back because of the measurements, not because of the accuracy. Notable: random init g-norms (8e-3) are actually HIGHER than BP-trained ones (4e-4) — BP training reduces the gradient magnitude as loss decreases. So the protocol distinguishes 3 distinct regimes: (1) untrained healthy, (2) trained-and-still-healthy (BP/EP), (3) trained-into-pathology (DFA/SB/CB).
2026-04-07	Audit table extension to 3 seeds (s42/s123/s456)	YurenHao0426
	3 seeds × 5 methods × 4 diagnostics = 60 measurements. Key reproducibility findings: - BP: trustworthy on all 3 seeds (acc 0.61-0.62, h_L ~200, g_L ~3-4e-4) - EP: trustworthy on all 3 seeds (acc 0.29-0.36, h_L 3-8e3, g_L ~1e-4) - DFA, SB, CB: walked back on all 3 seeds × all 3 of (a)/(b)/(d) Diagnostic (c) is bimodal across seeds — confirms the prior memory finding: - DFA s42=0.047 (noise), s123=0.436 (drift), s456=-0.005 (noise) - SB s42=0.992 (drift), s123=0.561 (drift), s456=0.035 (noise) - CB s42=0.352 (drift), s123=0.250 (~edge), s456=0.518 (drift) (c) catches different methods on different seeds. (a)/(b)/(d) catch all 3 failing methods on all 3 seeds — robust binary detection.
2026-04-07	Add protocol decision-utility ablation table	YurenHao0426
	Builds on the 5-method audit JSON. For each method, evaluates 7 reporting strategies (S0=acc only, S1=+Γ field standard, S2-S5=+single diagnostic, S_full=full protocol), and emits the verdict each strategy would have reached. Result: 3 of 5 methods (DFA/SB/CB) are walked back by S_full but NOT by S1. Each of (a)scale, (b)floor, (d)frozen is independently sufficient for binary detection of those 3 failures. Diagnostic (c)stability adds sub-mode discrimination (drift vs noise) but not new positive detections. This is the §3 protocol decision-utility evidence.
2026-04-07	Add audit table example: protocol applied to BP/DFA/SB/CB/EP	YurenHao0426
	5-method audit table on 4-block d=256 ResMLP CIFAR-10 seed 42: - BP: trustworthy (acc 0.615, h_L=2e2, g_L=4e-4, stab 0.099) - DFA: walked back via (a)+(b)+(d) — h_L=4e8, g_L=4e-9, undercuts frozen - State Bridge: walked back via all 4 diagnostics — stability 0.992 is the cleanest possible drift-dominated case - Credit Bridge: walked back via all 4 — stability 0.352, also drift mode - EP: trustworthy (acc 0.359, h_L=3e3, g_L=2e-4, stab -0.036) — paper's internal control case This is the §2 audit evidence for the main-track paper. Confirms that standard headline acc + Γ silently fails on 3 of 5 methods on this architecture, while the 4-diagnostic protocol catches all three.
2026-04-07	Add FA diagnostic protocol reference implementation	YurenHao0426
	Codex round 15 #1 priority for the E&D-track paper: - protocol/protocol.py: 4 diagnostics (residual norms, BP grad norms, cross-batch direction stability, and a frozen-baseline comparator) - protocol/report.py: DiagnosticReport with per-diagnostic verdicts and pretty-printer - protocol/smoke_test.py: validates BP/DFA/EP checkpoints produce the expected verdicts (BP/EP trustworthy; DFA walked back via residual explosion + BP grad at floor) - protocol/README.md: usage, audit cases, threshold rationale - protocol/CHECKLIST.md: 6 evaluation pipeline pitfalls (norm(-1), cosine_similarity eps clamp, fp16 underflow, Bs reproducibility, aggregation, layer-0 dominance) - protocol/REPORTING_TEMPLATE.md: per-method fillable form for FA papers