faeval.git - Unnamed repository; edit this file 'description' to name the repository.

Age	Commit message (Collapse)	Author
2026-04-07	Add training-monitor early-stop demo: 96% compute savings on DFA	YurenHao0426
	Demonstrates the practical use case of the protocol — not as a post-hoc audit but as an in-training abort condition. Walks through the existing per-epoch trace and shows when the protocol would have triggered an early stop on DFA training and what the saved compute would be. Result: DFA on 4-block d=256 ResMLP fires diagnostic (b) at epoch 4 with test acc 0.3076. The final acc at epoch 100 is also 0.3076 (identical). Stopping at epoch 4 saves 96% of compute with zero headline acc loss.
2026-04-07	Cross-architecture temporal validation: 3 archs x 3 seeds x 2 methods	YurenHao0426
	ResMLP (4-block d=256, with out_ln, CIFAR-10): s42: DFA (a) ep 8, (b) ep 4, acc 0.308 s123: DFA (a) ep 11, (b) ep 4, acc 0.320 s456: DFA (a) ep 8, (b) ep 3, acc 0.300 ViT-Mini (4-block d=128, cls token + terminal LN, CIFAR-10): s42: DFA (a) ep 1, (b) ep 3, acc 0.256 s123: DFA (a) ep 1, (b) ep 2, acc 0.202 s456: DFA (a) ep 1, (b) ep 3, acc 0.253 StudentNet (4-block d=128, NO terminal LN, synthetic alpha=1.0): s42: DFA (a) ep 18, (b) NEVER, acc 0.332 s123: DFA (a) ep 14, (b) NEVER, acc 0.314 s456: DFA (a) ep 25, (b) NEVER, acc 0.336 BP: never fires on any seed x any architecture (9/9 sanity passes). Key cross-architecture finding: diagnostic (b) is specifically the LN- driven failure mode. Without out_ln, the BP grad never crosses the 1e-7 floor, even though (a) still fires (the residual stream still grows, just without the LN-cancellation pathology that drives the BP grad to the floor). This is the causal architectural control: (b) specifically tests 'is terminal-LN gradient cancellation active?' and (a) tests 'is the residual stream growing without bound?'. They are linked but separable. This is the §3 cross-architecture validation evidence.
2026-04-07	Protocol diagnostic (a): use max per-block growth, not max/min ratio	YurenHao0426
	Old metric: max(\|\|h\|\|) / max(\|\|h_0\|\|, eps). False-positives on ViT-style architectures because the cls token at layer 0 (right after patch_embed) has anomalously small magnitude (~0.3-1.5), inflating the ratio even on healthy BP-trained ViTs. New metric: max_l(\|\|h_{l+1}\|\| / \|\|h_l\|\|) — the largest single-block residual amplification. Architecture-invariant. Calibration: - BP-trained, late training: <5x per block - BP ViT, early epochs (cls token resolving): 13-25x max - DFA-trained ResMLP/ViT: 100-4000x per block Threshold raised from 10 to 50 to sit cleanly between healthy-early- training (max 25) and failure-regime (min 100). Re-verifications: - smoke test (BP/DFA/EP): all 3 verdicts unchanged - random init (3 seeds): trustworthy on all 3 - 5-method audit table single-seed: identical verdicts - decision-utility ablation: identical (still 0/5 by S1, 3/5 by S_full) - temporal evolution 3-seed: (b) now fires first at ep 3-4, (a) at ep 8-11. Both well before training ends. The 'protocol fires ~92 epochs early' story still holds. - ViT temporal evolution: BP no longer false-fires; DFA fires (a) ep 1, (b) ep 3 — protocol works on the second architecture.
2026-04-07	Add reproducers for pitfalls 1-3 in CHECKLIST.md	YurenHao0426
	Each bug from the catalog has a synthetic reproducer that runs in <1 sec without GPU: Bug 1: x.norm(-1) on a 2x2 tensor returns 1.143 (L_{-1} of whole tensor) instead of [5, 10] (per-row L_2 along dim=-1). Bug 2: F.cosine_similarity(a, b) with \|\|b\|\|=5e-10 returns +0.000905 instead of the true +0.018101. The clamp (eps=1e-8) underestimates the divisor 20x. Bug 3: 5e-10 in fp16 -> 0 (underflows smallest subnormal ~6e-8). Downstream F.cosine_similarity returns NaN. bf16 works because it shares fp32's exponent range. Bugs 4-6 (Bs reproducibility, aggregation, layer-0 dominance) require a trained network and are demonstrated inside audit_table and ablation_decision_utility.
2026-04-07	Temporal evolution 3-seed: protocol fires at DFA epoch 3-4 on all seeds	YurenHao0426
	s42: (a)+(b) fire at epoch 4, DFA final acc 0.3076 s123: (a)+(b) fire at epoch 4, DFA final acc 0.3203 s456: (a)+(b) fire at epoch 3, DFA final acc 0.2998 BP never fires on any seed (final acc 0.61-0.63). The 'protocol catches it 96 epochs early' finding is fully reproducible across seeds.
2026-04-07	Add temporal diagnostic evolution: protocol fires at epoch 4 of DFA	YurenHao0426
	Replays per-epoch logged data from results/snapshot_evolution_v2/ through the protocol thresholds. Result: diagnostics (a) \|\|h_l\|\| explosion AND (b) \|\|g_L\|\| at floor BOTH first fire at epoch 4 of DFA training. At that point, DFA test acc is 0.308 — its final value at epoch 100 is also 0.308. The protocol could have walked back the headline 96 epochs before training finished. DFA's gamma hovers at 0.087-0.107 for all 100 epochs. A reviewer looking at acc+gamma would conclude 'DFA is hovering at 31% acc with ~0.10 alignment, both reasonable'. Wrong on both counts. BP never fires any diagnostic at any epoch. Stays bounded at \|\|h_L\|\|~200, \|\|g_L\|\|~3-5e-5, accuracy climbs to 0.61. This is the temporal validation of decision utility: the protocol catches the pathology AS IT HAPPENS, not just retrospectively.
2026-04-07	Add random-init sanity check: protocol does not flag untrained networks	YurenHao0426
	3-seed random init ResMLP gives chance accuracy (~10%) but the protocol verdict is 'trustworthy' on all 3 seeds: - residual norms ~8.7 across all layers (no growth, bounded) - BP gradient norms ~8e-3 (healthy, well above 1e-7 floor) - cross-batch stability 0.08-0.18 (in the BP/EP range) This is the answer to the likely reviewer question: 'is your protocol just flagging anything that doesn't perform well?' Answer: no. Random init is at chance and the protocol passes it. The walked-back trained methods are walked back because of the measurements, not because of the accuracy. Notable: random init g-norms (8e-3) are actually HIGHER than BP-trained ones (4e-4) — BP training reduces the gradient magnitude as loss decreases. So the protocol distinguishes 3 distinct regimes: (1) untrained healthy, (2) trained-and-still-healthy (BP/EP), (3) trained-into-pathology (DFA/SB/CB).
2026-04-07	Audit table extension to 3 seeds (s42/s123/s456)	YurenHao0426
	3 seeds × 5 methods × 4 diagnostics = 60 measurements. Key reproducibility findings: - BP: trustworthy on all 3 seeds (acc 0.61-0.62, h_L ~200, g_L ~3-4e-4) - EP: trustworthy on all 3 seeds (acc 0.29-0.36, h_L 3-8e3, g_L ~1e-4) - DFA, SB, CB: walked back on all 3 seeds × all 3 of (a)/(b)/(d) Diagnostic (c) is bimodal across seeds — confirms the prior memory finding: - DFA s42=0.047 (noise), s123=0.436 (drift), s456=-0.005 (noise) - SB s42=0.992 (drift), s123=0.561 (drift), s456=0.035 (noise) - CB s42=0.352 (drift), s123=0.250 (~edge), s456=0.518 (drift) (c) catches different methods on different seeds. (a)/(b)/(d) catch all 3 failing methods on all 3 seeds — robust binary detection.
2026-04-07	Add protocol decision-utility ablation table	YurenHao0426
	Builds on the 5-method audit JSON. For each method, evaluates 7 reporting strategies (S0=acc only, S1=+Γ field standard, S2-S5=+single diagnostic, S_full=full protocol), and emits the verdict each strategy would have reached. Result: 3 of 5 methods (DFA/SB/CB) are walked back by S_full but NOT by S1. Each of (a)scale, (b)floor, (d)frozen is independently sufficient for binary detection of those 3 failures. Diagnostic (c)stability adds sub-mode discrimination (drift vs noise) but not new positive detections. This is the §3 protocol decision-utility evidence.
2026-04-07	Add DFA direction-quality direct test (codex round 13 option c)	YurenHao0426
	Trains both vanilla DFA (lam=0) and penalized DFA (lam=1e-2) from the same seed, then directly measures the per-layer cosine between DFA's local credit signal e_T @ B_l^T and the BP gradient at hidden layers. Uses the training Bs (not fresh ones, per the Bs-specificity finding from earlier). The penalized run is the key measurement: in that condition the BP grad is ~10^-7 (well above the eps=1e-8 floor), so a near-zero cosine here would be the direct evidence of the second failure mode (direction-quality ceiling) that codex round 13 hypothesized. Pre-registered prediction: penalized cos(DFA, BP) ~ 0.01-0.05 -> direction quality is the second, separable failure mode. Saves the penalized checkpoint so the diagnostic protocol can be re-applied to it (where (a) and (b) should pass, (d) should still fail).
2026-04-07	Add audit table example: protocol applied to BP/DFA/SB/CB/EP	YurenHao0426
	5-method audit table on 4-block d=256 ResMLP CIFAR-10 seed 42: - BP: trustworthy (acc 0.615, h_L=2e2, g_L=4e-4, stab 0.099) - DFA: walked back via (a)+(b)+(d) — h_L=4e8, g_L=4e-9, undercuts frozen - State Bridge: walked back via all 4 diagnostics — stability 0.992 is the cleanest possible drift-dominated case - Credit Bridge: walked back via all 4 — stability 0.352, also drift mode - EP: trustworthy (acc 0.359, h_L=3e3, g_L=2e-4, stab -0.036) — paper's internal control case This is the §2 audit evidence for the main-track paper. Confirms that standard headline acc + Γ silently fails on 3 of 5 methods on this architecture, while the 4-diagnostic protocol catches all three.
2026-04-07	Add FA diagnostic protocol reference implementation	YurenHao0426
	Codex round 15 #1 priority for the E&D-track paper: - protocol/protocol.py: 4 diagnostics (residual norms, BP grad norms, cross-batch direction stability, and a frozen-baseline comparator) - protocol/report.py: DiagnosticReport with per-diagnostic verdicts and pretty-printer - protocol/smoke_test.py: validates BP/DFA/EP checkpoints produce the expected verdicts (BP/EP trustworthy; DFA walked back via residual explosion + BP grad at floor) - protocol/README.md: usage, audit cases, threshold rationale - protocol/CHECKLIST.md: 6 evaluation pipeline pitfalls (norm(-1), cosine_similarity eps clamp, fp16 underflow, Bs reproducibility, aggregation, layer-0 dominance) - protocol/REPORTING_TEMPLATE.md: per-method fillable form for FA papers
2026-04-03	Add 5 extra seeds to synthetic cross-state distance (now 10 seeds for all ↵	YurenHao0426
	methods) BP/DFA/SB/CB: added seeds 2048,3000,4000,5000,6000 (L=4 only, all 3 alphas) Total: 1290 rows (was 990) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03	Recompute all CNN diagnostics with fixed BP gradient flow	YurenHao0426
	CNN CIFAR-10 (5 seeds, fixed Gamma): BP: acc=86.8%, Gamma=0.970, rho=0.603 DFA: acc=56.7%, Gamma=0.896, rho=0.061 EP: acc=50.6%, Gamma=0.484, rho=0.450 SB: acc=63.3%, Gamma=1.000 (BP self-cos, feedback nets not saved), rho=0.601 CB: acc=31.8%, Gamma=1.000 (BP self-cos), rho=0.226 DFA Gamma=0.896 is notably high — CNN DFA credit aligns well with BP gradients. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03	Fix CNN compute_bp_grads: remove inter-layer detach so gradients flow to all ↵	YurenHao0426
	layers Old code detached hidden states between layers, making layers 0-2 disconnected from the loss (gradient = None → 0). Fixed by keeping the forward graph connected. BP CNN Gamma per-layer now: [0.985, 0.990, 0.987, 0.967] (was [0, 0, 0, 0.967]) But gradient norms are ~1e-17 (genuine numerical precision issue with CNN architecture). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03	Complete EP data: 10-seed synthetic + 6-seed CIFAR persample + cross-state	YurenHao0426
	EP synthetic: 30 JSONs + 30 checkpoints (10 seeds × 3α) EP CIFAR persample: 6 seeds × 4 layers × 256 samples = 6144 rows added Synth cross-state: 150 EP rows added (990 total) cifar_persample_all.csv: 30720 rows (was 24576, +6144 EP) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03	EP synthetic 10 seeds complete: 30 JSONs + 30 checkpoints + cross-state distance	YurenHao0426
	Updated synth_cross_state_distance.csv with 150 EP rows (990 total). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03	Add checkpoint saving to ep_synthetic.py	YurenHao0426
	Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03	Add EP cross-state distance for CIFAR + verify CNN summary	YurenHao0426
	EP CIFAR d_BP: L0=2.0×, L4=26.7× (much closer to BP than DFA=162×/2.5M×) EP synthetic: no checkpoints saved (ep_synthetic.py didn't save .pt) CNN summary: 20 rows confirmed correct Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03	Add EP synthetic per-seed CSV + synthetic cross-state distance	YurenHao0426
	EP synthetic: 15 rows (3α × 5 seeds) Synth cross-state: 840 rows (3α × 2L × 4 methods × 5 seeds × (L+1) layers) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03	Add EP to BP gradient sparsity analysis	YurenHao0426
	EP CIFAR d=256: s(1e-6)=100%, mean_norm=1.41e-04 EP produces networks where ALL samples have non-zero BP gradients, unlike DFA (0.4%), SB (21%), CB (3%). EP is closer to BP (98.7%). Updated clean_sparsity_summary.csv (980 rows, now includes EP). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03	Add cross-method hidden state distance vs BP	YurenHao0426
	Non-BP methods produce radically different representations: DFA L0: 162×, L4: 2.5M× relative to BP hidden norms SB L0: 3.2×, L4: 1.1M× CB L0: 59×, L4: 1.4M× (BP vs itself = 0) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03	Update all EP results with sign-corrected credit	YurenHao0426
	EP Synthetic (fixed): Gamma=+0.13~0.20, rho=+0.25 EP CIFAR d=256: Gamma=+0.007, rho=+0.051 EP CIFAR d=512: Gamma=+0.000, rho=-0.002 EP CNN: Gamma=+0.248, rho=+0.492 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03	Fix EP credit sign in cnn_baseline.py	YurenHao0426
	Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02	Recompute EP CIFAR d=256 diagnostics with sign fix	YurenHao0426
	EP d=256 (5 seeds): acc=31.9%, Gamma=+0.007 (was -0.13), rho=+0.051 (was -0.037) Sign correction: -(h_nudge - h_free)/β aligns EP credit with BP gradient direction. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02	Fix EP credit sign: negate (h_nudge - h_free)/β to align with BP grad direction	YurenHao0426
	EP nudge moves h toward lower loss (opposite to BP grad which points toward loss increase). Without negation, Gamma is negative and rho is -0.25. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02	Add EP synthetic results: 15 JSONs (3α × 5 seeds)	YurenHao0426
	EP synthetic: acc high (92-96%) but Gamma negative (-0.13 to -0.20), rho=-0.25 EP credit direction may be inverted or diagnostics have issue. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02	Add --d_hidden arg to ep_baseline.py for d=512 support	YurenHao0426
	Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02	Fix ep_synthetic: bp dict needs L+1 entries for EP credit comparison	YurenHao0426
	Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02	Add EP synthetic ladder script	YurenHao0426
	Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02	Add CNN SB+CB results (5 seeds each), update summary CSV	YurenHao0426
	CNN CIFAR-10 (5 seeds): BP: 86.8%±0.3%, Gamma=0.238, rho=0.250 DFA: 56.7%±2.0%, Gamma=0.216, rho=0.017 SB: 63.3%±0.5%, Gamma=0.045, rho=0.298 CB: 31.8%±6.2%, Gamma=0.013, rho=0.033 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02	Fix CNN state bridge: use custom CNNStateBridge for variable input dims	YurenHao0426
	Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02	Add SB and CB methods to cnn_baseline.py	YurenHao0426
	State bridge: per-layer StateBridgeNet predicting h3 from flattened h_l Credit bridge: per-layer ValueNet with terminal + bridge consistency + DFA warmup Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02	Fix and recompute GELU ablation Gamma from checkpoints	YurenHao0426
	ReLU MLP (L=4 d=256): BP: acc=61.1%, Gamma=1.000, rho=0.998 DFA: acc=30.7%, Gamma=0.104, rho=-0.001 SB: acc=15.5%, Gamma=0.300, rho=0.159 CB: acc=28.7%, Gamma=0.298, rho=0.007 Note: SB/CB Gamma uses BP gradient as proxy (feedback nets not checkpointed). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02	Fix gelu_ablation.py: compute method-specific Gamma instead of hardcoded 1.0	YurenHao0426
	DFA now uses regenerated DFA Bs for credit; SB/CB use BP as proxy (feedback nets not saved). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02	Add summary CSVs for EP, GELU ablation, CNN	YurenHao0426
	Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02	Add EP baseline (5 seeds), GELU ablation (20 runs), CNN BP+DFA (10 runs)	YurenHao0426
	EP (L=4 d=256): acc≈30%, Gamma≈0, rho≈0 — EP credit signal weak on feedforward MLP GELU ablation (ReLU variant): 4 methods × 5 seeds complete CNN BP+DFA: 5 seeds each, BP + DFA on SmallCNN Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02	Add GELU/ReLU ablation script for CIFAR MLP	YurenHao0426
	Note: existing ResidualMLP already uses GELU. This adds ResidualMLPReLU variant. Ablation compares ReLU vs GELU for BP/DFA/SB/CB. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02	Add CNN baseline: SmallCNN with BP/DFA/EP on CIFAR-10	YurenHao0426
	Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02	Add EP baseline implementation (Scellier & Bengio 2017) for CIFAR MLP	YurenHao0426
	Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01	Add d=512 support sparsity: 20 JSONs + summary CSV	YurenHao0426
	BP: s(1e-6)=92.7%, norm=2.70e-04, r_inf=0.159, PR=0.300 DFA: s(1e-6)=0.1%, norm=5.31e-08 SB: s(1e-6)=20.3%, norm=2.33e-06 CB: s(1e-6)=1.2%, norm=9.88e-08 Same pattern as d=256, confirming width-independence of the sparsity gap. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01	Add d512_sparsity.py: support sparsity for d=512 checkpoints	YurenHao0426
	Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01	Add missing bp_s456.json for CIFAR d=512 (rerun after SIGTERM)	YurenHao0426
	bp s=456: acc=0.5999, rho=0.9881, nse=0.4764 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01	Add CIFAR L=4 d=512 confirmatory: 4 methods × 5 seeds with checkpoints	YurenHao0426
	BP: 60.6%±0.3%, rho=0.989 DFA: 30.8%±0.5%, rho=0.003 State Bridge: 21.2%±3.7%, rho=0.119 Credit Bridge: 30.1%±0.5%, rho=0.002 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01	Add cifar_d512_confirmatory.py: L=4 d=512 with checkpoint saving	YurenHao0426
	Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01	Add P3 protocol panel: method ranking across 5 protocol slices	YurenHao0426
	Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01	Add per-sample gradient stats: 24576 rows (256 samples × 4 layers × 4 ↵	YurenHao0426
	methods × 6 seeds) Columns: method, seed, layer, sample_id, grad_norm, log10_grad_norm, r_inf, pr, hoyer, topk1, topk5 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01	Add clean_sparsity_persample.py: per-sample gradient stats	YurenHao0426
	Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01	Add clean_sparsity_summary.csv: 960 rows aggregated from 168 JSONs	YurenHao0426
	Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01	Add clean sparsity results: 168 JSONs from independent processes on GPU 1	YurenHao0426
	CIFAR: 24 JSONs (4 methods × 6 seeds), BP s(1e-6)=98% confirmed Synthetic: 144 JSONs (4 methods × 6 seeds × 3 alphas × 2 depths) All data reliable — each method+seed in separate Python process. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>