faeval.git - Unnamed repository; edit this file 'description' to name the repository.

Age	Commit message (Collapse)	Author
2026-04-08	paper v2.34.3: fix Figure 4 cross-arch verdict matrix (data + layout)	YurenHao0426
	User flagged Figure 4 issues. Found three problems: 1. Row 4 (no-terminal-LN ResMLP) (d) frozen was encoded as 0 (passes) but the actual data is no-outln DFA acc 0.327 ± 0.012 (3-seed) vs frozen baseline 0.349 ± 0.002 → margin -2.2 pp, beyond the 2 pp threshold → (d) FIRES. Updated to 1 (WB). 2. Row 5 (CNN BN) cells (c) and (d) were encoded as 0 (passes) but the CNN audit (results/protocol_audit/audit_cnn_3seed.json) only measured (a) and (b); there is no CNN frozen baseline and no CNN stability run. Showing them as ✓ was misleading. Added a third color (gray, "—") for "not measured" and marked CNN (c)+(d) accordingly. 3. Layout had massive empty vertical space below the panels with the key-finding text floating far below. Compressed: - figsize (11, 4.2) → (11, 3.2) [tighter aspect ratio] - Key-finding text moved from axes-coordinates y=-1.55 (way below plot) to figure-coordinates y=-0.05 (directly under panels) - BP panel title clarified: "BP-trained: protocol passes" → "BP-trained: protocol passes everywhere" Also marked ViT-Mini (c) and no-LN ResMLP (c) as "not measured" since neither has a saved cross_batch_stability value (the audit_cnn, audit_d512, snapshot_vit_v1, and snapshot_no_outln_v1 files don't include this diagnostic). New verdict matrix: (a) (b) (c) (d) ResMLP-d256 LN WB WB ✓ WB ResMLP-d512 LN WB WB ✓ WB ViT-Mini WB WB — WB ResMLP-d256 no-LN WB ✓ — WB ← row 4 (d) was wrong CNN BN WB ✓ — — ← row 5 (c)+(d) were misleading Key finding "(b) only fires on terminal-LN architectures" is unchanged and now visually clearer (rows 1-3 have WB in (b), rows 4-5 have ✓). Page impact: total page count 19 → 18 (the more compact figure reclaimed an entire page). §1-§7 main content still fits on 9 pages. Updated docstring with full data sources for each row. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08	paper v2.31.9: relabel "StudentNet" → "no-terminal-LN ResMLP"	YurenHao0426
	The §3 ¶3 / §5 ¶3 / Figure 5 / §7 mentions of "StudentNet" as a cross-architecture validation case were a misleading rebrand of the no-terminal-LN ResMLP-d256 ablation. Verified by tracing the data: results/protocol_audit/temporal_evolution_s{42,123,456}.json final_acc 0.332/0.313/0.336 (matches no-outln 3-seed 0.327±0.012) first_fire_a {18, 14, 25} first_fire_b None / None / None The actual synth StudentNet (results/snapshot_synth_v1, d=128 alpha=1.0) has max-per-block growth ~6.88 over 80 epochs and never reaches the 50× threshold, so diagnostic (a) does NOT fire on the real synth StudentNet at all. Calling the no-outln data "StudentNet" doubled- counted the same architecture under two names (the same-backbone causal control AND the cross-arch generalization test). Relabeled to "no-terminal-LN ResMLP" everywhere it appeared: - §3 ¶3 paragraph 1 cross-arch list - §3 ¶3 paragraph 2 (now with explicit per-seed first-fire epochs {18,14,25}) - §5 paragraph (the conclusion) - §7 conclusion (cross-arch list) - Figure 5 caption - Figure 5 row label (with re-rendered PDF) The remaining cross-arch generalization claim is now: ViT-Mini fires both diagnostics, ResMLP at d=256/d=512 fires both, no-terminal-LN ResMLP and BatchNorm CNN fire only (a) — three real architecture classes, with the no-LN ablation being the same-backbone control rather than a separate architecture. The cross-arch story is slightly weaker ("3 architecture classes" not "4") but truthful and self-consistent. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08	Figures 3 and 4: fix aspect ratio (fig3 was squeezed strip) and key-finding ↵	YurenHao0426
	label overlap (fig4) Per user feedback: - fig4_penalty_rescue.pdf (Figure 3 in paper): was figsize=(13, 3.5), aspect 3.7:1, which rendered as a thin strip with squeezed subplot content. Increased height to figsize=(13, 6.0), aspect 2.2:1. Much taller panels that actually show axis labels and legends readably. - fig5_cross_arch_summary.pdf (Figure 4 in paper): the 'Key finding' italic text annotation at y=-1.0 in axes transform was overlapping with the multiline architecture y-tick labels at the bottom of the second subplot. Moved to y=-1.55 and increased figsize height from 3.5 to 4.2 so the lower annotation still fits in bbox_inches='tight' crop. - Also bumped includegraphics width from 0.92\linewidth to \linewidth for both figures so they use the full text width. Main content still exactly 9 pages within E&D budget. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08	Fill in tables 1-3 + generate figures 2/4/5 from existing data	YurenHao0426
	Tables filled with real values: Table 1: 5-method audit (3-seed mean ± std for acc, headline Γ, verdict) Table 2: 4-condition mode 2 validation (cos and ρ values from existing checkpoint measurements) Table 3: protocol thresholds (50×, 1e-7, 0.30, 2pp) Figures generated from existing data: fig2_decision_utility.pdf: 5×7 verdict heatmap from results/protocol_audit/ablation_decision_utility.json fig4_penalty_rescue.pdf: 3-panel — trajectory + cos/ρ bars + 2×2 acc from snapshot_evolution_v2 + dfa_residual_penalty + bp_with_penalty fig5_cross_arch_summary.pdf: 5×4 BP/DFA verdict matrix across architectures Compiles to 8 pages with all tables/figures rendered. §1-§7 main body still has only paragraph topic sentences (TODO: per-section prose filling via codex). Figure numbering is wrong (codex put figures in section order not numerical order — need fixing).
2026-04-08	v2 skeleton from round 25: section structure now matches round 23	YurenHao0426
	Round 24's skeleton had 3 deviations from round 23 redo: - Made §3 'Diagnostic Protocol' instead of 'Failure Mode 1' - Collapsed Mode 1 + Mode 2 into one §4 - Added §6 'Reference Implementation' (was supposed to be dropped) Round 25 fixed all three. New §3-§7 match round 23 redo exactly: §3 Failure Mode 1: Measurement Degeneracy §4 Failure Mode 2: Low Intrinsic Credit-Direction Quality §5 Intervention and Cross-Architecture Evidence §6 Recommended FA Evaluation Protocol §7 Discussion, Limits, Conclusion Also added: - In-line bibliography with 12 \bibitem entries (Paleka, O'Bray, Jordan + FA literature) — citations resolve correctly now - Appendices A-G with actual prose content (not just headers) - 7-pitfall catalog with descriptions - Walk-back chain methodology paragraph - 7-validation summary table Compiles to 9 pages with figures 1+3 inline (existing PNGs) and figures 2/4/5 as placeholder text PDFs (TODO: regenerate). Tables 1/2/3 still have TODO placeholders for numerical values. Next: fill in tables 1-3 with existing JSON data, generate figures 2/4/5 from existing data, then consult codex per-section for prose filling.