faeval.git - Unnamed repository; edit this file 'description' to name the repository.

Age	Commit message (Collapse)	Author
2026-04-08	Round 31: fill §6 Recommended Protocol + §7 Discussion prose; v2 ↵	YurenHao0426
	content-complete Six paragraphs total via codex round 31: - §6 §6.1: measurement-validity-first ordering with 63x and 24338x calibration gaps - §6 §6.2: minimal four-check protocol; decision-utility 0/5 vs 3/5 walk-back - §6 §6.3: conservative asymmetry (BP/EP preserved, DFA/SB/CB walked back) - §7 §7.1: scope claim — evaluation failure not algorithmic impossibility - §7 §7.2: limits — CIFAR-10 only, observational LN interpretation, lower-bound BP+penalty control - §7 §7.3: lesson — decompose evaluation question, position vs Jordan/O'Bray/Paleka Compiles to 12 pages (main content 1-8, refs+appendices 8-12), within E&D 9-page main budget. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08	Round 30: fill in §5 Intervention and Cross-Arch prose (4 paragraphs) via codex	YurenHao0426

2026-04-08	Round 29: fill in §4 Failure Mode 2 prose (3 paragraphs) via codex	YurenHao0426

2026-04-08	Round 28: fill in §3 Failure Mode 1 prose (4 paragraphs) via codex	YurenHao0426

2026-04-08	Round 27: fill in §2 Audit prose (4 paragraphs) via codex	YurenHao0426
	Codex round 27 produced 4 substantive paragraphs for §2, replacing thin placeholders. Each paragraph follows round 23's prescription: P1: canonical setting (4-block d=256, AdamW, 100 ep, 3 seeds) + table/figure references P2: under field-standard reporting, all 5 methods look fine P3: EP internal comparison — same trustworthy measurement regime BUT EP depth contribution is also marginally negative (-3.3 pp vs frozen baseline). Honest about EP being trustworthy-measurement but neutral-depth-contribution (per round 27 prompt's caveat). P4: frozen-baseline comparison gives the walk-back: BP +26.6 pp, DFA -4.3 pp, SB -14.4 pp, CB -6.0 pp. Diagnostic split lines up with acc split. Compiles cleanly. Next: §3 Failure Mode 1 prose via round 28.
2026-04-08	Round 26: fill in §1 Introduction prose (3 paragraphs) via codex	YurenHao0426
	Codex round 26 produced 3 substantive paragraphs for §1, replacing the 3 thin placeholder sentences. Each paragraph follows round 23's prescription: P1: claim sentence + numerical evidence (DFA 0.306 < frozen 0.349; layer-0 +0.42 vs deep ~0; \|\|g_L\|\| ~ 5e-10 < eps clamp 1e-8) + closing 'measurement regime must be valid' P2: 5-method audit shows the two modes; intervention dissociation (lambda=1e-4 alleviates Mode 1 not Mode 2; vanilla ep 1 has meaningful \|\|g\|\| but deep cos still ~0) + closing P3: methodological contribution framing + cite Paleka, O'Bray, Jordan + closing roadmap Compiles cleanly. PDF still has §2-§7 with topic sentences only (TODO next via per-section codex rounds).
2026-04-08	Fill in tables 1-3 + generate figures 2/4/5 from existing data	YurenHao0426
	Tables filled with real values: Table 1: 5-method audit (3-seed mean ± std for acc, headline Γ, verdict) Table 2: 4-condition mode 2 validation (cos and ρ values from existing checkpoint measurements) Table 3: protocol thresholds (50×, 1e-7, 0.30, 2pp) Figures generated from existing data: fig2_decision_utility.pdf: 5×7 verdict heatmap from results/protocol_audit/ablation_decision_utility.json fig4_penalty_rescue.pdf: 3-panel — trajectory + cos/ρ bars + 2×2 acc from snapshot_evolution_v2 + dfa_residual_penalty + bp_with_penalty fig5_cross_arch_summary.pdf: 5×4 BP/DFA verdict matrix across architectures Compiles to 8 pages with all tables/figures rendered. §1-§7 main body still has only paragraph topic sentences (TODO: per-section prose filling via codex). Figure numbering is wrong (codex put figures in section order not numerical order — need fixing).
2026-04-08	v2 skeleton from round 25: section structure now matches round 23	YurenHao0426
	Round 24's skeleton had 3 deviations from round 23 redo: - Made §3 'Diagnostic Protocol' instead of 'Failure Mode 1' - Collapsed Mode 1 + Mode 2 into one §4 - Added §6 'Reference Implementation' (was supposed to be dropped) Round 25 fixed all three. New §3-§7 match round 23 redo exactly: §3 Failure Mode 1: Measurement Degeneracy §4 Failure Mode 2: Low Intrinsic Credit-Direction Quality §5 Intervention and Cross-Architecture Evidence §6 Recommended FA Evaluation Protocol §7 Discussion, Limits, Conclusion Also added: - In-line bibliography with 12 \bibitem entries (Paleka, O'Bray, Jordan + FA literature) — citations resolve correctly now - Appendices A-G with actual prose content (not just headers) - 7-pitfall catalog with descriptions - Walk-back chain methodology paragraph - 7-validation summary table Compiles to 9 pages with figures 1+3 inline (existing PNGs) and figures 2/4/5 as placeholder text PDFs (TODO: regenerate). Tables 1/2/3 still have TODO placeholders for numerical values. Next: fill in tables 1-3 with existing JSON data, generate figures 2/4/5 from existing data, then consult codex per-section for prose filling.
2026-04-08	Archive failed v1 draft as v1_rejected.tex; remove main.tex/main.pdf	YurenHao0426
	User rejected the v1 draft as '流水账实验报告' (sequential experiment report). Round 22 + 23 redid the outline with E&D-genre prescription. Saving v1 as v1_rejected.tex for reference. New main.tex will be written from round 24 LaTeX skeleton (codex offered to provide it), section by section, with codex check on each section's prose.
2026-04-08	Compile paper PDF + fix bibstyle for tectonic	YurenHao0426
	Compiled with tectonic (the only LaTeX engine on this server). Two fixes needed: 1. Pass [numbers,compress] to natbib via PassOptionsToPackage so the numerical bibliography style works 2. Use bibstyle 'abbrvnat' instead of 'plain' (compatible with natbib) Result: 10-page PDF, ~7.5 content pages (well under 9-page E&D limit), references on pages 8-9, appendices A-D on pages 9-10. PDF uploaded to broker as 1843506b_main.pdf for user review.
2026-04-08	Paper main.tex: add §5.4 λ sweep dissociation table	YurenHao0426
	The λ sweep is the strongest single piece of two-mode separation evidence and doesn't require the early-epoch caveat. New §5.4 with table showing: λ=0: vanilla, both modes broken λ=1e-4: mode 1 ALLEVIATED (\|\|h_L\|\|=2.4e4, \|\|g\|\|=6.3e-7), mode 2 NOT (cos -0.022, rho -0.004) λ=1e-2: mode 1 alleviated, mode 2 partially (cos +0.16, rho +0.09) λ=1e-1: slightly over-constrained (cos +0.13, rho +0.07) The two modes have different intervention thresholds. §5.4 is now the killer evidence; the early-epoch disambiguation in §5.3 becomes supporting. Updated section summary to 'five validations'.
2026-04-08	First draft of NeurIPS 2026 E&D paper	YurenHao0426
	Title: 'Beyond Accuracy and Alignment: A Diagnostic Evaluation Protocol for Feedback Alignment' Structure (per round 21 prescription): Abstract: 'broken because conflated' framing, 2 distinct modes named §1 Introduction: discovery hook -> 2-mode framing -> contribution §2 Related work §3 Audit (the field-standard pair walks back nothing) §4 The diagnostic protocol (4 diagnostics, calibrated thresholds, decision-utility ablation, cross-architecture validation) §5 Two distinct failure modes (mechanism, penalty rescue, direct cosine measurement, hypothesis-disambiguation, capacity-cost control) §6 Limitations §7 Broader impacts §8 Conclusion Appendices: reproducibility, 7-pitfalls catalog, walk-back chain (4 step), all 6 validations of the two-mode separation Includes 4 result tables, ~10 references, structured as eandd track double-blind submission. 760 lines of LaTeX, balanced environments verified. Ready for compilation on a system with pdflatex. Template: paper/neurips_2026.{sty,tex}, downloaded from official NeurIPS 2026 source. checklist.tex also unzipped.