faeval.git - Unnamed repository; edit this file 'description' to name the repository.

Age	Commit message (Collapse)	Author
2026-04-08	Save null_calibration_penalized_dfa.json for §6 ¶2 audit	YurenHao0426
	The §6 ¶2 fresh-B null control claim "deep cos +0.002 ± 0.022 (n=20 draws), per-layer stds 0.013-0.023" was verified against a fresh re-run of experiments/null_calibration_penalized_cos.py: training-Bs deep cos: +0.1627 (matches Appendix L row) fresh-Bs deep cos: +0.0022 ± 0.0220 (per-layer std avg, n=20) per-layer stds: [0.0125, 0.0221, 0.0162, 0.0229, 0.0228] (l0-l4) The "0.013-0.023" range matches the per-layer std range exactly. The "± 0.022" is the average per-layer std across deep layers (l1-l4). Saved as the auditable source. The script (experiments/null_calibration_ penalized_cos.py) can re-derive these values from the saved checkpoint in results/dfa_pen_short/dfa_pen_lam0.01_s42.pt. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08	paper v2.31.11: §3 ¶3 d=512 max-per-block growth uses sourced DFA value	YurenHao0426
	§3 ¶3 said d=512 has "even larger max-per-block growth (about 1.5×10^4)" without a clear source for 1.5e4. The actual DFA-d=512 max-per-block growth from results/protocol_audit/audit_d512_3seed.json: s42: 7788, s123: 6397, s456: 7689 → 3-seed mean ~7292 (≈7e3) Updated to "DFA three-seed mean about 7×10^3 vs ~1.9×10^3 at d=256". The "even larger" claim still holds (4× larger), and now the comparison to d=256 is explicit and sourced. Both d=256 and d=512 values now point to the same protocol_audit JSONs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08	paper v2.31.10: Appendix L drift values use 3-seed means (were s42)	YurenHao0426
	Appendix L claimed "per-block w2 relative displacement after 30 epochs averages 14.3× for SB+penalty, 18.6×±0.5 for DFA+penalty, and 19.3× for CB+penalty (three seeds each)" but the SB and CB values were actually s42 single-seed values (14.32 and 19.27) labeled as if they were 3-seed averages. DFA was correctly 3-seed. Re-aggregating from results/round38_{sb,cb}_penalty_30ep_s{42,123,456}/ results_cifar10.json drift fields: SB+pen w2: 14.32, 15.30, 14.68 → 14.77 ± 0.50 (was 14.3) CB+pen w2: 19.27, 19.63, 18.53 → 19.14 ± 0.56 (was 19.3) SB+pen embed: 7.10, 6.87, 6.88 → 6.95 ± 0.13 (was 7.1) CB+pen embed: 44.57, 47.27, 47.18 → 46.34 ± 1.53 (was 44.6) DFA+pen w2: 18.6 ± 0.5 ✓ (correct) DFA+pen embed: 94.6 ± 1.4 ✓ (correct) The mechanism conclusion is unchanged: CB's per-block drift is still ~30% larger than SB's, embed drift is still ~7× larger; DFA still has the largest embed updates of any method. CB and DFA still ≈9.3 pp below State Bridge in final accuracy. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08	paper v2.31.9: relabel "StudentNet" → "no-terminal-LN ResMLP"	YurenHao0426
	The §3 ¶3 / §5 ¶3 / Figure 5 / §7 mentions of "StudentNet" as a cross-architecture validation case were a misleading rebrand of the no-terminal-LN ResMLP-d256 ablation. Verified by tracing the data: results/protocol_audit/temporal_evolution_s{42,123,456}.json final_acc 0.332/0.313/0.336 (matches no-outln 3-seed 0.327±0.012) first_fire_a {18, 14, 25} first_fire_b None / None / None The actual synth StudentNet (results/snapshot_synth_v1, d=128 alpha=1.0) has max-per-block growth ~6.88 over 80 epochs and never reaches the 50× threshold, so diagnostic (a) does NOT fire on the real synth StudentNet at all. Calling the no-outln data "StudentNet" doubled- counted the same architecture under two names (the same-backbone causal control AND the cross-arch generalization test). Relabeled to "no-terminal-LN ResMLP" everywhere it appeared: - §3 ¶3 paragraph 1 cross-arch list - §3 ¶3 paragraph 2 (now with explicit per-seed first-fire epochs {18,14,25}) - §5 paragraph (the conclusion) - §7 conclusion (cross-arch list) - Figure 5 caption - Figure 5 row label (with re-rendered PDF) The remaining cross-arch generalization claim is now: ViT-Mini fires both diagnostics, ResMLP at d=256/d=512 fires both, no-terminal-LN ResMLP and BatchNorm CNN fire only (a) — three real architecture classes, with the no-LN ablation being the same-backbone control rather than a separate architecture. The cross-arch story is slightly weaker ("3 architecture classes" not "4") but truthful and self-consistent. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08	paper v2.31.8: Appendix I EP random-target ‖h_L‖ values from saved JSON	YurenHao0426
	Appendix I claimed EP random-target ‖h_L‖ "≈586 at 5 ep" and "≈2,085 at 100 ep" without a saved-JSON source. Re-measured on the saved checkpoints with consistent methodology (model.eval(), n=2048 test median), giving 557 (5 ep) and 2151 (100 ep). The ~5% discrepancy is likely model.train() vs model.eval() LN-batch-stats; the new values are reproducible. Saved results/ep_random_h_L_summary.json as the source of truth. The "26× smaller than DFA's 14,510 at 3 ep" comparison still holds (was "25×"; updated to "26×" with the new EP values). The fixed-feedback vs energy-based separation conclusion is unchanged. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08	paper v2.31.7: Appendix H vanilla residual DFA endpoint values	YurenHao0426
	Make Appendix H consistent with §3 ¶1 (which v2.30.2 already updated to 3-seed means): vanilla residual DFA's endpoint ‖h_L‖ ≈ 5×10^8 and ‖g_L‖ ≈ 4×10^-10 (three-seed mean), not the s42 single-seed values. The s42 numbers are 4.39e8 and 4.86e-10, which were rounded as "≈4e8" and "≈5e-10" in the appendix. The 3-seed means are 5.18e8 and 3.76e-10, which round to "≈5e8" and "≈4e-10". Now §3 ¶1 (3-seed) and Appendix H (3-seed) report consistently. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08	paper v2.31.6: §3 ¶2 no-residual values use 3-seed means	YurenHao0426
	§3 ¶2 said "DFA still converges to ‖h_L‖≈1.06×10^8 and ‖g_L‖≈1.09×10^-10 at 100 epochs (Appendix H)". These were the s42 single-seed values silently used as if they were generic, even though Appendix H gives the full per-seed list {1.06e8, 3.15e7, 1.09e8}. 3-seed means from results/h2_no_residual_full_s{42,123,456}/snapshot_evolution_s*.json: ‖h_L‖ per seed: 1.06e8, 3.15e7, 1.09e8 → mean 8.21e7 (was 1.06e8) ‖g_L‖ per seed: 1.08e-10, 2.94e-10, 1.76e-10 → mean 1.93e-10 (was 1.09e-10) Updated to 8.2e7 and 1.9e-10 with explicit "across three seeds" framing. Both still well past the diagnostic-(b) floor; falsification conclusion unchanged. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08	paper v2.31.5: §3 ¶3 ep-4 g_L from 3-seed (was single-seed g_2)	YurenHao0426
	Paper claimed "‖g_L‖ drops from 9.8×10^-4 at ep 0 to 6.7×10^-8 by ep 4 in the temporal replay across three seeds". The 9.8×10^-4 is the 3-seed mean of g_L at ep 0 (correct). But the 6.7×10^-8 was the s42 single-seed g_2 value (6.73×10^-8) at ep 4, not g_L and not 3-seed. The actual 3-seed g_L means from results/snapshot_evolution_v2/: ep 0: 9.83, 9.74, 9.87 × 10^-4 → mean 9.81 ≈ 9.8e-4 ✓ ep 4: 6.82, 6.37, 4.12 × 10^-8 → mean 5.77 ≈ 5.8e-8 (was 6.7e-8) Updated to 5.8×10^-8 with the per-seed values shown for transparency. The "fires within 11 epochs" actionable-early-stop conclusion is unchanged — all three seeds are well below the 1e-7 floor by ep 4. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08	paper v2.31.4: §2 ¶3 EP per-block growth 11.6× → 6.6× (3-seed)	YurenHao0426
	Re-aggregating from results/protocol_audit/audit_table_s42_s123_s456.json: EP per-block max growth ratios per seed are 2.87, 10.96, 6.10 → 3-seed mean 6.64. Single-seed max is 10.96 ≈ 11.0, not 11.6. The 11.6× value in the prose was untraceable to any seed or aggregation; replaced with "three-seed mean max-per-block growth is only 6.6× (highest single-seed value 11.0×)" so both the average and the worst-seed are sourced. This keeps EP cleanly under the §6 ¶1 "below about 11×" threshold for healthy methods (max single-seed is 11.0, comfortably below the 50× diagnostic-(a) threshold), preserving the EP-as-internal-control story. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08	paper v2.31.3: §2 ¶3 per-block growth values were architecture mix-up	YurenHao0426
	The paper claimed DFA/SB/CB had max-per-block growth of "237×, 12000×, 96×" on the 4-block d=256 ResMLP. Re-aggregating from the protocol audit JSON (results/protocol_audit/audit_table_s42_s123_s456.json) gives: DFA d=256: max growth 2043, 979, 2545 → 3-seed mean ~1856 (≈1.9e3) SB d=256: max growth 12781, 24126, 10467 → mean ~15791 (≈1.6e4) CB d=256: max growth 1820, 695, 1034 → mean ~1183 (≈1.2e3) The paper's "237" and "96" actually match the BatchNorm CNN audit (audit_cnn_3seed.json gives DFA 214/235/263 → mean 237 and CB 108/90/91 → mean 96), not the d=256 ResMLP. SB "12000" was close to ResMLP s42 single-seed (12781) but the other two values were apparently picked from the wrong architecture. This was an architecture mix-up that under-reported the d=256 ResMLP per-block growth by ~8x for DFA and ~12x for CB. Updated to the actual 3-seed mean values from the matched d=256 audit. The numbers are now an order of magnitude larger and more clearly "extreme" than the original mistaken values. The CNN per-block growth claim of "up to 237×" in §5 ¶3 (which says "the BatchNorm CNN ... shows strong growth under DFA, with max-per- block growth up to 237×") is correct — that 237 is the right value for the CNN context. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08	paper v2.31.2: depth-scan layer-0 cos range [+0.39,+0.40] → [+0.38,+0.40]	YurenHao0426
	Per Table 5, DFA layer-0 cos at d=512 across L ∈ {2,4,6,8,12} is [+0.396, +0.400, +0.387, +0.377, +0.388]. The L=8 value of +0.377 falls below the +0.39 lower bound the prose was claiming. Updated both occurrences (§5 ¶3 main text and Appendix H paragraph) to the true range [+0.38, +0.40]. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08	paper v2.31.1: deep cos +0.155 → +0.151 (ground-truth re-measurement)	YurenHao0426
	The paper had 3-seed penalized DFA deep cos at +0.155 ± 0.025, but re-measuring on the saved checkpoints gives 0.1506 (mean of per-seed deep means is 0.1507; pooled mean over 12 deep-layer values is 0.1506). The std of 0.025 matches pooled ddof=1 ✓. Same paragraph also had inconsistent values: "+0.155 ± 0.025" 3-seed above, then "+0.165" single-seed s42 in the lambda sweep. Unified to 3-seed throughout. §5 ¶2 lambda sweep updates: lam=1e-4 \|\|h_L\|\| 2.4e4 (s42 only) → 2.2e4 (3-seed mean) lam=1e-4 \|\|g_L\|\| 6.3e-7 (s42) → 7.0e-7 (3-seed) lam=1e-4 deep cos -0.022 (s42) → -0.020 (3-seed) lam=1e-2 deep cos +0.165 (s42) → +0.151 (3-seed, same as the three-seed value used elsewhere) Other places updated: §4 ¶4 prose, Table 2, Appendix J Table 9 DFA+pen mean row (deep cos +0.155 → +0.151 and \|\|h_L\|\|/\|\|g_L\|\| columns updated to 30-ep dfa_pen_short means rather than the round-19 single-seed numbers), Appendix L paragraph. Page layout preserved: 9 main pages, refs p10, 18 total, 0 overfull. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08	paper v2.31: matched 30-epoch BP/DFA controls (was unsourced 0.609/0.308)	YurenHao0426
	The §5 ¶3 BP-no-penalty value of 0.609 ± 0.004 and DFA-no-penalty value of 0.308 ± 0.014 turned out to be unsourced — they were carried over from a hardcoded comment in experiments/bp_with_penalty_control.py ("BP-trainable (3-seed mean): 0.609") that nobody had actually measured with a matched 30-epoch run. Ran the missing matched controls under the same recipe as BP+pen (lam=0, 30 epochs, AdamW 1e-3, wd 0.01, cosine schedule, batch 128, 3 seeds 42/123/456): BP no-pen 30ep: per-seed 0.5851, 0.5845, 0.5863 → 0.585 ± 0.001 (paper said 0.609 ± 0.004, off by 0.024) DFA no-pen 30ep: per-seed 0.3070, 0.2985, 0.2966 → 0.301 ± 0.005 (paper said 0.308 ± 0.014) Also re-grounded DFA+penalty 30ep using the dfa_pen_short 3-seed run (0.3593, 0.3610, 0.3604 → 0.360 ± 0.001), which is what the deep-cosine +0.155 figure was computed on. The paper had 0.363 ± 0.001 — that came from the 100-epoch run, not the 30-epoch run, so it was an apples-to- oranges comparison with BP+pen 30-ep. Paper changes (§5 ¶3): BP penalty cost: -8 pp → -5.5 pp DFA pen rescue: +5.5 → +5.9 pp DFA+pen margin vs frozen: +1.4 → +1.1 pp BP-to-DFA gap: 17 → 17.0 pp (unchanged) BP-to-SB gap: 7.7 → 7.7 pp (unchanged) BP-to-DFA gap is still the lower-bound credit-quality cost claim; 17 pp gap is unchanged in magnitude. Also updated: - §5 ¶1 prose: 0.363 → 0.360, 0.308 → 0.301 - §4 ¶4 prose: DFA+pen 0.363 → 0.360 - Appendix J Table 9 caption: 0.363 → 0.360, +9.0 → +9.3 pp gap to SB - Appendix L paragraph: +5.5 → +5.9 pp DFA penalty rescue - Figure 3 panel C bar values + title pen-cost annotation - New results/matched_30ep_control_summary.json as auditable record Page layout preserved: 9 main pages + refs p10, 18 total, 0 overfull. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08	paper v2.30.2: §3 ¶1 endpoint values use 3-seed means	YurenHao0426
	Paragraph framing says "3 seeds" but the endpoint values were actually s42-specific. Re-aggregated from results/snapshot_evolution_v2/snapshot_evolution_s{42,123,456}.json: \|\|h_L\|\| final per seed: 4.39e8, 3.86e8, 7.30e8 → mean 5.18e8 (paper now: 5e8, was 4e8) \|\|g_L\|\| final per seed: 4.86e-10, 3.76e-10, 2.67e-10 → mean 3.76e-10 (paper now: 4e-10, was 5e-10) The §5 ¶1 intervention paragraph still says "‖h_L‖ ~4.4e8" — that one is explicitly the s42 vanilla DFA endpoint that Figure 3 panel (a) plots, so it stays single-seed. §3 is the 3-seed version. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08	paper v2.30.1: ground-truth \|\|g_2\|\| values in §4 ¶1	YurenHao0426
	Fresh re-measurement on the saved early checkpoints (per_layer_cos_3seed.json) gives \|\|g_2\|\| at ep 1: s42: 6.79e-7 → paper rounds 6.8e-7 s123: 6.57e-7 → paper rounds 6.6e-7 s456: 3.85e-7 → paper rounds 3.8e-7 The previous prose values (6.7, 6.5, 3.9) were carried over from ad-hoc measurements with inconsistent rounding (3.9 was an error; 3.85 rounds to 3.8). All three values are still well above the 1e-7 diagnostic-(b) threshold, so the §4 ¶1 mode-2-without-mode-1 claim is unchanged. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08	paper v2.30: fix layer-0 cosine numbers + add per-seed appendix M	YurenHao0426
	Found a numerical error in §4 ¶3: the layer-0 vanilla DFA cosines were listed as +0.42, +0.45, +0.39 across seeds 42/123/456 but the actual re-measurement on the saved early-epoch checkpoints gives +0.421, +0.436, +0.418 (the s456 value was off by 0.03). The deep-mean numbers in Table 2 (-0.008 ± 0.013) were already correct. Changes: - §4 ¶3: layer-0 trio updated to +0.42, +0.44, +0.42 across seeds and cite now points to a new per-seed appendix. - New Appendix M (Layer-0 Dominance): 6-row table of per-seed per-layer cosines on vanilla DFA early checkpoints (3 seeds × ep 1, 2), with per-layer \|\|g\|\|. Documents the layer-0 dominance pattern that drives the headline aggregate Γ on these checkpoints. - results/vanilla_dfa_early_ckpts/per_layer_cos_3seed.json: machine- readable dump of all 6 measurements for future audit. - §7 compressed (~30 words trimmed across the closing paragraph) and Figure 3 width 0.92 → 0.82 to keep main content at exactly 9 pages after the appendix addition. Verified: 9 pages main + refs on p10, 18 total, 0 overfull boxes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08	paper v2.29: add Scellier & Bengio 2017 EP citation	YurenHao0426
	§1 ¶1 referenced "equilibrium propagation" without a bibitem despite EP being the trustworthy non-BP control throughout the paper. Added the canonical Scellier & Bengio 2017 Frontiers in Computational Neuroscience reference and cited it where EP is first named in the FA-first intro. Main content stays at 9 pages (§7 closes mid-p9, refs start p10); 0 overfull boxes; 18 pages total. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08	§6 polish: clarify Figure 5 (decision_utility) is in Appendix D, not main text	YurenHao0426
	After moving Figure 5 from §6 to Appendix D in v2.23, §6 ¶2 still said 'Figure~\ref{fig:decision_utility} makes the decision value explicit' which would render as 'Figure 5 makes...' but Figure 5 is now in the appendix. Reader on p8 looking for Figure 5 nearby would not find it. Added explicit '(Appendix~\ref{app:all_validations})' parenthetical right after the figure ref so the reader knows where to look. Audit of all other figure refs (Figures 1-4 in main text): - fig:audit_hero (Figure 1, §2) → refs in §1/§2 main text ✓ - fig:temporal_cross_arch (Figure 2, §5) → refs in §3/§5 ✓ - fig:penalty_rescue (Figure 3, §5) → refs in §4/§5/§7 ✓ - fig:cross_arch_summary (Figure 4, §5) → refs in §5/§6 ✓ All clean. Main content still 9 pages.
2026-04-08	Reviewer-concern batch: ρ formula + LN Jacobian derivation + diagnostic (c) ↵	YurenHao0426
	formula + threshold pointer + hyperparameter fairness clause Addressed 4 secondary reviewer concerns from the user's earlier list, all small inline additions: 1. §3 ¶1 LN Jacobian: extended the 1-line claim into a 2-line derivation. For y = LN(h) = (h-μ)/σ with σ ∝ \|\|h\|\|/√d, \|\|∂y/∂h\|\| = Θ(1/σ), so \|\|g_L\|\| = Θ(1/\|\|h_L\|\|). Connects the (a) growth and (b) collapse formally. 2. §4 ¶2 ρ formal definition: added the inline formula ρ_l = Pearson(<a_l, εv>, ℓ(h_l + εv) - ℓ(h_l)) over M=32 random unit-norm directions v with ε=1e-3, evaluated per sample on a fixed eval batch and averaged. Previously this was narrative-only. 3. §6 ¶3 diagnostic (c) cross-batch stability: added inline definition as the mean pairwise cosine of per-batch-averaged BP-grad direction at the chosen layer across K≥8 disjoint 128-sample minibatches, with the empirical separation (drift 0.5-0.99 vs healthy 0.05-0.18). 4. §6 ¶3 threshold sensitivity pointer: added (Appendix~\ref{app:threshold_sweep}) pointer next to the (a)/(b) calibration claim. 5. §2 ¶1 hyperparameter fairness: changed 'against the same architecture, optimizer, and training budget' to 'against the identical architecture, optimizer, schedule, and training budget without method-specific tuning' and added 'batch size 128'. Closes the 'fairness asserted but not evidenced' reviewer concern. Page budget: each addition ate ~1-2 lines. Net push was ~9 lines, which spilled §7 onto p10. Recovered by: - Shrinking Figure 3 (penalty rescue) from \linewidth to 0.92\linewidth - Shrinking Figure 4 (cross_arch_summary) from 0.78\linewidth to 0.7\linewidth - Compressing diagnostic (c) clause (kept the formula intent without all the LaTeX math symbols inline) - Trimming §7 closing sentence: dropped 'main lesson is to decompose' preamble; merged 'a reporting rule' phrase into the same sentence as the methodology-line citations Result: §1-§7 + all figures fit strictly in 9 pages (verified via pdftotext; p9 ends with §7 closing sentence + page number '9'; p10 starts with References). Total 18 pages, 0 overfull hbox. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08	Intro: add slower FA-first opening paragraph (user feedback)	YurenHao0426
	User flagged that intro entered the critique too fast without first explaining what feedback alignment is. Based on Bartunov et al. 2018's intro structure (their Section 1 opens with the weight-transport problem, introduces FA as the response, and only then motivates the evaluation question), rewrote §1 paragraph 1. New §1 ¶1: - BP is standard but biologically implausible (weight-transport problem) - FA (Lillicrap 2016) side-steps via random feedback - DFA (Nokland 2016) simplifies by direct projection per layer - Parallel lines: target propagation (Lee 2015), equilibrium propagation - Modern scaling: Launay 2020 (transformers), Akrout 2019 - Evaluation converged on accuracy + Gamma cosine summary §1 ¶2 (old ¶1) then starts the audit critique against this backdrop, so a reader who arrived without any FA context now has one paragraph of set-up before the critique begins. Page-budget side effect: the ~110-word addition pushed main content to 10 pages briefly. Recovered by shrinking Figure 4 (cross_arch_summary) from width=\linewidth to width=0.78\linewidth, which freed enough p9 vertical space for §7 to fit entirely on p9. Result: main content strictly 9 pages (§1-§7 on p1-p9, references and appendices on p10+). Total 18 pages. 0 overfull hbox. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08	Bib fix: Akrout first name 'Mohamad' -> 'Mohamed' (flagged by user's ↵	YurenHao0426
	Semantic Scholar check)
2026-04-08	Bib proactive verification: Xiong 2nd author fix + expand 3 'et al.' entries	YurenHao0426
	While user runs Semantic Scholar verification, I WebSearch-verified the 4 citations I flagged as 'never independently checked' and found one real bug plus opportunity to expand others: - Xiong 2020: second author was 'Yunchang Yu' in my bibitem, but the actual author is 'Yunchang YANG'. Fixed. Also expanded to the full 10-author list. - Paleka 2026: expanded 'Daniel Paleka et al.' -> 'Daniel Paleka, Shashwat Goel, Jonas Geiping, and Florian Tramèr'. Title/venue confirmed correct. - O'Bray 2022: expanded 'Leslie O'Bray et al.' -> 'Leslie O'Bray, Max Horn, Bastian Rieck, and Karsten M. Borgwardt'. Title/venue confirmed correct. - Jordan 2020: expanded 'Scott M. Jordan et al.' -> 'Scott Jordan, Yash Chandak, Daniel Cohen, Mengxue Zhang, and Philip Thomas'. Also dropped incorrect middle initial 'M.' Title/venue confirmed correct. All 4 citations now have full verified author lists. The Yang/Yu typo was a real factual error that Semantic Scholar would have caught. Main content still 9 pages. Task list unchanged. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08	SB/CB probe reframe + compression + Figure 5 to appendix (user-approved)	YurenHao0426
	User pushed back on SB/CB being treated as 'audited FA methods' because they're our own constructions. Reframe them as diagnostic probes built on two prior-literature assumptions (state=credit and credit=performance). §1 intro: add 1 sentence clarifying BP/EP/DFA are established baselines and SB/CB are probes constructed in this paper. §2 ¶2 new opening paragraph (before 'By the field's usual criteria'): - SB/CB are probes, not prior FA variants - Each directly learns a target from a prior-literature view - SB: target-propagation view (Bengio 2014, Lee 2015) — auxiliary G_ψ(h_l,t_l,s) predicts h_L via MSE; a_l^SB = ∇_{h_l} CE(W_out LN(G_ψ(h_l,t_l,s)), y) - CB: synthetic-gradient view (Jaderberg 2017) — auxiliary V_φ(h_l,t_l,s) trained via bridge residual; a_l^CB = ∇_{h_l} V_φ(h_l,t_l,s) - Both auxiliaries trained on detached hidden states - Role: populate different points in the (angular alignment, functional usefulness) plane, making the §4 cos-vs-acc dissociation visible Bibliography: added Bengio 2014 (arXiv 1407.7906), Lee et al. 2015 (ECML PKDD), Jaderberg et al. 2017 (ICML) — all verified via WebSearch. Page budget: the ~180-word §2 addition pushed §7 onto p10. Recovered space by: (a) compressing §2 ¶1 opening (b) compressing §3 ¶2 falsification chain (tighter number formatting) (c) compressing §6 ¶3 asymmetry paragraph (d) merging §7 into a single paragraph (was 3) (e) moving Figure 5 (decision_utility) from §6 main text to a floated appendix figure in Appendix D (the 'all seven validations' appendix, which is conceptually related). The decision-utility ablation's headline ('accuracy+Γ walks back 0/5, full protocol walks back 3/5') is already in §6 prose so the figure functions as supporting backup. Result: main content is strictly 9 pages (§1-§7 on p1-p9). References and appendices on p10+. Total 18 pages, 0 overfull hbox. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08	Appendix L: upgrade DFA+pen nudging/drift/trajectory to 3-seed (matches SB/CB)	YurenHao0426
	Ran DFA+pen via cifar_resmlp.py on s123, s456. 3-seed: - acc 0.3601 ± 0.0014 (matches existing 0.363 from round 11) - deep cos +0.1518 ± 0.0110 (matches existing +0.155 ± 0.025) - deep nudge -5e-5 (was single-seed -6e-5) - train Δloss 0.095 ± 0.007 (was single-seed 0.104) - w2 drift 18.6 ± 0.5 (matches single-seed 18.8) - embed drift 94.6 ± 1.4 (matches single-seed 92.7) Single-seed s42 was within noise of the 3-seed mean across all 4 functional metrics, so the cos-vs-acc dissociation story is unchanged — Appendix L now just reports 3-seed for all 3 methods consistently. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08	Figures 3 and 4: fix aspect ratio (fig3 was squeezed strip) and key-finding ↵	YurenHao0426
	label overlap (fig4) Per user feedback: - fig4_penalty_rescue.pdf (Figure 3 in paper): was figsize=(13, 3.5), aspect 3.7:1, which rendered as a thin strip with squeezed subplot content. Increased height to figsize=(13, 6.0), aspect 2.2:1. Much taller panels that actually show axis labels and legends readably. - fig5_cross_arch_summary.pdf (Figure 4 in paper): the 'Key finding' italic text annotation at y=-1.0 in axes transform was overlapping with the multiline architecture y-tick labels at the bottom of the second subplot. Moved to y=-1.55 and increased figsize height from 3.5 to 4.2 so the lower annotation still fits in bbox_inches='tight' crop. - Also bumped includegraphics width from 0.92\linewidth to \linewidth for both figures so they use the full text width. Main content still exactly 9 pages within E&D budget. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08	Tables 1/2/3: revert to tabular + wrap in resizebox (per user feedback)	YurenHao0426
	Previous tabularx conversion caused visual problems: - Table 1: 'lcccc' inside tabularx{\linewidth} did not stretch, leaving a blank column on the right side of the page - Table 2: 'p{0.18}LLL' forced originally-single-line cells to wrap into multiple lines (e.g., 'Vanilla DFA, early epoch' and 'cos_deep=... rho_deep=...' split across 2 lines each) - Table 3: 'p{0.06}L p{0.16} p{0.22}' similarly compressed single-line rows Fix: revert all 3 tables to original plain 'tabular' (Table 1 lcccc, Table 2 lccc, Table 3 llll) and wrap each in \resizebox{\linewidth}{!}{...}. This: - Stretches Table 1 to full text width (no blank right column) - Shrinks Table 2's originally-wide content uniformly so all rows stay single-line - Shrinks Table 3 similarly so (a)/(b)/(c)/(d) rows are single-line Tables 4-9 (appendix) keep their tabularx treatment since they fit cleanly. Result: 0 overfull hbox, main content still exactly 9 pages, first 4 references now fit on p9 as well. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08	Polish: tighten §7 conclusion so it fully fits on page 9 (was spilling 2 ↵	YurenHao0426
	lines to p10) The two final sentences both started with 'Once...' and were redundant. Cut the last sentence entirely and tightened the preceding phrase from 'Once the field enforces that separation...' to just stating the claim. Before: 'That is the sense in which this paper fits the evaluation-methodology line...: the contribution is not a new benchmark artifact, but a reporting rule for preventing a repeatable interpretive error. Once the field enforces that separation between measurement validity and substantive credit quality, positive results will become more trustworthy, negative results more precise, and the apparent evidence for successful deep credit assignment much harder to overstate.' After: 'That is the sense in which this paper fits the evaluation-methodology line...: the contribution is a reporting rule for preventing a repeatable interpretive error, not a new benchmark artifact.' §7 now fully fits on p9. Main content exactly 9 pages within E&D budget.
2026-04-08	Polish: convert all 9 tables to tabularx for robust \linewidth fitting	YurenHao0426
	User flagged Table 2 (mode validation) as overflowing. Root cause: the 'Deep-layer alignment signal' column had long multi-term cosine+rho expressions under plain 'lccc' column spec with no width constraint. Fix: - Added \usepackage{tabularx} and a raggedright L column type - Converted all 9 tables from tabular to tabularx{\linewidth}{...} - Table 1 (main audit): plain lcccc inside tabularx, fits width - Table 2 (mode validation): first column p{0.18\linewidth}, three wrapping L columns - Table 3 (protocol def): tight left p{}, wrapping L measurement column, two right p{} columns - Table 4 (all validations): p{0.18\linewidth} + three L columns - Tables 5-9 (appendices, numeric): @{\extracolsep{\fill}} with existing lrrr... specs Also shortened 'DFA+pen mean (3 seeds)' label to 'DFA+pen mean' in Appendix L table to eliminate a 19.5pt overfull on that row. Result: 0 overfull hbox warnings (was several), main content still 9 pages exactly within E&D budget, total 17 pages. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08	Round 41 complete: 3-method cos-vs-acc dissociation with DFA+pen added	YurenHao0426
	DFA+penalty single seed s42, 30ep via cifar_resmlp.py (not the earlier dfa_residual_penalty_test.py which doesn't save nudging): - test acc: 0.3607 (matches existing 3-seed 0.363±0.001) - deep cos: +0.166 (matches existing 3-seed 0.155±0.025) - deep nudge Δloss (eta=0.01): -6e-5 (smallest) - trajectory loss decrease: 0.104 (smallest) Full 3-method comparison at 30 epochs: DFA+pen SB+pen CB+pen test acc 0.361 0.453 0.360 deep cos +0.166 +0.322 +0.684 deep nudge -6e-5 -1.78e-3 -0.45e-3 traj Δloss 0.104 0.458 0.122 KEY INSIGHT: Deep cosine ranks methods CB > SB > DFA, but ALL functional metrics (nudge, trajectory loss decrease, accuracy) rank them SB >> CB ≈ DFA. Cos is the ONLY ordering that does not predict accuracy correctly. This is the strongest form of the cos-vs-acc dissociation: the ordering implied by angular agreement is contradicted by three independent functional measurements, all of which do predict accuracy. Appendix L ¶2 updated to report all 3 methods and note the ranking contradiction. Main content still 9 pages. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08	Round 41 (Appendix L): add 4th piece of cos-vs-acc evidence - trajectory ↵	YurenHao0426
	loss decrease SB+penalty train loss: 2.047 -> 1.589 (Δ=0.458 over 30 epochs) CB+penalty train loss: 1.996 -> 1.874 (Δ=0.122 over 30 epochs) Ratio: 3.8x (matching the 4x nudging ratio) This is the third independent functional measurement, from per-epoch logs in the same round 38 JSONs (log['train_loss']). The cos-vs-acc dissociation now has four independent pieces of evidence, all agreeing: 1. Test accuracy: CB 0.360 vs SB 0.453 (9.3pp gap) 2. Parameter drift: CB w2 19.3x vs SB 14.3x (CB larger updates) 3. Single-step nudging Δloss: CB -0.45e-3 vs SB -1.78e-3 (4x gap, eta=0.01) 4. Trajectory loss decrease: CB 0.122 vs SB 0.458 (3.8x gap over 30 epochs) All four inversely correlated with deep cosine (CB has higher deep cos). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08	Round 41 (Appendix L): add nudging-test functional evidence for cos-vs-acc ↵	YurenHao0426
	dissociation The nudging test values were already in the round 38 JSON under diag['nudging']['0.01'] but never used. Extracted and added to Appendix L: SB+penalty: deep nudge delta = -1.78e-3 (per-layer, eta=0.01) CB+penalty: deep nudge delta = -0.45e-3 (per-layer, eta=0.01) A single step of size eta=0.01 in each method's credit direction decreases the test loss by 1.78e-3 (SB) vs 0.45e-3 (CB) — a 4x gap in functional loss decrease that EXACTLY INVERTS the 4x deep-cosine gap between the methods. This is the direct functional measurement for the 'angular agreement is not sufficient' claim. Combined with the drift diagnostic (larger CB updates), the cos-vs-acc mechanism hypothesis now has THREE independent pieces of support: 1. Test accuracy (headline: CB same as DFA, SB higher) 2. Parameter drift (CB larger updates than SB) 3. Nudging functional loss decrease (CB 4x smaller than SB) Zero new compute — all from existing round 38 JSON data. Main content still 9 pages exactly within E&D budget. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08	Round 41 (Appendix L): add per-block drift diagnostic reinforcing cos-vs-acc ↵	YurenHao0426
	hypothesis Extracted from existing round 38 JSON data without running new compute. The drift field (\|\|W_final - W_init\|\|_F / \|\|W_init\|\|_F) is produced by cifar_resmlp.py's feature_drift() and was already saved but not used in the paper. Key finding: CB+penalty has LARGER block updates than SB+penalty (per-block w2 drift 19.3x vs 14.3x; embed drift 44.6x vs 7.1x) yet 9.3 pp LOWER accuracy. This rules out 'CB just has smaller updates' as an alternative explanation for the cos-vs-acc dissociation. Added 2 sentences to Appendix L paragraph 2 noting this supporting evidence for the 'angular agreement does not certify functional forward-state content' mechanism hypothesis in §4. Main content still 9 pages exactly within E&D budget. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08	Polish: spell out 'Equilibrium Propagation (EP)' on first discussion in §2	YurenHao0426
	The paper uses 'EP' throughout but never spelled it out in §1 or §2. Added first-mention spell-out with a brief 1-line description ('a contrastive energy-based alternative to BP that updates weights from the difference between a free-phase and a nudged-phase hidden trajectory') so the reader has context before EP is used as a key internal comparison. Main content still 9 pages.
2026-04-08	Polish: Appendix L title + intro mention both SB and CB (was SB-only)	YurenHao0426
	Appendix L title was 'State Bridge Penalty Rescue: 3-Seed Cross-Method Test' but the table has both SB and CB rows. Updated to: 'State Bridge and Credit Bridge Penalty Rescue: 3-Seed Cross-Method Test'. Intro sentence updated to mention re-running both SB and CB, and to note both baselines were matched.
2026-04-08	Polish: remove 5 stale 'TODO: re-render figure' comments	YurenHao0426
	All 5 figures are rendered and current; the TODO comments were leftover notes from when figures were being generated. No functional change — just cleanup of invisible LaTeX source comments.
2026-04-08	Bib fix: Moskovitz + Crafton + Refinetti titles/venues corrected via ↵	YurenHao0426
	WebSearch verification Three more bibitems had hallucinated or wrong fields; verified correct versions against openreview/arxiv/proceedings URLs. - Moskovitz et al. 2018: was 'In NeurIPS, 2018' -> arXiv preprint 1812.06488 (paper is arxiv-only, never published at NeurIPS). First name Ted -> Theodore. - Crafton et al. 2019: title was 'Backpropagation through feedback alignment for deep learning in analog hardware' -> 'Direct feedback alignment with sparse connections for local learning'. Venue was 'ICASSP' -> 'Frontiers in Neuroscience, 13:525'. Third author Eric -> Evan Gebhardt. - Refinetti et al.: year was 2023 -> 2021, 4th author Krzakala -> Goldt, title was 'Aligning residual pathways: normalization, scale, and feedback in deep networks' -> 'Align, then memorise: the dynamics of learning with feedback alignment'. Venue ICML (correct year 2021, not 2023). All 12 bibitems now verified. Running total: - Lillicrap 2016, Nokland 2016, Bartunov 2018, Launay 2020 verified via WebSearch this round (were already correct). - Xiong 2020 bib sort key cleaned up earlier. - Akrout 2019 title fixed earlier (Deep feedback control -> Deep learning without weight transport). - Paleka 2026, O'Bray 2022, Jordan 2020 titles fixed earlier. - Moskovitz + Crafton + Refinetti fixed this commit. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08	Bib fix: Akrout 2019 title 'Deep feedback control' -> 'Deep learning without ↵	YurenHao0426
	weight transport'
2026-04-08	Bib fix: correct titles for 3 E&D model papers (Paleka/O'Bray/Jordan)	YurenHao0426
	Previous bibitems had paraphrased/invented titles for the 3 E&D-methodology exemplar papers cited in §1 and §7. The correct titles are: - Paleka et al. ICLR 2026: 'Pitfalls in Evaluating Language Model Forecasters' (not 'Pitfalls in evaluating model behavior: measurement, reporting, and interpretability failures') - O'Bray et al. ICLR 2022: 'Evaluation Metrics for Graph Generative Models: Problems, Pitfalls, and Practical Solutions' (not 'Evaluation beyond leaderboard metrics: methodology matters') - Jordan et al. ICML 2020: 'Evaluating the Performance of Reinforcement Learning Algorithms' (not 'Evaluating machine learning: tests, cases, and expectations'). Also corrected first author 'Matt' -> 'Scott M.' Verified against codex round 23 memory which recorded the correct titles from the OpenReview/ICML URLs. Previous bibitems were hallucinated titles from earlier rounds and would have been a factual bug in the bibliography. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08	Bib cleanup: fix Xiong 2020 bibitem sort key (fabricated names removed)	YurenHao0426

2026-04-08	§3 fix: correctly distinguish DFA/SB/CB local credit vectors	YurenHao0426
	Previous §3 ¶1 wrote the local loss as -<f_l, B_l^T e_T> as if it applied to DFA, SB, and CB all three. But that's only DFA's form. SB and CB use learned bridge networks to derive credit: - DFA: a_l = B_l^T e_T (fixed random projection) - State Bridge: a_l = gradient of CE(head(LN(G_psi(h_l, t_l, s))), y) where G_psi is a learned state predictor of h_L - Credit Bridge: a_l = gradient of learned value net V(h_l, t_l, s) The fix correctly writes the shared local loss form -<f_l, a_l> and defines a_l for each method in-line. This also serves as the first definition of SB and CB in the paper (previously they were named in Table 1 without being defined). Main content still ends at p9 (just slightly before the bottom margin now); references span p9-p10 but are not counted against the 9-page content budget. Total 17 pages. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08	Polish: fix §3 typo 'to an meaningful' -> 'to a meaningful'	YurenHao0426

2026-04-08	OPTION F polish pass: update abstract, §1 ¶2, §6 ¶3, §7 ¶2 with round ↵	YurenHao0426
	38-40 findings Codex round 40 sequencing: A then F. OPTION F (polish) pass found 4 real issues: 1. Abstract: added narrow conditional-positive finding sentence about SB+penalty beating the shallow baseline + cos-vs-acc dissociation 2. §1 ¶2: replaced 'deep cosine can improve to about +0.16' (DFA-specific) with a fuller cross-method statement mentioning SB +0.32, CB +0.68, and the cos-vs-acc dissociation 3. §6 ¶3 (protocol asymmetry): added one sentence noting that the cross-method cos-vs-acc dissociation REINFORCES the necessity of keeping all four diagnostics separate 4. §7 ¶2 (limits): upgraded 'terminal-LN interpretation is observational rather than causal identification' to 'established causally on the audited residual ResMLP via the matched same-backbone no-terminal-LN control but not proven to extend beyond that architecture family' — reflects round 36 wording upgrade based on existing April 7 no_outln data All four changes are prose-level updates driven by data that was already in the paper. Main content still fits at 9 pages exactly (E&D limit). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08	Round 40 §4 update: cos-vs-acc 3-part proposition (Observation / Inference ↵	YurenHao0426
	/ Mechanism hypothesis) Codex round 40 recommended turning the Mode 2 dissociation from empirical curiosity into a methodological claim via a 3-part proposition: 1. Observation: CB+pen and DFA+pen reach same acc despite 4x deep-cos gap; SB+pen best acc with intermediate cos 2. Inference: layerwise BP-cosine is NECESSARY to rule out grossly wrong credit signals but NOT SUFFICIENT to certify usable credit for depth 3. Mechanism hypothesis: usefulness depends on whether local updates induce coordinated forward-state change across blocks, not just angular agreement with BP Method framing (codex-approved safer versions): - CB = 'gradient-direction surrogate' (high angular agreement, low functional credit) - SB = 'state-level downstream teaching signal' (lower angular agreement, higher functional credit) - Explicitly framed as HYPOTHESIS not theorem Main content still 9 pages exactly (within E&D limit). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08	Round 38 CB+penalty multi-seed: confirm clean cos-vs-acc dissociation across ↵	YurenHao0426
	3 fixed-feedback methods CB+penalty 3-seed (4-block d=256, 30ep, lam=1e-2): - acc 0.360±0.003 (same as DFA+pen, 9pp below SB+pen) - \|\|h_L\|\|=5680±178, \|\|g_L\|\|=1.9e-5 (HEALTHY) - layer-0 cos +0.652±0.005 - deep cos +0.679±0.008 (4x DFA+pen, 2x SB+pen) - deep rho +0.464±0.025 (6x DFA+pen) Final 3-method rescue comparison: DFA+pen: acc 0.363, deep cos 0.155, deep rho 0.080 SB+pen: acc 0.453, deep cos 0.322, deep rho 0.402 CB+pen: acc 0.360, deep cos 0.679, deep rho 0.464 Clean cos-vs-acc dissociation: - CB has 4x higher deep cos than DFA but SAME accuracy - SB has intermediate deep cos but HIGHEST accuracy - Alignment to BP gradient is NECESSARY but NOT SUFFICIENT for usable credit Paper updates: - §4 ¶4: now includes all 3 methods with numbers, adds 'cos is necessary but not sufficient' framing based on 3-method dissociation - Appendix K: adds CB+pen 3-seed rows + vanilla CB baseline for comparison - Main content still 9 pages exactly (within E&D limit) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08	Appendix G + J: 3-seed depth scan L=4 validation + SB/CB random_targets ↵	YurenHao0426
	100ep convergence - Appendix G: depth scan L=4 now 3-seed (s42, s123, s456). 3-seed DFA layer-0 cos +0.412±0.011, deep cos -0.0004±0.0008, CB deep cos +0.039±0.010. Indistinguishable from single-seed row shown in table. - Appendix J: full 100-epoch random_targets trajectory for SB and CB: SB: \|\|h_L\|\|=3.6e5, \|\|g_L\|\|=4e-8 (at floor), acc 0.100 (chance) CB: \|\|h_L\|\|=1.38e8, \|\|g_L\|\|=0 (collapsed), acc 0.085 (chance) Both reach Mode 1 (a)+(b) at 100ep, consistent with DFA's 1.67e8 / 8e-12. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08	Appendix J: add EP random_targets full 100ep convergence \|\|h_L\|\|=2085, acc=0.081	YurenHao0426

2026-04-08	Round 38 paper update: §4 + §5 + new Appendix K with SB+penalty 3-seed result	YurenHao0426
	CODEX ROUND 39 VERDICT: PAPER-CHANGING for SB; wait for CB multi-seed for CB claims. Round 38 3-seed SB+penalty (4-block d=256, 30ep, lam=1e-2): - acc 0.453±0.003 (BEATS shallow baseline 0.349 by +10.4pp -- FIRST non-BP method) - \|\|h_L\|\|=302±8 (contained, not silenced) - \|\|g_L\|\|=1.8e-4 (HEALTHY) - deep cos +0.322±0.007 (2x DFA+pen +0.155) - deep rho +0.402±0.015 (5x DFA+pen +0.080) Penalty rescue magnitudes (method-dependent): - DFA: +5.5 pp (0.306 -> 0.363) - SB: +24 pp (0.213 -> 0.453) - CB: +15 pp (single seed, multi-seed in flight) - BP: -8 pp (capacity cost, 0.609 -> 0.530) Paper updates: - §4 ¶4 NEW: Mode 2 has method-dependent severity within fixed-feedback family; SB+penalty is the first audited non-BP method to substantively use deep blocks via intervention; deep cos doesn't predict acc across methods (methodological obs) - §5 ¶3 EXTENDED: BP+penalty -> 3x penalty control (BP, DFA, SB) with all margins vs frozen-blocks baseline; BP-to-SB gap only 7.7 pp vs BP-to-DFA gap 17 pp - Appendix K NEW: full SB+penalty 3-seed table with vanilla SB and DFA+pen comparison Main content stays at 9 pages exactly (within E&D limit). Total 16 pages. CB multi-seed (s123, s456) launched in parallel (PIDs 576938, 576939) — claims deferred until those land. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08	Appendix H: H2 ablation now 3-seed (s42, s123, s456) — multi-seed ↵	YurenHao0426
	validates falsification 3-seed mean: \|\|h_L\|\|=8.2e7, \|\|g_L\|\|=1.9e-10 Per-seed: \|\|h_L\|\| in {1.06e8, 3.15e7, 1.09e8}, \|\|g_L\|\| in {1.08, 2.94, 1.77}e-10 All deeply below the (b) floor, all confirm Mode 1 (a)+(b) fire on no-residual ResMLP+terminal-LN. Multi-seed H2 falsification of 'residual skip causes Mode 1' is now robust. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08	Round 38: add --penalty_lam flag to cifar_resmlp.py for Mode 2 cross-method test	YurenHao0426
	Patches: - main(): add --penalty_lam (separate from CB's bridge temperature args.lam) - train_dfa block update (line 195): add penalty_lam * (f_l**2).sum(-1).mean() - train_state_bridge block update (line 326): same penalty - train_credit_bridge block update (line 533): same penalty Codex round 38 GO STAGE: keep penalty separate from CB lam, blocks-only, sanity-check that hidden_norms remain nontrivial (not silencing the blocks). 2-epoch smoke (results/round38_smoke_sbcb_pen) passes the silencing check: SB \|\|h_L\|\|=229, CB \|\|h_L\|\|=1258, both nontrivial. Deep cosines positive across all layers for SB ([0.28, 0.25, 0.23]) and rising for CB ([0.04, 0.08, 0.13, 0.15]). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08	Fix bibtex citation key: refinetti2023align -> refinetti2023aligning ↵	YurenHao0426
	(matches bibitem)