From 1b085a17237dfa8a1df49c0005ba26d7ba41ebaa Mon Sep 17 00:00:00 2001 From: YurenHao0426 Date: Wed, 8 Apr 2026 19:57:52 -0500 Subject: =?UTF-8?q?paper=20v2.34:=20=C2=A74=20=C2=B64=20mechanism=20hypoth?= =?UTF-8?q?esis=20adds=20Mode=202=20=E2=86=92=20Mode=201=20causal=20chain?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit User flagged that Mode 1 may itself be a downstream consequence of Mode 2 (rather than a parallel failure mode), and asked for this to be added to the §4 mechanism hypothesis. The causal chain: 1. Local credit signal a_l has poor functional usefulness (Mode 2) 2. Optimizer cannot drive useful per-block forward-state change 3. The only easy way to increase is to inflate ‖f_l‖ along the cheap random direction set by a_l (Mode 1(a) growth) 4. Inflated residual stream → terminal LN gradient cancellation (Mode 1(b) collapse) 5. Per-block penalty caps ‖f_l‖, breaking the chain at step 3 without fixing the underlying credit quality → Explains why penalty alleviates Mode 1 fully but Mode 2 only partially This is more parsimonious than "two parallel failure modes" and is consistent with the observed asymmetry that the penalty rescues Mode 1 without fully fixing Mode 2 (deep cos +0.151 vs BP's ≈1.0). §4 ¶4 mechanism hypothesis section now contains: - Original CB/SB descriptions (gradient-direction surrogate vs state-level teaching signal) - NEW: Mode 2 → Mode 1 downstream-symptom hypothesis with the explicit causal chain - Hypothesis caveat (we have measured angle-to-accuracy + functional proxies but not full forward-state-change content) Page-budget compensation: - §4 ¶4 setup recital compressed (combined SB/CB/DFA into one sentence) - §4 ¶4 functional measurements paragraph compressed (used \emph{Nudging:} / \emph{Training-loss trajectory:} structure) - §7 ¶1 closing compressed (merged the redundant no-terminal-LN ablation + BatchNorm CNN sentences) Page layout: §1-§7 still 9 pages exactly (§7 ends p9 line 358, refs p10 line 359). Total 19 pages (was 18) — one extra appendix page from the v2.33 per-seed nudging values, unchanged here. 9-page main content budget preserved. This responds to user message: "主要逻辑风险是 Mode 1 可能是 Mode 2 的 下游后果(而非并列失败模式)... 建议优先处理:把 nudging test 提到主文、 把 Mode 2→Mode 1 因果链作为 mechanism hypothesis" Co-Authored-By: Claude Opus 4.6 (1M context) --- paper/main.pdf | Bin 502005 -> 502309 bytes 1 file changed, 0 insertions(+), 0 deletions(-) (limited to 'paper/main.pdf') diff --git a/paper/main.pdf b/paper/main.pdf index 033851c..ea62a49 100644 Binary files a/paper/main.pdf and b/paper/main.pdf differ -- cgit v1.2.3