diff options
| author | YurenHao0426 <Blackhao0426@gmail.com> | 2026-04-08 19:57:52 -0500 |
|---|---|---|
| committer | YurenHao0426 <Blackhao0426@gmail.com> | 2026-04-08 19:57:52 -0500 |
| commit | 1b085a17237dfa8a1df49c0005ba26d7ba41ebaa (patch) | |
| tree | 5a1693d2592d291e7ade843449e02d1e42014fe5 /results/gelu_ablation/relu_dfa_s42.json | |
| parent | 9ebaa25377996c8ad437856d68f515b6d0d64a36 (diff) | |
paper v2.34: §4 ¶4 mechanism hypothesis adds Mode 2 → Mode 1 causal chain
User flagged that Mode 1 may itself be a downstream consequence of Mode 2
(rather than a parallel failure mode), and asked for this to be added to
the §4 mechanism hypothesis.
The causal chain:
1. Local credit signal a_l has poor functional usefulness (Mode 2)
2. Optimizer cannot drive useful per-block forward-state change
3. The only easy way to increase <f_l, a_l> is to inflate ‖f_l‖ along
the cheap random direction set by a_l (Mode 1(a) growth)
4. Inflated residual stream → terminal LN gradient cancellation
(Mode 1(b) collapse)
5. Per-block penalty caps ‖f_l‖, breaking the chain at step 3 without
fixing the underlying credit quality
→ Explains why penalty alleviates Mode 1 fully but Mode 2 only partially
This is more parsimonious than "two parallel failure modes" and is
consistent with the observed asymmetry that the penalty rescues Mode 1
without fully fixing Mode 2 (deep cos +0.151 vs BP's ≈1.0).
§4 ¶4 mechanism hypothesis section now contains:
- Original CB/SB descriptions (gradient-direction surrogate vs state-level
teaching signal)
- NEW: Mode 2 → Mode 1 downstream-symptom hypothesis with the explicit
causal chain
- Hypothesis caveat (we have measured angle-to-accuracy + functional
proxies but not full forward-state-change content)
Page-budget compensation:
- §4 ¶4 setup recital compressed (combined SB/CB/DFA into one sentence)
- §4 ¶4 functional measurements paragraph compressed (used \emph{Nudging:}
/ \emph{Training-loss trajectory:} structure)
- §7 ¶1 closing compressed (merged the redundant no-terminal-LN ablation
+ BatchNorm CNN sentences)
Page layout: §1-§7 still 9 pages exactly (§7 ends p9 line 358, refs p10
line 359). Total 19 pages (was 18) — one extra appendix page from the
v2.33 per-seed nudging values, unchanged here. 9-page main content
budget preserved.
This responds to user message: "主要逻辑风险是 Mode 1 可能是 Mode 2 的
下游后果(而非并列失败模式)... 建议优先处理:把 nudging test 提到主文、
把 Mode 2→Mode 1 因果链作为 mechanism hypothesis"
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Diffstat (limited to 'results/gelu_ablation/relu_dfa_s42.json')
0 files changed, 0 insertions, 0 deletions
