diff options
| author | YurenHao0426 <Blackhao0426@gmail.com> | 2026-03-25 08:22:04 -0500 |
|---|---|---|
| committer | YurenHao0426 <Blackhao0426@gmail.com> | 2026-03-25 08:22:04 -0500 |
| commit | 7e01fbc0ce871857c1e1879ed0d3559e8bfae7c7 (patch) | |
| tree | 4f0da6c6362b8ebe8109fe9a40ed28e2d7759595 /report_explore | |
| parent | 825d973428450cb24d8cccc8c2604235ef974b7c (diff) | |
Add Phase 6.5A: same-batch linesearch REVISES Phase 6A conclusion
Phase 6A's "better credit → worse loss" was a protocol artifact caused by:
1. Credit normalization (inflated DFA, suppressed Vec magnitude ordering)
2. Held-out evaluation (measured generalization failure, not exploitability)
3. Gradient clamping
With strict same-batch evaluation:
- Oracle BP: dL_same = -0.406 (strongest descent)
- Vec_M4: dL_same = -0.135
- ScalarCB: dL_same = -0.025
- DFA: dL_same = -0.003
Same-batch loss decrease is MONOTONIC with credit quality.
But held-out loss INCREASES for all non-DFA methods (Case D: overfitting).
The bottleneck is batch-level generalization, not surrogate exploitability.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Diffstat (limited to 'report_explore')
| -rw-r--r-- | report_explore/MEMO_6.5A_samebatch_linesearch.md | 44 |
1 files changed, 44 insertions, 0 deletions
diff --git a/report_explore/MEMO_6.5A_samebatch_linesearch.md b/report_explore/MEMO_6.5A_samebatch_linesearch.md new file mode 100644 index 0000000..733db12 --- /dev/null +++ b/report_explore/MEMO_6.5A_samebatch_linesearch.md @@ -0,0 +1,44 @@ +# Phase 6.5A Memo: Same-Batch Linesearch + +**Date**: 2026-03-25 + +## Question +Under strict same-batch evaluation, does better credit produce better loss decrease? + +## Answer: YES. + +Phase 6A's conclusion was wrong due to protocol confounds. With same-batch evaluation: + +### All blocks, normalized credit (closest to Phase 6A protocol): + +| Method | best eta | dL_same | dL_held | +|--------|---------|---------|---------| +| DFA | 1e-2 | -0.003 | +0.004 | +| ScalarCB | 3e-3 | -0.025 | +0.027 | +| Vec_M4 | 3e-3 | **-0.135** | +0.045 | +| Oracle BP | 1e-2 | **-0.406** | +0.094 | + +**Same-batch loss decreases monotonically with credit quality**: Oracle > Vec > ScalarCB > DFA. + +### But held-out loss increases for all non-DFA methods. + +This is **Case D**: the local surrogate correctly exploits credit to decrease training loss, but the update overfits to the batch. The better the credit, the more effective the overfitting. + +### Key confounds in Phase 6A: +1. **Normalization**: Phase 6A always normalized credit, which amplified DFA's weak signals to the same magnitude as Vec's strong signals, erasing the natural ordering +2. **Held-out evaluation**: Phase 6A evaluated on held-out batches, showing the generalization failure rather than the exploitability success +3. **Gradient clamping**: Phase 6A clamped gradients to [-1, 1], further distorting the relationship + +### Raw vs Normalized (all blocks): +| Method | raw dL_same (best) | norm dL_same (best) | +|--------|--------------------|---------------------| +| Vec_M4 | -0.005 | -0.135 | +| Oracle | -0.003 | -0.406 | + +Raw credit produces tiny updates because BP gradients have RMS ≈ 0.00004. Normalization brings all methods to comparable magnitude but introduces overfitting. + +## Revised Diagnosis + +The bottleneck is NOT "local surrogate cannot exploit good credit" (Case B from Phase 6A). It IS: +- **Generalization/overfitting**: local surrogate with good credit decreases train loss but increases held-out loss +- This means the project direction should be about **regularizing local updates** rather than replacing the surrogate entirely |
