summaryrefslogtreecommitdiff
path: root/protocol/examples/threshold_d_sensitivity.py
AgeCommit message (Collapse)Author
2026-04-08Sync experiment+protocol scripts with v2.32 corrected control valuesYurenHao0426
The pre-v2.31 unsourced values BP=0.609 and DFA=0.308 (which v2.31 fixed to 0.585 and 0.301 via matched 30-ep controls) were also hardcoded as "compare to" comments in 5 helper scripts: experiments/bp_with_penalty_control.py experiments/dfa_residual_penalty_test.py experiments/resmlp_frozen_blocks_baseline.py protocol/examples/threshold_d_sensitivity.py protocol/examples/plot_penalty_rescue.py These are non-paper-input scripts (their output goes to stdout, not to the paper), so the stale values didn't cause numerical errors in the paper itself. But the original v2.31 BP+pen=0.609 unsourced number bug came from exactly this kind of hardcoded "for-comparison" comment that was never measured. Updating them now to remove the same trap from future runs. Each script now references the matched 30-ep 3-seed values from results/bp_no_penalty_30ep, results/dfa_no_penalty_30ep, results/ dfa_pen_short, and results/bp_with_penalty. protocol/EVIDENCE_SUMMARY.md and PAPER_OUTLINE.md still have stale numbers — these are project scratch documents and not user-facing. Deferred to a separate sweep if needed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-07Add (d) frozen-baseline threshold sensitivity — IMPORTANT new findingYurenHao0426
Critical observation: at lambda=1e-3 (single seed), penalized DFA margin above shallow baseline is +2.3 pp — which PASSES (d) at the 2 pp default threshold. At lambda=1e-2 (3 seeds), the margin is +1.4 pp — FIRES (d) at 2 pp. So the (d) verdict on penalized DFA depends on BOTH the lambda choice AND the threshold choice. This is a significantly weaker claim than 'two failure modes are separable via (d)'. The honest framing per round 18 lesson: there is a real tradeoff between penalty strength and depth utilization. Weaker penalty preserves more depth contribution but also more scale pathology. Stronger penalty kills depth contribution. The protocol surfaces this tradeoff but doesn't establish the second failure mode by itself. Compared to (a) 63x and (b) 24338x separation gaps, (d) is the LEAST robust diagnostic and the most sensitive to threshold choice. Need to flag this prominently in the paper.