Sync experiment+protocol scripts with v2.32 corrected control values

2026-04-09T00:24:06+00:00

The pre-v2.31 unsourced values BP=0.609 and DFA=0.308 (which v2.31 fixed
to 0.585 and 0.301 via matched 30-ep controls) were also hardcoded as
"compare to" comments in 5 helper scripts:

  experiments/bp_with_penalty_control.py
  experiments/dfa_residual_penalty_test.py
  experiments/resmlp_frozen_blocks_baseline.py
  protocol/examples/threshold_d_sensitivity.py
  protocol/examples/plot_penalty_rescue.py

These are non-paper-input scripts (their output goes to stdout, not to
the paper), so the stale values didn't cause numerical errors in the
paper itself. But the original v2.31 BP+pen=0.609 unsourced number bug
came from exactly this kind of hardcoded "for-comparison" comment that
was never measured. Updating them now to remove the same trap from
future runs.

Each script now references the matched 30-ep 3-seed values from
results/bp_no_penalty_30ep, results/dfa_no_penalty_30ep, results/
dfa_pen_short, and results/bp_with_penalty.

protocol/EVIDENCE_SUMMARY.md and PAPER_OUTLINE.md still have stale
numbers — these are project scratch documents and not user-facing.
Deferred to a separate sweep if needed.

Co-Authored-By: Claude Opus 4.6 (1M context)

Add (d) frozen-baseline threshold sensitivity — IMPORTANT new finding

2026-04-08T04:59:35+00:00

Critical observation: at lambda=1e-3 (single seed), penalized DFA margin
above shallow baseline is +2.3 pp — which PASSES (d) at the 2 pp default
threshold. At lambda=1e-2 (3 seeds), the margin is +1.4 pp — FIRES (d)
at 2 pp.

So the (d) verdict on penalized DFA depends on BOTH the lambda choice
AND the threshold choice. This is a significantly weaker claim than
'two failure modes are separable via (d)'.

The honest framing per round 18 lesson: there is a real tradeoff between
penalty strength and depth utilization. Weaker penalty preserves more
depth contribution but also more scale pathology. Stronger penalty kills
depth contribution. The protocol surfaces this tradeoff but doesn't
establish the second failure mode by itself.

Compared to (a) 63x and (b) 24338x separation gaps, (d) is the LEAST
robust diagnostic and the most sensitive to threshold choice. Need to
flag this prominently in the paper.

faeval.git/protocol/examples/threshold_d_sensitivity.py, branch master

Sync experiment+protocol scripts with v2.32 corrected control values

Add (d) frozen-baseline threshold sensitivity — IMPORTANT new finding