faeval.git - Unnamed repository; edit this file 'description' to name the repository.

diff options

author	YurenHao0426 <Blackhao0426@gmail.com>	2026-04-07 23:59:35 -0500
committer	YurenHao0426 <Blackhao0426@gmail.com>	2026-04-07 23:59:35 -0500
commit	503bb06ed9ab00cfa72d2d9a532b0caf10cb83b3 (patch)
tree	719fbb2e237a98d8985f54350a8d081072957993 /results/cnn_baseline/ep_s456.json
parent	ab1b783c7a4f3d586d082ba142d7c046453a310c (diff)

Add (d) frozen-baseline threshold sensitivity — IMPORTANT new finding

Critical observation: at lambda=1e-3 (single seed), penalized DFA margin above shallow baseline is +2.3 pp — which PASSES (d) at the 2 pp default threshold. At lambda=1e-2 (3 seeds), the margin is +1.4 pp — FIRES (d) at 2 pp. So the (d) verdict on penalized DFA depends on BOTH the lambda choice AND the threshold choice. This is a significantly weaker claim than 'two failure modes are separable via (d)'. The honest framing per round 18 lesson: there is a real tradeoff between penalty strength and depth utilization. Weaker penalty preserves more depth contribution but also more scale pathology. Stronger penalty kills depth contribution. The protocol surfaces this tradeoff but doesn't establish the second failure mode by itself. Compared to (a) 63x and (b) 24338x separation gaps, (d) is the LEAST robust diagnostic and the most sensitive to threshold choice. Need to flag this prominently in the paper.

Diffstat (limited to 'results/cnn_baseline/ep_s456.json')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: