1 files changed, 43 insertions, 0 deletions
diff --git a/research/flossing/paper/experiment_framework.md b/research/flossing/paper/experiment_framework.md
new file mode 100644
index 0000000..945a9a3
--- /dev/null
+++ b/research/flossing/paper/experiment_framework.md
@@ -0,0 +1,43 @@
+# Experiment framework — locked 2026-06-12
+
+Purpose: every paper claim gets its evidence gap named and the run that closes it specified,
+so all remaining GPU/CPU work is execution against this table. Targets: workshop freeze
+~Jun 25 (COLM deadline Jul 12), ICLR freeze ~Sep 5 (deadline ~Sep 16, unconfirmed).
+
+## Locked measurement protocol (uniformity rules)
+- Estimator: current `diagnose_{hrm,trm}_joint.py` (joint JVP+QR, k=8, t_ons=1); seed 0 primary;
+  n=2048 headline, n=512 series/sweeps; batch 16 (TRM) / 32 (HRM).
+- λ values never compared across estimator implementations; scale ownership stated at first use.
+- Settled = low band of late z_H drift (mean of final 4 segments), Otsu primary + percentile
+  sweep + threshold-free statement; z_L and combined-drift definitions reported as robustness.
+- idx-pairing: same seed + same n across any runs that will be joined per example.
+- All new npz under analysis_2x2/ subdirs with queue logs; nothing hand-run on GPU.
+
+## Claims → gaps → runs
+
+| ID | Closes | Run | Spec | Cost | Priority / target |
+|---|---|---|---|---|---|
+| E1 | C1/C2 formal uncertainty; C7-TRM matched pair | OFFLINE: bootstrap CIs (cells, AUCs, B-fraction rule-of-three bound); z_L/combined settling robustness; 2×2 on existing TRM official multi4 npz (@35805 best, @65100 final) vs baseline @58590 | existing npz | CPU, now | **P0 / workshop** |
+| E2 | C1/C2 run-level replication (HRM has 1 training run) | diagnose step9_E (HRM fixed-unroll baseline 50k) best.pt + final.pt + 2 mid ckpts, n=512–2048 | loader risk: step9 .pt may need key remap; fix on first failure | ~1–2 h shared | **P1 / workshop** |
+| E5 | C5 single-horizon caveat → "when does fate become legible" curve | horizon sweep h∈{2,4,6,8,10,12} segments, both models, idx-paired to full-window retest (seed 0, n=2048) | env-var horizon variant of *_short.py | ~5 h shared | **P1 / workshop headline fig** |
+| E6 | C7 matched-objective intervention probe | (a) verify provenance of diag_hrm_multi4_* run; (b) diagnose step9 pairs E vs F (HRM), G vs H (TRM) at {12500, 25000, best, final}, n=512; compare at matched ACCURACY (not matched step) + best-vs-best | ~2–3 h shared | **P1 / ICLR (workshop if cheap)** |
+| E3 | C3 "would have halted" + C-cell fate beyond window | 32-segment unroll restricted to B∪C idx sets (≤300 examples/model) + true-ACT inference on HRM's 21 B-cell idx | ~1 h shared | P2 / ICLR |
+| E4 | C4 difficulty proxy weakness | CPU sudoku solver with backtrack counting on the 81-token inputs; redo difficulty control binned by backtracks | CPU only | P2 / ICLR |
+| E7 | Sec-5 implication → measured result | restart-and-select on TRM official @58590: K∈{1,2,4,8,16,32} h_init Gaussian restarts; select by q_halt vs late-drift vs λ-proxy vs oracle; n=1024; quantify B-blind-spot empirically | ~5–10 h shared | P2 / ICLR |
+| E8 | "Sudoku-only" limitation (biggest generalization win) | build Maze dataset; train TRM official-recipe Maze; full diagnostic suite on it | **multi-day dedicated GPU** — needs explicit go | P2-gated / ICLR |
+| E9 | transverse/readout framing | logit-space FTLE (diagnose-script modification) | dev + ~2 h | P3 / ICLR appendix |
+| E10 | sampling-seed robustness (if bootstrap deemed insufficient) | rerun headline diags seeds 1,2 | ~5 h shared | P3 |
+
+## Decision rules
+- Workshop version ships on E1+E2+E5 (+E6 if done): decomposition, controls, horizon curve.
+- ICLR adds E3/E4/E6/E7 (+E8 if approved in time: training must START by ~Jul 1).
+- Any FAILED queue job: fix loader/script, rerun; never substitute hand-run results.
+- New results that contradict current text → claims.md updated first, prose second.
+
+## Status board (update as runs land)
+- [x] E1 offline batch — done 2026-06-12, see offline_followups/phase1_e1.md
+- [x] E2 done — step9_E (2nd HRM run, fixed-unroll): B≈0 replicates (best 1/743=0.13%, final 3/832=0.36%). C2 STRENGTHENED (now 2 estimators × 2 archs × 2 training runs × 2 objectives).
+- [x] E5 done — horizon sweep {2,4,6,8,10,12}×2 models. C5 STRENGTHENED to a curve: dynamics never legible early (TRM AUC≈0.48–0.53 all H; HRM reversed/below 0.5 all H); only HRM q_halt predicts (0.81→0.64). See phase1/fig_E5_horizon_sweep.png. NEW: undecided-pool eventual-success drops to 4%(HRM)/19%(TRM) by H=12 yet dynamics still don't discriminate.
+- [x] E6 done — matched-objective. **C7b demoted** (effect small/inconsistent at matched step); C7a (λ₁(D) rises over training) stands. §3.4 needs rewrite.
+- [ ] E3, E4, E7 — phase-2
+- [ ] E8 Maze — GO (user has dedicated GPUs). Build dataset + package launch/diagnostic; train on user cards.