research/flossing/paper/experiment_framework.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43

# Experiment framework — locked 2026-06-12

Purpose: every paper claim gets its evidence gap named and the run that closes it specified,
so all remaining GPU/CPU work is execution against this table. Targets: workshop freeze
~Jun 25 (COLM deadline Jul 12), ICLR freeze ~Sep 5 (deadline ~Sep 16, unconfirmed).

## Locked measurement protocol (uniformity rules)
- Estimator: current `diagnose_{hrm,trm}_joint.py` (joint JVP+QR, k=8, t_ons=1); seed 0 primary;
  n=2048 headline, n=512 series/sweeps; batch 16 (TRM) / 32 (HRM).
- λ values never compared across estimator implementations; scale ownership stated at first use.
- Settled = low band of late z_H drift (mean of final 4 segments), Otsu primary + percentile
  sweep + threshold-free statement; z_L and combined-drift definitions reported as robustness.
- idx-pairing: same seed + same n across any runs that will be joined per example.
- All new npz under analysis_2x2/ subdirs with queue logs; nothing hand-run on GPU.

## Claims → gaps → runs

| ID | Closes | Run | Spec | Cost | Priority / target |
|---|---|---|---|---|---|
| E1 | C1/C2 formal uncertainty; C7-TRM matched pair | OFFLINE: bootstrap CIs (cells, AUCs, B-fraction rule-of-three bound); z_L/combined settling robustness; 2×2 on existing TRM official multi4 npz (@35805 best, @65100 final) vs baseline @58590 | existing npz | CPU, now | **P0 / workshop** |
| E2 | C1/C2 run-level replication (HRM has 1 training run) | diagnose step9_E (HRM fixed-unroll baseline 50k) best.pt + final.pt + 2 mid ckpts, n=512–2048 | loader risk: step9 .pt may need key remap; fix on first failure | ~1–2 h shared | **P1 / workshop** |
| E5 | C5 single-horizon caveat → "when does fate become legible" curve | horizon sweep h∈{2,4,6,8,10,12} segments, both models, idx-paired to full-window retest (seed 0, n=2048) | env-var horizon variant of *_short.py | ~5 h shared | **P1 / workshop headline fig** |
| E6 | C7 matched-objective intervention probe | (a) verify provenance of diag_hrm_multi4_* run; (b) diagnose step9 pairs E vs F (HRM), G vs H (TRM) at {12500, 25000, best, final}, n=512; compare at matched ACCURACY (not matched step) + best-vs-best | ~2–3 h shared | **P1 / ICLR (workshop if cheap)** |
| E3 | C3 "would have halted" + C-cell fate beyond window | 32-segment unroll restricted to B∪C idx sets (≤300 examples/model) + true-ACT inference on HRM's 21 B-cell idx | ~1 h shared | P2 / ICLR |
| E4 | C4 difficulty proxy weakness | CPU sudoku solver with backtrack counting on the 81-token inputs; redo difficulty control binned by backtracks | CPU only | P2 / ICLR |
| E7 | Sec-5 implication → measured result | restart-and-select on TRM official @58590: K∈{1,2,4,8,16,32} h_init Gaussian restarts; select by q_halt vs late-drift vs λ-proxy vs oracle; n=1024; quantify B-blind-spot empirically | ~5–10 h shared | P2 / ICLR |
| E8 | "Sudoku-only" limitation (biggest generalization win) | build Maze dataset; train TRM official-recipe Maze; full diagnostic suite on it | **multi-day dedicated GPU** — needs explicit go | P2-gated / ICLR |
| E9 | transverse/readout framing | logit-space FTLE (diagnose-script modification) | dev + ~2 h | P3 / ICLR appendix |
| E10 | sampling-seed robustness (if bootstrap deemed insufficient) | rerun headline diags seeds 1,2 | ~5 h shared | P3 |

## Decision rules
- Workshop version ships on E1+E2+E5 (+E6 if done): decomposition, controls, horizon curve.
- ICLR adds E3/E4/E6/E7 (+E8 if approved in time: training must START by ~Jul 1).
- Any FAILED queue job: fix loader/script, rerun; never substitute hand-run results.
- New results that contradict current text → claims.md updated first, prose second.

## Status board (update as runs land)
- [x] E1 offline batch — done 2026-06-12, see offline_followups/phase1_e1.md
- [x] E2 done — step9_E (2nd HRM run, fixed-unroll): B≈0 replicates (best 1/743=0.13%, final 3/832=0.36%). C2 STRENGTHENED (now 2 estimators × 2 archs × 2 training runs × 2 objectives).
- [x] E5 done — horizon sweep {2,4,6,8,10,12}×2 models. C5 STRENGTHENED to a curve: dynamics never legible early (TRM AUC≈0.48–0.53 all H; HRM reversed/below 0.5 all H); only HRM q_halt predicts (0.81→0.64). See phase1/fig_E5_horizon_sweep.png. NEW: undecided-pool eventual-success drops to 4%(HRM)/19%(TRM) by H=12 yet dynamics still don't discriminate.
- [x] E6 done — matched-objective. **C7b demoted** (effect small/inconsistent at matched step); C7a (λ₁(D) rises over training) stands. §3.4 needs rewrite.
- [ ] E3, E4, E7 — phase-2
- [ ] E8 Maze — GO (user has dedicated GPUs). Build dataset + package launch/diagnostic; train on user cards.