diff options
Diffstat (limited to 'research/flossing/paper/experiment_framework.md')
| -rw-r--r-- | research/flossing/paper/experiment_framework.md | 43 |
1 files changed, 43 insertions, 0 deletions
diff --git a/research/flossing/paper/experiment_framework.md b/research/flossing/paper/experiment_framework.md new file mode 100644 index 0000000..945a9a3 --- /dev/null +++ b/research/flossing/paper/experiment_framework.md @@ -0,0 +1,43 @@ +# Experiment framework — locked 2026-06-12 + +Purpose: every paper claim gets its evidence gap named and the run that closes it specified, +so all remaining GPU/CPU work is execution against this table. Targets: workshop freeze +~Jun 25 (COLM deadline Jul 12), ICLR freeze ~Sep 5 (deadline ~Sep 16, unconfirmed). + +## Locked measurement protocol (uniformity rules) +- Estimator: current `diagnose_{hrm,trm}_joint.py` (joint JVP+QR, k=8, t_ons=1); seed 0 primary; + n=2048 headline, n=512 series/sweeps; batch 16 (TRM) / 32 (HRM). +- λ values never compared across estimator implementations; scale ownership stated at first use. +- Settled = low band of late z_H drift (mean of final 4 segments), Otsu primary + percentile + sweep + threshold-free statement; z_L and combined-drift definitions reported as robustness. +- idx-pairing: same seed + same n across any runs that will be joined per example. +- All new npz under analysis_2x2/ subdirs with queue logs; nothing hand-run on GPU. + +## Claims → gaps → runs + +| ID | Closes | Run | Spec | Cost | Priority / target | +|---|---|---|---|---|---| +| E1 | C1/C2 formal uncertainty; C7-TRM matched pair | OFFLINE: bootstrap CIs (cells, AUCs, B-fraction rule-of-three bound); z_L/combined settling robustness; 2×2 on existing TRM official multi4 npz (@35805 best, @65100 final) vs baseline @58590 | existing npz | CPU, now | **P0 / workshop** | +| E2 | C1/C2 run-level replication (HRM has 1 training run) | diagnose step9_E (HRM fixed-unroll baseline 50k) best.pt + final.pt + 2 mid ckpts, n=512–2048 | loader risk: step9 .pt may need key remap; fix on first failure | ~1–2 h shared | **P1 / workshop** | +| E5 | C5 single-horizon caveat → "when does fate become legible" curve | horizon sweep h∈{2,4,6,8,10,12} segments, both models, idx-paired to full-window retest (seed 0, n=2048) | env-var horizon variant of *_short.py | ~5 h shared | **P1 / workshop headline fig** | +| E6 | C7 matched-objective intervention probe | (a) verify provenance of diag_hrm_multi4_* run; (b) diagnose step9 pairs E vs F (HRM), G vs H (TRM) at {12500, 25000, best, final}, n=512; compare at matched ACCURACY (not matched step) + best-vs-best | ~2–3 h shared | **P1 / ICLR (workshop if cheap)** | +| E3 | C3 "would have halted" + C-cell fate beyond window | 32-segment unroll restricted to B∪C idx sets (≤300 examples/model) + true-ACT inference on HRM's 21 B-cell idx | ~1 h shared | P2 / ICLR | +| E4 | C4 difficulty proxy weakness | CPU sudoku solver with backtrack counting on the 81-token inputs; redo difficulty control binned by backtracks | CPU only | P2 / ICLR | +| E7 | Sec-5 implication → measured result | restart-and-select on TRM official @58590: K∈{1,2,4,8,16,32} h_init Gaussian restarts; select by q_halt vs late-drift vs λ-proxy vs oracle; n=1024; quantify B-blind-spot empirically | ~5–10 h shared | P2 / ICLR | +| E8 | "Sudoku-only" limitation (biggest generalization win) | build Maze dataset; train TRM official-recipe Maze; full diagnostic suite on it | **multi-day dedicated GPU** — needs explicit go | P2-gated / ICLR | +| E9 | transverse/readout framing | logit-space FTLE (diagnose-script modification) | dev + ~2 h | P3 / ICLR appendix | +| E10 | sampling-seed robustness (if bootstrap deemed insufficient) | rerun headline diags seeds 1,2 | ~5 h shared | P3 | + +## Decision rules +- Workshop version ships on E1+E2+E5 (+E6 if done): decomposition, controls, horizon curve. +- ICLR adds E3/E4/E6/E7 (+E8 if approved in time: training must START by ~Jul 1). +- Any FAILED queue job: fix loader/script, rerun; never substitute hand-run results. +- New results that contradict current text → claims.md updated first, prose second. + +## Status board (update as runs land) +- [x] E1 offline batch — done 2026-06-12, see offline_followups/phase1_e1.md +- [x] E2 done — step9_E (2nd HRM run, fixed-unroll): B≈0 replicates (best 1/743=0.13%, final 3/832=0.36%). C2 STRENGTHENED (now 2 estimators × 2 archs × 2 training runs × 2 objectives). +- [x] E5 done — horizon sweep {2,4,6,8,10,12}×2 models. C5 STRENGTHENED to a curve: dynamics never legible early (TRM AUC≈0.48–0.53 all H; HRM reversed/below 0.5 all H); only HRM q_halt predicts (0.81→0.64). See phase1/fig_E5_horizon_sweep.png. NEW: undecided-pool eventual-success drops to 4%(HRM)/19%(TRM) by H=12 yet dynamics still don't discriminate. +- [x] E6 done — matched-objective. **C7b demoted** (effect small/inconsistent at matched step); C7a (λ₁(D) rises over training) stands. §3.4 needs rewrite. +- [ ] E3, E4, E7 — phase-2 +- [ ] E8 Maze — GO (user has dedicated GPUs). Build dataset + package launch/diagnostic; train on user cards. |
