diff options
| author | YurenHao0426 <blackhao0426@gmail.com> | 2026-06-13 12:35:36 -0500 |
|---|---|---|
| committer | YurenHao0426 <blackhao0426@gmail.com> | 2026-06-13 12:35:36 -0500 |
| commit | 66e0d8b9fd4d0f7a2231d689c055e26fdf1cf04a (patch) | |
| tree | c29cba61124018755a19b02c9d33e3ad5f2e05cc /research/flossing/analysis_2x2/offline_followups/phase1_e1.md | |
Curated export for clone-and-run Maze training (2x A6000) + diagnostics.
trm/hrm pretrain.py carry trajectory-augmentation code (backward-compatible).
Heavy artifacts (checkpoints/wandb/npz) gitignored; see PROVENANCE.md.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Diffstat (limited to 'research/flossing/analysis_2x2/offline_followups/phase1_e1.md')
| -rw-r--r-- | research/flossing/analysis_2x2/offline_followups/phase1_e1.md | 27 |
1 files changed, 27 insertions, 0 deletions
diff --git a/research/flossing/analysis_2x2/offline_followups/phase1_e1.md b/research/flossing/analysis_2x2/offline_followups/phase1_e1.md new file mode 100644 index 0000000..b3e3f40 --- /dev/null +++ b/research/flossing/analysis_2x2/offline_followups/phase1_e1.md @@ -0,0 +1,27 @@ +# E1 offline batch — bootstrap CIs, settling robustness, TRM multi4 pair + +## Bootstrap / exact CIs (TRM official @58590, n=2048) +- settled-wrong fraction: observed 0/254; exact 95% upper bound 0.0117 (1.17% of failures) +- AUC(-lam1->correct) = 0.9935, bootstrap 95% CI (0.9908244697676584, 0.9957330475628791) +- lam1(wrong) median 95% CI (0.10100110620260239, 0.10556983947753906) +- lam1(correct) median 95% CI (0.011215815320611, 0.011744528077542782) + +## Bootstrap CIs (HRM @26040, n=8192, strict band) +- strict settled-wrong fraction of failures: observed 0.0054, bootstrap 95% CI (0.0032613427182413084, 0.007798538095694945) +- AUC(-lam1->correct) = 0.9841, bootstrap 95% CI (0.9815470536412456, 0.9865145187475995) + +## Settling-criterion robustness (B-cell counts under alternative drift definitions) +- TRM official n=2048 | zH: B=0/A=1724 (tau=1.36) | zL: B=0/A=1728 (tau=1.42) | combined: B=0/A=1727 (tau=1.54) +- HRM n=8192 | zH: B=63/A=4103 (tau=0.77) | zL: B=59/A=4083 (tau=1.01) | combined: B=60/A=4087 (tau=1.07) + +## TRM official-pipeline multi4 vs baseline (matched objective, n=512 each) +- baseline @58590: acc=0.875; A/B/C/D=434/0/14/64; fD=0.125; lam1(D)=+0.1034; lam1(A)=+0.0111 +- multi4 @35805 (best): acc=0.900; A/B/C/D=452/0/9/51; fD=0.100; lam1(D)=+0.1019; lam1(A)=+0.0039 +- multi4 @65100 (final): acc=0.824; A/B/C/D=408/1/14/89; fD=0.174; lam1(D)=+0.0946; lam1(A)=+0.0133 + +## hrm_multi4 provenance (E6a) +- diag_hrm_multi4_step_{20832,23436,26040}_512.npz step grid matches HRM pretrain numbering; + multi4_eval_compare/logs should contain the eval invocations — checked manually below. +- ACTION: if the hrm_multi4 run is pretrain-pipeline (ACT-streaming + perturbation), then the + May-28 multi4 vs righteous baseline comparison IS matched-pipeline and Sec 3.4's caveat is + narrower than written; step9 E-vs-F pair (queued) covers the fixed-unroll objective regardless.
\ No newline at end of file |
