summaryrefslogtreecommitdiff
path: root/research/flossing/analysis_2x2/offline_followups/phase1_e1.md
diff options
context:
space:
mode:
authorYurenHao0426 <blackhao0426@gmail.com>2026-06-13 12:35:36 -0500
committerYurenHao0426 <blackhao0426@gmail.com>2026-06-13 12:35:36 -0500
commit66e0d8b9fd4d0f7a2231d689c055e26fdf1cf04a (patch)
treec29cba61124018755a19b02c9d33e3ad5f2e05cc /research/flossing/analysis_2x2/offline_followups/phase1_e1.md
rrm workspace: TRM/HRM/SRM code, Maze dataset, dynamical-analysis pipelineHEADmain
Curated export for clone-and-run Maze training (2x A6000) + diagnostics. trm/hrm pretrain.py carry trajectory-augmentation code (backward-compatible). Heavy artifacts (checkpoints/wandb/npz) gitignored; see PROVENANCE.md. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Diffstat (limited to 'research/flossing/analysis_2x2/offline_followups/phase1_e1.md')
-rw-r--r--research/flossing/analysis_2x2/offline_followups/phase1_e1.md27
1 files changed, 27 insertions, 0 deletions
diff --git a/research/flossing/analysis_2x2/offline_followups/phase1_e1.md b/research/flossing/analysis_2x2/offline_followups/phase1_e1.md
new file mode 100644
index 0000000..b3e3f40
--- /dev/null
+++ b/research/flossing/analysis_2x2/offline_followups/phase1_e1.md
@@ -0,0 +1,27 @@
+# E1 offline batch — bootstrap CIs, settling robustness, TRM multi4 pair
+
+## Bootstrap / exact CIs (TRM official @58590, n=2048)
+- settled-wrong fraction: observed 0/254; exact 95% upper bound 0.0117 (1.17% of failures)
+- AUC(-lam1->correct) = 0.9935, bootstrap 95% CI (0.9908244697676584, 0.9957330475628791)
+- lam1(wrong) median 95% CI (0.10100110620260239, 0.10556983947753906)
+- lam1(correct) median 95% CI (0.011215815320611, 0.011744528077542782)
+
+## Bootstrap CIs (HRM @26040, n=8192, strict band)
+- strict settled-wrong fraction of failures: observed 0.0054, bootstrap 95% CI (0.0032613427182413084, 0.007798538095694945)
+- AUC(-lam1->correct) = 0.9841, bootstrap 95% CI (0.9815470536412456, 0.9865145187475995)
+
+## Settling-criterion robustness (B-cell counts under alternative drift definitions)
+- TRM official n=2048 | zH: B=0/A=1724 (tau=1.36) | zL: B=0/A=1728 (tau=1.42) | combined: B=0/A=1727 (tau=1.54)
+- HRM n=8192 | zH: B=63/A=4103 (tau=0.77) | zL: B=59/A=4083 (tau=1.01) | combined: B=60/A=4087 (tau=1.07)
+
+## TRM official-pipeline multi4 vs baseline (matched objective, n=512 each)
+- baseline @58590: acc=0.875; A/B/C/D=434/0/14/64; fD=0.125; lam1(D)=+0.1034; lam1(A)=+0.0111
+- multi4 @35805 (best): acc=0.900; A/B/C/D=452/0/9/51; fD=0.100; lam1(D)=+0.1019; lam1(A)=+0.0039
+- multi4 @65100 (final): acc=0.824; A/B/C/D=408/1/14/89; fD=0.174; lam1(D)=+0.0946; lam1(A)=+0.0133
+
+## hrm_multi4 provenance (E6a)
+- diag_hrm_multi4_step_{20832,23436,26040}_512.npz step grid matches HRM pretrain numbering;
+ multi4_eval_compare/logs should contain the eval invocations — checked manually below.
+- ACTION: if the hrm_multi4 run is pretrain-pipeline (ACT-streaming + perturbation), then the
+ May-28 multi4 vs righteous baseline comparison IS matched-pipeline and Sec 3.4's caveat is
+ narrower than written; step9 E-vs-F pair (queued) covers the fixed-unroll objective regardless. \ No newline at end of file