# E1 offline batch — bootstrap CIs, settling robustness, TRM multi4 pair ## Bootstrap / exact CIs (TRM official @58590, n=2048) - settled-wrong fraction: observed 0/254; exact 95% upper bound 0.0117 (1.17% of failures) - AUC(-lam1->correct) = 0.9935, bootstrap 95% CI (0.9908244697676584, 0.9957330475628791) - lam1(wrong) median 95% CI (0.10100110620260239, 0.10556983947753906) - lam1(correct) median 95% CI (0.011215815320611, 0.011744528077542782) ## Bootstrap CIs (HRM @26040, n=8192, strict band) - strict settled-wrong fraction of failures: observed 0.0054, bootstrap 95% CI (0.0032613427182413084, 0.007798538095694945) - AUC(-lam1->correct) = 0.9841, bootstrap 95% CI (0.9815470536412456, 0.9865145187475995) ## Settling-criterion robustness (B-cell counts under alternative drift definitions) - TRM official n=2048 | zH: B=0/A=1724 (tau=1.36) | zL: B=0/A=1728 (tau=1.42) | combined: B=0/A=1727 (tau=1.54) - HRM n=8192 | zH: B=63/A=4103 (tau=0.77) | zL: B=59/A=4083 (tau=1.01) | combined: B=60/A=4087 (tau=1.07) ## TRM official-pipeline multi4 vs baseline (matched objective, n=512 each) - baseline @58590: acc=0.875; A/B/C/D=434/0/14/64; fD=0.125; lam1(D)=+0.1034; lam1(A)=+0.0111 - multi4 @35805 (best): acc=0.900; A/B/C/D=452/0/9/51; fD=0.100; lam1(D)=+0.1019; lam1(A)=+0.0039 - multi4 @65100 (final): acc=0.824; A/B/C/D=408/1/14/89; fD=0.174; lam1(D)=+0.0946; lam1(A)=+0.0133 ## hrm_multi4 provenance (E6a) - diag_hrm_multi4_step_{20832,23436,26040}_512.npz step grid matches HRM pretrain numbering; multi4_eval_compare/logs should contain the eval invocations — checked manually below. - ACTION: if the hrm_multi4 run is pretrain-pipeline (ACT-streaming + perturbation), then the May-28 multi4 vs righteous baseline comparison IS matched-pipeline and Sec 3.4's caveat is narrower than written; step9 E-vs-F pair (queued) covers the fixed-unroll objective regardless.