research/flossing/analysis_2x2/offline_followups/phase1_e1.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27

# E1 offline batch — bootstrap CIs, settling robustness, TRM multi4 pair

## Bootstrap / exact CIs (TRM official @58590, n=2048)
- settled-wrong fraction: observed 0/254; exact 95% upper bound 0.0117 (1.17% of failures)
- AUC(-lam1->correct) = 0.9935, bootstrap 95% CI (0.9908244697676584, 0.9957330475628791)
- lam1(wrong) median 95% CI (0.10100110620260239, 0.10556983947753906)
- lam1(correct) median 95% CI (0.011215815320611, 0.011744528077542782)

## Bootstrap CIs (HRM @26040, n=8192, strict band)
- strict settled-wrong fraction of failures: observed 0.0054, bootstrap 95% CI (0.0032613427182413084, 0.007798538095694945)
- AUC(-lam1->correct) = 0.9841, bootstrap 95% CI (0.9815470536412456, 0.9865145187475995)

## Settling-criterion robustness (B-cell counts under alternative drift definitions)
- TRM official n=2048 | zH: B=0/A=1724 (tau=1.36) | zL: B=0/A=1728 (tau=1.42) | combined: B=0/A=1727 (tau=1.54)
- HRM n=8192 | zH: B=63/A=4103 (tau=0.77) | zL: B=59/A=4083 (tau=1.01) | combined: B=60/A=4087 (tau=1.07)

## TRM official-pipeline multi4 vs baseline (matched objective, n=512 each)
- baseline @58590: acc=0.875; A/B/C/D=434/0/14/64; fD=0.125; lam1(D)=+0.1034; lam1(A)=+0.0111
- multi4 @35805 (best): acc=0.900; A/B/C/D=452/0/9/51; fD=0.100; lam1(D)=+0.1019; lam1(A)=+0.0039
- multi4 @65100 (final): acc=0.824; A/B/C/D=408/1/14/89; fD=0.174; lam1(D)=+0.0946; lam1(A)=+0.0133

## hrm_multi4 provenance (E6a)
- diag_hrm_multi4_step_{20832,23436,26040}_512.npz step grid matches HRM pretrain numbering;
  multi4_eval_compare/logs should contain the eval invocations — checked manually below.
- ACTION: if the hrm_multi4 run is pretrain-pipeline (ACT-streaming + perturbation), then the
  May-28 multi4 vs righteous baseline comparison IS matched-pipeline and Sec 3.4's caveat is
  narrower than written; step9 E-vs-F pair (queued) covers the fixed-unroll objective regardless.