research/flossing/analysis_2x2/offline_followups/followups.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88

# Offline follow-ups (no GPU) — 2026-06-11

Strict in-band thresholds: HRM pct45 of pooled log10 late-drift; TRM pct60 (band edge; B=0 regardless).
All numbers observational; within-dataset comparisons only.

## HRM @26040 (n=8192), strict tau(log10)=-0.0129
| cell | n | lam1 med | lam8 med | token_acc med | halted_at med | q_halt_final med | givens med |
|---|---|---|---|---|---|---|---|
| A | 3665 | -0.8670 | -0.9787 | 1.000 | 4 | +7.47 | 26 |
| B | 21 | -0.8421 | -0.9495 | 0.617 | 6 | +7.47 | 25 |
| C | 633 | -0.7796 | -0.8815 | 1.000 | 10 | +7.47 | 25 |
| D | 3873 | -0.5991 | -0.7140 | 0.630 | 0 | -9.62 | 25 |

### hrm26040_n8192_strict: end-of-window drift slope (log10 steps13-16 vs 9-12; <0 = still descending)
- A: n=3665, slope median -0.0023, IQR [-0.0073, +0.0012], frac still descending (<-0.01): 0.20
- B: n=21, slope median -0.0063, IQR [-0.4005, -0.0009], frac still descending (<-0.01): 0.48
- C: n=633, slope median -0.0006, IQR [-1.4084, +0.0088], frac still descending (<-0.01): 0.44
- D: n=3873, slope median -0.0031, IQR [-0.0556, +0.0459], frac still descending (<-0.01): 0.46

### HRM unsettled stratum: AUC(-lam1 -> correct) per log-drift decile
| decile | drift range (log10) | n | n_correct | AUC |
|---|---|---|---|---|
| 1 | [-0.01, 0.66] | 451 | 422 | 0.966 |
| 2 | [0.66, 1.42] | 450 | 50 | 0.972 |
| 3 | [1.42, 1.52] | 451 | 4 | 0.988 |
| 4 | [1.52, 1.56] | 450 | 6 | 0.964 |
| 5 | [1.56, 1.60] | 451 | 1 | 0.984 |
| 6 | [1.60, 1.62] | 450 | 7 | 0.949 |
| 7 | [1.62, 1.65] | 451 | 5 | 0.837 |
| 8 | [1.65, 1.68] | 450 | 5 | 0.851 |
| 9 | [1.68, 1.71] | 451 | 16 | 0.804 |
| 10 | [1.71, 1.93] | 451 | 117 | 0.685 |
- weighted mean within-decile AUC = 0.879 (vs unconditioned within-unsettled AUC 0.933)
- AUC(-end_slope -> correct | unsettled) = 0.605 (C still-descending fraction vs D, see slope table above)

## HRM strict-band settled-but-wrong examples (n=21)
| idx | givens | token_acc | lam1 | drift_final | halted_at | q_halt_final |
|---|---|---|---|---|---|---|
| 342267 | 17 | 0.407 | -0.867 | 0.976 | 5 | +7.41 |
| 212705 | 17 | 0.469 | -0.838 | 0.964 | 8 | +7.44 |
| 329832 | 17 | 0.481 | -0.703 | 0.970 | 8 | +7.41 |
| 20075 | 27 | 0.519 | -0.812 | 0.966 | 5 | +7.50 |
| 198242 | 25 | 0.568 | -0.843 | 0.980 | 7 | +7.47 |
| 223591 | 24 | 0.580 | -0.939 | 0.951 | 4 | +7.47 |
| 238704 | 27 | 0.593 | -0.931 | 0.953 | 5 | +7.47 |
| 364431 | 25 | 0.593 | -0.806 | 0.956 | 6 | +7.44 |
| 274637 | 26 | 0.593 | -0.859 | 0.979 | 6 | +7.47 |
| 182424 | 24 | 0.605 | -0.985 | 0.949 | 6 | +7.47 |
| 351919 | 25 | 0.617 | -0.742 | 0.965 | 5 | +7.47 |
| 123022 | 27 | 0.617 | -0.826 | 0.951 | 7 | +7.50 |
| 150426 | 25 | 0.630 | -0.767 | 0.963 | 9 | +7.47 |
| 175427 | 26 | 0.630 | -0.843 | 0.946 | 8 | +7.50 |
| 422185 | 26 | 0.642 | -0.841 | 0.946 | 7 | +7.47 |
| 344032 | 24 | 0.654 | -0.903 | 0.965 | 4 | +7.53 |
| 30703 | 25 | 0.691 | -0.732 | 0.972 | 6 | +7.53 |
| 386549 | 23 | 0.691 | -0.842 | 0.966 | 4 | +7.47 |
| 3370 | 26 | 0.716 | -0.861 | 0.955 | 6 | +7.47 |
| 243909 | 24 | 0.753 | -0.786 | 0.969 | 8 | +7.50 |
| 258307 | 25 | 0.877 | -0.918 | 0.952 | 5 | +7.47 |

## HRM difficulty control (#givens, input tokens != 1)
- givens: min 17, median 25, max 36
- Spearman(lam1, givens): overall -0.350; correct-only -0.155; wrong-only -0.180
- Spearman(correct, givens) = +0.276

| givens bin | n | acc | AUC(-lam1 -> correct) |
|---|---|---|---|
| [17, 24] | 1152 | 0.321 | 0.976 |
| [24, 25] | 1795 | 0.373 | 0.980 |
| [25, 26] | 1764 | 0.503 | 0.987 |
| [26, 36] | 3481 | 0.681 | 0.983 |
- weighted mean within-bin AUC = 0.982 (overall 0.984)

## TRM official @58590 (n=512), strict tau(log10)=1.0240
| cell | n | lam1 med | token_acc med | q_halt_final med | givens med |
|---|---|---|---|---|---|
| A | 307 | +0.0105 | 1.000 | +7.78 | 26 |
| B | 0 | | | | |
| C | 141 | +0.0174 | 1.000 | +7.81 | 25 |
| D | 64 | +0.1034 | 0.630 | -11.12 | 25 |

### trm_official58590_n512_strict: end-of-window drift slope (log10 steps13-16 vs 9-12; <0 = still descending)
- A: n=307, slope median -0.1471, IQR [-0.2267, -0.0641], frac still descending (<-0.01): 0.90
- B: n=0
- C: n=141, slope median -0.0080, IQR [-0.2603, +0.0808], frac still descending (<-0.01): 0.49
- D: n=64, slope median -0.0125, IQR [-0.0525, +0.0276], frac still descending (<-0.01): 0.53
- Spearman(lam1, givens): overall -0.240; wrong-only -0.238
- Spearman(correct, givens) = +0.148