summaryrefslogtreecommitdiff
path: root/research/flossing/analysis_2x2
diff options
context:
space:
mode:
Diffstat (limited to 'research/flossing/analysis_2x2')
-rw-r--r--research/flossing/analysis_2x2/OBSERVATIONS.md155
-rw-r--r--research/flossing/analysis_2x2/analyze_2x2.py350
-rw-r--r--research/flossing/analysis_2x2/analyze_phase1.py126
-rw-r--r--research/flossing/analysis_2x2/cells_hrm26040_joint_n2048.csv5
-rw-r--r--research/flossing/analysis_2x2/cells_hrm26040_n8192.csv5
-rw-r--r--research/flossing/analysis_2x2/cells_trm_official58590_n2048.csv5
-rw-r--r--research/flossing/analysis_2x2/cells_trm_official_gbs768_step58590_n512.csv5
-rw-r--r--research/flossing/analysis_2x2/cells_trm_singleGPU_step130205_n512.csv5
-rw-r--r--research/flossing/analysis_2x2/cells_trm_singleGPU_step260410_n512.csv5
-rw-r--r--research/flossing/analysis_2x2/cells_trm_step13020_n512.csv5
-rw-r--r--research/flossing/analysis_2x2/early_pairing_hrm26040_joint.md17
-rw-r--r--research/flossing/analysis_2x2/early_pairing_trm_official58590.md17
-rw-r--r--research/flossing/analysis_2x2/early_window_pairing.py85
-rw-r--r--research/flossing/analysis_2x2/evolution_hrm.csv11
-rw-r--r--research/flossing/analysis_2x2/evolution_trm.csv11
-rw-r--r--research/flossing/analysis_2x2/offline_followups.py229
-rw-r--r--research/flossing/analysis_2x2/offline_followups/followups.md88
-rw-r--r--research/flossing/analysis_2x2/offline_followups/phase1_e1.md27
-rw-r--r--research/flossing/analysis_2x2/offline_phase1_e1.py123
-rw-r--r--research/flossing/analysis_2x2/phase1/phase1_results.md48
-rw-r--r--research/flossing/analysis_2x2/results.md69
-rw-r--r--research/flossing/analysis_2x2/results_hrm26040_joint_n2048.json124
-rw-r--r--research/flossing/analysis_2x2/results_hrm26040_n8192.json124
-rw-r--r--research/flossing/analysis_2x2/results_trm_official58590_n2048.json105
-rw-r--r--research/flossing/analysis_2x2/results_trm_official_gbs768_step58590_n512.json105
-rw-r--r--research/flossing/analysis_2x2/results_trm_singleGPU_step130205_n512.json105
-rw-r--r--research/flossing/analysis_2x2/results_trm_singleGPU_step260410_n512.json105
-rw-r--r--research/flossing/analysis_2x2/results_trm_step13020_n512.json124
-rwxr-xr-xresearch/flossing/analysis_2x2/run_phase1_queue.sh92
-rwxr-xr-xresearch/flossing/analysis_2x2/run_retest_2x2.sh75
-rw-r--r--research/flossing/analysis_2x2/sweep_hrm26040_joint_n2048.csv20
-rw-r--r--research/flossing/analysis_2x2/sweep_hrm26040_n8192.csv20
-rw-r--r--research/flossing/analysis_2x2/sweep_trm_official58590_n2048.csv20
-rw-r--r--research/flossing/analysis_2x2/sweep_trm_official_gbs768_step58590_n512.csv20
-rw-r--r--research/flossing/analysis_2x2/sweep_trm_singleGPU_step130205_n512.csv20
-rw-r--r--research/flossing/analysis_2x2/sweep_trm_singleGPU_step260410_n512.csv20
-rw-r--r--research/flossing/analysis_2x2/sweep_trm_step13020_n512.csv20
37 files changed, 2490 insertions, 0 deletions
diff --git a/research/flossing/analysis_2x2/OBSERVATIONS.md b/research/flossing/analysis_2x2/OBSERVATIONS.md
new file mode 100644
index 0000000..d0b0670
--- /dev/null
+++ b/research/flossing/analysis_2x2/OBSERVATIONS.md
@@ -0,0 +1,155 @@
+# 2×2 analysis: (late-trajectory settling) × (answer correctness) — observations
+
+Date: 2026-06-11. Code: `analyze_2x2.py` (+ ad-hoc strict-threshold checks logged in session).
+All statements below are measurements on existing diagnostic npz files; no new GPU runs.
+
+## Data provenance (CRITICAL: λ scales are NOT comparable across estimator versions)
+
+| dataset | model / ckpt | n | estimator | window |
+|---|---|---|---|---|
+| `diag_8k.npz` | HRM righteous-python @ step_26040 (H=2,L=2) | 8192 | `diagnose_hrm.py` (May 22 version) | full 16 ACT steps |
+| `diag_trm_singleGPU_step*_512.npz` ×10 | TRM mlp_t singleGPU @ 26041…260410 | 512 each | `diagnose_trm_joint.py` (joint JVP+QR, per-sub-update norm) | full 16 ACT steps |
+| `diag_hrm_step_*_512.npz` ×10 | HRM @ 2604…26040 | 512 each | joint estimator (−0.15-scale, matches step7 reports) | full 16 ACT steps |
+
+Within-dataset comparisons only. "Converged"/"settled" here = late z_H drift (mean over ACT steps 13–16)
+falls in the low-drift band; threshold from Otsu on pooled log10 drift, robustness via percentile sweep.
+Settled ≠ literal fixed point: the settled band has a narrow characteristic residual velocity
+(HRM ≈0.96/step, IQR-width <0.03 in log10; TRM ≈18.5/step, q10–q90 = 15.8–21.4).
+
+## Headline numbers
+
+### HRM @26040, n=8192, exact_acc 0.525 (full-window λ, May-22 estimator scale)
+Otsu τ → cells: A(settled,correct)=4103, B(settled,wrong)=63, C(unsettled,correct)=195, D(unsettled,wrong)=3831.
+
+- Failure mass is overwhelmingly unsettled: wrong-that-settled = 1.6% at Otsu τ; at a STRICT in-band
+ threshold (45th pct, drift<0.97) it is 21/3894 = **0.55% of failures**.
+- The strict-band B examples (n=21) have λ₁ median **−0.842** vs A −0.867 (success-like contraction),
+ q_halt(final) median **+7.47 — identical to A** (ACT-confident), token_acc 0.41–0.88 (median 0.62,
+ substantially wrong, not near-misses). These are confidently-wrong settled answers — invisible to
+ both a stability-based and an ACT-confidence-based selector.
+- λ₁ as a predictor: AUC(−λ₁→correct) = 0.984 ≈ AUC(−λ₁→settled) = 0.986. At the Otsu split the
+ within-settled outcome gap (Δλ₁ −0.27, AUC 0.852) is a THRESHOLD-MIXTURE ARTIFACT: the Otsu τ sits
+ above the band gap (drift distribution has a near-empty region between pct≈45 and pct≈55,
+ τ jumps −0.013→1.39), so Otsu-"settled" includes mid-drift examples. At the strict in-band τ the
+ within-settled gap nearly closes (−0.867 vs −0.842).
+- Residual outcome signal in the unsettled stratum: C vs D Δλ₁(median) = −0.094, AUC 0.818
+ (C n=195). Note C drifts MORE than D (median 55.4 vs 40.8) while being more contracting in λ₁.
+- AUC(−log late-drift→correct) = 0.964.
+
+### TRM singleGPU @260410 (final), n=512, exact_acc 0.770 (joint estimator)
+Cells: A=383, B=**0**, C=11, D=118.
+
+- **B = 0 at every threshold**: drift distributions of settled-correct (q90 = 21.4) and wrong
+ (q10 = 47.8) are completely separated — no wrong example reaches the settled band. Robust across
+ the entire percentile sweep (nB=0, nD=118 constant). Same at step 130205. Across the 10-ckpt series,
+ fB>0 only at step 26041 (1.2%).
+- λ₁ medians: A +0.0047 (≈0, consistent with "TRM success short-window λ₁>0 is acceptable"),
+ C +0.0998, D +0.1023. Within-unsettled outcome gap ≈ 0 (Δ −0.0025; AUC 0.619, C n=11).
+ AUC(−λ₁→correct) = 0.989 ≈ AUC(−λ₁→settled) = 0.996: at the final checkpoint, λ₁'s outcome
+ signal in TRM is (almost) entirely settling-regime detection.
+- q_halt(final) median: A +7.44, C +7.41, D −11.1.
+
+## Checkpoint evolution (512/ckpt; within-series comparisons)
+
+- TRM (26041→260410): fB≈0 from 52082 on; fD shrinks 0.41→0.23; λ₁(D) rises monotonically
+ +0.036→+0.102 while λ₁(A) stays ≈0 — the success/failure λ₁ gap widens over training via the
+ failure cell becoming more expansive.
+- HRM (2604→26040, joint-estimator series): mass migration over training:
+ fB 0.89→0.008 (early ckpts: nearly all examples low-drift & wrong), fD rises to ~0.52 mid-training,
+ fA grows with accuracy; λ₁(D) rises from −0.087 to +0.023 (sign flip ~step 15–18k), λ₁(A) stays
+ −0.10…−0.20.
+
+## Direct answers to the motivating questions
+
+1. **Is the failure FTLE cluster a mixture of wandering + wrong-fixed-point modes?**
+ Measured: yes, but extremely lopsided. Wandering dominates (HRM ≥98.4% of failures at Otsu τ,
+ 99.45% at strict τ; TRM final 100%). The wrong-fixed-point mode exists in HRM (21 examples) with
+ exactly the predicted signature (success-like λ₁), and is absent in TRM at late checkpoints.
+2. **Does conditioning on settling absorb the success/failure FTLE gap?**
+ TRM final: essentially yes (within-stratum AUC 0.619 with n=11; regime AUC 0.996).
+ HRM: mostly yes after threshold correction (strict-band within-settled gap ≈0.025), with one
+ genuine residual: the unsettled stratum retains an outcome gap (AUC 0.818, C n=195).
+3. **Relation to published taxonomies** (factual cross-references):
+ Ren & Liu's four HRM modes map onto our cells; their "non-trivial failure (converged to wrong
+ fixed point)" = our strict-band B (rare: 0.55% of failures); their "trivial failure (wanders or
+ oscillates)" = our D (dominant). Efstathiou & Balwani's "failed runs plateau at stable high-loss
+ attractors" (TRM): by our state-drift criterion, TRM failures are NOT settled (B=0, drift ≥~48/step,
+ ~0.77× the early-trajectory velocity); their "stable" refers to loss plateaus/bounded regions,
+ not state convergence. Our drift+λ₁ measurement distinguishes these.
+
+## Addendum (same day, offline follow-ups — see offline_followups/followups.md)
+
+**REVISION of headline point 2.** "λ₁'s outcome signal is (almost) entirely settling-regime detection"
+was based on comparing raw AUCs and is too strong as a mediation claim. The proper control —
+AUC(−λ₁→correct) **within matched late-drift deciles** of the unsettled stratum — shows substantial
+independent signal (HRM: per-decile 0.97→0.69 from low to high drift, weighted mean 0.879;
+unconditioned within-unsettled 0.933). TRM official @58590 with a strict band τ shows the same
+qualitatively: unsettled-correct λ₁ ≈ +0.017 (n=141) vs unsettled-wrong +0.103 (n=64) at overlapping
+drift levels. Corrected statement: **λ₁ correlates strongly with settledness, but at matched drift
+level it still separates outcome** — drift and λ₁ are not redundant observables.
+
+**Difficulty control (#givens, crude proxy).** HRM n=8192: Spearman(correct, givens)=+0.28;
+Spearman(λ₁, givens)=−0.35 overall but −0.16/−0.18 within outcome. Within-givens-bin
+AUC(−λ₁→correct) = 0.976–0.987 (weighted 0.982, vs overall 0.984): at the #givens level, the
+FTLE-outcome separation is NOT a difficulty artifact. (Solver-backtrack difficulty not available
+offline; #givens is a weak proxy — flag for the writeup.)
+
+**Strict-B per-example (n=21).** All 21 have halted_at ∈ [4,9] (median 6) — under real ACT inference
+every one would have halted early, confidently (q_halt +7.4–7.5), and wrong (token_acc 0.41–0.88).
+The three lowest-token-acc cases are all 17-givens (minimum-clue) puzzles. Per-example table in
+followups.md; drift profiles indistinguishable from the A band (fig_hrm_strictB_profiles.png).
+
+**TRM official @58590 note.** 90% of its settled-correct examples are still descending at window end
+(slope median −0.147) — unlike singleGPU @260410 whose A-band is flat (~18.5/step). The "settled band"
+criterion is checkpoint-specific; cross-checkpoint comparisons must re-derive τ per dataset.
+
+## Addendum 2 (2026-06-12, n=2048 retest + early-window pairing; npz in retest/)
+
+Retest ran on GPU 0 (shared, 12h-fallback claim). Four diagnostics, seed 0, idx-paired.
+
+**1. TRM official @58590 (87.6%), full window, n=2048: B = 0 confirmed.**
+254 failures, none settled; the MINIMUM late-drift among wrong examples (log10 1.664 ≈ 46/step)
+exceeds the late-drift of 96.5% of correct examples — near-complete distribution separation,
+threshold-free. λ₁: wrong +0.103 / correct +0.012, AUC 0.993. Within-unsettled outcome AUC 0.848
+(C n=70) — residual signal confirmed at usable n.
+
+**2. HRM @26040, full window, joint estimator, n=2048: replicates diag_8k on a second estimator.**
+acc 0.526; A/B/C/D = 1020/14/57/957; λ₁(A) −0.152 vs λ₁(D) +0.032 (the email's −0.15/+0.04 scale);
+strict-band B n=5 (0.5% of failures) with λ₁ −0.141 ≈ A and q_halt +7.47 (selector-blind, replicated);
+unsettled within-decile AUC weighted 0.900 (was 0.879 on diag_8k).
+
+**3. Early-window (first 4 ACT steps) does NOT forecast eventual success among still-unsolved
+examples — and on HRM the dynamical signals point the OTHER way.**
+Unconditioned early AUCs are inflated by already-solved-at-4 examples (TRM 69.4% solved@4,
+HRM 34.5%). Restricted to not-yet-correct@4:
+
+| signal @ step 4 | TRM (n=626, 59.4% eventually correct) | HRM (n=1342, 27.6% eventually correct) |
+|---|---|---|
+| AUC(−λ₁_early → eventual correct) | 0.543 | **0.448** (reversed) |
+| AUC(−drift@4 → eventual correct) | 0.492 | **0.312** (reversed: MORE early movement ↔ eventual success, +dir AUC 0.688) |
+| AUC(q_halt@4 → eventual correct) | 0.521 | **0.734** |
+
+Observations: (i) the λ₁/outcome association is concurrent with the trajectory's fate (final window),
+not antecedent — early-window λ₁ has no forward predictive power at this granularity; (ii) on HRM the
+sign of the early association is inverted: among undecided examples, higher early drift (and
+marginally higher early λ₁) accompany eventual success; (iii) the one early signal with real forecast
+power is HRM's learned q_halt (0.734) — absent in TRM (0.521); factual architecture note: TRM removed
+HRM's Q-learning continue-head (BCE halt only). Window length 4 was chosen to match train-time; other
+horizons untested.
+
+**Consequences for writeup/claims:** "failure ↔ chaos" should be stated as an outcome-concurrent
+dynamical signature, not an early predictor; early-exit/reallocation applications are unsupported at
+this granularity (and sign-reversed on HRM); the within-stratum independence result (drift-matched
+AUC 0.88-0.90) plus the difficulty control (Addendum 1) remain the strongest positive claims.
+
+## Caveats
+
+- n(B)=21 and n(C)=11/195 are small; per-cell statements about those cells are low-precision.
+- "Settled" is a relative (band) criterion; both A-bands have nonzero characteristic residual velocity.
+- exact_correct is evaluated at ACT step 16 under fixed unroll; post-window corruption or recovery
+ (Ren & Liu's fixed-point violation) is not observable in these arrays.
+- λ window = full trajectory (336 sub-updates TRM / 16 segments HRM); the early-window
+ (length-matched) version of this analysis is NOT covered by existing npz files — the `_short`
+ diagnose scripts exist (4 ACT steps) but produced no saved npz we could find. Open follow-up.
+- diag_8k (HRM) and the HRM evolution series use different estimator normalizations; only
+ within-file comparisons are reported above.
diff --git a/research/flossing/analysis_2x2/analyze_2x2.py b/research/flossing/analysis_2x2/analyze_2x2.py
new file mode 100644
index 0000000..315ffc6
--- /dev/null
+++ b/research/flossing/analysis_2x2/analyze_2x2.py
@@ -0,0 +1,350 @@
+"""2x2 analysis: (terminal convergence) x (answer correctness), per-example FTLE per cell.
+
+Inputs: existing diagnostic npz files produced by diagnose_hrm*.py / diagnose_trm_joint.py:
+ drift_zH/drift_zL (N,16) per-ACT-step state displacement norms,
+ lyap_spec (N,8) full-window joint FTLE spectrum, exact_correct (N,),
+ token_acc, halted_at, q_halt/q_continue (N,16), idx.
+
+Convergence metric (measurement choice, reported alongside robustness sweep):
+ d_late = mean(drift_zH[:, -4:]) (late-trajectory z_H velocity, ACT steps 13-16)
+ primary threshold tau = Otsu on log10(d_late) pooled per dataset;
+ sensitivity: tau swept over pooled percentiles 5..95.
+
+Cells: A = converged & correct, B = converged & wrong,
+ C = non-converged & correct, D = non-converged & wrong.
+
+Outputs (in this directory): results_<tag>.json, cells_<tag>.csv, sweep_<tag>.csv,
+ fig_<tag>_{drift_hist,lyap_by_cell,scatter,spectrum}.png, evolution_{hrm,trm}.{csv,png},
+ results.md (combined human-readable summary).
+
+Observational only: this script reports counts, distributions and rank statistics; it does
+not test mechanisms.
+"""
+from __future__ import annotations
+
+import json
+from pathlib import Path
+
+import matplotlib
+
+matplotlib.use("Agg")
+import matplotlib.pyplot as plt
+import numpy as np
+
+HERE = Path(__file__).resolve().parent
+FLOSS = HERE.parent
+
+LATE_K = 4 # ACT steps used for late drift
+
+
+def otsu_threshold(x: np.ndarray, nbins: int = 256) -> float:
+ h, edges = np.histogram(x, bins=nbins)
+ h = h.astype(np.float64)
+ centers = 0.5 * (edges[:-1] + edges[1:])
+ w = h.sum()
+ if w == 0:
+ return float(np.median(x))
+ p = h / w
+ omega = np.cumsum(p)
+ mu = np.cumsum(p * centers)
+ mu_t = mu[-1]
+ denom = omega * (1.0 - omega)
+ denom[denom <= 0] = np.nan
+ sigma_b2 = (mu_t * omega - mu) ** 2 / denom
+ k = np.nanargmax(sigma_b2)
+ return float(centers[k])
+
+
+def auc_rank(score: np.ndarray, label: np.ndarray) -> float:
+ """AUC of `score` for predicting label==1 (rank-based, ties averaged)."""
+ pos = score[label == 1]
+ neg = score[label == 0]
+ if len(pos) == 0 or len(neg) == 0:
+ return float("nan")
+ allv = np.concatenate([pos, neg])
+ order = np.argsort(allv, kind="mergesort")
+ ranks = np.empty_like(order, dtype=np.float64)
+ ranks[order] = np.arange(1, len(allv) + 1)
+ # average ranks for ties
+ sv = allv[order]
+ i = 0
+ while i < len(sv):
+ j = i
+ while j + 1 < len(sv) and sv[j + 1] == sv[i]:
+ j += 1
+ if j > i:
+ ranks[order[i : j + 1]] = ranks[order[i : j + 1]].mean()
+ i = j + 1
+ r_pos = ranks[: len(pos)].sum()
+ return float((r_pos - len(pos) * (len(pos) + 1) / 2) / (len(pos) * len(neg)))
+
+
+def cell_stats(lyap: np.ndarray, tok: np.ndarray, halted: np.ndarray, mask: np.ndarray) -> dict:
+ if mask.sum() == 0:
+ return {"n": 0}
+ l1 = lyap[mask, 0]
+ return {
+ "n": int(mask.sum()),
+ "lam1_median": float(np.median(l1)),
+ "lam1_mean": float(l1.mean()),
+ "lam1_iqr": [float(np.percentile(l1, 25)), float(np.percentile(l1, 75))],
+ "lam8_median": float(np.median(lyap[mask, -1])),
+ "spectrum_median": [float(np.median(lyap[mask, i])) for i in range(lyap.shape[1])],
+ "token_acc_median": float(np.median(tok[mask])),
+ "halted_at_median": float(np.median(halted[mask])),
+ }
+
+
+def analyze(npz_path: Path, tag: str, make_figs: bool = True) -> dict:
+ d = np.load(npz_path)
+ lyap = d["lyap_spec"].astype(np.float64)
+ correct = d["exact_correct"].astype(int)
+ tok = d["token_acc"].astype(np.float64)
+ halted = d["halted_at"].astype(np.float64)
+ drift_h = d["drift_zH"].astype(np.float64)
+ drift_l = d["drift_zL"].astype(np.float64)
+
+ d_late = drift_h[:, -LATE_K:].mean(axis=1)
+ d_late_l = drift_l[:, -LATE_K:].mean(axis=1)
+ logd = np.log10(np.clip(d_late, 1e-12, None))
+
+ tau = otsu_threshold(logd)
+ conv = logd < tau
+
+ cells = {
+ "A_conv_correct": (conv) & (correct == 1),
+ "B_conv_wrong": (conv) & (correct == 0),
+ "C_nonconv_correct": (~conv) & (correct == 1),
+ "D_nonconv_wrong": (~conv) & (correct == 0),
+ }
+
+ res = {
+ "npz": str(npz_path),
+ "n": int(len(correct)),
+ "exact_acc": float(correct.mean()),
+ "late_drift_def": f"mean(drift_zH[:, -{LATE_K}:])",
+ "otsu_tau_log10": tau,
+ "frac_converged": float(conv.mean()),
+ "cells": {k: cell_stats(lyap, tok, halted, m) for k, m in cells.items()},
+ "mixture": {
+ "wrong_that_converged": float(
+ cells["B_conv_wrong"].sum() / max((correct == 0).sum(), 1)
+ ),
+ "correct_that_nonconverged": float(
+ cells["C_nonconv_correct"].sum() / max((correct == 1).sum(), 1)
+ ),
+ },
+ "contrasts": {
+ "dlam1_correct_minus_wrong_overall": float(
+ np.median(lyap[correct == 1, 0]) - np.median(lyap[correct == 0, 0])
+ ),
+ "dlam1_within_converged": float(
+ np.median(lyap[cells["A_conv_correct"], 0]) - np.median(lyap[cells["B_conv_wrong"], 0])
+ )
+ if cells["B_conv_wrong"].sum() > 0 and cells["A_conv_correct"].sum() > 0
+ else float("nan"),
+ "dlam1_within_nonconverged": float(
+ np.median(lyap[cells["C_nonconv_correct"], 0]) - np.median(lyap[cells["D_nonconv_wrong"], 0])
+ )
+ if cells["C_nonconv_correct"].sum() > 0 and cells["D_nonconv_wrong"].sum() > 0
+ else float("nan"),
+ "dlam1_wrong_conv_minus_wrong_nonconv": float(
+ np.median(lyap[cells["B_conv_wrong"], 0]) - np.median(lyap[cells["D_nonconv_wrong"], 0])
+ )
+ if cells["B_conv_wrong"].sum() > 0 and cells["D_nonconv_wrong"].sum() > 0
+ else float("nan"),
+ },
+ "auc": {
+ "neg_lam1_predicts_correct_overall": auc_rank(-lyap[:, 0], correct),
+ "neg_lam1_predicts_correct_within_conv": auc_rank(-lyap[conv, 0], correct[conv]),
+ "neg_lam1_predicts_correct_within_nonconv": auc_rank(-lyap[~conv, 0], correct[~conv]),
+ "neg_logdrift_predicts_correct": auc_rank(-logd, correct),
+ "neg_lam1_predicts_converged": auc_rank(-lyap[:, 0], conv.astype(int)),
+ },
+ }
+
+ # threshold sensitivity sweep
+ sweep_rows = []
+ for pct in range(5, 96, 5):
+ t = np.percentile(logd, pct)
+ c = logd < t
+ row = {
+ "pct": pct,
+ "tau": float(t),
+ "nA": int((c & (correct == 1)).sum()),
+ "nB": int((c & (correct == 0)).sum()),
+ "nC": int((~c & (correct == 1)).sum()),
+ "nD": int((~c & (correct == 0)).sum()),
+ }
+ for nm, m in [
+ ("lam1_med_B", c & (correct == 0)),
+ ("lam1_med_D", ~c & (correct == 0)),
+ ]:
+ row[nm] = float(np.median(lyap[m, 0])) if m.sum() > 0 else float("nan")
+ sweep_rows.append(row)
+ sweep_csv = HERE / f"sweep_{tag}.csv"
+ with sweep_csv.open("w") as f:
+ keys = list(sweep_rows[0].keys())
+ f.write(",".join(keys) + "\n")
+ for r in sweep_rows:
+ f.write(",".join(str(r[k]) for k in keys) + "\n")
+
+ # per-cell csv
+ with (HERE / f"cells_{tag}.csv").open("w") as f:
+ f.write("cell,n,lam1_median,lam1_mean,lam1_q25,lam1_q75,lam8_median,token_acc_median,halted_at_median\n")
+ for k, m in cells.items():
+ s = res["cells"][k]
+ if s["n"] == 0:
+ f.write(f"{k},0,,,,,,,\n")
+ continue
+ f.write(
+ f"{k},{s['n']},{s['lam1_median']:.6f},{s['lam1_mean']:.6f},"
+ f"{s['lam1_iqr'][0]:.6f},{s['lam1_iqr'][1]:.6f},{s['lam8_median']:.6f},"
+ f"{s['token_acc_median']:.4f},{s['halted_at_median']:.1f}\n"
+ )
+
+ if make_figs:
+ colors = {"A_conv_correct": "tab:green", "B_conv_wrong": "tab:orange",
+ "C_nonconv_correct": "tab:blue", "D_nonconv_wrong": "tab:red"}
+
+ fig, ax = plt.subplots(figsize=(6, 4))
+ bins = np.linspace(logd.min(), logd.max(), 60)
+ ax.hist(logd[correct == 1], bins=bins, alpha=0.55, label=f"correct (n={int(correct.sum())})", color="tab:green")
+ ax.hist(logd[correct == 0], bins=bins, alpha=0.55, label=f"wrong (n={int((1-correct).sum())})", color="tab:red")
+ ax.axvline(tau, color="k", ls="--", lw=1, label=f"Otsu tau={tau:.2f}")
+ ax.set_xlabel("log10 late drift_zH (steps -4:)"); ax.set_ylabel("count")
+ ax.set_title(f"{tag}: late-drift distribution by correctness"); ax.legend(fontsize=8)
+ fig.tight_layout(); fig.savefig(HERE / f"fig_{tag}_drift_hist.png", dpi=150); plt.close(fig)
+
+ fig, ax = plt.subplots(figsize=(6.5, 4))
+ for i, (k, m) in enumerate(cells.items()):
+ if m.sum() == 0:
+ continue
+ y = lyap[m, 0]
+ x = np.full(y.shape, i) + (np.random.default_rng(0).uniform(-0.18, 0.18, y.shape))
+ ax.plot(x, y, ".", ms=3, alpha=0.35, color=colors[k])
+ ax.hlines(np.median(y), i - 0.28, i + 0.28, color=colors[k], lw=2.5)
+ ax.set_xticks(range(4)); ax.set_xticklabels([f"{k}\n(n={int(m.sum())})" for k, m in cells.items()], fontsize=7)
+ ax.set_ylabel("lambda_1 (full-window FTLE)"); ax.axhline(0, color="gray", lw=0.6)
+ ax.set_title(f"{tag}: lambda_1 by 2x2 cell")
+ fig.tight_layout(); fig.savefig(HERE / f"fig_{tag}_lyap_by_cell.png", dpi=150); plt.close(fig)
+
+ fig, ax = plt.subplots(figsize=(6, 4.5))
+ ax.scatter(logd[correct == 1], lyap[correct == 1, 0], s=5, alpha=0.4, c="tab:green", label="correct")
+ ax.scatter(logd[correct == 0], lyap[correct == 0, 0], s=5, alpha=0.4, c="tab:red", label="wrong")
+ ax.axvline(tau, color="k", ls="--", lw=1); ax.axhline(0, color="gray", lw=0.6)
+ ax.set_xlabel("log10 late drift_zH"); ax.set_ylabel("lambda_1")
+ ax.set_title(f"{tag}: drift vs lambda_1"); ax.legend(fontsize=8)
+ fig.tight_layout(); fig.savefig(HERE / f"fig_{tag}_scatter.png", dpi=150); plt.close(fig)
+
+ fig, ax = plt.subplots(figsize=(6, 4))
+ for k, m in cells.items():
+ if m.sum() < 3:
+ continue
+ ax.plot(range(1, lyap.shape[1] + 1), [np.median(lyap[m, i]) for i in range(lyap.shape[1])],
+ "o-", ms=4, label=f"{k} (n={int(m.sum())})", color=colors[k])
+ ax.axhline(0, color="gray", lw=0.6)
+ ax.set_xlabel("exponent index"); ax.set_ylabel("median lambda_i")
+ ax.set_title(f"{tag}: median FTLE spectrum per cell"); ax.legend(fontsize=7)
+ fig.tight_layout(); fig.savefig(HERE / f"fig_{tag}_spectrum.png", dpi=150); plt.close(fig)
+
+ # secondary observables
+ res["aux"] = {
+ "late_drift_zL_corr_with_zH_log": float(np.corrcoef(logd, np.log10(np.clip(d_late_l, 1e-12, None)))[0, 1]),
+ "q_halt_final_median_by_cell": {
+ k: float(np.median(d["q_halt"][m, -1])) if m.sum() > 0 else float("nan") for k, m in cells.items()
+ },
+ }
+ (HERE / f"results_{tag}.json").write_text(json.dumps(res, indent=2))
+ return res
+
+
+def evolution(series: list[tuple[str, Path]], out_tag: str) -> None:
+ rows = []
+ for label, p in series:
+ if not p.exists():
+ continue
+ d = np.load(p)
+ lyap = d["lyap_spec"].astype(np.float64)
+ correct = d["exact_correct"].astype(int)
+ logd = np.log10(np.clip(d["drift_zH"][:, -LATE_K:].mean(axis=1), 1e-12, None))
+ tau = otsu_threshold(logd)
+ conv = logd < tau
+ row = dict(step=label, acc=float(correct.mean()), tau=tau,
+ fA=float(((conv) & (correct == 1)).mean()), fB=float(((conv) & (correct == 0)).mean()),
+ fC=float(((~conv) & (correct == 1)).mean()), fD=float(((~conv) & (correct == 0)).mean()))
+ for nm, m in [("l1A", (conv) & (correct == 1)), ("l1B", (conv) & (correct == 0)),
+ ("l1C", (~conv) & (correct == 1)), ("l1D", (~conv) & (correct == 0))]:
+ row[nm] = float(np.median(lyap[m, 0])) if m.sum() > 2 else float("nan")
+ rows.append(row)
+ if not rows:
+ return
+ keys = list(rows[0].keys())
+ with (HERE / f"evolution_{out_tag}.csv").open("w") as f:
+ f.write(",".join(keys) + "\n")
+ for r in rows:
+ f.write(",".join(str(r[k]) for k in keys) + "\n")
+ fig, axes = plt.subplots(1, 2, figsize=(11, 4))
+ xs = range(len(rows))
+ for nm, c in [("fA", "tab:green"), ("fB", "tab:orange"), ("fC", "tab:blue"), ("fD", "tab:red")]:
+ axes[0].plot(xs, [r[nm] for r in rows], "o-", label=nm, color=c)
+ axes[0].set_xticks(list(xs)); axes[0].set_xticklabels([r["step"] for r in rows], rotation=45, fontsize=7)
+ axes[0].set_ylabel("cell fraction"); axes[0].legend(fontsize=8); axes[0].set_title(f"{out_tag}: cell fractions")
+ for nm, c in [("l1A", "tab:green"), ("l1B", "tab:orange"), ("l1C", "tab:blue"), ("l1D", "tab:red")]:
+ axes[1].plot(xs, [r[nm] for r in rows], "o-", label=nm, color=c)
+ axes[1].axhline(0, color="gray", lw=0.6)
+ axes[1].set_xticks(list(xs)); axes[1].set_xticklabels([r["step"] for r in rows], rotation=45, fontsize=7)
+ axes[1].set_ylabel("median lambda_1"); axes[1].legend(fontsize=8); axes[1].set_title(f"{out_tag}: per-cell lambda_1")
+ fig.tight_layout(); fig.savefig(HERE / f"evolution_{out_tag}.png", dpi=150); plt.close(fig)
+
+
+def main() -> None:
+ results = {}
+ primary = [
+ ("hrm26040_n8192", FLOSS / "diag_8k.npz"),
+ ("trm_singleGPU_step260410_n512", FLOSS / "diag_trm_singleGPU_step260410_512.npz"),
+ ("trm_singleGPU_step130205_n512", FLOSS / "diag_trm_singleGPU_step130205_512.npz"),
+ ("trm_step13020_n512", FLOSS / "diag_trm_step13020_512.npz"),
+ ]
+ for tag, p in primary:
+ if p.exists():
+ results[tag] = analyze(p, tag)
+ print(f"[done] {tag}")
+
+ evolution(
+ [(f"{s}", FLOSS / f"diag_hrm_step_{s}_512.npz") for s in
+ [2604, 5208, 7812, 10416, 13020, 15624, 18228, 20832, 23436, 26040]],
+ "hrm",
+ )
+ evolution(
+ [(f"{s}", FLOSS / f"diag_trm_singleGPU_step{s}_512.npz") for s in
+ [26041, 52082, 78123, 104164, 130205, 156246, 182287, 208328, 234369, 260410]],
+ "trm",
+ )
+
+ # combined human-readable summary
+ lines = ["# 2x2 analysis (convergence x correctness) — generated " + __import__("datetime").date.today().isoformat(), ""]
+ for tag, r in results.items():
+ lines += [f"## {tag}", f"- npz: `{r['npz']}`, n={r['n']}, exact_acc={r['exact_acc']:.3f}",
+ f"- late-drift def: {r['late_drift_def']}, Otsu tau(log10)={r['otsu_tau_log10']:.3f}, frac_converged={r['frac_converged']:.3f}", ""]
+ lines.append("| cell | n | lam1 median | lam1 IQR | token_acc med |")
+ lines.append("|---|---|---|---|---|")
+ for k, s in r["cells"].items():
+ if s["n"] == 0:
+ lines.append(f"| {k} | 0 | - | - | - |")
+ else:
+ lines.append(f"| {k} | {s['n']} | {s['lam1_median']:+.4f} | [{s['lam1_iqr'][0]:+.4f}, {s['lam1_iqr'][1]:+.4f}] | {s['token_acc_median']:.3f} |")
+ c = r["contrasts"]; a = r["auc"]; m = r["mixture"]
+ lines += ["",
+ f"- mixture: wrong-that-converged = {m['wrong_that_converged']:.3f}; correct-that-nonconverged = {m['correct_that_nonconverged']:.3f}",
+ f"- dlam1(correct-wrong): overall {c['dlam1_correct_minus_wrong_overall']:+.4f}; within-conv {c['dlam1_within_converged']:+.4f}; within-nonconv {c['dlam1_within_nonconverged']:+.4f}",
+ f"- dlam1(wrong: conv - nonconv) = {c['dlam1_wrong_conv_minus_wrong_nonconv']:+.4f}",
+ f"- AUC(-lam1 -> correct): overall {a['neg_lam1_predicts_correct_overall']:.3f}; within-conv {a['neg_lam1_predicts_correct_within_conv']:.3f}; within-nonconv {a['neg_lam1_predicts_correct_within_nonconv']:.3f}",
+ f"- AUC(-log d_late -> correct) = {a['neg_logdrift_predicts_correct']:.3f}; AUC(-lam1 -> converged) = {a['neg_lam1_predicts_converged']:.3f}",
+ ""]
+ (HERE / "results.md").write_text("\n".join(lines))
+ print("wrote", HERE / "results.md")
+
+
+if __name__ == "__main__":
+ main()
diff --git a/research/flossing/analysis_2x2/analyze_phase1.py b/research/flossing/analysis_2x2/analyze_phase1.py
new file mode 100644
index 0000000..2ffa58d
--- /dev/null
+++ b/research/flossing/analysis_2x2/analyze_phase1.py
@@ -0,0 +1,126 @@
+"""Analyze phase-1 results: E5 horizon sweep, E6 matched-objective pairs, E2 second-run replication.
+Outputs: analysis_2x2/phase1/phase1_results.md + figures.
+"""
+from __future__ import annotations
+from pathlib import Path
+import matplotlib; matplotlib.use("Agg")
+import matplotlib.pyplot as plt
+import numpy as np
+
+HERE = Path(__file__).resolve().parent
+P1 = HERE / "phase1"
+RETEST = HERE / "retest"
+
+
+def auc_rank(score, label):
+ score = np.asarray(score, float); label = np.asarray(label, int)
+ pos, neg = score[label == 1], score[label == 0]
+ if len(pos) == 0 or len(neg) == 0:
+ return float("nan")
+ allv = np.concatenate([pos, neg]); order = np.argsort(allv, kind="mergesort")
+ ranks = np.empty(len(allv)); ranks[order] = np.arange(1, len(allv) + 1)
+ sv = allv[order]; i = 0
+ while i < len(sv):
+ j = i
+ while j + 1 < len(sv) and sv[j + 1] == sv[i]:
+ j += 1
+ if j > i:
+ ranks[order[i:j + 1]] = ranks[order[i:j + 1]].mean()
+ i = j + 1
+ return float((ranks[:len(pos)].sum() - len(pos) * (len(pos) + 1) / 2) / (len(pos) * len(neg)))
+
+
+def otsu(x, nbins=256):
+ h, e = np.histogram(x, bins=nbins); h = h.astype(float); c = (e[:-1] + e[1:]) / 2
+ p = h / h.sum(); om = np.cumsum(p); mu = np.cumsum(p * c); mt = mu[-1]
+ den = om * (1 - om); den[den <= 0] = np.nan
+ return float(c[np.nanargmax((mt * om - mu) ** 2 / den)])
+
+
+def cells_of(d):
+ ld = np.log10(np.clip(d["drift_zH"][:, -4:].mean(1), 1e-12, None))
+ c = d["exact_correct"].astype(int); tau = otsu(ld); conv = ld < tau
+ A = conv & (c == 1); B = conv & (c == 0); C = (~conv) & (c == 1); D = (~conv) & (c == 0)
+ l1 = d["lyap_spec"][:, 0]
+ return dict(acc=float(c.mean()), tau=tau,
+ nA=int(A.sum()), nB=int(B.sum()), nC=int(C.sum()), nD=int(D.sum()),
+ fD=float(D.mean()), l1A=float(np.median(l1[A])) if A.sum() else float("nan"),
+ l1D=float(np.median(l1[D])) if D.sum() else float("nan"))
+
+
+lines = ["# Phase-1 analysis (E5 horizon sweep, E6 matched-objective, E2 replication)", ""]
+
+# ---------- E5 horizon sweep ----------
+lines += ["## E5: when does fate become legible? AUC(early signal -> FINAL correct), restricted to not-yet-correct@H", ""]
+finals = {"trm": np.load(RETEST / "trm_gbs768_step58590_full_n2048.npz"),
+ "hrm": np.load(RETEST / "hrm_righteous_step26040_full_n2048.npz")}
+# include h=4 from retest short runs
+short4 = {"trm": np.load(RETEST / "trm_gbs768_step58590_short_n2048.npz"),
+ "hrm": np.load(RETEST / "hrm_righteous_step26040_short_n2048.npz")}
+sweep = {}
+for model in ["trm", "hrm"]:
+ fin = finals[model]; fi = fin["idx"]; y_final_all = fin["exact_correct"].astype(int)
+ rows = []
+ for H in [2, 4, 6, 8, 10, 12]:
+ if H == 4:
+ s = short4[model]
+ else:
+ tag = f"{'trm_official58590' if model=='trm' else 'hrm26040'}_h{H}_n2048"
+ s = np.load(P1 / f"{tag}.npz")
+ si = s["idx"]; common, fp, sp = np.intersect1d(fi, si, return_indices=True)
+ yf = y_final_all[fp]; ye = s["exact_correct"].astype(int)[sp]
+ l1 = s["lyap_spec"][sp, 0]; dr = np.log10(np.clip(s["drift_zH"][sp, -1], 1e-12, None)); q = s["q_halt"][sp, -1]
+ m = ye == 0 # not yet correct at H
+ rows.append(dict(H=H, n_und=int(m.sum()), frac_eventual=float(yf[m].mean()) if m.sum() else float("nan"),
+ auc_l1=auc_rank(-l1[m], yf[m]), auc_drift=auc_rank(-dr[m], yf[m]), auc_q=auc_rank(q[m], yf[m]),
+ solved_at_H=float(ye.mean())))
+ sweep[model] = rows
+ lines.append(f"### {model.upper()}")
+ lines.append("| H | solved@H | undecided n | of which eventual✓ | AUC(-λ₁) | AUC(-drift) | AUC(q_halt) |")
+ lines.append("|---|---|---|---|---|---|---|")
+ for r in rows:
+ lines.append(f"| {r['H']} | {r['solved_at_H']:.3f} | {r['n_und']} | {r['frac_eventual']:.3f} | "
+ f"{r['auc_l1']:.3f} | {r['auc_drift']:.3f} | {r['auc_q']:.3f} |")
+ lines.append("")
+
+# E5 figure
+fig, axes = plt.subplots(1, 2, figsize=(11, 4))
+for ax, model in zip(axes, ["trm", "hrm"]):
+ rows = sweep[model]; H = [r["H"] for r in rows]
+ ax.plot(H, [r["auc_l1"] for r in rows], "o-", label="−λ₁")
+ ax.plot(H, [r["auc_drift"] for r in rows], "s-", label="−drift")
+ ax.plot(H, [r["auc_q"] for r in rows], "^-", label="q_halt")
+ ax.axhline(0.5, color="gray", lw=0.6, ls="--")
+ ax.set_xlabel("prefix length H (ACT segments)"); ax.set_ylabel("AUC → final correct (undecided@H only)")
+ ax.set_ylim(0.3, 1.0); ax.set_title(f"{model.upper()}: legibility of fate vs prefix"); ax.legend(fontsize=8)
+fig.tight_layout(); fig.savefig(P1 / "fig_E5_horizon_sweep.png", dpi=150); plt.close(fig)
+
+# ---------- E6 matched-objective ----------
+lines += ["## E6: matched-objective intervention (step9 fixed-unroll runs, n=512)", ""]
+for fam, base, mult in [("HRM (E base vs F multi4)", "step9E_hrm", "step9F_hrm"),
+ ("TRM (G base vs H multi4)", "step9G_trm", "step9H_trm")]:
+ lines.append(f"### {fam}")
+ lines.append("| ckpt | base acc | base fD | base λ₁(D) | multi4 acc | multi4 fD | multi4 λ₁(D) |")
+ lines.append("|---|---|---|---|---|---|---|")
+ for ck in ["step_12500", "step_25000", "best", "final"]:
+ bp = P1 / f"{base}_{ck}_n512.npz"; mp = P1 / f"{mult}_{ck}_n512.npz"
+ if not bp.exists() or not mp.exists():
+ lines.append(f"| {ck} | missing | | | | | |"); continue
+ b = cells_of(np.load(bp)); m = cells_of(np.load(mp))
+ lines.append(f"| {ck} | {b['acc']:.3f} | {b['fD']:.3f} | {b['l1D']:+.4f} | "
+ f"{m['acc']:.3f} | {m['fD']:.3f} | {m['l1D']:+.4f} |")
+ lines.append("")
+
+# ---------- E2 second-run replication ----------
+lines += ["## E2: decomposition on a SECOND HRM training run (step9_E fixed-unroll, n=2048)", ""]
+lines.append("| ckpt | acc | A | B | C | D | λ₁(A) | λ₁(D) | settled-wrong frac of failures |")
+lines.append("|---|---|---|---|---|---|---|---|---|")
+for ck in ["best", "final"]:
+ d = np.load(P1 / f"step9E_hrm_{ck}_full_n2048.npz"); cc = cells_of(d)
+ nfail = cc["nB"] + cc["nD"]
+ lines.append(f"| {ck} | {cc['acc']:.3f} | {cc['nA']} | {cc['nB']} | {cc['nC']} | {cc['nD']} | "
+ f"{cc['l1A']:+.4f} | {cc['l1D']:+.4f} | {cc['nB']/max(nfail,1):.4f} |")
+
+(P1 / "phase1_results.md").write_text("\n".join(lines))
+print("\n".join(lines))
+print("\nwrote", P1 / "phase1_results.md", "and fig_E5_horizon_sweep.png")
diff --git a/research/flossing/analysis_2x2/cells_hrm26040_joint_n2048.csv b/research/flossing/analysis_2x2/cells_hrm26040_joint_n2048.csv
new file mode 100644
index 0000000..9ee8208
--- /dev/null
+++ b/research/flossing/analysis_2x2/cells_hrm26040_joint_n2048.csv
@@ -0,0 +1,5 @@
+cell,n,lam1_median,lam1_mean,lam1_q25,lam1_q75,lam8_median,token_acc_median,halted_at_median
+A_conv_correct,1020,-0.152252,-0.149175,-0.177945,-0.123537,-0.229215,1.0000,4.0
+B_conv_wrong,14,-0.035939,-0.045284,-0.100539,0.023329,-0.091743,0.6420,0.0
+C_nonconv_correct,57,-0.044267,-0.039474,-0.073953,-0.009610,-0.115885,1.0000,14.0
+D_nonconv_wrong,957,0.031881,0.037504,0.001404,0.070978,-0.064309,0.6296,0.0
diff --git a/research/flossing/analysis_2x2/cells_hrm26040_n8192.csv b/research/flossing/analysis_2x2/cells_hrm26040_n8192.csv
new file mode 100644
index 0000000..7d2c776
--- /dev/null
+++ b/research/flossing/analysis_2x2/cells_hrm26040_n8192.csv
@@ -0,0 +1,5 @@
+cell,n,lam1_median,lam1_mean,lam1_q25,lam1_q75,lam8_median,token_acc_median,halted_at_median
+A_conv_correct,4103,-0.861743,-0.856474,-0.902015,-0.815340,-0.974624,1.0000,4.0
+B_conv_wrong,63,-0.591228,-0.631761,-0.809065,-0.506116,-0.737965,0.6296,0.0
+C_nonconv_correct,195,-0.694297,-0.685632,-0.735417,-0.646644,-0.788653,1.0000,14.0
+D_nonconv_wrong,3831,-0.599770,-0.597011,-0.647486,-0.548848,-0.714284,0.6296,0.0
diff --git a/research/flossing/analysis_2x2/cells_trm_official58590_n2048.csv b/research/flossing/analysis_2x2/cells_trm_official58590_n2048.csv
new file mode 100644
index 0000000..d805cf9
--- /dev/null
+++ b/research/flossing/analysis_2x2/cells_trm_official58590_n2048.csv
@@ -0,0 +1,5 @@
+cell,n,lam1_median,lam1_mean,lam1_q25,lam1_q75,lam8_median,token_acc_median,halted_at_median
+A_conv_correct,1724,0.011282,0.015547,0.008467,0.016524,0.002090,1.0000,2.0
+B_conv_wrong,0,,,,,,,
+C_nonconv_correct,70,0.081975,0.077830,0.068631,0.090418,0.008649,1.0000,14.0
+D_nonconv_wrong,254,0.102872,0.102872,0.092413,0.112425,0.018793,0.6296,0.0
diff --git a/research/flossing/analysis_2x2/cells_trm_official_gbs768_step58590_n512.csv b/research/flossing/analysis_2x2/cells_trm_official_gbs768_step58590_n512.csv
new file mode 100644
index 0000000..2ff9465
--- /dev/null
+++ b/research/flossing/analysis_2x2/cells_trm_official_gbs768_step58590_n512.csv
@@ -0,0 +1,5 @@
+cell,n,lam1_median,lam1_mean,lam1_q25,lam1_q75,lam8_median,token_acc_median,halted_at_median
+A_conv_correct,434,0.011053,0.015590,0.008054,0.017474,0.002147,1.0000,2.0
+B_conv_wrong,0,,,,,,,
+C_nonconv_correct,14,0.082749,0.080288,0.070048,0.095862,0.006181,1.0000,14.0
+D_nonconv_wrong,64,0.103398,0.102595,0.092417,0.111892,0.021354,0.6296,0.0
diff --git a/research/flossing/analysis_2x2/cells_trm_singleGPU_step130205_n512.csv b/research/flossing/analysis_2x2/cells_trm_singleGPU_step130205_n512.csv
new file mode 100644
index 0000000..1e9aec5
--- /dev/null
+++ b/research/flossing/analysis_2x2/cells_trm_singleGPU_step130205_n512.csv
@@ -0,0 +1,5 @@
+cell,n,lam1_median,lam1_mean,lam1_q25,lam1_q75,lam8_median,token_acc_median,halted_at_median
+A_conv_correct,365,0.004312,0.009192,-0.000673,0.014061,-0.008585,1.0000,3.0
+B_conv_wrong,0,,,,,,,
+C_nonconv_correct,22,0.085199,0.084005,0.069379,0.098772,0.015529,1.0000,15.0
+D_nonconv_wrong,125,0.106698,0.106791,0.099522,0.112582,0.032808,0.6296,0.0
diff --git a/research/flossing/analysis_2x2/cells_trm_singleGPU_step260410_n512.csv b/research/flossing/analysis_2x2/cells_trm_singleGPU_step260410_n512.csv
new file mode 100644
index 0000000..4d11774
--- /dev/null
+++ b/research/flossing/analysis_2x2/cells_trm_singleGPU_step260410_n512.csv
@@ -0,0 +1,5 @@
+cell,n,lam1_median,lam1_mean,lam1_q25,lam1_q75,lam8_median,token_acc_median,halted_at_median
+A_conv_correct,383,0.004744,0.010418,0.000744,0.013660,-0.004220,1.0000,2.0
+B_conv_wrong,0,,,,,,,
+C_nonconv_correct,11,0.099828,0.089867,0.090672,0.103510,0.013320,1.0000,15.0
+D_nonconv_wrong,118,0.102321,0.102882,0.093520,0.111216,0.021455,0.6420,0.0
diff --git a/research/flossing/analysis_2x2/cells_trm_step13020_n512.csv b/research/flossing/analysis_2x2/cells_trm_step13020_n512.csv
new file mode 100644
index 0000000..d3d1ff0
--- /dev/null
+++ b/research/flossing/analysis_2x2/cells_trm_step13020_n512.csv
@@ -0,0 +1,5 @@
+cell,n,lam1_median,lam1_mean,lam1_q25,lam1_q75,lam8_median,token_acc_median,halted_at_median
+A_conv_correct,296,-0.004330,-0.002231,-0.011287,0.005197,-0.032494,1.0000,3.0
+B_conv_wrong,10,-0.002707,-0.003217,-0.008984,0.000914,-0.024164,0.5556,0.0
+C_nonconv_correct,9,0.041901,0.041647,0.032285,0.047971,-0.018014,1.0000,14.0
+D_nonconv_wrong,197,0.031663,0.030858,0.014350,0.046172,-0.012672,0.6420,0.0
diff --git a/research/flossing/analysis_2x2/early_pairing_hrm26040_joint.md b/research/flossing/analysis_2x2/early_pairing_hrm26040_joint.md
new file mode 100644
index 0000000..8d90e02
--- /dev/null
+++ b/research/flossing/analysis_2x2/early_pairing_hrm26040_joint.md
@@ -0,0 +1,17 @@
+# Early-window pairing — hrm26040_joint
+- paired n=2048; final acc=0.5259; already-correct@step4=0.3447
+- of final-correct, fraction already correct@4: 0.6555
+- early-window lam1: final-correct med -0.1714, final-wrong med -0.1314
+
+## Forecasting FINAL outcome from the first 4 ACT steps
+- AUC(-lam1_early -> final correct) = 0.728
+- AUC(-drift@4 -> final correct) = 0.486
+- AUC(q_halt@4 -> final correct) = 0.908
+- reference: AUC(-lam1_full -> final correct) = 0.987
+
+## Restricted to examples NOT yet correct at step 4 (the decision-relevant set)
+- n=1342, of which eventually correct: 371 (0.276)
+- AUC(-lam1_early -> eventually correct) = 0.448
+- AUC(-drift@4 -> eventually correct) = 0.312
+- AUC(q_halt@4 -> eventually correct) = 0.734
+- early lam1 med: eventually-correct -0.1225 vs never-correct -0.1314 \ No newline at end of file
diff --git a/research/flossing/analysis_2x2/early_pairing_trm_official58590.md b/research/flossing/analysis_2x2/early_pairing_trm_official58590.md
new file mode 100644
index 0000000..2b29ab5
--- /dev/null
+++ b/research/flossing/analysis_2x2/early_pairing_trm_official58590.md
@@ -0,0 +1,17 @@
+# Early-window pairing — trm_official58590
+- paired n=2048; final acc=0.8760; already-correct@step4=0.6943
+- of final-correct, fraction already correct@4: 0.7926
+- early-window lam1: final-correct med +0.0178, final-wrong med +0.1075
+
+## Forecasting FINAL outcome from the first 4 ACT steps
+- AUC(-lam1_early -> final correct) = 0.891
+- AUC(-drift@4 -> final correct) = 0.800
+- AUC(q_halt@4 -> final correct) = 0.901
+- reference: AUC(-lam1_full -> final correct) = 0.993
+
+## Restricted to examples NOT yet correct at step 4 (the decision-relevant set)
+- n=626, of which eventually correct: 372 (0.594)
+- AUC(-lam1_early -> eventually correct) = 0.543
+- AUC(-drift@4 -> eventually correct) = 0.492
+- AUC(q_halt@4 -> eventually correct) = 0.521
+- early lam1 med: eventually-correct +0.1060 vs never-correct +0.1075 \ No newline at end of file
diff --git a/research/flossing/analysis_2x2/early_window_pairing.py b/research/flossing/analysis_2x2/early_window_pairing.py
new file mode 100644
index 0000000..6277df2
--- /dev/null
+++ b/research/flossing/analysis_2x2/early_window_pairing.py
@@ -0,0 +1,85 @@
+"""Early-window (first 4 ACT steps) vs full-window pairing.
+
+Question: does the early-window FTLE forecast FINAL failure before convergence?
+- join full/short npz per example via idx (same seed/n => same sampling)
+- target label = final exact_correct from the FULL-window run
+- report: AUC(-lam1_early -> final correct), same for early drift and q_halt@4,
+ and the conditional version restricted to examples NOT yet correct at step 4
+ (short run's own exact_correct == 0) — the practically relevant population for
+ early-exit / compute-reallocation decisions.
+
+Observational only. Usage: python early_window_pairing.py FULL.npz SHORT.npz TAG
+"""
+from __future__ import annotations
+
+import sys
+from pathlib import Path
+
+import numpy as np
+
+HERE = Path(__file__).resolve().parent
+
+
+def auc_rank(score: np.ndarray, label: np.ndarray) -> float:
+ pos, neg = score[label == 1], score[label == 0]
+ if len(pos) == 0 or len(neg) == 0:
+ return float("nan")
+ allv = np.concatenate([pos, neg])
+ order = np.argsort(allv, kind="mergesort")
+ ranks = np.empty(len(allv)); ranks[order] = np.arange(1, len(allv) + 1)
+ sv = allv[order]; i = 0
+ while i < len(sv):
+ j = i
+ while j + 1 < len(sv) and sv[j + 1] == sv[i]:
+ j += 1
+ if j > i:
+ ranks[order[i:j + 1]] = ranks[order[i:j + 1]].mean()
+ i = j + 1
+ return float((ranks[:len(pos)].sum() - len(pos) * (len(pos) + 1) / 2) / (len(pos) * len(neg)))
+
+
+def main(full_path: str, short_path: str, tag: str) -> None:
+ f = np.load(full_path)
+ s = np.load(short_path)
+
+ fi, si = f["idx"], s["idx"]
+ common, f_pos, s_pos = np.intersect1d(fi, si, return_indices=True)
+ print(f"[{tag}] paired {len(common)} examples (full n={len(fi)}, short n={len(si)})")
+
+ y_final = f["exact_correct"].astype(int)[f_pos] # FINAL outcome (step 16)
+ y_early = s["exact_correct"].astype(int)[s_pos] # correct already at step 4
+ l1_early = s["lyap_spec"][s_pos, 0]
+ l1_full = f["lyap_spec"][f_pos, 0]
+ drift4 = np.log10(np.clip(s["drift_zH"][s_pos, -1], 1e-12, None)) # drift at ACT step 4
+ q4 = s["q_halt"][s_pos, -1]
+
+ lines = [f"# Early-window pairing — {tag}",
+ f"- paired n={len(common)}; final acc={y_final.mean():.4f}; already-correct@step4={y_early.mean():.4f}",
+ f"- of final-correct, fraction already correct@4: {y_early[y_final==1].mean():.4f}",
+ f"- early-window lam1: final-correct med {np.median(l1_early[y_final==1]):+.4f}, "
+ f"final-wrong med {np.median(l1_early[y_final==0]):+.4f}",
+ "",
+ "## Forecasting FINAL outcome from the first 4 ACT steps",
+ f"- AUC(-lam1_early -> final correct) = {auc_rank(-l1_early, y_final):.3f}",
+ f"- AUC(-drift@4 -> final correct) = {auc_rank(-drift4, y_final):.3f}",
+ f"- AUC(q_halt@4 -> final correct) = {auc_rank(q4, y_final):.3f}",
+ f"- reference: AUC(-lam1_full -> final correct) = {auc_rank(-l1_full, y_final):.3f}",
+ "",
+ "## Restricted to examples NOT yet correct at step 4 (the decision-relevant set)"]
+ m = y_early == 0
+ n_m = int(m.sum()); n_pos = int(y_final[m].sum())
+ lines += [f"- n={n_m}, of which eventually correct: {n_pos} ({n_pos/max(n_m,1):.3f})",
+ f"- AUC(-lam1_early -> eventually correct) = {auc_rank(-l1_early[m], y_final[m]):.3f}",
+ f"- AUC(-drift@4 -> eventually correct) = {auc_rank(-drift4[m], y_final[m]):.3f}",
+ f"- AUC(q_halt@4 -> eventually correct) = {auc_rank(q4[m], y_final[m]):.3f}",
+ f"- early lam1 med: eventually-correct {np.median(l1_early[m & (y_final==1)]):+.4f} vs "
+ f"never-correct {np.median(l1_early[m & (y_final==0)]):+.4f}"]
+
+ out = HERE / f"early_pairing_{tag}.md"
+ out.write_text("\n".join(lines))
+ print("\n".join(lines))
+ print("wrote", out)
+
+
+if __name__ == "__main__":
+ main(sys.argv[1], sys.argv[2], sys.argv[3])
diff --git a/research/flossing/analysis_2x2/evolution_hrm.csv b/research/flossing/analysis_2x2/evolution_hrm.csv
new file mode 100644
index 0000000..c6a6b10
--- /dev/null
+++ b/research/flossing/analysis_2x2/evolution_hrm.csv
@@ -0,0 +1,11 @@
+step,acc,tau,fA,fB,fC,fD,l1A,l1B,l1C,l1D
+2604,0.015625,0.32445240020751953,0.015625,0.892578125,0.0,0.091796875,-0.11765776202082634,-0.10138508677482605,nan,-0.08651512861251831
+5208,0.048828125,0.40048032999038696,0.029296875,0.900390625,0.01953125,0.05078125,-0.17876404523849487,-0.12725482881069183,-0.18487446755170822,-0.11079395189881325
+7812,0.15234375,0.6775624752044678,0.119140625,0.7734375,0.033203125,0.07421875,-0.20277076959609985,-0.1001129113137722,-0.19649504125118256,-0.12255467846989632
+10416,0.1796875,0.7741286754608154,0.166015625,0.599609375,0.013671875,0.220703125,-0.18029353022575378,-0.06513682007789612,-0.038433440029621124,-0.0504322350025177
+13020,0.30078125,0.8470979928970337,0.283203125,0.2421875,0.017578125,0.45703125,-0.17255809903144836,-0.024668633937835693,-0.07013614475727081,-0.010587955359369516
+15624,0.333984375,0.8144785761833191,0.3125,0.1484375,0.021484375,0.517578125,-0.19566883146762848,0.0031415667617693543,-0.05802934989333153,0.009074798785150051
+18228,0.474609375,0.8437007665634155,0.439453125,0.001953125,0.03515625,0.5234375,-0.08657147735357285,nan,-0.0343703031539917,0.012321019545197487
+20832,0.45703125,0.6549708247184753,0.423828125,0.0078125,0.033203125,0.53515625,-0.09613814204931259,-0.022231140173971653,-0.04714604467153549,0.001996797218453139
+23436,0.505859375,0.7171876430511475,0.484375,0.01171875,0.021484375,0.482421875,-0.14968957751989365,-0.04964173026382923,-0.06192338466644287,-0.0044072768650949
+26040,0.5,0.7611898183822632,0.47265625,0.0078125,0.02734375,0.4921875,-0.1546676605939865,0.0052672732854261994,-0.03868532460182905,0.023184756748378277
diff --git a/research/flossing/analysis_2x2/evolution_trm.csv b/research/flossing/analysis_2x2/evolution_trm.csv
new file mode 100644
index 0000000..a0b19f5
--- /dev/null
+++ b/research/flossing/analysis_2x2/evolution_trm.csv
@@ -0,0 +1,11 @@
+step,acc,tau,fA,fB,fC,fD,l1A,l1B,l1C,l1D
+26041,0.576171875,0.7283560037612915,0.560546875,0.01171875,0.015625,0.412109375,-0.04085663706064224,-0.0017999516567215323,0.015885045286267996,0.035852354019880295
+52082,0.6484375,0.8146334886550903,0.626953125,0.0,0.021484375,0.3515625,-0.005659707821905613,nan,0.05140006169676781,0.060247575864195824
+78123,0.6875,0.9091860055923462,0.66015625,0.0,0.02734375,0.3125,-0.004423868143931031,nan,0.05615279637277126,0.08166220411658287
+104164,0.7109375,1.0002410411834717,0.681640625,0.0,0.029296875,0.2890625,-0.006436891388148069,nan,0.08611240983009338,0.09516968205571175
+130205,0.755859375,1.2130658626556396,0.712890625,0.0,0.04296875,0.244140625,0.004312093835324049,nan,0.08519911020994186,0.10669758170843124
+156246,0.7578125,1.2462413311004639,0.73046875,0.0,0.02734375,0.2421875,0.03479858115315437,nan,0.08626002073287964,0.10777830705046654
+182287,0.7421875,1.3035683631896973,0.712890625,0.0,0.029296875,0.2578125,-0.006856455933302641,nan,0.09087501466274261,0.10055964067578316
+208328,0.75,1.3776456117630005,0.720703125,0.0,0.029296875,0.25,0.03374895453453064,nan,0.08769012242555618,0.10804034024477005
+234369,0.7734375,1.36112380027771,0.7421875,0.0,0.03125,0.2265625,0.011585433036088943,nan,0.09419485554099083,0.10411781445145607
+260410,0.76953125,1.406412124633789,0.748046875,0.0,0.021484375,0.23046875,0.004744160454720259,nan,0.09982780367136002,0.10232088342308998
diff --git a/research/flossing/analysis_2x2/offline_followups.py b/research/flossing/analysis_2x2/offline_followups.py
new file mode 100644
index 0000000..8101acb
--- /dev/null
+++ b/research/flossing/analysis_2x2/offline_followups.py
@@ -0,0 +1,229 @@
+"""Offline follow-ups to the 2x2 analysis (no GPU):
+
+1. Residual outcome signal within the unsettled stratum (HRM diag_8k primary,
+ TRM official @58590 secondary): per-cell drift profiles over the 16 ACT steps,
+ end-of-window drift slope, q_halt trajectories, halted_at, lambda spectra with a
+ STRICT in-band threshold, and per-drift-decile AUC(lambda1 -> correct) within the
+ unsettled stratum (does lambda1 add signal beyond drift level?).
+2. Per-example profile of the strict-band settled-but-wrong examples (HRM, n~21).
+3. Difficulty control: #givens per puzzle (input tokens != 1) joined via idx;
+ lambda1 ~ givens rank correlation overall/within outcome, and per-givens-bin
+ AUC(-lambda1 -> correct).
+
+Observational only. Outputs to analysis_2x2/offline_followups/.
+"""
+from __future__ import annotations
+
+from pathlib import Path
+
+import matplotlib
+
+matplotlib.use("Agg")
+import matplotlib.pyplot as plt
+import numpy as np
+
+HERE = Path(__file__).resolve().parent
+FLOSS = HERE.parent
+OUT = HERE / "offline_followups"
+OUT.mkdir(exist_ok=True)
+
+DATA_TEST_INPUTS = Path("/home/yurenh2/rrm/data/sudoku-extreme-1k-aug-1000/test/all__inputs.npy")
+
+CELL_COLORS = {"A": "tab:green", "B": "tab:orange", "C": "tab:blue", "D": "tab:red"}
+
+
+def auc_rank(score: np.ndarray, label: np.ndarray) -> float:
+ pos, neg = score[label == 1], score[label == 0]
+ if len(pos) == 0 or len(neg) == 0:
+ return float("nan")
+ allv = np.concatenate([pos, neg])
+ order = np.argsort(allv, kind="mergesort")
+ ranks = np.empty(len(allv)); ranks[order] = np.arange(1, len(allv) + 1)
+ sv = allv[order]; i = 0
+ while i < len(sv):
+ j = i
+ while j + 1 < len(sv) and sv[j + 1] == sv[i]:
+ j += 1
+ if j > i:
+ ranks[order[i:j + 1]] = ranks[order[i:j + 1]].mean()
+ i = j + 1
+ return float((ranks[:len(pos)].sum() - len(pos) * (len(pos) + 1) / 2) / (len(pos) * len(neg)))
+
+
+def spearman(a: np.ndarray, b: np.ndarray) -> float:
+ ra = np.argsort(np.argsort(a)).astype(float)
+ rb = np.argsort(np.argsort(b)).astype(float)
+ return float(np.corrcoef(ra, rb)[0, 1])
+
+
+def load(npz_path: Path, strict_pct: float):
+ d = np.load(npz_path)
+ out = {k: d[k] for k in d.files}
+ out["logd_late"] = np.log10(np.clip(out["drift_zH"][:, -4:].mean(1), 1e-12, None))
+ out["tau_strict"] = float(np.percentile(out["logd_late"], strict_pct))
+ conv = out["logd_late"] < out["tau_strict"]
+ c = out["exact_correct"].astype(int)
+ out["cells"] = {
+ "A": conv & (c == 1), "B": conv & (c == 0),
+ "C": (~conv) & (c == 1), "D": (~conv) & (c == 0),
+ }
+ return out
+
+
+def givens_for(idx: np.ndarray) -> np.ndarray:
+ inputs = np.load(DATA_TEST_INPUTS, mmap_mode="r")
+ return np.array([(inputs[i] != 1).sum() for i in idx])
+
+
+def drift_profiles_fig(ds, tag, lines):
+ fig, axes = plt.subplots(1, 2, figsize=(11, 4))
+ steps = np.arange(1, ds["drift_zH"].shape[1] + 1)
+ for nm, m in ds["cells"].items():
+ if m.sum() < 3:
+ continue
+ med = np.median(ds["drift_zH"][m], axis=0)
+ q1 = np.percentile(ds["drift_zH"][m], 25, axis=0)
+ q3 = np.percentile(ds["drift_zH"][m], 75, axis=0)
+ axes[0].plot(steps, med, "o-", ms=3, color=CELL_COLORS[nm], label=f"{nm} (n={int(m.sum())})")
+ axes[0].fill_between(steps, q1, q3, color=CELL_COLORS[nm], alpha=0.15)
+ qm = np.median(ds["q_halt"][m], axis=0)
+ axes[1].plot(steps, qm, "o-", ms=3, color=CELL_COLORS[nm], label=nm)
+ axes[0].set_yscale("log"); axes[0].set_xlabel("ACT step"); axes[0].set_ylabel("drift_zH (median, IQR)")
+ axes[0].legend(fontsize=8); axes[0].set_title(f"{tag}: drift profiles per cell")
+ axes[1].set_xlabel("ACT step"); axes[1].set_ylabel("q_halt (median)"); axes[1].axhline(0, color="gray", lw=0.6)
+ axes[1].set_title(f"{tag}: q_halt per cell"); axes[1].legend(fontsize=8)
+ fig.tight_layout(); fig.savefig(OUT / f"fig_{tag}_profiles.png", dpi=150); plt.close(fig)
+
+ # end-of-window slope: log10 mean(drift[13:16]) - log10 mean(drift[9:12])
+ slope = (np.log10(np.clip(ds["drift_zH"][:, 12:16].mean(1), 1e-12, None))
+ - np.log10(np.clip(ds["drift_zH"][:, 8:12].mean(1), 1e-12, None)))
+ lines.append(f"\n### {tag}: end-of-window drift slope (log10 steps13-16 vs 9-12; <0 = still descending)")
+ for nm, m in ds["cells"].items():
+ if m.sum() == 0:
+ lines.append(f"- {nm}: n=0")
+ continue
+ lines.append(f"- {nm}: n={int(m.sum())}, slope median {np.median(slope[m]):+.4f}, "
+ f"IQR [{np.percentile(slope[m],25):+.4f}, {np.percentile(slope[m],75):+.4f}], "
+ f"frac still descending (<-0.01): {float((slope[m] < -0.01).mean()):.2f}")
+ return slope
+
+
+def main() -> None:
+ lines = ["# Offline follow-ups (no GPU) — 2026-06-11",
+ "",
+ "Strict in-band thresholds: HRM pct45 of pooled log10 late-drift; TRM pct60 (band edge; B=0 regardless).",
+ "All numbers observational; within-dataset comparisons only."]
+
+ # ---------- HRM diag_8k ----------
+ hrm = load(FLOSS / "diag_8k.npz", strict_pct=45)
+ tag = "hrm26040_n8192_strict"
+ lines.append(f"\n## HRM @26040 (n=8192), strict tau(log10)={hrm['tau_strict']:.4f}")
+ lines.append("| cell | n | lam1 med | lam8 med | token_acc med | halted_at med | q_halt_final med | givens med |")
+ lines.append("|---|---|---|---|---|---|---|---|")
+ g_hrm = givens_for(hrm["idx"])
+ for nm, m in hrm["cells"].items():
+ if m.sum() == 0:
+ lines.append(f"| {nm} | 0 | | | | | | |")
+ continue
+ lines.append(
+ f"| {nm} | {int(m.sum())} | {np.median(hrm['lyap_spec'][m,0]):+.4f} | {np.median(hrm['lyap_spec'][m,-1]):+.4f} "
+ f"| {np.median(hrm['token_acc'][m]):.3f} | {np.median(hrm['halted_at'][m]):.0f} "
+ f"| {np.median(hrm['q_halt'][m,-1]):+.2f} | {np.median(g_hrm[m]):.0f} |")
+
+ slope = drift_profiles_fig(hrm, tag, lines)
+
+ # residual signal within unsettled stratum: per-drift-decile AUC
+ uns = ~(hrm["cells"]["A"] | hrm["cells"]["B"])
+ c = hrm["exact_correct"].astype(int)
+ lines.append("\n### HRM unsettled stratum: AUC(-lam1 -> correct) per log-drift decile")
+ lines.append("| decile | drift range (log10) | n | n_correct | AUC |")
+ lines.append("|---|---|---|---|---|")
+ ld_u, l1_u, c_u = hrm["logd_late"][uns], hrm["lyap_spec"][uns, 0], c[uns]
+ qs = np.percentile(ld_u, np.arange(0, 101, 10))
+ aucs, ws = [], []
+ for i in range(10):
+ m = (ld_u >= qs[i]) & (ld_u <= qs[i + 1] if i == 9 else ld_u < qs[i + 1])
+ a = auc_rank(-l1_u[m], c_u[m])
+ if not np.isnan(a) and c_u[m].sum() >= 5:
+ aucs.append(a); ws.append(m.sum())
+ lines.append(f"| {i+1} | [{qs[i]:.2f}, {qs[i+1]:.2f}] | {int(m.sum())} | {int(c_u[m].sum())} | "
+ f"{a:.3f} |" if not np.isnan(a) else f"| {i+1} | [{qs[i]:.2f}, {qs[i+1]:.2f}] | {int(m.sum())} | {int(c_u[m].sum())} | n/a |")
+ if aucs:
+ lines.append(f"- weighted mean within-decile AUC = {np.average(aucs, weights=ws):.3f} "
+ f"(vs unconditioned within-unsettled AUC {auc_rank(-l1_u, c_u):.3f})")
+
+ # also: does end-slope separate C from D?
+ lines.append(f"- AUC(-end_slope -> correct | unsettled) = {auc_rank(-slope[uns], c_u):.3f} "
+ f"(C still-descending fraction vs D, see slope table above)")
+
+ # ---------- strict-B per-example table ----------
+ B = hrm["cells"]["B"]
+ lines.append(f"\n## HRM strict-band settled-but-wrong examples (n={int(B.sum())})")
+ lines.append("| idx | givens | token_acc | lam1 | drift_final | halted_at | q_halt_final |")
+ lines.append("|---|---|---|---|---|---|---|")
+ bi = np.where(B)[0]
+ order = np.argsort(hrm["token_acc"][bi])
+ for j in bi[order]:
+ lines.append(
+ f"| {int(hrm['idx'][j])} | {int(g_hrm[j])} | {hrm['token_acc'][j]:.3f} | {hrm['lyap_spec'][j,0]:+.3f} "
+ f"| {hrm['drift_zH'][j,-1]:.3f} | {int(hrm['halted_at'][j])} | {hrm['q_halt'][j,-1]:+.2f} |")
+ # B drift profiles vs A band
+ fig, ax = plt.subplots(figsize=(6.5, 4))
+ steps = np.arange(1, 17)
+ A = hrm["cells"]["A"]
+ ax.fill_between(steps, np.percentile(hrm["drift_zH"][A], 10, axis=0),
+ np.percentile(hrm["drift_zH"][A], 90, axis=0), color="tab:green", alpha=0.2,
+ label=f"A q10-q90 (n={int(A.sum())})")
+ for j in bi:
+ ax.plot(steps, hrm["drift_zH"][j], "-", lw=1, alpha=0.8, color="tab:orange")
+ ax.set_yscale("log"); ax.set_xlabel("ACT step"); ax.set_ylabel("drift_zH")
+ ax.set_title("HRM: strict-B drift profiles vs A band"); ax.legend(fontsize=8)
+ fig.tight_layout(); fig.savefig(OUT / "fig_hrm_strictB_profiles.png", dpi=150); plt.close(fig)
+
+ # ---------- difficulty control (HRM) ----------
+ l1 = hrm["lyap_spec"][:, 0]
+ lines.append("\n## HRM difficulty control (#givens, input tokens != 1)")
+ lines.append(f"- givens: min {g_hrm.min()}, median {np.median(g_hrm):.0f}, max {g_hrm.max()}")
+ lines.append(f"- Spearman(lam1, givens): overall {spearman(l1, g_hrm):+.3f}; "
+ f"correct-only {spearman(l1[c==1], g_hrm[c==1]):+.3f}; "
+ f"wrong-only {spearman(l1[c==0], g_hrm[c==0]):+.3f}")
+ lines.append(f"- Spearman(correct, givens) = {spearman(c.astype(float), g_hrm):+.3f}")
+ lines.append("\n| givens bin | n | acc | AUC(-lam1 -> correct) |")
+ lines.append("|---|---|---|---|")
+ edges = np.unique(np.percentile(g_hrm, [0, 20, 40, 60, 80, 100]))
+ bin_aucs, bin_ws = [], []
+ for i in range(len(edges) - 1):
+ m = (g_hrm >= edges[i]) & (g_hrm <= edges[i + 1] if i == len(edges) - 2 else g_hrm < edges[i + 1])
+ a = auc_rank(-l1[m], c[m])
+ lines.append(f"| [{edges[i]:.0f}, {edges[i+1]:.0f}] | {int(m.sum())} | {c[m].mean():.3f} | {a:.3f} |")
+ if not np.isnan(a):
+ bin_aucs.append(a); bin_ws.append(m.sum())
+ lines.append(f"- weighted mean within-bin AUC = {np.average(bin_aucs, weights=bin_ws):.3f} (overall 0.984)")
+
+ # ---------- TRM official @58590 (secondary, n=512) ----------
+ trm = load(FLOSS / "official_gbs768_spectrum/trm_gbs768_base_step58590_n512_k8_seed20260602.npz", strict_pct=60)
+ g_trm = givens_for(trm["idx"])
+ ct = trm["exact_correct"].astype(int)
+ lines.append(f"\n## TRM official @58590 (n=512), strict tau(log10)={trm['tau_strict']:.4f}")
+ lines.append("| cell | n | lam1 med | token_acc med | q_halt_final med | givens med |")
+ lines.append("|---|---|---|---|---|---|")
+ for nm, m in trm["cells"].items():
+ if m.sum() == 0:
+ lines.append(f"| {nm} | 0 | | | | |")
+ continue
+ lines.append(f"| {nm} | {int(m.sum())} | {np.median(trm['lyap_spec'][m,0]):+.4f} "
+ f"| {np.median(trm['token_acc'][m]):.3f} | {np.median(trm['q_halt'][m,-1]):+.2f} "
+ f"| {np.median(g_trm[m]):.0f} |")
+ drift_profiles_fig(trm, "trm_official58590_n512_strict", lines)
+ l1t = trm["lyap_spec"][:, 0]
+ lines.append(f"- Spearman(lam1, givens): overall {spearman(l1t, g_trm):+.3f}; "
+ f"wrong-only {spearman(l1t[ct==0], g_trm[ct==0]):+.3f}")
+ lines.append(f"- Spearman(correct, givens) = {spearman(ct.astype(float), g_trm):+.3f}")
+
+ (OUT / "followups.md").write_text("\n".join(lines))
+ print("\n".join(lines[:6]))
+ print("wrote", OUT / "followups.md")
+
+
+if __name__ == "__main__":
+ main()
diff --git a/research/flossing/analysis_2x2/offline_followups/followups.md b/research/flossing/analysis_2x2/offline_followups/followups.md
new file mode 100644
index 0000000..be5f519
--- /dev/null
+++ b/research/flossing/analysis_2x2/offline_followups/followups.md
@@ -0,0 +1,88 @@
+# Offline follow-ups (no GPU) — 2026-06-11
+
+Strict in-band thresholds: HRM pct45 of pooled log10 late-drift; TRM pct60 (band edge; B=0 regardless).
+All numbers observational; within-dataset comparisons only.
+
+## HRM @26040 (n=8192), strict tau(log10)=-0.0129
+| cell | n | lam1 med | lam8 med | token_acc med | halted_at med | q_halt_final med | givens med |
+|---|---|---|---|---|---|---|---|
+| A | 3665 | -0.8670 | -0.9787 | 1.000 | 4 | +7.47 | 26 |
+| B | 21 | -0.8421 | -0.9495 | 0.617 | 6 | +7.47 | 25 |
+| C | 633 | -0.7796 | -0.8815 | 1.000 | 10 | +7.47 | 25 |
+| D | 3873 | -0.5991 | -0.7140 | 0.630 | 0 | -9.62 | 25 |
+
+### hrm26040_n8192_strict: end-of-window drift slope (log10 steps13-16 vs 9-12; <0 = still descending)
+- A: n=3665, slope median -0.0023, IQR [-0.0073, +0.0012], frac still descending (<-0.01): 0.20
+- B: n=21, slope median -0.0063, IQR [-0.4005, -0.0009], frac still descending (<-0.01): 0.48
+- C: n=633, slope median -0.0006, IQR [-1.4084, +0.0088], frac still descending (<-0.01): 0.44
+- D: n=3873, slope median -0.0031, IQR [-0.0556, +0.0459], frac still descending (<-0.01): 0.46
+
+### HRM unsettled stratum: AUC(-lam1 -> correct) per log-drift decile
+| decile | drift range (log10) | n | n_correct | AUC |
+|---|---|---|---|---|
+| 1 | [-0.01, 0.66] | 451 | 422 | 0.966 |
+| 2 | [0.66, 1.42] | 450 | 50 | 0.972 |
+| 3 | [1.42, 1.52] | 451 | 4 | 0.988 |
+| 4 | [1.52, 1.56] | 450 | 6 | 0.964 |
+| 5 | [1.56, 1.60] | 451 | 1 | 0.984 |
+| 6 | [1.60, 1.62] | 450 | 7 | 0.949 |
+| 7 | [1.62, 1.65] | 451 | 5 | 0.837 |
+| 8 | [1.65, 1.68] | 450 | 5 | 0.851 |
+| 9 | [1.68, 1.71] | 451 | 16 | 0.804 |
+| 10 | [1.71, 1.93] | 451 | 117 | 0.685 |
+- weighted mean within-decile AUC = 0.879 (vs unconditioned within-unsettled AUC 0.933)
+- AUC(-end_slope -> correct | unsettled) = 0.605 (C still-descending fraction vs D, see slope table above)
+
+## HRM strict-band settled-but-wrong examples (n=21)
+| idx | givens | token_acc | lam1 | drift_final | halted_at | q_halt_final |
+|---|---|---|---|---|---|---|
+| 342267 | 17 | 0.407 | -0.867 | 0.976 | 5 | +7.41 |
+| 212705 | 17 | 0.469 | -0.838 | 0.964 | 8 | +7.44 |
+| 329832 | 17 | 0.481 | -0.703 | 0.970 | 8 | +7.41 |
+| 20075 | 27 | 0.519 | -0.812 | 0.966 | 5 | +7.50 |
+| 198242 | 25 | 0.568 | -0.843 | 0.980 | 7 | +7.47 |
+| 223591 | 24 | 0.580 | -0.939 | 0.951 | 4 | +7.47 |
+| 238704 | 27 | 0.593 | -0.931 | 0.953 | 5 | +7.47 |
+| 364431 | 25 | 0.593 | -0.806 | 0.956 | 6 | +7.44 |
+| 274637 | 26 | 0.593 | -0.859 | 0.979 | 6 | +7.47 |
+| 182424 | 24 | 0.605 | -0.985 | 0.949 | 6 | +7.47 |
+| 351919 | 25 | 0.617 | -0.742 | 0.965 | 5 | +7.47 |
+| 123022 | 27 | 0.617 | -0.826 | 0.951 | 7 | +7.50 |
+| 150426 | 25 | 0.630 | -0.767 | 0.963 | 9 | +7.47 |
+| 175427 | 26 | 0.630 | -0.843 | 0.946 | 8 | +7.50 |
+| 422185 | 26 | 0.642 | -0.841 | 0.946 | 7 | +7.47 |
+| 344032 | 24 | 0.654 | -0.903 | 0.965 | 4 | +7.53 |
+| 30703 | 25 | 0.691 | -0.732 | 0.972 | 6 | +7.53 |
+| 386549 | 23 | 0.691 | -0.842 | 0.966 | 4 | +7.47 |
+| 3370 | 26 | 0.716 | -0.861 | 0.955 | 6 | +7.47 |
+| 243909 | 24 | 0.753 | -0.786 | 0.969 | 8 | +7.50 |
+| 258307 | 25 | 0.877 | -0.918 | 0.952 | 5 | +7.47 |
+
+## HRM difficulty control (#givens, input tokens != 1)
+- givens: min 17, median 25, max 36
+- Spearman(lam1, givens): overall -0.350; correct-only -0.155; wrong-only -0.180
+- Spearman(correct, givens) = +0.276
+
+| givens bin | n | acc | AUC(-lam1 -> correct) |
+|---|---|---|---|
+| [17, 24] | 1152 | 0.321 | 0.976 |
+| [24, 25] | 1795 | 0.373 | 0.980 |
+| [25, 26] | 1764 | 0.503 | 0.987 |
+| [26, 36] | 3481 | 0.681 | 0.983 |
+- weighted mean within-bin AUC = 0.982 (overall 0.984)
+
+## TRM official @58590 (n=512), strict tau(log10)=1.0240
+| cell | n | lam1 med | token_acc med | q_halt_final med | givens med |
+|---|---|---|---|---|---|
+| A | 307 | +0.0105 | 1.000 | +7.78 | 26 |
+| B | 0 | | | | |
+| C | 141 | +0.0174 | 1.000 | +7.81 | 25 |
+| D | 64 | +0.1034 | 0.630 | -11.12 | 25 |
+
+### trm_official58590_n512_strict: end-of-window drift slope (log10 steps13-16 vs 9-12; <0 = still descending)
+- A: n=307, slope median -0.1471, IQR [-0.2267, -0.0641], frac still descending (<-0.01): 0.90
+- B: n=0
+- C: n=141, slope median -0.0080, IQR [-0.2603, +0.0808], frac still descending (<-0.01): 0.49
+- D: n=64, slope median -0.0125, IQR [-0.0525, +0.0276], frac still descending (<-0.01): 0.53
+- Spearman(lam1, givens): overall -0.240; wrong-only -0.238
+- Spearman(correct, givens) = +0.148 \ No newline at end of file
diff --git a/research/flossing/analysis_2x2/offline_followups/phase1_e1.md b/research/flossing/analysis_2x2/offline_followups/phase1_e1.md
new file mode 100644
index 0000000..b3e3f40
--- /dev/null
+++ b/research/flossing/analysis_2x2/offline_followups/phase1_e1.md
@@ -0,0 +1,27 @@
+# E1 offline batch — bootstrap CIs, settling robustness, TRM multi4 pair
+
+## Bootstrap / exact CIs (TRM official @58590, n=2048)
+- settled-wrong fraction: observed 0/254; exact 95% upper bound 0.0117 (1.17% of failures)
+- AUC(-lam1->correct) = 0.9935, bootstrap 95% CI (0.9908244697676584, 0.9957330475628791)
+- lam1(wrong) median 95% CI (0.10100110620260239, 0.10556983947753906)
+- lam1(correct) median 95% CI (0.011215815320611, 0.011744528077542782)
+
+## Bootstrap CIs (HRM @26040, n=8192, strict band)
+- strict settled-wrong fraction of failures: observed 0.0054, bootstrap 95% CI (0.0032613427182413084, 0.007798538095694945)
+- AUC(-lam1->correct) = 0.9841, bootstrap 95% CI (0.9815470536412456, 0.9865145187475995)
+
+## Settling-criterion robustness (B-cell counts under alternative drift definitions)
+- TRM official n=2048 | zH: B=0/A=1724 (tau=1.36) | zL: B=0/A=1728 (tau=1.42) | combined: B=0/A=1727 (tau=1.54)
+- HRM n=8192 | zH: B=63/A=4103 (tau=0.77) | zL: B=59/A=4083 (tau=1.01) | combined: B=60/A=4087 (tau=1.07)
+
+## TRM official-pipeline multi4 vs baseline (matched objective, n=512 each)
+- baseline @58590: acc=0.875; A/B/C/D=434/0/14/64; fD=0.125; lam1(D)=+0.1034; lam1(A)=+0.0111
+- multi4 @35805 (best): acc=0.900; A/B/C/D=452/0/9/51; fD=0.100; lam1(D)=+0.1019; lam1(A)=+0.0039
+- multi4 @65100 (final): acc=0.824; A/B/C/D=408/1/14/89; fD=0.174; lam1(D)=+0.0946; lam1(A)=+0.0133
+
+## hrm_multi4 provenance (E6a)
+- diag_hrm_multi4_step_{20832,23436,26040}_512.npz step grid matches HRM pretrain numbering;
+ multi4_eval_compare/logs should contain the eval invocations — checked manually below.
+- ACTION: if the hrm_multi4 run is pretrain-pipeline (ACT-streaming + perturbation), then the
+ May-28 multi4 vs righteous baseline comparison IS matched-pipeline and Sec 3.4's caveat is
+ narrower than written; step9 E-vs-F pair (queued) covers the fixed-unroll objective regardless. \ No newline at end of file
diff --git a/research/flossing/analysis_2x2/offline_phase1_e1.py b/research/flossing/analysis_2x2/offline_phase1_e1.py
new file mode 100644
index 0000000..7c42825
--- /dev/null
+++ b/research/flossing/analysis_2x2/offline_phase1_e1.py
@@ -0,0 +1,123 @@
+"""E1 offline batch (experiment_framework.md):
+(1) bootstrap CIs for headline quantities; (2) settling-criterion robustness (z_L / combined);
+(3) TRM official multi4 vs baseline matched-pipeline 2x2; (4) provenance note for hrm_multi4.
+Outputs: offline_followups/phase1_e1.md
+"""
+from __future__ import annotations
+
+from pathlib import Path
+
+import numpy as np
+
+HERE = Path(__file__).resolve().parent
+FLOSS = HERE.parent
+OUT = HERE / "offline_followups"
+RNG = np.random.default_rng(0)
+
+
+def auc_rank(score, label):
+ pos, neg = score[label == 1], score[label == 0]
+ if len(pos) == 0 or len(neg) == 0:
+ return float("nan")
+ allv = np.concatenate([pos, neg])
+ order = np.argsort(allv, kind="mergesort")
+ ranks = np.empty(len(allv)); ranks[order] = np.arange(1, len(allv) + 1)
+ sv = allv[order]; i = 0
+ while i < len(sv):
+ j = i
+ while j + 1 < len(sv) and sv[j + 1] == sv[i]:
+ j += 1
+ if j > i:
+ ranks[order[i:j + 1]] = ranks[order[i:j + 1]].mean()
+ i = j + 1
+ return float((ranks[:len(pos)].sum() - len(pos) * (len(pos) + 1) / 2) / (len(pos) * len(neg)))
+
+
+def boot_ci(stat_fn, n, B=10000):
+ vals = []
+ for _ in range(B):
+ idx = RNG.integers(0, n, n)
+ v = stat_fn(idx)
+ if not np.isnan(v):
+ vals.append(v)
+ return float(np.percentile(vals, 2.5)), float(np.percentile(vals, 97.5))
+
+
+def late(d, key="drift_zH"):
+ return np.log10(np.clip(d[key][:, -4:].mean(1), 1e-12, None))
+
+
+def otsu(x, nbins=256):
+ h, e = np.histogram(x, bins=nbins); h = h.astype(float)
+ c = (e[:-1] + e[1:]) / 2
+ p = h / h.sum(); om = np.cumsum(p); mu = np.cumsum(p * c); mt = mu[-1]
+ den = om * (1 - om); den[den <= 0] = np.nan
+ return float(c[np.nanargmax((mt * om - mu) ** 2 / den)])
+
+
+lines = ["# E1 offline batch — bootstrap CIs, settling robustness, TRM multi4 pair", ""]
+
+# ---------- (1) bootstrap CIs ----------
+trm = np.load(FLOSS / "analysis_2x2/retest/trm_gbs768_step58590_full_n2048.npz")
+c_t = trm["exact_correct"].astype(int); l1_t = trm["lyap_spec"][:, 0]; ld_t = late(trm)
+n_wrong = int((c_t == 0).sum())
+cp_upper = 1 - 0.05 ** (1 / n_wrong) # Clopper-Pearson upper for 0 events
+lines += ["## Bootstrap / exact CIs (TRM official @58590, n=2048)",
+ f"- settled-wrong fraction: observed 0/{n_wrong}; exact 95% upper bound {cp_upper:.4f} "
+ f"({cp_upper*100:.2f}% of failures)",
+ f"- AUC(-lam1->correct) = {auc_rank(-l1_t, c_t):.4f}, bootstrap 95% CI "
+ f"{boot_ci(lambda i: auc_rank(-l1_t[i], c_t[i]), len(c_t))}",
+ f"- lam1(wrong) median 95% CI {boot_ci(lambda i: float(np.median(l1_t[i][c_t[i]==0])), len(c_t))}",
+ f"- lam1(correct) median 95% CI {boot_ci(lambda i: float(np.median(l1_t[i][c_t[i]==1])), len(c_t))}"]
+
+hrm = np.load(FLOSS / "diag_8k.npz")
+c_h = hrm["exact_correct"].astype(int); l1_h = hrm["lyap_spec"][:, 0]; ld_h = late(hrm)
+tau_strict = float(np.percentile(ld_h, 45))
+def strictB_frac(idx):
+ c, ld = c_h[idx], ld_h[idx]
+ w = (c == 0).sum()
+ return float(((ld < tau_strict) & (c == 0)).sum() / max(w, 1))
+lines += ["", "## Bootstrap CIs (HRM @26040, n=8192, strict band)",
+ f"- strict settled-wrong fraction of failures: observed {21/3894:.4f}, bootstrap 95% CI "
+ f"{boot_ci(strictB_frac, len(c_h))}",
+ f"- AUC(-lam1->correct) = {auc_rank(-l1_h, c_h):.4f}, bootstrap 95% CI "
+ f"{boot_ci(lambda i: auc_rank(-l1_h[i], c_h[i]), len(c_h))}"]
+
+# ---------- (2) settling-criterion robustness ----------
+lines += ["", "## Settling-criterion robustness (B-cell counts under alternative drift definitions)"]
+for tag, d, c in [("TRM official n=2048", trm, c_t), ("HRM n=8192", hrm, c_h)]:
+ row = [tag]
+ for key, nm in [("drift_zH", "zH"), ("drift_zL", "zL")]:
+ ld = late(d, key); tau = otsu(ld); conv = ld < tau
+ row.append(f"{nm}: B={int((conv & (c==0)).sum())}/A={int((conv & (c==1)).sum())} (tau={tau:.2f})")
+ comb = np.log10(np.clip(np.sqrt(d["drift_zH"][:, -4:].mean(1) ** 2 + d["drift_zL"][:, -4:].mean(1) ** 2), 1e-12, None))
+ tau = otsu(comb); conv = comb < tau
+ row.append(f"combined: B={int((conv & (c==0)).sum())}/A={int((conv & (c==1)).sum())} (tau={tau:.2f})")
+ lines.append("- " + " | ".join(row))
+
+# ---------- (3) TRM official multi4 vs baseline (matched pipeline) ----------
+lines += ["", "## TRM official-pipeline multi4 vs baseline (matched objective, n=512 each)"]
+spec = FLOSS / "official_gbs768_spectrum"
+for nm, f in [("baseline @58590", spec / "trm_gbs768_base_step58590_n512_k8_seed20260602.npz"),
+ ("multi4 @35805 (best)", spec / "trm_gbs768_multi4_step35805_n512_k8_seed20260602.npz"),
+ ("multi4 @65100 (final)", spec / "trm_gbs768_multi4_step65100_n512_k8_seed20260602.npz")]:
+ if not f.exists():
+ lines.append(f"- {nm}: npz missing"); continue
+ d = np.load(f); c = d["exact_correct"].astype(int); l1 = d["lyap_spec"][:, 0]
+ ld = late(d); tau = otsu(ld); conv = ld < tau
+ A = conv & (c == 1); B = conv & (c == 0); C = (~conv) & (c == 1); D = (~conv) & (c == 0)
+ lines.append(f"- {nm}: acc={c.mean():.3f}; A/B/C/D={int(A.sum())}/{int(B.sum())}/{int(C.sum())}/{int(D.sum())}; "
+ f"fD={float(D.mean()):.3f}; lam1(D)={np.median(l1[D]) if D.sum()>0 else float('nan'):+.4f}; "
+ f"lam1(A)={np.median(l1[A]) if A.sum()>0 else float('nan'):+.4f}")
+
+# ---------- (4) provenance note ----------
+lines += ["", "## hrm_multi4 provenance (E6a)",
+ "- diag_hrm_multi4_step_{20832,23436,26040}_512.npz step grid matches HRM pretrain numbering;",
+ " multi4_eval_compare/logs should contain the eval invocations — checked manually below.",
+ "- ACTION: if the hrm_multi4 run is pretrain-pipeline (ACT-streaming + perturbation), then the",
+ " May-28 multi4 vs righteous baseline comparison IS matched-pipeline and Sec 3.4's caveat is",
+ " narrower than written; step9 E-vs-F pair (queued) covers the fixed-unroll objective regardless."]
+
+OUT.mkdir(exist_ok=True)
+(OUT / "phase1_e1.md").write_text("\n".join(lines))
+print("\n".join(lines))
diff --git a/research/flossing/analysis_2x2/phase1/phase1_results.md b/research/flossing/analysis_2x2/phase1/phase1_results.md
new file mode 100644
index 0000000..5dde25d
--- /dev/null
+++ b/research/flossing/analysis_2x2/phase1/phase1_results.md
@@ -0,0 +1,48 @@
+# Phase-1 analysis (E5 horizon sweep, E6 matched-objective, E2 replication)
+
+## E5: when does fate become legible? AUC(early signal -> FINAL correct), restricted to not-yet-correct@H
+
+### TRM
+| H | solved@H | undecided n | of which eventual✓ | AUC(-λ₁) | AUC(-drift) | AUC(q_halt) |
+|---|---|---|---|---|---|---|
+| 2 | 0.542 | 939 | 0.729 | 0.526 | 0.411 | 0.608 |
+| 4 | 0.694 | 626 | 0.594 | 0.543 | 0.492 | 0.521 |
+| 6 | 0.766 | 479 | 0.470 | 0.523 | 0.441 | 0.531 |
+| 8 | 0.801 | 407 | 0.376 | 0.483 | 0.495 | 0.514 |
+| 10 | 0.831 | 347 | 0.268 | 0.477 | 0.536 | 0.476 |
+| 12 | 0.846 | 315 | 0.194 | 0.483 | 0.509 | 0.509 |
+
+### HRM
+| H | solved@H | undecided n | of which eventual✓ | AUC(-λ₁) | AUC(-drift) | AUC(q_halt) |
+|---|---|---|---|---|---|---|
+| 2 | 0.071 | 1903 | 0.490 | 0.375 | 0.193 | 0.810 |
+| 4 | 0.345 | 1342 | 0.276 | 0.448 | 0.312 | 0.734 |
+| 6 | 0.436 | 1155 | 0.159 | 0.438 | 0.349 | 0.676 |
+| 8 | 0.471 | 1083 | 0.103 | 0.432 | 0.371 | 0.705 |
+| 10 | 0.489 | 1046 | 0.072 | 0.478 | 0.407 | 0.705 |
+| 12 | 0.506 | 1012 | 0.041 | 0.503 | 0.348 | 0.635 |
+
+## E6: matched-objective intervention (step9 fixed-unroll runs, n=512)
+
+### HRM (E base vs F multi4)
+| ckpt | base acc | base fD | base λ₁(D) | multi4 acc | multi4 fD | multi4 λ₁(D) |
+|---|---|---|---|---|---|---|
+| step_12500 | 0.613 | 0.387 | -0.0157 | 0.631 | 0.369 | -0.0215 |
+| step_25000 | 0.615 | 0.385 | +0.0138 | 0.617 | 0.379 | +0.0101 |
+| best | 0.619 | 0.381 | -0.0214 | 0.613 | 0.387 | -0.0181 |
+| final | 0.588 | 0.410 | +0.0456 | 0.609 | 0.387 | +0.0335 |
+
+### TRM (G base vs H multi4)
+| ckpt | base acc | base fD | base λ₁(D) | multi4 acc | multi4 fD | multi4 λ₁(D) |
+|---|---|---|---|---|---|---|
+| step_12500 | 0.553 | 0.344 | +0.0219 | 0.600 | 0.305 | +0.0284 |
+| step_25000 | 0.525 | 0.396 | +0.0209 | 0.545 | 0.361 | +0.0222 |
+| best | 0.596 | 0.334 | +0.0233 | 0.580 | 0.361 | +0.0366 |
+| final | 0.477 | 0.312 | +0.0164 | 0.537 | 0.270 | +0.0249 |
+
+## E2: decomposition on a SECOND HRM training run (step9_E fixed-unroll, n=2048)
+
+| ckpt | acc | A | B | C | D | λ₁(A) | λ₁(D) | settled-wrong frac of failures |
+|---|---|---|---|---|---|---|---|---|
+| best | 0.637 | 1244 | 1 | 61 | 742 | -0.1685 | -0.0308 | 0.0013 |
+| final | 0.594 | 1124 | 3 | 92 | 829 | -0.0144 | +0.0444 | 0.0036 | \ No newline at end of file
diff --git a/research/flossing/analysis_2x2/results.md b/research/flossing/analysis_2x2/results.md
new file mode 100644
index 0000000..a62bba2
--- /dev/null
+++ b/research/flossing/analysis_2x2/results.md
@@ -0,0 +1,69 @@
+# 2x2 analysis (convergence x correctness) — generated 2026-06-11
+
+## hrm26040_n8192
+- npz: `/home/yurenh2/rrm/research/flossing/diag_8k.npz`, n=8192, exact_acc=0.525
+- late-drift def: mean(drift_zH[:, -4:]), Otsu tau(log10)=0.766, frac_converged=0.509
+
+| cell | n | lam1 median | lam1 IQR | token_acc med |
+|---|---|---|---|---|
+| A_conv_correct | 4103 | -0.8617 | [-0.9020, -0.8153] | 1.000 |
+| B_conv_wrong | 63 | -0.5912 | [-0.8091, -0.5061] | 0.630 |
+| C_nonconv_correct | 195 | -0.6943 | [-0.7354, -0.6466] | 1.000 |
+| D_nonconv_wrong | 3831 | -0.5998 | [-0.6475, -0.5488] | 0.630 |
+
+- mixture: wrong-that-converged = 0.016; correct-that-nonconverged = 0.045
+- dlam1(correct-wrong): overall -0.2586; within-conv -0.2705; within-nonconv -0.0945
+- dlam1(wrong: conv - nonconv) = +0.0085
+- AUC(-lam1 -> correct): overall 0.984; within-conv 0.852; within-nonconv 0.818
+- AUC(-log d_late -> correct) = 0.964; AUC(-lam1 -> converged) = 0.986
+
+## trm_singleGPU_step260410_n512
+- npz: `/home/yurenh2/rrm/research/flossing/diag_trm_singleGPU_step260410_512.npz`, n=512, exact_acc=0.770
+- late-drift def: mean(drift_zH[:, -4:]), Otsu tau(log10)=1.406, frac_converged=0.748
+
+| cell | n | lam1 median | lam1 IQR | token_acc med |
+|---|---|---|---|---|
+| A_conv_correct | 383 | +0.0047 | [+0.0007, +0.0137] | 1.000 |
+| B_conv_wrong | 0 | - | - | - |
+| C_nonconv_correct | 11 | +0.0998 | [+0.0907, +0.1035] | 1.000 |
+| D_nonconv_wrong | 118 | +0.1023 | [+0.0935, +0.1112] | 0.642 |
+
+- mixture: wrong-that-converged = 0.000; correct-that-nonconverged = 0.028
+- dlam1(correct-wrong): overall -0.0973; within-conv +nan; within-nonconv -0.0025
+- dlam1(wrong: conv - nonconv) = +nan
+- AUC(-lam1 -> correct): overall 0.989; within-conv nan; within-nonconv 0.619
+- AUC(-log d_late -> correct) = 0.975; AUC(-lam1 -> converged) = 0.996
+
+## trm_singleGPU_step130205_n512
+- npz: `/home/yurenh2/rrm/research/flossing/diag_trm_singleGPU_step130205_512.npz`, n=512, exact_acc=0.756
+- late-drift def: mean(drift_zH[:, -4:]), Otsu tau(log10)=1.213, frac_converged=0.713
+
+| cell | n | lam1 median | lam1 IQR | token_acc med |
+|---|---|---|---|---|
+| A_conv_correct | 365 | +0.0043 | [-0.0007, +0.0141] | 1.000 |
+| B_conv_wrong | 0 | - | - | - |
+| C_nonconv_correct | 22 | +0.0852 | [+0.0694, +0.0988] | 1.000 |
+| D_nonconv_wrong | 125 | +0.1067 | [+0.0995, +0.1126] | 0.630 |
+
+- mixture: wrong-that-converged = 0.000; correct-that-nonconverged = 0.057
+- dlam1(correct-wrong): overall -0.1017; within-conv +nan; within-nonconv -0.0215
+- dlam1(wrong: conv - nonconv) = +nan
+- AUC(-lam1 -> correct): overall 0.989; within-conv nan; within-nonconv 0.805
+- AUC(-log d_late -> correct) = 0.957; AUC(-lam1 -> converged) = 0.996
+
+## trm_step13020_n512
+- npz: `/home/yurenh2/rrm/research/flossing/diag_trm_step13020_512.npz`, n=512, exact_acc=0.596
+- late-drift def: mean(drift_zH[:, -4:]), Otsu tau(log10)=0.730, frac_converged=0.598
+
+| cell | n | lam1 median | lam1 IQR | token_acc med |
+|---|---|---|---|---|
+| A_conv_correct | 296 | -0.0043 | [-0.0113, +0.0052] | 1.000 |
+| B_conv_wrong | 10 | -0.0027 | [-0.0090, +0.0009] | 0.556 |
+| C_nonconv_correct | 9 | +0.0419 | [+0.0323, +0.0480] | 1.000 |
+| D_nonconv_wrong | 197 | +0.0317 | [+0.0143, +0.0462] | 0.642 |
+
+- mixture: wrong-that-converged = 0.048; correct-that-nonconverged = 0.030
+- dlam1(correct-wrong): overall -0.0339; within-conv -0.0016; within-nonconv +0.0102
+- dlam1(wrong: conv - nonconv) = -0.0344
+- AUC(-lam1 -> correct): overall 0.849; within-conv 0.515; within-nonconv 0.360
+- AUC(-log d_late -> correct) = 0.976; AUC(-lam1 -> converged) = 0.887
diff --git a/research/flossing/analysis_2x2/results_hrm26040_joint_n2048.json b/research/flossing/analysis_2x2/results_hrm26040_joint_n2048.json
new file mode 100644
index 0000000..0a13647
--- /dev/null
+++ b/research/flossing/analysis_2x2/results_hrm26040_joint_n2048.json
@@ -0,0 +1,124 @@
+{
+ "npz": "retest/hrm_righteous_step26040_full_n2048.npz",
+ "n": 2048,
+ "exact_acc": 0.52587890625,
+ "late_drift_def": "mean(drift_zH[:, -4:])",
+ "otsu_tau_log10": 0.7878647951431916,
+ "frac_converged": 0.5048828125,
+ "cells": {
+ "A_conv_correct": {
+ "n": 1020,
+ "lam1_median": -0.15225201100111008,
+ "lam1_mean": -0.14917536273087356,
+ "lam1_iqr": [
+ -0.17794504761695862,
+ -0.12353727221488953
+ ],
+ "lam8_median": -0.2292146310210228,
+ "spectrum_median": [
+ -0.15225201100111008,
+ -0.16475734114646912,
+ -0.1838746815919876,
+ -0.1940322071313858,
+ -0.2046389803290367,
+ -0.2135268971323967,
+ -0.22090013325214386,
+ -0.2292146310210228
+ ],
+ "token_acc_median": 1.0,
+ "halted_at_median": 4.0
+ },
+ "B_conv_wrong": {
+ "n": 14,
+ "lam1_median": -0.035938894376158714,
+ "lam1_mean": -0.045284229197672436,
+ "lam1_iqr": [
+ -0.10053892433643341,
+ 0.023329108022153378
+ ],
+ "lam8_median": -0.09174276143312454,
+ "spectrum_median": [
+ -0.035938894376158714,
+ -0.042827894911170006,
+ -0.05657286383211613,
+ -0.06455842405557632,
+ -0.07352825626730919,
+ -0.07300741225481033,
+ -0.08471239358186722,
+ -0.09174276143312454
+ ],
+ "token_acc_median": 0.6419753134250641,
+ "halted_at_median": 0.0
+ },
+ "C_nonconv_correct": {
+ "n": 57,
+ "lam1_median": -0.04426668584346771,
+ "lam1_mean": -0.03947403389560222,
+ "lam1_iqr": [
+ -0.0739532858133316,
+ -0.009610416367650032
+ ],
+ "lam8_median": -0.115885429084301,
+ "spectrum_median": [
+ -0.04426668584346771,
+ -0.06513039767742157,
+ -0.07873687148094177,
+ -0.09092652052640915,
+ -0.09820108860731125,
+ -0.10941123217344284,
+ -0.10965009033679962,
+ -0.115885429084301
+ ],
+ "token_acc_median": 1.0,
+ "halted_at_median": 14.0
+ },
+ "D_nonconv_wrong": {
+ "n": 957,
+ "lam1_median": 0.031880877912044525,
+ "lam1_mean": 0.037503608156111645,
+ "lam1_iqr": [
+ 0.001404385082423687,
+ 0.07097765058279037
+ ],
+ "lam8_median": -0.06430909037590027,
+ "spectrum_median": [
+ 0.031880877912044525,
+ 0.002569675212725997,
+ -0.016799692064523697,
+ -0.029757805168628693,
+ -0.04117149859666824,
+ -0.049708690494298935,
+ -0.05613057315349579,
+ -0.06430909037590027
+ ],
+ "token_acc_median": 0.6296296119689941,
+ "halted_at_median": 0.0
+ }
+ },
+ "mixture": {
+ "wrong_that_converged": 0.014418125643666324,
+ "correct_that_nonconverged": 0.052924791086350974
+ },
+ "contrasts": {
+ "dlam1_correct_minus_wrong_overall": -0.18112388998270035,
+ "dlam1_within_converged": -0.11631311662495136,
+ "dlam1_within_nonconverged": -0.07614756375551224,
+ "dlam1_wrong_conv_minus_wrong_nonconv": -0.06781977228820324
+ },
+ "auc": {
+ "neg_lam1_predicts_correct_overall": 0.987268674570913,
+ "neg_lam1_predicts_correct_within_conv": 0.8455182072829132,
+ "neg_lam1_predicts_correct_within_nonconv": 0.8541128160002933,
+ "neg_logdrift_predicts_correct": 0.9627364412914158,
+ "neg_lam1_predicts_converged": 0.9916087731145015
+ },
+ "aux": {
+ "late_drift_zL_corr_with_zH_log": 0.9984897992742242,
+ "q_halt_final_median_by_cell": {
+ "A_conv_correct": 7.46875,
+ "B_conv_wrong": -6.671875,
+ "C_nonconv_correct": 7.4375,
+ "D_nonconv_wrong": -9.625
+ }
+ }
+} \ No newline at end of file
diff --git a/research/flossing/analysis_2x2/results_hrm26040_n8192.json b/research/flossing/analysis_2x2/results_hrm26040_n8192.json
new file mode 100644
index 0000000..bf4c711
--- /dev/null
+++ b/research/flossing/analysis_2x2/results_hrm26040_n8192.json
@@ -0,0 +1,124 @@
+{
+ "npz": "/home/yurenh2/rrm/research/flossing/diag_8k.npz",
+ "n": 8192,
+ "exact_acc": 0.524658203125,
+ "late_drift_def": "mean(drift_zH[:, -4:])",
+ "otsu_tau_log10": 0.766353451713377,
+ "frac_converged": 0.508544921875,
+ "cells": {
+ "A_conv_correct": {
+ "n": 4103,
+ "lam1_median": -0.8617429137229919,
+ "lam1_mean": -0.8564737750180547,
+ "lam1_iqr": [
+ -0.9020151197910309,
+ -0.815339982509613
+ ],
+ "lam8_median": -0.9746241569519043,
+ "spectrum_median": [
+ -0.8617429137229919,
+ -0.8905690908432007,
+ -0.9112593531608582,
+ -0.9292124509811401,
+ -0.9435566663742065,
+ -0.9553980827331543,
+ -0.9662551879882812,
+ -0.9746241569519043
+ ],
+ "token_acc_median": 1.0,
+ "halted_at_median": 4.0
+ },
+ "B_conv_wrong": {
+ "n": 63,
+ "lam1_median": -0.5912281274795532,
+ "lam1_mean": -0.6317614662268806,
+ "lam1_iqr": [
+ -0.809065192937851,
+ -0.5061159133911133
+ ],
+ "lam8_median": -0.7379652261734009,
+ "spectrum_median": [
+ -0.5912281274795532,
+ -0.6236273050308228,
+ -0.6583524942398071,
+ -0.681266725063324,
+ -0.6938952207565308,
+ -0.7069988250732422,
+ -0.7344604730606079,
+ -0.7379652261734009
+ ],
+ "token_acc_median": 0.6296296119689941,
+ "halted_at_median": 0.0
+ },
+ "C_nonconv_correct": {
+ "n": 195,
+ "lam1_median": -0.694297194480896,
+ "lam1_mean": -0.6856319598662548,
+ "lam1_iqr": [
+ -0.7354169189929962,
+ -0.6466438174247742
+ ],
+ "lam8_median": -0.7886528372764587,
+ "spectrum_median": [
+ -0.694297194480896,
+ -0.7198933362960815,
+ -0.7372102737426758,
+ -0.7535885572433472,
+ -0.7657074928283691,
+ -0.7715225219726562,
+ -0.7826006412506104,
+ -0.7886528372764587
+ ],
+ "token_acc_median": 1.0,
+ "halted_at_median": 14.0
+ },
+ "D_nonconv_wrong": {
+ "n": 3831,
+ "lam1_median": -0.5997700691223145,
+ "lam1_mean": -0.5970111100099877,
+ "lam1_iqr": [
+ -0.64748615026474,
+ -0.548848420381546
+ ],
+ "lam8_median": -0.7142836451530457,
+ "spectrum_median": [
+ -0.5997700691223145,
+ -0.634728193283081,
+ -0.6558717489242554,
+ -0.6714187860488892,
+ -0.6842536330223083,
+ -0.6956309676170349,
+ -0.7056014537811279,
+ -0.7142836451530457
+ ],
+ "token_acc_median": 0.6296296119689941,
+ "halted_at_median": 0.0
+ }
+ },
+ "mixture": {
+ "wrong_that_converged": 0.016178736517719568,
+ "correct_that_nonconverged": 0.04536993950674732
+ },
+ "contrasts": {
+ "dlam1_correct_minus_wrong_overall": -0.2586202025413513,
+ "dlam1_within_converged": -0.2705147862434387,
+ "dlam1_within_nonconverged": -0.09452712535858154,
+ "dlam1_wrong_conv_minus_wrong_nonconv": 0.00854194164276123
+ },
+ "auc": {
+ "neg_lam1_predicts_correct_overall": 0.9841127835524125,
+ "neg_lam1_predicts_correct_within_conv": 0.8520014391328065,
+ "neg_lam1_predicts_correct_within_nonconv": 0.8180149790173282,
+ "neg_logdrift_predicts_correct": 0.9639893544685683,
+ "neg_lam1_predicts_converged": 0.9859867295607834
+ },
+ "aux": {
+ "late_drift_zL_corr_with_zH_log": 0.9985142270869076,
+ "q_halt_final_median_by_cell": {
+ "A_conv_correct": 7.46875,
+ "B_conv_wrong": -8.6875,
+ "C_nonconv_correct": 7.46875,
+ "D_nonconv_wrong": -9.625
+ }
+ }
+} \ No newline at end of file
diff --git a/research/flossing/analysis_2x2/results_trm_official58590_n2048.json b/research/flossing/analysis_2x2/results_trm_official58590_n2048.json
new file mode 100644
index 0000000..bc000b2
--- /dev/null
+++ b/research/flossing/analysis_2x2/results_trm_official58590_n2048.json
@@ -0,0 +1,105 @@
+{
+ "npz": "retest/trm_gbs768_step58590_full_n2048.npz",
+ "n": 2048,
+ "exact_acc": 0.8759765625,
+ "late_drift_def": "mean(drift_zH[:, -4:])",
+ "otsu_tau_log10": 1.3628430376123835,
+ "frac_converged": 0.841796875,
+ "cells": {
+ "A_conv_correct": {
+ "n": 1724,
+ "lam1_median": 0.011281780432909727,
+ "lam1_mean": 0.015547435924631226,
+ "lam1_iqr": [
+ 0.008467044215649366,
+ 0.01652396097779274
+ ],
+ "lam8_median": 0.0020897792419418693,
+ "spectrum_median": [
+ 0.011281780432909727,
+ 0.008820333052426577,
+ 0.007248040288686752,
+ 0.0058647627010941505,
+ 0.004666940309107304,
+ 0.003820874495431781,
+ 0.0028478524181991816,
+ 0.0020897792419418693
+ ],
+ "token_acc_median": 1.0,
+ "halted_at_median": 2.0
+ },
+ "B_conv_wrong": {
+ "n": 0
+ },
+ "C_nonconv_correct": {
+ "n": 70,
+ "lam1_median": 0.0819745659828186,
+ "lam1_mean": 0.07783016583749226,
+ "lam1_iqr": [
+ 0.06863085739314556,
+ 0.09041773155331612
+ ],
+ "lam8_median": 0.008648575283586979,
+ "spectrum_median": [
+ 0.0819745659828186,
+ 0.052037810906767845,
+ 0.039651695638895035,
+ 0.028503953479230404,
+ 0.023648646660149097,
+ 0.016836480237543583,
+ 0.011321825440973043,
+ 0.008648575283586979
+ ],
+ "token_acc_median": 1.0,
+ "halted_at_median": 14.0
+ },
+ "D_nonconv_wrong": {
+ "n": 254,
+ "lam1_median": 0.10287182405591011,
+ "lam1_mean": 0.10287220525283983,
+ "lam1_iqr": [
+ 0.09241268038749695,
+ 0.11242476664483547
+ ],
+ "lam8_median": 0.01879305113106966,
+ "spectrum_median": [
+ 0.10287182405591011,
+ 0.0741146095097065,
+ 0.05705060437321663,
+ 0.045855674892663956,
+ 0.03607587888836861,
+ 0.029474849812686443,
+ 0.023949903436005116,
+ 0.01879305113106966
+ ],
+ "token_acc_median": 0.6296296119689941,
+ "halted_at_median": 0.0
+ }
+ },
+ "mixture": {
+ "wrong_that_converged": 0.0,
+ "correct_that_nonconverged": 0.03901895206243032
+ },
+ "contrasts": {
+ "dlam1_correct_minus_wrong_overall": -0.09140016930177808,
+ "dlam1_within_converged": NaN,
+ "dlam1_within_nonconverged": -0.020897258073091507,
+ "dlam1_wrong_conv_minus_wrong_nonconv": NaN
+ },
+ "auc": {
+ "neg_lam1_predicts_correct_overall": 0.9934800165029538,
+ "neg_lam1_predicts_correct_within_conv": NaN,
+ "neg_lam1_predicts_correct_within_nonconv": 0.8484814398200224,
+ "neg_logdrift_predicts_correct": 0.9719340057409211,
+ "neg_lam1_predicts_converged": 0.9923716736845121
+ },
+ "aux": {
+ "late_drift_zL_corr_with_zH_log": 0.9854614526028403,
+ "q_halt_final_median_by_cell": {
+ "A_conv_correct": 7.78125,
+ "B_conv_wrong": NaN,
+ "C_nonconv_correct": 7.71875,
+ "D_nonconv_wrong": -11.125
+ }
+ }
+} \ No newline at end of file
diff --git a/research/flossing/analysis_2x2/results_trm_official_gbs768_step58590_n512.json b/research/flossing/analysis_2x2/results_trm_official_gbs768_step58590_n512.json
new file mode 100644
index 0000000..460e407
--- /dev/null
+++ b/research/flossing/analysis_2x2/results_trm_official_gbs768_step58590_n512.json
@@ -0,0 +1,105 @@
+{
+ "npz": "../official_gbs768_spectrum/trm_gbs768_base_step58590_n512_k8_seed20260602.npz",
+ "n": 512,
+ "exact_acc": 0.875,
+ "late_drift_def": "mean(drift_zH[:, -4:])",
+ "otsu_tau_log10": 1.364234039179148,
+ "frac_converged": 0.84765625,
+ "cells": {
+ "A_conv_correct": {
+ "n": 434,
+ "lam1_median": 0.011052772868424654,
+ "lam1_mean": 0.015589851711679268,
+ "lam1_iqr": [
+ 0.008054324658587575,
+ 0.017474188469350338
+ ],
+ "lam8_median": 0.0021472229855135083,
+ "spectrum_median": [
+ 0.011052772868424654,
+ 0.008679452817887068,
+ 0.007291494170203805,
+ 0.005999391432851553,
+ 0.004774029366672039,
+ 0.0037459908053278923,
+ 0.0029579862020909786,
+ 0.0021472229855135083
+ ],
+ "token_acc_median": 1.0,
+ "halted_at_median": 2.0
+ },
+ "B_conv_wrong": {
+ "n": 0
+ },
+ "C_nonconv_correct": {
+ "n": 14,
+ "lam1_median": 0.08274861052632332,
+ "lam1_mean": 0.0802882562524506,
+ "lam1_iqr": [
+ 0.07004822231829166,
+ 0.09586159884929657
+ ],
+ "lam8_median": 0.00618100818246603,
+ "spectrum_median": [
+ 0.08274861052632332,
+ 0.05459940806031227,
+ 0.039645833894610405,
+ 0.026364507153630257,
+ 0.020821912214159966,
+ 0.015413322485983372,
+ 0.010960425715893507,
+ 0.00618100818246603
+ ],
+ "token_acc_median": 1.0,
+ "halted_at_median": 14.0
+ },
+ "D_nonconv_wrong": {
+ "n": 64,
+ "lam1_median": 0.10339764878153801,
+ "lam1_mean": 0.10259491263423115,
+ "lam1_iqr": [
+ 0.09241705574095249,
+ 0.1118917465209961
+ ],
+ "lam8_median": 0.02135413885116577,
+ "spectrum_median": [
+ 0.10339764878153801,
+ 0.07536708191037178,
+ 0.057968027889728546,
+ 0.0463954322040081,
+ 0.03947761468589306,
+ 0.03304685465991497,
+ 0.026251215487718582,
+ 0.02135413885116577
+ ],
+ "token_acc_median": 0.6296296119689941,
+ "halted_at_median": 0.0
+ }
+ },
+ "mixture": {
+ "wrong_that_converged": 0.0,
+ "correct_that_nonconverged": 0.03125
+ },
+ "contrasts": {
+ "dlam1_correct_minus_wrong_overall": -0.09217484341934323,
+ "dlam1_within_converged": NaN,
+ "dlam1_within_nonconverged": -0.02064903825521469,
+ "dlam1_wrong_conv_minus_wrong_nonconv": NaN
+ },
+ "auc": {
+ "neg_lam1_predicts_correct_overall": 0.99267578125,
+ "neg_lam1_predicts_correct_within_conv": NaN,
+ "neg_lam1_predicts_correct_within_nonconv": 0.7667410714285714,
+ "neg_logdrift_predicts_correct": 0.9767020089285714,
+ "neg_lam1_predicts_converged": 0.9906947890818859
+ },
+ "aux": {
+ "late_drift_zL_corr_with_zH_log": 0.983883308834529,
+ "q_halt_final_median_by_cell": {
+ "A_conv_correct": 7.78125,
+ "B_conv_wrong": NaN,
+ "C_nonconv_correct": 7.71875,
+ "D_nonconv_wrong": -11.125
+ }
+ }
+} \ No newline at end of file
diff --git a/research/flossing/analysis_2x2/results_trm_singleGPU_step130205_n512.json b/research/flossing/analysis_2x2/results_trm_singleGPU_step130205_n512.json
new file mode 100644
index 0000000..9f73dff
--- /dev/null
+++ b/research/flossing/analysis_2x2/results_trm_singleGPU_step130205_n512.json
@@ -0,0 +1,105 @@
+{
+ "npz": "/home/yurenh2/rrm/research/flossing/diag_trm_singleGPU_step130205_512.npz",
+ "n": 512,
+ "exact_acc": 0.755859375,
+ "late_drift_def": "mean(drift_zH[:, -4:])",
+ "otsu_tau_log10": 1.2130658229840834,
+ "frac_converged": 0.712890625,
+ "cells": {
+ "A_conv_correct": {
+ "n": 365,
+ "lam1_median": 0.004312093835324049,
+ "lam1_mean": 0.00919173876134739,
+ "lam1_iqr": [
+ -0.0006733923801220953,
+ 0.01406068168580532
+ ],
+ "lam8_median": -0.008585464209318161,
+ "spectrum_median": [
+ 0.004312093835324049,
+ -0.0004256218671798706,
+ -0.002575409132987261,
+ -0.0042461007833480835,
+ -0.005322618409991264,
+ -0.006433633156120777,
+ -0.007657818961888552,
+ -0.008585464209318161
+ ],
+ "token_acc_median": 1.0,
+ "halted_at_median": 3.0
+ },
+ "B_conv_wrong": {
+ "n": 0
+ },
+ "C_nonconv_correct": {
+ "n": 22,
+ "lam1_median": 0.08519911020994186,
+ "lam1_mean": 0.08400513193654743,
+ "lam1_iqr": [
+ 0.06937860697507858,
+ 0.09877191670238972
+ ],
+ "lam8_median": 0.015529214404523373,
+ "spectrum_median": [
+ 0.08519911020994186,
+ 0.062445785850286484,
+ 0.04946317337453365,
+ 0.03590788505971432,
+ 0.030925835482776165,
+ 0.025972959585487843,
+ 0.017265344969928265,
+ 0.015529214404523373
+ ],
+ "token_acc_median": 1.0,
+ "halted_at_median": 15.0
+ },
+ "D_nonconv_wrong": {
+ "n": 125,
+ "lam1_median": 0.10669758170843124,
+ "lam1_mean": 0.10679112273454666,
+ "lam1_iqr": [
+ 0.09952236711978912,
+ 0.1125824898481369
+ ],
+ "lam8_median": 0.0328078418970108,
+ "spectrum_median": [
+ 0.10669758170843124,
+ 0.0822329968214035,
+ 0.06821739673614502,
+ 0.058041222393512726,
+ 0.04859192296862602,
+ 0.04428848251700401,
+ 0.037217091768980026,
+ 0.0328078418970108
+ ],
+ "token_acc_median": 0.6296296119689941,
+ "halted_at_median": 0.0
+ }
+ },
+ "mixture": {
+ "wrong_that_converged": 0.0,
+ "correct_that_nonconverged": 0.056847545219638244
+ },
+ "contrasts": {
+ "dlam1_correct_minus_wrong_overall": -0.1017410927452147,
+ "dlam1_within_converged": NaN,
+ "dlam1_within_nonconverged": -0.02149847149848938,
+ "dlam1_wrong_conv_minus_wrong_nonconv": NaN
+ },
+ "auc": {
+ "neg_lam1_predicts_correct_overall": 0.988857881136951,
+ "neg_lam1_predicts_correct_within_conv": NaN,
+ "neg_lam1_predicts_correct_within_nonconv": 0.8047272727272727,
+ "neg_logdrift_predicts_correct": 0.957250645994832,
+ "neg_lam1_predicts_converged": 0.9961047432671699
+ },
+ "aux": {
+ "late_drift_zL_corr_with_zH_log": 0.9640733496154094,
+ "q_halt_final_median_by_cell": {
+ "A_conv_correct": 7.0,
+ "B_conv_wrong": NaN,
+ "C_nonconv_correct": 7.390625,
+ "D_nonconv_wrong": -11.125
+ }
+ }
+} \ No newline at end of file
diff --git a/research/flossing/analysis_2x2/results_trm_singleGPU_step260410_n512.json b/research/flossing/analysis_2x2/results_trm_singleGPU_step260410_n512.json
new file mode 100644
index 0000000..3b11879
--- /dev/null
+++ b/research/flossing/analysis_2x2/results_trm_singleGPU_step260410_n512.json
@@ -0,0 +1,105 @@
+{
+ "npz": "/home/yurenh2/rrm/research/flossing/diag_trm_singleGPU_step260410_512.npz",
+ "n": 512,
+ "exact_acc": 0.76953125,
+ "late_drift_def": "mean(drift_zH[:, -4:])",
+ "otsu_tau_log10": 1.406412158338842,
+ "frac_converged": 0.748046875,
+ "cells": {
+ "A_conv_correct": {
+ "n": 383,
+ "lam1_median": 0.004744160454720259,
+ "lam1_mean": 0.01041777972714804,
+ "lam1_iqr": [
+ 0.0007444327347911894,
+ 0.013659899588674307
+ ],
+ "lam8_median": -0.004219561815261841,
+ "spectrum_median": [
+ 0.004744160454720259,
+ 0.0016308835474774241,
+ 0.00021728611318394542,
+ -0.0006795059307478368,
+ -0.001709386589936912,
+ -0.002729899948462844,
+ -0.0034774676896631718,
+ -0.004219561815261841
+ ],
+ "token_acc_median": 1.0,
+ "halted_at_median": 2.0
+ },
+ "B_conv_wrong": {
+ "n": 0
+ },
+ "C_nonconv_correct": {
+ "n": 11,
+ "lam1_median": 0.09982780367136002,
+ "lam1_mean": 0.08986719998277047,
+ "lam1_iqr": [
+ 0.09067171066999435,
+ 0.10350960865616798
+ ],
+ "lam8_median": 0.013320312835276127,
+ "spectrum_median": [
+ 0.09982780367136002,
+ 0.06963343918323517,
+ 0.051610760390758514,
+ 0.03814869746565819,
+ 0.022425860166549683,
+ 0.01762143149971962,
+ 0.012642345391213894,
+ 0.013320312835276127
+ ],
+ "token_acc_median": 1.0,
+ "halted_at_median": 15.0
+ },
+ "D_nonconv_wrong": {
+ "n": 118,
+ "lam1_median": 0.10232088342308998,
+ "lam1_mean": 0.10288177159125522,
+ "lam1_iqr": [
+ 0.09352006390690804,
+ 0.11121566034853458
+ ],
+ "lam8_median": 0.021454622969031334,
+ "spectrum_median": [
+ 0.10232088342308998,
+ 0.07419029250741005,
+ 0.05829045735299587,
+ 0.047421256080269814,
+ 0.040371473878622055,
+ 0.032383667305111885,
+ 0.027303663082420826,
+ 0.021454622969031334
+ ],
+ "token_acc_median": 0.6419752836227417,
+ "halted_at_median": 0.0
+ }
+ },
+ "mixture": {
+ "wrong_that_converged": 0.0,
+ "correct_that_nonconverged": 0.027918781725888325
+ },
+ "contrasts": {
+ "dlam1_correct_minus_wrong_overall": -0.09725644066929817,
+ "dlam1_within_converged": NaN,
+ "dlam1_within_nonconverged": -0.002493079751729965,
+ "dlam1_wrong_conv_minus_wrong_nonconv": NaN
+ },
+ "auc": {
+ "neg_lam1_predicts_correct_overall": 0.9893530069689409,
+ "neg_lam1_predicts_correct_within_conv": NaN,
+ "neg_lam1_predicts_correct_within_nonconv": 0.6186440677966102,
+ "neg_logdrift_predicts_correct": 0.9747913619547449,
+ "neg_lam1_predicts_converged": 0.9955269496225231
+ },
+ "aux": {
+ "late_drift_zL_corr_with_zH_log": 0.9724686803222311,
+ "q_halt_final_median_by_cell": {
+ "A_conv_correct": 7.4375,
+ "B_conv_wrong": NaN,
+ "C_nonconv_correct": 7.40625,
+ "D_nonconv_wrong": -11.125
+ }
+ }
+} \ No newline at end of file
diff --git a/research/flossing/analysis_2x2/results_trm_step13020_n512.json b/research/flossing/analysis_2x2/results_trm_step13020_n512.json
new file mode 100644
index 0000000..ae4af7d
--- /dev/null
+++ b/research/flossing/analysis_2x2/results_trm_step13020_n512.json
@@ -0,0 +1,124 @@
+{
+ "npz": "/home/yurenh2/rrm/research/flossing/diag_trm_step13020_512.npz",
+ "n": 512,
+ "exact_acc": 0.595703125,
+ "late_drift_def": "mean(drift_zH[:, -4:])",
+ "otsu_tau_log10": 0.7298937072389344,
+ "frac_converged": 0.59765625,
+ "cells": {
+ "A_conv_correct": {
+ "n": 296,
+ "lam1_median": -0.004329816671088338,
+ "lam1_mean": -0.002230541864829885,
+ "lam1_iqr": [
+ -0.011287129484117031,
+ 0.00519718776922673
+ ],
+ "lam8_median": -0.03249448351562023,
+ "spectrum_median": [
+ -0.004329816671088338,
+ -0.016021928749978542,
+ -0.021661318838596344,
+ -0.02548989001661539,
+ -0.02795236185193062,
+ -0.03012390434741974,
+ -0.03131883405148983,
+ -0.03249448351562023
+ ],
+ "token_acc_median": 1.0,
+ "halted_at_median": 3.0
+ },
+ "B_conv_wrong": {
+ "n": 10,
+ "lam1_median": -0.0027067698538303375,
+ "lam1_mean": -0.003216998477000743,
+ "lam1_iqr": [
+ -0.008983954321593046,
+ 0.0009138719469774514
+ ],
+ "lam8_median": -0.024164372123777866,
+ "spectrum_median": [
+ -0.0027067698538303375,
+ -0.009424471529200673,
+ -0.014679631683975458,
+ -0.014942141249775887,
+ -0.017778784036636353,
+ -0.01962883025407791,
+ -0.020820041187107563,
+ -0.024164372123777866
+ ],
+ "token_acc_median": 0.5555555820465088,
+ "halted_at_median": 0.0
+ },
+ "C_nonconv_correct": {
+ "n": 9,
+ "lam1_median": 0.04190123826265335,
+ "lam1_mean": 0.04164694120279617,
+ "lam1_iqr": [
+ 0.03228544816374779,
+ 0.047971080988645554
+ ],
+ "lam8_median": -0.018014254048466682,
+ "spectrum_median": [
+ 0.04190123826265335,
+ 0.022236861288547516,
+ 0.008249719627201557,
+ 0.006971194874495268,
+ -0.008609719574451447,
+ -0.007754913531243801,
+ -0.010403371416032314,
+ -0.018014254048466682
+ ],
+ "token_acc_median": 1.0,
+ "halted_at_median": 14.0
+ },
+ "D_nonconv_wrong": {
+ "n": 197,
+ "lam1_median": 0.03166300058364868,
+ "lam1_mean": 0.030858048445575867,
+ "lam1_iqr": [
+ 0.01434984989464283,
+ 0.04617198556661606
+ ],
+ "lam8_median": -0.012672499753534794,
+ "spectrum_median": [
+ 0.03166300058364868,
+ 0.01676548644900322,
+ 0.007448865100741386,
+ 0.0023278011940419674,
+ -0.0027259995695203543,
+ -0.007005904335528612,
+ -0.009610905312001705,
+ -0.012672499753534794
+ ],
+ "token_acc_median": 0.6419752836227417,
+ "halted_at_median": 0.0
+ }
+ },
+ "mixture": {
+ "wrong_that_converged": 0.04830917874396135,
+ "correct_that_nonconverged": 0.029508196721311476
+ },
+ "contrasts": {
+ "dlam1_correct_minus_wrong_overall": -0.03387995809316635,
+ "dlam1_within_converged": -0.0016230468172580004,
+ "dlam1_within_nonconverged": 0.01023823767900467,
+ "dlam1_wrong_conv_minus_wrong_nonconv": -0.03436977043747902
+ },
+ "auc": {
+ "neg_lam1_predicts_correct_overall": 0.8491644887938544,
+ "neg_lam1_predicts_correct_within_conv": 0.5148648648648648,
+ "neg_lam1_predicts_correct_within_nonconv": 0.3598420755781162,
+ "neg_logdrift_predicts_correct": 0.9758295715530213,
+ "neg_lam1_predicts_converged": 0.8866996636842439
+ },
+ "aux": {
+ "late_drift_zL_corr_with_zH_log": 0.9964963268090318,
+ "q_halt_final_median_by_cell": {
+ "A_conv_correct": 7.25,
+ "B_conv_wrong": -10.71875,
+ "C_nonconv_correct": 7.25,
+ "D_nonconv_wrong": -10.4375
+ }
+ }
+} \ No newline at end of file
diff --git a/research/flossing/analysis_2x2/run_phase1_queue.sh b/research/flossing/analysis_2x2/run_phase1_queue.sh
new file mode 100755
index 0000000..989c7ee
--- /dev/null
+++ b/research/flossing/analysis_2x2/run_phase1_queue.sh
@@ -0,0 +1,92 @@
+#!/usr/bin/env bash
+# Phase-1 queue (experiment_framework.md): E5 horizon sweeps, E2 run-level replication,
+# E6 matched-objective step9 pairs. Waits for a free GPU (12h fallback), runs sequentially.
+set -o pipefail
+
+cd /home/yurenh2/rrm/research/flossing
+source /home/yurenh2/miniconda3/etc/profile.d/conda.sh
+conda activate rrm
+
+OUTDIR=analysis_2x2/phase1
+mkdir -p "$OUTDIR"
+STATUS="$OUTDIR/queue_status.log"
+TRM_OFF="/home/yurenh2/rrm/trm/checkpoints/Sudoku-extreme-1k-aug-1000-ACT-torch/pretrain_mlp_t_sudoku_official_gbs768_repro"
+TRM_SGL="/home/yurenh2/rrm/trm/checkpoints/Sudoku-extreme-1k-aug-1000-ACT-torch/pretrain_mlp_t_sudoku_singleGPU"
+HRM_ROOT="/home/yurenh2/rrm/hrm/checkpoints/Sudoku-extreme-1k-aug-1000 ACT-torch/HierarchicalReasoningModel_ACTV1 righteous-python"
+S9=/home/yurenh2/rrm/research/flossing
+
+log() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" >> "$STATUS"; }
+free_gpu() {
+ nvidia-smi --query-gpu=index,utilization.gpu,memory.used --format=csv,noheader,nounits \
+ | awk -F', ' '$2<30 && $3<8000 {print $1; exit}'
+}
+
+log "phase-1 queue started (E5 horizon sweeps, E2 step9_E replication, E6 step9 pairs)"
+DEADLINE=$(( $(date +%s) + 12*3600 ))
+GPU=""
+while true; do
+ g1="$(free_gpu)"
+ if [[ -n "$g1" ]]; then
+ sleep 60; g2="$(free_gpu)"
+ if [[ "$g2" == "$g1" ]]; then GPU="$g1"; break; fi
+ fi
+ if (( $(date +%s) > DEADLINE )); then
+ GPU="$(nvidia-smi --query-gpu=index,memory.used --format=csv,noheader,nounits | sort -t, -k2 -n | head -1 | cut -d, -f1)"
+ log "12h fallback: taking GPU $GPU"
+ break
+ fi
+ sleep 300
+done
+log "claimed GPU $GPU"
+export CUDA_VISIBLE_DEVICES="$GPU"
+
+run_job() { # name horizon script args...
+ local name="$1" hor="$2"; shift 2
+ if [[ -f "$OUTDIR/${name}.npz" ]]; then log "skip $name"; return 0; fi
+ log "start $name"
+ if DIAG_HORIZON="$hor" python "$@" --out "$OUTDIR/${name}.npz" > "$OUTDIR/${name}.log" 2>&1; then
+ log "done $name"
+ else
+ log "FAILED $name"
+ fi
+}
+
+# --- E5: TRM horizon sweep (h=4 already exists in retest/) ---
+for H in 2 6 8 10 12; do
+ run_job "trm_official58590_h${H}_n2048" "$H" diagnose_trm_joint_horizon.py \
+ --ckpt-root "$TRM_OFF" --ckpt-name step_58590 --n-samples 2048 --batch-size 16 \
+ --k-lyap 8 --t-ons 1 --seed 0
+done
+
+# --- E5: HRM horizon sweep ---
+for H in 2 6 8 10 12; do
+ run_job "hrm26040_h${H}_n2048" "$H" diagnose_hrm_joint_horizon.py \
+ --ckpt-root "$HRM_ROOT" --ckpt-name step_26040 --n-samples 2048 --batch-size 32 \
+ --k-lyap 8 --t-ons 1 --seed 0
+done
+
+# --- E2: HRM second training run (step9_E fixed-unroll baseline), full window ---
+run_job "step9E_hrm_best_full_n2048" 16 diagnose_hrm_joint.py \
+ --ckpt-root "$HRM_ROOT" --ckpt-name "$S9/step9_E_hrm_baseline_parallel_fixed_26040_50k_ckpts/best.pt" \
+ --n-samples 2048 --batch-size 32 --k-lyap 8 --t-ons 1 --seed 0
+run_job "step9E_hrm_final_full_n2048" 16 diagnose_hrm_joint.py \
+ --ckpt-root "$HRM_ROOT" --ckpt-name "$S9/step9_E_hrm_baseline_parallel_fixed_26040_50k_ckpts/final.pt" \
+ --n-samples 2048 --batch-size 32 --k-lyap 8 --t-ons 1 --seed 0
+
+# --- E6: matched-objective pairs (n=512): HRM E vs F, TRM G vs H ---
+for CK in step_12500 step_25000 best final; do
+ run_job "step9E_hrm_${CK}_n512" 16 diagnose_hrm_joint.py \
+ --ckpt-root "$HRM_ROOT" --ckpt-name "$S9/step9_E_hrm_baseline_parallel_fixed_26040_50k_ckpts/${CK}.pt" \
+ --n-samples 512 --batch-size 32 --k-lyap 8 --t-ons 1 --seed 0
+ run_job "step9F_hrm_${CK}_n512" 16 diagnose_hrm_joint.py \
+ --ckpt-root "$HRM_ROOT" --ckpt-name "$S9/step9_F_hrm_multi4_loguniform_ramp_26040_50k_ckpts/${CK}.pt" \
+ --n-samples 512 --batch-size 32 --k-lyap 8 --t-ons 1 --seed 0
+ run_job "step9G_trm_${CK}_n512" 16 diagnose_trm_joint.py \
+ --ckpt-root "$TRM_SGL" --ckpt-name "$S9/step9_G_trm_baseline_parallel_fixed_26041_batch4_50k_ckpts/${CK}.pt" \
+ --n-samples 512 --batch-size 16 --k-lyap 8 --t-ons 1 --seed 0
+ run_job "step9H_trm_${CK}_n512" 16 diagnose_trm_joint.py \
+ --ckpt-root "$TRM_SGL" --ckpt-name "$S9/step9_H_trm_multi4_loguniform_ramp_26041_batch4_50k_ckpts/${CK}.pt" \
+ --n-samples 512 --batch-size 16 --k-lyap 8 --t-ons 1 --seed 0
+done
+
+log "phase-1 queue finished"
diff --git a/research/flossing/analysis_2x2/run_retest_2x2.sh b/research/flossing/analysis_2x2/run_retest_2x2.sh
new file mode 100755
index 0000000..9cdcc13
--- /dev/null
+++ b/research/flossing/analysis_2x2/run_retest_2x2.sh
@@ -0,0 +1,75 @@
+#!/usr/bin/env bash
+# Queue: wait for a free GPU, then run the 2x2 re-test diagnostics:
+# 1. TRM official_gbs768 @ step_58590 (86.9% ckpt) full-window n=2048
+# 2. TRM official_gbs768 @ step_58590 early-window n=2048 (first 4 ACT steps)
+# 3. HRM righteous-python @ step_26040 (joint est.) full-window n=2048
+# 4. HRM righteous-python @ step_26040 early-window n=2048
+# Same --seed 0 and same n across full/short pairs so idx fields pair up.
+set -o pipefail
+
+cd /home/yurenh2/rrm/research/flossing
+source /home/yurenh2/miniconda3/etc/profile.d/conda.sh
+conda activate rrm
+
+OUTDIR=analysis_2x2/retest
+mkdir -p "$OUTDIR"
+STATUS="$OUTDIR/queue_status.log"
+TRM_ROOT="/home/yurenh2/rrm/trm/checkpoints/Sudoku-extreme-1k-aug-1000-ACT-torch/pretrain_mlp_t_sudoku_official_gbs768_repro"
+HRM_ROOT="/home/yurenh2/rrm/hrm/checkpoints/Sudoku-extreme-1k-aug-1000 ACT-torch/HierarchicalReasoningModel_ACTV1 righteous-python"
+
+log() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" >> "$STATUS"; }
+
+free_gpu() {
+ # print index of a GPU with util<30% and mem<8GB, else empty
+ nvidia-smi --query-gpu=index,utilization.gpu,memory.used --format=csv,noheader,nounits \
+ | awk -F', ' '$2<30 && $3<8000 {print $1; exit}'
+}
+
+log "queue started, waiting for a free GPU (util<30%, mem<8GB, two checks 60s apart; 12h fallback)"
+DEADLINE=$(( $(date +%s) + 12*3600 ))
+GPU=""
+while true; do
+ g1="$(free_gpu)"
+ if [[ -n "$g1" ]]; then
+ sleep 60
+ g2="$(free_gpu)"
+ if [[ "$g2" == "$g1" ]]; then GPU="$g1"; break; fi
+ fi
+ if (( $(date +%s) > DEADLINE )); then
+ GPU="$(nvidia-smi --query-gpu=index,memory.used --format=csv,noheader,nounits | sort -t, -k2 -n | head -1 | cut -d, -f1)"
+ log "12h fallback: taking GPU $GPU (most free memory) despite utilization"
+ break
+ fi
+ sleep 300
+done
+log "claimed GPU $GPU"
+export CUDA_VISIBLE_DEVICES="$GPU"
+
+run_job() {
+ local name="$1"; shift
+ if [[ -f "$OUTDIR/${name}.npz" ]]; then log "skip $name (output exists)"; return 0; fi
+ log "start $name"
+ if python "$@" --out "$OUTDIR/${name}.npz" > "$OUTDIR/${name}.log" 2>&1; then
+ log "done $name"
+ else
+ log "FAILED $name (see $OUTDIR/${name}.log)"
+ fi
+}
+
+run_job trm_gbs768_step58590_full_n2048 \
+ diagnose_trm_joint.py --ckpt-root "$TRM_ROOT" --ckpt-name step_58590 \
+ --n-samples 2048 --batch-size 16 --k-lyap 8 --t-ons 1 --seed 0
+
+run_job trm_gbs768_step58590_short_n2048 \
+ diagnose_trm_joint_short.py --ckpt-root "$TRM_ROOT" --ckpt-name step_58590 \
+ --n-samples 2048 --batch-size 16 --k-lyap 8 --t-ons 1 --seed 0
+
+run_job hrm_righteous_step26040_full_n2048 \
+ diagnose_hrm_joint.py --ckpt-root "$HRM_ROOT" --ckpt-name step_26040 \
+ --n-samples 2048 --batch-size 32 --k-lyap 8 --t-ons 1 --seed 0
+
+run_job hrm_righteous_step26040_short_n2048 \
+ diagnose_hrm_joint_short.py --ckpt-root "$HRM_ROOT" --ckpt-name step_26040 \
+ --n-samples 2048 --batch-size 32 --k-lyap 8 --t-ons 1 --seed 0
+
+log "all retest diagnostics finished"
diff --git a/research/flossing/analysis_2x2/sweep_hrm26040_joint_n2048.csv b/research/flossing/analysis_2x2/sweep_hrm26040_joint_n2048.csv
new file mode 100644
index 0000000..c4a8b6c
--- /dev/null
+++ b/research/flossing/analysis_2x2/sweep_hrm26040_joint_n2048.csv
@@ -0,0 +1,20 @@
+pct,tau,nA,nB,nC,nD,lam1_med_B,lam1_med_D
+5,-0.011992306107950053,102,1,975,970,-0.22750887274742126,0.03138776309788227
+10,-0.01058249634236916,203,2,874,969,-0.15784957632422447,0.0314599983394146
+15,-0.009196568905867631,306,2,771,969,-0.15784957632422447,0.0314599983394146
+20,-0.008030915298460381,408,2,669,969,-0.15784957632422447,0.0314599983394146
+25,-0.006957811036151525,509,3,568,968,-0.10465513914823532,0.03146406076848507
+30,-0.006023327926233211,612,3,465,968,-0.10465513914823532,0.03146406076848507
+35,-0.004937302122343026,714,3,363,968,-0.10465513914823532,0.03146406076848507
+40,-0.0033157474065108714,814,5,263,966,-0.14139553904533386,0.03153638541698456
+45,-0.0012688460769353994,917,5,160,966,-0.14139553904533386,0.03153638541698456
+50,0.5973517463361264,1013,11,64,960,-0.0452079139649868,0.03175541199743748
+55,1.3993508404545523,1034,92,43,879,0.033553484827280045,0.031315527856349945
+60,1.4994241881984267,1036,193,41,778,0.029089782387018204,0.03191038407385349
+65,1.547490873090265,1038,293,39,678,0.02823900803923607,0.032731860876083374
+70,1.5799635318056726,1040,393,37,578,0.029996931552886963,0.03191038407385349
+75,1.6038922156413973,1040,496,37,475,0.029781382530927658,0.03269977867603302
+80,1.6256149248592253,1040,598,37,373,0.03138776309788227,0.030956093221902847
+85,1.6518692868467673,1042,698,35,273,0.02991918846964836,0.03507234901189804
+90,1.677233478220475,1044,799,33,172,0.028961632400751114,0.044597823172807693
+95,1.708463281897988,1049,896,28,75,0.029092321172356606,0.052752863615751266
diff --git a/research/flossing/analysis_2x2/sweep_hrm26040_n8192.csv b/research/flossing/analysis_2x2/sweep_hrm26040_n8192.csv
new file mode 100644
index 0000000..b2f9f82
--- /dev/null
+++ b/research/flossing/analysis_2x2/sweep_hrm26040_n8192.csv
@@ -0,0 +1,20 @@
+pct,tau,nA,nB,nC,nD,lam1_med_B,lam1_med_D
+5,-0.024403165931412795,410,0,3888,3894,nan,-0.5996554493904114
+10,-0.02256924601928628,818,2,3480,3892,-0.8345716893672943,-0.5995031297206879
+15,-0.021242109610698556,1225,4,3073,3890,-0.8419434428215027,-0.5994212925434113
+20,-0.020075217197184483,1634,5,2664,3889,-0.8429288864135742,-0.599414587020874
+25,-0.019023493082334777,2039,9,2259,3885,-0.8409579992294312,-0.5993443727493286
+30,-0.017916215347934226,2444,14,1854,3880,-0.8415209650993347,-0.599242627620697
+35,-0.016770285088033448,2852,15,1446,3879,-0.8420839309692383,-0.5992319583892822
+40,-0.015257325632230992,3260,17,1038,3877,-0.8420839309692383,-0.599224328994751
+45,-0.012884453933432984,3665,21,633,3873,-0.8420839309692383,-0.5990880727767944
+50,0.371726245629762,4067,29,231,3865,-0.8250086307525635,-0.5989847183227539
+55,1.3871392863925978,4136,370,162,3524,-0.5518185496330261,-0.6034163236618042
+60,1.4972477780890994,4138,777,160,3117,-0.5671209096908569,-0.6080851554870605
+65,1.5481401201578837,4144,1181,154,2713,-0.5731313228607178,-0.6117637753486633
+70,1.5835620683527638,4147,1587,151,2307,-0.5808747410774231,-0.6136600971221924
+75,1.6110709254128324,4153,1991,145,1903,-0.584848165512085,-0.616129994392395
+80,1.6353553466380342,4156,2397,142,1497,-0.5879771709442139,-0.6185622215270996
+85,1.6574195825233073,4161,2802,137,1092,-0.5930990278720856,-0.6169184744358063
+90,1.6817109395003784,4167,3205,131,689,-0.5954653024673462,-0.619819164276123
+95,1.7160308725239535,4184,3598,114,296,-0.5972362160682678,-0.6263471245765686
diff --git a/research/flossing/analysis_2x2/sweep_trm_official58590_n2048.csv b/research/flossing/analysis_2x2/sweep_trm_official58590_n2048.csv
new file mode 100644
index 0000000..747eb40
--- /dev/null
+++ b/research/flossing/analysis_2x2/sweep_trm_official58590_n2048.csv
@@ -0,0 +1,20 @@
+pct,tau,nA,nB,nC,nD,lam1_med_B,lam1_med_D
+5,0.6425822518044539,103,0,1691,254,nan,0.10287182405591011
+10,0.7090247691558889,205,0,1589,254,nan,0.10287182405591011
+15,0.7579128424918095,308,0,1486,254,nan,0.10287182405591011
+20,0.7959777945095498,410,0,1384,254,nan,0.10287182405591011
+25,0.8365441996415908,512,0,1282,254,nan,0.10287182405591011
+30,0.8629687925201839,615,0,1179,254,nan,0.10287182405591011
+35,0.8888206300392528,717,0,1077,254,nan,0.10287182405591011
+40,0.9161186566022281,819,0,975,254,nan,0.10287182405591011
+45,0.9434356854634222,922,0,872,254,nan,0.10287182405591011
+50,0.9675603217529236,1024,0,770,254,nan,0.10287182405591011
+55,0.9912876432316691,1126,0,668,254,nan,0.10287182405591011
+60,1.01543024505144,1229,0,565,254,nan,0.10287182405591011
+65,1.0446947959840196,1331,0,463,254,nan,0.10287182405591011
+70,1.0695176837117564,1433,0,361,254,nan,0.10287182405591011
+75,1.100648956309921,1536,0,258,254,nan,0.10287182405591011
+80,1.2184501775276642,1638,0,156,254,nan,0.10287182405591011
+85,1.696239099229016,1733,7,61,247,0.1096716970205307,0.10244401544332504
+90,1.794792940908479,1740,103,54,151,0.10650952905416489,0.10119745135307312
+95,1.8549837815942356,1749,196,45,58,0.10514867305755615,0.09396017715334892
diff --git a/research/flossing/analysis_2x2/sweep_trm_official_gbs768_step58590_n512.csv b/research/flossing/analysis_2x2/sweep_trm_official_gbs768_step58590_n512.csv
new file mode 100644
index 0000000..011c7cc
--- /dev/null
+++ b/research/flossing/analysis_2x2/sweep_trm_official_gbs768_step58590_n512.csv
@@ -0,0 +1,20 @@
+pct,tau,nA,nB,nC,nD,lam1_med_B,lam1_med_D
+5,0.6520149305579591,26,0,422,64,nan,0.10339764878153801
+10,0.7055634417495923,52,0,396,64,nan,0.10339764878153801
+15,0.7503615854928217,77,0,371,64,nan,0.10339764878153801
+20,0.8005556182318934,103,0,345,64,nan,0.10339764878153801
+25,0.840565017355408,128,0,320,64,nan,0.10339764878153801
+30,0.8627818507451074,154,0,294,64,nan,0.10339764878153801
+35,0.8956748981953134,179,0,269,64,nan,0.10339764878153801
+40,0.923274365661878,205,0,243,64,nan,0.10339764878153801
+45,0.9484961694413702,230,0,218,64,nan,0.10339764878153801
+50,0.9733951084873498,256,0,192,64,nan,0.10339764878153801
+55,1.0041568660376061,282,0,166,64,nan,0.10339764878153801
+60,1.02397895505604,307,0,141,64,nan,0.10339764878153801
+65,1.04513267210118,333,0,115,64,nan,0.10339764878153801
+70,1.0684270138688392,358,0,90,64,nan,0.10339764878153801
+75,1.097585285269502,384,0,64,64,nan,0.10339764878153801
+80,1.1958246561577712,409,0,39,64,nan,0.10339764878153801
+85,1.3747578433518455,435,0,13,64,nan,0.10339764878153801
+90,1.785234311475947,437,23,11,41,0.10155700892210007,0.10475098341703415
+95,1.8427352264688373,438,48,10,16,0.10240473970770836,0.10433581843972206
diff --git a/research/flossing/analysis_2x2/sweep_trm_singleGPU_step130205_n512.csv b/research/flossing/analysis_2x2/sweep_trm_singleGPU_step130205_n512.csv
new file mode 100644
index 0000000..2c44154
--- /dev/null
+++ b/research/flossing/analysis_2x2/sweep_trm_singleGPU_step130205_n512.csv
@@ -0,0 +1,20 @@
+pct,tau,nA,nB,nC,nD,lam1_med_B,lam1_med_D
+5,0.3525141817739674,26,0,361,125,nan,0.10669758170843124
+10,0.5997049779303102,52,0,335,125,nan,0.10669758170843124
+15,0.7276586924161174,77,0,310,125,nan,0.10669758170843124
+20,0.745895549719167,103,0,284,125,nan,0.10669758170843124
+25,0.7574238256855987,128,0,259,125,nan,0.10669758170843124
+30,0.7756026094327713,154,0,233,125,nan,0.10669758170843124
+35,0.7948198738771284,179,0,208,125,nan,0.10669758170843124
+40,0.8150679269326413,205,0,182,125,nan,0.10669758170843124
+45,0.8328527737072317,230,0,157,125,nan,0.10669758170843124
+50,0.8680717056251492,256,0,131,125,nan,0.10669758170843124
+55,0.8878173590916981,282,0,105,125,nan,0.10669758170843124
+60,0.9114474129601072,307,0,80,125,nan,0.10669758170843124
+65,0.9357973777374493,333,0,54,125,nan,0.10669758170843124
+70,0.9914630324419641,358,0,29,125,nan,0.10669758170843124
+75,1.6265775245886065,370,14,17,111,0.10315298289060593,0.10740122199058533
+80,1.6638789305051203,370,39,17,86,0.10702066123485565,0.10642785206437111
+85,1.690579189299017,370,65,17,60,0.10719160735607147,0.10596203431487083
+90,1.722529342618924,370,90,17,35,0.10658939182758331,0.10669758170843124
+95,1.7704888842696966,372,114,15,11,0.10589195787906647,0.10992329567670822
diff --git a/research/flossing/analysis_2x2/sweep_trm_singleGPU_step260410_n512.csv b/research/flossing/analysis_2x2/sweep_trm_singleGPU_step260410_n512.csv
new file mode 100644
index 0000000..fac3a6d
--- /dev/null
+++ b/research/flossing/analysis_2x2/sweep_trm_singleGPU_step260410_n512.csv
@@ -0,0 +1,20 @@
+pct,tau,nA,nB,nC,nD,lam1_med_B,lam1_med_D
+5,1.1904157471082346,26,0,368,118,nan,0.10232088342308998
+10,1.205761365098233,52,0,342,118,nan,0.10232088342308998
+15,1.220964285676207,77,0,317,118,nan,0.10232088342308998
+20,1.2340382280630668,103,0,291,118,nan,0.10232088342308998
+25,1.2434017565460773,128,0,266,118,nan,0.10232088342308998
+30,1.253923051119596,154,0,240,118,nan,0.10232088342308998
+35,1.260020693170751,179,0,215,118,nan,0.10232088342308998
+40,1.2729663625028254,205,0,189,118,nan,0.10232088342308998
+45,1.286432864193943,230,0,164,118,nan,0.10232088342308998
+50,1.29425001787462,256,0,138,118,nan,0.10232088342308998
+55,1.3019951688280547,282,0,112,118,nan,0.10232088342308998
+60,1.3106991625583733,307,0,87,118,nan,0.10232088342308998
+65,1.323434760415417,333,0,61,118,nan,0.10232088342308998
+70,1.3423083391290647,358,0,36,118,nan,0.10232088342308998
+75,1.4595699782590752,384,0,10,118,nan,0.10232088342308998
+80,1.6965097149553376,384,25,10,93,0.10151073336601257,0.10370030999183655
+85,1.7416635003129175,384,51,10,67,0.1015518382191658,0.10389910638332367
+90,1.7702691318062869,384,76,10,42,0.10161199420690536,0.10456060990691185
+95,1.7933802257282094,384,102,10,16,0.10197825729846954,0.10976629704236984
diff --git a/research/flossing/analysis_2x2/sweep_trm_step13020_n512.csv b/research/flossing/analysis_2x2/sweep_trm_step13020_n512.csv
new file mode 100644
index 0000000..4dba5eb
--- /dev/null
+++ b/research/flossing/analysis_2x2/sweep_trm_step13020_n512.csv
@@ -0,0 +1,20 @@
+pct,tau,nA,nB,nC,nD,lam1_med_B,lam1_med_D
+5,0.1461886368890553,26,0,279,207,nan,0.02986210025846958
+10,0.1484773952154017,52,0,253,207,nan,0.02986210025846958
+15,0.14987173414316576,77,0,228,207,nan,0.02986210025846958
+20,0.15185786893932013,103,0,202,207,nan,0.02986210025846958
+25,0.15372304028625772,128,0,177,207,nan,0.02986210025846958
+30,0.15513756643739401,154,0,151,207,nan,0.02986210025846958
+35,0.15748307719056248,179,0,126,207,nan,0.02986210025846958
+40,0.16089218583944756,205,0,100,207,nan,0.02986210025846958
+45,0.16811494142658162,230,0,75,207,nan,0.02986210025846958
+50,0.21289853477178294,256,0,49,207,nan,0.02986210025846958
+55,0.31298600561046025,281,1,24,206,-0.014628570526838303,0.030006153509020805
+60,0.8658403204807424,296,11,9,196,-0.003207409055903554,0.03166834078729153
+65,1.1762548477525878,298,35,7,172,-0.0013167093275114894,0.03500013053417206
+70,1.3359059506661692,298,60,7,147,0.008530525024980307,0.03881623595952988
+75,1.4369319380372483,298,86,7,121,0.01115282904356718,0.043723251670598984
+80,1.5107383605809237,298,111,7,96,0.014448095113039017,0.045386193320155144
+85,1.5691906481172995,298,137,7,70,0.020133845508098602,0.04726334474980831
+90,1.6297611879974812,298,162,7,45,0.022078031674027443,0.04918263107538223
+95,1.7017159174665506,298,188,7,19,0.0257729422301054,0.061704523861408234