diff options
| author | YurenHao0426 <blackhao0426@gmail.com> | 2026-06-13 12:35:36 -0500 |
|---|---|---|
| committer | YurenHao0426 <blackhao0426@gmail.com> | 2026-06-13 12:35:36 -0500 |
| commit | 66e0d8b9fd4d0f7a2231d689c055e26fdf1cf04a (patch) | |
| tree | c29cba61124018755a19b02c9d33e3ad5f2e05cc /research/flossing/analysis_2x2 | |
Curated export for clone-and-run Maze training (2x A6000) + diagnostics.
trm/hrm pretrain.py carry trajectory-augmentation code (backward-compatible).
Heavy artifacts (checkpoints/wandb/npz) gitignored; see PROVENANCE.md.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Diffstat (limited to 'research/flossing/analysis_2x2')
37 files changed, 2490 insertions, 0 deletions
diff --git a/research/flossing/analysis_2x2/OBSERVATIONS.md b/research/flossing/analysis_2x2/OBSERVATIONS.md new file mode 100644 index 0000000..d0b0670 --- /dev/null +++ b/research/flossing/analysis_2x2/OBSERVATIONS.md @@ -0,0 +1,155 @@ +# 2×2 analysis: (late-trajectory settling) × (answer correctness) — observations + +Date: 2026-06-11. Code: `analyze_2x2.py` (+ ad-hoc strict-threshold checks logged in session). +All statements below are measurements on existing diagnostic npz files; no new GPU runs. + +## Data provenance (CRITICAL: λ scales are NOT comparable across estimator versions) + +| dataset | model / ckpt | n | estimator | window | +|---|---|---|---|---| +| `diag_8k.npz` | HRM righteous-python @ step_26040 (H=2,L=2) | 8192 | `diagnose_hrm.py` (May 22 version) | full 16 ACT steps | +| `diag_trm_singleGPU_step*_512.npz` ×10 | TRM mlp_t singleGPU @ 26041…260410 | 512 each | `diagnose_trm_joint.py` (joint JVP+QR, per-sub-update norm) | full 16 ACT steps | +| `diag_hrm_step_*_512.npz` ×10 | HRM @ 2604…26040 | 512 each | joint estimator (−0.15-scale, matches step7 reports) | full 16 ACT steps | + +Within-dataset comparisons only. "Converged"/"settled" here = late z_H drift (mean over ACT steps 13–16) +falls in the low-drift band; threshold from Otsu on pooled log10 drift, robustness via percentile sweep. +Settled ≠ literal fixed point: the settled band has a narrow characteristic residual velocity +(HRM ≈0.96/step, IQR-width <0.03 in log10; TRM ≈18.5/step, q10–q90 = 15.8–21.4). + +## Headline numbers + +### HRM @26040, n=8192, exact_acc 0.525 (full-window λ, May-22 estimator scale) +Otsu τ → cells: A(settled,correct)=4103, B(settled,wrong)=63, C(unsettled,correct)=195, D(unsettled,wrong)=3831. + +- Failure mass is overwhelmingly unsettled: wrong-that-settled = 1.6% at Otsu τ; at a STRICT in-band + threshold (45th pct, drift<0.97) it is 21/3894 = **0.55% of failures**. +- The strict-band B examples (n=21) have λ₁ median **−0.842** vs A −0.867 (success-like contraction), + q_halt(final) median **+7.47 — identical to A** (ACT-confident), token_acc 0.41–0.88 (median 0.62, + substantially wrong, not near-misses). These are confidently-wrong settled answers — invisible to + both a stability-based and an ACT-confidence-based selector. +- λ₁ as a predictor: AUC(−λ₁→correct) = 0.984 ≈ AUC(−λ₁→settled) = 0.986. At the Otsu split the + within-settled outcome gap (Δλ₁ −0.27, AUC 0.852) is a THRESHOLD-MIXTURE ARTIFACT: the Otsu τ sits + above the band gap (drift distribution has a near-empty region between pct≈45 and pct≈55, + τ jumps −0.013→1.39), so Otsu-"settled" includes mid-drift examples. At the strict in-band τ the + within-settled gap nearly closes (−0.867 vs −0.842). +- Residual outcome signal in the unsettled stratum: C vs D Δλ₁(median) = −0.094, AUC 0.818 + (C n=195). Note C drifts MORE than D (median 55.4 vs 40.8) while being more contracting in λ₁. +- AUC(−log late-drift→correct) = 0.964. + +### TRM singleGPU @260410 (final), n=512, exact_acc 0.770 (joint estimator) +Cells: A=383, B=**0**, C=11, D=118. + +- **B = 0 at every threshold**: drift distributions of settled-correct (q90 = 21.4) and wrong + (q10 = 47.8) are completely separated — no wrong example reaches the settled band. Robust across + the entire percentile sweep (nB=0, nD=118 constant). Same at step 130205. Across the 10-ckpt series, + fB>0 only at step 26041 (1.2%). +- λ₁ medians: A +0.0047 (≈0, consistent with "TRM success short-window λ₁>0 is acceptable"), + C +0.0998, D +0.1023. Within-unsettled outcome gap ≈ 0 (Δ −0.0025; AUC 0.619, C n=11). + AUC(−λ₁→correct) = 0.989 ≈ AUC(−λ₁→settled) = 0.996: at the final checkpoint, λ₁'s outcome + signal in TRM is (almost) entirely settling-regime detection. +- q_halt(final) median: A +7.44, C +7.41, D −11.1. + +## Checkpoint evolution (512/ckpt; within-series comparisons) + +- TRM (26041→260410): fB≈0 from 52082 on; fD shrinks 0.41→0.23; λ₁(D) rises monotonically + +0.036→+0.102 while λ₁(A) stays ≈0 — the success/failure λ₁ gap widens over training via the + failure cell becoming more expansive. +- HRM (2604→26040, joint-estimator series): mass migration over training: + fB 0.89→0.008 (early ckpts: nearly all examples low-drift & wrong), fD rises to ~0.52 mid-training, + fA grows with accuracy; λ₁(D) rises from −0.087 to +0.023 (sign flip ~step 15–18k), λ₁(A) stays + −0.10…−0.20. + +## Direct answers to the motivating questions + +1. **Is the failure FTLE cluster a mixture of wandering + wrong-fixed-point modes?** + Measured: yes, but extremely lopsided. Wandering dominates (HRM ≥98.4% of failures at Otsu τ, + 99.45% at strict τ; TRM final 100%). The wrong-fixed-point mode exists in HRM (21 examples) with + exactly the predicted signature (success-like λ₁), and is absent in TRM at late checkpoints. +2. **Does conditioning on settling absorb the success/failure FTLE gap?** + TRM final: essentially yes (within-stratum AUC 0.619 with n=11; regime AUC 0.996). + HRM: mostly yes after threshold correction (strict-band within-settled gap ≈0.025), with one + genuine residual: the unsettled stratum retains an outcome gap (AUC 0.818, C n=195). +3. **Relation to published taxonomies** (factual cross-references): + Ren & Liu's four HRM modes map onto our cells; their "non-trivial failure (converged to wrong + fixed point)" = our strict-band B (rare: 0.55% of failures); their "trivial failure (wanders or + oscillates)" = our D (dominant). Efstathiou & Balwani's "failed runs plateau at stable high-loss + attractors" (TRM): by our state-drift criterion, TRM failures are NOT settled (B=0, drift ≥~48/step, + ~0.77× the early-trajectory velocity); their "stable" refers to loss plateaus/bounded regions, + not state convergence. Our drift+λ₁ measurement distinguishes these. + +## Addendum (same day, offline follow-ups — see offline_followups/followups.md) + +**REVISION of headline point 2.** "λ₁'s outcome signal is (almost) entirely settling-regime detection" +was based on comparing raw AUCs and is too strong as a mediation claim. The proper control — +AUC(−λ₁→correct) **within matched late-drift deciles** of the unsettled stratum — shows substantial +independent signal (HRM: per-decile 0.97→0.69 from low to high drift, weighted mean 0.879; +unconditioned within-unsettled 0.933). TRM official @58590 with a strict band τ shows the same +qualitatively: unsettled-correct λ₁ ≈ +0.017 (n=141) vs unsettled-wrong +0.103 (n=64) at overlapping +drift levels. Corrected statement: **λ₁ correlates strongly with settledness, but at matched drift +level it still separates outcome** — drift and λ₁ are not redundant observables. + +**Difficulty control (#givens, crude proxy).** HRM n=8192: Spearman(correct, givens)=+0.28; +Spearman(λ₁, givens)=−0.35 overall but −0.16/−0.18 within outcome. Within-givens-bin +AUC(−λ₁→correct) = 0.976–0.987 (weighted 0.982, vs overall 0.984): at the #givens level, the +FTLE-outcome separation is NOT a difficulty artifact. (Solver-backtrack difficulty not available +offline; #givens is a weak proxy — flag for the writeup.) + +**Strict-B per-example (n=21).** All 21 have halted_at ∈ [4,9] (median 6) — under real ACT inference +every one would have halted early, confidently (q_halt +7.4–7.5), and wrong (token_acc 0.41–0.88). +The three lowest-token-acc cases are all 17-givens (minimum-clue) puzzles. Per-example table in +followups.md; drift profiles indistinguishable from the A band (fig_hrm_strictB_profiles.png). + +**TRM official @58590 note.** 90% of its settled-correct examples are still descending at window end +(slope median −0.147) — unlike singleGPU @260410 whose A-band is flat (~18.5/step). The "settled band" +criterion is checkpoint-specific; cross-checkpoint comparisons must re-derive τ per dataset. + +## Addendum 2 (2026-06-12, n=2048 retest + early-window pairing; npz in retest/) + +Retest ran on GPU 0 (shared, 12h-fallback claim). Four diagnostics, seed 0, idx-paired. + +**1. TRM official @58590 (87.6%), full window, n=2048: B = 0 confirmed.** +254 failures, none settled; the MINIMUM late-drift among wrong examples (log10 1.664 ≈ 46/step) +exceeds the late-drift of 96.5% of correct examples — near-complete distribution separation, +threshold-free. λ₁: wrong +0.103 / correct +0.012, AUC 0.993. Within-unsettled outcome AUC 0.848 +(C n=70) — residual signal confirmed at usable n. + +**2. HRM @26040, full window, joint estimator, n=2048: replicates diag_8k on a second estimator.** +acc 0.526; A/B/C/D = 1020/14/57/957; λ₁(A) −0.152 vs λ₁(D) +0.032 (the email's −0.15/+0.04 scale); +strict-band B n=5 (0.5% of failures) with λ₁ −0.141 ≈ A and q_halt +7.47 (selector-blind, replicated); +unsettled within-decile AUC weighted 0.900 (was 0.879 on diag_8k). + +**3. Early-window (first 4 ACT steps) does NOT forecast eventual success among still-unsolved +examples — and on HRM the dynamical signals point the OTHER way.** +Unconditioned early AUCs are inflated by already-solved-at-4 examples (TRM 69.4% solved@4, +HRM 34.5%). Restricted to not-yet-correct@4: + +| signal @ step 4 | TRM (n=626, 59.4% eventually correct) | HRM (n=1342, 27.6% eventually correct) | +|---|---|---| +| AUC(−λ₁_early → eventual correct) | 0.543 | **0.448** (reversed) | +| AUC(−drift@4 → eventual correct) | 0.492 | **0.312** (reversed: MORE early movement ↔ eventual success, +dir AUC 0.688) | +| AUC(q_halt@4 → eventual correct) | 0.521 | **0.734** | + +Observations: (i) the λ₁/outcome association is concurrent with the trajectory's fate (final window), +not antecedent — early-window λ₁ has no forward predictive power at this granularity; (ii) on HRM the +sign of the early association is inverted: among undecided examples, higher early drift (and +marginally higher early λ₁) accompany eventual success; (iii) the one early signal with real forecast +power is HRM's learned q_halt (0.734) — absent in TRM (0.521); factual architecture note: TRM removed +HRM's Q-learning continue-head (BCE halt only). Window length 4 was chosen to match train-time; other +horizons untested. + +**Consequences for writeup/claims:** "failure ↔ chaos" should be stated as an outcome-concurrent +dynamical signature, not an early predictor; early-exit/reallocation applications are unsupported at +this granularity (and sign-reversed on HRM); the within-stratum independence result (drift-matched +AUC 0.88-0.90) plus the difficulty control (Addendum 1) remain the strongest positive claims. + +## Caveats + +- n(B)=21 and n(C)=11/195 are small; per-cell statements about those cells are low-precision. +- "Settled" is a relative (band) criterion; both A-bands have nonzero characteristic residual velocity. +- exact_correct is evaluated at ACT step 16 under fixed unroll; post-window corruption or recovery + (Ren & Liu's fixed-point violation) is not observable in these arrays. +- λ window = full trajectory (336 sub-updates TRM / 16 segments HRM); the early-window + (length-matched) version of this analysis is NOT covered by existing npz files — the `_short` + diagnose scripts exist (4 ACT steps) but produced no saved npz we could find. Open follow-up. +- diag_8k (HRM) and the HRM evolution series use different estimator normalizations; only + within-file comparisons are reported above. diff --git a/research/flossing/analysis_2x2/analyze_2x2.py b/research/flossing/analysis_2x2/analyze_2x2.py new file mode 100644 index 0000000..315ffc6 --- /dev/null +++ b/research/flossing/analysis_2x2/analyze_2x2.py @@ -0,0 +1,350 @@ +"""2x2 analysis: (terminal convergence) x (answer correctness), per-example FTLE per cell. + +Inputs: existing diagnostic npz files produced by diagnose_hrm*.py / diagnose_trm_joint.py: + drift_zH/drift_zL (N,16) per-ACT-step state displacement norms, + lyap_spec (N,8) full-window joint FTLE spectrum, exact_correct (N,), + token_acc, halted_at, q_halt/q_continue (N,16), idx. + +Convergence metric (measurement choice, reported alongside robustness sweep): + d_late = mean(drift_zH[:, -4:]) (late-trajectory z_H velocity, ACT steps 13-16) + primary threshold tau = Otsu on log10(d_late) pooled per dataset; + sensitivity: tau swept over pooled percentiles 5..95. + +Cells: A = converged & correct, B = converged & wrong, + C = non-converged & correct, D = non-converged & wrong. + +Outputs (in this directory): results_<tag>.json, cells_<tag>.csv, sweep_<tag>.csv, + fig_<tag>_{drift_hist,lyap_by_cell,scatter,spectrum}.png, evolution_{hrm,trm}.{csv,png}, + results.md (combined human-readable summary). + +Observational only: this script reports counts, distributions and rank statistics; it does +not test mechanisms. +""" +from __future__ import annotations + +import json +from pathlib import Path + +import matplotlib + +matplotlib.use("Agg") +import matplotlib.pyplot as plt +import numpy as np + +HERE = Path(__file__).resolve().parent +FLOSS = HERE.parent + +LATE_K = 4 # ACT steps used for late drift + + +def otsu_threshold(x: np.ndarray, nbins: int = 256) -> float: + h, edges = np.histogram(x, bins=nbins) + h = h.astype(np.float64) + centers = 0.5 * (edges[:-1] + edges[1:]) + w = h.sum() + if w == 0: + return float(np.median(x)) + p = h / w + omega = np.cumsum(p) + mu = np.cumsum(p * centers) + mu_t = mu[-1] + denom = omega * (1.0 - omega) + denom[denom <= 0] = np.nan + sigma_b2 = (mu_t * omega - mu) ** 2 / denom + k = np.nanargmax(sigma_b2) + return float(centers[k]) + + +def auc_rank(score: np.ndarray, label: np.ndarray) -> float: + """AUC of `score` for predicting label==1 (rank-based, ties averaged).""" + pos = score[label == 1] + neg = score[label == 0] + if len(pos) == 0 or len(neg) == 0: + return float("nan") + allv = np.concatenate([pos, neg]) + order = np.argsort(allv, kind="mergesort") + ranks = np.empty_like(order, dtype=np.float64) + ranks[order] = np.arange(1, len(allv) + 1) + # average ranks for ties + sv = allv[order] + i = 0 + while i < len(sv): + j = i + while j + 1 < len(sv) and sv[j + 1] == sv[i]: + j += 1 + if j > i: + ranks[order[i : j + 1]] = ranks[order[i : j + 1]].mean() + i = j + 1 + r_pos = ranks[: len(pos)].sum() + return float((r_pos - len(pos) * (len(pos) + 1) / 2) / (len(pos) * len(neg))) + + +def cell_stats(lyap: np.ndarray, tok: np.ndarray, halted: np.ndarray, mask: np.ndarray) -> dict: + if mask.sum() == 0: + return {"n": 0} + l1 = lyap[mask, 0] + return { + "n": int(mask.sum()), + "lam1_median": float(np.median(l1)), + "lam1_mean": float(l1.mean()), + "lam1_iqr": [float(np.percentile(l1, 25)), float(np.percentile(l1, 75))], + "lam8_median": float(np.median(lyap[mask, -1])), + "spectrum_median": [float(np.median(lyap[mask, i])) for i in range(lyap.shape[1])], + "token_acc_median": float(np.median(tok[mask])), + "halted_at_median": float(np.median(halted[mask])), + } + + +def analyze(npz_path: Path, tag: str, make_figs: bool = True) -> dict: + d = np.load(npz_path) + lyap = d["lyap_spec"].astype(np.float64) + correct = d["exact_correct"].astype(int) + tok = d["token_acc"].astype(np.float64) + halted = d["halted_at"].astype(np.float64) + drift_h = d["drift_zH"].astype(np.float64) + drift_l = d["drift_zL"].astype(np.float64) + + d_late = drift_h[:, -LATE_K:].mean(axis=1) + d_late_l = drift_l[:, -LATE_K:].mean(axis=1) + logd = np.log10(np.clip(d_late, 1e-12, None)) + + tau = otsu_threshold(logd) + conv = logd < tau + + cells = { + "A_conv_correct": (conv) & (correct == 1), + "B_conv_wrong": (conv) & (correct == 0), + "C_nonconv_correct": (~conv) & (correct == 1), + "D_nonconv_wrong": (~conv) & (correct == 0), + } + + res = { + "npz": str(npz_path), + "n": int(len(correct)), + "exact_acc": float(correct.mean()), + "late_drift_def": f"mean(drift_zH[:, -{LATE_K}:])", + "otsu_tau_log10": tau, + "frac_converged": float(conv.mean()), + "cells": {k: cell_stats(lyap, tok, halted, m) for k, m in cells.items()}, + "mixture": { + "wrong_that_converged": float( + cells["B_conv_wrong"].sum() / max((correct == 0).sum(), 1) + ), + "correct_that_nonconverged": float( + cells["C_nonconv_correct"].sum() / max((correct == 1).sum(), 1) + ), + }, + "contrasts": { + "dlam1_correct_minus_wrong_overall": float( + np.median(lyap[correct == 1, 0]) - np.median(lyap[correct == 0, 0]) + ), + "dlam1_within_converged": float( + np.median(lyap[cells["A_conv_correct"], 0]) - np.median(lyap[cells["B_conv_wrong"], 0]) + ) + if cells["B_conv_wrong"].sum() > 0 and cells["A_conv_correct"].sum() > 0 + else float("nan"), + "dlam1_within_nonconverged": float( + np.median(lyap[cells["C_nonconv_correct"], 0]) - np.median(lyap[cells["D_nonconv_wrong"], 0]) + ) + if cells["C_nonconv_correct"].sum() > 0 and cells["D_nonconv_wrong"].sum() > 0 + else float("nan"), + "dlam1_wrong_conv_minus_wrong_nonconv": float( + np.median(lyap[cells["B_conv_wrong"], 0]) - np.median(lyap[cells["D_nonconv_wrong"], 0]) + ) + if cells["B_conv_wrong"].sum() > 0 and cells["D_nonconv_wrong"].sum() > 0 + else float("nan"), + }, + "auc": { + "neg_lam1_predicts_correct_overall": auc_rank(-lyap[:, 0], correct), + "neg_lam1_predicts_correct_within_conv": auc_rank(-lyap[conv, 0], correct[conv]), + "neg_lam1_predicts_correct_within_nonconv": auc_rank(-lyap[~conv, 0], correct[~conv]), + "neg_logdrift_predicts_correct": auc_rank(-logd, correct), + "neg_lam1_predicts_converged": auc_rank(-lyap[:, 0], conv.astype(int)), + }, + } + + # threshold sensitivity sweep + sweep_rows = [] + for pct in range(5, 96, 5): + t = np.percentile(logd, pct) + c = logd < t + row = { + "pct": pct, + "tau": float(t), + "nA": int((c & (correct == 1)).sum()), + "nB": int((c & (correct == 0)).sum()), + "nC": int((~c & (correct == 1)).sum()), + "nD": int((~c & (correct == 0)).sum()), + } + for nm, m in [ + ("lam1_med_B", c & (correct == 0)), + ("lam1_med_D", ~c & (correct == 0)), + ]: + row[nm] = float(np.median(lyap[m, 0])) if m.sum() > 0 else float("nan") + sweep_rows.append(row) + sweep_csv = HERE / f"sweep_{tag}.csv" + with sweep_csv.open("w") as f: + keys = list(sweep_rows[0].keys()) + f.write(",".join(keys) + "\n") + for r in sweep_rows: + f.write(",".join(str(r[k]) for k in keys) + "\n") + + # per-cell csv + with (HERE / f"cells_{tag}.csv").open("w") as f: + f.write("cell,n,lam1_median,lam1_mean,lam1_q25,lam1_q75,lam8_median,token_acc_median,halted_at_median\n") + for k, m in cells.items(): + s = res["cells"][k] + if s["n"] == 0: + f.write(f"{k},0,,,,,,,\n") + continue + f.write( + f"{k},{s['n']},{s['lam1_median']:.6f},{s['lam1_mean']:.6f}," + f"{s['lam1_iqr'][0]:.6f},{s['lam1_iqr'][1]:.6f},{s['lam8_median']:.6f}," + f"{s['token_acc_median']:.4f},{s['halted_at_median']:.1f}\n" + ) + + if make_figs: + colors = {"A_conv_correct": "tab:green", "B_conv_wrong": "tab:orange", + "C_nonconv_correct": "tab:blue", "D_nonconv_wrong": "tab:red"} + + fig, ax = plt.subplots(figsize=(6, 4)) + bins = np.linspace(logd.min(), logd.max(), 60) + ax.hist(logd[correct == 1], bins=bins, alpha=0.55, label=f"correct (n={int(correct.sum())})", color="tab:green") + ax.hist(logd[correct == 0], bins=bins, alpha=0.55, label=f"wrong (n={int((1-correct).sum())})", color="tab:red") + ax.axvline(tau, color="k", ls="--", lw=1, label=f"Otsu tau={tau:.2f}") + ax.set_xlabel("log10 late drift_zH (steps -4:)"); ax.set_ylabel("count") + ax.set_title(f"{tag}: late-drift distribution by correctness"); ax.legend(fontsize=8) + fig.tight_layout(); fig.savefig(HERE / f"fig_{tag}_drift_hist.png", dpi=150); plt.close(fig) + + fig, ax = plt.subplots(figsize=(6.5, 4)) + for i, (k, m) in enumerate(cells.items()): + if m.sum() == 0: + continue + y = lyap[m, 0] + x = np.full(y.shape, i) + (np.random.default_rng(0).uniform(-0.18, 0.18, y.shape)) + ax.plot(x, y, ".", ms=3, alpha=0.35, color=colors[k]) + ax.hlines(np.median(y), i - 0.28, i + 0.28, color=colors[k], lw=2.5) + ax.set_xticks(range(4)); ax.set_xticklabels([f"{k}\n(n={int(m.sum())})" for k, m in cells.items()], fontsize=7) + ax.set_ylabel("lambda_1 (full-window FTLE)"); ax.axhline(0, color="gray", lw=0.6) + ax.set_title(f"{tag}: lambda_1 by 2x2 cell") + fig.tight_layout(); fig.savefig(HERE / f"fig_{tag}_lyap_by_cell.png", dpi=150); plt.close(fig) + + fig, ax = plt.subplots(figsize=(6, 4.5)) + ax.scatter(logd[correct == 1], lyap[correct == 1, 0], s=5, alpha=0.4, c="tab:green", label="correct") + ax.scatter(logd[correct == 0], lyap[correct == 0, 0], s=5, alpha=0.4, c="tab:red", label="wrong") + ax.axvline(tau, color="k", ls="--", lw=1); ax.axhline(0, color="gray", lw=0.6) + ax.set_xlabel("log10 late drift_zH"); ax.set_ylabel("lambda_1") + ax.set_title(f"{tag}: drift vs lambda_1"); ax.legend(fontsize=8) + fig.tight_layout(); fig.savefig(HERE / f"fig_{tag}_scatter.png", dpi=150); plt.close(fig) + + fig, ax = plt.subplots(figsize=(6, 4)) + for k, m in cells.items(): + if m.sum() < 3: + continue + ax.plot(range(1, lyap.shape[1] + 1), [np.median(lyap[m, i]) for i in range(lyap.shape[1])], + "o-", ms=4, label=f"{k} (n={int(m.sum())})", color=colors[k]) + ax.axhline(0, color="gray", lw=0.6) + ax.set_xlabel("exponent index"); ax.set_ylabel("median lambda_i") + ax.set_title(f"{tag}: median FTLE spectrum per cell"); ax.legend(fontsize=7) + fig.tight_layout(); fig.savefig(HERE / f"fig_{tag}_spectrum.png", dpi=150); plt.close(fig) + + # secondary observables + res["aux"] = { + "late_drift_zL_corr_with_zH_log": float(np.corrcoef(logd, np.log10(np.clip(d_late_l, 1e-12, None)))[0, 1]), + "q_halt_final_median_by_cell": { + k: float(np.median(d["q_halt"][m, -1])) if m.sum() > 0 else float("nan") for k, m in cells.items() + }, + } + (HERE / f"results_{tag}.json").write_text(json.dumps(res, indent=2)) + return res + + +def evolution(series: list[tuple[str, Path]], out_tag: str) -> None: + rows = [] + for label, p in series: + if not p.exists(): + continue + d = np.load(p) + lyap = d["lyap_spec"].astype(np.float64) + correct = d["exact_correct"].astype(int) + logd = np.log10(np.clip(d["drift_zH"][:, -LATE_K:].mean(axis=1), 1e-12, None)) + tau = otsu_threshold(logd) + conv = logd < tau + row = dict(step=label, acc=float(correct.mean()), tau=tau, + fA=float(((conv) & (correct == 1)).mean()), fB=float(((conv) & (correct == 0)).mean()), + fC=float(((~conv) & (correct == 1)).mean()), fD=float(((~conv) & (correct == 0)).mean())) + for nm, m in [("l1A", (conv) & (correct == 1)), ("l1B", (conv) & (correct == 0)), + ("l1C", (~conv) & (correct == 1)), ("l1D", (~conv) & (correct == 0))]: + row[nm] = float(np.median(lyap[m, 0])) if m.sum() > 2 else float("nan") + rows.append(row) + if not rows: + return + keys = list(rows[0].keys()) + with (HERE / f"evolution_{out_tag}.csv").open("w") as f: + f.write(",".join(keys) + "\n") + for r in rows: + f.write(",".join(str(r[k]) for k in keys) + "\n") + fig, axes = plt.subplots(1, 2, figsize=(11, 4)) + xs = range(len(rows)) + for nm, c in [("fA", "tab:green"), ("fB", "tab:orange"), ("fC", "tab:blue"), ("fD", "tab:red")]: + axes[0].plot(xs, [r[nm] for r in rows], "o-", label=nm, color=c) + axes[0].set_xticks(list(xs)); axes[0].set_xticklabels([r["step"] for r in rows], rotation=45, fontsize=7) + axes[0].set_ylabel("cell fraction"); axes[0].legend(fontsize=8); axes[0].set_title(f"{out_tag}: cell fractions") + for nm, c in [("l1A", "tab:green"), ("l1B", "tab:orange"), ("l1C", "tab:blue"), ("l1D", "tab:red")]: + axes[1].plot(xs, [r[nm] for r in rows], "o-", label=nm, color=c) + axes[1].axhline(0, color="gray", lw=0.6) + axes[1].set_xticks(list(xs)); axes[1].set_xticklabels([r["step"] for r in rows], rotation=45, fontsize=7) + axes[1].set_ylabel("median lambda_1"); axes[1].legend(fontsize=8); axes[1].set_title(f"{out_tag}: per-cell lambda_1") + fig.tight_layout(); fig.savefig(HERE / f"evolution_{out_tag}.png", dpi=150); plt.close(fig) + + +def main() -> None: + results = {} + primary = [ + ("hrm26040_n8192", FLOSS / "diag_8k.npz"), + ("trm_singleGPU_step260410_n512", FLOSS / "diag_trm_singleGPU_step260410_512.npz"), + ("trm_singleGPU_step130205_n512", FLOSS / "diag_trm_singleGPU_step130205_512.npz"), + ("trm_step13020_n512", FLOSS / "diag_trm_step13020_512.npz"), + ] + for tag, p in primary: + if p.exists(): + results[tag] = analyze(p, tag) + print(f"[done] {tag}") + + evolution( + [(f"{s}", FLOSS / f"diag_hrm_step_{s}_512.npz") for s in + [2604, 5208, 7812, 10416, 13020, 15624, 18228, 20832, 23436, 26040]], + "hrm", + ) + evolution( + [(f"{s}", FLOSS / f"diag_trm_singleGPU_step{s}_512.npz") for s in + [26041, 52082, 78123, 104164, 130205, 156246, 182287, 208328, 234369, 260410]], + "trm", + ) + + # combined human-readable summary + lines = ["# 2x2 analysis (convergence x correctness) — generated " + __import__("datetime").date.today().isoformat(), ""] + for tag, r in results.items(): + lines += [f"## {tag}", f"- npz: `{r['npz']}`, n={r['n']}, exact_acc={r['exact_acc']:.3f}", + f"- late-drift def: {r['late_drift_def']}, Otsu tau(log10)={r['otsu_tau_log10']:.3f}, frac_converged={r['frac_converged']:.3f}", ""] + lines.append("| cell | n | lam1 median | lam1 IQR | token_acc med |") + lines.append("|---|---|---|---|---|") + for k, s in r["cells"].items(): + if s["n"] == 0: + lines.append(f"| {k} | 0 | - | - | - |") + else: + lines.append(f"| {k} | {s['n']} | {s['lam1_median']:+.4f} | [{s['lam1_iqr'][0]:+.4f}, {s['lam1_iqr'][1]:+.4f}] | {s['token_acc_median']:.3f} |") + c = r["contrasts"]; a = r["auc"]; m = r["mixture"] + lines += ["", + f"- mixture: wrong-that-converged = {m['wrong_that_converged']:.3f}; correct-that-nonconverged = {m['correct_that_nonconverged']:.3f}", + f"- dlam1(correct-wrong): overall {c['dlam1_correct_minus_wrong_overall']:+.4f}; within-conv {c['dlam1_within_converged']:+.4f}; within-nonconv {c['dlam1_within_nonconverged']:+.4f}", + f"- dlam1(wrong: conv - nonconv) = {c['dlam1_wrong_conv_minus_wrong_nonconv']:+.4f}", + f"- AUC(-lam1 -> correct): overall {a['neg_lam1_predicts_correct_overall']:.3f}; within-conv {a['neg_lam1_predicts_correct_within_conv']:.3f}; within-nonconv {a['neg_lam1_predicts_correct_within_nonconv']:.3f}", + f"- AUC(-log d_late -> correct) = {a['neg_logdrift_predicts_correct']:.3f}; AUC(-lam1 -> converged) = {a['neg_lam1_predicts_converged']:.3f}", + ""] + (HERE / "results.md").write_text("\n".join(lines)) + print("wrote", HERE / "results.md") + + +if __name__ == "__main__": + main() diff --git a/research/flossing/analysis_2x2/analyze_phase1.py b/research/flossing/analysis_2x2/analyze_phase1.py new file mode 100644 index 0000000..2ffa58d --- /dev/null +++ b/research/flossing/analysis_2x2/analyze_phase1.py @@ -0,0 +1,126 @@ +"""Analyze phase-1 results: E5 horizon sweep, E6 matched-objective pairs, E2 second-run replication. +Outputs: analysis_2x2/phase1/phase1_results.md + figures. +""" +from __future__ import annotations +from pathlib import Path +import matplotlib; matplotlib.use("Agg") +import matplotlib.pyplot as plt +import numpy as np + +HERE = Path(__file__).resolve().parent +P1 = HERE / "phase1" +RETEST = HERE / "retest" + + +def auc_rank(score, label): + score = np.asarray(score, float); label = np.asarray(label, int) + pos, neg = score[label == 1], score[label == 0] + if len(pos) == 0 or len(neg) == 0: + return float("nan") + allv = np.concatenate([pos, neg]); order = np.argsort(allv, kind="mergesort") + ranks = np.empty(len(allv)); ranks[order] = np.arange(1, len(allv) + 1) + sv = allv[order]; i = 0 + while i < len(sv): + j = i + while j + 1 < len(sv) and sv[j + 1] == sv[i]: + j += 1 + if j > i: + ranks[order[i:j + 1]] = ranks[order[i:j + 1]].mean() + i = j + 1 + return float((ranks[:len(pos)].sum() - len(pos) * (len(pos) + 1) / 2) / (len(pos) * len(neg))) + + +def otsu(x, nbins=256): + h, e = np.histogram(x, bins=nbins); h = h.astype(float); c = (e[:-1] + e[1:]) / 2 + p = h / h.sum(); om = np.cumsum(p); mu = np.cumsum(p * c); mt = mu[-1] + den = om * (1 - om); den[den <= 0] = np.nan + return float(c[np.nanargmax((mt * om - mu) ** 2 / den)]) + + +def cells_of(d): + ld = np.log10(np.clip(d["drift_zH"][:, -4:].mean(1), 1e-12, None)) + c = d["exact_correct"].astype(int); tau = otsu(ld); conv = ld < tau + A = conv & (c == 1); B = conv & (c == 0); C = (~conv) & (c == 1); D = (~conv) & (c == 0) + l1 = d["lyap_spec"][:, 0] + return dict(acc=float(c.mean()), tau=tau, + nA=int(A.sum()), nB=int(B.sum()), nC=int(C.sum()), nD=int(D.sum()), + fD=float(D.mean()), l1A=float(np.median(l1[A])) if A.sum() else float("nan"), + l1D=float(np.median(l1[D])) if D.sum() else float("nan")) + + +lines = ["# Phase-1 analysis (E5 horizon sweep, E6 matched-objective, E2 replication)", ""] + +# ---------- E5 horizon sweep ---------- +lines += ["## E5: when does fate become legible? AUC(early signal -> FINAL correct), restricted to not-yet-correct@H", ""] +finals = {"trm": np.load(RETEST / "trm_gbs768_step58590_full_n2048.npz"), + "hrm": np.load(RETEST / "hrm_righteous_step26040_full_n2048.npz")} +# include h=4 from retest short runs +short4 = {"trm": np.load(RETEST / "trm_gbs768_step58590_short_n2048.npz"), + "hrm": np.load(RETEST / "hrm_righteous_step26040_short_n2048.npz")} +sweep = {} +for model in ["trm", "hrm"]: + fin = finals[model]; fi = fin["idx"]; y_final_all = fin["exact_correct"].astype(int) + rows = [] + for H in [2, 4, 6, 8, 10, 12]: + if H == 4: + s = short4[model] + else: + tag = f"{'trm_official58590' if model=='trm' else 'hrm26040'}_h{H}_n2048" + s = np.load(P1 / f"{tag}.npz") + si = s["idx"]; common, fp, sp = np.intersect1d(fi, si, return_indices=True) + yf = y_final_all[fp]; ye = s["exact_correct"].astype(int)[sp] + l1 = s["lyap_spec"][sp, 0]; dr = np.log10(np.clip(s["drift_zH"][sp, -1], 1e-12, None)); q = s["q_halt"][sp, -1] + m = ye == 0 # not yet correct at H + rows.append(dict(H=H, n_und=int(m.sum()), frac_eventual=float(yf[m].mean()) if m.sum() else float("nan"), + auc_l1=auc_rank(-l1[m], yf[m]), auc_drift=auc_rank(-dr[m], yf[m]), auc_q=auc_rank(q[m], yf[m]), + solved_at_H=float(ye.mean()))) + sweep[model] = rows + lines.append(f"### {model.upper()}") + lines.append("| H | solved@H | undecided n | of which eventual✓ | AUC(-λ₁) | AUC(-drift) | AUC(q_halt) |") + lines.append("|---|---|---|---|---|---|---|") + for r in rows: + lines.append(f"| {r['H']} | {r['solved_at_H']:.3f} | {r['n_und']} | {r['frac_eventual']:.3f} | " + f"{r['auc_l1']:.3f} | {r['auc_drift']:.3f} | {r['auc_q']:.3f} |") + lines.append("") + +# E5 figure +fig, axes = plt.subplots(1, 2, figsize=(11, 4)) +for ax, model in zip(axes, ["trm", "hrm"]): + rows = sweep[model]; H = [r["H"] for r in rows] + ax.plot(H, [r["auc_l1"] for r in rows], "o-", label="−λ₁") + ax.plot(H, [r["auc_drift"] for r in rows], "s-", label="−drift") + ax.plot(H, [r["auc_q"] for r in rows], "^-", label="q_halt") + ax.axhline(0.5, color="gray", lw=0.6, ls="--") + ax.set_xlabel("prefix length H (ACT segments)"); ax.set_ylabel("AUC → final correct (undecided@H only)") + ax.set_ylim(0.3, 1.0); ax.set_title(f"{model.upper()}: legibility of fate vs prefix"); ax.legend(fontsize=8) +fig.tight_layout(); fig.savefig(P1 / "fig_E5_horizon_sweep.png", dpi=150); plt.close(fig) + +# ---------- E6 matched-objective ---------- +lines += ["## E6: matched-objective intervention (step9 fixed-unroll runs, n=512)", ""] +for fam, base, mult in [("HRM (E base vs F multi4)", "step9E_hrm", "step9F_hrm"), + ("TRM (G base vs H multi4)", "step9G_trm", "step9H_trm")]: + lines.append(f"### {fam}") + lines.append("| ckpt | base acc | base fD | base λ₁(D) | multi4 acc | multi4 fD | multi4 λ₁(D) |") + lines.append("|---|---|---|---|---|---|---|") + for ck in ["step_12500", "step_25000", "best", "final"]: + bp = P1 / f"{base}_{ck}_n512.npz"; mp = P1 / f"{mult}_{ck}_n512.npz" + if not bp.exists() or not mp.exists(): + lines.append(f"| {ck} | missing | | | | | |"); continue + b = cells_of(np.load(bp)); m = cells_of(np.load(mp)) + lines.append(f"| {ck} | {b['acc']:.3f} | {b['fD']:.3f} | {b['l1D']:+.4f} | " + f"{m['acc']:.3f} | {m['fD']:.3f} | {m['l1D']:+.4f} |") + lines.append("") + +# ---------- E2 second-run replication ---------- +lines += ["## E2: decomposition on a SECOND HRM training run (step9_E fixed-unroll, n=2048)", ""] +lines.append("| ckpt | acc | A | B | C | D | λ₁(A) | λ₁(D) | settled-wrong frac of failures |") +lines.append("|---|---|---|---|---|---|---|---|---|") +for ck in ["best", "final"]: + d = np.load(P1 / f"step9E_hrm_{ck}_full_n2048.npz"); cc = cells_of(d) + nfail = cc["nB"] + cc["nD"] + lines.append(f"| {ck} | {cc['acc']:.3f} | {cc['nA']} | {cc['nB']} | {cc['nC']} | {cc['nD']} | " + f"{cc['l1A']:+.4f} | {cc['l1D']:+.4f} | {cc['nB']/max(nfail,1):.4f} |") + +(P1 / "phase1_results.md").write_text("\n".join(lines)) +print("\n".join(lines)) +print("\nwrote", P1 / "phase1_results.md", "and fig_E5_horizon_sweep.png") diff --git a/research/flossing/analysis_2x2/cells_hrm26040_joint_n2048.csv b/research/flossing/analysis_2x2/cells_hrm26040_joint_n2048.csv new file mode 100644 index 0000000..9ee8208 --- /dev/null +++ b/research/flossing/analysis_2x2/cells_hrm26040_joint_n2048.csv @@ -0,0 +1,5 @@ +cell,n,lam1_median,lam1_mean,lam1_q25,lam1_q75,lam8_median,token_acc_median,halted_at_median +A_conv_correct,1020,-0.152252,-0.149175,-0.177945,-0.123537,-0.229215,1.0000,4.0 +B_conv_wrong,14,-0.035939,-0.045284,-0.100539,0.023329,-0.091743,0.6420,0.0 +C_nonconv_correct,57,-0.044267,-0.039474,-0.073953,-0.009610,-0.115885,1.0000,14.0 +D_nonconv_wrong,957,0.031881,0.037504,0.001404,0.070978,-0.064309,0.6296,0.0 diff --git a/research/flossing/analysis_2x2/cells_hrm26040_n8192.csv b/research/flossing/analysis_2x2/cells_hrm26040_n8192.csv new file mode 100644 index 0000000..7d2c776 --- /dev/null +++ b/research/flossing/analysis_2x2/cells_hrm26040_n8192.csv @@ -0,0 +1,5 @@ +cell,n,lam1_median,lam1_mean,lam1_q25,lam1_q75,lam8_median,token_acc_median,halted_at_median +A_conv_correct,4103,-0.861743,-0.856474,-0.902015,-0.815340,-0.974624,1.0000,4.0 +B_conv_wrong,63,-0.591228,-0.631761,-0.809065,-0.506116,-0.737965,0.6296,0.0 +C_nonconv_correct,195,-0.694297,-0.685632,-0.735417,-0.646644,-0.788653,1.0000,14.0 +D_nonconv_wrong,3831,-0.599770,-0.597011,-0.647486,-0.548848,-0.714284,0.6296,0.0 diff --git a/research/flossing/analysis_2x2/cells_trm_official58590_n2048.csv b/research/flossing/analysis_2x2/cells_trm_official58590_n2048.csv new file mode 100644 index 0000000..d805cf9 --- /dev/null +++ b/research/flossing/analysis_2x2/cells_trm_official58590_n2048.csv @@ -0,0 +1,5 @@ +cell,n,lam1_median,lam1_mean,lam1_q25,lam1_q75,lam8_median,token_acc_median,halted_at_median +A_conv_correct,1724,0.011282,0.015547,0.008467,0.016524,0.002090,1.0000,2.0 +B_conv_wrong,0,,,,,,, +C_nonconv_correct,70,0.081975,0.077830,0.068631,0.090418,0.008649,1.0000,14.0 +D_nonconv_wrong,254,0.102872,0.102872,0.092413,0.112425,0.018793,0.6296,0.0 diff --git a/research/flossing/analysis_2x2/cells_trm_official_gbs768_step58590_n512.csv b/research/flossing/analysis_2x2/cells_trm_official_gbs768_step58590_n512.csv new file mode 100644 index 0000000..2ff9465 --- /dev/null +++ b/research/flossing/analysis_2x2/cells_trm_official_gbs768_step58590_n512.csv @@ -0,0 +1,5 @@ +cell,n,lam1_median,lam1_mean,lam1_q25,lam1_q75,lam8_median,token_acc_median,halted_at_median +A_conv_correct,434,0.011053,0.015590,0.008054,0.017474,0.002147,1.0000,2.0 +B_conv_wrong,0,,,,,,, +C_nonconv_correct,14,0.082749,0.080288,0.070048,0.095862,0.006181,1.0000,14.0 +D_nonconv_wrong,64,0.103398,0.102595,0.092417,0.111892,0.021354,0.6296,0.0 diff --git a/research/flossing/analysis_2x2/cells_trm_singleGPU_step130205_n512.csv b/research/flossing/analysis_2x2/cells_trm_singleGPU_step130205_n512.csv new file mode 100644 index 0000000..1e9aec5 --- /dev/null +++ b/research/flossing/analysis_2x2/cells_trm_singleGPU_step130205_n512.csv @@ -0,0 +1,5 @@ +cell,n,lam1_median,lam1_mean,lam1_q25,lam1_q75,lam8_median,token_acc_median,halted_at_median +A_conv_correct,365,0.004312,0.009192,-0.000673,0.014061,-0.008585,1.0000,3.0 +B_conv_wrong,0,,,,,,, +C_nonconv_correct,22,0.085199,0.084005,0.069379,0.098772,0.015529,1.0000,15.0 +D_nonconv_wrong,125,0.106698,0.106791,0.099522,0.112582,0.032808,0.6296,0.0 diff --git a/research/flossing/analysis_2x2/cells_trm_singleGPU_step260410_n512.csv b/research/flossing/analysis_2x2/cells_trm_singleGPU_step260410_n512.csv new file mode 100644 index 0000000..4d11774 --- /dev/null +++ b/research/flossing/analysis_2x2/cells_trm_singleGPU_step260410_n512.csv @@ -0,0 +1,5 @@ +cell,n,lam1_median,lam1_mean,lam1_q25,lam1_q75,lam8_median,token_acc_median,halted_at_median +A_conv_correct,383,0.004744,0.010418,0.000744,0.013660,-0.004220,1.0000,2.0 +B_conv_wrong,0,,,,,,, +C_nonconv_correct,11,0.099828,0.089867,0.090672,0.103510,0.013320,1.0000,15.0 +D_nonconv_wrong,118,0.102321,0.102882,0.093520,0.111216,0.021455,0.6420,0.0 diff --git a/research/flossing/analysis_2x2/cells_trm_step13020_n512.csv b/research/flossing/analysis_2x2/cells_trm_step13020_n512.csv new file mode 100644 index 0000000..d3d1ff0 --- /dev/null +++ b/research/flossing/analysis_2x2/cells_trm_step13020_n512.csv @@ -0,0 +1,5 @@ +cell,n,lam1_median,lam1_mean,lam1_q25,lam1_q75,lam8_median,token_acc_median,halted_at_median +A_conv_correct,296,-0.004330,-0.002231,-0.011287,0.005197,-0.032494,1.0000,3.0 +B_conv_wrong,10,-0.002707,-0.003217,-0.008984,0.000914,-0.024164,0.5556,0.0 +C_nonconv_correct,9,0.041901,0.041647,0.032285,0.047971,-0.018014,1.0000,14.0 +D_nonconv_wrong,197,0.031663,0.030858,0.014350,0.046172,-0.012672,0.6420,0.0 diff --git a/research/flossing/analysis_2x2/early_pairing_hrm26040_joint.md b/research/flossing/analysis_2x2/early_pairing_hrm26040_joint.md new file mode 100644 index 0000000..8d90e02 --- /dev/null +++ b/research/flossing/analysis_2x2/early_pairing_hrm26040_joint.md @@ -0,0 +1,17 @@ +# Early-window pairing — hrm26040_joint +- paired n=2048; final acc=0.5259; already-correct@step4=0.3447 +- of final-correct, fraction already correct@4: 0.6555 +- early-window lam1: final-correct med -0.1714, final-wrong med -0.1314 + +## Forecasting FINAL outcome from the first 4 ACT steps +- AUC(-lam1_early -> final correct) = 0.728 +- AUC(-drift@4 -> final correct) = 0.486 +- AUC(q_halt@4 -> final correct) = 0.908 +- reference: AUC(-lam1_full -> final correct) = 0.987 + +## Restricted to examples NOT yet correct at step 4 (the decision-relevant set) +- n=1342, of which eventually correct: 371 (0.276) +- AUC(-lam1_early -> eventually correct) = 0.448 +- AUC(-drift@4 -> eventually correct) = 0.312 +- AUC(q_halt@4 -> eventually correct) = 0.734 +- early lam1 med: eventually-correct -0.1225 vs never-correct -0.1314
\ No newline at end of file diff --git a/research/flossing/analysis_2x2/early_pairing_trm_official58590.md b/research/flossing/analysis_2x2/early_pairing_trm_official58590.md new file mode 100644 index 0000000..2b29ab5 --- /dev/null +++ b/research/flossing/analysis_2x2/early_pairing_trm_official58590.md @@ -0,0 +1,17 @@ +# Early-window pairing — trm_official58590 +- paired n=2048; final acc=0.8760; already-correct@step4=0.6943 +- of final-correct, fraction already correct@4: 0.7926 +- early-window lam1: final-correct med +0.0178, final-wrong med +0.1075 + +## Forecasting FINAL outcome from the first 4 ACT steps +- AUC(-lam1_early -> final correct) = 0.891 +- AUC(-drift@4 -> final correct) = 0.800 +- AUC(q_halt@4 -> final correct) = 0.901 +- reference: AUC(-lam1_full -> final correct) = 0.993 + +## Restricted to examples NOT yet correct at step 4 (the decision-relevant set) +- n=626, of which eventually correct: 372 (0.594) +- AUC(-lam1_early -> eventually correct) = 0.543 +- AUC(-drift@4 -> eventually correct) = 0.492 +- AUC(q_halt@4 -> eventually correct) = 0.521 +- early lam1 med: eventually-correct +0.1060 vs never-correct +0.1075
\ No newline at end of file diff --git a/research/flossing/analysis_2x2/early_window_pairing.py b/research/flossing/analysis_2x2/early_window_pairing.py new file mode 100644 index 0000000..6277df2 --- /dev/null +++ b/research/flossing/analysis_2x2/early_window_pairing.py @@ -0,0 +1,85 @@ +"""Early-window (first 4 ACT steps) vs full-window pairing. + +Question: does the early-window FTLE forecast FINAL failure before convergence? +- join full/short npz per example via idx (same seed/n => same sampling) +- target label = final exact_correct from the FULL-window run +- report: AUC(-lam1_early -> final correct), same for early drift and q_halt@4, + and the conditional version restricted to examples NOT yet correct at step 4 + (short run's own exact_correct == 0) — the practically relevant population for + early-exit / compute-reallocation decisions. + +Observational only. Usage: python early_window_pairing.py FULL.npz SHORT.npz TAG +""" +from __future__ import annotations + +import sys +from pathlib import Path + +import numpy as np + +HERE = Path(__file__).resolve().parent + + +def auc_rank(score: np.ndarray, label: np.ndarray) -> float: + pos, neg = score[label == 1], score[label == 0] + if len(pos) == 0 or len(neg) == 0: + return float("nan") + allv = np.concatenate([pos, neg]) + order = np.argsort(allv, kind="mergesort") + ranks = np.empty(len(allv)); ranks[order] = np.arange(1, len(allv) + 1) + sv = allv[order]; i = 0 + while i < len(sv): + j = i + while j + 1 < len(sv) and sv[j + 1] == sv[i]: + j += 1 + if j > i: + ranks[order[i:j + 1]] = ranks[order[i:j + 1]].mean() + i = j + 1 + return float((ranks[:len(pos)].sum() - len(pos) * (len(pos) + 1) / 2) / (len(pos) * len(neg))) + + +def main(full_path: str, short_path: str, tag: str) -> None: + f = np.load(full_path) + s = np.load(short_path) + + fi, si = f["idx"], s["idx"] + common, f_pos, s_pos = np.intersect1d(fi, si, return_indices=True) + print(f"[{tag}] paired {len(common)} examples (full n={len(fi)}, short n={len(si)})") + + y_final = f["exact_correct"].astype(int)[f_pos] # FINAL outcome (step 16) + y_early = s["exact_correct"].astype(int)[s_pos] # correct already at step 4 + l1_early = s["lyap_spec"][s_pos, 0] + l1_full = f["lyap_spec"][f_pos, 0] + drift4 = np.log10(np.clip(s["drift_zH"][s_pos, -1], 1e-12, None)) # drift at ACT step 4 + q4 = s["q_halt"][s_pos, -1] + + lines = [f"# Early-window pairing — {tag}", + f"- paired n={len(common)}; final acc={y_final.mean():.4f}; already-correct@step4={y_early.mean():.4f}", + f"- of final-correct, fraction already correct@4: {y_early[y_final==1].mean():.4f}", + f"- early-window lam1: final-correct med {np.median(l1_early[y_final==1]):+.4f}, " + f"final-wrong med {np.median(l1_early[y_final==0]):+.4f}", + "", + "## Forecasting FINAL outcome from the first 4 ACT steps", + f"- AUC(-lam1_early -> final correct) = {auc_rank(-l1_early, y_final):.3f}", + f"- AUC(-drift@4 -> final correct) = {auc_rank(-drift4, y_final):.3f}", + f"- AUC(q_halt@4 -> final correct) = {auc_rank(q4, y_final):.3f}", + f"- reference: AUC(-lam1_full -> final correct) = {auc_rank(-l1_full, y_final):.3f}", + "", + "## Restricted to examples NOT yet correct at step 4 (the decision-relevant set)"] + m = y_early == 0 + n_m = int(m.sum()); n_pos = int(y_final[m].sum()) + lines += [f"- n={n_m}, of which eventually correct: {n_pos} ({n_pos/max(n_m,1):.3f})", + f"- AUC(-lam1_early -> eventually correct) = {auc_rank(-l1_early[m], y_final[m]):.3f}", + f"- AUC(-drift@4 -> eventually correct) = {auc_rank(-drift4[m], y_final[m]):.3f}", + f"- AUC(q_halt@4 -> eventually correct) = {auc_rank(q4[m], y_final[m]):.3f}", + f"- early lam1 med: eventually-correct {np.median(l1_early[m & (y_final==1)]):+.4f} vs " + f"never-correct {np.median(l1_early[m & (y_final==0)]):+.4f}"] + + out = HERE / f"early_pairing_{tag}.md" + out.write_text("\n".join(lines)) + print("\n".join(lines)) + print("wrote", out) + + +if __name__ == "__main__": + main(sys.argv[1], sys.argv[2], sys.argv[3]) diff --git a/research/flossing/analysis_2x2/evolution_hrm.csv b/research/flossing/analysis_2x2/evolution_hrm.csv new file mode 100644 index 0000000..c6a6b10 --- /dev/null +++ b/research/flossing/analysis_2x2/evolution_hrm.csv @@ -0,0 +1,11 @@ +step,acc,tau,fA,fB,fC,fD,l1A,l1B,l1C,l1D +2604,0.015625,0.32445240020751953,0.015625,0.892578125,0.0,0.091796875,-0.11765776202082634,-0.10138508677482605,nan,-0.08651512861251831 +5208,0.048828125,0.40048032999038696,0.029296875,0.900390625,0.01953125,0.05078125,-0.17876404523849487,-0.12725482881069183,-0.18487446755170822,-0.11079395189881325 +7812,0.15234375,0.6775624752044678,0.119140625,0.7734375,0.033203125,0.07421875,-0.20277076959609985,-0.1001129113137722,-0.19649504125118256,-0.12255467846989632 +10416,0.1796875,0.7741286754608154,0.166015625,0.599609375,0.013671875,0.220703125,-0.18029353022575378,-0.06513682007789612,-0.038433440029621124,-0.0504322350025177 +13020,0.30078125,0.8470979928970337,0.283203125,0.2421875,0.017578125,0.45703125,-0.17255809903144836,-0.024668633937835693,-0.07013614475727081,-0.010587955359369516 +15624,0.333984375,0.8144785761833191,0.3125,0.1484375,0.021484375,0.517578125,-0.19566883146762848,0.0031415667617693543,-0.05802934989333153,0.009074798785150051 +18228,0.474609375,0.8437007665634155,0.439453125,0.001953125,0.03515625,0.5234375,-0.08657147735357285,nan,-0.0343703031539917,0.012321019545197487 +20832,0.45703125,0.6549708247184753,0.423828125,0.0078125,0.033203125,0.53515625,-0.09613814204931259,-0.022231140173971653,-0.04714604467153549,0.001996797218453139 +23436,0.505859375,0.7171876430511475,0.484375,0.01171875,0.021484375,0.482421875,-0.14968957751989365,-0.04964173026382923,-0.06192338466644287,-0.0044072768650949 +26040,0.5,0.7611898183822632,0.47265625,0.0078125,0.02734375,0.4921875,-0.1546676605939865,0.0052672732854261994,-0.03868532460182905,0.023184756748378277 diff --git a/research/flossing/analysis_2x2/evolution_trm.csv b/research/flossing/analysis_2x2/evolution_trm.csv new file mode 100644 index 0000000..a0b19f5 --- /dev/null +++ b/research/flossing/analysis_2x2/evolution_trm.csv @@ -0,0 +1,11 @@ +step,acc,tau,fA,fB,fC,fD,l1A,l1B,l1C,l1D +26041,0.576171875,0.7283560037612915,0.560546875,0.01171875,0.015625,0.412109375,-0.04085663706064224,-0.0017999516567215323,0.015885045286267996,0.035852354019880295 +52082,0.6484375,0.8146334886550903,0.626953125,0.0,0.021484375,0.3515625,-0.005659707821905613,nan,0.05140006169676781,0.060247575864195824 +78123,0.6875,0.9091860055923462,0.66015625,0.0,0.02734375,0.3125,-0.004423868143931031,nan,0.05615279637277126,0.08166220411658287 +104164,0.7109375,1.0002410411834717,0.681640625,0.0,0.029296875,0.2890625,-0.006436891388148069,nan,0.08611240983009338,0.09516968205571175 +130205,0.755859375,1.2130658626556396,0.712890625,0.0,0.04296875,0.244140625,0.004312093835324049,nan,0.08519911020994186,0.10669758170843124 +156246,0.7578125,1.2462413311004639,0.73046875,0.0,0.02734375,0.2421875,0.03479858115315437,nan,0.08626002073287964,0.10777830705046654 +182287,0.7421875,1.3035683631896973,0.712890625,0.0,0.029296875,0.2578125,-0.006856455933302641,nan,0.09087501466274261,0.10055964067578316 +208328,0.75,1.3776456117630005,0.720703125,0.0,0.029296875,0.25,0.03374895453453064,nan,0.08769012242555618,0.10804034024477005 +234369,0.7734375,1.36112380027771,0.7421875,0.0,0.03125,0.2265625,0.011585433036088943,nan,0.09419485554099083,0.10411781445145607 +260410,0.76953125,1.406412124633789,0.748046875,0.0,0.021484375,0.23046875,0.004744160454720259,nan,0.09982780367136002,0.10232088342308998 diff --git a/research/flossing/analysis_2x2/offline_followups.py b/research/flossing/analysis_2x2/offline_followups.py new file mode 100644 index 0000000..8101acb --- /dev/null +++ b/research/flossing/analysis_2x2/offline_followups.py @@ -0,0 +1,229 @@ +"""Offline follow-ups to the 2x2 analysis (no GPU): + +1. Residual outcome signal within the unsettled stratum (HRM diag_8k primary, + TRM official @58590 secondary): per-cell drift profiles over the 16 ACT steps, + end-of-window drift slope, q_halt trajectories, halted_at, lambda spectra with a + STRICT in-band threshold, and per-drift-decile AUC(lambda1 -> correct) within the + unsettled stratum (does lambda1 add signal beyond drift level?). +2. Per-example profile of the strict-band settled-but-wrong examples (HRM, n~21). +3. Difficulty control: #givens per puzzle (input tokens != 1) joined via idx; + lambda1 ~ givens rank correlation overall/within outcome, and per-givens-bin + AUC(-lambda1 -> correct). + +Observational only. Outputs to analysis_2x2/offline_followups/. +""" +from __future__ import annotations + +from pathlib import Path + +import matplotlib + +matplotlib.use("Agg") +import matplotlib.pyplot as plt +import numpy as np + +HERE = Path(__file__).resolve().parent +FLOSS = HERE.parent +OUT = HERE / "offline_followups" +OUT.mkdir(exist_ok=True) + +DATA_TEST_INPUTS = Path("/home/yurenh2/rrm/data/sudoku-extreme-1k-aug-1000/test/all__inputs.npy") + +CELL_COLORS = {"A": "tab:green", "B": "tab:orange", "C": "tab:blue", "D": "tab:red"} + + +def auc_rank(score: np.ndarray, label: np.ndarray) -> float: + pos, neg = score[label == 1], score[label == 0] + if len(pos) == 0 or len(neg) == 0: + return float("nan") + allv = np.concatenate([pos, neg]) + order = np.argsort(allv, kind="mergesort") + ranks = np.empty(len(allv)); ranks[order] = np.arange(1, len(allv) + 1) + sv = allv[order]; i = 0 + while i < len(sv): + j = i + while j + 1 < len(sv) and sv[j + 1] == sv[i]: + j += 1 + if j > i: + ranks[order[i:j + 1]] = ranks[order[i:j + 1]].mean() + i = j + 1 + return float((ranks[:len(pos)].sum() - len(pos) * (len(pos) + 1) / 2) / (len(pos) * len(neg))) + + +def spearman(a: np.ndarray, b: np.ndarray) -> float: + ra = np.argsort(np.argsort(a)).astype(float) + rb = np.argsort(np.argsort(b)).astype(float) + return float(np.corrcoef(ra, rb)[0, 1]) + + +def load(npz_path: Path, strict_pct: float): + d = np.load(npz_path) + out = {k: d[k] for k in d.files} + out["logd_late"] = np.log10(np.clip(out["drift_zH"][:, -4:].mean(1), 1e-12, None)) + out["tau_strict"] = float(np.percentile(out["logd_late"], strict_pct)) + conv = out["logd_late"] < out["tau_strict"] + c = out["exact_correct"].astype(int) + out["cells"] = { + "A": conv & (c == 1), "B": conv & (c == 0), + "C": (~conv) & (c == 1), "D": (~conv) & (c == 0), + } + return out + + +def givens_for(idx: np.ndarray) -> np.ndarray: + inputs = np.load(DATA_TEST_INPUTS, mmap_mode="r") + return np.array([(inputs[i] != 1).sum() for i in idx]) + + +def drift_profiles_fig(ds, tag, lines): + fig, axes = plt.subplots(1, 2, figsize=(11, 4)) + steps = np.arange(1, ds["drift_zH"].shape[1] + 1) + for nm, m in ds["cells"].items(): + if m.sum() < 3: + continue + med = np.median(ds["drift_zH"][m], axis=0) + q1 = np.percentile(ds["drift_zH"][m], 25, axis=0) + q3 = np.percentile(ds["drift_zH"][m], 75, axis=0) + axes[0].plot(steps, med, "o-", ms=3, color=CELL_COLORS[nm], label=f"{nm} (n={int(m.sum())})") + axes[0].fill_between(steps, q1, q3, color=CELL_COLORS[nm], alpha=0.15) + qm = np.median(ds["q_halt"][m], axis=0) + axes[1].plot(steps, qm, "o-", ms=3, color=CELL_COLORS[nm], label=nm) + axes[0].set_yscale("log"); axes[0].set_xlabel("ACT step"); axes[0].set_ylabel("drift_zH (median, IQR)") + axes[0].legend(fontsize=8); axes[0].set_title(f"{tag}: drift profiles per cell") + axes[1].set_xlabel("ACT step"); axes[1].set_ylabel("q_halt (median)"); axes[1].axhline(0, color="gray", lw=0.6) + axes[1].set_title(f"{tag}: q_halt per cell"); axes[1].legend(fontsize=8) + fig.tight_layout(); fig.savefig(OUT / f"fig_{tag}_profiles.png", dpi=150); plt.close(fig) + + # end-of-window slope: log10 mean(drift[13:16]) - log10 mean(drift[9:12]) + slope = (np.log10(np.clip(ds["drift_zH"][:, 12:16].mean(1), 1e-12, None)) + - np.log10(np.clip(ds["drift_zH"][:, 8:12].mean(1), 1e-12, None))) + lines.append(f"\n### {tag}: end-of-window drift slope (log10 steps13-16 vs 9-12; <0 = still descending)") + for nm, m in ds["cells"].items(): + if m.sum() == 0: + lines.append(f"- {nm}: n=0") + continue + lines.append(f"- {nm}: n={int(m.sum())}, slope median {np.median(slope[m]):+.4f}, " + f"IQR [{np.percentile(slope[m],25):+.4f}, {np.percentile(slope[m],75):+.4f}], " + f"frac still descending (<-0.01): {float((slope[m] < -0.01).mean()):.2f}") + return slope + + +def main() -> None: + lines = ["# Offline follow-ups (no GPU) — 2026-06-11", + "", + "Strict in-band thresholds: HRM pct45 of pooled log10 late-drift; TRM pct60 (band edge; B=0 regardless).", + "All numbers observational; within-dataset comparisons only."] + + # ---------- HRM diag_8k ---------- + hrm = load(FLOSS / "diag_8k.npz", strict_pct=45) + tag = "hrm26040_n8192_strict" + lines.append(f"\n## HRM @26040 (n=8192), strict tau(log10)={hrm['tau_strict']:.4f}") + lines.append("| cell | n | lam1 med | lam8 med | token_acc med | halted_at med | q_halt_final med | givens med |") + lines.append("|---|---|---|---|---|---|---|---|") + g_hrm = givens_for(hrm["idx"]) + for nm, m in hrm["cells"].items(): + if m.sum() == 0: + lines.append(f"| {nm} | 0 | | | | | | |") + continue + lines.append( + f"| {nm} | {int(m.sum())} | {np.median(hrm['lyap_spec'][m,0]):+.4f} | {np.median(hrm['lyap_spec'][m,-1]):+.4f} " + f"| {np.median(hrm['token_acc'][m]):.3f} | {np.median(hrm['halted_at'][m]):.0f} " + f"| {np.median(hrm['q_halt'][m,-1]):+.2f} | {np.median(g_hrm[m]):.0f} |") + + slope = drift_profiles_fig(hrm, tag, lines) + + # residual signal within unsettled stratum: per-drift-decile AUC + uns = ~(hrm["cells"]["A"] | hrm["cells"]["B"]) + c = hrm["exact_correct"].astype(int) + lines.append("\n### HRM unsettled stratum: AUC(-lam1 -> correct) per log-drift decile") + lines.append("| decile | drift range (log10) | n | n_correct | AUC |") + lines.append("|---|---|---|---|---|") + ld_u, l1_u, c_u = hrm["logd_late"][uns], hrm["lyap_spec"][uns, 0], c[uns] + qs = np.percentile(ld_u, np.arange(0, 101, 10)) + aucs, ws = [], [] + for i in range(10): + m = (ld_u >= qs[i]) & (ld_u <= qs[i + 1] if i == 9 else ld_u < qs[i + 1]) + a = auc_rank(-l1_u[m], c_u[m]) + if not np.isnan(a) and c_u[m].sum() >= 5: + aucs.append(a); ws.append(m.sum()) + lines.append(f"| {i+1} | [{qs[i]:.2f}, {qs[i+1]:.2f}] | {int(m.sum())} | {int(c_u[m].sum())} | " + f"{a:.3f} |" if not np.isnan(a) else f"| {i+1} | [{qs[i]:.2f}, {qs[i+1]:.2f}] | {int(m.sum())} | {int(c_u[m].sum())} | n/a |") + if aucs: + lines.append(f"- weighted mean within-decile AUC = {np.average(aucs, weights=ws):.3f} " + f"(vs unconditioned within-unsettled AUC {auc_rank(-l1_u, c_u):.3f})") + + # also: does end-slope separate C from D? + lines.append(f"- AUC(-end_slope -> correct | unsettled) = {auc_rank(-slope[uns], c_u):.3f} " + f"(C still-descending fraction vs D, see slope table above)") + + # ---------- strict-B per-example table ---------- + B = hrm["cells"]["B"] + lines.append(f"\n## HRM strict-band settled-but-wrong examples (n={int(B.sum())})") + lines.append("| idx | givens | token_acc | lam1 | drift_final | halted_at | q_halt_final |") + lines.append("|---|---|---|---|---|---|---|") + bi = np.where(B)[0] + order = np.argsort(hrm["token_acc"][bi]) + for j in bi[order]: + lines.append( + f"| {int(hrm['idx'][j])} | {int(g_hrm[j])} | {hrm['token_acc'][j]:.3f} | {hrm['lyap_spec'][j,0]:+.3f} " + f"| {hrm['drift_zH'][j,-1]:.3f} | {int(hrm['halted_at'][j])} | {hrm['q_halt'][j,-1]:+.2f} |") + # B drift profiles vs A band + fig, ax = plt.subplots(figsize=(6.5, 4)) + steps = np.arange(1, 17) + A = hrm["cells"]["A"] + ax.fill_between(steps, np.percentile(hrm["drift_zH"][A], 10, axis=0), + np.percentile(hrm["drift_zH"][A], 90, axis=0), color="tab:green", alpha=0.2, + label=f"A q10-q90 (n={int(A.sum())})") + for j in bi: + ax.plot(steps, hrm["drift_zH"][j], "-", lw=1, alpha=0.8, color="tab:orange") + ax.set_yscale("log"); ax.set_xlabel("ACT step"); ax.set_ylabel("drift_zH") + ax.set_title("HRM: strict-B drift profiles vs A band"); ax.legend(fontsize=8) + fig.tight_layout(); fig.savefig(OUT / "fig_hrm_strictB_profiles.png", dpi=150); plt.close(fig) + + # ---------- difficulty control (HRM) ---------- + l1 = hrm["lyap_spec"][:, 0] + lines.append("\n## HRM difficulty control (#givens, input tokens != 1)") + lines.append(f"- givens: min {g_hrm.min()}, median {np.median(g_hrm):.0f}, max {g_hrm.max()}") + lines.append(f"- Spearman(lam1, givens): overall {spearman(l1, g_hrm):+.3f}; " + f"correct-only {spearman(l1[c==1], g_hrm[c==1]):+.3f}; " + f"wrong-only {spearman(l1[c==0], g_hrm[c==0]):+.3f}") + lines.append(f"- Spearman(correct, givens) = {spearman(c.astype(float), g_hrm):+.3f}") + lines.append("\n| givens bin | n | acc | AUC(-lam1 -> correct) |") + lines.append("|---|---|---|---|") + edges = np.unique(np.percentile(g_hrm, [0, 20, 40, 60, 80, 100])) + bin_aucs, bin_ws = [], [] + for i in range(len(edges) - 1): + m = (g_hrm >= edges[i]) & (g_hrm <= edges[i + 1] if i == len(edges) - 2 else g_hrm < edges[i + 1]) + a = auc_rank(-l1[m], c[m]) + lines.append(f"| [{edges[i]:.0f}, {edges[i+1]:.0f}] | {int(m.sum())} | {c[m].mean():.3f} | {a:.3f} |") + if not np.isnan(a): + bin_aucs.append(a); bin_ws.append(m.sum()) + lines.append(f"- weighted mean within-bin AUC = {np.average(bin_aucs, weights=bin_ws):.3f} (overall 0.984)") + + # ---------- TRM official @58590 (secondary, n=512) ---------- + trm = load(FLOSS / "official_gbs768_spectrum/trm_gbs768_base_step58590_n512_k8_seed20260602.npz", strict_pct=60) + g_trm = givens_for(trm["idx"]) + ct = trm["exact_correct"].astype(int) + lines.append(f"\n## TRM official @58590 (n=512), strict tau(log10)={trm['tau_strict']:.4f}") + lines.append("| cell | n | lam1 med | token_acc med | q_halt_final med | givens med |") + lines.append("|---|---|---|---|---|---|") + for nm, m in trm["cells"].items(): + if m.sum() == 0: + lines.append(f"| {nm} | 0 | | | | |") + continue + lines.append(f"| {nm} | {int(m.sum())} | {np.median(trm['lyap_spec'][m,0]):+.4f} " + f"| {np.median(trm['token_acc'][m]):.3f} | {np.median(trm['q_halt'][m,-1]):+.2f} " + f"| {np.median(g_trm[m]):.0f} |") + drift_profiles_fig(trm, "trm_official58590_n512_strict", lines) + l1t = trm["lyap_spec"][:, 0] + lines.append(f"- Spearman(lam1, givens): overall {spearman(l1t, g_trm):+.3f}; " + f"wrong-only {spearman(l1t[ct==0], g_trm[ct==0]):+.3f}") + lines.append(f"- Spearman(correct, givens) = {spearman(ct.astype(float), g_trm):+.3f}") + + (OUT / "followups.md").write_text("\n".join(lines)) + print("\n".join(lines[:6])) + print("wrote", OUT / "followups.md") + + +if __name__ == "__main__": + main() diff --git a/research/flossing/analysis_2x2/offline_followups/followups.md b/research/flossing/analysis_2x2/offline_followups/followups.md new file mode 100644 index 0000000..be5f519 --- /dev/null +++ b/research/flossing/analysis_2x2/offline_followups/followups.md @@ -0,0 +1,88 @@ +# Offline follow-ups (no GPU) — 2026-06-11 + +Strict in-band thresholds: HRM pct45 of pooled log10 late-drift; TRM pct60 (band edge; B=0 regardless). +All numbers observational; within-dataset comparisons only. + +## HRM @26040 (n=8192), strict tau(log10)=-0.0129 +| cell | n | lam1 med | lam8 med | token_acc med | halted_at med | q_halt_final med | givens med | +|---|---|---|---|---|---|---|---| +| A | 3665 | -0.8670 | -0.9787 | 1.000 | 4 | +7.47 | 26 | +| B | 21 | -0.8421 | -0.9495 | 0.617 | 6 | +7.47 | 25 | +| C | 633 | -0.7796 | -0.8815 | 1.000 | 10 | +7.47 | 25 | +| D | 3873 | -0.5991 | -0.7140 | 0.630 | 0 | -9.62 | 25 | + +### hrm26040_n8192_strict: end-of-window drift slope (log10 steps13-16 vs 9-12; <0 = still descending) +- A: n=3665, slope median -0.0023, IQR [-0.0073, +0.0012], frac still descending (<-0.01): 0.20 +- B: n=21, slope median -0.0063, IQR [-0.4005, -0.0009], frac still descending (<-0.01): 0.48 +- C: n=633, slope median -0.0006, IQR [-1.4084, +0.0088], frac still descending (<-0.01): 0.44 +- D: n=3873, slope median -0.0031, IQR [-0.0556, +0.0459], frac still descending (<-0.01): 0.46 + +### HRM unsettled stratum: AUC(-lam1 -> correct) per log-drift decile +| decile | drift range (log10) | n | n_correct | AUC | +|---|---|---|---|---| +| 1 | [-0.01, 0.66] | 451 | 422 | 0.966 | +| 2 | [0.66, 1.42] | 450 | 50 | 0.972 | +| 3 | [1.42, 1.52] | 451 | 4 | 0.988 | +| 4 | [1.52, 1.56] | 450 | 6 | 0.964 | +| 5 | [1.56, 1.60] | 451 | 1 | 0.984 | +| 6 | [1.60, 1.62] | 450 | 7 | 0.949 | +| 7 | [1.62, 1.65] | 451 | 5 | 0.837 | +| 8 | [1.65, 1.68] | 450 | 5 | 0.851 | +| 9 | [1.68, 1.71] | 451 | 16 | 0.804 | +| 10 | [1.71, 1.93] | 451 | 117 | 0.685 | +- weighted mean within-decile AUC = 0.879 (vs unconditioned within-unsettled AUC 0.933) +- AUC(-end_slope -> correct | unsettled) = 0.605 (C still-descending fraction vs D, see slope table above) + +## HRM strict-band settled-but-wrong examples (n=21) +| idx | givens | token_acc | lam1 | drift_final | halted_at | q_halt_final | +|---|---|---|---|---|---|---| +| 342267 | 17 | 0.407 | -0.867 | 0.976 | 5 | +7.41 | +| 212705 | 17 | 0.469 | -0.838 | 0.964 | 8 | +7.44 | +| 329832 | 17 | 0.481 | -0.703 | 0.970 | 8 | +7.41 | +| 20075 | 27 | 0.519 | -0.812 | 0.966 | 5 | +7.50 | +| 198242 | 25 | 0.568 | -0.843 | 0.980 | 7 | +7.47 | +| 223591 | 24 | 0.580 | -0.939 | 0.951 | 4 | +7.47 | +| 238704 | 27 | 0.593 | -0.931 | 0.953 | 5 | +7.47 | +| 364431 | 25 | 0.593 | -0.806 | 0.956 | 6 | +7.44 | +| 274637 | 26 | 0.593 | -0.859 | 0.979 | 6 | +7.47 | +| 182424 | 24 | 0.605 | -0.985 | 0.949 | 6 | +7.47 | +| 351919 | 25 | 0.617 | -0.742 | 0.965 | 5 | +7.47 | +| 123022 | 27 | 0.617 | -0.826 | 0.951 | 7 | +7.50 | +| 150426 | 25 | 0.630 | -0.767 | 0.963 | 9 | +7.47 | +| 175427 | 26 | 0.630 | -0.843 | 0.946 | 8 | +7.50 | +| 422185 | 26 | 0.642 | -0.841 | 0.946 | 7 | +7.47 | +| 344032 | 24 | 0.654 | -0.903 | 0.965 | 4 | +7.53 | +| 30703 | 25 | 0.691 | -0.732 | 0.972 | 6 | +7.53 | +| 386549 | 23 | 0.691 | -0.842 | 0.966 | 4 | +7.47 | +| 3370 | 26 | 0.716 | -0.861 | 0.955 | 6 | +7.47 | +| 243909 | 24 | 0.753 | -0.786 | 0.969 | 8 | +7.50 | +| 258307 | 25 | 0.877 | -0.918 | 0.952 | 5 | +7.47 | + +## HRM difficulty control (#givens, input tokens != 1) +- givens: min 17, median 25, max 36 +- Spearman(lam1, givens): overall -0.350; correct-only -0.155; wrong-only -0.180 +- Spearman(correct, givens) = +0.276 + +| givens bin | n | acc | AUC(-lam1 -> correct) | +|---|---|---|---| +| [17, 24] | 1152 | 0.321 | 0.976 | +| [24, 25] | 1795 | 0.373 | 0.980 | +| [25, 26] | 1764 | 0.503 | 0.987 | +| [26, 36] | 3481 | 0.681 | 0.983 | +- weighted mean within-bin AUC = 0.982 (overall 0.984) + +## TRM official @58590 (n=512), strict tau(log10)=1.0240 +| cell | n | lam1 med | token_acc med | q_halt_final med | givens med | +|---|---|---|---|---|---| +| A | 307 | +0.0105 | 1.000 | +7.78 | 26 | +| B | 0 | | | | | +| C | 141 | +0.0174 | 1.000 | +7.81 | 25 | +| D | 64 | +0.1034 | 0.630 | -11.12 | 25 | + +### trm_official58590_n512_strict: end-of-window drift slope (log10 steps13-16 vs 9-12; <0 = still descending) +- A: n=307, slope median -0.1471, IQR [-0.2267, -0.0641], frac still descending (<-0.01): 0.90 +- B: n=0 +- C: n=141, slope median -0.0080, IQR [-0.2603, +0.0808], frac still descending (<-0.01): 0.49 +- D: n=64, slope median -0.0125, IQR [-0.0525, +0.0276], frac still descending (<-0.01): 0.53 +- Spearman(lam1, givens): overall -0.240; wrong-only -0.238 +- Spearman(correct, givens) = +0.148
\ No newline at end of file diff --git a/research/flossing/analysis_2x2/offline_followups/phase1_e1.md b/research/flossing/analysis_2x2/offline_followups/phase1_e1.md new file mode 100644 index 0000000..b3e3f40 --- /dev/null +++ b/research/flossing/analysis_2x2/offline_followups/phase1_e1.md @@ -0,0 +1,27 @@ +# E1 offline batch — bootstrap CIs, settling robustness, TRM multi4 pair + +## Bootstrap / exact CIs (TRM official @58590, n=2048) +- settled-wrong fraction: observed 0/254; exact 95% upper bound 0.0117 (1.17% of failures) +- AUC(-lam1->correct) = 0.9935, bootstrap 95% CI (0.9908244697676584, 0.9957330475628791) +- lam1(wrong) median 95% CI (0.10100110620260239, 0.10556983947753906) +- lam1(correct) median 95% CI (0.011215815320611, 0.011744528077542782) + +## Bootstrap CIs (HRM @26040, n=8192, strict band) +- strict settled-wrong fraction of failures: observed 0.0054, bootstrap 95% CI (0.0032613427182413084, 0.007798538095694945) +- AUC(-lam1->correct) = 0.9841, bootstrap 95% CI (0.9815470536412456, 0.9865145187475995) + +## Settling-criterion robustness (B-cell counts under alternative drift definitions) +- TRM official n=2048 | zH: B=0/A=1724 (tau=1.36) | zL: B=0/A=1728 (tau=1.42) | combined: B=0/A=1727 (tau=1.54) +- HRM n=8192 | zH: B=63/A=4103 (tau=0.77) | zL: B=59/A=4083 (tau=1.01) | combined: B=60/A=4087 (tau=1.07) + +## TRM official-pipeline multi4 vs baseline (matched objective, n=512 each) +- baseline @58590: acc=0.875; A/B/C/D=434/0/14/64; fD=0.125; lam1(D)=+0.1034; lam1(A)=+0.0111 +- multi4 @35805 (best): acc=0.900; A/B/C/D=452/0/9/51; fD=0.100; lam1(D)=+0.1019; lam1(A)=+0.0039 +- multi4 @65100 (final): acc=0.824; A/B/C/D=408/1/14/89; fD=0.174; lam1(D)=+0.0946; lam1(A)=+0.0133 + +## hrm_multi4 provenance (E6a) +- diag_hrm_multi4_step_{20832,23436,26040}_512.npz step grid matches HRM pretrain numbering; + multi4_eval_compare/logs should contain the eval invocations — checked manually below. +- ACTION: if the hrm_multi4 run is pretrain-pipeline (ACT-streaming + perturbation), then the + May-28 multi4 vs righteous baseline comparison IS matched-pipeline and Sec 3.4's caveat is + narrower than written; step9 E-vs-F pair (queued) covers the fixed-unroll objective regardless.
\ No newline at end of file diff --git a/research/flossing/analysis_2x2/offline_phase1_e1.py b/research/flossing/analysis_2x2/offline_phase1_e1.py new file mode 100644 index 0000000..7c42825 --- /dev/null +++ b/research/flossing/analysis_2x2/offline_phase1_e1.py @@ -0,0 +1,123 @@ +"""E1 offline batch (experiment_framework.md): +(1) bootstrap CIs for headline quantities; (2) settling-criterion robustness (z_L / combined); +(3) TRM official multi4 vs baseline matched-pipeline 2x2; (4) provenance note for hrm_multi4. +Outputs: offline_followups/phase1_e1.md +""" +from __future__ import annotations + +from pathlib import Path + +import numpy as np + +HERE = Path(__file__).resolve().parent +FLOSS = HERE.parent +OUT = HERE / "offline_followups" +RNG = np.random.default_rng(0) + + +def auc_rank(score, label): + pos, neg = score[label == 1], score[label == 0] + if len(pos) == 0 or len(neg) == 0: + return float("nan") + allv = np.concatenate([pos, neg]) + order = np.argsort(allv, kind="mergesort") + ranks = np.empty(len(allv)); ranks[order] = np.arange(1, len(allv) + 1) + sv = allv[order]; i = 0 + while i < len(sv): + j = i + while j + 1 < len(sv) and sv[j + 1] == sv[i]: + j += 1 + if j > i: + ranks[order[i:j + 1]] = ranks[order[i:j + 1]].mean() + i = j + 1 + return float((ranks[:len(pos)].sum() - len(pos) * (len(pos) + 1) / 2) / (len(pos) * len(neg))) + + +def boot_ci(stat_fn, n, B=10000): + vals = [] + for _ in range(B): + idx = RNG.integers(0, n, n) + v = stat_fn(idx) + if not np.isnan(v): + vals.append(v) + return float(np.percentile(vals, 2.5)), float(np.percentile(vals, 97.5)) + + +def late(d, key="drift_zH"): + return np.log10(np.clip(d[key][:, -4:].mean(1), 1e-12, None)) + + +def otsu(x, nbins=256): + h, e = np.histogram(x, bins=nbins); h = h.astype(float) + c = (e[:-1] + e[1:]) / 2 + p = h / h.sum(); om = np.cumsum(p); mu = np.cumsum(p * c); mt = mu[-1] + den = om * (1 - om); den[den <= 0] = np.nan + return float(c[np.nanargmax((mt * om - mu) ** 2 / den)]) + + +lines = ["# E1 offline batch — bootstrap CIs, settling robustness, TRM multi4 pair", ""] + +# ---------- (1) bootstrap CIs ---------- +trm = np.load(FLOSS / "analysis_2x2/retest/trm_gbs768_step58590_full_n2048.npz") +c_t = trm["exact_correct"].astype(int); l1_t = trm["lyap_spec"][:, 0]; ld_t = late(trm) +n_wrong = int((c_t == 0).sum()) +cp_upper = 1 - 0.05 ** (1 / n_wrong) # Clopper-Pearson upper for 0 events +lines += ["## Bootstrap / exact CIs (TRM official @58590, n=2048)", + f"- settled-wrong fraction: observed 0/{n_wrong}; exact 95% upper bound {cp_upper:.4f} " + f"({cp_upper*100:.2f}% of failures)", + f"- AUC(-lam1->correct) = {auc_rank(-l1_t, c_t):.4f}, bootstrap 95% CI " + f"{boot_ci(lambda i: auc_rank(-l1_t[i], c_t[i]), len(c_t))}", + f"- lam1(wrong) median 95% CI {boot_ci(lambda i: float(np.median(l1_t[i][c_t[i]==0])), len(c_t))}", + f"- lam1(correct) median 95% CI {boot_ci(lambda i: float(np.median(l1_t[i][c_t[i]==1])), len(c_t))}"] + +hrm = np.load(FLOSS / "diag_8k.npz") +c_h = hrm["exact_correct"].astype(int); l1_h = hrm["lyap_spec"][:, 0]; ld_h = late(hrm) +tau_strict = float(np.percentile(ld_h, 45)) +def strictB_frac(idx): + c, ld = c_h[idx], ld_h[idx] + w = (c == 0).sum() + return float(((ld < tau_strict) & (c == 0)).sum() / max(w, 1)) +lines += ["", "## Bootstrap CIs (HRM @26040, n=8192, strict band)", + f"- strict settled-wrong fraction of failures: observed {21/3894:.4f}, bootstrap 95% CI " + f"{boot_ci(strictB_frac, len(c_h))}", + f"- AUC(-lam1->correct) = {auc_rank(-l1_h, c_h):.4f}, bootstrap 95% CI " + f"{boot_ci(lambda i: auc_rank(-l1_h[i], c_h[i]), len(c_h))}"] + +# ---------- (2) settling-criterion robustness ---------- +lines += ["", "## Settling-criterion robustness (B-cell counts under alternative drift definitions)"] +for tag, d, c in [("TRM official n=2048", trm, c_t), ("HRM n=8192", hrm, c_h)]: + row = [tag] + for key, nm in [("drift_zH", "zH"), ("drift_zL", "zL")]: + ld = late(d, key); tau = otsu(ld); conv = ld < tau + row.append(f"{nm}: B={int((conv & (c==0)).sum())}/A={int((conv & (c==1)).sum())} (tau={tau:.2f})") + comb = np.log10(np.clip(np.sqrt(d["drift_zH"][:, -4:].mean(1) ** 2 + d["drift_zL"][:, -4:].mean(1) ** 2), 1e-12, None)) + tau = otsu(comb); conv = comb < tau + row.append(f"combined: B={int((conv & (c==0)).sum())}/A={int((conv & (c==1)).sum())} (tau={tau:.2f})") + lines.append("- " + " | ".join(row)) + +# ---------- (3) TRM official multi4 vs baseline (matched pipeline) ---------- +lines += ["", "## TRM official-pipeline multi4 vs baseline (matched objective, n=512 each)"] +spec = FLOSS / "official_gbs768_spectrum" +for nm, f in [("baseline @58590", spec / "trm_gbs768_base_step58590_n512_k8_seed20260602.npz"), + ("multi4 @35805 (best)", spec / "trm_gbs768_multi4_step35805_n512_k8_seed20260602.npz"), + ("multi4 @65100 (final)", spec / "trm_gbs768_multi4_step65100_n512_k8_seed20260602.npz")]: + if not f.exists(): + lines.append(f"- {nm}: npz missing"); continue + d = np.load(f); c = d["exact_correct"].astype(int); l1 = d["lyap_spec"][:, 0] + ld = late(d); tau = otsu(ld); conv = ld < tau + A = conv & (c == 1); B = conv & (c == 0); C = (~conv) & (c == 1); D = (~conv) & (c == 0) + lines.append(f"- {nm}: acc={c.mean():.3f}; A/B/C/D={int(A.sum())}/{int(B.sum())}/{int(C.sum())}/{int(D.sum())}; " + f"fD={float(D.mean()):.3f}; lam1(D)={np.median(l1[D]) if D.sum()>0 else float('nan'):+.4f}; " + f"lam1(A)={np.median(l1[A]) if A.sum()>0 else float('nan'):+.4f}") + +# ---------- (4) provenance note ---------- +lines += ["", "## hrm_multi4 provenance (E6a)", + "- diag_hrm_multi4_step_{20832,23436,26040}_512.npz step grid matches HRM pretrain numbering;", + " multi4_eval_compare/logs should contain the eval invocations — checked manually below.", + "- ACTION: if the hrm_multi4 run is pretrain-pipeline (ACT-streaming + perturbation), then the", + " May-28 multi4 vs righteous baseline comparison IS matched-pipeline and Sec 3.4's caveat is", + " narrower than written; step9 E-vs-F pair (queued) covers the fixed-unroll objective regardless."] + +OUT.mkdir(exist_ok=True) +(OUT / "phase1_e1.md").write_text("\n".join(lines)) +print("\n".join(lines)) diff --git a/research/flossing/analysis_2x2/phase1/phase1_results.md b/research/flossing/analysis_2x2/phase1/phase1_results.md new file mode 100644 index 0000000..5dde25d --- /dev/null +++ b/research/flossing/analysis_2x2/phase1/phase1_results.md @@ -0,0 +1,48 @@ +# Phase-1 analysis (E5 horizon sweep, E6 matched-objective, E2 replication) + +## E5: when does fate become legible? AUC(early signal -> FINAL correct), restricted to not-yet-correct@H + +### TRM +| H | solved@H | undecided n | of which eventual✓ | AUC(-λ₁) | AUC(-drift) | AUC(q_halt) | +|---|---|---|---|---|---|---| +| 2 | 0.542 | 939 | 0.729 | 0.526 | 0.411 | 0.608 | +| 4 | 0.694 | 626 | 0.594 | 0.543 | 0.492 | 0.521 | +| 6 | 0.766 | 479 | 0.470 | 0.523 | 0.441 | 0.531 | +| 8 | 0.801 | 407 | 0.376 | 0.483 | 0.495 | 0.514 | +| 10 | 0.831 | 347 | 0.268 | 0.477 | 0.536 | 0.476 | +| 12 | 0.846 | 315 | 0.194 | 0.483 | 0.509 | 0.509 | + +### HRM +| H | solved@H | undecided n | of which eventual✓ | AUC(-λ₁) | AUC(-drift) | AUC(q_halt) | +|---|---|---|---|---|---|---| +| 2 | 0.071 | 1903 | 0.490 | 0.375 | 0.193 | 0.810 | +| 4 | 0.345 | 1342 | 0.276 | 0.448 | 0.312 | 0.734 | +| 6 | 0.436 | 1155 | 0.159 | 0.438 | 0.349 | 0.676 | +| 8 | 0.471 | 1083 | 0.103 | 0.432 | 0.371 | 0.705 | +| 10 | 0.489 | 1046 | 0.072 | 0.478 | 0.407 | 0.705 | +| 12 | 0.506 | 1012 | 0.041 | 0.503 | 0.348 | 0.635 | + +## E6: matched-objective intervention (step9 fixed-unroll runs, n=512) + +### HRM (E base vs F multi4) +| ckpt | base acc | base fD | base λ₁(D) | multi4 acc | multi4 fD | multi4 λ₁(D) | +|---|---|---|---|---|---|---| +| step_12500 | 0.613 | 0.387 | -0.0157 | 0.631 | 0.369 | -0.0215 | +| step_25000 | 0.615 | 0.385 | +0.0138 | 0.617 | 0.379 | +0.0101 | +| best | 0.619 | 0.381 | -0.0214 | 0.613 | 0.387 | -0.0181 | +| final | 0.588 | 0.410 | +0.0456 | 0.609 | 0.387 | +0.0335 | + +### TRM (G base vs H multi4) +| ckpt | base acc | base fD | base λ₁(D) | multi4 acc | multi4 fD | multi4 λ₁(D) | +|---|---|---|---|---|---|---| +| step_12500 | 0.553 | 0.344 | +0.0219 | 0.600 | 0.305 | +0.0284 | +| step_25000 | 0.525 | 0.396 | +0.0209 | 0.545 | 0.361 | +0.0222 | +| best | 0.596 | 0.334 | +0.0233 | 0.580 | 0.361 | +0.0366 | +| final | 0.477 | 0.312 | +0.0164 | 0.537 | 0.270 | +0.0249 | + +## E2: decomposition on a SECOND HRM training run (step9_E fixed-unroll, n=2048) + +| ckpt | acc | A | B | C | D | λ₁(A) | λ₁(D) | settled-wrong frac of failures | +|---|---|---|---|---|---|---|---|---| +| best | 0.637 | 1244 | 1 | 61 | 742 | -0.1685 | -0.0308 | 0.0013 | +| final | 0.594 | 1124 | 3 | 92 | 829 | -0.0144 | +0.0444 | 0.0036 |
\ No newline at end of file diff --git a/research/flossing/analysis_2x2/results.md b/research/flossing/analysis_2x2/results.md new file mode 100644 index 0000000..a62bba2 --- /dev/null +++ b/research/flossing/analysis_2x2/results.md @@ -0,0 +1,69 @@ +# 2x2 analysis (convergence x correctness) — generated 2026-06-11 + +## hrm26040_n8192 +- npz: `/home/yurenh2/rrm/research/flossing/diag_8k.npz`, n=8192, exact_acc=0.525 +- late-drift def: mean(drift_zH[:, -4:]), Otsu tau(log10)=0.766, frac_converged=0.509 + +| cell | n | lam1 median | lam1 IQR | token_acc med | +|---|---|---|---|---| +| A_conv_correct | 4103 | -0.8617 | [-0.9020, -0.8153] | 1.000 | +| B_conv_wrong | 63 | -0.5912 | [-0.8091, -0.5061] | 0.630 | +| C_nonconv_correct | 195 | -0.6943 | [-0.7354, -0.6466] | 1.000 | +| D_nonconv_wrong | 3831 | -0.5998 | [-0.6475, -0.5488] | 0.630 | + +- mixture: wrong-that-converged = 0.016; correct-that-nonconverged = 0.045 +- dlam1(correct-wrong): overall -0.2586; within-conv -0.2705; within-nonconv -0.0945 +- dlam1(wrong: conv - nonconv) = +0.0085 +- AUC(-lam1 -> correct): overall 0.984; within-conv 0.852; within-nonconv 0.818 +- AUC(-log d_late -> correct) = 0.964; AUC(-lam1 -> converged) = 0.986 + +## trm_singleGPU_step260410_n512 +- npz: `/home/yurenh2/rrm/research/flossing/diag_trm_singleGPU_step260410_512.npz`, n=512, exact_acc=0.770 +- late-drift def: mean(drift_zH[:, -4:]), Otsu tau(log10)=1.406, frac_converged=0.748 + +| cell | n | lam1 median | lam1 IQR | token_acc med | +|---|---|---|---|---| +| A_conv_correct | 383 | +0.0047 | [+0.0007, +0.0137] | 1.000 | +| B_conv_wrong | 0 | - | - | - | +| C_nonconv_correct | 11 | +0.0998 | [+0.0907, +0.1035] | 1.000 | +| D_nonconv_wrong | 118 | +0.1023 | [+0.0935, +0.1112] | 0.642 | + +- mixture: wrong-that-converged = 0.000; correct-that-nonconverged = 0.028 +- dlam1(correct-wrong): overall -0.0973; within-conv +nan; within-nonconv -0.0025 +- dlam1(wrong: conv - nonconv) = +nan +- AUC(-lam1 -> correct): overall 0.989; within-conv nan; within-nonconv 0.619 +- AUC(-log d_late -> correct) = 0.975; AUC(-lam1 -> converged) = 0.996 + +## trm_singleGPU_step130205_n512 +- npz: `/home/yurenh2/rrm/research/flossing/diag_trm_singleGPU_step130205_512.npz`, n=512, exact_acc=0.756 +- late-drift def: mean(drift_zH[:, -4:]), Otsu tau(log10)=1.213, frac_converged=0.713 + +| cell | n | lam1 median | lam1 IQR | token_acc med | +|---|---|---|---|---| +| A_conv_correct | 365 | +0.0043 | [-0.0007, +0.0141] | 1.000 | +| B_conv_wrong | 0 | - | - | - | +| C_nonconv_correct | 22 | +0.0852 | [+0.0694, +0.0988] | 1.000 | +| D_nonconv_wrong | 125 | +0.1067 | [+0.0995, +0.1126] | 0.630 | + +- mixture: wrong-that-converged = 0.000; correct-that-nonconverged = 0.057 +- dlam1(correct-wrong): overall -0.1017; within-conv +nan; within-nonconv -0.0215 +- dlam1(wrong: conv - nonconv) = +nan +- AUC(-lam1 -> correct): overall 0.989; within-conv nan; within-nonconv 0.805 +- AUC(-log d_late -> correct) = 0.957; AUC(-lam1 -> converged) = 0.996 + +## trm_step13020_n512 +- npz: `/home/yurenh2/rrm/research/flossing/diag_trm_step13020_512.npz`, n=512, exact_acc=0.596 +- late-drift def: mean(drift_zH[:, -4:]), Otsu tau(log10)=0.730, frac_converged=0.598 + +| cell | n | lam1 median | lam1 IQR | token_acc med | +|---|---|---|---|---| +| A_conv_correct | 296 | -0.0043 | [-0.0113, +0.0052] | 1.000 | +| B_conv_wrong | 10 | -0.0027 | [-0.0090, +0.0009] | 0.556 | +| C_nonconv_correct | 9 | +0.0419 | [+0.0323, +0.0480] | 1.000 | +| D_nonconv_wrong | 197 | +0.0317 | [+0.0143, +0.0462] | 0.642 | + +- mixture: wrong-that-converged = 0.048; correct-that-nonconverged = 0.030 +- dlam1(correct-wrong): overall -0.0339; within-conv -0.0016; within-nonconv +0.0102 +- dlam1(wrong: conv - nonconv) = -0.0344 +- AUC(-lam1 -> correct): overall 0.849; within-conv 0.515; within-nonconv 0.360 +- AUC(-log d_late -> correct) = 0.976; AUC(-lam1 -> converged) = 0.887 diff --git a/research/flossing/analysis_2x2/results_hrm26040_joint_n2048.json b/research/flossing/analysis_2x2/results_hrm26040_joint_n2048.json new file mode 100644 index 0000000..0a13647 --- /dev/null +++ b/research/flossing/analysis_2x2/results_hrm26040_joint_n2048.json @@ -0,0 +1,124 @@ +{ + "npz": "retest/hrm_righteous_step26040_full_n2048.npz", + "n": 2048, + "exact_acc": 0.52587890625, + "late_drift_def": "mean(drift_zH[:, -4:])", + "otsu_tau_log10": 0.7878647951431916, + "frac_converged": 0.5048828125, + "cells": { + "A_conv_correct": { + "n": 1020, + "lam1_median": -0.15225201100111008, + "lam1_mean": -0.14917536273087356, + "lam1_iqr": [ + -0.17794504761695862, + -0.12353727221488953 + ], + "lam8_median": -0.2292146310210228, + "spectrum_median": [ + -0.15225201100111008, + -0.16475734114646912, + -0.1838746815919876, + -0.1940322071313858, + -0.2046389803290367, + -0.2135268971323967, + -0.22090013325214386, + -0.2292146310210228 + ], + "token_acc_median": 1.0, + "halted_at_median": 4.0 + }, + "B_conv_wrong": { + "n": 14, + "lam1_median": -0.035938894376158714, + "lam1_mean": -0.045284229197672436, + "lam1_iqr": [ + -0.10053892433643341, + 0.023329108022153378 + ], + "lam8_median": -0.09174276143312454, + "spectrum_median": [ + -0.035938894376158714, + -0.042827894911170006, + -0.05657286383211613, + -0.06455842405557632, + -0.07352825626730919, + -0.07300741225481033, + -0.08471239358186722, + -0.09174276143312454 + ], + "token_acc_median": 0.6419753134250641, + "halted_at_median": 0.0 + }, + "C_nonconv_correct": { + "n": 57, + "lam1_median": -0.04426668584346771, + "lam1_mean": -0.03947403389560222, + "lam1_iqr": [ + -0.0739532858133316, + -0.009610416367650032 + ], + "lam8_median": -0.115885429084301, + "spectrum_median": [ + -0.04426668584346771, + -0.06513039767742157, + -0.07873687148094177, + -0.09092652052640915, + -0.09820108860731125, + -0.10941123217344284, + -0.10965009033679962, + -0.115885429084301 + ], + "token_acc_median": 1.0, + "halted_at_median": 14.0 + }, + "D_nonconv_wrong": { + "n": 957, + "lam1_median": 0.031880877912044525, + "lam1_mean": 0.037503608156111645, + "lam1_iqr": [ + 0.001404385082423687, + 0.07097765058279037 + ], + "lam8_median": -0.06430909037590027, + "spectrum_median": [ + 0.031880877912044525, + 0.002569675212725997, + -0.016799692064523697, + -0.029757805168628693, + -0.04117149859666824, + -0.049708690494298935, + -0.05613057315349579, + -0.06430909037590027 + ], + "token_acc_median": 0.6296296119689941, + "halted_at_median": 0.0 + } + }, + "mixture": { + "wrong_that_converged": 0.014418125643666324, + "correct_that_nonconverged": 0.052924791086350974 + }, + "contrasts": { + "dlam1_correct_minus_wrong_overall": -0.18112388998270035, + "dlam1_within_converged": -0.11631311662495136, + "dlam1_within_nonconverged": -0.07614756375551224, + "dlam1_wrong_conv_minus_wrong_nonconv": -0.06781977228820324 + }, + "auc": { + "neg_lam1_predicts_correct_overall": 0.987268674570913, + "neg_lam1_predicts_correct_within_conv": 0.8455182072829132, + "neg_lam1_predicts_correct_within_nonconv": 0.8541128160002933, + "neg_logdrift_predicts_correct": 0.9627364412914158, + "neg_lam1_predicts_converged": 0.9916087731145015 + }, + "aux": { + "late_drift_zL_corr_with_zH_log": 0.9984897992742242, + "q_halt_final_median_by_cell": { + "A_conv_correct": 7.46875, + "B_conv_wrong": -6.671875, + "C_nonconv_correct": 7.4375, + "D_nonconv_wrong": -9.625 + } + } +}
\ No newline at end of file diff --git a/research/flossing/analysis_2x2/results_hrm26040_n8192.json b/research/flossing/analysis_2x2/results_hrm26040_n8192.json new file mode 100644 index 0000000..bf4c711 --- /dev/null +++ b/research/flossing/analysis_2x2/results_hrm26040_n8192.json @@ -0,0 +1,124 @@ +{ + "npz": "/home/yurenh2/rrm/research/flossing/diag_8k.npz", + "n": 8192, + "exact_acc": 0.524658203125, + "late_drift_def": "mean(drift_zH[:, -4:])", + "otsu_tau_log10": 0.766353451713377, + "frac_converged": 0.508544921875, + "cells": { + "A_conv_correct": { + "n": 4103, + "lam1_median": -0.8617429137229919, + "lam1_mean": -0.8564737750180547, + "lam1_iqr": [ + -0.9020151197910309, + -0.815339982509613 + ], + "lam8_median": -0.9746241569519043, + "spectrum_median": [ + -0.8617429137229919, + -0.8905690908432007, + -0.9112593531608582, + -0.9292124509811401, + -0.9435566663742065, + -0.9553980827331543, + -0.9662551879882812, + -0.9746241569519043 + ], + "token_acc_median": 1.0, + "halted_at_median": 4.0 + }, + "B_conv_wrong": { + "n": 63, + "lam1_median": -0.5912281274795532, + "lam1_mean": -0.6317614662268806, + "lam1_iqr": [ + -0.809065192937851, + -0.5061159133911133 + ], + "lam8_median": -0.7379652261734009, + "spectrum_median": [ + -0.5912281274795532, + -0.6236273050308228, + -0.6583524942398071, + -0.681266725063324, + -0.6938952207565308, + -0.7069988250732422, + -0.7344604730606079, + -0.7379652261734009 + ], + "token_acc_median": 0.6296296119689941, + "halted_at_median": 0.0 + }, + "C_nonconv_correct": { + "n": 195, + "lam1_median": -0.694297194480896, + "lam1_mean": -0.6856319598662548, + "lam1_iqr": [ + -0.7354169189929962, + -0.6466438174247742 + ], + "lam8_median": -0.7886528372764587, + "spectrum_median": [ + -0.694297194480896, + -0.7198933362960815, + -0.7372102737426758, + -0.7535885572433472, + -0.7657074928283691, + -0.7715225219726562, + -0.7826006412506104, + -0.7886528372764587 + ], + "token_acc_median": 1.0, + "halted_at_median": 14.0 + }, + "D_nonconv_wrong": { + "n": 3831, + "lam1_median": -0.5997700691223145, + "lam1_mean": -0.5970111100099877, + "lam1_iqr": [ + -0.64748615026474, + -0.548848420381546 + ], + "lam8_median": -0.7142836451530457, + "spectrum_median": [ + -0.5997700691223145, + -0.634728193283081, + -0.6558717489242554, + -0.6714187860488892, + -0.6842536330223083, + -0.6956309676170349, + -0.7056014537811279, + -0.7142836451530457 + ], + "token_acc_median": 0.6296296119689941, + "halted_at_median": 0.0 + } + }, + "mixture": { + "wrong_that_converged": 0.016178736517719568, + "correct_that_nonconverged": 0.04536993950674732 + }, + "contrasts": { + "dlam1_correct_minus_wrong_overall": -0.2586202025413513, + "dlam1_within_converged": -0.2705147862434387, + "dlam1_within_nonconverged": -0.09452712535858154, + "dlam1_wrong_conv_minus_wrong_nonconv": 0.00854194164276123 + }, + "auc": { + "neg_lam1_predicts_correct_overall": 0.9841127835524125, + "neg_lam1_predicts_correct_within_conv": 0.8520014391328065, + "neg_lam1_predicts_correct_within_nonconv": 0.8180149790173282, + "neg_logdrift_predicts_correct": 0.9639893544685683, + "neg_lam1_predicts_converged": 0.9859867295607834 + }, + "aux": { + "late_drift_zL_corr_with_zH_log": 0.9985142270869076, + "q_halt_final_median_by_cell": { + "A_conv_correct": 7.46875, + "B_conv_wrong": -8.6875, + "C_nonconv_correct": 7.46875, + "D_nonconv_wrong": -9.625 + } + } +}
\ No newline at end of file diff --git a/research/flossing/analysis_2x2/results_trm_official58590_n2048.json b/research/flossing/analysis_2x2/results_trm_official58590_n2048.json new file mode 100644 index 0000000..bc000b2 --- /dev/null +++ b/research/flossing/analysis_2x2/results_trm_official58590_n2048.json @@ -0,0 +1,105 @@ +{ + "npz": "retest/trm_gbs768_step58590_full_n2048.npz", + "n": 2048, + "exact_acc": 0.8759765625, + "late_drift_def": "mean(drift_zH[:, -4:])", + "otsu_tau_log10": 1.3628430376123835, + "frac_converged": 0.841796875, + "cells": { + "A_conv_correct": { + "n": 1724, + "lam1_median": 0.011281780432909727, + "lam1_mean": 0.015547435924631226, + "lam1_iqr": [ + 0.008467044215649366, + 0.01652396097779274 + ], + "lam8_median": 0.0020897792419418693, + "spectrum_median": [ + 0.011281780432909727, + 0.008820333052426577, + 0.007248040288686752, + 0.0058647627010941505, + 0.004666940309107304, + 0.003820874495431781, + 0.0028478524181991816, + 0.0020897792419418693 + ], + "token_acc_median": 1.0, + "halted_at_median": 2.0 + }, + "B_conv_wrong": { + "n": 0 + }, + "C_nonconv_correct": { + "n": 70, + "lam1_median": 0.0819745659828186, + "lam1_mean": 0.07783016583749226, + "lam1_iqr": [ + 0.06863085739314556, + 0.09041773155331612 + ], + "lam8_median": 0.008648575283586979, + "spectrum_median": [ + 0.0819745659828186, + 0.052037810906767845, + 0.039651695638895035, + 0.028503953479230404, + 0.023648646660149097, + 0.016836480237543583, + 0.011321825440973043, + 0.008648575283586979 + ], + "token_acc_median": 1.0, + "halted_at_median": 14.0 + }, + "D_nonconv_wrong": { + "n": 254, + "lam1_median": 0.10287182405591011, + "lam1_mean": 0.10287220525283983, + "lam1_iqr": [ + 0.09241268038749695, + 0.11242476664483547 + ], + "lam8_median": 0.01879305113106966, + "spectrum_median": [ + 0.10287182405591011, + 0.0741146095097065, + 0.05705060437321663, + 0.045855674892663956, + 0.03607587888836861, + 0.029474849812686443, + 0.023949903436005116, + 0.01879305113106966 + ], + "token_acc_median": 0.6296296119689941, + "halted_at_median": 0.0 + } + }, + "mixture": { + "wrong_that_converged": 0.0, + "correct_that_nonconverged": 0.03901895206243032 + }, + "contrasts": { + "dlam1_correct_minus_wrong_overall": -0.09140016930177808, + "dlam1_within_converged": NaN, + "dlam1_within_nonconverged": -0.020897258073091507, + "dlam1_wrong_conv_minus_wrong_nonconv": NaN + }, + "auc": { + "neg_lam1_predicts_correct_overall": 0.9934800165029538, + "neg_lam1_predicts_correct_within_conv": NaN, + "neg_lam1_predicts_correct_within_nonconv": 0.8484814398200224, + "neg_logdrift_predicts_correct": 0.9719340057409211, + "neg_lam1_predicts_converged": 0.9923716736845121 + }, + "aux": { + "late_drift_zL_corr_with_zH_log": 0.9854614526028403, + "q_halt_final_median_by_cell": { + "A_conv_correct": 7.78125, + "B_conv_wrong": NaN, + "C_nonconv_correct": 7.71875, + "D_nonconv_wrong": -11.125 + } + } +}
\ No newline at end of file diff --git a/research/flossing/analysis_2x2/results_trm_official_gbs768_step58590_n512.json b/research/flossing/analysis_2x2/results_trm_official_gbs768_step58590_n512.json new file mode 100644 index 0000000..460e407 --- /dev/null +++ b/research/flossing/analysis_2x2/results_trm_official_gbs768_step58590_n512.json @@ -0,0 +1,105 @@ +{ + "npz": "../official_gbs768_spectrum/trm_gbs768_base_step58590_n512_k8_seed20260602.npz", + "n": 512, + "exact_acc": 0.875, + "late_drift_def": "mean(drift_zH[:, -4:])", + "otsu_tau_log10": 1.364234039179148, + "frac_converged": 0.84765625, + "cells": { + "A_conv_correct": { + "n": 434, + "lam1_median": 0.011052772868424654, + "lam1_mean": 0.015589851711679268, + "lam1_iqr": [ + 0.008054324658587575, + 0.017474188469350338 + ], + "lam8_median": 0.0021472229855135083, + "spectrum_median": [ + 0.011052772868424654, + 0.008679452817887068, + 0.007291494170203805, + 0.005999391432851553, + 0.004774029366672039, + 0.0037459908053278923, + 0.0029579862020909786, + 0.0021472229855135083 + ], + "token_acc_median": 1.0, + "halted_at_median": 2.0 + }, + "B_conv_wrong": { + "n": 0 + }, + "C_nonconv_correct": { + "n": 14, + "lam1_median": 0.08274861052632332, + "lam1_mean": 0.0802882562524506, + "lam1_iqr": [ + 0.07004822231829166, + 0.09586159884929657 + ], + "lam8_median": 0.00618100818246603, + "spectrum_median": [ + 0.08274861052632332, + 0.05459940806031227, + 0.039645833894610405, + 0.026364507153630257, + 0.020821912214159966, + 0.015413322485983372, + 0.010960425715893507, + 0.00618100818246603 + ], + "token_acc_median": 1.0, + "halted_at_median": 14.0 + }, + "D_nonconv_wrong": { + "n": 64, + "lam1_median": 0.10339764878153801, + "lam1_mean": 0.10259491263423115, + "lam1_iqr": [ + 0.09241705574095249, + 0.1118917465209961 + ], + "lam8_median": 0.02135413885116577, + "spectrum_median": [ + 0.10339764878153801, + 0.07536708191037178, + 0.057968027889728546, + 0.0463954322040081, + 0.03947761468589306, + 0.03304685465991497, + 0.026251215487718582, + 0.02135413885116577 + ], + "token_acc_median": 0.6296296119689941, + "halted_at_median": 0.0 + } + }, + "mixture": { + "wrong_that_converged": 0.0, + "correct_that_nonconverged": 0.03125 + }, + "contrasts": { + "dlam1_correct_minus_wrong_overall": -0.09217484341934323, + "dlam1_within_converged": NaN, + "dlam1_within_nonconverged": -0.02064903825521469, + "dlam1_wrong_conv_minus_wrong_nonconv": NaN + }, + "auc": { + "neg_lam1_predicts_correct_overall": 0.99267578125, + "neg_lam1_predicts_correct_within_conv": NaN, + "neg_lam1_predicts_correct_within_nonconv": 0.7667410714285714, + "neg_logdrift_predicts_correct": 0.9767020089285714, + "neg_lam1_predicts_converged": 0.9906947890818859 + }, + "aux": { + "late_drift_zL_corr_with_zH_log": 0.983883308834529, + "q_halt_final_median_by_cell": { + "A_conv_correct": 7.78125, + "B_conv_wrong": NaN, + "C_nonconv_correct": 7.71875, + "D_nonconv_wrong": -11.125 + } + } +}
\ No newline at end of file diff --git a/research/flossing/analysis_2x2/results_trm_singleGPU_step130205_n512.json b/research/flossing/analysis_2x2/results_trm_singleGPU_step130205_n512.json new file mode 100644 index 0000000..9f73dff --- /dev/null +++ b/research/flossing/analysis_2x2/results_trm_singleGPU_step130205_n512.json @@ -0,0 +1,105 @@ +{ + "npz": "/home/yurenh2/rrm/research/flossing/diag_trm_singleGPU_step130205_512.npz", + "n": 512, + "exact_acc": 0.755859375, + "late_drift_def": "mean(drift_zH[:, -4:])", + "otsu_tau_log10": 1.2130658229840834, + "frac_converged": 0.712890625, + "cells": { + "A_conv_correct": { + "n": 365, + "lam1_median": 0.004312093835324049, + "lam1_mean": 0.00919173876134739, + "lam1_iqr": [ + -0.0006733923801220953, + 0.01406068168580532 + ], + "lam8_median": -0.008585464209318161, + "spectrum_median": [ + 0.004312093835324049, + -0.0004256218671798706, + -0.002575409132987261, + -0.0042461007833480835, + -0.005322618409991264, + -0.006433633156120777, + -0.007657818961888552, + -0.008585464209318161 + ], + "token_acc_median": 1.0, + "halted_at_median": 3.0 + }, + "B_conv_wrong": { + "n": 0 + }, + "C_nonconv_correct": { + "n": 22, + "lam1_median": 0.08519911020994186, + "lam1_mean": 0.08400513193654743, + "lam1_iqr": [ + 0.06937860697507858, + 0.09877191670238972 + ], + "lam8_median": 0.015529214404523373, + "spectrum_median": [ + 0.08519911020994186, + 0.062445785850286484, + 0.04946317337453365, + 0.03590788505971432, + 0.030925835482776165, + 0.025972959585487843, + 0.017265344969928265, + 0.015529214404523373 + ], + "token_acc_median": 1.0, + "halted_at_median": 15.0 + }, + "D_nonconv_wrong": { + "n": 125, + "lam1_median": 0.10669758170843124, + "lam1_mean": 0.10679112273454666, + "lam1_iqr": [ + 0.09952236711978912, + 0.1125824898481369 + ], + "lam8_median": 0.0328078418970108, + "spectrum_median": [ + 0.10669758170843124, + 0.0822329968214035, + 0.06821739673614502, + 0.058041222393512726, + 0.04859192296862602, + 0.04428848251700401, + 0.037217091768980026, + 0.0328078418970108 + ], + "token_acc_median": 0.6296296119689941, + "halted_at_median": 0.0 + } + }, + "mixture": { + "wrong_that_converged": 0.0, + "correct_that_nonconverged": 0.056847545219638244 + }, + "contrasts": { + "dlam1_correct_minus_wrong_overall": -0.1017410927452147, + "dlam1_within_converged": NaN, + "dlam1_within_nonconverged": -0.02149847149848938, + "dlam1_wrong_conv_minus_wrong_nonconv": NaN + }, + "auc": { + "neg_lam1_predicts_correct_overall": 0.988857881136951, + "neg_lam1_predicts_correct_within_conv": NaN, + "neg_lam1_predicts_correct_within_nonconv": 0.8047272727272727, + "neg_logdrift_predicts_correct": 0.957250645994832, + "neg_lam1_predicts_converged": 0.9961047432671699 + }, + "aux": { + "late_drift_zL_corr_with_zH_log": 0.9640733496154094, + "q_halt_final_median_by_cell": { + "A_conv_correct": 7.0, + "B_conv_wrong": NaN, + "C_nonconv_correct": 7.390625, + "D_nonconv_wrong": -11.125 + } + } +}
\ No newline at end of file diff --git a/research/flossing/analysis_2x2/results_trm_singleGPU_step260410_n512.json b/research/flossing/analysis_2x2/results_trm_singleGPU_step260410_n512.json new file mode 100644 index 0000000..3b11879 --- /dev/null +++ b/research/flossing/analysis_2x2/results_trm_singleGPU_step260410_n512.json @@ -0,0 +1,105 @@ +{ + "npz": "/home/yurenh2/rrm/research/flossing/diag_trm_singleGPU_step260410_512.npz", + "n": 512, + "exact_acc": 0.76953125, + "late_drift_def": "mean(drift_zH[:, -4:])", + "otsu_tau_log10": 1.406412158338842, + "frac_converged": 0.748046875, + "cells": { + "A_conv_correct": { + "n": 383, + "lam1_median": 0.004744160454720259, + "lam1_mean": 0.01041777972714804, + "lam1_iqr": [ + 0.0007444327347911894, + 0.013659899588674307 + ], + "lam8_median": -0.004219561815261841, + "spectrum_median": [ + 0.004744160454720259, + 0.0016308835474774241, + 0.00021728611318394542, + -0.0006795059307478368, + -0.001709386589936912, + -0.002729899948462844, + -0.0034774676896631718, + -0.004219561815261841 + ], + "token_acc_median": 1.0, + "halted_at_median": 2.0 + }, + "B_conv_wrong": { + "n": 0 + }, + "C_nonconv_correct": { + "n": 11, + "lam1_median": 0.09982780367136002, + "lam1_mean": 0.08986719998277047, + "lam1_iqr": [ + 0.09067171066999435, + 0.10350960865616798 + ], + "lam8_median": 0.013320312835276127, + "spectrum_median": [ + 0.09982780367136002, + 0.06963343918323517, + 0.051610760390758514, + 0.03814869746565819, + 0.022425860166549683, + 0.01762143149971962, + 0.012642345391213894, + 0.013320312835276127 + ], + "token_acc_median": 1.0, + "halted_at_median": 15.0 + }, + "D_nonconv_wrong": { + "n": 118, + "lam1_median": 0.10232088342308998, + "lam1_mean": 0.10288177159125522, + "lam1_iqr": [ + 0.09352006390690804, + 0.11121566034853458 + ], + "lam8_median": 0.021454622969031334, + "spectrum_median": [ + 0.10232088342308998, + 0.07419029250741005, + 0.05829045735299587, + 0.047421256080269814, + 0.040371473878622055, + 0.032383667305111885, + 0.027303663082420826, + 0.021454622969031334 + ], + "token_acc_median": 0.6419752836227417, + "halted_at_median": 0.0 + } + }, + "mixture": { + "wrong_that_converged": 0.0, + "correct_that_nonconverged": 0.027918781725888325 + }, + "contrasts": { + "dlam1_correct_minus_wrong_overall": -0.09725644066929817, + "dlam1_within_converged": NaN, + "dlam1_within_nonconverged": -0.002493079751729965, + "dlam1_wrong_conv_minus_wrong_nonconv": NaN + }, + "auc": { + "neg_lam1_predicts_correct_overall": 0.9893530069689409, + "neg_lam1_predicts_correct_within_conv": NaN, + "neg_lam1_predicts_correct_within_nonconv": 0.6186440677966102, + "neg_logdrift_predicts_correct": 0.9747913619547449, + "neg_lam1_predicts_converged": 0.9955269496225231 + }, + "aux": { + "late_drift_zL_corr_with_zH_log": 0.9724686803222311, + "q_halt_final_median_by_cell": { + "A_conv_correct": 7.4375, + "B_conv_wrong": NaN, + "C_nonconv_correct": 7.40625, + "D_nonconv_wrong": -11.125 + } + } +}
\ No newline at end of file diff --git a/research/flossing/analysis_2x2/results_trm_step13020_n512.json b/research/flossing/analysis_2x2/results_trm_step13020_n512.json new file mode 100644 index 0000000..ae4af7d --- /dev/null +++ b/research/flossing/analysis_2x2/results_trm_step13020_n512.json @@ -0,0 +1,124 @@ +{ + "npz": "/home/yurenh2/rrm/research/flossing/diag_trm_step13020_512.npz", + "n": 512, + "exact_acc": 0.595703125, + "late_drift_def": "mean(drift_zH[:, -4:])", + "otsu_tau_log10": 0.7298937072389344, + "frac_converged": 0.59765625, + "cells": { + "A_conv_correct": { + "n": 296, + "lam1_median": -0.004329816671088338, + "lam1_mean": -0.002230541864829885, + "lam1_iqr": [ + -0.011287129484117031, + 0.00519718776922673 + ], + "lam8_median": -0.03249448351562023, + "spectrum_median": [ + -0.004329816671088338, + -0.016021928749978542, + -0.021661318838596344, + -0.02548989001661539, + -0.02795236185193062, + -0.03012390434741974, + -0.03131883405148983, + -0.03249448351562023 + ], + "token_acc_median": 1.0, + "halted_at_median": 3.0 + }, + "B_conv_wrong": { + "n": 10, + "lam1_median": -0.0027067698538303375, + "lam1_mean": -0.003216998477000743, + "lam1_iqr": [ + -0.008983954321593046, + 0.0009138719469774514 + ], + "lam8_median": -0.024164372123777866, + "spectrum_median": [ + -0.0027067698538303375, + -0.009424471529200673, + -0.014679631683975458, + -0.014942141249775887, + -0.017778784036636353, + -0.01962883025407791, + -0.020820041187107563, + -0.024164372123777866 + ], + "token_acc_median": 0.5555555820465088, + "halted_at_median": 0.0 + }, + "C_nonconv_correct": { + "n": 9, + "lam1_median": 0.04190123826265335, + "lam1_mean": 0.04164694120279617, + "lam1_iqr": [ + 0.03228544816374779, + 0.047971080988645554 + ], + "lam8_median": -0.018014254048466682, + "spectrum_median": [ + 0.04190123826265335, + 0.022236861288547516, + 0.008249719627201557, + 0.006971194874495268, + -0.008609719574451447, + -0.007754913531243801, + -0.010403371416032314, + -0.018014254048466682 + ], + "token_acc_median": 1.0, + "halted_at_median": 14.0 + }, + "D_nonconv_wrong": { + "n": 197, + "lam1_median": 0.03166300058364868, + "lam1_mean": 0.030858048445575867, + "lam1_iqr": [ + 0.01434984989464283, + 0.04617198556661606 + ], + "lam8_median": -0.012672499753534794, + "spectrum_median": [ + 0.03166300058364868, + 0.01676548644900322, + 0.007448865100741386, + 0.0023278011940419674, + -0.0027259995695203543, + -0.007005904335528612, + -0.009610905312001705, + -0.012672499753534794 + ], + "token_acc_median": 0.6419752836227417, + "halted_at_median": 0.0 + } + }, + "mixture": { + "wrong_that_converged": 0.04830917874396135, + "correct_that_nonconverged": 0.029508196721311476 + }, + "contrasts": { + "dlam1_correct_minus_wrong_overall": -0.03387995809316635, + "dlam1_within_converged": -0.0016230468172580004, + "dlam1_within_nonconverged": 0.01023823767900467, + "dlam1_wrong_conv_minus_wrong_nonconv": -0.03436977043747902 + }, + "auc": { + "neg_lam1_predicts_correct_overall": 0.8491644887938544, + "neg_lam1_predicts_correct_within_conv": 0.5148648648648648, + "neg_lam1_predicts_correct_within_nonconv": 0.3598420755781162, + "neg_logdrift_predicts_correct": 0.9758295715530213, + "neg_lam1_predicts_converged": 0.8866996636842439 + }, + "aux": { + "late_drift_zL_corr_with_zH_log": 0.9964963268090318, + "q_halt_final_median_by_cell": { + "A_conv_correct": 7.25, + "B_conv_wrong": -10.71875, + "C_nonconv_correct": 7.25, + "D_nonconv_wrong": -10.4375 + } + } +}
\ No newline at end of file diff --git a/research/flossing/analysis_2x2/run_phase1_queue.sh b/research/flossing/analysis_2x2/run_phase1_queue.sh new file mode 100755 index 0000000..989c7ee --- /dev/null +++ b/research/flossing/analysis_2x2/run_phase1_queue.sh @@ -0,0 +1,92 @@ +#!/usr/bin/env bash +# Phase-1 queue (experiment_framework.md): E5 horizon sweeps, E2 run-level replication, +# E6 matched-objective step9 pairs. Waits for a free GPU (12h fallback), runs sequentially. +set -o pipefail + +cd /home/yurenh2/rrm/research/flossing +source /home/yurenh2/miniconda3/etc/profile.d/conda.sh +conda activate rrm + +OUTDIR=analysis_2x2/phase1 +mkdir -p "$OUTDIR" +STATUS="$OUTDIR/queue_status.log" +TRM_OFF="/home/yurenh2/rrm/trm/checkpoints/Sudoku-extreme-1k-aug-1000-ACT-torch/pretrain_mlp_t_sudoku_official_gbs768_repro" +TRM_SGL="/home/yurenh2/rrm/trm/checkpoints/Sudoku-extreme-1k-aug-1000-ACT-torch/pretrain_mlp_t_sudoku_singleGPU" +HRM_ROOT="/home/yurenh2/rrm/hrm/checkpoints/Sudoku-extreme-1k-aug-1000 ACT-torch/HierarchicalReasoningModel_ACTV1 righteous-python" +S9=/home/yurenh2/rrm/research/flossing + +log() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" >> "$STATUS"; } +free_gpu() { + nvidia-smi --query-gpu=index,utilization.gpu,memory.used --format=csv,noheader,nounits \ + | awk -F', ' '$2<30 && $3<8000 {print $1; exit}' +} + +log "phase-1 queue started (E5 horizon sweeps, E2 step9_E replication, E6 step9 pairs)" +DEADLINE=$(( $(date +%s) + 12*3600 )) +GPU="" +while true; do + g1="$(free_gpu)" + if [[ -n "$g1" ]]; then + sleep 60; g2="$(free_gpu)" + if [[ "$g2" == "$g1" ]]; then GPU="$g1"; break; fi + fi + if (( $(date +%s) > DEADLINE )); then + GPU="$(nvidia-smi --query-gpu=index,memory.used --format=csv,noheader,nounits | sort -t, -k2 -n | head -1 | cut -d, -f1)" + log "12h fallback: taking GPU $GPU" + break + fi + sleep 300 +done +log "claimed GPU $GPU" +export CUDA_VISIBLE_DEVICES="$GPU" + +run_job() { # name horizon script args... + local name="$1" hor="$2"; shift 2 + if [[ -f "$OUTDIR/${name}.npz" ]]; then log "skip $name"; return 0; fi + log "start $name" + if DIAG_HORIZON="$hor" python "$@" --out "$OUTDIR/${name}.npz" > "$OUTDIR/${name}.log" 2>&1; then + log "done $name" + else + log "FAILED $name" + fi +} + +# --- E5: TRM horizon sweep (h=4 already exists in retest/) --- +for H in 2 6 8 10 12; do + run_job "trm_official58590_h${H}_n2048" "$H" diagnose_trm_joint_horizon.py \ + --ckpt-root "$TRM_OFF" --ckpt-name step_58590 --n-samples 2048 --batch-size 16 \ + --k-lyap 8 --t-ons 1 --seed 0 +done + +# --- E5: HRM horizon sweep --- +for H in 2 6 8 10 12; do + run_job "hrm26040_h${H}_n2048" "$H" diagnose_hrm_joint_horizon.py \ + --ckpt-root "$HRM_ROOT" --ckpt-name step_26040 --n-samples 2048 --batch-size 32 \ + --k-lyap 8 --t-ons 1 --seed 0 +done + +# --- E2: HRM second training run (step9_E fixed-unroll baseline), full window --- +run_job "step9E_hrm_best_full_n2048" 16 diagnose_hrm_joint.py \ + --ckpt-root "$HRM_ROOT" --ckpt-name "$S9/step9_E_hrm_baseline_parallel_fixed_26040_50k_ckpts/best.pt" \ + --n-samples 2048 --batch-size 32 --k-lyap 8 --t-ons 1 --seed 0 +run_job "step9E_hrm_final_full_n2048" 16 diagnose_hrm_joint.py \ + --ckpt-root "$HRM_ROOT" --ckpt-name "$S9/step9_E_hrm_baseline_parallel_fixed_26040_50k_ckpts/final.pt" \ + --n-samples 2048 --batch-size 32 --k-lyap 8 --t-ons 1 --seed 0 + +# --- E6: matched-objective pairs (n=512): HRM E vs F, TRM G vs H --- +for CK in step_12500 step_25000 best final; do + run_job "step9E_hrm_${CK}_n512" 16 diagnose_hrm_joint.py \ + --ckpt-root "$HRM_ROOT" --ckpt-name "$S9/step9_E_hrm_baseline_parallel_fixed_26040_50k_ckpts/${CK}.pt" \ + --n-samples 512 --batch-size 32 --k-lyap 8 --t-ons 1 --seed 0 + run_job "step9F_hrm_${CK}_n512" 16 diagnose_hrm_joint.py \ + --ckpt-root "$HRM_ROOT" --ckpt-name "$S9/step9_F_hrm_multi4_loguniform_ramp_26040_50k_ckpts/${CK}.pt" \ + --n-samples 512 --batch-size 32 --k-lyap 8 --t-ons 1 --seed 0 + run_job "step9G_trm_${CK}_n512" 16 diagnose_trm_joint.py \ + --ckpt-root "$TRM_SGL" --ckpt-name "$S9/step9_G_trm_baseline_parallel_fixed_26041_batch4_50k_ckpts/${CK}.pt" \ + --n-samples 512 --batch-size 16 --k-lyap 8 --t-ons 1 --seed 0 + run_job "step9H_trm_${CK}_n512" 16 diagnose_trm_joint.py \ + --ckpt-root "$TRM_SGL" --ckpt-name "$S9/step9_H_trm_multi4_loguniform_ramp_26041_batch4_50k_ckpts/${CK}.pt" \ + --n-samples 512 --batch-size 16 --k-lyap 8 --t-ons 1 --seed 0 +done + +log "phase-1 queue finished" diff --git a/research/flossing/analysis_2x2/run_retest_2x2.sh b/research/flossing/analysis_2x2/run_retest_2x2.sh new file mode 100755 index 0000000..9cdcc13 --- /dev/null +++ b/research/flossing/analysis_2x2/run_retest_2x2.sh @@ -0,0 +1,75 @@ +#!/usr/bin/env bash +# Queue: wait for a free GPU, then run the 2x2 re-test diagnostics: +# 1. TRM official_gbs768 @ step_58590 (86.9% ckpt) full-window n=2048 +# 2. TRM official_gbs768 @ step_58590 early-window n=2048 (first 4 ACT steps) +# 3. HRM righteous-python @ step_26040 (joint est.) full-window n=2048 +# 4. HRM righteous-python @ step_26040 early-window n=2048 +# Same --seed 0 and same n across full/short pairs so idx fields pair up. +set -o pipefail + +cd /home/yurenh2/rrm/research/flossing +source /home/yurenh2/miniconda3/etc/profile.d/conda.sh +conda activate rrm + +OUTDIR=analysis_2x2/retest +mkdir -p "$OUTDIR" +STATUS="$OUTDIR/queue_status.log" +TRM_ROOT="/home/yurenh2/rrm/trm/checkpoints/Sudoku-extreme-1k-aug-1000-ACT-torch/pretrain_mlp_t_sudoku_official_gbs768_repro" +HRM_ROOT="/home/yurenh2/rrm/hrm/checkpoints/Sudoku-extreme-1k-aug-1000 ACT-torch/HierarchicalReasoningModel_ACTV1 righteous-python" + +log() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" >> "$STATUS"; } + +free_gpu() { + # print index of a GPU with util<30% and mem<8GB, else empty + nvidia-smi --query-gpu=index,utilization.gpu,memory.used --format=csv,noheader,nounits \ + | awk -F', ' '$2<30 && $3<8000 {print $1; exit}' +} + +log "queue started, waiting for a free GPU (util<30%, mem<8GB, two checks 60s apart; 12h fallback)" +DEADLINE=$(( $(date +%s) + 12*3600 )) +GPU="" +while true; do + g1="$(free_gpu)" + if [[ -n "$g1" ]]; then + sleep 60 + g2="$(free_gpu)" + if [[ "$g2" == "$g1" ]]; then GPU="$g1"; break; fi + fi + if (( $(date +%s) > DEADLINE )); then + GPU="$(nvidia-smi --query-gpu=index,memory.used --format=csv,noheader,nounits | sort -t, -k2 -n | head -1 | cut -d, -f1)" + log "12h fallback: taking GPU $GPU (most free memory) despite utilization" + break + fi + sleep 300 +done +log "claimed GPU $GPU" +export CUDA_VISIBLE_DEVICES="$GPU" + +run_job() { + local name="$1"; shift + if [[ -f "$OUTDIR/${name}.npz" ]]; then log "skip $name (output exists)"; return 0; fi + log "start $name" + if python "$@" --out "$OUTDIR/${name}.npz" > "$OUTDIR/${name}.log" 2>&1; then + log "done $name" + else + log "FAILED $name (see $OUTDIR/${name}.log)" + fi +} + +run_job trm_gbs768_step58590_full_n2048 \ + diagnose_trm_joint.py --ckpt-root "$TRM_ROOT" --ckpt-name step_58590 \ + --n-samples 2048 --batch-size 16 --k-lyap 8 --t-ons 1 --seed 0 + +run_job trm_gbs768_step58590_short_n2048 \ + diagnose_trm_joint_short.py --ckpt-root "$TRM_ROOT" --ckpt-name step_58590 \ + --n-samples 2048 --batch-size 16 --k-lyap 8 --t-ons 1 --seed 0 + +run_job hrm_righteous_step26040_full_n2048 \ + diagnose_hrm_joint.py --ckpt-root "$HRM_ROOT" --ckpt-name step_26040 \ + --n-samples 2048 --batch-size 32 --k-lyap 8 --t-ons 1 --seed 0 + +run_job hrm_righteous_step26040_short_n2048 \ + diagnose_hrm_joint_short.py --ckpt-root "$HRM_ROOT" --ckpt-name step_26040 \ + --n-samples 2048 --batch-size 32 --k-lyap 8 --t-ons 1 --seed 0 + +log "all retest diagnostics finished" diff --git a/research/flossing/analysis_2x2/sweep_hrm26040_joint_n2048.csv b/research/flossing/analysis_2x2/sweep_hrm26040_joint_n2048.csv new file mode 100644 index 0000000..c4a8b6c --- /dev/null +++ b/research/flossing/analysis_2x2/sweep_hrm26040_joint_n2048.csv @@ -0,0 +1,20 @@ +pct,tau,nA,nB,nC,nD,lam1_med_B,lam1_med_D +5,-0.011992306107950053,102,1,975,970,-0.22750887274742126,0.03138776309788227 +10,-0.01058249634236916,203,2,874,969,-0.15784957632422447,0.0314599983394146 +15,-0.009196568905867631,306,2,771,969,-0.15784957632422447,0.0314599983394146 +20,-0.008030915298460381,408,2,669,969,-0.15784957632422447,0.0314599983394146 +25,-0.006957811036151525,509,3,568,968,-0.10465513914823532,0.03146406076848507 +30,-0.006023327926233211,612,3,465,968,-0.10465513914823532,0.03146406076848507 +35,-0.004937302122343026,714,3,363,968,-0.10465513914823532,0.03146406076848507 +40,-0.0033157474065108714,814,5,263,966,-0.14139553904533386,0.03153638541698456 +45,-0.0012688460769353994,917,5,160,966,-0.14139553904533386,0.03153638541698456 +50,0.5973517463361264,1013,11,64,960,-0.0452079139649868,0.03175541199743748 +55,1.3993508404545523,1034,92,43,879,0.033553484827280045,0.031315527856349945 +60,1.4994241881984267,1036,193,41,778,0.029089782387018204,0.03191038407385349 +65,1.547490873090265,1038,293,39,678,0.02823900803923607,0.032731860876083374 +70,1.5799635318056726,1040,393,37,578,0.029996931552886963,0.03191038407385349 +75,1.6038922156413973,1040,496,37,475,0.029781382530927658,0.03269977867603302 +80,1.6256149248592253,1040,598,37,373,0.03138776309788227,0.030956093221902847 +85,1.6518692868467673,1042,698,35,273,0.02991918846964836,0.03507234901189804 +90,1.677233478220475,1044,799,33,172,0.028961632400751114,0.044597823172807693 +95,1.708463281897988,1049,896,28,75,0.029092321172356606,0.052752863615751266 diff --git a/research/flossing/analysis_2x2/sweep_hrm26040_n8192.csv b/research/flossing/analysis_2x2/sweep_hrm26040_n8192.csv new file mode 100644 index 0000000..b2f9f82 --- /dev/null +++ b/research/flossing/analysis_2x2/sweep_hrm26040_n8192.csv @@ -0,0 +1,20 @@ +pct,tau,nA,nB,nC,nD,lam1_med_B,lam1_med_D +5,-0.024403165931412795,410,0,3888,3894,nan,-0.5996554493904114 +10,-0.02256924601928628,818,2,3480,3892,-0.8345716893672943,-0.5995031297206879 +15,-0.021242109610698556,1225,4,3073,3890,-0.8419434428215027,-0.5994212925434113 +20,-0.020075217197184483,1634,5,2664,3889,-0.8429288864135742,-0.599414587020874 +25,-0.019023493082334777,2039,9,2259,3885,-0.8409579992294312,-0.5993443727493286 +30,-0.017916215347934226,2444,14,1854,3880,-0.8415209650993347,-0.599242627620697 +35,-0.016770285088033448,2852,15,1446,3879,-0.8420839309692383,-0.5992319583892822 +40,-0.015257325632230992,3260,17,1038,3877,-0.8420839309692383,-0.599224328994751 +45,-0.012884453933432984,3665,21,633,3873,-0.8420839309692383,-0.5990880727767944 +50,0.371726245629762,4067,29,231,3865,-0.8250086307525635,-0.5989847183227539 +55,1.3871392863925978,4136,370,162,3524,-0.5518185496330261,-0.6034163236618042 +60,1.4972477780890994,4138,777,160,3117,-0.5671209096908569,-0.6080851554870605 +65,1.5481401201578837,4144,1181,154,2713,-0.5731313228607178,-0.6117637753486633 +70,1.5835620683527638,4147,1587,151,2307,-0.5808747410774231,-0.6136600971221924 +75,1.6110709254128324,4153,1991,145,1903,-0.584848165512085,-0.616129994392395 +80,1.6353553466380342,4156,2397,142,1497,-0.5879771709442139,-0.6185622215270996 +85,1.6574195825233073,4161,2802,137,1092,-0.5930990278720856,-0.6169184744358063 +90,1.6817109395003784,4167,3205,131,689,-0.5954653024673462,-0.619819164276123 +95,1.7160308725239535,4184,3598,114,296,-0.5972362160682678,-0.6263471245765686 diff --git a/research/flossing/analysis_2x2/sweep_trm_official58590_n2048.csv b/research/flossing/analysis_2x2/sweep_trm_official58590_n2048.csv new file mode 100644 index 0000000..747eb40 --- /dev/null +++ b/research/flossing/analysis_2x2/sweep_trm_official58590_n2048.csv @@ -0,0 +1,20 @@ +pct,tau,nA,nB,nC,nD,lam1_med_B,lam1_med_D +5,0.6425822518044539,103,0,1691,254,nan,0.10287182405591011 +10,0.7090247691558889,205,0,1589,254,nan,0.10287182405591011 +15,0.7579128424918095,308,0,1486,254,nan,0.10287182405591011 +20,0.7959777945095498,410,0,1384,254,nan,0.10287182405591011 +25,0.8365441996415908,512,0,1282,254,nan,0.10287182405591011 +30,0.8629687925201839,615,0,1179,254,nan,0.10287182405591011 +35,0.8888206300392528,717,0,1077,254,nan,0.10287182405591011 +40,0.9161186566022281,819,0,975,254,nan,0.10287182405591011 +45,0.9434356854634222,922,0,872,254,nan,0.10287182405591011 +50,0.9675603217529236,1024,0,770,254,nan,0.10287182405591011 +55,0.9912876432316691,1126,0,668,254,nan,0.10287182405591011 +60,1.01543024505144,1229,0,565,254,nan,0.10287182405591011 +65,1.0446947959840196,1331,0,463,254,nan,0.10287182405591011 +70,1.0695176837117564,1433,0,361,254,nan,0.10287182405591011 +75,1.100648956309921,1536,0,258,254,nan,0.10287182405591011 +80,1.2184501775276642,1638,0,156,254,nan,0.10287182405591011 +85,1.696239099229016,1733,7,61,247,0.1096716970205307,0.10244401544332504 +90,1.794792940908479,1740,103,54,151,0.10650952905416489,0.10119745135307312 +95,1.8549837815942356,1749,196,45,58,0.10514867305755615,0.09396017715334892 diff --git a/research/flossing/analysis_2x2/sweep_trm_official_gbs768_step58590_n512.csv b/research/flossing/analysis_2x2/sweep_trm_official_gbs768_step58590_n512.csv new file mode 100644 index 0000000..011c7cc --- /dev/null +++ b/research/flossing/analysis_2x2/sweep_trm_official_gbs768_step58590_n512.csv @@ -0,0 +1,20 @@ +pct,tau,nA,nB,nC,nD,lam1_med_B,lam1_med_D +5,0.6520149305579591,26,0,422,64,nan,0.10339764878153801 +10,0.7055634417495923,52,0,396,64,nan,0.10339764878153801 +15,0.7503615854928217,77,0,371,64,nan,0.10339764878153801 +20,0.8005556182318934,103,0,345,64,nan,0.10339764878153801 +25,0.840565017355408,128,0,320,64,nan,0.10339764878153801 +30,0.8627818507451074,154,0,294,64,nan,0.10339764878153801 +35,0.8956748981953134,179,0,269,64,nan,0.10339764878153801 +40,0.923274365661878,205,0,243,64,nan,0.10339764878153801 +45,0.9484961694413702,230,0,218,64,nan,0.10339764878153801 +50,0.9733951084873498,256,0,192,64,nan,0.10339764878153801 +55,1.0041568660376061,282,0,166,64,nan,0.10339764878153801 +60,1.02397895505604,307,0,141,64,nan,0.10339764878153801 +65,1.04513267210118,333,0,115,64,nan,0.10339764878153801 +70,1.0684270138688392,358,0,90,64,nan,0.10339764878153801 +75,1.097585285269502,384,0,64,64,nan,0.10339764878153801 +80,1.1958246561577712,409,0,39,64,nan,0.10339764878153801 +85,1.3747578433518455,435,0,13,64,nan,0.10339764878153801 +90,1.785234311475947,437,23,11,41,0.10155700892210007,0.10475098341703415 +95,1.8427352264688373,438,48,10,16,0.10240473970770836,0.10433581843972206 diff --git a/research/flossing/analysis_2x2/sweep_trm_singleGPU_step130205_n512.csv b/research/flossing/analysis_2x2/sweep_trm_singleGPU_step130205_n512.csv new file mode 100644 index 0000000..2c44154 --- /dev/null +++ b/research/flossing/analysis_2x2/sweep_trm_singleGPU_step130205_n512.csv @@ -0,0 +1,20 @@ +pct,tau,nA,nB,nC,nD,lam1_med_B,lam1_med_D +5,0.3525141817739674,26,0,361,125,nan,0.10669758170843124 +10,0.5997049779303102,52,0,335,125,nan,0.10669758170843124 +15,0.7276586924161174,77,0,310,125,nan,0.10669758170843124 +20,0.745895549719167,103,0,284,125,nan,0.10669758170843124 +25,0.7574238256855987,128,0,259,125,nan,0.10669758170843124 +30,0.7756026094327713,154,0,233,125,nan,0.10669758170843124 +35,0.7948198738771284,179,0,208,125,nan,0.10669758170843124 +40,0.8150679269326413,205,0,182,125,nan,0.10669758170843124 +45,0.8328527737072317,230,0,157,125,nan,0.10669758170843124 +50,0.8680717056251492,256,0,131,125,nan,0.10669758170843124 +55,0.8878173590916981,282,0,105,125,nan,0.10669758170843124 +60,0.9114474129601072,307,0,80,125,nan,0.10669758170843124 +65,0.9357973777374493,333,0,54,125,nan,0.10669758170843124 +70,0.9914630324419641,358,0,29,125,nan,0.10669758170843124 +75,1.6265775245886065,370,14,17,111,0.10315298289060593,0.10740122199058533 +80,1.6638789305051203,370,39,17,86,0.10702066123485565,0.10642785206437111 +85,1.690579189299017,370,65,17,60,0.10719160735607147,0.10596203431487083 +90,1.722529342618924,370,90,17,35,0.10658939182758331,0.10669758170843124 +95,1.7704888842696966,372,114,15,11,0.10589195787906647,0.10992329567670822 diff --git a/research/flossing/analysis_2x2/sweep_trm_singleGPU_step260410_n512.csv b/research/flossing/analysis_2x2/sweep_trm_singleGPU_step260410_n512.csv new file mode 100644 index 0000000..fac3a6d --- /dev/null +++ b/research/flossing/analysis_2x2/sweep_trm_singleGPU_step260410_n512.csv @@ -0,0 +1,20 @@ +pct,tau,nA,nB,nC,nD,lam1_med_B,lam1_med_D +5,1.1904157471082346,26,0,368,118,nan,0.10232088342308998 +10,1.205761365098233,52,0,342,118,nan,0.10232088342308998 +15,1.220964285676207,77,0,317,118,nan,0.10232088342308998 +20,1.2340382280630668,103,0,291,118,nan,0.10232088342308998 +25,1.2434017565460773,128,0,266,118,nan,0.10232088342308998 +30,1.253923051119596,154,0,240,118,nan,0.10232088342308998 +35,1.260020693170751,179,0,215,118,nan,0.10232088342308998 +40,1.2729663625028254,205,0,189,118,nan,0.10232088342308998 +45,1.286432864193943,230,0,164,118,nan,0.10232088342308998 +50,1.29425001787462,256,0,138,118,nan,0.10232088342308998 +55,1.3019951688280547,282,0,112,118,nan,0.10232088342308998 +60,1.3106991625583733,307,0,87,118,nan,0.10232088342308998 +65,1.323434760415417,333,0,61,118,nan,0.10232088342308998 +70,1.3423083391290647,358,0,36,118,nan,0.10232088342308998 +75,1.4595699782590752,384,0,10,118,nan,0.10232088342308998 +80,1.6965097149553376,384,25,10,93,0.10151073336601257,0.10370030999183655 +85,1.7416635003129175,384,51,10,67,0.1015518382191658,0.10389910638332367 +90,1.7702691318062869,384,76,10,42,0.10161199420690536,0.10456060990691185 +95,1.7933802257282094,384,102,10,16,0.10197825729846954,0.10976629704236984 diff --git a/research/flossing/analysis_2x2/sweep_trm_step13020_n512.csv b/research/flossing/analysis_2x2/sweep_trm_step13020_n512.csv new file mode 100644 index 0000000..4dba5eb --- /dev/null +++ b/research/flossing/analysis_2x2/sweep_trm_step13020_n512.csv @@ -0,0 +1,20 @@ +pct,tau,nA,nB,nC,nD,lam1_med_B,lam1_med_D +5,0.1461886368890553,26,0,279,207,nan,0.02986210025846958 +10,0.1484773952154017,52,0,253,207,nan,0.02986210025846958 +15,0.14987173414316576,77,0,228,207,nan,0.02986210025846958 +20,0.15185786893932013,103,0,202,207,nan,0.02986210025846958 +25,0.15372304028625772,128,0,177,207,nan,0.02986210025846958 +30,0.15513756643739401,154,0,151,207,nan,0.02986210025846958 +35,0.15748307719056248,179,0,126,207,nan,0.02986210025846958 +40,0.16089218583944756,205,0,100,207,nan,0.02986210025846958 +45,0.16811494142658162,230,0,75,207,nan,0.02986210025846958 +50,0.21289853477178294,256,0,49,207,nan,0.02986210025846958 +55,0.31298600561046025,281,1,24,206,-0.014628570526838303,0.030006153509020805 +60,0.8658403204807424,296,11,9,196,-0.003207409055903554,0.03166834078729153 +65,1.1762548477525878,298,35,7,172,-0.0013167093275114894,0.03500013053417206 +70,1.3359059506661692,298,60,7,147,0.008530525024980307,0.03881623595952988 +75,1.4369319380372483,298,86,7,121,0.01115282904356718,0.043723251670598984 +80,1.5107383605809237,298,111,7,96,0.014448095113039017,0.045386193320155144 +85,1.5691906481172995,298,137,7,70,0.020133845508098602,0.04726334474980831 +90,1.6297611879974812,298,162,7,45,0.022078031674027443,0.04918263107538223 +95,1.7017159174665506,298,188,7,19,0.0257729422301054,0.061704523861408234 |
