# 2×2 analysis: (late-trajectory settling) × (answer correctness) — observations Date: 2026-06-11. Code: `analyze_2x2.py` (+ ad-hoc strict-threshold checks logged in session). All statements below are measurements on existing diagnostic npz files; no new GPU runs. ## Data provenance (CRITICAL: λ scales are NOT comparable across estimator versions) | dataset | model / ckpt | n | estimator | window | |---|---|---|---|---| | `diag_8k.npz` | HRM righteous-python @ step_26040 (H=2,L=2) | 8192 | `diagnose_hrm.py` (May 22 version) | full 16 ACT steps | | `diag_trm_singleGPU_step*_512.npz` ×10 | TRM mlp_t singleGPU @ 26041…260410 | 512 each | `diagnose_trm_joint.py` (joint JVP+QR, per-sub-update norm) | full 16 ACT steps | | `diag_hrm_step_*_512.npz` ×10 | HRM @ 2604…26040 | 512 each | joint estimator (−0.15-scale, matches step7 reports) | full 16 ACT steps | Within-dataset comparisons only. "Converged"/"settled" here = late z_H drift (mean over ACT steps 13–16) falls in the low-drift band; threshold from Otsu on pooled log10 drift, robustness via percentile sweep. Settled ≠ literal fixed point: the settled band has a narrow characteristic residual velocity (HRM ≈0.96/step, IQR-width <0.03 in log10; TRM ≈18.5/step, q10–q90 = 15.8–21.4). ## Headline numbers ### HRM @26040, n=8192, exact_acc 0.525 (full-window λ, May-22 estimator scale) Otsu τ → cells: A(settled,correct)=4103, B(settled,wrong)=63, C(unsettled,correct)=195, D(unsettled,wrong)=3831. - Failure mass is overwhelmingly unsettled: wrong-that-settled = 1.6% at Otsu τ; at a STRICT in-band threshold (45th pct, drift<0.97) it is 21/3894 = **0.55% of failures**. - The strict-band B examples (n=21) have λ₁ median **−0.842** vs A −0.867 (success-like contraction), q_halt(final) median **+7.47 — identical to A** (ACT-confident), token_acc 0.41–0.88 (median 0.62, substantially wrong, not near-misses). These are confidently-wrong settled answers — invisible to both a stability-based and an ACT-confidence-based selector. - λ₁ as a predictor: AUC(−λ₁→correct) = 0.984 ≈ AUC(−λ₁→settled) = 0.986. At the Otsu split the within-settled outcome gap (Δλ₁ −0.27, AUC 0.852) is a THRESHOLD-MIXTURE ARTIFACT: the Otsu τ sits above the band gap (drift distribution has a near-empty region between pct≈45 and pct≈55, τ jumps −0.013→1.39), so Otsu-"settled" includes mid-drift examples. At the strict in-band τ the within-settled gap nearly closes (−0.867 vs −0.842). - Residual outcome signal in the unsettled stratum: C vs D Δλ₁(median) = −0.094, AUC 0.818 (C n=195). Note C drifts MORE than D (median 55.4 vs 40.8) while being more contracting in λ₁. - AUC(−log late-drift→correct) = 0.964. ### TRM singleGPU @260410 (final), n=512, exact_acc 0.770 (joint estimator) Cells: A=383, B=**0**, C=11, D=118. - **B = 0 at every threshold**: drift distributions of settled-correct (q90 = 21.4) and wrong (q10 = 47.8) are completely separated — no wrong example reaches the settled band. Robust across the entire percentile sweep (nB=0, nD=118 constant). Same at step 130205. Across the 10-ckpt series, fB>0 only at step 26041 (1.2%). - λ₁ medians: A +0.0047 (≈0, consistent with "TRM success short-window λ₁>0 is acceptable"), C +0.0998, D +0.1023. Within-unsettled outcome gap ≈ 0 (Δ −0.0025; AUC 0.619, C n=11). AUC(−λ₁→correct) = 0.989 ≈ AUC(−λ₁→settled) = 0.996: at the final checkpoint, λ₁'s outcome signal in TRM is (almost) entirely settling-regime detection. - q_halt(final) median: A +7.44, C +7.41, D −11.1. ## Checkpoint evolution (512/ckpt; within-series comparisons) - TRM (26041→260410): fB≈0 from 52082 on; fD shrinks 0.41→0.23; λ₁(D) rises monotonically +0.036→+0.102 while λ₁(A) stays ≈0 — the success/failure λ₁ gap widens over training via the failure cell becoming more expansive. - HRM (2604→26040, joint-estimator series): mass migration over training: fB 0.89→0.008 (early ckpts: nearly all examples low-drift & wrong), fD rises to ~0.52 mid-training, fA grows with accuracy; λ₁(D) rises from −0.087 to +0.023 (sign flip ~step 15–18k), λ₁(A) stays −0.10…−0.20. ## Direct answers to the motivating questions 1. **Is the failure FTLE cluster a mixture of wandering + wrong-fixed-point modes?** Measured: yes, but extremely lopsided. Wandering dominates (HRM ≥98.4% of failures at Otsu τ, 99.45% at strict τ; TRM final 100%). The wrong-fixed-point mode exists in HRM (21 examples) with exactly the predicted signature (success-like λ₁), and is absent in TRM at late checkpoints. 2. **Does conditioning on settling absorb the success/failure FTLE gap?** TRM final: essentially yes (within-stratum AUC 0.619 with n=11; regime AUC 0.996). HRM: mostly yes after threshold correction (strict-band within-settled gap ≈0.025), with one genuine residual: the unsettled stratum retains an outcome gap (AUC 0.818, C n=195). 3. **Relation to published taxonomies** (factual cross-references): Ren & Liu's four HRM modes map onto our cells; their "non-trivial failure (converged to wrong fixed point)" = our strict-band B (rare: 0.55% of failures); their "trivial failure (wanders or oscillates)" = our D (dominant). Efstathiou & Balwani's "failed runs plateau at stable high-loss attractors" (TRM): by our state-drift criterion, TRM failures are NOT settled (B=0, drift ≥~48/step, ~0.77× the early-trajectory velocity); their "stable" refers to loss plateaus/bounded regions, not state convergence. Our drift+λ₁ measurement distinguishes these. ## Addendum (same day, offline follow-ups — see offline_followups/followups.md) **REVISION of headline point 2.** "λ₁'s outcome signal is (almost) entirely settling-regime detection" was based on comparing raw AUCs and is too strong as a mediation claim. The proper control — AUC(−λ₁→correct) **within matched late-drift deciles** of the unsettled stratum — shows substantial independent signal (HRM: per-decile 0.97→0.69 from low to high drift, weighted mean 0.879; unconditioned within-unsettled 0.933). TRM official @58590 with a strict band τ shows the same qualitatively: unsettled-correct λ₁ ≈ +0.017 (n=141) vs unsettled-wrong +0.103 (n=64) at overlapping drift levels. Corrected statement: **λ₁ correlates strongly with settledness, but at matched drift level it still separates outcome** — drift and λ₁ are not redundant observables. **Difficulty control (#givens, crude proxy).** HRM n=8192: Spearman(correct, givens)=+0.28; Spearman(λ₁, givens)=−0.35 overall but −0.16/−0.18 within outcome. Within-givens-bin AUC(−λ₁→correct) = 0.976–0.987 (weighted 0.982, vs overall 0.984): at the #givens level, the FTLE-outcome separation is NOT a difficulty artifact. (Solver-backtrack difficulty not available offline; #givens is a weak proxy — flag for the writeup.) **Strict-B per-example (n=21).** All 21 have halted_at ∈ [4,9] (median 6) — under real ACT inference every one would have halted early, confidently (q_halt +7.4–7.5), and wrong (token_acc 0.41–0.88). The three lowest-token-acc cases are all 17-givens (minimum-clue) puzzles. Per-example table in followups.md; drift profiles indistinguishable from the A band (fig_hrm_strictB_profiles.png). **TRM official @58590 note.** 90% of its settled-correct examples are still descending at window end (slope median −0.147) — unlike singleGPU @260410 whose A-band is flat (~18.5/step). The "settled band" criterion is checkpoint-specific; cross-checkpoint comparisons must re-derive τ per dataset. ## Addendum 2 (2026-06-12, n=2048 retest + early-window pairing; npz in retest/) Retest ran on GPU 0 (shared, 12h-fallback claim). Four diagnostics, seed 0, idx-paired. **1. TRM official @58590 (87.6%), full window, n=2048: B = 0 confirmed.** 254 failures, none settled; the MINIMUM late-drift among wrong examples (log10 1.664 ≈ 46/step) exceeds the late-drift of 96.5% of correct examples — near-complete distribution separation, threshold-free. λ₁: wrong +0.103 / correct +0.012, AUC 0.993. Within-unsettled outcome AUC 0.848 (C n=70) — residual signal confirmed at usable n. **2. HRM @26040, full window, joint estimator, n=2048: replicates diag_8k on a second estimator.** acc 0.526; A/B/C/D = 1020/14/57/957; λ₁(A) −0.152 vs λ₁(D) +0.032 (the email's −0.15/+0.04 scale); strict-band B n=5 (0.5% of failures) with λ₁ −0.141 ≈ A and q_halt +7.47 (selector-blind, replicated); unsettled within-decile AUC weighted 0.900 (was 0.879 on diag_8k). **3. Early-window (first 4 ACT steps) does NOT forecast eventual success among still-unsolved examples — and on HRM the dynamical signals point the OTHER way.** Unconditioned early AUCs are inflated by already-solved-at-4 examples (TRM 69.4% solved@4, HRM 34.5%). Restricted to not-yet-correct@4: | signal @ step 4 | TRM (n=626, 59.4% eventually correct) | HRM (n=1342, 27.6% eventually correct) | |---|---|---| | AUC(−λ₁_early → eventual correct) | 0.543 | **0.448** (reversed) | | AUC(−drift@4 → eventual correct) | 0.492 | **0.312** (reversed: MORE early movement ↔ eventual success, +dir AUC 0.688) | | AUC(q_halt@4 → eventual correct) | 0.521 | **0.734** | Observations: (i) the λ₁/outcome association is concurrent with the trajectory's fate (final window), not antecedent — early-window λ₁ has no forward predictive power at this granularity; (ii) on HRM the sign of the early association is inverted: among undecided examples, higher early drift (and marginally higher early λ₁) accompany eventual success; (iii) the one early signal with real forecast power is HRM's learned q_halt (0.734) — absent in TRM (0.521); factual architecture note: TRM removed HRM's Q-learning continue-head (BCE halt only). Window length 4 was chosen to match train-time; other horizons untested. **Consequences for writeup/claims:** "failure ↔ chaos" should be stated as an outcome-concurrent dynamical signature, not an early predictor; early-exit/reallocation applications are unsupported at this granularity (and sign-reversed on HRM); the within-stratum independence result (drift-matched AUC 0.88-0.90) plus the difficulty control (Addendum 1) remain the strongest positive claims. ## Caveats - n(B)=21 and n(C)=11/195 are small; per-cell statements about those cells are low-precision. - "Settled" is a relative (band) criterion; both A-bands have nonzero characteristic residual velocity. - exact_correct is evaluated at ACT step 16 under fixed unroll; post-window corruption or recovery (Ren & Liu's fixed-point violation) is not observable in these arrays. - λ window = full trajectory (336 sub-updates TRM / 16 segments HRM); the early-window (length-matched) version of this analysis is NOT covered by existing npz files — the `_short` diagnose scripts exist (4 ACT steps) but produced no saved npz we could find. Open follow-up. - diag_8k (HRM) and the HRM evolution series use different estimator normalizations; only within-file comparisons are reported above.