# 2×2 analysis: (late-trajectory settling) × (answer correctness) — observations

Date: 2026-06-11. Code: `analyze_2x2.py` (+ ad-hoc strict-threshold checks logged in session).
All statements below are measurements on existing diagnostic npz files; no new GPU runs.

## Data provenance (CRITICAL: λ scales are NOT comparable across estimator versions)

| dataset | model / ckpt | n | estimator | window |
|---|---|---|---|---|
| `diag_8k.npz` | HRM righteous-python @ step_26040 (H=2,L=2) | 8192 | `diagnose_hrm.py` (May 22 version) | full 16 ACT steps |
| `diag_trm_singleGPU_step*_512.npz` ×10 | TRM mlp_t singleGPU @ 26041…260410 | 512 each | `diagnose_trm_joint.py` (joint JVP+QR, per-sub-update norm) | full 16 ACT steps |
| `diag_hrm_step_*_512.npz` ×10 | HRM @ 2604…26040 | 512 each | joint estimator (−0.15-scale, matches step7 reports) | full 16 ACT steps |

Within-dataset comparisons only. "Converged"/"settled" here = late z_H drift (mean over ACT steps 13–16)
falls in the low-drift band; threshold from Otsu on pooled log10 drift, robustness via percentile sweep.
Settled ≠ literal fixed point: the settled band has a narrow characteristic residual velocity
(HRM ≈0.96/step, IQR-width <0.03 in log10; TRM ≈18.5/step, q10–q90 = 15.8–21.4).

## Headline numbers

### HRM @26040, n=8192, exact_acc 0.525 (full-window λ, May-22 estimator scale)
Otsu τ → cells: A(settled,correct)=4103, B(settled,wrong)=63, C(unsettled,correct)=195, D(unsettled,wrong)=3831.

- Failure mass is overwhelmingly unsettled: wrong-that-settled = 1.6% at Otsu τ; at a STRICT in-band
  threshold (45th pct, drift<0.97) it is 21/3894 = **0.55% of failures**.
- The strict-band B examples (n=21) have λ₁ median **−0.842** vs A −0.867 (success-like contraction),
  q_halt(final) median **+7.47 — identical to A** (ACT-confident), token_acc 0.41–0.88 (median 0.62,
  substantially wrong, not near-misses). These are confidently-wrong settled answers — invisible to
  both a stability-based and an ACT-confidence-based selector.
- λ₁ as a predictor: AUC(−λ₁→correct) = 0.984 ≈ AUC(−λ₁→settled) = 0.986. At the Otsu split the
  within-settled outcome gap (Δλ₁ −0.27, AUC 0.852) is a THRESHOLD-MIXTURE ARTIFACT: the Otsu τ sits
  above the band gap (drift distribution has a near-empty region between pct≈45 and pct≈55,
  τ jumps −0.013→1.39), so Otsu-"settled" includes mid-drift examples. At the strict in-band τ the
  within-settled gap nearly closes (−0.867 vs −0.842).
- Residual outcome signal in the unsettled stratum: C vs D Δλ₁(median) = −0.094, AUC 0.818
  (C n=195). Note C drifts MORE than D (median 55.4 vs 40.8) while being more contracting in λ₁.
- AUC(−log late-drift→correct) = 0.964.

### TRM singleGPU @260410 (final), n=512, exact_acc 0.770 (joint estimator)
Cells: A=383, B=**0**, C=11, D=118.

- **B = 0 at every threshold**: drift distributions of settled-correct (q90 = 21.4) and wrong
  (q10 = 47.8) are completely separated — no wrong example reaches the settled band. Robust across
  the entire percentile sweep (nB=0, nD=118 constant). Same at step 130205. Across the 10-ckpt series,
  fB>0 only at step 26041 (1.2%).
- λ₁ medians: A +0.0047 (≈0, consistent with "TRM success short-window λ₁>0 is acceptable"),
  C +0.0998, D +0.1023. Within-unsettled outcome gap ≈ 0 (Δ −0.0025; AUC 0.619, C n=11).
  AUC(−λ₁→correct) = 0.989 ≈ AUC(−λ₁→settled) = 0.996: at the final checkpoint, λ₁'s outcome
  signal in TRM is (almost) entirely settling-regime detection.
- q_halt(final) median: A +7.44, C +7.41, D −11.1.

## Checkpoint evolution (512/ckpt; within-series comparisons)

- TRM (26041→260410): fB≈0 from 52082 on; fD shrinks 0.41→0.23; λ₁(D) rises monotonically
  +0.036→+0.102 while λ₁(A) stays ≈0 — the success/failure λ₁ gap widens over training via the
  failure cell becoming more expansive.
- HRM (2604→26040, joint-estimator series): mass migration over training:
  fB 0.89→0.008 (early ckpts: nearly all examples low-drift & wrong), fD rises to ~0.52 mid-training,
  fA grows with accuracy; λ₁(D) rises from −0.087 to +0.023 (sign flip ~step 15–18k), λ₁(A) stays
  −0.10…−0.20.

## Direct answers to the motivating questions

1. **Is the failure FTLE cluster a mixture of wandering + wrong-fixed-point modes?**
   Measured: yes, but extremely lopsided. Wandering dominates (HRM ≥98.4% of failures at Otsu τ,
   99.45% at strict τ; TRM final 100%). The wrong-fixed-point mode exists in HRM (21 examples) with
   exactly the predicted signature (success-like λ₁), and is absent in TRM at late checkpoints.
2. **Does conditioning on settling absorb the success/failure FTLE gap?**
   TRM final: essentially yes (within-stratum AUC 0.619 with n=11; regime AUC 0.996).
   HRM: mostly yes after threshold correction (strict-band within-settled gap ≈0.025), with one
   genuine residual: the unsettled stratum retains an outcome gap (AUC 0.818, C n=195).
3. **Relation to published taxonomies** (factual cross-references):
   Ren & Liu's four HRM modes map onto our cells; their "non-trivial failure (converged to wrong
   fixed point)" = our strict-band B (rare: 0.55% of failures); their "trivial failure (wanders or
   oscillates)" = our D (dominant). Efstathiou & Balwani's "failed runs plateau at stable high-loss
   attractors" (TRM): by our state-drift criterion, TRM failures are NOT settled (B=0, drift ≥~48/step,
   ~0.77× the early-trajectory velocity); their "stable" refers to loss plateaus/bounded regions,
   not state convergence. Our drift+λ₁ measurement distinguishes these.

## Addendum (same day, offline follow-ups — see offline_followups/followups.md)

**REVISION of headline point 2.** "λ₁'s outcome signal is (almost) entirely settling-regime detection"
was based on comparing raw AUCs and is too strong as a mediation claim. The proper control —
AUC(−λ₁→correct) **within matched late-drift deciles** of the unsettled stratum — shows substantial
independent signal (HRM: per-decile 0.97→0.69 from low to high drift, weighted mean 0.879;
unconditioned within-unsettled 0.933). TRM official @58590 with a strict band τ shows the same
qualitatively: unsettled-correct λ₁ ≈ +0.017 (n=141) vs unsettled-wrong +0.103 (n=64) at overlapping
drift levels. Corrected statement: **λ₁ correlates strongly with settledness, but at matched drift
level it still separates outcome** — drift and λ₁ are not redundant observables.

**Difficulty control (#givens, crude proxy).** HRM n=8192: Spearman(correct, givens)=+0.28;
Spearman(λ₁, givens)=−0.35 overall but −0.16/−0.18 within outcome. Within-givens-bin
AUC(−λ₁→correct) = 0.976–0.987 (weighted 0.982, vs overall 0.984): at the #givens level, the
FTLE-outcome separation is NOT a difficulty artifact. (Solver-backtrack difficulty not available
offline; #givens is a weak proxy — flag for the writeup.)

**Strict-B per-example (n=21).** All 21 have halted_at ∈ [4,9] (median 6) — under real ACT inference
every one would have halted early, confidently (q_halt +7.4–7.5), and wrong (token_acc 0.41–0.88).
The three lowest-token-acc cases are all 17-givens (minimum-clue) puzzles. Per-example table in
followups.md; drift profiles indistinguishable from the A band (fig_hrm_strictB_profiles.png).

**TRM official @58590 note.** 90% of its settled-correct examples are still descending at window end
(slope median −0.147) — unlike singleGPU @260410 whose A-band is flat (~18.5/step). The "settled band"
criterion is checkpoint-specific; cross-checkpoint comparisons must re-derive τ per dataset.

## Addendum 2 (2026-06-12, n=2048 retest + early-window pairing; npz in retest/)

Retest ran on GPU 0 (shared, 12h-fallback claim). Four diagnostics, seed 0, idx-paired.

**1. TRM official @58590 (87.6%), full window, n=2048: B = 0 confirmed.**
254 failures, none settled; the MINIMUM late-drift among wrong examples (log10 1.664 ≈ 46/step)
exceeds the late-drift of 96.5% of correct examples — near-complete distribution separation,
threshold-free. λ₁: wrong +0.103 / correct +0.012, AUC 0.993. Within-unsettled outcome AUC 0.848
(C n=70) — residual signal confirmed at usable n.

**2. HRM @26040, full window, joint estimator, n=2048: replicates diag_8k on a second estimator.**
acc 0.526; A/B/C/D = 1020/14/57/957; λ₁(A) −0.152 vs λ₁(D) +0.032 (the email's −0.15/+0.04 scale);
strict-band B n=5 (0.5% of failures) with λ₁ −0.141 ≈ A and q_halt +7.47 (selector-blind, replicated);
unsettled within-decile AUC weighted 0.900 (was 0.879 on diag_8k).

**3. Early-window (first 4 ACT steps) does NOT forecast eventual success among still-unsolved
examples — and on HRM the dynamical signals point the OTHER way.**
Unconditioned early AUCs are inflated by already-solved-at-4 examples (TRM 69.4% solved@4,
HRM 34.5%). Restricted to not-yet-correct@4:

| signal @ step 4 | TRM (n=626, 59.4% eventually correct) | HRM (n=1342, 27.6% eventually correct) |
|---|---|---|
| AUC(−λ₁_early → eventual correct) | 0.543 | **0.448** (reversed) |
| AUC(−drift@4 → eventual correct) | 0.492 | **0.312** (reversed: MORE early movement ↔ eventual success, +dir AUC 0.688) |
| AUC(q_halt@4 → eventual correct) | 0.521 | **0.734** |

Observations: (i) the λ₁/outcome association is concurrent with the trajectory's fate (final window),
not antecedent — early-window λ₁ has no forward predictive power at this granularity; (ii) on HRM the
sign of the early association is inverted: among undecided examples, higher early drift (and
marginally higher early λ₁) accompany eventual success; (iii) the one early signal with real forecast
power is HRM's learned q_halt (0.734) — absent in TRM (0.521); factual architecture note: TRM removed
HRM's Q-learning continue-head (BCE halt only). Window length 4 was chosen to match train-time; other
horizons untested.

**Consequences for writeup/claims:** "failure ↔ chaos" should be stated as an outcome-concurrent
dynamical signature, not an early predictor; early-exit/reallocation applications are unsupported at
this granularity (and sign-reversed on HRM); the within-stratum independence result (drift-matched
AUC 0.88-0.90) plus the difficulty control (Addendum 1) remain the strongest positive claims.

## Caveats

- n(B)=21 and n(C)=11/195 are small; per-cell statements about those cells are low-precision.
- "Settled" is a relative (band) criterion; both A-bands have nonzero characteristic residual velocity.
- exact_correct is evaluated at ACT step 16 under fixed unroll; post-window corruption or recovery
  (Ren & Liu's fixed-point violation) is not observable in these arrays.
- λ window = full trajectory (336 sub-updates TRM / 16 segments HRM); the early-window
  (length-matched) version of this analysis is NOT covered by existing npz files — the `_short`
  diagnose scripts exist (4 ACT steps) but produced no saved npz we could find. Open follow-up.
- diag_8k (HRM) and the HRM evolution series use different estimator normalizations; only
  within-file comparisons are reported above.