summaryrefslogtreecommitdiff
path: root/research/flossing/analysis_2x2/OBSERVATIONS.md
blob: d0b06706f4a870b4f4256288b2633defe42abcfd (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
# 2×2 analysis: (late-trajectory settling) × (answer correctness) — observations

Date: 2026-06-11. Code: `analyze_2x2.py` (+ ad-hoc strict-threshold checks logged in session).
All statements below are measurements on existing diagnostic npz files; no new GPU runs.

## Data provenance (CRITICAL: λ scales are NOT comparable across estimator versions)

| dataset | model / ckpt | n | estimator | window |
|---|---|---|---|---|
| `diag_8k.npz` | HRM righteous-python @ step_26040 (H=2,L=2) | 8192 | `diagnose_hrm.py` (May 22 version) | full 16 ACT steps |
| `diag_trm_singleGPU_step*_512.npz` ×10 | TRM mlp_t singleGPU @ 26041…260410 | 512 each | `diagnose_trm_joint.py` (joint JVP+QR, per-sub-update norm) | full 16 ACT steps |
| `diag_hrm_step_*_512.npz` ×10 | HRM @ 2604…26040 | 512 each | joint estimator (−0.15-scale, matches step7 reports) | full 16 ACT steps |

Within-dataset comparisons only. "Converged"/"settled" here = late z_H drift (mean over ACT steps 13–16)
falls in the low-drift band; threshold from Otsu on pooled log10 drift, robustness via percentile sweep.
Settled ≠ literal fixed point: the settled band has a narrow characteristic residual velocity
(HRM ≈0.96/step, IQR-width <0.03 in log10; TRM ≈18.5/step, q10–q90 = 15.8–21.4).

## Headline numbers

### HRM @26040, n=8192, exact_acc 0.525 (full-window λ, May-22 estimator scale)
Otsu τ → cells: A(settled,correct)=4103, B(settled,wrong)=63, C(unsettled,correct)=195, D(unsettled,wrong)=3831.

- Failure mass is overwhelmingly unsettled: wrong-that-settled = 1.6% at Otsu τ; at a STRICT in-band
  threshold (45th pct, drift<0.97) it is 21/3894 = **0.55% of failures**.
- The strict-band B examples (n=21) have λ₁ median **−0.842** vs A −0.867 (success-like contraction),
  q_halt(final) median **+7.47 — identical to A** (ACT-confident), token_acc 0.41–0.88 (median 0.62,
  substantially wrong, not near-misses). These are confidently-wrong settled answers — invisible to
  both a stability-based and an ACT-confidence-based selector.
- λ₁ as a predictor: AUC(−λ₁→correct) = 0.984 ≈ AUC(−λ₁→settled) = 0.986. At the Otsu split the
  within-settled outcome gap (Δλ₁ −0.27, AUC 0.852) is a THRESHOLD-MIXTURE ARTIFACT: the Otsu τ sits
  above the band gap (drift distribution has a near-empty region between pct≈45 and pct≈55,
  τ jumps −0.013→1.39), so Otsu-"settled" includes mid-drift examples. At the strict in-band τ the
  within-settled gap nearly closes (−0.867 vs −0.842).
- Residual outcome signal in the unsettled stratum: C vs D Δλ₁(median) = −0.094, AUC 0.818
  (C n=195). Note C drifts MORE than D (median 55.4 vs 40.8) while being more contracting in λ₁.
- AUC(−log late-drift→correct) = 0.964.

### TRM singleGPU @260410 (final), n=512, exact_acc 0.770 (joint estimator)
Cells: A=383, B=**0**, C=11, D=118.

- **B = 0 at every threshold**: drift distributions of settled-correct (q90 = 21.4) and wrong
  (q10 = 47.8) are completely separated — no wrong example reaches the settled band. Robust across
  the entire percentile sweep (nB=0, nD=118 constant). Same at step 130205. Across the 10-ckpt series,
  fB>0 only at step 26041 (1.2%).
- λ₁ medians: A +0.0047 (≈0, consistent with "TRM success short-window λ₁>0 is acceptable"),
  C +0.0998, D +0.1023. Within-unsettled outcome gap ≈ 0 (Δ −0.0025; AUC 0.619, C n=11).
  AUC(−λ₁→correct) = 0.989 ≈ AUC(−λ₁→settled) = 0.996: at the final checkpoint, λ₁'s outcome
  signal in TRM is (almost) entirely settling-regime detection.
- q_halt(final) median: A +7.44, C +7.41, D −11.1.

## Checkpoint evolution (512/ckpt; within-series comparisons)

- TRM (26041→260410): fB≈0 from 52082 on; fD shrinks 0.41→0.23; λ₁(D) rises monotonically
  +0.036→+0.102 while λ₁(A) stays ≈0 — the success/failure λ₁ gap widens over training via the
  failure cell becoming more expansive.
- HRM (2604→26040, joint-estimator series): mass migration over training:
  fB 0.89→0.008 (early ckpts: nearly all examples low-drift & wrong), fD rises to ~0.52 mid-training,
  fA grows with accuracy; λ₁(D) rises from −0.087 to +0.023 (sign flip ~step 15–18k), λ₁(A) stays
  −0.10…−0.20.

## Direct answers to the motivating questions

1. **Is the failure FTLE cluster a mixture of wandering + wrong-fixed-point modes?**
   Measured: yes, but extremely lopsided. Wandering dominates (HRM ≥98.4% of failures at Otsu τ,
   99.45% at strict τ; TRM final 100%). The wrong-fixed-point mode exists in HRM (21 examples) with
   exactly the predicted signature (success-like λ₁), and is absent in TRM at late checkpoints.
2. **Does conditioning on settling absorb the success/failure FTLE gap?**
   TRM final: essentially yes (within-stratum AUC 0.619 with n=11; regime AUC 0.996).
   HRM: mostly yes after threshold correction (strict-band within-settled gap ≈0.025), with one
   genuine residual: the unsettled stratum retains an outcome gap (AUC 0.818, C n=195).
3. **Relation to published taxonomies** (factual cross-references):
   Ren & Liu's four HRM modes map onto our cells; their "non-trivial failure (converged to wrong
   fixed point)" = our strict-band B (rare: 0.55% of failures); their "trivial failure (wanders or
   oscillates)" = our D (dominant). Efstathiou & Balwani's "failed runs plateau at stable high-loss
   attractors" (TRM): by our state-drift criterion, TRM failures are NOT settled (B=0, drift ≥~48/step,
   ~0.77× the early-trajectory velocity); their "stable" refers to loss plateaus/bounded regions,
   not state convergence. Our drift+λ₁ measurement distinguishes these.

## Addendum (same day, offline follow-ups — see offline_followups/followups.md)

**REVISION of headline point 2.** "λ₁'s outcome signal is (almost) entirely settling-regime detection"
was based on comparing raw AUCs and is too strong as a mediation claim. The proper control —
AUC(−λ₁→correct) **within matched late-drift deciles** of the unsettled stratum — shows substantial
independent signal (HRM: per-decile 0.97→0.69 from low to high drift, weighted mean 0.879;
unconditioned within-unsettled 0.933). TRM official @58590 with a strict band τ shows the same
qualitatively: unsettled-correct λ₁ ≈ +0.017 (n=141) vs unsettled-wrong +0.103 (n=64) at overlapping
drift levels. Corrected statement: **λ₁ correlates strongly with settledness, but at matched drift
level it still separates outcome** — drift and λ₁ are not redundant observables.

**Difficulty control (#givens, crude proxy).** HRM n=8192: Spearman(correct, givens)=+0.28;
Spearman(λ₁, givens)=−0.35 overall but −0.16/−0.18 within outcome. Within-givens-bin
AUC(−λ₁→correct) = 0.976–0.987 (weighted 0.982, vs overall 0.984): at the #givens level, the
FTLE-outcome separation is NOT a difficulty artifact. (Solver-backtrack difficulty not available
offline; #givens is a weak proxy — flag for the writeup.)

**Strict-B per-example (n=21).** All 21 have halted_at ∈ [4,9] (median 6) — under real ACT inference
every one would have halted early, confidently (q_halt +7.4–7.5), and wrong (token_acc 0.41–0.88).
The three lowest-token-acc cases are all 17-givens (minimum-clue) puzzles. Per-example table in
followups.md; drift profiles indistinguishable from the A band (fig_hrm_strictB_profiles.png).

**TRM official @58590 note.** 90% of its settled-correct examples are still descending at window end
(slope median −0.147) — unlike singleGPU @260410 whose A-band is flat (~18.5/step). The "settled band"
criterion is checkpoint-specific; cross-checkpoint comparisons must re-derive τ per dataset.

## Addendum 2 (2026-06-12, n=2048 retest + early-window pairing; npz in retest/)

Retest ran on GPU 0 (shared, 12h-fallback claim). Four diagnostics, seed 0, idx-paired.

**1. TRM official @58590 (87.6%), full window, n=2048: B = 0 confirmed.**
254 failures, none settled; the MINIMUM late-drift among wrong examples (log10 1.664 ≈ 46/step)
exceeds the late-drift of 96.5% of correct examples — near-complete distribution separation,
threshold-free. λ₁: wrong +0.103 / correct +0.012, AUC 0.993. Within-unsettled outcome AUC 0.848
(C n=70) — residual signal confirmed at usable n.

**2. HRM @26040, full window, joint estimator, n=2048: replicates diag_8k on a second estimator.**
acc 0.526; A/B/C/D = 1020/14/57/957; λ₁(A) −0.152 vs λ₁(D) +0.032 (the email's −0.15/+0.04 scale);
strict-band B n=5 (0.5% of failures) with λ₁ −0.141 ≈ A and q_halt +7.47 (selector-blind, replicated);
unsettled within-decile AUC weighted 0.900 (was 0.879 on diag_8k).

**3. Early-window (first 4 ACT steps) does NOT forecast eventual success among still-unsolved
examples — and on HRM the dynamical signals point the OTHER way.**
Unconditioned early AUCs are inflated by already-solved-at-4 examples (TRM 69.4% solved@4,
HRM 34.5%). Restricted to not-yet-correct@4:

| signal @ step 4 | TRM (n=626, 59.4% eventually correct) | HRM (n=1342, 27.6% eventually correct) |
|---|---|---|
| AUC(−λ₁_early → eventual correct) | 0.543 | **0.448** (reversed) |
| AUC(−drift@4 → eventual correct) | 0.492 | **0.312** (reversed: MORE early movement ↔ eventual success, +dir AUC 0.688) |
| AUC(q_halt@4 → eventual correct) | 0.521 | **0.734** |

Observations: (i) the λ₁/outcome association is concurrent with the trajectory's fate (final window),
not antecedent — early-window λ₁ has no forward predictive power at this granularity; (ii) on HRM the
sign of the early association is inverted: among undecided examples, higher early drift (and
marginally higher early λ₁) accompany eventual success; (iii) the one early signal with real forecast
power is HRM's learned q_halt (0.734) — absent in TRM (0.521); factual architecture note: TRM removed
HRM's Q-learning continue-head (BCE halt only). Window length 4 was chosen to match train-time; other
horizons untested.

**Consequences for writeup/claims:** "failure ↔ chaos" should be stated as an outcome-concurrent
dynamical signature, not an early predictor; early-exit/reallocation applications are unsupported at
this granularity (and sign-reversed on HRM); the within-stratum independence result (drift-matched
AUC 0.88-0.90) plus the difficulty control (Addendum 1) remain the strongest positive claims.

## Caveats

- n(B)=21 and n(C)=11/195 are small; per-cell statements about those cells are low-precision.
- "Settled" is a relative (band) criterion; both A-bands have nonzero characteristic residual velocity.
- exact_correct is evaluated at ACT step 16 under fixed unroll; post-window corruption or recovery
  (Ren & Liu's fixed-point violation) is not observable in these arrays.
- λ window = full trajectory (336 sub-updates TRM / 16 segments HRM); the early-window
  (length-matched) version of this analysis is NOT covered by existing npz files — the `_short`
  diagnose scripts exist (4 ACT steps) but produced no saved npz we could find. Open follow-up.
- diag_8k (HRM) and the HRM evolution series use different estimator normalizations; only
  within-file comparisons are reported above.