diff options
| author | YurenHao0426 <blackhao0426@gmail.com> | 2026-06-13 12:35:36 -0500 |
|---|---|---|
| committer | YurenHao0426 <blackhao0426@gmail.com> | 2026-06-13 12:35:36 -0500 |
| commit | 66e0d8b9fd4d0f7a2231d689c055e26fdf1cf04a (patch) | |
| tree | c29cba61124018755a19b02c9d33e3ad5f2e05cc /research/flossing/analysis_2x2/OBSERVATIONS.md | |
Curated export for clone-and-run Maze training (2x A6000) + diagnostics.
trm/hrm pretrain.py carry trajectory-augmentation code (backward-compatible).
Heavy artifacts (checkpoints/wandb/npz) gitignored; see PROVENANCE.md.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Diffstat (limited to 'research/flossing/analysis_2x2/OBSERVATIONS.md')
| -rw-r--r-- | research/flossing/analysis_2x2/OBSERVATIONS.md | 155 |
1 files changed, 155 insertions, 0 deletions
diff --git a/research/flossing/analysis_2x2/OBSERVATIONS.md b/research/flossing/analysis_2x2/OBSERVATIONS.md new file mode 100644 index 0000000..d0b0670 --- /dev/null +++ b/research/flossing/analysis_2x2/OBSERVATIONS.md @@ -0,0 +1,155 @@ +# 2×2 analysis: (late-trajectory settling) × (answer correctness) — observations + +Date: 2026-06-11. Code: `analyze_2x2.py` (+ ad-hoc strict-threshold checks logged in session). +All statements below are measurements on existing diagnostic npz files; no new GPU runs. + +## Data provenance (CRITICAL: λ scales are NOT comparable across estimator versions) + +| dataset | model / ckpt | n | estimator | window | +|---|---|---|---|---| +| `diag_8k.npz` | HRM righteous-python @ step_26040 (H=2,L=2) | 8192 | `diagnose_hrm.py` (May 22 version) | full 16 ACT steps | +| `diag_trm_singleGPU_step*_512.npz` ×10 | TRM mlp_t singleGPU @ 26041…260410 | 512 each | `diagnose_trm_joint.py` (joint JVP+QR, per-sub-update norm) | full 16 ACT steps | +| `diag_hrm_step_*_512.npz` ×10 | HRM @ 2604…26040 | 512 each | joint estimator (−0.15-scale, matches step7 reports) | full 16 ACT steps | + +Within-dataset comparisons only. "Converged"/"settled" here = late z_H drift (mean over ACT steps 13–16) +falls in the low-drift band; threshold from Otsu on pooled log10 drift, robustness via percentile sweep. +Settled ≠ literal fixed point: the settled band has a narrow characteristic residual velocity +(HRM ≈0.96/step, IQR-width <0.03 in log10; TRM ≈18.5/step, q10–q90 = 15.8–21.4). + +## Headline numbers + +### HRM @26040, n=8192, exact_acc 0.525 (full-window λ, May-22 estimator scale) +Otsu τ → cells: A(settled,correct)=4103, B(settled,wrong)=63, C(unsettled,correct)=195, D(unsettled,wrong)=3831. + +- Failure mass is overwhelmingly unsettled: wrong-that-settled = 1.6% at Otsu τ; at a STRICT in-band + threshold (45th pct, drift<0.97) it is 21/3894 = **0.55% of failures**. +- The strict-band B examples (n=21) have λ₁ median **−0.842** vs A −0.867 (success-like contraction), + q_halt(final) median **+7.47 — identical to A** (ACT-confident), token_acc 0.41–0.88 (median 0.62, + substantially wrong, not near-misses). These are confidently-wrong settled answers — invisible to + both a stability-based and an ACT-confidence-based selector. +- λ₁ as a predictor: AUC(−λ₁→correct) = 0.984 ≈ AUC(−λ₁→settled) = 0.986. At the Otsu split the + within-settled outcome gap (Δλ₁ −0.27, AUC 0.852) is a THRESHOLD-MIXTURE ARTIFACT: the Otsu τ sits + above the band gap (drift distribution has a near-empty region between pct≈45 and pct≈55, + τ jumps −0.013→1.39), so Otsu-"settled" includes mid-drift examples. At the strict in-band τ the + within-settled gap nearly closes (−0.867 vs −0.842). +- Residual outcome signal in the unsettled stratum: C vs D Δλ₁(median) = −0.094, AUC 0.818 + (C n=195). Note C drifts MORE than D (median 55.4 vs 40.8) while being more contracting in λ₁. +- AUC(−log late-drift→correct) = 0.964. + +### TRM singleGPU @260410 (final), n=512, exact_acc 0.770 (joint estimator) +Cells: A=383, B=**0**, C=11, D=118. + +- **B = 0 at every threshold**: drift distributions of settled-correct (q90 = 21.4) and wrong + (q10 = 47.8) are completely separated — no wrong example reaches the settled band. Robust across + the entire percentile sweep (nB=0, nD=118 constant). Same at step 130205. Across the 10-ckpt series, + fB>0 only at step 26041 (1.2%). +- λ₁ medians: A +0.0047 (≈0, consistent with "TRM success short-window λ₁>0 is acceptable"), + C +0.0998, D +0.1023. Within-unsettled outcome gap ≈ 0 (Δ −0.0025; AUC 0.619, C n=11). + AUC(−λ₁→correct) = 0.989 ≈ AUC(−λ₁→settled) = 0.996: at the final checkpoint, λ₁'s outcome + signal in TRM is (almost) entirely settling-regime detection. +- q_halt(final) median: A +7.44, C +7.41, D −11.1. + +## Checkpoint evolution (512/ckpt; within-series comparisons) + +- TRM (26041→260410): fB≈0 from 52082 on; fD shrinks 0.41→0.23; λ₁(D) rises monotonically + +0.036→+0.102 while λ₁(A) stays ≈0 — the success/failure λ₁ gap widens over training via the + failure cell becoming more expansive. +- HRM (2604→26040, joint-estimator series): mass migration over training: + fB 0.89→0.008 (early ckpts: nearly all examples low-drift & wrong), fD rises to ~0.52 mid-training, + fA grows with accuracy; λ₁(D) rises from −0.087 to +0.023 (sign flip ~step 15–18k), λ₁(A) stays + −0.10…−0.20. + +## Direct answers to the motivating questions + +1. **Is the failure FTLE cluster a mixture of wandering + wrong-fixed-point modes?** + Measured: yes, but extremely lopsided. Wandering dominates (HRM ≥98.4% of failures at Otsu τ, + 99.45% at strict τ; TRM final 100%). The wrong-fixed-point mode exists in HRM (21 examples) with + exactly the predicted signature (success-like λ₁), and is absent in TRM at late checkpoints. +2. **Does conditioning on settling absorb the success/failure FTLE gap?** + TRM final: essentially yes (within-stratum AUC 0.619 with n=11; regime AUC 0.996). + HRM: mostly yes after threshold correction (strict-band within-settled gap ≈0.025), with one + genuine residual: the unsettled stratum retains an outcome gap (AUC 0.818, C n=195). +3. **Relation to published taxonomies** (factual cross-references): + Ren & Liu's four HRM modes map onto our cells; their "non-trivial failure (converged to wrong + fixed point)" = our strict-band B (rare: 0.55% of failures); their "trivial failure (wanders or + oscillates)" = our D (dominant). Efstathiou & Balwani's "failed runs plateau at stable high-loss + attractors" (TRM): by our state-drift criterion, TRM failures are NOT settled (B=0, drift ≥~48/step, + ~0.77× the early-trajectory velocity); their "stable" refers to loss plateaus/bounded regions, + not state convergence. Our drift+λ₁ measurement distinguishes these. + +## Addendum (same day, offline follow-ups — see offline_followups/followups.md) + +**REVISION of headline point 2.** "λ₁'s outcome signal is (almost) entirely settling-regime detection" +was based on comparing raw AUCs and is too strong as a mediation claim. The proper control — +AUC(−λ₁→correct) **within matched late-drift deciles** of the unsettled stratum — shows substantial +independent signal (HRM: per-decile 0.97→0.69 from low to high drift, weighted mean 0.879; +unconditioned within-unsettled 0.933). TRM official @58590 with a strict band τ shows the same +qualitatively: unsettled-correct λ₁ ≈ +0.017 (n=141) vs unsettled-wrong +0.103 (n=64) at overlapping +drift levels. Corrected statement: **λ₁ correlates strongly with settledness, but at matched drift +level it still separates outcome** — drift and λ₁ are not redundant observables. + +**Difficulty control (#givens, crude proxy).** HRM n=8192: Spearman(correct, givens)=+0.28; +Spearman(λ₁, givens)=−0.35 overall but −0.16/−0.18 within outcome. Within-givens-bin +AUC(−λ₁→correct) = 0.976–0.987 (weighted 0.982, vs overall 0.984): at the #givens level, the +FTLE-outcome separation is NOT a difficulty artifact. (Solver-backtrack difficulty not available +offline; #givens is a weak proxy — flag for the writeup.) + +**Strict-B per-example (n=21).** All 21 have halted_at ∈ [4,9] (median 6) — under real ACT inference +every one would have halted early, confidently (q_halt +7.4–7.5), and wrong (token_acc 0.41–0.88). +The three lowest-token-acc cases are all 17-givens (minimum-clue) puzzles. Per-example table in +followups.md; drift profiles indistinguishable from the A band (fig_hrm_strictB_profiles.png). + +**TRM official @58590 note.** 90% of its settled-correct examples are still descending at window end +(slope median −0.147) — unlike singleGPU @260410 whose A-band is flat (~18.5/step). The "settled band" +criterion is checkpoint-specific; cross-checkpoint comparisons must re-derive τ per dataset. + +## Addendum 2 (2026-06-12, n=2048 retest + early-window pairing; npz in retest/) + +Retest ran on GPU 0 (shared, 12h-fallback claim). Four diagnostics, seed 0, idx-paired. + +**1. TRM official @58590 (87.6%), full window, n=2048: B = 0 confirmed.** +254 failures, none settled; the MINIMUM late-drift among wrong examples (log10 1.664 ≈ 46/step) +exceeds the late-drift of 96.5% of correct examples — near-complete distribution separation, +threshold-free. λ₁: wrong +0.103 / correct +0.012, AUC 0.993. Within-unsettled outcome AUC 0.848 +(C n=70) — residual signal confirmed at usable n. + +**2. HRM @26040, full window, joint estimator, n=2048: replicates diag_8k on a second estimator.** +acc 0.526; A/B/C/D = 1020/14/57/957; λ₁(A) −0.152 vs λ₁(D) +0.032 (the email's −0.15/+0.04 scale); +strict-band B n=5 (0.5% of failures) with λ₁ −0.141 ≈ A and q_halt +7.47 (selector-blind, replicated); +unsettled within-decile AUC weighted 0.900 (was 0.879 on diag_8k). + +**3. Early-window (first 4 ACT steps) does NOT forecast eventual success among still-unsolved +examples — and on HRM the dynamical signals point the OTHER way.** +Unconditioned early AUCs are inflated by already-solved-at-4 examples (TRM 69.4% solved@4, +HRM 34.5%). Restricted to not-yet-correct@4: + +| signal @ step 4 | TRM (n=626, 59.4% eventually correct) | HRM (n=1342, 27.6% eventually correct) | +|---|---|---| +| AUC(−λ₁_early → eventual correct) | 0.543 | **0.448** (reversed) | +| AUC(−drift@4 → eventual correct) | 0.492 | **0.312** (reversed: MORE early movement ↔ eventual success, +dir AUC 0.688) | +| AUC(q_halt@4 → eventual correct) | 0.521 | **0.734** | + +Observations: (i) the λ₁/outcome association is concurrent with the trajectory's fate (final window), +not antecedent — early-window λ₁ has no forward predictive power at this granularity; (ii) on HRM the +sign of the early association is inverted: among undecided examples, higher early drift (and +marginally higher early λ₁) accompany eventual success; (iii) the one early signal with real forecast +power is HRM's learned q_halt (0.734) — absent in TRM (0.521); factual architecture note: TRM removed +HRM's Q-learning continue-head (BCE halt only). Window length 4 was chosen to match train-time; other +horizons untested. + +**Consequences for writeup/claims:** "failure ↔ chaos" should be stated as an outcome-concurrent +dynamical signature, not an early predictor; early-exit/reallocation applications are unsupported at +this granularity (and sign-reversed on HRM); the within-stratum independence result (drift-matched +AUC 0.88-0.90) plus the difficulty control (Addendum 1) remain the strongest positive claims. + +## Caveats + +- n(B)=21 and n(C)=11/195 are small; per-cell statements about those cells are low-precision. +- "Settled" is a relative (band) criterion; both A-bands have nonzero characteristic residual velocity. +- exact_correct is evaluated at ACT step 16 under fixed unroll; post-window corruption or recovery + (Ren & Liu's fixed-point violation) is not observable in these arrays. +- λ window = full trajectory (336 sub-updates TRM / 16 segments HRM); the early-window + (length-matched) version of this analysis is NOT covered by existing npz files — the `_short` + diagnose scripts exist (4 ACT steps) but produced no saved npz we could find. Open follow-up. +- diag_8k (HRM) and the HRM evolution series use different estimator normalizations; only + within-file comparisons are reported above. |
