research/flossing/meeting_artifacts/trm_multi4_dynamics_report.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57

# TRM multi4 Dynamics Report

Date: 2026-06-02

## Main Result

Official GBS768 Sudoku-Extreme runs show a real best-checkpoint gain from trajectory perturbation training.

| model | step | full exact | full token acc | dynamics sample exact | lambda1 mean | top-8 mean | positive count / 8 |
|---|---:|---:|---:|---:|---:|---:|---:|
| TRM baseline best | 58590 | 0.8686 | 0.9508 | 0.8750 | +0.0282 | +0.0135 | 7.84 |
| TRM multi4 best | 35805 | 0.8965 | 0.9604 | 0.9004 | +0.0204 | +0.0066 | 3.84 |
| TRM multi4 final | 65100 | 0.8351 | 0.9350 | 0.8242 | +0.0323 | +0.0185 | 8.00 |

Interpretation: multi4 best improves full-test exact accuracy by +2.78pp over baseline best, while suppressing the Lyapunov spectrum volume on successful trajectories. The final checkpoint collapses below baseline and its spectrum becomes broadly positive again.

## Success vs Failure Separation

The N=512 spectrum sample keeps the original observation: failures have much larger positive exponents.

| model | success lambda1 | failure lambda1 | success top-8 mean | failure top-8 mean | success positive count | failure positive count |
|---|---:|---:|---:|---:|---:|---:|
| TRM baseline best | +0.0176 | +0.1026 | +0.0080 | +0.0516 | 7.82 | 8.00 |
| TRM multi4 best | +0.0112 | +0.1035 | +0.0020 | +0.0485 | 3.38 | 8.00 |
| TRM multi4 final | +0.0191 | +0.0942 | +0.0129 | +0.0449 | 8.00 | 8.00 |

Interpretation: multi4 does not make failures stable; it mainly shifts successful solutions into less chaotic basins. This is the right story for stable attractor training: success becomes less volume-expanding, while bad basins remain chaotic.

## PTRM Same-Subset Result

Existing PTRM paired files already use identical `idx` with `n=1000`, `K=100`, `D=64`, `sigma=0.3`, L-only perturbation.

| metric | baseline best | multi4 best | delta |
|---|---:|---:|---:|
| deterministic exact on subset | 0.8870 | 0.9110 | +0.0240 |
| mean rollout exact | 0.9419 | 0.9542 | +0.0123 |
| Q-selected exact | 0.9840 | 0.9880 | +0.0040 |
| oracle pass@100 | 0.9850 | 0.9880 | +0.0030 |
| correct rollout count / 100 | 94.19 | 95.42 | +1.23 |

Interpretation: PTRM is near saturated on Sudoku, so Q-selected/oracle improvements are small. The stronger signal is that multi4 increases deterministic accuracy and the density of correct stochastic rollouts on the same test subset.

## Caveats

- Full TRM eval uses all 422,786 Sudoku-Extreme test puzzles. PTRM runs above use a fixed 1,000-sample subset, so deterministic PTRM-subset numbers should not be mixed with full-test W&B exact accuracy.
- The N=512 Lyapunov spectrum is a diagnostic sample, not full-test evaluation.
- Final checkpoint is not the result to report as the method's performance; it is an overtraining/collapse diagnostic.
- A larger paired PTRM run is currently queued: `n=10000`, `K=25`, `D=64`, `sigma=0.3`, L-only, same seed. It is slow and should be treated as a robustness check, not a blocker.

## Artifacts

- Headline CSV: `research/flossing/official_gbs768_spectrum/headline_trm_multi4_dynamics_table.csv`
- Spectrum summary CSV: `research/flossing/official_gbs768_spectrum/summary_n512_k8_seed20260602.csv`
- PTRM paired CSV: `research/flossing/ptrm_same_subset/paired_ptrm_k100_n1000_seed0_summary.csv`
- Spectrum plot: `research/flossing/meeting_artifacts/trm_gbs768_spectrum_success_failure_n512.png`
- Headline bar plot: `research/flossing/meeting_artifacts/trm_gbs768_headline_dynamics_bars.png`
- PTRM plot: `research/flossing/meeting_artifacts/ptrm_same_subset_k100_n1000_comparison.png`