diff options
| author | YurenHao0426 <blackhao0426@gmail.com> | 2026-06-13 12:35:36 -0500 |
|---|---|---|
| committer | YurenHao0426 <blackhao0426@gmail.com> | 2026-06-13 12:35:36 -0500 |
| commit | 66e0d8b9fd4d0f7a2231d689c055e26fdf1cf04a (patch) | |
| tree | c29cba61124018755a19b02c9d33e3ad5f2e05cc /research/flossing/meeting_artifacts | |
Curated export for clone-and-run Maze training (2x A6000) + diagnostics.
trm/hrm pretrain.py carry trajectory-augmentation code (backward-compatible).
Heavy artifacts (checkpoints/wandb/npz) gitignored; see PROVENANCE.md.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Diffstat (limited to 'research/flossing/meeting_artifacts')
| -rw-r--r-- | research/flossing/meeting_artifacts/trm_multi4_dynamics_report.md | 57 |
1 files changed, 57 insertions, 0 deletions
diff --git a/research/flossing/meeting_artifacts/trm_multi4_dynamics_report.md b/research/flossing/meeting_artifacts/trm_multi4_dynamics_report.md new file mode 100644 index 0000000..184656c --- /dev/null +++ b/research/flossing/meeting_artifacts/trm_multi4_dynamics_report.md @@ -0,0 +1,57 @@ +# TRM multi4 Dynamics Report + +Date: 2026-06-02 + +## Main Result + +Official GBS768 Sudoku-Extreme runs show a real best-checkpoint gain from trajectory perturbation training. + +| model | step | full exact | full token acc | dynamics sample exact | lambda1 mean | top-8 mean | positive count / 8 | +|---|---:|---:|---:|---:|---:|---:|---:| +| TRM baseline best | 58590 | 0.8686 | 0.9508 | 0.8750 | +0.0282 | +0.0135 | 7.84 | +| TRM multi4 best | 35805 | 0.8965 | 0.9604 | 0.9004 | +0.0204 | +0.0066 | 3.84 | +| TRM multi4 final | 65100 | 0.8351 | 0.9350 | 0.8242 | +0.0323 | +0.0185 | 8.00 | + +Interpretation: multi4 best improves full-test exact accuracy by +2.78pp over baseline best, while suppressing the Lyapunov spectrum volume on successful trajectories. The final checkpoint collapses below baseline and its spectrum becomes broadly positive again. + +## Success vs Failure Separation + +The N=512 spectrum sample keeps the original observation: failures have much larger positive exponents. + +| model | success lambda1 | failure lambda1 | success top-8 mean | failure top-8 mean | success positive count | failure positive count | +|---|---:|---:|---:|---:|---:|---:| +| TRM baseline best | +0.0176 | +0.1026 | +0.0080 | +0.0516 | 7.82 | 8.00 | +| TRM multi4 best | +0.0112 | +0.1035 | +0.0020 | +0.0485 | 3.38 | 8.00 | +| TRM multi4 final | +0.0191 | +0.0942 | +0.0129 | +0.0449 | 8.00 | 8.00 | + +Interpretation: multi4 does not make failures stable; it mainly shifts successful solutions into less chaotic basins. This is the right story for stable attractor training: success becomes less volume-expanding, while bad basins remain chaotic. + +## PTRM Same-Subset Result + +Existing PTRM paired files already use identical `idx` with `n=1000`, `K=100`, `D=64`, `sigma=0.3`, L-only perturbation. + +| metric | baseline best | multi4 best | delta | +|---|---:|---:|---:| +| deterministic exact on subset | 0.8870 | 0.9110 | +0.0240 | +| mean rollout exact | 0.9419 | 0.9542 | +0.0123 | +| Q-selected exact | 0.9840 | 0.9880 | +0.0040 | +| oracle pass@100 | 0.9850 | 0.9880 | +0.0030 | +| correct rollout count / 100 | 94.19 | 95.42 | +1.23 | + +Interpretation: PTRM is near saturated on Sudoku, so Q-selected/oracle improvements are small. The stronger signal is that multi4 increases deterministic accuracy and the density of correct stochastic rollouts on the same test subset. + +## Caveats + +- Full TRM eval uses all 422,786 Sudoku-Extreme test puzzles. PTRM runs above use a fixed 1,000-sample subset, so deterministic PTRM-subset numbers should not be mixed with full-test W&B exact accuracy. +- The N=512 Lyapunov spectrum is a diagnostic sample, not full-test evaluation. +- Final checkpoint is not the result to report as the method's performance; it is an overtraining/collapse diagnostic. +- A larger paired PTRM run is currently queued: `n=10000`, `K=25`, `D=64`, `sigma=0.3`, L-only, same seed. It is slow and should be treated as a robustness check, not a blocker. + +## Artifacts + +- Headline CSV: `research/flossing/official_gbs768_spectrum/headline_trm_multi4_dynamics_table.csv` +- Spectrum summary CSV: `research/flossing/official_gbs768_spectrum/summary_n512_k8_seed20260602.csv` +- PTRM paired CSV: `research/flossing/ptrm_same_subset/paired_ptrm_k100_n1000_seed0_summary.csv` +- Spectrum plot: `research/flossing/meeting_artifacts/trm_gbs768_spectrum_success_failure_n512.png` +- Headline bar plot: `research/flossing/meeting_artifacts/trm_gbs768_headline_dynamics_bars.png` +- PTRM plot: `research/flossing/meeting_artifacts/ptrm_same_subset_k100_n1000_comparison.png` |
