rrm workspace: TRM/HRM/SRM code, Maze dataset, dynamical-analysis pipelineHEAD main

Curated export for clone-and-run Maze training (2x A6000) + diagnostics. trm/hrm pretrain.py carry trajectory-augmentation code (backward-compatible). Heavy artifacts (checkpoints/wandb/npz) gitignored; see PROVENANCE.md. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
author: YurenHao0426 <blackhao0426@gmail.com> 2026-06-13 12:35:36 -0500
committer: YurenHao0426 <blackhao0426@gmail.com> 2026-06-13 12:35:36 -0500
commit: 66e0d8b9fd4d0f7a2231d689c055e26fdf1cf04a (patch)
tree: c29cba61124018755a19b02c9d33e3ad5f2e05cc /research/flossing/meeting_artifacts
1 files changed, 57 insertions, 0 deletions
diff --git a/research/flossing/meeting_artifacts/trm_multi4_dynamics_report.md b/research/flossing/meeting_artifacts/trm_multi4_dynamics_report.md
new file mode 100644
index 0000000..184656c
--- /dev/null
+++ b/research/flossing/meeting_artifacts/trm_multi4_dynamics_report.md
@@ -0,0 +1,57 @@
+# TRM multi4 Dynamics Report
+
+Date: 2026-06-02
+
+## Main Result
+
+Official GBS768 Sudoku-Extreme runs show a real best-checkpoint gain from trajectory perturbation training.
+
+| model | step | full exact | full token acc | dynamics sample exact | lambda1 mean | top-8 mean | positive count / 8 |
+|---|---:|---:|---:|---:|---:|---:|---:|
+| TRM baseline best | 58590 | 0.8686 | 0.9508 | 0.8750 | +0.0282 | +0.0135 | 7.84 |
+| TRM multi4 best | 35805 | 0.8965 | 0.9604 | 0.9004 | +0.0204 | +0.0066 | 3.84 |
+| TRM multi4 final | 65100 | 0.8351 | 0.9350 | 0.8242 | +0.0323 | +0.0185 | 8.00 |
+
+Interpretation: multi4 best improves full-test exact accuracy by +2.78pp over baseline best, while suppressing the Lyapunov spectrum volume on successful trajectories. The final checkpoint collapses below baseline and its spectrum becomes broadly positive again.
+
+## Success vs Failure Separation
+
+The N=512 spectrum sample keeps the original observation: failures have much larger positive exponents.
+
+| model | success lambda1 | failure lambda1 | success top-8 mean | failure top-8 mean | success positive count | failure positive count |
+|---|---:|---:|---:|---:|---:|---:|
+| TRM baseline best | +0.0176 | +0.1026 | +0.0080 | +0.0516 | 7.82 | 8.00 |
+| TRM multi4 best | +0.0112 | +0.1035 | +0.0020 | +0.0485 | 3.38 | 8.00 |
+| TRM multi4 final | +0.0191 | +0.0942 | +0.0129 | +0.0449 | 8.00 | 8.00 |
+
+Interpretation: multi4 does not make failures stable; it mainly shifts successful solutions into less chaotic basins. This is the right story for stable attractor training: success becomes less volume-expanding, while bad basins remain chaotic.
+
+## PTRM Same-Subset Result
+
+Existing PTRM paired files already use identical `idx` with `n=1000`, `K=100`, `D=64`, `sigma=0.3`, L-only perturbation.
+
+| metric | baseline best | multi4 best | delta |
+|---|---:|---:|---:|
+| deterministic exact on subset | 0.8870 | 0.9110 | +0.0240 |
+| mean rollout exact | 0.9419 | 0.9542 | +0.0123 |
+| Q-selected exact | 0.9840 | 0.9880 | +0.0040 |
+| oracle pass@100 | 0.9850 | 0.9880 | +0.0030 |
+| correct rollout count / 100 | 94.19 | 95.42 | +1.23 |
+
+Interpretation: PTRM is near saturated on Sudoku, so Q-selected/oracle improvements are small. The stronger signal is that multi4 increases deterministic accuracy and the density of correct stochastic rollouts on the same test subset.
+
+## Caveats
+
+- Full TRM eval uses all 422,786 Sudoku-Extreme test puzzles. PTRM runs above use a fixed 1,000-sample subset, so deterministic PTRM-subset numbers should not be mixed with full-test W&B exact accuracy.
+- The N=512 Lyapunov spectrum is a diagnostic sample, not full-test evaluation.
+- Final checkpoint is not the result to report as the method's performance; it is an overtraining/collapse diagnostic.
+- A larger paired PTRM run is currently queued: `n=10000`, `K=25`, `D=64`, `sigma=0.3`, L-only, same seed. It is slow and should be treated as a robustness check, not a blocker.
+
+## Artifacts
+
+- Headline CSV: `research/flossing/official_gbs768_spectrum/headline_trm_multi4_dynamics_table.csv`
+- Spectrum summary CSV: `research/flossing/official_gbs768_spectrum/summary_n512_k8_seed20260602.csv`
+- PTRM paired CSV: `research/flossing/ptrm_same_subset/paired_ptrm_k100_n1000_seed0_summary.csv`
+- Spectrum plot: `research/flossing/meeting_artifacts/trm_gbs768_spectrum_success_failure_n512.png`
+- Headline bar plot: `research/flossing/meeting_artifacts/trm_gbs768_headline_dynamics_bars.png`
+- PTRM plot: `research/flossing/meeting_artifacts/ptrm_same_subset_k100_n1000_comparison.png`
author	YurenHao0426 <blackhao0426@gmail.com>	2026-06-13 12:35:36 -0500
committer	YurenHao0426 <blackhao0426@gmail.com>	2026-06-13 12:35:36 -0500
commit	66e0d8b9fd4d0f7a2231d689c055e26fdf1cf04a (patch)
tree	c29cba61124018755a19b02c9d33e3ad5f2e05cc /research/flossing/meeting_artifacts