summaryrefslogtreecommitdiff
path: root/research/flossing/maze_package/TRANSFER_README.md
diff options
context:
space:
mode:
authorYurenHao0426 <blackhao0426@gmail.com>2026-06-13 12:35:36 -0500
committerYurenHao0426 <blackhao0426@gmail.com>2026-06-13 12:35:36 -0500
commit66e0d8b9fd4d0f7a2231d689c055e26fdf1cf04a (patch)
treec29cba61124018755a19b02c9d33e3ad5f2e05cc /research/flossing/maze_package/TRANSFER_README.md
rrm workspace: TRM/HRM/SRM code, Maze dataset, dynamical-analysis pipelineHEADmain
Curated export for clone-and-run Maze training (2x A6000) + diagnostics. trm/hrm pretrain.py carry trajectory-augmentation code (backward-compatible). Heavy artifacts (checkpoints/wandb/npz) gitignored; see PROVENANCE.md. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Diffstat (limited to 'research/flossing/maze_package/TRANSFER_README.md')
-rw-r--r--research/flossing/maze_package/TRANSFER_README.md36
1 files changed, 36 insertions, 0 deletions
diff --git a/research/flossing/maze_package/TRANSFER_README.md b/research/flossing/maze_package/TRANSFER_README.md
new file mode 100644
index 0000000..6e11076
--- /dev/null
+++ b/research/flossing/maze_package/TRANSFER_README.md
@@ -0,0 +1,36 @@
+# Maze training bundle — transfer to your training machine
+
+## What's in this bundle
+- `maze-30x30-hard-1k/` — the built dataset (seq_len 900, vocab 6, 1000 puzzles ×8 augments).
+- `launch_maze_trm_portable.sh` — path-configurable launcher.
+- `diagnose_trm_joint.py`, `step7_interfloss.py` — diagnostic scripts (only if you run
+ diagnostics on the training machine; otherwise rsync checkpoints back to the lab box).
+
+## On the training machine
+1. Have the TinyRecursiveModels repo cloned and the `rrm` conda env (torch 2.7 cu126,
+ flash-attn 2 for Ampere). If the env doesn't exist, recreate from the lab box's
+ `env/requirements.txt` / `pip-freeze.txt`.
+2. Put the dataset somewhere, e.g. `~/data/maze-30x30-hard-1k`.
+3. Launch:
+ ```bash
+ TRM_DIR=~/TinyRecursiveModels DATA_DIR=~/data/maze-30x30-hard-1k \
+ bash launch_maze_trm_portable.sh 2 384 # 2x A6000
+ # or: 2 192 # 2x A5000 (->128 if OOM)
+ ```
+ Target ~75% exact accuracy (official). ~18-28h on 2x A6000, ~24-36h on 2x A5000.
+ Saves one checkpoint per 5000 epochs (10 total) — keep all, the evolution analysis needs them.
+
+## After training
+Preferred: `rsync` the whole run checkpoint dir (checkpoints/maze-.../pretrain_att_maze30x30_*/)
+back to the lab box and run the existing analysis_2x2 queue there. The dir must include
+`all_config.yaml` plus the `step_*` files.
+
+If diagnosing on the training machine, two caveats vs the Sudoku runs:
+1. Maze uses the ATTENTION arch (not mlp_t). Verify diagnose_trm_joint.py's f_L/f_H JVP
+ closures call the attention L_level correctly; patch if the signature differs.
+2. seq_len 900 (vs 97) makes per-sample JVP+QR ~9-10x slower. Use n=512 for the headline 2x2,
+ n=256 for the horizon sweep, k_lyap=8.
+
+## Sanity check before the long run
+A 200-step smoke (epochs=200 eval_interval=200) should complete in minutes and confirm the
+attention model + flash-attn + dataset load without OOM before committing to 50k epochs.