blob: 6e1107659943a3993e57206f3683faaec59132d4 (
plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
|
# Maze training bundle — transfer to your training machine
## What's in this bundle
- `maze-30x30-hard-1k/` — the built dataset (seq_len 900, vocab 6, 1000 puzzles ×8 augments).
- `launch_maze_trm_portable.sh` — path-configurable launcher.
- `diagnose_trm_joint.py`, `step7_interfloss.py` — diagnostic scripts (only if you run
diagnostics on the training machine; otherwise rsync checkpoints back to the lab box).
## On the training machine
1. Have the TinyRecursiveModels repo cloned and the `rrm` conda env (torch 2.7 cu126,
flash-attn 2 for Ampere). If the env doesn't exist, recreate from the lab box's
`env/requirements.txt` / `pip-freeze.txt`.
2. Put the dataset somewhere, e.g. `~/data/maze-30x30-hard-1k`.
3. Launch:
```bash
TRM_DIR=~/TinyRecursiveModels DATA_DIR=~/data/maze-30x30-hard-1k \
bash launch_maze_trm_portable.sh 2 384 # 2x A6000
# or: 2 192 # 2x A5000 (->128 if OOM)
```
Target ~75% exact accuracy (official). ~18-28h on 2x A6000, ~24-36h on 2x A5000.
Saves one checkpoint per 5000 epochs (10 total) — keep all, the evolution analysis needs them.
## After training
Preferred: `rsync` the whole run checkpoint dir (checkpoints/maze-.../pretrain_att_maze30x30_*/)
back to the lab box and run the existing analysis_2x2 queue there. The dir must include
`all_config.yaml` plus the `step_*` files.
If diagnosing on the training machine, two caveats vs the Sudoku runs:
1. Maze uses the ATTENTION arch (not mlp_t). Verify diagnose_trm_joint.py's f_L/f_H JVP
closures call the attention L_level correctly; patch if the signature differs.
2. seq_len 900 (vs 97) makes per-sample JVP+QR ~9-10x slower. Use n=512 for the headline 2x2,
n=256 for the horizon sweep, k_lyap=8.
## Sanity check before the long run
A 200-step smoke (epochs=200 eval_interval=200) should complete in minutes and confirm the
attention model + flash-attn + dataset load without OOM before committing to 50k epochs.
|