research/flossing/maze_package/README.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30

# Maze-Hard package (E8) — train on dedicated cards, diagnose after

## Contents
- `launch_maze_trm.sh` — TRM Maze official recipe (att variant, 50k epochs), 1–2 GPU.
- dataset already at `/home/yurenh2/rrm/data/maze-30x30-hard-1k` (built 2026-06-13;
  seq_len 900, vocab 6, 1000 puzzles ×8 dihedral augments).

## Run
```bash
bash launch_maze_trm.sh 2 384   # 2x A6000
bash launch_maze_trm.sh 2 192   # 2x A5000 (->128 if OOM)
```
Target: ~75% exact accuracy (official figure). Saves a checkpoint every 5000 epochs
(10 checkpoints) — needed for the evolution analysis.

## After training: diagnostics
The 2x2 / FTLE pipeline reads any TRM checkpoint dir (all_config.yaml + step_N). Two caveats
vs Sudoku, to verify on first run:
1. ATTENTION arch (not mlp_t): confirm diagnose_trm_joint.py's JVP path runs on att blocks
   (Sudoku used mlp_t). If the L_level call signature differs, patch the f_L/f_H closures.
2. seq_len 900 vs 97 → per-sample JVP+QR cost ~9-10x Sudoku. Use n=512 for the headline 2x2
   and n=256 for the horizon sweep; k_lyap=8 unchanged. Budget ~0.5-1 day on one card, or
   rsync checkpoints back to the lab box and run via the analysis_2x2 queue.

## What Maze closes
Kills the "Sudoku-only" limitation. Pre-registered prediction (write BEFORE looking, for the
paper's credibility): if the wandering-not-settling decomposition is architecture/task-general,
Maze should show B≈0 (failures don't settle) and the same concurrent-not-antecedent horizon
profile. A DIFFERENT result (e.g. Maze failures do settle) is also publishable — it bounds the
claim's scope. Either way the decomposition gets a second task.