research/flossing/trajectory_augmentation_notes.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190

# Trajectory Augmentation Notes

## Current Hypothesis

Use Lyapunov-style tiny perturbations of the initial recurrent latent state as
hidden-trajectory augmentation:

```text
x, y unchanged
z0 -> z0 + epsilon
loss = original supervised ACT/QA loss against y
```

This tests task-level attractor stability directly: if the dynamics are
chaotic, a tiny perturbation of the initial trajectory should miss the correct
answer basin; training forces perturbed trajectories for the same ground-truth
pair to still reach `y`.

## Running First

Initial queued runs use:

```text
sigma = 1e-3
perturb = z_H and z_L
single_perturbed_ce: one perturbed trajectory, no clean branch
multi_perturbed_ce: clean plus three perturbed trajectories, averaged CE
```

No KL, no Lyapunov loss, no JVP/flossing, no data/input augmentation.

## Backup Experiments

Do not conclude from one noise scale. If the current runs fail or are
ambiguous, test a noise curriculum:

```text
clean CE warmup for N steps
then enable trajectory augmentation
sigma ramp: 0 -> target_sigma
```

Candidate fixed/ramp target scales:

```text
1e-5, 3e-5, 1e-4, 3e-4, 1e-3, 3e-3, 1e-2
```

Instead of treating `sigma` as one fixed value, also test centered noise
distributions around the no-perturbation trajectory:

```text
epsilon ~ Normal(0, sigma^2 I)
sigma sampled per trajectory from LogUniform(sigma_min, sigma_max)
sigma ramped over training, then sampled in a band around the target
mixture: p(clean) delta_0 + (1 - p(clean)) Normal(0, sigma^2 I)
```

The distribution should remain centered at zero so the clean trajectory is the
mean trajectory. The goal is to cover a small ball/shell around `z0`, not to
move the model to a new deterministic offset.

Also compare perturb locations:

```text
z_H only
z_L only
z_H and z_L
```

The expected success signature is not necessarily all `lambda1 < 0`; better
signals are improved perturbed-rollout success, fewer broad positive modes,
and deterministic clean accuracy not regressing.

## Long-Train Engineering Note

`step9_trajectory_perturb_train.py` now supports two rollout implementations:

```text
serial_act:
  old path; K trajectories are rolled out one by one through the ACT wrapper

parallel_fixed:
  B -> B*K
  first rollout is clean for multi_perturbed_ce
  remaining rollouts sample centered perturbations
  run exactly halt_max_steps without ACT streaming reset
  average supervised loss across B*K trajectories
```

`parallel_fixed` is the default for new runs. It deliberately avoids the ACT
wrapper's halted-sample reset, because reset would make early-halted copies of
the same `(x,y)` repeat inside one optimizer step.

Preferred options:

```text
1. Fixed-unroll multiK:
   B -> B*K
   repeat x,y K times
   initialize clean/noisy z0 variants
   run exactly halt_max_steps for all trajectories
   compute supervised loss on fixed rollout outputs
```

This is simplest and matches the stability question: all initial-neighborhood
trajectories should reach `y` after the same reasoning budget.

```text
2. ACT-mask multiK:
   B -> B*K
   repeat x,y K times
   run ACT in parallel
   maintain active_mask
   after a trajectory halts, zero/mask later loss contributions
   normalize per trajectory or by valid trajectory-steps
```

Do not naively concatenate `B*K` and use the unmodified ACT streaming semantics
without masking. The wrapper resets halted samples and reloads data, which is
correct for ordinary streaming training but would make early-halted copies of
the same `(x,y)` repeat inside one optimizer step.

## Current Long Runs

Started 2026-05-27:

```text
step9_E_hrm_baseline_parallel_fixed_26040_50k
step9_F_hrm_multi4_loguniform_ramp_26040_50k
step9_G_trm_baseline_parallel_fixed_26041_batch4_50k
step9_H_trm_multi4_loguniform_ramp_26041_batch4_50k
```

Perturb runs use:

```text
K = 4
noise_sampling = loguniform
sigma interval final = [3e-5, 3e-3]
sigma ramp = 0 -> final interval over 5000 steps
perturb = z_H and z_L
eval_every = 2500
eval_n = 1024
save_every_eval + save_best + save_final
```

These runs include fixed-unroll baselines because the fixed-unroll objective is
not identical to the old ACT-streaming baseline.

## Long-Run Result Snapshot

Completed 2026-05-27:

```text
HRM fixed baseline:
  initial 0.5176, best 0.6328 @ 5000, final 0.5801

HRM multi4 loguniform:
  initial 0.5176, best 0.6250 @ 7500, final 0.5889

TRM fixed baseline:
  initial 0.5615, best 0.5947 @ 22500, final 0.4971

TRM multi4 loguniform:
  initial 0.5615, best 0.6084 @ 42500, final 0.5508
```

Interpretation: HRM did not get a best-accuracy gain from multi4, though final
accuracy decayed slightly less. TRM did get both a higher best and a much less
bad final decay under multi4, matching the earlier 10k signal that trajectory
augmentation may raise the ceiling or reduce regression, but is still unstable.

## Resume Support

Model-only resume has always worked if `--ckpt-root` points to the original
config directory and `--ckpt-name` is an absolute path to a saved `best.pt` or
`final.pt`.

Exact training-state resume is now supported:

```text
--save-train-state
  writes latest_state.pt, best_state.pt, final_state.pt

--resume-state path/to/latest_state.pt
  restores model weights, optimizer state, current train_step, best_acc, and RNG
```

The launcher now enables `--save-train-state` for future long runs.