ep_run/GPT55_BUG_HUNT.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249

# GPT55 EP/AEP Correctness Bug Hunt

Scope: static review of `lt_ep_train.py`, `holo_ep.py`, `asym_probe.py`, with context from `../EP_BELOW210_DIAGNOSIS_FIX.md`. I did not modify training code and did not run training or GPU jobs. I used only tiny CPU checks for formula consistency.

## Executive Summary

The main suspect for the reported `cos(g_EP, g_transpose) ~= 0.94` plateau that is flat over `hr` is not beta noise. The code has no nudged-phase/adjoint convergence check. A tight free-phase residual does not imply the two-phase AEP contrast has converged to the fixed-point adjoint. In the linearized corrected dynamics, finite-`T2` error is independent of `hr`, exactly matching the clue.

I did not find a sign flip in the deterministic AEP correction. The real two-phase contrast, holomorphic two-phase contrast, and exact-adjoint probe sign conventions are mutually consistent. I did find several real correctness hazards: the plain-EP AEP correction clip invalidates the transpose correction, stochastic `fnoise` makes JVP/VJP not derivatives of the same force realization, `t1max` trains at a refined fixed point while evaluation/BPTT use the finite `T1` state, and `asym_probe.py` has probe-specific footguns that can mislead the diagnosis.

## Ranked Findings

### 1. Finite nudged/adjoint relaxation is unchecked; this explains an `hr`-flat cosine plateau

Files/lines:
- `lt_ep_train.py:163-180` runs plain nudges for exactly `T2` steps.
- `lt_ep_train.py:181-197` computes holomorphic/AEP `a` through `holo_a_track`, `holo_a_select2`, `holo_a_select`, or `holo_a`, but ignores whether the nudged contrast actually converged.
- `holo_ep.py:179-185` updates the two real phases for one finite step at a time.
- `holo_ep.py:193-211` selects a snapshot by smallest inter-snapshot increment, not by an adjoint residual.
- `holo_ep.py:229-254` does the same for common-mode tracking.
- `asym_probe.py:727-743` changes `beta/hr` in the diagnostic sweep but keeps the nudged relaxation budget fixed.
- `asym_probe.py:582-595` interprets beta-sweep behavior without a finite-`T2` branch.

What is wrong:
The estimator validity logic checks/refines the free phase, but not the nudged phase. For the corrected local dynamics, with `M = I + eps * J^T` and `ell = dL/dz`, the two-phase contrast approximately obeys:

```text
a_{t+1} = M a_t + eps * ell
fixed point: J^T a = -ell
```

The finite-`T2` error is `M^T2` times the initial adjoint error and is independent of the nudge radius `r/hr` in the linear regime. Therefore a cosine plateau that is flat across `hr=0.04..0.8` is a signature of under-converged nudged/adjoint relaxation, not evidence that beta noise or finite-beta bias has been ruled out.

Why it matters:
This corrupts training gradients directly. It also corrupts the diagnosis because the probe can report a tight free-phase residual while the AEP contrast is still a truncated adjoint solve. As the operator hardens below CE about 2.1, the adjoint relaxation can slow down even if the free phase is tight.

Severity: corrupts TRAINING and DIAGNOSIS/probe.

Confidence: high. The line-level behavior matches the `hr`-insensitive clue exactly.

Minimal fix:
Add a nudged/adjoint convergence criterion. At minimum, return and log `inc_min / (||a_best|| + eps)` from `holo_a_select2` and `holo_a_track`, and sweep `t2sel` in `asym_probe.py` at fixed `hr`. Better: compute an adjoint residual proxy at the selected `a`, e.g. `||J^T a + ell|| / (||ell|| + eps)` using VJP of the full thick force at the free state, and keep nudging until it is below tolerance. Treat `hr` sweeps as inconclusive unless `T2/t2sel` convergence is also demonstrated.

### 2. Plain-EP AEP correction is norm-clipped, which invalidates the transpose correction

Files/lines:
- `lt_ep_train.py:172-178` computes `corr = Jv - JTv`, then replaces it by `corr * (fn / cn)` when `||corr|| > ||f||`.

What is wrong:
The AsymEP correction is algebraic: subtract `Jv - J^T v` so the local Jacobian becomes `J^T`. Scaling that correction by a state-dependent factor changes the corrected dynamics to something like `J - alpha(J - J^T)`, with `alpha < 1` exactly when asymmetry is large. That is no longer the transpose dynamics.

Why it matters:
This can create a systematic estimator bias in plain EP (`holo=0`). It is especially dangerous near the regime where the antisymmetric correction is large, which is the regime this correction is supposed to fix.

Severity: corrupts TRAINING for the plain real two-phase path. It does not affect the current `holo_a_select2`/`holo_a_track` path, which does not use this clip.

Confidence: high.

Minimal fix:
Remove the correction clip. If stability is needed, clip the total update, reduce `eps`, reduce `beta/hr`, or reject/halve the nudged step while logging that the estimator left its validity region. Do not scale only the antisymmetric correction.

### 3. With `fnoise > 0`, JVP/VJP are not derivatives of the same force realization

Files/lines:
- `lt_ep_train.py:87-90` samples fresh multiplicative noise inside `_noisy`.
- `lt_ep_train.py:92-97` applies `_noisy` inside `nc_force`.
- `lt_ep_train.py:171-178` evaluates the noisy force and then separately calls JVP/VJP through `nc_force`.
- `holo_ep.py:150-151` injects fresh noise into `rforce`.
- `holo_ep.py:176-185` and `holo_ep.py:224-239` use `blk.nc_force` for correction JVP/VJP, which can sample different noise again.

What is wrong:
When `fnoise > 0`, the forward force, JVP force, and VJP force are separate random functions. The correction is no longer `Jv - J^T v` for the same operator used in the state update, and JVP and VJP are not even transposes of the same sampled Jacobian.

Why it matters:
This corrupts AEP training in the noisy hardware simulation path and makes `navg` average a mixture of stochastic bias and stochastic noise. With `fnoise=0`, this issue is inactive.

Severity: corrupts TRAINING when `--fnoise > 0`; otherwise inactive.

Confidence: high.

Minimal fix:
Do not sample inside differentiable force functions. Sample a fixed noise mask/device realization outside and pass it into both the forward force and the JVP/VJP force, or keep dynamic per-pass noise out of AEP correction and use deterministic mismatch for differentiable hardware probes.

### 4. `t1max` trains the EP task gradient at a refined state while eval/BPTT use the finite-`T1` state

Files/lines:
- `lt_ep_train.py:143-152` first computes `zT` at `T1`, then optionally refines `zs` up to `t1max`.
- `lt_ep_train.py:203-210` computes the EP block gradient and readout gradient at refined `zs`.
- `lt_ep_train.py:260-265` BPTT differentiates exactly `T1` unrolled steps.
- `lt_ep_train.py:279-286` validation evaluates exactly `T1` relaxed steps.

What is wrong:
With `t1max > T1`, EP optimizes the refined fixed-point state, while the reported validation objective and BPTT reference use the finite-`T1` state. That is a real objective mismatch.

Why it matters:
If `z_T1` drifts away from the refined fixed point, EP can improve the wrong state while validation and the practical finite-time model degrade. This was also identified in `../EP_BELOW210_DIAGNOSIS_FIX.md:11-20`.

Severity: corrupts TRAINING when `t1max > T1` and the finite-`T1` state is the real objective.

Confidence: high.

Minimal fix:
Choose one objective and make all paths use it. If the objective is finite `T1`, compute the EP gradient/readout gradient at `zT` or add a principled finite-time/contraction term. If the objective is the fixed point, evaluate and compare BPTT against the same refined state.

### 5. `resreg` is hard-wired to thick `tforce` and its scaling includes already-added non-task gradients

Files/lines:
- `lt_ep_train.py:220-224` computes the residual penalty with `blk.tforce(zT, xin0)`.
- `lt_ep_train.py:225-228` scales by `gtask` after prior task and `jacreg` gradients may already be in `grads`.

What is wrong:
`blk.tforce` is the thick-block force only (`lt_ep_train.py:81-85`). If `resreg` is used with `attn_mode` other than `thick`, the residual penalty is for the wrong dynamics. Also, `gtask` is described as task-gradient norm but includes any gradients already added to `grads`, including `jacreg` from `lt_ep_train.py:211-219`.

Why it matters:
This can apply a residual penalty in the wrong direction for non-thick modes and makes the `resreg` ratio slightly different from its stated meaning.

Severity: corrupts TRAINING conditionally: non-thick `resreg` is high risk; thick with `jacreg` is a smaller scaling bug.

Confidence: high.

Minimal fix:
Guard `resreg` with `assert blk.attn_mode == 'thick'` or compute the residual through `blk.force(..., cg=True)` for the active mode. Capture the pure task-gradient norm before adding `jacreg` and `resreg`.

### 6. Complex masked softmax is numerically unstable because masked logits affect the row shift

Files/lines:
- `holo_ep.py:26-29` computes `c = a.real.amax(...)` before masking, then multiplies `exp(a - c)` by `mask`.
- `holo_ep.py:48` passes the causal mask as a complex tensor into this helper.
- Real attention masks before softmax at `lt_ep_train.py:66-68`.

What is wrong:
Mathematically, the row shift cancels if arithmetic is exact. Numerically, a large masked future logit can dominate `c`, causing all valid entries to underflow or lose precision. The real path masks before softmax and does not have this issue.

Why it matters:
With `qknorm` enabled this is mitigated because logits are bounded, but without `qknorm` it can bias or NaN holomorphic phases.

Severity: corrupts TRAINING conditionally in complex holomorphic paths, especially without `--qknorm`.

Confidence: medium.

Minimal fix:
Keep the mask boolean and compute the shift over valid entries only:

```python
c = a.real.masked_fill(~mask, -float("inf")).amax(-1, keepdim=True)
w = torch.exp(a - c).masked_fill(~mask, 0)
```

### 7. Holomorphic EP helpers silently implement only the thick force, but `ep_step` allows them for any mode

Files/lines:
- `lt_ep_train.py:181-197` calls `holo_ep` whenever `holo > 0`, without checking `blk.attn_mode`.
- `holo_ep.py:36-51` implements `cforce` as thick LN + attention + FFN.
- `holo_ep.py:134-152` implements `rforce` as the same thick real-axis force.
- `lt_ep_train.py:349` default `--attn_mode` is `real`, while `lt_ep_train.py:358-359` allow `--holo`.

What is wrong:
If a user runs `--holo` with `attn_mode=real`, `energy`, or `mono`, the nudged force used to estimate `a` is not the model force.

Why it matters:
This silently corrupts training for a legal CLI flag combination.

Severity: corrupts TRAINING for non-thick `--holo` runs.

Confidence: high.

Minimal fix:
Add a hard guard in `ep_step`: `if holo > 0 and blk.attn_mode != 'thick': raise ValueError(...)`, or implement holomorphic force extensions for the other modes.

### 8. `asym_probe.py` hard-codes model construction choices that may not match the checkpoint

Files/lines:
- `asym_probe.py:31-50` exposes `--gelu`, `--T1`, `--T2`, `--hr`, etc.
- `asym_probe.py:105-119` forces `attn_mode="thick"`, `c=1.0`, `qknorm=True`, `fnoise=0.0`, `track=True`, and assigns `blk.gelu = cfg.gelu`.
- `lt_ep_train.py:81-120` never reads `blk.gelu`; GELU is hard-coded to tanh-form in the active thick force.

What is wrong:
The probe can analyze a different model than the checkpoint was trained with. The `--gelu` flag is especially misleading because assigning `blk.gelu` has no effect in the current `EQBlock`.

Why it matters:
For the current qknorm/thick/c=1/tanh runs this is probably harmless. For c-bump, non-qknorm, non-thick, or historical erf/tanh comparisons, it can make `g_transpose`, `g_BPTT`, and `g_EP` refer to the wrong dynamics.

Severity: DIAGNOSIS/probe only, unless probe conclusions are used to choose training changes.

Confidence: high.

Minimal fix:
Save the training config in checkpoints and load `attn_mode`, `c`, `qknorm`, GELU mode, and relevant flags from it. Remove `--gelu` or implement it in `EQBlock`.

### 9. `asym_probe.py` labels `ep_step`'s returned residual as estimator/free-phase convergence, but it is the pre-refinement `T1` residual

Files/lines:
- `lt_ep_train.py:143-152` computes `res` at `T1`, then may refine `zs` and store `res_used`.
- `lt_ep_train.py:232` returns `res`, not `res_used`.
- `asym_probe.py:840` prints that value as `EP estimator free-phase residual from ep_step`.
- `asym_probe.py:505-522` separately computes and prints the refined exact-reference residual.

What is wrong:
The probe can conflate three different residuals: `T1` residual, refined free-phase residual, and nudged/adjoint residual. Only the last one diagnoses whether the EP contrast has converged.

Why it matters:
This can make a run look "tightly converged" or "not tightly converged" depending on which print line the reader tracks. It also reinforces the wrong conclusion that free-phase convergence alone validates the estimator.

Severity: DIAGNOSIS/probe only.

Confidence: high.

Minimal fix:
Return both `res_T1` and `res_refined` from `ep_step`, print both in the probe, and add a separate nudged/adjoint residual for `a`.

### 10. `holo_ep.py` self-test/debug main is broken by unreachable code and an undefined function

Files/lines:
- `holo_ep.py:257-280` defines `holo_a_lockin` and returns.
- `holo_ep.py:281-290` contains unreachable code that looks like a missing `holo_grads` function body.
- `holo_ep.py:329-332` calls `holo_grads`, which is not defined.

What is wrong:
Running `python holo_ep.py` as a diagnostic script will fail.

Why it matters:
This does not affect `lt_ep_train.py` imports of `holo_a`, `holo_a_select2`, or `holo_a_track`, but it can break or mislead standalone estimator checks.

Severity: DIAGNOSIS/probe only.

Confidence: high.

Minimal fix:
Move `holo_ep.py:281-290` into a real `def holo_grads(...)` or delete the stale self-test.

## Checked And Found Correct

- AEP correction sign: `lt_ep_train.py:171-178`, `holo_ep.py:181-185`, and `holo_ep.py:233-239` subtract `Jv - J^T v`, which is the correct sign for making the local differential dynamics use `J^T`.
- Two-phase contrast sign: `lt_ep_train.py:199-200` and `holo_ep.py:193-195` compute `(z_- - z_+) / (2 beta/r)`, which matches `lambda` solving `J^T lambda = -dL/dz`.
- Exact-adjoint probe sign: `asym_probe.py:443-445` solves `J^T lambda = -ell`, and `asym_probe.py:457-465` computes `lambda^T F_theta`. That is the correct implicit fixed-point gradient.
- Deterministic force consistency for thick mode: `lt_ep_train.py:81-85`, `lt_ep_train.py:102-106`, `holo_ep.py:134-152`, and `holo_ep.py:36-51` match on the real axis. Tiny CPU check with qknorm gave max `|tforce-rforce| = 1.19e-7` and max `|tforce-cforce.real| = 2.09e-7`.
- GELU consistency in current code: `lt_ep_train.py:84`, `lt_ep_train.py:96`, `lt_ep_train.py:105`, `holo_ep.py:32-33`, and `holo_ep.py:148` all use tanh-form GELU. Tiny CPU check found max difference from `F.gelu(..., approximate='tanh')` of `2.38e-7`.
- qknorm consistency in current thick paths: real attention uses qknorm at `lt_ep_train.py:63-65`, complex force at `holo_ep.py:44-46`, and real nudged force at `holo_ep.py:142-144`.
- Common-mode AEP anchor: `holo_ep.py:231-239` correctly computes `zbar`, duplicates it as the anchor, and applies the antisymmetric correction to `Z - zbar`.
- Block-parameter gradient scope: `lt_ep_train.py:203-205` computes `grad((a * f).sum(), blk.block)`, and `asym_probe.py:457-465` uses the same clamp-gradient path for token/position parameters. This is the right scope for force parameters.
- Readout-head gradient: `lt_ep_train.py:208-210` computes only the direct CE gradient for `Wh`. Since `Wh` is not in the force, there is no missing implicit force term and no double-counting.
- BPTT probe unroll: `asym_probe.py:819` uses `bptt_step`, which unrolls the same `blk.force` update as training at `lt_ep_train.py:260-265`.
- `no_grad` around JVP/VJP is not itself a bug. A tiny CPU check confirmed both `torch.func.jvp/vjp` and `torch.autograd.functional.jvp/vjp` still return derivatives inside surrounding `torch.no_grad()` blocks.

## Highest-Value Next Static/CPU Checks

1. Add a CPU-sized linearized test that compares `a_T` from `holo_a_track` against a direct solve of `J^T a = -ell` while sweeping `T2` and `hr`. Prediction: the 0.94-style error should move with `T2`, not `hr`.
2. In `asym_probe.py`, add a `--t2-sweep` diagnostic at fixed `hr`, and print `||a_t - a_{t-K}|| / ||a_t||` plus, if affordable, `||J^T a + ell|| / ||ell||`.
3. Re-run the existing beta/hr sweep only after proving the selected `a` is converged for each point.