diff options
| author | Yuren Hao <yurenh2@illinois.edu> | 2026-07-03 05:56:50 -0500 |
|---|---|---|
| committer | Yuren Hao <yurenh2@illinois.edu> | 2026-07-03 05:56:50 -0500 |
| commit | b83947778e2c776f757a07d4719b7ce961d7ed55 (patch) | |
| tree | b9cc01d7adda691d9156d9d04f4fb2f644674e96 /ep_run/GPT55_BUG_HUNT.md | |
Initial commit: ept — backprop-free equilibrium transformer (EP)
Code (ep_run/), organized docs (docs/{method,campaign,hardware,outreach,paper}),
analysis scripts (scripts/), ONBOARDING.md entry point. Large data/checkpoints
git-ignored (share separately).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_014FAPDWQ49M5Ye3NpTndTpn
Diffstat (limited to 'ep_run/GPT55_BUG_HUNT.md')
| -rw-r--r-- | ep_run/GPT55_BUG_HUNT.md | 249 |
1 files changed, 249 insertions, 0 deletions
diff --git a/ep_run/GPT55_BUG_HUNT.md b/ep_run/GPT55_BUG_HUNT.md new file mode 100644 index 0000000..241adba --- /dev/null +++ b/ep_run/GPT55_BUG_HUNT.md @@ -0,0 +1,249 @@ +# GPT55 EP/AEP Correctness Bug Hunt + +Scope: static review of `lt_ep_train.py`, `holo_ep.py`, `asym_probe.py`, with context from `../EP_BELOW210_DIAGNOSIS_FIX.md`. I did not modify training code and did not run training or GPU jobs. I used only tiny CPU checks for formula consistency. + +## Executive Summary + +The main suspect for the reported `cos(g_EP, g_transpose) ~= 0.94` plateau that is flat over `hr` is not beta noise. The code has no nudged-phase/adjoint convergence check. A tight free-phase residual does not imply the two-phase AEP contrast has converged to the fixed-point adjoint. In the linearized corrected dynamics, finite-`T2` error is independent of `hr`, exactly matching the clue. + +I did not find a sign flip in the deterministic AEP correction. The real two-phase contrast, holomorphic two-phase contrast, and exact-adjoint probe sign conventions are mutually consistent. I did find several real correctness hazards: the plain-EP AEP correction clip invalidates the transpose correction, stochastic `fnoise` makes JVP/VJP not derivatives of the same force realization, `t1max` trains at a refined fixed point while evaluation/BPTT use the finite `T1` state, and `asym_probe.py` has probe-specific footguns that can mislead the diagnosis. + +## Ranked Findings + +### 1. Finite nudged/adjoint relaxation is unchecked; this explains an `hr`-flat cosine plateau + +Files/lines: +- `lt_ep_train.py:163-180` runs plain nudges for exactly `T2` steps. +- `lt_ep_train.py:181-197` computes holomorphic/AEP `a` through `holo_a_track`, `holo_a_select2`, `holo_a_select`, or `holo_a`, but ignores whether the nudged contrast actually converged. +- `holo_ep.py:179-185` updates the two real phases for one finite step at a time. +- `holo_ep.py:193-211` selects a snapshot by smallest inter-snapshot increment, not by an adjoint residual. +- `holo_ep.py:229-254` does the same for common-mode tracking. +- `asym_probe.py:727-743` changes `beta/hr` in the diagnostic sweep but keeps the nudged relaxation budget fixed. +- `asym_probe.py:582-595` interprets beta-sweep behavior without a finite-`T2` branch. + +What is wrong: +The estimator validity logic checks/refines the free phase, but not the nudged phase. For the corrected local dynamics, with `M = I + eps * J^T` and `ell = dL/dz`, the two-phase contrast approximately obeys: + +```text +a_{t+1} = M a_t + eps * ell +fixed point: J^T a = -ell +``` + +The finite-`T2` error is `M^T2` times the initial adjoint error and is independent of the nudge radius `r/hr` in the linear regime. Therefore a cosine plateau that is flat across `hr=0.04..0.8` is a signature of under-converged nudged/adjoint relaxation, not evidence that beta noise or finite-beta bias has been ruled out. + +Why it matters: +This corrupts training gradients directly. It also corrupts the diagnosis because the probe can report a tight free-phase residual while the AEP contrast is still a truncated adjoint solve. As the operator hardens below CE about 2.1, the adjoint relaxation can slow down even if the free phase is tight. + +Severity: corrupts TRAINING and DIAGNOSIS/probe. + +Confidence: high. The line-level behavior matches the `hr`-insensitive clue exactly. + +Minimal fix: +Add a nudged/adjoint convergence criterion. At minimum, return and log `inc_min / (||a_best|| + eps)` from `holo_a_select2` and `holo_a_track`, and sweep `t2sel` in `asym_probe.py` at fixed `hr`. Better: compute an adjoint residual proxy at the selected `a`, e.g. `||J^T a + ell|| / (||ell|| + eps)` using VJP of the full thick force at the free state, and keep nudging until it is below tolerance. Treat `hr` sweeps as inconclusive unless `T2/t2sel` convergence is also demonstrated. + +### 2. Plain-EP AEP correction is norm-clipped, which invalidates the transpose correction + +Files/lines: +- `lt_ep_train.py:172-178` computes `corr = Jv - JTv`, then replaces it by `corr * (fn / cn)` when `||corr|| > ||f||`. + +What is wrong: +The AsymEP correction is algebraic: subtract `Jv - J^T v` so the local Jacobian becomes `J^T`. Scaling that correction by a state-dependent factor changes the corrected dynamics to something like `J - alpha(J - J^T)`, with `alpha < 1` exactly when asymmetry is large. That is no longer the transpose dynamics. + +Why it matters: +This can create a systematic estimator bias in plain EP (`holo=0`). It is especially dangerous near the regime where the antisymmetric correction is large, which is the regime this correction is supposed to fix. + +Severity: corrupts TRAINING for the plain real two-phase path. It does not affect the current `holo_a_select2`/`holo_a_track` path, which does not use this clip. + +Confidence: high. + +Minimal fix: +Remove the correction clip. If stability is needed, clip the total update, reduce `eps`, reduce `beta/hr`, or reject/halve the nudged step while logging that the estimator left its validity region. Do not scale only the antisymmetric correction. + +### 3. With `fnoise > 0`, JVP/VJP are not derivatives of the same force realization + +Files/lines: +- `lt_ep_train.py:87-90` samples fresh multiplicative noise inside `_noisy`. +- `lt_ep_train.py:92-97` applies `_noisy` inside `nc_force`. +- `lt_ep_train.py:171-178` evaluates the noisy force and then separately calls JVP/VJP through `nc_force`. +- `holo_ep.py:150-151` injects fresh noise into `rforce`. +- `holo_ep.py:176-185` and `holo_ep.py:224-239` use `blk.nc_force` for correction JVP/VJP, which can sample different noise again. + +What is wrong: +When `fnoise > 0`, the forward force, JVP force, and VJP force are separate random functions. The correction is no longer `Jv - J^T v` for the same operator used in the state update, and JVP and VJP are not even transposes of the same sampled Jacobian. + +Why it matters: +This corrupts AEP training in the noisy hardware simulation path and makes `navg` average a mixture of stochastic bias and stochastic noise. With `fnoise=0`, this issue is inactive. + +Severity: corrupts TRAINING when `--fnoise > 0`; otherwise inactive. + +Confidence: high. + +Minimal fix: +Do not sample inside differentiable force functions. Sample a fixed noise mask/device realization outside and pass it into both the forward force and the JVP/VJP force, or keep dynamic per-pass noise out of AEP correction and use deterministic mismatch for differentiable hardware probes. + +### 4. `t1max` trains the EP task gradient at a refined state while eval/BPTT use the finite-`T1` state + +Files/lines: +- `lt_ep_train.py:143-152` first computes `zT` at `T1`, then optionally refines `zs` up to `t1max`. +- `lt_ep_train.py:203-210` computes the EP block gradient and readout gradient at refined `zs`. +- `lt_ep_train.py:260-265` BPTT differentiates exactly `T1` unrolled steps. +- `lt_ep_train.py:279-286` validation evaluates exactly `T1` relaxed steps. + +What is wrong: +With `t1max > T1`, EP optimizes the refined fixed-point state, while the reported validation objective and BPTT reference use the finite-`T1` state. That is a real objective mismatch. + +Why it matters: +If `z_T1` drifts away from the refined fixed point, EP can improve the wrong state while validation and the practical finite-time model degrade. This was also identified in `../EP_BELOW210_DIAGNOSIS_FIX.md:11-20`. + +Severity: corrupts TRAINING when `t1max > T1` and the finite-`T1` state is the real objective. + +Confidence: high. + +Minimal fix: +Choose one objective and make all paths use it. If the objective is finite `T1`, compute the EP gradient/readout gradient at `zT` or add a principled finite-time/contraction term. If the objective is the fixed point, evaluate and compare BPTT against the same refined state. + +### 5. `resreg` is hard-wired to thick `tforce` and its scaling includes already-added non-task gradients + +Files/lines: +- `lt_ep_train.py:220-224` computes the residual penalty with `blk.tforce(zT, xin0)`. +- `lt_ep_train.py:225-228` scales by `gtask` after prior task and `jacreg` gradients may already be in `grads`. + +What is wrong: +`blk.tforce` is the thick-block force only (`lt_ep_train.py:81-85`). If `resreg` is used with `attn_mode` other than `thick`, the residual penalty is for the wrong dynamics. Also, `gtask` is described as task-gradient norm but includes any gradients already added to `grads`, including `jacreg` from `lt_ep_train.py:211-219`. + +Why it matters: +This can apply a residual penalty in the wrong direction for non-thick modes and makes the `resreg` ratio slightly different from its stated meaning. + +Severity: corrupts TRAINING conditionally: non-thick `resreg` is high risk; thick with `jacreg` is a smaller scaling bug. + +Confidence: high. + +Minimal fix: +Guard `resreg` with `assert blk.attn_mode == 'thick'` or compute the residual through `blk.force(..., cg=True)` for the active mode. Capture the pure task-gradient norm before adding `jacreg` and `resreg`. + +### 6. Complex masked softmax is numerically unstable because masked logits affect the row shift + +Files/lines: +- `holo_ep.py:26-29` computes `c = a.real.amax(...)` before masking, then multiplies `exp(a - c)` by `mask`. +- `holo_ep.py:48` passes the causal mask as a complex tensor into this helper. +- Real attention masks before softmax at `lt_ep_train.py:66-68`. + +What is wrong: +Mathematically, the row shift cancels if arithmetic is exact. Numerically, a large masked future logit can dominate `c`, causing all valid entries to underflow or lose precision. The real path masks before softmax and does not have this issue. + +Why it matters: +With `qknorm` enabled this is mitigated because logits are bounded, but without `qknorm` it can bias or NaN holomorphic phases. + +Severity: corrupts TRAINING conditionally in complex holomorphic paths, especially without `--qknorm`. + +Confidence: medium. + +Minimal fix: +Keep the mask boolean and compute the shift over valid entries only: + +```python +c = a.real.masked_fill(~mask, -float("inf")).amax(-1, keepdim=True) +w = torch.exp(a - c).masked_fill(~mask, 0) +``` + +### 7. Holomorphic EP helpers silently implement only the thick force, but `ep_step` allows them for any mode + +Files/lines: +- `lt_ep_train.py:181-197` calls `holo_ep` whenever `holo > 0`, without checking `blk.attn_mode`. +- `holo_ep.py:36-51` implements `cforce` as thick LN + attention + FFN. +- `holo_ep.py:134-152` implements `rforce` as the same thick real-axis force. +- `lt_ep_train.py:349` default `--attn_mode` is `real`, while `lt_ep_train.py:358-359` allow `--holo`. + +What is wrong: +If a user runs `--holo` with `attn_mode=real`, `energy`, or `mono`, the nudged force used to estimate `a` is not the model force. + +Why it matters: +This silently corrupts training for a legal CLI flag combination. + +Severity: corrupts TRAINING for non-thick `--holo` runs. + +Confidence: high. + +Minimal fix: +Add a hard guard in `ep_step`: `if holo > 0 and blk.attn_mode != 'thick': raise ValueError(...)`, or implement holomorphic force extensions for the other modes. + +### 8. `asym_probe.py` hard-codes model construction choices that may not match the checkpoint + +Files/lines: +- `asym_probe.py:31-50` exposes `--gelu`, `--T1`, `--T2`, `--hr`, etc. +- `asym_probe.py:105-119` forces `attn_mode="thick"`, `c=1.0`, `qknorm=True`, `fnoise=0.0`, `track=True`, and assigns `blk.gelu = cfg.gelu`. +- `lt_ep_train.py:81-120` never reads `blk.gelu`; GELU is hard-coded to tanh-form in the active thick force. + +What is wrong: +The probe can analyze a different model than the checkpoint was trained with. The `--gelu` flag is especially misleading because assigning `blk.gelu` has no effect in the current `EQBlock`. + +Why it matters: +For the current qknorm/thick/c=1/tanh runs this is probably harmless. For c-bump, non-qknorm, non-thick, or historical erf/tanh comparisons, it can make `g_transpose`, `g_BPTT`, and `g_EP` refer to the wrong dynamics. + +Severity: DIAGNOSIS/probe only, unless probe conclusions are used to choose training changes. + +Confidence: high. + +Minimal fix: +Save the training config in checkpoints and load `attn_mode`, `c`, `qknorm`, GELU mode, and relevant flags from it. Remove `--gelu` or implement it in `EQBlock`. + +### 9. `asym_probe.py` labels `ep_step`'s returned residual as estimator/free-phase convergence, but it is the pre-refinement `T1` residual + +Files/lines: +- `lt_ep_train.py:143-152` computes `res` at `T1`, then may refine `zs` and store `res_used`. +- `lt_ep_train.py:232` returns `res`, not `res_used`. +- `asym_probe.py:840` prints that value as `EP estimator free-phase residual from ep_step`. +- `asym_probe.py:505-522` separately computes and prints the refined exact-reference residual. + +What is wrong: +The probe can conflate three different residuals: `T1` residual, refined free-phase residual, and nudged/adjoint residual. Only the last one diagnoses whether the EP contrast has converged. + +Why it matters: +This can make a run look "tightly converged" or "not tightly converged" depending on which print line the reader tracks. It also reinforces the wrong conclusion that free-phase convergence alone validates the estimator. + +Severity: DIAGNOSIS/probe only. + +Confidence: high. + +Minimal fix: +Return both `res_T1` and `res_refined` from `ep_step`, print both in the probe, and add a separate nudged/adjoint residual for `a`. + +### 10. `holo_ep.py` self-test/debug main is broken by unreachable code and an undefined function + +Files/lines: +- `holo_ep.py:257-280` defines `holo_a_lockin` and returns. +- `holo_ep.py:281-290` contains unreachable code that looks like a missing `holo_grads` function body. +- `holo_ep.py:329-332` calls `holo_grads`, which is not defined. + +What is wrong: +Running `python holo_ep.py` as a diagnostic script will fail. + +Why it matters: +This does not affect `lt_ep_train.py` imports of `holo_a`, `holo_a_select2`, or `holo_a_track`, but it can break or mislead standalone estimator checks. + +Severity: DIAGNOSIS/probe only. + +Confidence: high. + +Minimal fix: +Move `holo_ep.py:281-290` into a real `def holo_grads(...)` or delete the stale self-test. + +## Checked And Found Correct + +- AEP correction sign: `lt_ep_train.py:171-178`, `holo_ep.py:181-185`, and `holo_ep.py:233-239` subtract `Jv - J^T v`, which is the correct sign for making the local differential dynamics use `J^T`. +- Two-phase contrast sign: `lt_ep_train.py:199-200` and `holo_ep.py:193-195` compute `(z_- - z_+) / (2 beta/r)`, which matches `lambda` solving `J^T lambda = -dL/dz`. +- Exact-adjoint probe sign: `asym_probe.py:443-445` solves `J^T lambda = -ell`, and `asym_probe.py:457-465` computes `lambda^T F_theta`. That is the correct implicit fixed-point gradient. +- Deterministic force consistency for thick mode: `lt_ep_train.py:81-85`, `lt_ep_train.py:102-106`, `holo_ep.py:134-152`, and `holo_ep.py:36-51` match on the real axis. Tiny CPU check with qknorm gave max `|tforce-rforce| = 1.19e-7` and max `|tforce-cforce.real| = 2.09e-7`. +- GELU consistency in current code: `lt_ep_train.py:84`, `lt_ep_train.py:96`, `lt_ep_train.py:105`, `holo_ep.py:32-33`, and `holo_ep.py:148` all use tanh-form GELU. Tiny CPU check found max difference from `F.gelu(..., approximate='tanh')` of `2.38e-7`. +- qknorm consistency in current thick paths: real attention uses qknorm at `lt_ep_train.py:63-65`, complex force at `holo_ep.py:44-46`, and real nudged force at `holo_ep.py:142-144`. +- Common-mode AEP anchor: `holo_ep.py:231-239` correctly computes `zbar`, duplicates it as the anchor, and applies the antisymmetric correction to `Z - zbar`. +- Block-parameter gradient scope: `lt_ep_train.py:203-205` computes `grad((a * f).sum(), blk.block)`, and `asym_probe.py:457-465` uses the same clamp-gradient path for token/position parameters. This is the right scope for force parameters. +- Readout-head gradient: `lt_ep_train.py:208-210` computes only the direct CE gradient for `Wh`. Since `Wh` is not in the force, there is no missing implicit force term and no double-counting. +- BPTT probe unroll: `asym_probe.py:819` uses `bptt_step`, which unrolls the same `blk.force` update as training at `lt_ep_train.py:260-265`. +- `no_grad` around JVP/VJP is not itself a bug. A tiny CPU check confirmed both `torch.func.jvp/vjp` and `torch.autograd.functional.jvp/vjp` still return derivatives inside surrounding `torch.no_grad()` blocks. + +## Highest-Value Next Static/CPU Checks + +1. Add a CPU-sized linearized test that compares `a_T` from `holo_a_track` against a direct solve of `J^T a = -ell` while sweeping `T2` and `hr`. Prediction: the 0.94-style error should move with `T2`, not `hr`. +2. In `asym_probe.py`, add a `--t2-sweep` diagnostic at fixed `hr`, and print `||a_t - a_{t-K}|| / ||a_t||` plus, if affordable, `||J^T a + ell|| / ||ell||`. +3. Re-run the existing beta/hr sweep only after proving the selected `a` is converged for each point. |
