summaryrefslogtreecommitdiff
path: root/ep_run/GPT55_BUG_HUNT.md
diff options
context:
space:
mode:
Diffstat (limited to 'ep_run/GPT55_BUG_HUNT.md')
-rw-r--r--ep_run/GPT55_BUG_HUNT.md249
1 files changed, 249 insertions, 0 deletions
diff --git a/ep_run/GPT55_BUG_HUNT.md b/ep_run/GPT55_BUG_HUNT.md
new file mode 100644
index 0000000..241adba
--- /dev/null
+++ b/ep_run/GPT55_BUG_HUNT.md
@@ -0,0 +1,249 @@
+# GPT55 EP/AEP Correctness Bug Hunt
+
+Scope: static review of `lt_ep_train.py`, `holo_ep.py`, `asym_probe.py`, with context from `../EP_BELOW210_DIAGNOSIS_FIX.md`. I did not modify training code and did not run training or GPU jobs. I used only tiny CPU checks for formula consistency.
+
+## Executive Summary
+
+The main suspect for the reported `cos(g_EP, g_transpose) ~= 0.94` plateau that is flat over `hr` is not beta noise. The code has no nudged-phase/adjoint convergence check. A tight free-phase residual does not imply the two-phase AEP contrast has converged to the fixed-point adjoint. In the linearized corrected dynamics, finite-`T2` error is independent of `hr`, exactly matching the clue.
+
+I did not find a sign flip in the deterministic AEP correction. The real two-phase contrast, holomorphic two-phase contrast, and exact-adjoint probe sign conventions are mutually consistent. I did find several real correctness hazards: the plain-EP AEP correction clip invalidates the transpose correction, stochastic `fnoise` makes JVP/VJP not derivatives of the same force realization, `t1max` trains at a refined fixed point while evaluation/BPTT use the finite `T1` state, and `asym_probe.py` has probe-specific footguns that can mislead the diagnosis.
+
+## Ranked Findings
+
+### 1. Finite nudged/adjoint relaxation is unchecked; this explains an `hr`-flat cosine plateau
+
+Files/lines:
+- `lt_ep_train.py:163-180` runs plain nudges for exactly `T2` steps.
+- `lt_ep_train.py:181-197` computes holomorphic/AEP `a` through `holo_a_track`, `holo_a_select2`, `holo_a_select`, or `holo_a`, but ignores whether the nudged contrast actually converged.
+- `holo_ep.py:179-185` updates the two real phases for one finite step at a time.
+- `holo_ep.py:193-211` selects a snapshot by smallest inter-snapshot increment, not by an adjoint residual.
+- `holo_ep.py:229-254` does the same for common-mode tracking.
+- `asym_probe.py:727-743` changes `beta/hr` in the diagnostic sweep but keeps the nudged relaxation budget fixed.
+- `asym_probe.py:582-595` interprets beta-sweep behavior without a finite-`T2` branch.
+
+What is wrong:
+The estimator validity logic checks/refines the free phase, but not the nudged phase. For the corrected local dynamics, with `M = I + eps * J^T` and `ell = dL/dz`, the two-phase contrast approximately obeys:
+
+```text
+a_{t+1} = M a_t + eps * ell
+fixed point: J^T a = -ell
+```
+
+The finite-`T2` error is `M^T2` times the initial adjoint error and is independent of the nudge radius `r/hr` in the linear regime. Therefore a cosine plateau that is flat across `hr=0.04..0.8` is a signature of under-converged nudged/adjoint relaxation, not evidence that beta noise or finite-beta bias has been ruled out.
+
+Why it matters:
+This corrupts training gradients directly. It also corrupts the diagnosis because the probe can report a tight free-phase residual while the AEP contrast is still a truncated adjoint solve. As the operator hardens below CE about 2.1, the adjoint relaxation can slow down even if the free phase is tight.
+
+Severity: corrupts TRAINING and DIAGNOSIS/probe.
+
+Confidence: high. The line-level behavior matches the `hr`-insensitive clue exactly.
+
+Minimal fix:
+Add a nudged/adjoint convergence criterion. At minimum, return and log `inc_min / (||a_best|| + eps)` from `holo_a_select2` and `holo_a_track`, and sweep `t2sel` in `asym_probe.py` at fixed `hr`. Better: compute an adjoint residual proxy at the selected `a`, e.g. `||J^T a + ell|| / (||ell|| + eps)` using VJP of the full thick force at the free state, and keep nudging until it is below tolerance. Treat `hr` sweeps as inconclusive unless `T2/t2sel` convergence is also demonstrated.
+
+### 2. Plain-EP AEP correction is norm-clipped, which invalidates the transpose correction
+
+Files/lines:
+- `lt_ep_train.py:172-178` computes `corr = Jv - JTv`, then replaces it by `corr * (fn / cn)` when `||corr|| > ||f||`.
+
+What is wrong:
+The AsymEP correction is algebraic: subtract `Jv - J^T v` so the local Jacobian becomes `J^T`. Scaling that correction by a state-dependent factor changes the corrected dynamics to something like `J - alpha(J - J^T)`, with `alpha < 1` exactly when asymmetry is large. That is no longer the transpose dynamics.
+
+Why it matters:
+This can create a systematic estimator bias in plain EP (`holo=0`). It is especially dangerous near the regime where the antisymmetric correction is large, which is the regime this correction is supposed to fix.
+
+Severity: corrupts TRAINING for the plain real two-phase path. It does not affect the current `holo_a_select2`/`holo_a_track` path, which does not use this clip.
+
+Confidence: high.
+
+Minimal fix:
+Remove the correction clip. If stability is needed, clip the total update, reduce `eps`, reduce `beta/hr`, or reject/halve the nudged step while logging that the estimator left its validity region. Do not scale only the antisymmetric correction.
+
+### 3. With `fnoise > 0`, JVP/VJP are not derivatives of the same force realization
+
+Files/lines:
+- `lt_ep_train.py:87-90` samples fresh multiplicative noise inside `_noisy`.
+- `lt_ep_train.py:92-97` applies `_noisy` inside `nc_force`.
+- `lt_ep_train.py:171-178` evaluates the noisy force and then separately calls JVP/VJP through `nc_force`.
+- `holo_ep.py:150-151` injects fresh noise into `rforce`.
+- `holo_ep.py:176-185` and `holo_ep.py:224-239` use `blk.nc_force` for correction JVP/VJP, which can sample different noise again.
+
+What is wrong:
+When `fnoise > 0`, the forward force, JVP force, and VJP force are separate random functions. The correction is no longer `Jv - J^T v` for the same operator used in the state update, and JVP and VJP are not even transposes of the same sampled Jacobian.
+
+Why it matters:
+This corrupts AEP training in the noisy hardware simulation path and makes `navg` average a mixture of stochastic bias and stochastic noise. With `fnoise=0`, this issue is inactive.
+
+Severity: corrupts TRAINING when `--fnoise > 0`; otherwise inactive.
+
+Confidence: high.
+
+Minimal fix:
+Do not sample inside differentiable force functions. Sample a fixed noise mask/device realization outside and pass it into both the forward force and the JVP/VJP force, or keep dynamic per-pass noise out of AEP correction and use deterministic mismatch for differentiable hardware probes.
+
+### 4. `t1max` trains the EP task gradient at a refined state while eval/BPTT use the finite-`T1` state
+
+Files/lines:
+- `lt_ep_train.py:143-152` first computes `zT` at `T1`, then optionally refines `zs` up to `t1max`.
+- `lt_ep_train.py:203-210` computes the EP block gradient and readout gradient at refined `zs`.
+- `lt_ep_train.py:260-265` BPTT differentiates exactly `T1` unrolled steps.
+- `lt_ep_train.py:279-286` validation evaluates exactly `T1` relaxed steps.
+
+What is wrong:
+With `t1max > T1`, EP optimizes the refined fixed-point state, while the reported validation objective and BPTT reference use the finite-`T1` state. That is a real objective mismatch.
+
+Why it matters:
+If `z_T1` drifts away from the refined fixed point, EP can improve the wrong state while validation and the practical finite-time model degrade. This was also identified in `../EP_BELOW210_DIAGNOSIS_FIX.md:11-20`.
+
+Severity: corrupts TRAINING when `t1max > T1` and the finite-`T1` state is the real objective.
+
+Confidence: high.
+
+Minimal fix:
+Choose one objective and make all paths use it. If the objective is finite `T1`, compute the EP gradient/readout gradient at `zT` or add a principled finite-time/contraction term. If the objective is the fixed point, evaluate and compare BPTT against the same refined state.
+
+### 5. `resreg` is hard-wired to thick `tforce` and its scaling includes already-added non-task gradients
+
+Files/lines:
+- `lt_ep_train.py:220-224` computes the residual penalty with `blk.tforce(zT, xin0)`.
+- `lt_ep_train.py:225-228` scales by `gtask` after prior task and `jacreg` gradients may already be in `grads`.
+
+What is wrong:
+`blk.tforce` is the thick-block force only (`lt_ep_train.py:81-85`). If `resreg` is used with `attn_mode` other than `thick`, the residual penalty is for the wrong dynamics. Also, `gtask` is described as task-gradient norm but includes any gradients already added to `grads`, including `jacreg` from `lt_ep_train.py:211-219`.
+
+Why it matters:
+This can apply a residual penalty in the wrong direction for non-thick modes and makes the `resreg` ratio slightly different from its stated meaning.
+
+Severity: corrupts TRAINING conditionally: non-thick `resreg` is high risk; thick with `jacreg` is a smaller scaling bug.
+
+Confidence: high.
+
+Minimal fix:
+Guard `resreg` with `assert blk.attn_mode == 'thick'` or compute the residual through `blk.force(..., cg=True)` for the active mode. Capture the pure task-gradient norm before adding `jacreg` and `resreg`.
+
+### 6. Complex masked softmax is numerically unstable because masked logits affect the row shift
+
+Files/lines:
+- `holo_ep.py:26-29` computes `c = a.real.amax(...)` before masking, then multiplies `exp(a - c)` by `mask`.
+- `holo_ep.py:48` passes the causal mask as a complex tensor into this helper.
+- Real attention masks before softmax at `lt_ep_train.py:66-68`.
+
+What is wrong:
+Mathematically, the row shift cancels if arithmetic is exact. Numerically, a large masked future logit can dominate `c`, causing all valid entries to underflow or lose precision. The real path masks before softmax and does not have this issue.
+
+Why it matters:
+With `qknorm` enabled this is mitigated because logits are bounded, but without `qknorm` it can bias or NaN holomorphic phases.
+
+Severity: corrupts TRAINING conditionally in complex holomorphic paths, especially without `--qknorm`.
+
+Confidence: medium.
+
+Minimal fix:
+Keep the mask boolean and compute the shift over valid entries only:
+
+```python
+c = a.real.masked_fill(~mask, -float("inf")).amax(-1, keepdim=True)
+w = torch.exp(a - c).masked_fill(~mask, 0)
+```
+
+### 7. Holomorphic EP helpers silently implement only the thick force, but `ep_step` allows them for any mode
+
+Files/lines:
+- `lt_ep_train.py:181-197` calls `holo_ep` whenever `holo > 0`, without checking `blk.attn_mode`.
+- `holo_ep.py:36-51` implements `cforce` as thick LN + attention + FFN.
+- `holo_ep.py:134-152` implements `rforce` as the same thick real-axis force.
+- `lt_ep_train.py:349` default `--attn_mode` is `real`, while `lt_ep_train.py:358-359` allow `--holo`.
+
+What is wrong:
+If a user runs `--holo` with `attn_mode=real`, `energy`, or `mono`, the nudged force used to estimate `a` is not the model force.
+
+Why it matters:
+This silently corrupts training for a legal CLI flag combination.
+
+Severity: corrupts TRAINING for non-thick `--holo` runs.
+
+Confidence: high.
+
+Minimal fix:
+Add a hard guard in `ep_step`: `if holo > 0 and blk.attn_mode != 'thick': raise ValueError(...)`, or implement holomorphic force extensions for the other modes.
+
+### 8. `asym_probe.py` hard-codes model construction choices that may not match the checkpoint
+
+Files/lines:
+- `asym_probe.py:31-50` exposes `--gelu`, `--T1`, `--T2`, `--hr`, etc.
+- `asym_probe.py:105-119` forces `attn_mode="thick"`, `c=1.0`, `qknorm=True`, `fnoise=0.0`, `track=True`, and assigns `blk.gelu = cfg.gelu`.
+- `lt_ep_train.py:81-120` never reads `blk.gelu`; GELU is hard-coded to tanh-form in the active thick force.
+
+What is wrong:
+The probe can analyze a different model than the checkpoint was trained with. The `--gelu` flag is especially misleading because assigning `blk.gelu` has no effect in the current `EQBlock`.
+
+Why it matters:
+For the current qknorm/thick/c=1/tanh runs this is probably harmless. For c-bump, non-qknorm, non-thick, or historical erf/tanh comparisons, it can make `g_transpose`, `g_BPTT`, and `g_EP` refer to the wrong dynamics.
+
+Severity: DIAGNOSIS/probe only, unless probe conclusions are used to choose training changes.
+
+Confidence: high.
+
+Minimal fix:
+Save the training config in checkpoints and load `attn_mode`, `c`, `qknorm`, GELU mode, and relevant flags from it. Remove `--gelu` or implement it in `EQBlock`.
+
+### 9. `asym_probe.py` labels `ep_step`'s returned residual as estimator/free-phase convergence, but it is the pre-refinement `T1` residual
+
+Files/lines:
+- `lt_ep_train.py:143-152` computes `res` at `T1`, then may refine `zs` and store `res_used`.
+- `lt_ep_train.py:232` returns `res`, not `res_used`.
+- `asym_probe.py:840` prints that value as `EP estimator free-phase residual from ep_step`.
+- `asym_probe.py:505-522` separately computes and prints the refined exact-reference residual.
+
+What is wrong:
+The probe can conflate three different residuals: `T1` residual, refined free-phase residual, and nudged/adjoint residual. Only the last one diagnoses whether the EP contrast has converged.
+
+Why it matters:
+This can make a run look "tightly converged" or "not tightly converged" depending on which print line the reader tracks. It also reinforces the wrong conclusion that free-phase convergence alone validates the estimator.
+
+Severity: DIAGNOSIS/probe only.
+
+Confidence: high.
+
+Minimal fix:
+Return both `res_T1` and `res_refined` from `ep_step`, print both in the probe, and add a separate nudged/adjoint residual for `a`.
+
+### 10. `holo_ep.py` self-test/debug main is broken by unreachable code and an undefined function
+
+Files/lines:
+- `holo_ep.py:257-280` defines `holo_a_lockin` and returns.
+- `holo_ep.py:281-290` contains unreachable code that looks like a missing `holo_grads` function body.
+- `holo_ep.py:329-332` calls `holo_grads`, which is not defined.
+
+What is wrong:
+Running `python holo_ep.py` as a diagnostic script will fail.
+
+Why it matters:
+This does not affect `lt_ep_train.py` imports of `holo_a`, `holo_a_select2`, or `holo_a_track`, but it can break or mislead standalone estimator checks.
+
+Severity: DIAGNOSIS/probe only.
+
+Confidence: high.
+
+Minimal fix:
+Move `holo_ep.py:281-290` into a real `def holo_grads(...)` or delete the stale self-test.
+
+## Checked And Found Correct
+
+- AEP correction sign: `lt_ep_train.py:171-178`, `holo_ep.py:181-185`, and `holo_ep.py:233-239` subtract `Jv - J^T v`, which is the correct sign for making the local differential dynamics use `J^T`.
+- Two-phase contrast sign: `lt_ep_train.py:199-200` and `holo_ep.py:193-195` compute `(z_- - z_+) / (2 beta/r)`, which matches `lambda` solving `J^T lambda = -dL/dz`.
+- Exact-adjoint probe sign: `asym_probe.py:443-445` solves `J^T lambda = -ell`, and `asym_probe.py:457-465` computes `lambda^T F_theta`. That is the correct implicit fixed-point gradient.
+- Deterministic force consistency for thick mode: `lt_ep_train.py:81-85`, `lt_ep_train.py:102-106`, `holo_ep.py:134-152`, and `holo_ep.py:36-51` match on the real axis. Tiny CPU check with qknorm gave max `|tforce-rforce| = 1.19e-7` and max `|tforce-cforce.real| = 2.09e-7`.
+- GELU consistency in current code: `lt_ep_train.py:84`, `lt_ep_train.py:96`, `lt_ep_train.py:105`, `holo_ep.py:32-33`, and `holo_ep.py:148` all use tanh-form GELU. Tiny CPU check found max difference from `F.gelu(..., approximate='tanh')` of `2.38e-7`.
+- qknorm consistency in current thick paths: real attention uses qknorm at `lt_ep_train.py:63-65`, complex force at `holo_ep.py:44-46`, and real nudged force at `holo_ep.py:142-144`.
+- Common-mode AEP anchor: `holo_ep.py:231-239` correctly computes `zbar`, duplicates it as the anchor, and applies the antisymmetric correction to `Z - zbar`.
+- Block-parameter gradient scope: `lt_ep_train.py:203-205` computes `grad((a * f).sum(), blk.block)`, and `asym_probe.py:457-465` uses the same clamp-gradient path for token/position parameters. This is the right scope for force parameters.
+- Readout-head gradient: `lt_ep_train.py:208-210` computes only the direct CE gradient for `Wh`. Since `Wh` is not in the force, there is no missing implicit force term and no double-counting.
+- BPTT probe unroll: `asym_probe.py:819` uses `bptt_step`, which unrolls the same `blk.force` update as training at `lt_ep_train.py:260-265`.
+- `no_grad` around JVP/VJP is not itself a bug. A tiny CPU check confirmed both `torch.func.jvp/vjp` and `torch.autograd.functional.jvp/vjp` still return derivatives inside surrounding `torch.no_grad()` blocks.
+
+## Highest-Value Next Static/CPU Checks
+
+1. Add a CPU-sized linearized test that compares `a_T` from `holo_a_track` against a direct solve of `J^T a = -ell` while sweeping `T2` and `hr`. Prediction: the 0.94-style error should move with `T2`, not `hr`.
+2. In `asym_probe.py`, add a `--t2-sweep` diagnostic at fixed `hr`, and print `||a_t - a_{t-K}|| / ||a_t||` plus, if affordable, `||J^T a + ell|| / ||ell||`.
+3. Re-run the existing beta/hr sweep only after proving the selected `a` is converged for each point.