Initial commit: ept — backprop-free equilibrium transformer (EP)

Code (ep_run/), organized docs (docs/{method,campaign,hardware,outreach,paper}), analysis scripts (scripts/), ONBOARDING.md entry point. Large data/checkpoints git-ignored (share separately). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014FAPDWQ49M5Ye3NpTndTpn
author: Yuren Hao <yurenh2@illinois.edu> 2026-07-03 05:56:50 -0500
committer: Yuren Hao <yurenh2@illinois.edu> 2026-07-03 05:56:50 -0500
commit: b83947778e2c776f757a07d4719b7ce961d7ed55 (patch)
tree: b9cc01d7adda691d9156d9d04f4fb2f644674e96 /ep_run/CODEX_VERDICT.md
1 files changed, 151 insertions, 0 deletions
diff --git a/ep_run/CODEX_VERDICT.md b/ep_run/CODEX_VERDICT.md
new file mode 100644
index 0000000..7d29c39
--- /dev/null
+++ b/ep_run/CODEX_VERDICT.md
@@ -0,0 +1,151 @@
+# CODEX VERDICT: EP below-2.10 divergence
+
+## Ruling
+
+Verdict: (b) STRUCTURAL.
+
+Converging the EP adjoint phase is necessary for a correct equilibrium-gradient estimate, but it is not sufficient to make this training problem behave like BPTT. The exact equilibrium gradient differentiates `L(z*)`. BPTT differentiates the actual deployed computation `L(z_T)` with `T=150`. Those are different objectives whenever convergence is not effectively complete. The missing term is the finite-horizon residual/contraction term. No `t2sel` or `hr` knob can add that term to the exact fixed-point gradient.
+
+## Fact-set check
+
+1. Correct in substance. `runs/bptt_clean.log` reaches best val CE `1.8277` and keeps the finite-`T1` residual small, around `4e-4` to `1e-3` late in training. The failure is EP-specific in the comparable EP logs. The `rho~0.982` value is referenced by the rho prober header and scripts, but the requested log set does not contain the full `spec_bifurcation.py` output, so the CE/residual part is directly verified and the exact rho number is not independently reprinted in the available logs.
+
+2. Correct. `runs/t2_sweep.log` shows `cos(g_EP,g_transpose)` rising from `0.742413` at `t2sel=10` to `0.998194` at `t2sel=160`. `runs/hr_ceiling_sweep.log` shows the remaining gap flat across `hr=0.04..0.8` at about `0.94..0.946`. That is adjoint-phase truncation, not beta-radius tuning. The code agrees: `lt_ep_train.py` calls `holo_a_track`/`holo_a_select2` with a fixed `t2sel`, and `holo_ep.py` selects a finite nudged snapshot rather than enforcing an adjoint residual.
+
+3. Correct. The same sweep reports `cos(g_transpose,g_BPTT)=0.974976` at free-phase step residual `2.17e-5`. `diag_probe.log` shows the exact fixed-point regime, residual around `1e-9`, where `cos(g_transpose,g_BPTT)=1.000000`. This is the finite-horizon/free-residual gap.
+
+4. Correct in causal direction, with one evidence caveat. `ep_redx.log` shows the sudden event: CE goes from `2.7417` at step 3200 with residual `2.5e-2` to CE `41` with residual `1.1e-1` at step 3300. `redx_traj.log` shows EP gradient quality degrading and residuals growing along the approach. The rho drift and damping-resistant `c` sweep are encoded in `spec_bifurcation.py`, `spec_rho_vs_c.py`, and cited in `t2fix_rho.log`, but the actual c-sweep output is not present in the requested logs. The important point remains: the finite-`T1` residual becomes hypersensitive near marginality.
+
+5. Correct as the root cause, but not yet empirically closed for `t2sel=160`. `runs/t2fix_rho.log` only has steps 100, 200, 300 at CE about `6`, with `rho~0.794`; it does not prove that `t2sel=160` will pass or fail near CE 2.x. The math below decides the open question: exact equilibrium gradients still optimize the wrong objective for finite-time deployment.
+
+## Why the exact equilibrium gradient lacks contraction defense
+
+Let the relaxation map be
+
+```text
+Phi_theta(z) = z + eps F_theta(z)
+```
+
+and let `z*` satisfy `F_theta(z*) = 0`. The equilibrium objective is
+
+```text
+J_inf(theta) = L(z*(theta)).
+```
+
+Differentiating the fixed-point equation gives
+
+```text
+F_z dz*/dtheta + F_theta = 0
+dz*/dtheta = -F_z^{-1} F_theta
+```
+
+Equivalently, solve the equilibrium adjoint
+
+```text
+F_z^T lambda = -L_z(z*)
+grad_theta J_inf = L_theta + lambda^T F_theta.
+```
+
+That is exactly what the EP/AEP estimator is trying to approximate. It contains `F_z^{-1}`, so slow modes can amplify loss sensitivity. But it contains no term for the finite relaxation length, no `T`, no initial residual, no `Phi^T`, and no derivative of `rho(Phi_z)` unless changing that contraction also changes `z*` or the equilibrium loss. A parameter that changes the convergence rate while leaving the fixed point and readout loss unchanged has zero exact equilibrium gradient.
+
+Scalar counterexample:
+
+```text
+F_k(z) = -k (z - z*)
+Phi_k(z) = z + eps F_k(z)
+L = L(z*)
+```
+
+For any positive `k`, the fixed point is the same. Therefore
+
+```text
+d L(z*) / d k = 0.
+```
+
+But the finite state is
+
+```text
+z_T - z* = (1 - eps k)^T (z_0 - z*),
+```
+
+so
+
+```text
+d L(z_T) / d k
+```
+
+contains a term proportional to
+
+```text
+T (1 - eps k)^(T-1).
+```
+
+That is exactly the contraction-defense term. It is large near `rho=1`, precisely where `rho^150` becomes explosive. It vanishes only in the true infinite-time limit when `rho<1` with enough margin.
+
+For the full model, BPTT differentiates
+
+```text
+z_{t+1} = Phi_theta(z_t)
+grad_theta L(z_T)
+  = L_z(z_T)^T sum_{k=0}^{T-1}
+      (prod_{s=k+1}^{T-1} Phi_z(z_s)) eps F_theta(z_k)
+    + direct terms.
+```
+
+Those products are the same objects that determine finite-time contraction. When they decay slowly, the finite-horizon gradient feels it. Equilibrium EP replaces this whole finite product chain with the fixed-point inverse at `z*` and takes `T=infinity`; the transient residual term is gone.
+
+The code implements this split exactly. In `lt_ep_train.py`, `ep_step` relaxes to `T1`, optionally refines beyond `T1`, and computes the task gradient at `zs` through `(a * f).sum()`. `bptt_step` unrolls exactly `T1` steps and differentiates `ce(blk, z, y)` at the final unrolled state. `evaluate()` also uses exactly `T1` relaxation steps. Therefore BPTT is optimizing the evaluated computation and EP is optimizing the refined fixed-point computation.
+
+## Consequence
+
+The exact equilibrium adjoint can be correct and still push into a marginal operator, because the equilibrium objective is indifferent to settling time except through its effect on `z*`. The evidence that EP can reach `cos(g_EP,g_transpose)=0.998` only proves that EP can compute the fixed-point gradient. It does not prove that the fixed-point gradient contains BPTT's finite-horizon stabilizer. It does not.
+
+So the fix is not "set `t2sel=160` and call the adjoint converged." That removes one estimator error. It does not change the target objective. If the deployed model is `T1=150`, the training signal must include finite-horizon dynamics or an explicit contraction objective.
+
+## Local forward-only fix
+
+This is fundamental for pure equilibrium EP on `L(z*)`, but not fundamental for local forward-only learning if the objective is changed.
+
+Concrete construction: finite-horizon forward-mode/RTRL eligibility training for `L(z_T)` plus, if needed, a local contraction penalty.
+
+Run the physical relaxation forward for `T=150`. Alongside the state, propagate local eligibility traces:
+
+```text
+e_{t+1}^{(p)} = Phi_z(z_t) e_t^{(p)} + eps dF_theta(z_t)/dp
+```
+
+At `T`, form the local three-factor update
+
+```text
+Delta p proportional to - L_z(z_T)^T e_T^{(p)}.
+```
+
+This is forward-mode differentiation of the actual finite unroll. It is not reverse BPTT, and it is not a digital root finder. Exact per-parameter RTRL is expensive; practical versions use blockwise, low-rank, or random-direction eligibility traces. But this is the correct class of construction because it preserves the finite product terms that defend contraction.
+
+If hardware or cost makes forward-mode eligibility too expensive, the alternative local objective is an explicit contraction homeostat:
+
+```text
+R_contr = E_v sum_t log( ||Phi_z(z_t) v_t|| / ||v_t|| )
+```
+
+estimated with two nearby physical trajectories or JVP hardware, or a hard monotone/contractive parameterization enforcing a negative log norm. This changes the objective/architecture. It is a valid local fix, but it is not "better EP gradient quality."
+
+## Single decisive experiment
+
+Run oracle exact-equilibrium-adjoint training, not merely deeper EP, from the same pre-drift checkpoint and with `resreg=0`.
+
+At every update:
+
+1. Relax/refine to `z*`.
+2. Solve the exact adjoint `F_z(z*)^T lambda = -L_z(z*)` by GMRES or an equivalent oracle.
+3. Apply `grad_theta = L_theta + lambda^T F_theta`.
+4. Track `cos(oracle,g_transpose)`, finite-`T1` residual, and free-phase rho on the fixed validation batch every 100 steps.
+
+Decision rule:
+
+```text
+If oracle equilibrium-adjoint training keeps rho near the BPTT value and clears the wall, (a).
+If it still drifts rho toward 1 and blows while cos(oracle,g_transpose) is near 1, (b).
+```
+
+My ruling is that the second outcome will occur. The exact equilibrium gradient is the wrong gradient for the finite-150-step computation; it cannot contain the missing finite-horizon contraction-defense term by construction.
author	Yuren Hao <yurenh2@illinois.edu>	2026-07-03 05:56:50 -0500
committer	Yuren Hao <yurenh2@illinois.edu>	2026-07-03 05:56:50 -0500
commit	b83947778e2c776f757a07d4719b7ce961d7ed55 (patch)
tree	b9cc01d7adda691d9156d9d04f4fb2f644674e96 /ep_run/CODEX_VERDICT.md