# CODEX VERDICT: EP below-2.10 divergence ## Ruling Verdict: (b) STRUCTURAL. Converging the EP adjoint phase is necessary for a correct equilibrium-gradient estimate, but it is not sufficient to make this training problem behave like BPTT. The exact equilibrium gradient differentiates `L(z*)`. BPTT differentiates the actual deployed computation `L(z_T)` with `T=150`. Those are different objectives whenever convergence is not effectively complete. The missing term is the finite-horizon residual/contraction term. No `t2sel` or `hr` knob can add that term to the exact fixed-point gradient. ## Fact-set check 1. Correct in substance. `runs/bptt_clean.log` reaches best val CE `1.8277` and keeps the finite-`T1` residual small, around `4e-4` to `1e-3` late in training. The failure is EP-specific in the comparable EP logs. The `rho~0.982` value is referenced by the rho prober header and scripts, but the requested log set does not contain the full `spec_bifurcation.py` output, so the CE/residual part is directly verified and the exact rho number is not independently reprinted in the available logs. 2. Correct. `runs/t2_sweep.log` shows `cos(g_EP,g_transpose)` rising from `0.742413` at `t2sel=10` to `0.998194` at `t2sel=160`. `runs/hr_ceiling_sweep.log` shows the remaining gap flat across `hr=0.04..0.8` at about `0.94..0.946`. That is adjoint-phase truncation, not beta-radius tuning. The code agrees: `lt_ep_train.py` calls `holo_a_track`/`holo_a_select2` with a fixed `t2sel`, and `holo_ep.py` selects a finite nudged snapshot rather than enforcing an adjoint residual. 3. Correct. The same sweep reports `cos(g_transpose,g_BPTT)=0.974976` at free-phase step residual `2.17e-5`. `diag_probe.log` shows the exact fixed-point regime, residual around `1e-9`, where `cos(g_transpose,g_BPTT)=1.000000`. This is the finite-horizon/free-residual gap. 4. Correct in causal direction, with one evidence caveat. `ep_redx.log` shows the sudden event: CE goes from `2.7417` at step 3200 with residual `2.5e-2` to CE `41` with residual `1.1e-1` at step 3300. `redx_traj.log` shows EP gradient quality degrading and residuals growing along the approach. The rho drift and damping-resistant `c` sweep are encoded in `spec_bifurcation.py`, `spec_rho_vs_c.py`, and cited in `t2fix_rho.log`, but the actual c-sweep output is not present in the requested logs. The important point remains: the finite-`T1` residual becomes hypersensitive near marginality. 5. Correct as the root cause, but not yet empirically closed for `t2sel=160`. `runs/t2fix_rho.log` only has steps 100, 200, 300 at CE about `6`, with `rho~0.794`; it does not prove that `t2sel=160` will pass or fail near CE 2.x. The math below decides the open question: exact equilibrium gradients still optimize the wrong objective for finite-time deployment. ## Why the exact equilibrium gradient lacks contraction defense Let the relaxation map be ```text Phi_theta(z) = z + eps F_theta(z) ``` and let `z*` satisfy `F_theta(z*) = 0`. The equilibrium objective is ```text J_inf(theta) = L(z*(theta)). ``` Differentiating the fixed-point equation gives ```text F_z dz*/dtheta + F_theta = 0 dz*/dtheta = -F_z^{-1} F_theta ``` Equivalently, solve the equilibrium adjoint ```text F_z^T lambda = -L_z(z*) grad_theta J_inf = L_theta + lambda^T F_theta. ``` That is exactly what the EP/AEP estimator is trying to approximate. It contains `F_z^{-1}`, so slow modes can amplify loss sensitivity. But it contains no term for the finite relaxation length, no `T`, no initial residual, no `Phi^T`, and no derivative of `rho(Phi_z)` unless changing that contraction also changes `z*` or the equilibrium loss. A parameter that changes the convergence rate while leaving the fixed point and readout loss unchanged has zero exact equilibrium gradient. Scalar counterexample: ```text F_k(z) = -k (z - z*) Phi_k(z) = z + eps F_k(z) L = L(z*) ``` For any positive `k`, the fixed point is the same. Therefore ```text d L(z*) / d k = 0. ``` But the finite state is ```text z_T - z* = (1 - eps k)^T (z_0 - z*), ``` so ```text d L(z_T) / d k ``` contains a term proportional to ```text T (1 - eps k)^(T-1). ``` That is exactly the contraction-defense term. It is large near `rho=1`, precisely where `rho^150` becomes explosive. It vanishes only in the true infinite-time limit when `rho<1` with enough margin. For the full model, BPTT differentiates ```text z_{t+1} = Phi_theta(z_t) grad_theta L(z_T) = L_z(z_T)^T sum_{k=0}^{T-1} (prod_{s=k+1}^{T-1} Phi_z(z_s)) eps F_theta(z_k) + direct terms. ``` Those products are the same objects that determine finite-time contraction. When they decay slowly, the finite-horizon gradient feels it. Equilibrium EP replaces this whole finite product chain with the fixed-point inverse at `z*` and takes `T=infinity`; the transient residual term is gone. The code implements this split exactly. In `lt_ep_train.py`, `ep_step` relaxes to `T1`, optionally refines beyond `T1`, and computes the task gradient at `zs` through `(a * f).sum()`. `bptt_step` unrolls exactly `T1` steps and differentiates `ce(blk, z, y)` at the final unrolled state. `evaluate()` also uses exactly `T1` relaxation steps. Therefore BPTT is optimizing the evaluated computation and EP is optimizing the refined fixed-point computation. ## Consequence The exact equilibrium adjoint can be correct and still push into a marginal operator, because the equilibrium objective is indifferent to settling time except through its effect on `z*`. The evidence that EP can reach `cos(g_EP,g_transpose)=0.998` only proves that EP can compute the fixed-point gradient. It does not prove that the fixed-point gradient contains BPTT's finite-horizon stabilizer. It does not. So the fix is not "set `t2sel=160` and call the adjoint converged." That removes one estimator error. It does not change the target objective. If the deployed model is `T1=150`, the training signal must include finite-horizon dynamics or an explicit contraction objective. ## Local forward-only fix This is fundamental for pure equilibrium EP on `L(z*)`, but not fundamental for local forward-only learning if the objective is changed. Concrete construction: finite-horizon forward-mode/RTRL eligibility training for `L(z_T)` plus, if needed, a local contraction penalty. Run the physical relaxation forward for `T=150`. Alongside the state, propagate local eligibility traces: ```text e_{t+1}^{(p)} = Phi_z(z_t) e_t^{(p)} + eps dF_theta(z_t)/dp ``` At `T`, form the local three-factor update ```text Delta p proportional to - L_z(z_T)^T e_T^{(p)}. ``` This is forward-mode differentiation of the actual finite unroll. It is not reverse BPTT, and it is not a digital root finder. Exact per-parameter RTRL is expensive; practical versions use blockwise, low-rank, or random-direction eligibility traces. But this is the correct class of construction because it preserves the finite product terms that defend contraction. If hardware or cost makes forward-mode eligibility too expensive, the alternative local objective is an explicit contraction homeostat: ```text R_contr = E_v sum_t log( ||Phi_z(z_t) v_t|| / ||v_t|| ) ``` estimated with two nearby physical trajectories or JVP hardware, or a hard monotone/contractive parameterization enforcing a negative log norm. This changes the objective/architecture. It is a valid local fix, but it is not "better EP gradient quality." ## Single decisive experiment Run oracle exact-equilibrium-adjoint training, not merely deeper EP, from the same pre-drift checkpoint and with `resreg=0`. At every update: 1. Relax/refine to `z*`. 2. Solve the exact adjoint `F_z(z*)^T lambda = -L_z(z*)` by GMRES or an equivalent oracle. 3. Apply `grad_theta = L_theta + lambda^T F_theta`. 4. Track `cos(oracle,g_transpose)`, finite-`T1` residual, and free-phase rho on the fixed validation batch every 100 steps. Decision rule: ```text If oracle equilibrium-adjoint training keeps rho near the BPTT value and clears the wall, (a). If it still drifts rho toward 1 and blows while cos(oracle,g_transpose) is near 1, (b). ``` My ruling is that the second outcome will occur. The exact equilibrium gradient is the wrong gradient for the finite-150-step computation; it cannot contain the missing finite-horizon contraction-defense term by construction.