diff options
Diffstat (limited to 'ep_run/CODEX_VERDICT.md')
| -rw-r--r-- | ep_run/CODEX_VERDICT.md | 151 |
1 files changed, 151 insertions, 0 deletions
diff --git a/ep_run/CODEX_VERDICT.md b/ep_run/CODEX_VERDICT.md new file mode 100644 index 0000000..7d29c39 --- /dev/null +++ b/ep_run/CODEX_VERDICT.md @@ -0,0 +1,151 @@ +# CODEX VERDICT: EP below-2.10 divergence + +## Ruling + +Verdict: (b) STRUCTURAL. + +Converging the EP adjoint phase is necessary for a correct equilibrium-gradient estimate, but it is not sufficient to make this training problem behave like BPTT. The exact equilibrium gradient differentiates `L(z*)`. BPTT differentiates the actual deployed computation `L(z_T)` with `T=150`. Those are different objectives whenever convergence is not effectively complete. The missing term is the finite-horizon residual/contraction term. No `t2sel` or `hr` knob can add that term to the exact fixed-point gradient. + +## Fact-set check + +1. Correct in substance. `runs/bptt_clean.log` reaches best val CE `1.8277` and keeps the finite-`T1` residual small, around `4e-4` to `1e-3` late in training. The failure is EP-specific in the comparable EP logs. The `rho~0.982` value is referenced by the rho prober header and scripts, but the requested log set does not contain the full `spec_bifurcation.py` output, so the CE/residual part is directly verified and the exact rho number is not independently reprinted in the available logs. + +2. Correct. `runs/t2_sweep.log` shows `cos(g_EP,g_transpose)` rising from `0.742413` at `t2sel=10` to `0.998194` at `t2sel=160`. `runs/hr_ceiling_sweep.log` shows the remaining gap flat across `hr=0.04..0.8` at about `0.94..0.946`. That is adjoint-phase truncation, not beta-radius tuning. The code agrees: `lt_ep_train.py` calls `holo_a_track`/`holo_a_select2` with a fixed `t2sel`, and `holo_ep.py` selects a finite nudged snapshot rather than enforcing an adjoint residual. + +3. Correct. The same sweep reports `cos(g_transpose,g_BPTT)=0.974976` at free-phase step residual `2.17e-5`. `diag_probe.log` shows the exact fixed-point regime, residual around `1e-9`, where `cos(g_transpose,g_BPTT)=1.000000`. This is the finite-horizon/free-residual gap. + +4. Correct in causal direction, with one evidence caveat. `ep_redx.log` shows the sudden event: CE goes from `2.7417` at step 3200 with residual `2.5e-2` to CE `41` with residual `1.1e-1` at step 3300. `redx_traj.log` shows EP gradient quality degrading and residuals growing along the approach. The rho drift and damping-resistant `c` sweep are encoded in `spec_bifurcation.py`, `spec_rho_vs_c.py`, and cited in `t2fix_rho.log`, but the actual c-sweep output is not present in the requested logs. The important point remains: the finite-`T1` residual becomes hypersensitive near marginality. + +5. Correct as the root cause, but not yet empirically closed for `t2sel=160`. `runs/t2fix_rho.log` only has steps 100, 200, 300 at CE about `6`, with `rho~0.794`; it does not prove that `t2sel=160` will pass or fail near CE 2.x. The math below decides the open question: exact equilibrium gradients still optimize the wrong objective for finite-time deployment. + +## Why the exact equilibrium gradient lacks contraction defense + +Let the relaxation map be + +```text +Phi_theta(z) = z + eps F_theta(z) +``` + +and let `z*` satisfy `F_theta(z*) = 0`. The equilibrium objective is + +```text +J_inf(theta) = L(z*(theta)). +``` + +Differentiating the fixed-point equation gives + +```text +F_z dz*/dtheta + F_theta = 0 +dz*/dtheta = -F_z^{-1} F_theta +``` + +Equivalently, solve the equilibrium adjoint + +```text +F_z^T lambda = -L_z(z*) +grad_theta J_inf = L_theta + lambda^T F_theta. +``` + +That is exactly what the EP/AEP estimator is trying to approximate. It contains `F_z^{-1}`, so slow modes can amplify loss sensitivity. But it contains no term for the finite relaxation length, no `T`, no initial residual, no `Phi^T`, and no derivative of `rho(Phi_z)` unless changing that contraction also changes `z*` or the equilibrium loss. A parameter that changes the convergence rate while leaving the fixed point and readout loss unchanged has zero exact equilibrium gradient. + +Scalar counterexample: + +```text +F_k(z) = -k (z - z*) +Phi_k(z) = z + eps F_k(z) +L = L(z*) +``` + +For any positive `k`, the fixed point is the same. Therefore + +```text +d L(z*) / d k = 0. +``` + +But the finite state is + +```text +z_T - z* = (1 - eps k)^T (z_0 - z*), +``` + +so + +```text +d L(z_T) / d k +``` + +contains a term proportional to + +```text +T (1 - eps k)^(T-1). +``` + +That is exactly the contraction-defense term. It is large near `rho=1`, precisely where `rho^150` becomes explosive. It vanishes only in the true infinite-time limit when `rho<1` with enough margin. + +For the full model, BPTT differentiates + +```text +z_{t+1} = Phi_theta(z_t) +grad_theta L(z_T) + = L_z(z_T)^T sum_{k=0}^{T-1} + (prod_{s=k+1}^{T-1} Phi_z(z_s)) eps F_theta(z_k) + + direct terms. +``` + +Those products are the same objects that determine finite-time contraction. When they decay slowly, the finite-horizon gradient feels it. Equilibrium EP replaces this whole finite product chain with the fixed-point inverse at `z*` and takes `T=infinity`; the transient residual term is gone. + +The code implements this split exactly. In `lt_ep_train.py`, `ep_step` relaxes to `T1`, optionally refines beyond `T1`, and computes the task gradient at `zs` through `(a * f).sum()`. `bptt_step` unrolls exactly `T1` steps and differentiates `ce(blk, z, y)` at the final unrolled state. `evaluate()` also uses exactly `T1` relaxation steps. Therefore BPTT is optimizing the evaluated computation and EP is optimizing the refined fixed-point computation. + +## Consequence + +The exact equilibrium adjoint can be correct and still push into a marginal operator, because the equilibrium objective is indifferent to settling time except through its effect on `z*`. The evidence that EP can reach `cos(g_EP,g_transpose)=0.998` only proves that EP can compute the fixed-point gradient. It does not prove that the fixed-point gradient contains BPTT's finite-horizon stabilizer. It does not. + +So the fix is not "set `t2sel=160` and call the adjoint converged." That removes one estimator error. It does not change the target objective. If the deployed model is `T1=150`, the training signal must include finite-horizon dynamics or an explicit contraction objective. + +## Local forward-only fix + +This is fundamental for pure equilibrium EP on `L(z*)`, but not fundamental for local forward-only learning if the objective is changed. + +Concrete construction: finite-horizon forward-mode/RTRL eligibility training for `L(z_T)` plus, if needed, a local contraction penalty. + +Run the physical relaxation forward for `T=150`. Alongside the state, propagate local eligibility traces: + +```text +e_{t+1}^{(p)} = Phi_z(z_t) e_t^{(p)} + eps dF_theta(z_t)/dp +``` + +At `T`, form the local three-factor update + +```text +Delta p proportional to - L_z(z_T)^T e_T^{(p)}. +``` + +This is forward-mode differentiation of the actual finite unroll. It is not reverse BPTT, and it is not a digital root finder. Exact per-parameter RTRL is expensive; practical versions use blockwise, low-rank, or random-direction eligibility traces. But this is the correct class of construction because it preserves the finite product terms that defend contraction. + +If hardware or cost makes forward-mode eligibility too expensive, the alternative local objective is an explicit contraction homeostat: + +```text +R_contr = E_v sum_t log( ||Phi_z(z_t) v_t|| / ||v_t|| ) +``` + +estimated with two nearby physical trajectories or JVP hardware, or a hard monotone/contractive parameterization enforcing a negative log norm. This changes the objective/architecture. It is a valid local fix, but it is not "better EP gradient quality." + +## Single decisive experiment + +Run oracle exact-equilibrium-adjoint training, not merely deeper EP, from the same pre-drift checkpoint and with `resreg=0`. + +At every update: + +1. Relax/refine to `z*`. +2. Solve the exact adjoint `F_z(z*)^T lambda = -L_z(z*)` by GMRES or an equivalent oracle. +3. Apply `grad_theta = L_theta + lambda^T F_theta`. +4. Track `cos(oracle,g_transpose)`, finite-`T1` residual, and free-phase rho on the fixed validation batch every 100 steps. + +Decision rule: + +```text +If oracle equilibrium-adjoint training keeps rho near the BPTT value and clears the wall, (a). +If it still drifts rho toward 1 and blows while cos(oracle,g_transpose) is near 1, (b). +``` + +My ruling is that the second outcome will occur. The exact equilibrium gradient is the wrong gradient for the finite-150-step computation; it cannot contain the missing finite-horizon contraction-defense term by construction. |
