summaryrefslogtreecommitdiff
path: root/ep_run/CODEX_VERDICT.md
blob: 7d29c3984163c434ff9c997e61b176adbe55f2ad (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
# CODEX VERDICT: EP below-2.10 divergence

## Ruling

Verdict: (b) STRUCTURAL.

Converging the EP adjoint phase is necessary for a correct equilibrium-gradient estimate, but it is not sufficient to make this training problem behave like BPTT. The exact equilibrium gradient differentiates `L(z*)`. BPTT differentiates the actual deployed computation `L(z_T)` with `T=150`. Those are different objectives whenever convergence is not effectively complete. The missing term is the finite-horizon residual/contraction term. No `t2sel` or `hr` knob can add that term to the exact fixed-point gradient.

## Fact-set check

1. Correct in substance. `runs/bptt_clean.log` reaches best val CE `1.8277` and keeps the finite-`T1` residual small, around `4e-4` to `1e-3` late in training. The failure is EP-specific in the comparable EP logs. The `rho~0.982` value is referenced by the rho prober header and scripts, but the requested log set does not contain the full `spec_bifurcation.py` output, so the CE/residual part is directly verified and the exact rho number is not independently reprinted in the available logs.

2. Correct. `runs/t2_sweep.log` shows `cos(g_EP,g_transpose)` rising from `0.742413` at `t2sel=10` to `0.998194` at `t2sel=160`. `runs/hr_ceiling_sweep.log` shows the remaining gap flat across `hr=0.04..0.8` at about `0.94..0.946`. That is adjoint-phase truncation, not beta-radius tuning. The code agrees: `lt_ep_train.py` calls `holo_a_track`/`holo_a_select2` with a fixed `t2sel`, and `holo_ep.py` selects a finite nudged snapshot rather than enforcing an adjoint residual.

3. Correct. The same sweep reports `cos(g_transpose,g_BPTT)=0.974976` at free-phase step residual `2.17e-5`. `diag_probe.log` shows the exact fixed-point regime, residual around `1e-9`, where `cos(g_transpose,g_BPTT)=1.000000`. This is the finite-horizon/free-residual gap.

4. Correct in causal direction, with one evidence caveat. `ep_redx.log` shows the sudden event: CE goes from `2.7417` at step 3200 with residual `2.5e-2` to CE `41` with residual `1.1e-1` at step 3300. `redx_traj.log` shows EP gradient quality degrading and residuals growing along the approach. The rho drift and damping-resistant `c` sweep are encoded in `spec_bifurcation.py`, `spec_rho_vs_c.py`, and cited in `t2fix_rho.log`, but the actual c-sweep output is not present in the requested logs. The important point remains: the finite-`T1` residual becomes hypersensitive near marginality.

5. Correct as the root cause, but not yet empirically closed for `t2sel=160`. `runs/t2fix_rho.log` only has steps 100, 200, 300 at CE about `6`, with `rho~0.794`; it does not prove that `t2sel=160` will pass or fail near CE 2.x. The math below decides the open question: exact equilibrium gradients still optimize the wrong objective for finite-time deployment.

## Why the exact equilibrium gradient lacks contraction defense

Let the relaxation map be

```text
Phi_theta(z) = z + eps F_theta(z)
```

and let `z*` satisfy `F_theta(z*) = 0`. The equilibrium objective is

```text
J_inf(theta) = L(z*(theta)).
```

Differentiating the fixed-point equation gives

```text
F_z dz*/dtheta + F_theta = 0
dz*/dtheta = -F_z^{-1} F_theta
```

Equivalently, solve the equilibrium adjoint

```text
F_z^T lambda = -L_z(z*)
grad_theta J_inf = L_theta + lambda^T F_theta.
```

That is exactly what the EP/AEP estimator is trying to approximate. It contains `F_z^{-1}`, so slow modes can amplify loss sensitivity. But it contains no term for the finite relaxation length, no `T`, no initial residual, no `Phi^T`, and no derivative of `rho(Phi_z)` unless changing that contraction also changes `z*` or the equilibrium loss. A parameter that changes the convergence rate while leaving the fixed point and readout loss unchanged has zero exact equilibrium gradient.

Scalar counterexample:

```text
F_k(z) = -k (z - z*)
Phi_k(z) = z + eps F_k(z)
L = L(z*)
```

For any positive `k`, the fixed point is the same. Therefore

```text
d L(z*) / d k = 0.
```

But the finite state is

```text
z_T - z* = (1 - eps k)^T (z_0 - z*),
```

so

```text
d L(z_T) / d k
```

contains a term proportional to

```text
T (1 - eps k)^(T-1).
```

That is exactly the contraction-defense term. It is large near `rho=1`, precisely where `rho^150` becomes explosive. It vanishes only in the true infinite-time limit when `rho<1` with enough margin.

For the full model, BPTT differentiates

```text
z_{t+1} = Phi_theta(z_t)
grad_theta L(z_T)
  = L_z(z_T)^T sum_{k=0}^{T-1}
      (prod_{s=k+1}^{T-1} Phi_z(z_s)) eps F_theta(z_k)
    + direct terms.
```

Those products are the same objects that determine finite-time contraction. When they decay slowly, the finite-horizon gradient feels it. Equilibrium EP replaces this whole finite product chain with the fixed-point inverse at `z*` and takes `T=infinity`; the transient residual term is gone.

The code implements this split exactly. In `lt_ep_train.py`, `ep_step` relaxes to `T1`, optionally refines beyond `T1`, and computes the task gradient at `zs` through `(a * f).sum()`. `bptt_step` unrolls exactly `T1` steps and differentiates `ce(blk, z, y)` at the final unrolled state. `evaluate()` also uses exactly `T1` relaxation steps. Therefore BPTT is optimizing the evaluated computation and EP is optimizing the refined fixed-point computation.

## Consequence

The exact equilibrium adjoint can be correct and still push into a marginal operator, because the equilibrium objective is indifferent to settling time except through its effect on `z*`. The evidence that EP can reach `cos(g_EP,g_transpose)=0.998` only proves that EP can compute the fixed-point gradient. It does not prove that the fixed-point gradient contains BPTT's finite-horizon stabilizer. It does not.

So the fix is not "set `t2sel=160` and call the adjoint converged." That removes one estimator error. It does not change the target objective. If the deployed model is `T1=150`, the training signal must include finite-horizon dynamics or an explicit contraction objective.

## Local forward-only fix

This is fundamental for pure equilibrium EP on `L(z*)`, but not fundamental for local forward-only learning if the objective is changed.

Concrete construction: finite-horizon forward-mode/RTRL eligibility training for `L(z_T)` plus, if needed, a local contraction penalty.

Run the physical relaxation forward for `T=150`. Alongside the state, propagate local eligibility traces:

```text
e_{t+1}^{(p)} = Phi_z(z_t) e_t^{(p)} + eps dF_theta(z_t)/dp
```

At `T`, form the local three-factor update

```text
Delta p proportional to - L_z(z_T)^T e_T^{(p)}.
```

This is forward-mode differentiation of the actual finite unroll. It is not reverse BPTT, and it is not a digital root finder. Exact per-parameter RTRL is expensive; practical versions use blockwise, low-rank, or random-direction eligibility traces. But this is the correct class of construction because it preserves the finite product terms that defend contraction.

If hardware or cost makes forward-mode eligibility too expensive, the alternative local objective is an explicit contraction homeostat:

```text
R_contr = E_v sum_t log( ||Phi_z(z_t) v_t|| / ||v_t|| )
```

estimated with two nearby physical trajectories or JVP hardware, or a hard monotone/contractive parameterization enforcing a negative log norm. This changes the objective/architecture. It is a valid local fix, but it is not "better EP gradient quality."

## Single decisive experiment

Run oracle exact-equilibrium-adjoint training, not merely deeper EP, from the same pre-drift checkpoint and with `resreg=0`.

At every update:

1. Relax/refine to `z*`.
2. Solve the exact adjoint `F_z(z*)^T lambda = -L_z(z*)` by GMRES or an equivalent oracle.
3. Apply `grad_theta = L_theta + lambda^T F_theta`.
4. Track `cos(oracle,g_transpose)`, finite-`T1` residual, and free-phase rho on the fixed validation batch every 100 steps.

Decision rule:

```text
If oracle equilibrium-adjoint training keeps rho near the BPTT value and clears the wall, (a).
If it still drifts rho toward 1 and blows while cos(oracle,g_transpose) is near 1, (b).
```

My ruling is that the second outcome will occur. The exact equilibrium gradient is the wrong gradient for the finite-150-step computation; it cannot contain the missing finite-horizon contraction-defense term by construction.