diff options
| author | Yuren Hao <yurenh2@illinois.edu> | 2026-07-03 05:56:50 -0500 |
|---|---|---|
| committer | Yuren Hao <yurenh2@illinois.edu> | 2026-07-03 05:56:50 -0500 |
| commit | b83947778e2c776f757a07d4719b7ce961d7ed55 (patch) | |
| tree | b9cc01d7adda691d9156d9d04f4fb2f644674e96 /docs/hardware/GPT_PRO_PHYSICS_ANSWER.md | |
Initial commit: ept — backprop-free equilibrium transformer (EP)
Code (ep_run/), organized docs (docs/{method,campaign,hardware,outreach,paper}),
analysis scripts (scripts/), ONBOARDING.md entry point. Large data/checkpoints
git-ignored (share separately).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_014FAPDWQ49M5Ye3NpTndTpn
Diffstat (limited to 'docs/hardware/GPT_PRO_PHYSICS_ANSWER.md')
| -rw-r--r-- | docs/hardware/GPT_PRO_PHYSICS_ANSWER.md | 46 |
1 files changed, 46 insertions, 0 deletions
diff --git a/docs/hardware/GPT_PRO_PHYSICS_ANSWER.md b/docs/hardware/GPT_PRO_PHYSICS_ANSWER.md new file mode 100644 index 0000000..6dfbde2 --- /dev/null +++ b/docs/hardware/GPT_PRO_PHYSICS_ANSWER.md @@ -0,0 +1,46 @@ +# GPT-Pro physics consult — actionable synthesis (2026-06-21) +Full answer pasted by user (in session). This file = the load-bearing extraction + the prescribed experiment. + +## The decisive reframe: the no-go is REAL but NARROW +- **Same-dynamics, settled-state-only EP computes `J⁻¹ℓ`; the true implicit gradient needs `J⁻ᵀℓ`.** + True adjoint `λ = −J⁻ᵀℓ` (ℓ=∂L/∂z*); EP's infinitesimal nudge contrast solves `J s = −ℓ` ⇒ `s = −J⁻¹ℓ`. + Exactness for all ℓ ⇒ requires `J = Jᵀ` (self-adjoint in the physical metric P: `PJ=JᵀP`). **That is the precise no-go.** +- **But the BROADER no-go is FALSE.** `ρ(S⁻¹A)<1` is ONLY the convergence condition of the *first-order* (Neumann) AEP + expansion — NOT a universal impossibility. Exact non-conservative learning EXISTS if you alter the learning dynamics: + - **AsymEP** (Scurria 2026): add a local antisymmetric correction in the nudged phase so the modified nudged Jacobian = `Jᵀ`. + - **Dyadic EP** (Scurria 2026): double the state space; the auxiliary variable becomes a physical error signal ∝ cost gradient. EXACT for arbitrary non-conservative stationary systems. +- ⇒ **The obstruction MOVES from "gradient estimator impossible" to "can the HARDWARE embody the adjoint (transpose couplings / doubled state) without becoming a hidden digital backprop engine?"** Math exists; physical realizability is the open problem. +- Exact error operator (better than the ρ proxy): `E = J⁻¹Jᵀ − I = −2 J⁻¹A`. `‖S⁻¹A‖` is a diagnostic only when S is well-conditioned. + +## THE PRESCRIBED PRIORITY EXPERIMENT (GPT-Pro's stated #1: "oracle adjoint physics ablation") +At the SAME checkpoints, implement THREE gradient estimators + sweep an asymmetry knob ε (scale of the antisymmetric attention component): +1. **vanilla/our EP** (`g_EP`) — gives `J⁻¹ℓ`-flavored. +2. **exact transpose relaxation** (`g_transpose`): solve `Jᵀ λ = −ℓ` (GMRES; Jᵀ via vjp), then `g = (∂F/∂θ)ᵀ λ`. = the EXACT implicit gradient. +3. **Dyadic/AsymEP** with exact local `A` from autograd as a simulator ORACLE. +Reference: `g_BPTT` (finite-T). +**DECISION:** if `g_transpose` (and/or dyadic) stays aligned + correctly scaled vs `g_BPTT` even at `ρ(S⁻¹A)>1` (closing our measured 84% gap) ⇒ **the collapse is NOT a mathematical no-go — it's a hardware-realization / convergence problem** ⇒ the research problem becomes "build a local transpose/dyadic substrate." If `g_transpose` ALSO fails ⇒ culprit is finite residual / nonlinearity / saddle instability / loss of contraction, NOT the conservativity assumption. +- MUST run at a HEALTHY tight fixed point (the BPTT good solution, or a converged specnorm ckpt) — at the divergence edge (no fixed point) the solve is ill-conditioned (we saw it: MINRES resid 0.2). + +## Q2 (damping) — our suspicion CONFIRMED as plausible +`‖S⁻¹A‖ ≤ ‖A_G‖ / ((1+c) − ‖S_G‖)`. Two routes to high ρ: numerator `‖A_G‖` grows, OR contraction margin `(1+c)−‖S_G‖` collapses (S→singular). Weak-contraction good solutions fit the SECOND. Raising `c` buys EP-validity ONLY for *gratuitous* asymmetry; if you rescale attn/FFN by `(1+c)` to preserve the map, A_G & S_G scale too and ρ may not improve. **Damping sweep is high-value.** Decompose ρ into `‖A‖`, log-norm margin `μ_P(J)`, transient gain `sup_t‖e^{Jt}‖`. + +## Q3 (physical adjoint) — buildable, not free +- Transpose relaxation `λ̇ = Jᵀλ + ℓ` = recurrent-backprop/adjoint relaxation (NOT BPTT-through-time), physically allowed IF substrate has transpose couplings. Reciprocal crossbar: `Wᵀ` = swap drive/sense ports. Nonreciprocal attention: transpose includes state-dependent transposes of LN/softmax/QK/value/output ⇒ a "transpose twin" = duplicate routing + calibration + derivative circuits + extra phase. +- **Hamiltonian Echo (HEB/HEL)**: physical adjoint via time-reversal — but does NOT directly extend to DAMPED dissipative dynamics (dissipation destroys time-reversal info). Need bath/ancilla embedding or dyadic doubling. (Pourcel 2026: Lagrangian EP ⊇ HEL, forward-only, passes-independent-of-size, local.) +- Passive RC/resistive networks are energy-based (Kirchhoff ⇒ energy) — that's WHY local EP works there; they will NOT natively give `Jᵀ` for directed attention. +- Falsify: simulate transpose mismatch `Jᵀ+Δ` (sparsity/calibration/quantization error) → measure cos/norm-error/stability vs `‖Δ‖/‖J‖` = the HARDWARE TOLERANCE SPEC. + +## Q5/Q6 — non-normality & conservative cost +- Non-normality is NOT purely bad (nnRNN uses it for expressivity w/ stable spectra). Right objective: bound transient gain + adjoint mismatch, not kill A. Physical regularizer = contraction in a metric (`μ_P(J)<−m`), NOT Euclidean spectral radius. +- Conservative-attention's 0.15–0.2 CE gap is likely **parameterization/efficiency cost, NOT an information-theoretic wall** (conservative systems can encode directed maps via latent/paired compartments). Practical path = partially-conservative `F = F_energy + ε F_skew`, controlled ε per head/layer, keep `ρ<1` if using vanilla EP. + +## Q7 — make-or-break (staged, not monolithic) +Train 4× (10–50M, same corpus): (a) free non-conservative attn + BPTT [oracle]; (b) conservative Energy/Hopfield/CET + exact digital grad [architecture upper bound]; (c) same conservative + EP at strict residual [EP-estimator test]; (d) partially-conservative ε-skew + exact transpose/Dyadic EP. The result names the WALL: architecture (a bad) / EP-estimator (b good, c bad) / physical-adjoint (transpose works, local impl fails) / non-equilibrium (only finite-T BPTT readable). + +## Refs surfaced (pursue) +Dyadic EP + AsymEP = Scurria 2026 (arXiv:2602.03670, the EXACT non-conservative method). Lagrangian EP / Hamiltonian Echo Learning = Pourcel 2026 (arXiv:2506.06248) + RHEL (arXiv:2506.05259, forward-only BPTT proxy). Energy Transformer = Hoover/Krotov (arXiv:2302.07253); Convergent Energy Transformer (EP-trainable, OpenReview Qrfml76eWJ — already in our HW collab docs). Monotone-operator equ. nets = Winston-Kolter (arXiv:2006.08591). Resistive-net universal approx = arXiv:2312.15063. nnRNN = arXiv:1905.12080. + +## NEXT (our plan) +1. BUILD the oracle-adjoint probe (extend asym_probe: add `g_transpose` via GMRES solve of `Jᵀλ=−ℓ` + `(∂F/∂θ)ᵀλ`; + synthetic ε-asymmetry sweep on the J operator). [codex] +2. RUN at the BPTT good solution (watcher pings) + the ε-sweep on a current converged ckpt. DECIDES fundamental-vs-hardware. +3. Secondary: ρ(S⁻¹A) decomposition (‖A‖ / μ_P margin / transient gain) at BPTT ckpts; damping-c continuation. |
