docs/hardware/PHYSICS_QUESTIONS_FOR_DEEP_REASONING.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65

# Deep-reasoning brief: the physics of backprop-free equilibrium learning on non-conservative systems
*Self-contained briefing for a deep-reasoning consult. Goal: identify whether the central obstruction is fundamental, and what the physically-realizable escape routes are. 2026-06-21.*

---

## 0. One-paragraph setup
We train a **transformer as a physical equilibrium system** — a fixed point `z*` of a damped relaxation `ż = F(z)`, `F(z) = x_in − (1+c)z + Attn(LN z) + FFN(LN z)` (`c>0` damping) — **without backpropagation**, using **Equilibrium Propagation (EP)**: a free relaxation to `z*`, a nudged relaxation with the output perturbed by `β ∂L/∂z`, and a **local** weight update from the contrast of the two settled states. The motivation is **analog hardware**: the relaxation is physical settling (≈ free), and the update is local (no backward pass). On such hardware **BPTT is physically impossible** (no backward-through-time). The problem: EP trains cleanly until attention becomes expressive, then the gradient estimate collapses and training diverges — while exact BPTT with the identical model sails to a good solution.

## 1. What is rigorously established (the constraints any answer must respect)
- **EP ≡ BPTT only under** (Ernoult 2019): **H1** symmetric Jacobian `∂F/∂z = (∂F/∂z)ᵀ` (i.e. F is a gradient of a scalar — *conservative*); **H2** free phase *exactly* at the fixed point; **H3** `β→0`. BPTT needs **none** of these — it differentiates the actual finite computation graph.
- **Softmax self-attention is non-conservative**: independent Q/K (and untied FFN) ⇒ `J ≠ Jᵀ`. We measure the non-normality `‖A‖/‖S‖`-scale ≈ **1.4** (S, A = symmetric/antisymmetric parts of J).
- The standard fix, the **AEP correction** (inject `−2A(x−x*)` in the nudged phase to make the nudged Jacobian `J→Jᵀ`), is **first-order in `S⁻¹A`** — its error is a Neumann series `Σ (S⁻¹A)^{2n+1}` that **converges only if `ρ(S⁻¹A) < 1`**. At our `‖S⁻¹A‖ ≈ 1.4` it is *out of warranty*.
- The **good solutions are weakly contractive** — a BPTT model at good loss has finite-time Lyapunov ≈ **−0.0347** (barely inside the stable disk). The instability is **non-normal transient growth**, not an eigenvalue crossing (spectral radius is a misleading stability signal here).
- Empirically the EP↔BPTT gradient gap `M = g_BPTT − g_EP` reaches **~84 % of `‖g_BPTT‖`** just before divergence; EP retains the *direction* (cos 0.91) but loses the *magnitude/structure*. EP gradient ≈ BPTT only when the free-phase residual `res ≲ 1e-4`; at `res ≈ 1e-2` it is noise.
- The damping `−(1+c)z` is what makes `S` well-conditioned at a *healthy* fixed point (`S ≈ −(1+c)I` dominant ⇒ `S⁻¹A ≈ −A/(1+c)`, small); `ρ(S⁻¹A)` blows up only *as contraction is lost* (S → singular). So large `ρ(S⁻¹A)` is partly a *symptom* of lost contraction, not purely intrinsic asymmetry. **(This is unresolved — see Q2.)**

## 2. The central tension (the thing to crack)
**EP's gradient fidelity requires near-conservativity (`‖S⁻¹A‖ ≪ 1`) and tight convergence; the expressive power of attention comes precisely from its non-conservativity (`‖S⁻¹A‖ ≈ 1.4`).** BPTT escapes because it needs neither H1 nor H2 — but BPTT is not physically realizable on analog relaxation hardware. So the question is whether a **local, physically-realizable, backprop-free** rule can get the correct gradient for a **non-conservative equilibrium**, or whether there is a fundamental obstruction.

---

## QUESTIONS FOR DEEP REASONING

### Q1 — Is the obstruction FUNDAMENTAL or an estimator artifact? (the no-go question)
Is there a **theorem** of the form *"any local learning rule that reads only the settled states of a physical relaxation can recover the true loss-gradient of a non-conservative equilibrium system only to fidelity bounded by a function of the operator's non-normality `‖S⁻¹A‖`"*? Or is the `ρ(S⁻¹A)<1` limit specific to the *first-order* AEP construction, with a different (possibly non-perturbative) local estimator that is exact at any asymmetry?
- Concretely: the true adjoint needs `J⁻ᵀ`; EP's nudged relaxation naturally produces `J⁻¹`. For **symmetric** J these coincide (why EP works on energy-based systems). Is the gap `J⁻¹ − J⁻ᵀ` *fundamentally* inaccessible to any forward-only physical relaxation, or can a cleverly-coupled relaxation produce `J⁻ᵀ`?
- Relevant: implicit-function-theorem / adjoint-method structure; whether the adjoint of a *dissipative non-reciprocal* system is itself the steady state of *some* physically-runnable dynamics.

### Q2 — Decouple non-normality from lost-contraction. Is the GOOD solution actually high-`ρ(S⁻¹A)`?
We suspect `ρ(S⁻¹A)` is large at the *divergence edge* mostly because contraction is lost (`S` near-singular), not because the *good* solution is intrinsically that asymmetric. At a healthy fixed point the damping makes `S ≈ −(1+c)I`, so `ρ(S⁻¹A) ≈ ‖A‖/(1+c)`.
- Does the **expressivity of attention** require large **`‖A‖`** (antisymmetric coupling) per se, or only large **`‖J‖`** (which the damping + a larger `c` could offset while keeping `‖S⁻¹A‖<1`)? I.e. **can one have an expressive attention operator that is strongly damped (large `1+c`) so that `S⁻¹A` stays sub-1 — buying EP-validity by *scaling*, not by killing asymmetry?** What is the trade between damping `c`, expressivity, and `ρ(S⁻¹A)`?
- Is there a reparameterization (e.g. normalizing the value/output path, or `c` per-channel) that holds `ρ(S⁻¹A)<1` *uniformly over training* without an explicit penalty?

### Q3 — Can the PHYSICAL hardware compute its own adjoint (transpose dynamics)?
EP works on conservative systems because a **reciprocal** (symmetric-coupling) physical network has `J=Jᵀ`, so the same physical relaxation computes both the forward state and the adjoint. Our attention is **non-reciprocal** (independent Q/K = directed coupling).
- Is there a **physical mechanism** to run the *transpose/adjoint* dynamics of a non-reciprocal dissipative network on the same substrate — e.g. **time-reversal / Hamiltonian-echo backpropagation** (López-Pastor & Marquardt 2023, for Hamiltonian systems), a physically-coupled "transpose twin" network, or an Onsager-reciprocity-breaking trick? Does HEB extend from *Hamiltonian* (lossless) to our *damped dissipative* setting, and at what cost?
- If a substrate can physically realize **both** `J` and `Jᵀ` relaxations, EP/AEP becomes *exact* for non-conservative systems with **no** perturbative `ρ(S⁻¹A)<1` restriction. Is that physically buildable (analog crossbar / photonic / spintronic), and what does the transpose operation cost?

### Q4 — Is "equilibrium" even the right primitive? Finite-time vs fixed-point computation
BPTT's stabilizer is exactly the **finite-horizon truncation gradient** (it differentiates the *actual* 150-step settling, not the `T→∞` fixed point). The good BPTT solution rides a **weakly-contractive / not-fully-settled** state — arguably a *finite-time transient*, not a true equilibrium.
- Should the physical computation be defined at a **fixed finite settling time** `T` (a "finite-time fixed point") rather than `T→∞`? Is there a **local** learning rule whose target is the finite-time output — i.e. a physically-realizable analog of BPTT's finite-horizon term — without an actual backward pass? (e.g. forward-mode/RTRL-like physical accumulation, two-timescale dynamics, or an auxiliary physical variable that accumulates the transient sensitivity.)
- More radically: should we **embrace non-equilibrium computation** (transient / limit-cycle / wave dynamics) as the primitive and seek a local rule for *that* (reservoir-computing, oscillatory/Ising-machine, physical-learning lineages), rather than forcing a fixed-point interpretation onto a computation that is natively transient?

### Q5 — Non-normality as physics: does it have to hurt, and can it be bounded for free?
The failure is **non-normal transient amplification** — a stable operator whose perturbations grow transiently before decaying. In a real relaxation circuit this should be a *physical* phenomenon (ringing / overshoot before settling).
- What is the **physically natural regularizer** of non-normality? Is there a passivity / dissipativity / **contraction-in-a-metric (log-norm `μ_P(J)<0`)** structure that a physical substrate enforces *automatically* (e.g. a passive RC network is contractive by construction), and does building attention as a *passive* element bound `‖S⁻¹A‖` without a learned penalty?
- Is there an expressivity argument that **bounded non-normality is sufficient** for language modeling — i.e. that the useful computations of attention live at modest `‖S⁻¹A‖` and the divergence-causing high non-normality is *gratuitous* (an artifact of unconstrained training), not *functional*?

### Q6 — The conservative-attention escape: fundamental cost or engineering cost?
Energy-based attention exists — **modern Hopfield networks / Energy Transformer (Krotov), tied-value LSE-energy attention** — which are conservative by construction (`H1` holds, EP is *exact*, runs natively on analog equilibrium hardware). We measure it costs ~0.15–0.2 CE vs free (non-conservative) attention.
- Is the expressivity loss of conservative/energy attention **fundamental** (information-theoretic: a symmetric-coupling associative memory provably cannot represent what directed Q≠K attention can) or **engineering** (current energy-attention forms are just suboptimal, and a better conservative parameterization closes the gap)?
- Is there a **partially-conservative** sweet spot: an attention with a *controlled, small* antisymmetric part (`‖S⁻¹A‖<1`, inside the first-order AEP warranty) that captures most of the expressivity — and what is the **Pareto frontier of [gradient fidelity ↔ asymmetry ↔ task loss]**?

### Q7 — THE make-or-break for the hardware thesis
Granting that analog relaxation hardware is **structurally confined to (near-)conservative, contractive, equilibrium-based local learning**: **is a "readable" small language model (≈10–50M params, coherent text) achievable inside that regime?** Or does coherent language *require* the high non-normality / finite-time-transient expressivity that provably breaks local equilibrium learning?
- If achievable: what is the *minimum* non-normality / contraction margin a coherent-language equilibrium model needs, and does it fit inside `‖S⁻¹A‖<1` (+ comfortable contraction)?
- If not: what is the realistic *ceiling* of analog-EP-trainable models, and is there a hybrid (mostly-conservative core + a thin digitally-assisted non-conservative correction) that lifts it?

---

## What an actionable answer looks like
For each: (a) is it a **fundamental obstruction** (cite/sketch the no-go) or an **engineering gap** (sketch the construction)? (b) the **physical realizability** verdict (can it run on a relaxation substrate with local updates, no backward pass?) (c) the **cheapest experiment** that would falsify the proposed mechanism on our simulator (we can measure `ρ(S⁻¹A)`, contraction margin, gradient cos vs a non-conservative knob, energy-vs-free-attention loss, damping-`c` sweeps). We are not looking for "use BPTT" or "use a digital solver" — both are physically excluded. We want the **physics of a local, forward-only, non-conservative equilibrium learning rule**, or the proof that none can exist beyond bounded asymmetry.

## Key references the reasoner should engage
Scellier–Bengio 2017 (EP / energy); Ernoult et al. 2019 (EP≡BPTT conditions); Scurria et al. 2026 (AsymEP, the `J→Jᵀ` first-order correction); Laborieux–Zenke 2023 (Jacobian homeostasis, the `S⁻¹A` bias + functional-symmetry fix); López-Pastor–Marquardt 2023 (Hamiltonian echo backprop = physical adjoint via time-reversal); Krotov / Hoover et al. 2023 (Energy Transformer); Winston–Kolter 2020 (monotone-operator equilibrium nets = contraction by construction); Stern–Murugan 2023 (physical learning review); Kendall et al. 2020 (EP on analog resistor networks).