diff options
Diffstat (limited to 'docs/hardware/PHYSICS_QUESTIONS_FOR_DEEP_REASONING.md')
| -rw-r--r-- | docs/hardware/PHYSICS_QUESTIONS_FOR_DEEP_REASONING.md | 65 |
1 files changed, 65 insertions, 0 deletions
diff --git a/docs/hardware/PHYSICS_QUESTIONS_FOR_DEEP_REASONING.md b/docs/hardware/PHYSICS_QUESTIONS_FOR_DEEP_REASONING.md new file mode 100644 index 0000000..46f79e3 --- /dev/null +++ b/docs/hardware/PHYSICS_QUESTIONS_FOR_DEEP_REASONING.md @@ -0,0 +1,65 @@ +# Deep-reasoning brief: the physics of backprop-free equilibrium learning on non-conservative systems +*Self-contained briefing for a deep-reasoning consult. Goal: identify whether the central obstruction is fundamental, and what the physically-realizable escape routes are. 2026-06-21.* + +--- + +## 0. One-paragraph setup +We train a **transformer as a physical equilibrium system** — a fixed point `z*` of a damped relaxation `ż = F(z)`, `F(z) = x_in − (1+c)z + Attn(LN z) + FFN(LN z)` (`c>0` damping) — **without backpropagation**, using **Equilibrium Propagation (EP)**: a free relaxation to `z*`, a nudged relaxation with the output perturbed by `β ∂L/∂z`, and a **local** weight update from the contrast of the two settled states. The motivation is **analog hardware**: the relaxation is physical settling (≈ free), and the update is local (no backward pass). On such hardware **BPTT is physically impossible** (no backward-through-time). The problem: EP trains cleanly until attention becomes expressive, then the gradient estimate collapses and training diverges — while exact BPTT with the identical model sails to a good solution. + +## 1. What is rigorously established (the constraints any answer must respect) +- **EP ≡ BPTT only under** (Ernoult 2019): **H1** symmetric Jacobian `∂F/∂z = (∂F/∂z)ᵀ` (i.e. F is a gradient of a scalar — *conservative*); **H2** free phase *exactly* at the fixed point; **H3** `β→0`. BPTT needs **none** of these — it differentiates the actual finite computation graph. +- **Softmax self-attention is non-conservative**: independent Q/K (and untied FFN) ⇒ `J ≠ Jᵀ`. We measure the non-normality `‖A‖/‖S‖`-scale ≈ **1.4** (S, A = symmetric/antisymmetric parts of J). +- The standard fix, the **AEP correction** (inject `−2A(x−x*)` in the nudged phase to make the nudged Jacobian `J→Jᵀ`), is **first-order in `S⁻¹A`** — its error is a Neumann series `Σ (S⁻¹A)^{2n+1}` that **converges only if `ρ(S⁻¹A) < 1`**. At our `‖S⁻¹A‖ ≈ 1.4` it is *out of warranty*. +- The **good solutions are weakly contractive** — a BPTT model at good loss has finite-time Lyapunov ≈ **−0.0347** (barely inside the stable disk). The instability is **non-normal transient growth**, not an eigenvalue crossing (spectral radius is a misleading stability signal here). +- Empirically the EP↔BPTT gradient gap `M = g_BPTT − g_EP` reaches **~84 % of `‖g_BPTT‖`** just before divergence; EP retains the *direction* (cos 0.91) but loses the *magnitude/structure*. EP gradient ≈ BPTT only when the free-phase residual `res ≲ 1e-4`; at `res ≈ 1e-2` it is noise. +- The damping `−(1+c)z` is what makes `S` well-conditioned at a *healthy* fixed point (`S ≈ −(1+c)I` dominant ⇒ `S⁻¹A ≈ −A/(1+c)`, small); `ρ(S⁻¹A)` blows up only *as contraction is lost* (S → singular). So large `ρ(S⁻¹A)` is partly a *symptom* of lost contraction, not purely intrinsic asymmetry. **(This is unresolved — see Q2.)** + +## 2. The central tension (the thing to crack) +**EP's gradient fidelity requires near-conservativity (`‖S⁻¹A‖ ≪ 1`) and tight convergence; the expressive power of attention comes precisely from its non-conservativity (`‖S⁻¹A‖ ≈ 1.4`).** BPTT escapes because it needs neither H1 nor H2 — but BPTT is not physically realizable on analog relaxation hardware. So the question is whether a **local, physically-realizable, backprop-free** rule can get the correct gradient for a **non-conservative equilibrium**, or whether there is a fundamental obstruction. + +--- + +## QUESTIONS FOR DEEP REASONING + +### Q1 — Is the obstruction FUNDAMENTAL or an estimator artifact? (the no-go question) +Is there a **theorem** of the form *"any local learning rule that reads only the settled states of a physical relaxation can recover the true loss-gradient of a non-conservative equilibrium system only to fidelity bounded by a function of the operator's non-normality `‖S⁻¹A‖`"*? Or is the `ρ(S⁻¹A)<1` limit specific to the *first-order* AEP construction, with a different (possibly non-perturbative) local estimator that is exact at any asymmetry? +- Concretely: the true adjoint needs `J⁻ᵀ`; EP's nudged relaxation naturally produces `J⁻¹`. For **symmetric** J these coincide (why EP works on energy-based systems). Is the gap `J⁻¹ − J⁻ᵀ` *fundamentally* inaccessible to any forward-only physical relaxation, or can a cleverly-coupled relaxation produce `J⁻ᵀ`? +- Relevant: implicit-function-theorem / adjoint-method structure; whether the adjoint of a *dissipative non-reciprocal* system is itself the steady state of *some* physically-runnable dynamics. + +### Q2 — Decouple non-normality from lost-contraction. Is the GOOD solution actually high-`ρ(S⁻¹A)`? +We suspect `ρ(S⁻¹A)` is large at the *divergence edge* mostly because contraction is lost (`S` near-singular), not because the *good* solution is intrinsically that asymmetric. At a healthy fixed point the damping makes `S ≈ −(1+c)I`, so `ρ(S⁻¹A) ≈ ‖A‖/(1+c)`. +- Does the **expressivity of attention** require large **`‖A‖`** (antisymmetric coupling) per se, or only large **`‖J‖`** (which the damping + a larger `c` could offset while keeping `‖S⁻¹A‖<1`)? I.e. **can one have an expressive attention operator that is strongly damped (large `1+c`) so that `S⁻¹A` stays sub-1 — buying EP-validity by *scaling*, not by killing asymmetry?** What is the trade between damping `c`, expressivity, and `ρ(S⁻¹A)`? +- Is there a reparameterization (e.g. normalizing the value/output path, or `c` per-channel) that holds `ρ(S⁻¹A)<1` *uniformly over training* without an explicit penalty? + +### Q3 — Can the PHYSICAL hardware compute its own adjoint (transpose dynamics)? +EP works on conservative systems because a **reciprocal** (symmetric-coupling) physical network has `J=Jᵀ`, so the same physical relaxation computes both the forward state and the adjoint. Our attention is **non-reciprocal** (independent Q/K = directed coupling). +- Is there a **physical mechanism** to run the *transpose/adjoint* dynamics of a non-reciprocal dissipative network on the same substrate — e.g. **time-reversal / Hamiltonian-echo backpropagation** (López-Pastor & Marquardt 2023, for Hamiltonian systems), a physically-coupled "transpose twin" network, or an Onsager-reciprocity-breaking trick? Does HEB extend from *Hamiltonian* (lossless) to our *damped dissipative* setting, and at what cost? +- If a substrate can physically realize **both** `J` and `Jᵀ` relaxations, EP/AEP becomes *exact* for non-conservative systems with **no** perturbative `ρ(S⁻¹A)<1` restriction. Is that physically buildable (analog crossbar / photonic / spintronic), and what does the transpose operation cost? + +### Q4 — Is "equilibrium" even the right primitive? Finite-time vs fixed-point computation +BPTT's stabilizer is exactly the **finite-horizon truncation gradient** (it differentiates the *actual* 150-step settling, not the `T→∞` fixed point). The good BPTT solution rides a **weakly-contractive / not-fully-settled** state — arguably a *finite-time transient*, not a true equilibrium. +- Should the physical computation be defined at a **fixed finite settling time** `T` (a "finite-time fixed point") rather than `T→∞`? Is there a **local** learning rule whose target is the finite-time output — i.e. a physically-realizable analog of BPTT's finite-horizon term — without an actual backward pass? (e.g. forward-mode/RTRL-like physical accumulation, two-timescale dynamics, or an auxiliary physical variable that accumulates the transient sensitivity.) +- More radically: should we **embrace non-equilibrium computation** (transient / limit-cycle / wave dynamics) as the primitive and seek a local rule for *that* (reservoir-computing, oscillatory/Ising-machine, physical-learning lineages), rather than forcing a fixed-point interpretation onto a computation that is natively transient? + +### Q5 — Non-normality as physics: does it have to hurt, and can it be bounded for free? +The failure is **non-normal transient amplification** — a stable operator whose perturbations grow transiently before decaying. In a real relaxation circuit this should be a *physical* phenomenon (ringing / overshoot before settling). +- What is the **physically natural regularizer** of non-normality? Is there a passivity / dissipativity / **contraction-in-a-metric (log-norm `μ_P(J)<0`)** structure that a physical substrate enforces *automatically* (e.g. a passive RC network is contractive by construction), and does building attention as a *passive* element bound `‖S⁻¹A‖` without a learned penalty? +- Is there an expressivity argument that **bounded non-normality is sufficient** for language modeling — i.e. that the useful computations of attention live at modest `‖S⁻¹A‖` and the divergence-causing high non-normality is *gratuitous* (an artifact of unconstrained training), not *functional*? + +### Q6 — The conservative-attention escape: fundamental cost or engineering cost? +Energy-based attention exists — **modern Hopfield networks / Energy Transformer (Krotov), tied-value LSE-energy attention** — which are conservative by construction (`H1` holds, EP is *exact*, runs natively on analog equilibrium hardware). We measure it costs ~0.15–0.2 CE vs free (non-conservative) attention. +- Is the expressivity loss of conservative/energy attention **fundamental** (information-theoretic: a symmetric-coupling associative memory provably cannot represent what directed Q≠K attention can) or **engineering** (current energy-attention forms are just suboptimal, and a better conservative parameterization closes the gap)? +- Is there a **partially-conservative** sweet spot: an attention with a *controlled, small* antisymmetric part (`‖S⁻¹A‖<1`, inside the first-order AEP warranty) that captures most of the expressivity — and what is the **Pareto frontier of [gradient fidelity ↔ asymmetry ↔ task loss]**? + +### Q7 — THE make-or-break for the hardware thesis +Granting that analog relaxation hardware is **structurally confined to (near-)conservative, contractive, equilibrium-based local learning**: **is a "readable" small language model (≈10–50M params, coherent text) achievable inside that regime?** Or does coherent language *require* the high non-normality / finite-time-transient expressivity that provably breaks local equilibrium learning? +- If achievable: what is the *minimum* non-normality / contraction margin a coherent-language equilibrium model needs, and does it fit inside `‖S⁻¹A‖<1` (+ comfortable contraction)? +- If not: what is the realistic *ceiling* of analog-EP-trainable models, and is there a hybrid (mostly-conservative core + a thin digitally-assisted non-conservative correction) that lifts it? + +--- + +## What an actionable answer looks like +For each: (a) is it a **fundamental obstruction** (cite/sketch the no-go) or an **engineering gap** (sketch the construction)? (b) the **physical realizability** verdict (can it run on a relaxation substrate with local updates, no backward pass?) (c) the **cheapest experiment** that would falsify the proposed mechanism on our simulator (we can measure `ρ(S⁻¹A)`, contraction margin, gradient cos vs a non-conservative knob, energy-vs-free-attention loss, damping-`c` sweeps). We are not looking for "use BPTT" or "use a digital solver" — both are physically excluded. We want the **physics of a local, forward-only, non-conservative equilibrium learning rule**, or the proof that none can exist beyond bounded asymmetry. + +## Key references the reasoner should engage +Scellier–Bengio 2017 (EP / energy); Ernoult et al. 2019 (EP≡BPTT conditions); Scurria et al. 2026 (AsymEP, the `J→Jᵀ` first-order correction); Laborieux–Zenke 2023 (Jacobian homeostasis, the `S⁻¹A` bias + functional-symmetry fix); López-Pastor–Marquardt 2023 (Hamiltonian echo backprop = physical adjoint via time-reversal); Krotov / Hoover et al. 2023 (Energy Transformer); Winston–Kolter 2020 (monotone-operator equilibrium nets = contraction by construction); Stern–Murugan 2023 (physical learning review); Kendall et al. 2020 (EP on analog resistor networks). |
