diff options
| author | Yuren Hao <yurenh2@illinois.edu> | 2026-07-03 05:56:50 -0500 |
|---|---|---|
| committer | Yuren Hao <yurenh2@illinois.edu> | 2026-07-03 05:56:50 -0500 |
| commit | b83947778e2c776f757a07d4719b7ce961d7ed55 (patch) | |
| tree | b9cc01d7adda691d9156d9d04f4fb2f644674e96 /ep_run/scurria_nonconservative.txt | |
Initial commit: ept — backprop-free equilibrium transformer (EP)
Code (ep_run/), organized docs (docs/{method,campaign,hardware,outreach,paper}),
analysis scripts (scripts/), ONBOARDING.md entry point. Large data/checkpoints
git-ignored (share separately).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_014FAPDWQ49M5Ye3NpTndTpn
Diffstat (limited to 'ep_run/scurria_nonconservative.txt')
| -rw-r--r-- | ep_run/scurria_nonconservative.txt | 1708 |
1 files changed, 1708 insertions, 0 deletions
diff --git a/ep_run/scurria_nonconservative.txt b/ep_run/scurria_nonconservative.txt new file mode 100644 index 0000000..fa6d620 --- /dev/null +++ b/ep_run/scurria_nonconservative.txt @@ -0,0 +1,1708 @@ + Equilibrium Propagation for Non-Conservative Systems + + + Antonino Emanuele Scurria 1 Dimitri Vanden Abeele 1 Bortolo Matteo Mognetti 2 Serge Massar 1 + + + Abstract from inference, the transmission of nonlocal error signals, + and synchronous layer-wise computations with explicit gra- + Equilibrium Propagation (EP) is a physics- + dient storage. These constraints have no clear analog in + inspired learning algorithm that uses stationary + physical systems, making backpropagation challenging to + states of a dynamical system both for inference + implement in neuromorphic or analog hardware. Conse- +arXiv:2602.03670v2 [cs.LG] 1 Jun 2026 + + + + + and learning. In its original formulation it is + quently, understanding how credit assignment can instead + limited to conservative systems, i.e. to dynam- + emerge from intrinsic system dynamics, through local inter- + ics which derive from an energy function. Given + actions and continuous relaxation, is a central question in + their applications, it is important to extend EP + neuroscience and machine learning. + to non-conservative systems, i.e. systems with + non-reciprocal interactions. Previous attempts to Equilibrium Propagation (EP) (Scellier &Bengio, 2017) + generalize EP to such systems failed to compute represents one of the most promising advances in this direc- + the exact gradient of the cost function. Here we tion. It formulates supervised learning as a contrast between + propose a framework that extends EP to arbitrary two stationary states of a dynamical system: a ‘free’ phase + non-conservative systems, including feedforward where the system evolves autonomously, and a ‘nudged’ + networks. We keep the key property of equilib- phase where outputs are weakly pushed toward their targets. + rium propagation, namely the use of stationary The local change in neural states between these phases re- + states both for inference and learning. However, covers the exact gradient of the cost function with respect to + we modify the dynamics in the learning phase by parameters. This enables spatially local learning exploiting + a term proportional to the non-reciprocal part of the continuous relaxation of the system without a distinct + the interaction so as to obtain the exact gradient backward circuit or explicit weight transport. + of the cost function. This algorithm can also be + Since its introduction, several works have sought to im- + derived using a variational formulation that gen- + prove the practicality and biological realism of EP. Algo- + erates the learning dynamics through an energy + rithmic adaptations include enforcing temporal locality to + function defined over an augmented state space. + avoid state storage (Ernoult et al., 2020; Falk et al., 2025), + Numerical experiments show that this algorithm + deriving agnostic updates for black-box energies (Scel- + achieves better performance and learns faster than + lier et al., 2022), and substituting nudging with clamping + previous proposals. + (Stern et al., 2021). Theoretically, the framework has been + extended to stochastic systems (Scellier &Bengio, 2017; + Massar &Mognetti, 2025) and Lagrangian dynamics for + 1. Introduction time-varying inputs (Massar, 2025; Pourcel et al., 2025; + Standard neural network optimization relies on error back- Berneman &Hexner, 2025). In parallel, simulations have + propagation, an algorithm whose computational mechanism explored suitable substrates, ranging from spiking (Mar- + is difficult to reconcile with biological (Crick, 1989) and tin et al., 2021; O’Connor et al., 2019) and resistive net- + physical implementations (Indiveri &Liu, 2015). Specif- works (Kendall et al., 2020) to coupled oscillators (Wang + ically, backpropagation requires a backward pass distinct et al., 2024; Rageau &Grollier, 2025), as well as quantum + systems (Wanjura &Marquardt, 2025; Massar &Mognetti, + 1 + Laboratoire d’Information Quantique (LIQ) CP224, Université 2025; Scellier, 2024). Experimental realizations have been + libre de Bruxelles (ULB), Av. F. D. Roosevelt 50, 1050 Bruxelles, demonstrated in memristor crossbars (Yi et al., 2023), self- + Belgium 2 Interdisciplinary Center for Nonlinear Phenomena and + Complex Systems CP231, Université libre de Bruxelles (ULB), Av. adjusting electrical circuits (Dillavou et al., 2022; 2024), + F. D. Roosevelt 50, 1050 Bruxelles, Belgium. Correspondence to: elastic networks (Altman et al., 2024), and classical Ising + Antonino Emanuele Scurria <antonino.scurria@ulb.be>. models trained on quantum annealers (Laydevant et al., + 2024). + Proceedings of the 43 rd International Conference on Machine + Learning, Seoul, South Korea. PMLR 306, 2026. Copyright 2026 Despite these recent developments and the theoretical el- + by the author(s). + + 1 + Equilibrium Propagation for Non-Conservative Systems + +egance of EP, its standard formulation remains restricted a framework where the original dynamics serve for infer- +to conservative systems. In these systems, dynamics are ence, while a new augmented dynamic is used to compute +derived from an energy function, which inherently enforces gradients of the cost Eq. (2). In this augmented phase, the +symmetry (e.g., symmetric synaptic connections Jij = Jji ) output neurons are nudged towards their targets (as in stan- +through the action-reaction principle. This constraint pre- dard EP), while a local corrective term – proportional to the +cludes the use of EP in a broad class of models characterized antisymmetric part of the Jacobian at the free equilibrium + ∂ +by non-conservative forces. This includes the feedforward JF (x0 , θ, u) = ∂x F (x0 , θ, u) – is added to the forces. The +architectures dominant in modern AI, biological circuits, exact gradients of the cost with respect to parameters are +as well as physical systems that reach stationary states far then obtained by contrasting stationary states of the aug- +from thermodynamic equilibrium, such as nonlinear optical mented system. +systems driven by external lasers (Cin et al., 2025), opto- + Second, we introduce Dyadic EP, a ‘variational’ approach +electronic systems (Kalinin et al., 2025), exciton-polariton + to learning in non-conservative systems. This method in- +condensates (Sajnok &Matuszewski, 2025), active meta- + volves doubling the number of variables in the system’s +materials (Brandenbourger et al., 2019) and active colloids + state space and subsequently introducing a new energy func- +(Bishop et al., 2023; Osat &Golestanian, 2023) (see (Bowick + tion in this extended space. This approach takes advantage +et al., 2022) for a review). + of the extended space to execute the positive and negative +Formally, we consider a dynamical system governed by nudging phases in parallel, recovering the same computa- +a non-reciprocal force field F (x, θ, u), which relaxes to a tional cost as AsymEP. Derived from first principles, this +stationary configuration x0 satisfying: approach is inspired by established methods for mapping + dissipative dynamical systems onto conservative ones by + F (x0 , θ, u) = 0, (1) doubling the degrees of freedom (Bateman, 1931; Galley, +where x represents the state variables, θ the learnable param- 2013; Aykroyd et al., 2025). A more comprehensive study +eters and u the static input. Our goal, given a target y(u), is of the theoretical framework and its application to feedfor- +to compute the gradient of the cost function C(x0 , y) at this ward networks can be found in (Scurria, 2026). Our method +equilibrium, is related to the Dual Propagation algorithm (Høier et al., + dC 0 2023; Høier &Zach, 2023; 2024) and constitutes an inde- + (x , y), (2) pendent, first-principles generalization of Dyadic Learning + dθ +and update θ to minimize the cost. (Nest &Høier; Høier et al., 2024)—previously limited to + Hopfield networks—to arbitrary force fields. +Previous attempts to extend EP to non-conservative dynam- +ics include the Vector Field (VF) algorithm (Scellier et al., Third, we validate our framework on MNIST (LeCun, 1998), +2018). However, as noted by the authors, this method pro- Fashion-MNIST, and CIFAR-10. In continuous Hopfield +vides an unbiased gradient of the cost Eq. (2) only in the networks initialized with symmetric connection matrices, +conservative case. To mitigate this, (Laborieux &Zenke, AsymEP achieves better accuracy and learns faster than +2024) proposed adding a penalty to keep the Jacobian close EP and VF. Additionally, when we constrain the network +to symmetry, essentially forcing the system to be as con- to have a strong degree of structural asymmetry, in which +servative as possible. Alternative methods related to VF, case EP is inapplicable, AsymEP outperforms VF. Finally, +which similarly do not compute the exact gradient, were when we restrict connections to a feedforward structure, our +proposed in (Farinha et al., 2020; Costa &Santos, 2025) and algorithm effectively trains all parameters; in contrast, VF +for specific systems in simulation (Cin et al., 2025; Sajnok is limited to training the last layer, acting essentially as an +&Matuszewski, 2025). Extreme Learning Machine (Huang et al., 2006; Wang et al., + 2022) with poor performance. +Conversely, generalizations of backpropagation can handle +non-reciprocal forces and compute the exact gradient of In summary, this theoretical work proposes two generaliza- +the cost Eq. (2) but inherit the same challenges in physical tions of EP beyond conservative systems to arbitrary differ- +implementations. For instance, Backpropagation Through entiable dynamics that compute in their stationary states. +Time (Werbos, 1990) unfolds the network in time to ap- +ply standard backpropagation, Recurrent Backpropagation 2. Equilibrium Propagation Overview +(Almeida, 1990; Pineda, 1987) avoids this memory require- +ment but still requires a specific circuit to propagate errors, 2.1. Conservative Systems +and the continuous Adjoint Method (Chen et al., 2018) addi- We first review standard Equilibrium Propagation (EP) +tionally requires integrating the dynamics backward in time (Scellier &Bengio, 2017). We consider a network described +which is not physically possible for a dissipative system. by an energy function E(x, θ, u), such that the force field is +In this paper, we first propose Asymmetric EP (AsymEP), + + 2 + Equilibrium Propagation for Non-Conservative Systems + +derived from the potential E: stationary point, i.e., that Eq. (7) holds. Second, EP implic- + ∂ + itly assumes that the Jacobian JE (x0 , u) = ∂x FE (x0 , u) is + ∂ + FE (x, θ, u) = − E(x, θ, u). (3) invertible. In this work, we assume this condition holds and + ∂x will not state it explicitly hereafter. Third, for simplicity, +The objective is to compute the total gradient dC 0 we omit the dependency on the input u and target y in the + dθ (x , y) of a +(quadratic) cost function C(x, y) evaluated at the minimum following equations. +energy configuration of the system. This free equilibrium +denoted x0 (which depend implicitly in θ and u), satisfies 2.2. Vector Field +the stationarity condition: + The Vector Field (VF) algorithm, introduced in (Scellier + ∂ et al., 2018), is an early attempt to adapt EP to non- + − E(x0 , θ, u) = 0. (4) reciprocal forces. This method relies on the observation + ∂x + that, for conservative systems, linearizing the right-hand +To compute gradients, we introduce the augmented energy side of Eq. (9) around the equilibrium point x0 yields +functional: + ∂E(xβ , θ) ∂E(x−β , θ) + + ET (x, θ, β, u, y) = E(x, θ, u) + βC(x, y), (5) 1 + lim − + β→0 2β ∂θ ∂θ +where β is a scalar nudging parameter. The stationary config- ⊤ β (10) + x − x−β + + ∂FE 0 +uration of this augmented system is obtained by integrating = lim − (x , θ) , + β→0 ∂θ 2β +the dynamics + dx ∂ET (x, θ, β, u) where FE = −∂x E(x, θ) is the conservative force. It is + =− , (6) therefore tempting to use the right-hand side of Eq. (10) for + dt ∂x + parameter updates of non-conservative systems, for which +until the energy minimum is reached. This new fixed point no energy function E exists. +xβ , called nudged equilibrium, satisfies: + The VF algorithm adopts precisely this approach. It uses + ∂E(xβ , θ, u) ∂C(xβ , y) the nudged counterpart of Eq. (7), + +β = 0. (7) + ∂x ∂x + ∂C β + F (xβ , θ) − β (x ) = 0, (11) +The training procedure, as improved in (Laborieux et al., ∂x +2021), uses two nudged phases with factors ±β (with +β ̸= 0). Starting from x0 , the system relaxes to two in conjunction with the learning rule Eq. (10): +nearby perturbed equilibria, x+β and x−β . The displace- ⊤ β + x − x−β + +ment x+β − x−β is then used to compute the parameter ∂F 0 + ∆θ = ϵ (x , θ) . (12) +update in the learning rule: ∂θ 2β + + 1 ∂E(xβ , θ, u) ∂E(x−β , θ, u) + + ∆θ = −ϵ − , (8) However, as noted in (Scellier et al., 2018), Eq. (12) does + 2β ∂θ ∂θ not align with the true gradient dC 0 + dθ (x ) and is exact only if +where ϵ > 0 is the learning rate. The theoretical foundation the force is conservative. To see this, let JF (x, θ) denote +of EP is the result that, in the limβ→0 of Eq. (8), we get: the Jacobian of the vector field F (x, θ) (in components + (JF (x, θ))ij = ∂F∂xi (x,θ) + j + ). Differentiating the equilibrium + dC(x0 , y) d ∂E(xβ , θ, u) 0 + condition F (x , θ) = 0 with respect to θ gives + = , (9) + dθ dβ ∂θ + dx0 ∂F 0 +see Appendix D.1. The error of the above method is O(β 2 ). JF (x0 , θ) + (x , θ) = 0. (13) +This error can be further reduced using holomorphic equi- dθ ∂θ +librium propagation (Laborieux &Zenke, 2022). Consequently, the exact gradient of the cost is +Thus, EP recovers the exact gradient of the cost function ⊤ +using only local computations. In this manner, learning dC 0 dx0 ∂C 0 + (x ) = (x ) +implements gradient descent without an explicit backward dθ dθ ∂x + ⊤ +pass, and credit assignment is realized through the system’s + + ∂F 0 ⊤ 0 + −1 ∂C 0 +intrinsic relaxation dynamics. =− (x , θ) JF (x , θ) (x ) . + ∂θ ∂x + | {z }| {z } +Three remarks can be made at this point. First, EP does not pre-synaptic post-synaptic +require the system to be at an energy minimum, but only at a (14) + + 3 + Equilibrium Propagation for Non-Conservative Systems + +The terms ’pre-synaptic’ and ’post-synaptic’ in Eq. (14) Algorithm 1 Asymmetric EP (AsymEP) +are used by analogy with neuronal transmission: the pre- 1: Inputs: Force field F (x, θ), cost function C(x), nudg- +synaptic factor captures the local influence of θ on the force ing parameter β, learning rate ϵ. +F , while the post-synaptic factor is the sensitivity of the 2: repeat +cost to state perturbations. 3: 1. Free Phase: Evolve to stationary state +If instead we differentiate the nudged equilibrium condition 4: Evolve the system dynamics +in Eq. (11) with respect to β and evaluate at β = 0, we 5: + dx +obtain = F (x, θ), (17) + β + dt + dx ∂C 0 + JF (x0 , θ) − (x ) = 0, (15) 6: until convergence to the stationary state x0 . + dβ β=0 ∂x 7: 2. Jacobian Decomposition +which gives 8: Compute the Jacobian at equilibrium: + 9: + dxβ −1 ∂C 0 ∂F 0 + = JF (x0 , θ) (x , y). (16) JF (x0 , θ) = (x , θ), (18) + dβ β=0 ∂x ∂x +The right-hand side of Eq. (16) represents the effective post- 10: and decompose it in its antisymmetric part: + 11: +synaptic term used by the VF algorithm (Eq. 12). Compar- +ing this with the exact post-synaptic term derived in Eq. (14), + AJ (x0 , θ) = 12 (JF (x0 , θ) − JF (x0 , θ)⊤ ). (19) +we see that they coincide only if JF = JF⊤ , i.e., only if the +system is conservative. 12: 3. Nudged Phase: Augmented Dynamics +Now, let SJ (x0 , θ) and AJ (x0 , θ) denote the symmetric 13: Integrate the dynamics twice starting from x0 +and antisymmetric parts of the Jacobian at the free (un- 14: +nudged) equilibrium, respectively. Then, we show in Ap- dx ∂C +pendix A that the gradient error increases with the spectral = F (x, θ) − β (x) − 2AJ (x0 , θ) (x − x0 ), + −1 dt ∂x +radius of SJ (x0 , θ) AJ (x0 , θ). Consequently, large (20) +antisymmetric contributions degrade the gradient estima- 15: until convergence to two new stationary states +tion, confirming empirical observations in the Appendix of x±β + A . +(Ernoult et al., 2020). In fact, in the pathological limit where 16: 4. Parameter Update +the Jacobian would be purely antisymmetric SJ (x0 , θ) = 0, 17: Update the parameters according to: +the update of VF gives the negative of the true gradient, 18: +maximizing the cost rather than minimizing it. ⊤ ! + xβA − x−β + + ∂F 0 A + ∆θ = ϵ (x , θ) . (21) + ∂θ 2β +3. Asymmetric EP +Here, we introduce Asymmetric EP (AsymEP), see Algo- 19: until convergence of θ +rithm 1, which removes the gradient estimate error inherent 20: Output: Optimized parameters θ. +to VF by adding a local correction term to the augmented +inference dynamics. The new nudged equilibrium xβA satis- +fies: where JFA (x, θ) is the Jacobian of the modified dynamical + ∂C β system Eq. (20). At the equilibrium point x0 , JFA is equal + F (xβA , θ) − β (x ) − 2AJ (x0 , θ) (xβA − x0 ) = 0, (22) to the transpose of the original Jacobian: + ∂x A +As in VF, we then obtain two perturbed states x±β + A for op- + JFA (x0 , θ) = JF (x0 , θ) − 2AJ (x0 , θ) +posite nudging ±β and apply the contrastive learning rule = SJ (x0 , θ) − AJ (x0 , θ) +of Eq. (12). = JF⊤ (x0 , θ). (24) +We now show that AsymEP gives rise to the correct learning +rule, i.e. that right-hand side of Eq. (21) is proportional to where we have used the decomposition Eq. (44) of the orig- +the gradient of the cost function dC 0 inal Jacobian J into its symmetric and antisymmetric com- + dθ (x ) at the equilibrium + 0 +point x (Eq. 14). To this end, note that the same reasoning ponents. Therefore, the left hand side of Eq. (23) is equal to +leading to Eq. (16) leads to the true post-synaptic term + + dxβA −1 ∂C 0 dxβA −1 ∂C 0 + = JFA (x0 , θ) (x ). (23) = JF⊤ (x0 , θ) (x ), (25) + dβ β=0 ∂x dβ β=0 ∂x + + 4 + Equilibrium Propagation for Non-Conservative Systems + +which, using Eq. (14), proves the result. Additionally, al- until a stationary point (z β , z ′β ) is reached. Upon conver- +though implied by the equality with the true gradient, we gence, we follow the standard EP paradigm in using the +explicitly demonstrate the equivalence of the gradient esti- difference z β − z ′β to compute the post-synaptic term. Un- + ′ + ′ +mates obtained by AsymEP and Backpropagation Through der the change of variables m = z+z 2 and d = z − z , we +Time in Appendix B following (Ernoult et al., 2019). prove in Appendix D that m follows the original dynamics + F (ensuring valid inference), while d relaxes to a "physical" +Note that the corrective term −2AJ (x0 , θ)(x − x0 ) in + error signal proportional to the cost gradient. +Eq. (20) is spatially local: AJ vanishes for unconnected +neurons, and (x − x0 ) is available at the synapse given the It is important to notice that while Dyadic EP introduces a +memory mechanism already required by Eq. (12). This distinct formulation, it remains consistent with the general +correction can create backward connections (Section 5.3). theoretical setting of EP and matches the computational +However, in physical realizations, both feedforward and cost of AsymEP. Note also that we start the evolution of +feedback connections must be physically present, though the free phase (β = 0) with the identical initial condition +feedback may be deactivated during inference. for z and z ′ , (i.e., d = 0). This guarantees that integrat- + ing Eq. (32) leads to a symmetric stationary point where +4. Dyadic EP z 0 = z ′0 . Finally, we underline that the modified varia- + tional update rule in Eq. (34) is equivalent to the standard +We now introduce Dyadic EP (Algorithm 2), a variational symmetric EP update rule in Eq. (8) (see Appendix D). +algorithm that computes the exact cost gradient in the limit + Now, to make this concrete, consider a continuous Hopfield +of infinitesimal nudging. It maps the original n-variable + network (see also Eq. (35)) with an asymmetric connection +dynamics F (x, θ) onto a 2n-variable system (z, z ′ ) defined + matrix J. After some calculations (see Appendix F), the +by an energy H(z, z ′ , θ) and cost D(z, z ′ ). We show in + augmented energy of the system can be re-expressed as: +Appendix E that AsymEP can be seen as the first-order +projection of Dyadic EP onto the original n-dimensional 1 1 + HT = − ρ(z)⊤ Sρ(z) + ρ(z ′ )⊤ Sρ(z ′ ) − ρ(z)⊤ Aρ(z ′ ) +state space. 2 2 + 1 β +The new system is defined by the energy H and cost function + (∥z∥ − ∥z ∥ ) + (C(z, y) + C(z ′ , y)) , + 2 ′ 2 + 2 2 +D, given in terms of F and C by: (29) + + z + z′ + where S and A are the symmetric and antisymmetric parts + H(z, z ′ , θ) = −(z − z ′ )⊤ F ,θ , of J, respectively and ρ is an element-wise non-linearity. + 2 + ′ + An interesting analogy can be drawn with standard learning + z+z + D(z, z ′ ) = C , (26) rules in discrete Hopfield networks (Hopfield, 1982). For + 2 a sequence of binary memories {ξ 1 , . . . , ξ m } where ξ µ ∈ +where z, z ′ ∈ Rn . In order to learn, we introduce the aug- {−1, 1}n , S P corresponds to the standard autoassociative +mented energy Hebbian rule µ ξ µ (ξ µ )⊤ , creating stable attractors at each + pattern. In contrast, A corresponds to the heteroassociative + HT (z, z ′ , θ, β) = H(z, z ′ , θ) + βD(z, z ′ ). (27) + rule (e.g., a cycle between ξ µ and ξ ν given by ξ ν (ξ µ )⊤ − +The equilibrium configuration corresponds to a saddle point ξ µ (ξ ν )⊤ ), encoding transitions between patterns. +of HT , where z minimizes and z ′ maximizes the energy. + For this specific energy, the update rule given by Eq. (34) +This poses no issue for EP, which requires only that the + can be re-expressed as: +joint state (z, z ′ ) reaches a stationary state. Although this +min-maximization can be interpreted as z evolving forward 1 ⊤ + ρ(z ′β ) − ρ(z β ) ρ(z ′β ) + ρ(z β ) . (30) + + ∆J ∝ − +and z ′ backward in time, in practice they evolve forward 2β +simultaneously, as we integrate the coupled equations: In the limit β → 0, this gives: + z + z′ + + dz ∂HT β + ! + =− =F ,θ d + dt ∆J ∝ ⊙ ρ′ (m)ρ(m)⊤ . (31) + ∂z ⊤ 2 + β + z − z′ β ∂C z + z ′ + + ∂F + + − , + 2 ∂z z+z′ 2 ∂z 2 + 2 matching the learning rule in (Pineda, 1987), with + β + + dz ′ ∂HT + + z + z′ + limβ→0 dβ being the error signal. + =+ ′ + =F ,θ + dt ∂z 2 + ′ ⊤ + β ∂C z + z ′ + + z−z ∂F 5. Numerical Experiments + − + , + 2 ∂z ′ z+z′ 2 ∂z ′ 2 + 2 In this section, we numerically validate AsymEP (Algo- + (28) rithm 1). The neuronal dynamics follows the one introduced + + 5 + Equilibrium Propagation for Non-Conservative Systems + +Algorithm 2 Dyadic EP where ∥ · ∥F denotes the Frobenius norm. Note that this + 1: Inputs: Force field F (x, θ), cost function C(x, y), metric does not capture the asymmetry of the Jacobian, + nudging parameter β, learning rate ϵ which depends on the state x. + 2: repeat For numerical experiments, we restricted the network to a + 3: 1. Free Phase: Evolve to stationary state layered architecture with a single hidden layer to facilitate + 4: Evolve the system dynamics, starting from identi- comparison with prior work. Accordingly, J in contains + cal initial conditions z(0) = z ′ (0) = z0 , only input-to-hidden connections, while J dyn is block off- + 5: diagonal, encoding bidirectional interactions between the + dz ∂H dz ′ ∂H + =− , =+ ′, (32) hidden and output layers. Both J in and J dyn are trained. + dt ∂z dt ∂z + We first use MNIST (LeCun, 1998) (60k train, 10k test) + 6: until stationary states z 0 , z ′0 are reached. + followed by Fashion-MNIST to validate AsymEP, and then + 7: 2. Nudged Equilibrium + we further validate AsymEP and Dyadic EP by comparing + 8: Evolve the system dynamics, starting from the + them to Backpropagation on a convolutional feedforward, + solution of the free phase z 0 = z ′0 : + with CIFAR-10. Inputs are normalized using min-max to + 9: + dz ∂HT dz ′ ∂HT [−1, 1] and targets are one-hot encoded in {−1, 1}. All + =− , =+ , (33) hyperparameters are detailed in Appendix G, along with + dt ∂z dt ∂z ′ + additional details and numerical results. +10: until two nudged stationary states z β , z ′β are + reached. + 5.1. Symmetric Initialization +11: 3. Parameter Update +12: Update the parameters according to: We start by comparing AsymEP with standard EP and +13: VF. All algorithms are initialized with an identical sym- + 1 ∂H(z β , z ′β , θ) + + ∆θ = −ϵ (34) metric matrix J dyn . EP maintains this symmetry through- + β ∂θ out training, while VF and AsymEP induce asymmetry in +14: until convergence of θ J dyn . Since EP and VF already achieve strong performance +15: Output: Optimized parameters θ. on MNIST, the purpose of this experiment is to validate + AsymEP and compare it against EP and VF rather than + outperform the state of the art. +in (Scellier &Bengio, 2017), and is generalized to allow Figure 1 compares the three algorithms as a function of +for non-reciprocal forces as in (Scellier et al., 2018). For hidden-layer dimension after 1 and 20 epochs. AsymEP +clarity, we express the forces in a form that explicitly sepa- consistently outperforms the baselines, suggesting it learns +rates the contributions of the external input and the recurrent faster and better. +interactions: + Figure 2 studies the evolution of the asymmetry ratio rstr . + F (x) = ρ′ (x) ⊙ J in u + J dyn ρ(x) − x, + + (35) The results are reported for 50 hidden neurons. As expected, + EP preserves the initial weight symmetry. In contrast, VF +where u ∈ RNin denotes the input and x ∈ RNdyn the neu- and AsymEP induce non-trivial evolution of rstr following +ronal state, comprising both hidden and output units. The two distinct patterns, resulting in three distinct network +matrices J in ∈ RNdyn ×Nin and J dyn ∈ RNdyn ×Ndyn define the configurations. A complementary figure is available in Ap- +input and recurrent connectivity, respectively. The activation pendix G.1. +function ρ(·) is taken to be the hyperbolic tangent, applied +element-wise. 5.2. Fixed Asymmetry Ratio +If J dyn is symmetric, we can define the energy: While the previous section focused on networks compatible + 1 1 with all three algorithms (EP, VF, AsymEP), we now turn + E(x) = ∥x∥2 − ρ(x)⊤ J dyn ρ(x) − ρ(x)⊤ J in u, (36) to architectures with strong structural asymmetry. In this + 2 2 + regime, EP is inapplicable by construction, and, as we show, +which is identical to that of (Scellier &Bengio, 2017), pro- VF performs poorly, contrary to AsymEP which remains +vided that the input neurons are activated as ρ(u). effective. +Equation (35) naturally motivates a quantitative measure of To this end, we consider a class of networks where the +structural asymmetry rstr , defined as: asymmetry ratio rstr defined in Eq. (37) is kept fixed. Let S̃ + ⊤ and à be arbitrary symmetric and antisymmetric matrices + ∥(J dyn − J dyn )/2∥F in RNdyn ×Ndyn respectively. We enforce a fixed rstr via the + rstr = , (37) + ∥J dyn ∥F + + 6 + Equilibrium Propagation for Non-Conservative Systems + + where γ ∈ R is a learnable global scale. + Using VF and AsymEP, we train a layered network with one + hidden layer of 50 neurons (in which case S̃ and à are block + off-diagonal) for different values of rstr to investigate the + impact of structural asymmetry. We compare two training + regimes: training only the input weights J in (and the scale + γ), versus training all parameters including J dyn . The first + regime trains only the external forces from the input ρ′ (x) ⊙ + J in u (which correspond to a symmetric contribution in the + Jacobian) applied to our non-conservative system, while + the second additionally trains J dyn and therefore the non- + (a) Results after one epoch. symmetric part of the Jacobian directly. + Figure 3 summarizes the results. We find that AsymEP + maintains robust performance across all asymmetry levels + (e.g., achieving an accuracy of 93.8 ± 0.4% at rstr = 0 and + 94.9 ± 0.2% at rstr = 0.875 when training all parameters) + and can even learn when the recurrent connection matrix + J dyn is completely antisymmetric (rstr = 1). Additionally, + training all parameters shows significant improvement over + training only J in . + In contrast, VF performs well at low asymmetry ratios + but degrades as asymmetry increases, eventually dropping + to chance levels (e.g., accuracies of 5 ± 3% and 8 ± 4% + (b) Results after 20 epochs. at rstr = 1 for input-only and all-parameter training, re- +Figure 1. Comparison of algorithm performance on MNIST using spectively). When only J in is trained, VF accuracy col- +a layered architecture with one hidden layer and symmetric ini- lapses around rstr ≈ 0.5, whereas training all parameters +tialization. Squares denote AsymEP, circles EP, and triangles VF. delays this collapse until rstr ≈ 0.8. Our analysis in Ap- +Test accuracy (averaged over 10 runs) is shown after one epoch + pendix G.2.1 reveals that VF adjusts the dynamics such that +(Fig. 1a) and 20 epochs (Fig. 1b). + the asymmetry of the Jacobian’s off-diagonal terms remains + strictly lower than the structural asymmetry ratio. The train- + ing appears to adjust the neuronal state such that neurons + connected by strongly asymmetric weights have low activa- + tion. As shown in Appendix G.2.1, AsymEP learns faster + than VF across all levels of asymmetry. + Finally, Appendix G.3 opens with a brief theoretical dis- + cussion of the stability of these non-conservative dynamics, + followed by simulations on all-to-all topologies with con- + strained rstr and input projections J in . Even in this worst- + case setting, AsymEP reduces oscillations and improves + stability. + + 5.3. Feedforward Architectures +Figure 2. Evolution of the asymmetry ratio rstr (defined in Eq. (37)) We now consider a purely feedforward architecture. Here +during training on MNIST for AsymEP, EP and VF, initialized +from a symmetric configuration. The models use 50 hidden neu- VF trains only the last layer: with no backward connections, +rons. the output nudging signal cannot reach earlier layers, so for + every layer but the last the nudged stationary states coincide + with the free states, giving zero weight updates. As only +following parameterization of the recurrent parameters: the output layer is trained, the system essentially becomes + "q # an Extreme Learning Machine (Huang et al., 2006; Wang + dyn 2 S̃ à et al., 2022). In contrast, AsymEP introduces a correction + J =γ 1 − rstr + rstr , (38) that generates effective backward connections, allowing the + ∥S̃∥F ∥Ã∥F + + 7 + Equilibrium Propagation for Non-Conservative Systems + + tivity structures inspired by (Millidge et al., 2023), while + keeping the number of trainable parameters fixed. + Experiments are conducted on Fashion-MNIST using a two- + hidden-layer network with hidden dimensions 500 and 200. + Network states are denoted (x0 , x1 , x2 , x3 ), where x0 is + the input and x3 = xL the output. Forward and backward + connections are denoted by Wk and Bk , respectively, with + W1 = J in . + We consider three classes of dynamics. First, the Continuous + Hopfield (CH) dynamics introduced previously: + dxk + = −xk +ρ′ (xk )⊙ Wk ρ(xk−1 )+(1−δk,L )Bk ρ(xk+1 ) . + dt + (40) + Second, Predictive Coding (PC) dynamics, defined through + the prediction errors ek = xk − Wk ρ(xk−1 ), whose fixed +Figure 3. Impact of the structural asymmetry ratio rstr on accuracy +(top) and standard deviation over 10 runs (bottom) on MNIST. point ek = 0 corresponds to a standard feedforward net- +We compare VF (orange) and AsymEP (blue) under two training work: +regimes: training only J in (dashed) or all parameters (solid). + dxk + = −ek + (1 − δk,L ) (ρ′ (xk ) ⊙ (Bk ek+1 )) . (41) + dt +nudging signal to influence all layers. We make this explicit + Third, a standard dynamics chosen for direct comparison +for a network with one hidden layer. + with backpropagation: +Let the state x be partitioned in hidden h and output o + dxk +layers. The recurrent connection matrix is then J dyn = = −xk + Wk ρ(xk−1 ) + (1 − δk,L )Bk ρ(xk+1 ). (42) + dt + + 0 0 + . The forces of the system are: + Wh→o 0 For each dynamics, we examine three connectivity scenar- + β ios. + Fh = ρ′ (h) ⊙ J in u + λ(Wh→o )⊤ (o − o0 ) − h + + + + + 0 + + ⊤ + Fo = ρ′ (o) ⊙ Wh→o ρ(h) − λWh→o (h − h ) • In the asymmetric case (Bk ̸= Wk+1 + β + ), the backward + (39) + weights Bk are randomly initialized and kept fixed + + ∂C while only the forward weights are trained, ensuring a + + λβ −o + + + ∂o fair comparison (i.e., identical number of parameters); +where λ is 0 during the free inference and 1 during the in PC, the learning rule for Bk is zero when only inputs +nudged phase (Eq. 20). The force on the hidden layer Fhβ are clamped. +now depends on the output layer through the term ρ′ (h) ⊙ ⊤ + ⊤ • In the symmetric / conservative case (Bk = Wk+1 ), the +(Wh→o ) (o − o0 ), enabling the nudge (the term β ∂C ∂o ) to CH and PC dynamics derive from an energy functional, +influence the hidden layer. This implicitly assumes that the while the standard dynamics remains non-conservative +hardware implementation supports the physical activation due to its non-symmetric Jacobian. +of these backward connections. + • In the feedforward case (Bk = 0), the PC and stan- +We validate this using a single hidden layer of only 20 neu- + dard dynamics coincide; for the standard dynamics, the +rons on MNIST. After training, VF saturates with 64.3 ± + AsymEP learning rule mirrors backpropagation, with +2.0% accuracy, whereas AsymEP reaches 92.7 ± 0.5% ac- + ∆xβk = 2β1 + (xβ − x−β ) acting as the propagated error +curacy. We expect this discrepancy to increase with network + signal. +depth, since this increases the number of layers unable to +learn under VF. A figure with the accuracy during training +can be found in Appendix G.4.2. Table 1 shows that AsymEP consistently outperforms VF + in both asymmetric and feedforward settings, in final ac- + curacy, learning speed, and stability. After a single epoch +5.4. Advantages of Non-Conservative Dynamics + it already provides on average a 15% accuracy gain with +AsymEP is not tied to a specific neural dynamics. To further an order-of-magnitude reduction in variance. Remarkably, +assess the benefits of training non-conservative dynamics AsymEP with asymmetric connectivity also surpasses EP +using AsymEP, we compare several dynamics and connec- on symmetric networks despite training only the forward + + 8 + Equilibrium Propagation for Non-Conservative Systems + +weights, suggesting that relaxing symmetry constraints may 6. Discussion and Conclusion +improve expressivity. Supplementary results are provided +in Appendix G.5. In this work, we extended Equilibrium Propagation (EP) + to non-conservative systems that reach stationary states by + deriving two mathematically equivalent algorithms that re- + cover the exact gradient of the cost function in the limit of +Table 1. Test accuracy on Fashion-MNIST (%) at Epoch 50 (mean +± std 10 runs). BP on a standard feedforward architecture using + infinitesimal nudging. +MSE and SGD achieve 87.37 ± 0.29%. The first approach, Asymmetric EP, preserves the original + inference dynamics. It introduces a corrective force during + EP AsymEP VF + the nudged phase that remains spatially local, as the anti- + Asym - 86.78 ± 0.14 85.20 ± 0.12 symmetric Jacobian is null for unconnected neurons and the + CH Feedfor - 86.05 ± 0.12 77.76 ± 0.37 + Sym 84.30 ± 0.13 - - + perturbation from equilibrium is available at the synapse + Asym - 86.20 ± 0.17 80.71 ± 6.17 level. Unlike standard methods like Recurrent Backpropa- + PC gation (Almeida, 1990; Pineda, 1987), this avoids explicit + Sym 84.78 ± 0.14 - - + Asym - 82.91 ± 0.48 75.52 ± 1.69 digital weight transposition. However, a physical mech- + Standard + Feedfor - 86.25 ± 0.16 78.58 ± 0.28 anism to obtain the local corrective force at the synapse + level remains a subject for future work. We also note that + AsymEP shares the temporal non-locality of standard EP. +Finally, to investigate how AsymEP scales with depth, we The second approach, Dyadic EP, doubles the state space +trained deeper fully connected networks with two and three to map non-reciprocal dynamics onto an energy land- +hidden layers of 500 neurons on Fashion-MNIST, reaching scape—conceptually reminiscent of multi-compartment cor- +86.41 ± 0.22% and 87.8 ± 0.15% test accuracy respectively. tical neurons, where apical dendrites integrate feedback + (analogous to z − z ′ ) separately from basal feedforward +5.5. Feedforward Training on CIFAR-10: BP vs. Dyadic input (analogous to z + z ′ ) (Guerguiev et al., 2017). Addi- + EP vs. AsymEP tionally, this expanded space also enables the positive and + negative nudging phases to run in parallel. This offers a +To test whether our framework scales beyond shallow net- pathway to implement a version of EP that is local in time, +works, we conclude with a deep, purely feedforward CNN but would require a doubling of the degrees of freedom +architecture trained on CIFAR-10. We compare backprop- on the physical hardware. More fundamentally, the energy +agation (BP), VF, AsymEP and Dyadic EP in a controlled defined on the extended state shows that the tools and the- +setting where the gradient estimator is the only difference oretical guarantees obtained for EP should also apply to +between runs: all methods share the same configuration, the case of non-reciprocal forces, and that the variational +with the BP gradient replaced by the contrast of stationary principle behind EP is universal in the sense that it can be +states for the EP-based methods (see App. G.6 for details). applied to all networks which operate in a stationary state. +Each configuration is trained for 40 epochs over 5 seeds. + Furthermore, Dyadic EP is not restricted to the EP com- +Table 2 reports the final test accuracy. Both of our algo- munity and could suggest a more physically plausible al- +rithms scale to this regime, closely tracking the BP baseline ternative to the stationary-state Adjoint Method (for fixed +throughout training and matching its final accuracy: a paired inputs) (Chen et al., 2018): by solving the forward and ad- +t-test finds no significant difference between Dyadic EP and joint equations simultaneously via relaxation, it circumvents +BP (p = 0.75), and only a sub-percent gap for AsymEP. a separate backward-in-time pass. +In contrast, VF makes slight initial progress (peaking near +30%) before collapsing to chance level (10%). Additional Finally, our experiments on MNIST, Fashion-MNIST, and +details can be found in Appendix G.6 CIFAR-10 confirm that AsymEP and Dyadic EP consis- + tently outperform EP and VF, and notably enables effective + training of feedforward networks. + Our work thus opens new avenues for learning in neuro- +Table 2. Test accuracy on CIFAR-10 (%) at epoch 40 (mean ± std +over 5 seeds). morphic hardware, dissipative physical systems, and neural + architectures where asymmetry is intrinsic rather than inci- + Method Test Acc. (%) dental. + Backpropagation 90.66 ± 0.25 + Dyadic EP 90.69 ± 0.14 + AsymEP 89.74 ± 0.14 + VF 10.00 ± 0.00 + + + 9 + Equilibrium Propagation for Non-Conservative Systems + +Impact Statement References +This paper presents results that advance the field of machine Almeida, L. B. A learning rule for asynchronous percep- +learning. There are many potential societal consequences trons with feedback in a combinatorial environment. In +of our work, none of which we feel must be specifically Artificial neural networks: concept learning, pp. 102–111. +highlighted here. 1990. + Altman, L. E., Stern, M., Liu, A. J., and Durian, D. J. Ex- +Acknowledgments perimental demonstration of coupled learning in elastic + networks. Physical Review Applied, 22(2):024053, 2024. +AES is fully funded by the Horizon Europe Marie +Skłodowska-Curie Doctoral Network ’Postdigital Plus’ Aykroyd, C., Bourgoin, A., and Poncin-Lafitte, C. L. Hamil- +(Grant 101169118). DVA acknowledges the support of tonian treatment of non-conservative systems. arXiv +the French Community of Belgium through a FRIA fellow- preprint arXiv:2507.18658, 2025. +ship. SM acknowledges financial support by the Fonds de la + Bateman, H. On dissipative systems and related variational +Recherche Scientifique–FNRS, Belgium under EOS Project + principles. Physical Review, 38(4):815, 1931. +No. 40007536. Computational resources have been pro- +vided by the Consortium des Équipements de Calcul Intensif Berneman, M. and Hexner, D. Equilibrium propagation for +(CÉCI), funded by the Fonds de la Recherche Scientifique dissipative dynamics. Advanced Intelligent Systems, pp. +de Belgique (F.R.S.-FNRS) under Grant No. 2.5020.11 and e202501310, 2025. +by the Walloon Region. + Bishop, K. J., Biswal, S. L., and Bharti, B. Active colloids + as models, materials, and machines. Annual Review of + Chemical and Biomolecular Engineering, 14(1):1–30, + “ἁρμονίη ἀφανὴς φανερῆς κρείττων” + 2023. + Bowick, M. J., Fakhri, N., Marchetti, M. C., and Ra- + maswamy, S. Symmetry, thermodynamics, and topology + in active matter. Physical Review X, 12(1):010501, 2022. + Brandenbourger, M., Locsin, X., Lerner, E., and Coulais, C. + Non-reciprocal robotic metamaterials. Nature communi- + cations, 10(1):4608, 2019. + Cesa-Bianchi, N. and Lugosi, G. Prediction, learning, and + games. Cambridge university press, 2006. + Chen, R. T., Rubanova, Y., Bettencourt, J., and Duvenaud, + D. K. Neural ordinary differential equations. Advances + in neural information processing systems, 31, 2018. + Cin, N. D., Marquardt, F., and Wanjura, C. C. Training + nonlinear optical neural networks with scattering back- + propagation. arXiv preprint arXiv:2508.11750, 2025. + Costa, P. and Santos, P. A. Directed equilibrium propagation + revisited. Mathematics, 13(11), 2025. ISSN 2227-7390. + Crick, F. The recent excitement about neural networks. + Nature, 337, 1989. + Dillavou, S., Stern, M., Liu, A. J., and Durian, D. J. Demon- + stration of decentralized physics-driven learning. Physi- + cal Review Applied, 18(1):014040, 2022. + Dillavou, S., Beyer, B. D., Stern, M., Liu, A. J., Miskin, + M. Z., and Durian, D. J. Machine learning without a pro- + cessor: Emergent learning in a nonlinear analog network. + Proceedings of the National Academy of Sciences, 121 + (28):e2319718121, 2024. + + 10 + Equilibrium Propagation for Non-Conservative Systems + +Ernoult, M., Grollier, J., Querlioz, D., Bengio, Y., and Scel- Indiveri, G. and Liu, S.-C. Memory and information pro- + lier, B. Updates of equilibrium prop match gradients cessing in neuromorphic systems. Proceedings of the + of backprop through time in an rnn with static input. IEEE, 103(8):1379–1397, 2015. + Advances in neural information processing systems, 32, + 2019. Kalinin, K. P., Gladrow, J., Chu, J., Clegg, J. H., Cletheroe, + D., Kelly, D. J., Rahmani, B., Brennan, G., Canakci, B., +Ernoult, M., Grollier, J., Querlioz, D., Bengio, Y., and Scel- Falck, F., et al. Analog optical computer for ai inference + lier, B. Equilibrium propagation with continual weight and combinatorial optimization. Nature, 645(8080):354– + updates. arXiv preprint arXiv:2005.04168, 2020. 361, 2025. +Falk, M. J., Strupp, A. T., Scellier, B., and Murugan, Kendall, J., Pantone, R., Manickavasagam, K., Bengio, + A. Temporal contrastive learning through implicit non- Y., and Scellier, B. Training end-to-end analog neural + equilibrium memory. Nature Communications, (16), networks with equilibrium propagation. arXiv preprint + 2025. arXiv:2006.01981, 2020. +Farinha, M. T., Pequito, S., Santos, P. A., and Figueiredo, + Laborieux, A. and Zenke, F. Holomorphic equilibrium + M. A. T. Equilibrium propagation for complete directed + propagation computes exact gradients through finite size + neural networks. In Proceedings of the 28th European + oscillations. Advances in Neural Information Processing + Symposium on Artificial Neural Networks, Computational + Systems, 35:12950–12963, 2022. + Intelligence and Machine Learning (ESANN 2020), 2020. +Galley, C. R. Classical mechanics of nonconservative sys- Laborieux, A. and Zenke, F. Improving equilibrium propa- + tems. Physical review letters, 110(17):174301, 2013. gation without weight symmetry through jacobian home- + ostasis. In Proceedings of the International Confer- +Guerguiev, J., Lillicrap, T. P., and Richards, B. A. Towards ence on Learning Representations (ICLR) 2024, Virtual + deep learning with segregated dendrites. elife, 6:e22901, (ICLR), May 2024. + 2017. + Laborieux, A., Ernoult, M., Scellier, B., Bengio, Y., Grollier, +Høier, R. and Zach, C. A lagrangian perspective on dual J., and Querlioz, D. Scaling equilibrium propagation to + propagation. In Proceedings of the First Workshop on Ma- deep convnets by drastically reducing its gradient estima- + chine Learning with New Compute Paradigms at NeurIPS tor bias. Frontiers in neuroscience, 15:633674, 2021. + 2023, New Orleans, LA, USA, Dec 2023. + Laydevant, J., Marković, D., and Grollier, J. Training an +Høier, R. and Zach, C. Two tales of single-phase contrastive + ising machine with equilibrium propagation. Nature Com- + hebbian learning. In Salakhutdinov, R., Kolter, Z., Heller, + munications, 15(1):3671, 2024. + K., Weller, A., Oliver, N., Scarlett, J., and Berkenkamp, F. + (eds.), Proceedings of the 41st International Conference LeCun, Y. The mnist database of handwritten digits. + on Machine Learning, volume 235 of Proceedings of http://yann. lecun. com/exdb/mnist/, 1998. + Machine Learning Research, pp. 18470–18488. PMLR, + 21–27 Jul 2024. Martin, E., Ernoult, M., Laydevant, J., Li, S., Querlioz, D., + Petrisor, T., and Grollier, J. Eqspike: spike-driven equi- +Høier, R., Staudt, D., and Zach, C. Dual propagation: accel- + librium propagation for neuromorphic implementations. + erating contrastive hebbian learning with dyadic neurons. + Iscience, 24(3), 2021. + In Proceedings of the 40th International Conference on + Machine Learning, ICML’23. JMLR.org, 2023. Massar, S. Equilibrium propagation for learning in la- +Høier, R., Kalinin, K., Ernoult, M., and Zach, C. Dyadic grangian dynamical systems. Physical Review E, 112 + learning in recurrent and feedforward models. In NeurIPS (3):035304, 2025. + 2024 Workshop Machine Learning with new Compute Massar, S. and Mognetti, B. M. Equilibrium propagation: + Paradigms, 2024. the quantum and the thermal cases. Quantum Studies: +Hopfield, J. J. Neural networks and physical systems with Mathematics and Foundations, 12(1):6, 2025. + emergent collective computational abilities. Proceedings + Millidge, B., Song, Y., Salvatori, T., Lukasiewicz, T., and + of the national academy of sciences, 79(8):2554–2558, + Bogacz, R. Backpropagation at the infinitesimal infer- + 1982. + ence limit of energy-based models: Unifying predictive +Huang, G.-B., Zhu, Q.-Y., and Siew, C.-K. Extreme learning coding, equilibrium propagation, and contrastive hebbian + machine: theory and applications. Neurocomputing, 70 learning. In International Conference on Learning Rep- + (1-3):489–501, 2006. resentations (ICLR), 2023. + + 11 + Equilibrium Propagation for Non-Conservative Systems + +Nest, T. and Høier, R. Dyadic learning in asymmetric Wang, Q., Wanjura, C. C., and Marquardt, F. Training + convnets. In New Frontiers in Associative Memories- coupled phase oscillators as a neuromorphic platform + Workshop at ICLR 2026. using equilibrium propagation. Neuromorphic Computing + and Engineering, 4(3):034014, 2024. +Osat, S. and Golestanian, R. Non-reciprocal multifarious + self-organization. Nature Nanotechnology, 18(1):79–85, Wanjura, C. C. and Marquardt, F. Quantum equilibrium + 2023. propagation for efficient training of quantum systems + based on onsager reciprocity. Nature Communications, +O’Connor, P., Gavves, E., and Welling, M. Training a spik- 16(1):6595, 2025. + ing neural network with equilibrium propagation. In The + 22nd international conference on artificial intelligence Werbos, P. J. Backpropagation through time: what it does + and statistics, pp. 1516–1523. PMLR, 2019. and how to do it. Proceedings of the IEEE, 78(10):1550– + 1560, 1990. +Pineda, F. Generalization of back propagation to recurrent + and higher order neural networks. In Neural information Yi, S.-i., Kendall, J. D., Williams, R. S., and Kumar, S. + processing systems, 1987. Activity-difference training of deep neural networks using + memristor crossbars. Nature Electronics, 6(1):45–51, +Pourcel, G., Basu, D., Ernoult, M., and Gilra, A. Lagrangian- 2023. + based equilibrium propagation: generalisation to arbi- + trary boundary conditions & equivalence with hamilto- + nian echo learning. arXiv preprint arXiv:2506.06248, + 2025. + +Rageau, T. and Grollier, J. Training and synchronizing + oscillator networks with equilibrium propagation. Neuro- + morphic Computing and Engineering, 2025. + +Sajnok, K. and Matuszewski, M. Near-equilibrium propaga- + tion training in nonlinear wave systems. arXiv preprint + arXiv:2510.16084, 2025. + +Scellier, B. Quantum equilibrium propagation: Gradient- + descent training of quantum systems. arXiv preprint + arXiv:2406.00879, 2024. + +Scellier, B. and Bengio, Y. Equilibrium propagation: Bridg- + ing the gap between energy-based models and backprop- + agation. Frontiers in computational neuroscience, 11:24, + 2017. + +Scellier, B., Goyal, A., Binas, J., Mesnard, T., and Bengio, + Y. Generalization of equilibrium propagation to vector + field dynamics. arXiv preprint arXiv:1808.04873, 2018. + +Scellier, B., Mishra, S., Bengio, Y., and Ollivier, Y. Agnostic + physics-driven deep learning. arXiv:2205.15021v1, 2022. + +Scurria, A. E. A physical theory of backpropagation: Exact + gradients from the least-action principle. 2026. + +Stern, M., Hexner, D., Rocks, J. W., and Liu, A. J. Su- + pervised learning in physical networks: From machine + learning to learning machines. Physical Review X, 11(2): + 021045, 2021. + +Wang, J., Lu, S., Wang, S.-H., and Zhang, Y.-D. A review + on extreme learning machine. Multimedia Tools and + Applications, 81(29):41611–41660, 2022. + + 12 + Equilibrium Propagation for Non-Conservative Systems + +A. Gradient Estimation Error in VF where s denotes the dynamical state of the system. This + symmetry is the linchpin of the equivalence proof, as the +In this appendix, we quantify the gradient estimation error gradient expressions derived for BPTT and standard EP +introduced by VF in the limit where the Jacobian asymmetry differ precisely by a transpose operation applied to ∂F + ∂s . +is small. + This observation aligns with our analysis in the main text: +Comparing the post-synaptic update terms in Eqs. (12) and VF fails in non-conservative systems due to the missing +(14) gives the following error in the gradient of the cost: transpose in the post-synaptic term (see Eq. (16)). Following + ⊤ the derivation in Ernoult et al. (2019) (viz., Appendix A, Eqs. + ∂F 0 (31–33)), the recursive relations for the gradients in BPTT + Error = − (x , θ) + ∂θ are given by: + −1 −1 ∂C 0 +× JF (x0 , θ) − JF⊤ (x0 , θ) (x , y), (43) ∂ℓ + ∂x ∇BPTT + s (0) = (s⋆ , y), (49) + ∂s +To quantify this error, we decompose the Jacobian JF (x, θ) and for all t = 1, . . . , K, +into its symmetric part SJ (x, θ) and antisymmetric part + ⊤ + ∂F + SJ (x, θ) = 12 JF (x, θ) + JF⊤ (x, θ) , ∇BPTT ∇BPTT (t − 1), + + s (t) = (x, s⋆ , θ) s (50) + (44) ∂s + AJ (x, θ) = 12 JF (x, θ) − JF⊤ (x, θ) . + + ⊤ + ∂F + ∇BPTT + θ (t) = (x, s⋆ , θ) ∇BPTT + s (t − 1), (51) +Assuming the asymmetry AJ (x, θ) is small, we can make ∂θ +a series expansion in SJ−1 AJ (omitting the dependencies where θ represents the optimization parameters, ℓ is the +for clarity). Applying the Neumann expansion for small cost function, s⋆ is the free equilibrium state (satisfying +∥SJ−1 AJ ∥ gives F (s⋆ ) = 0), y is the target, and x is the input. The index t + ∞ + ! denotes the unrolled time steps, initialized at s(0) = s⋆ . + X + (JF ) −1 + = (−1) n + (SJ−1 AJ )n SJ−1 , (45) In contrast, the gradients computed by VF follow the recur- + n=0 sion (viz., Ernoult et al. (2019), Appendix A, Eqs. (24–26)): + ∞ + ! + X + (JF⊤ )−1 = (SJ−1 AJ )n SJ−1 . (46) ∂ℓ + ∆EP + s (0) = − (s⋆ , y), (52) + n=0 ∂s +Subtracting the two series and assuming convergence, we and for all t ≥ 0, +finally obtain + ∂F + ∆EP + s (t + 1) = (x, s⋆ , θ) ∆EPs (t), (53) + ∞ ∂s + ! + X + −1 + 2n+1 + −1 ⊤ −1 + (JF ) − (JF ) = −2 SJ AJ SJ−1 . + ∂F + ⊤ + n=0 ∆EP + θ (t + 1) = (x, s ⋆ , θ) ∆EP + s (t). (54) + (47) ∂θ + Comparing these two sets of equations confirms that the only +B. Equivalence between AsymEP and BPTT difference are Eqs. (50) and (53), specifically the transpose + of the Jacobian ∂F + ∂s (ignoring the global sign difference in +In this appendix, we sketch the equivalence between the Eqs. (49) and (52)). +gradient estimate computed by AsymEP and Backpropaga- +tion Through Time (BPTT) (Werbos, 1990) for a Recurrent In AsymEP, we modify the dynamics by adding a correction +Neural Network with fixed inputs. Our derivation relies on term dependent on the antisymmetric part of the Jacobian. +the proof provided by Ernoult et al. (2019), which estab- Denoting the force of this augmented system by F A , the +lished that standard (conservative) EP computes gradients Jacobian at the free equilibrium satisfies: +identical to those of BPTT. To facilitate direct comparison, ⊤ + ∂F A + +we adopt their notation for this section. ∂F + (x, s⋆ , θ) = (x, s⋆ , θ) . (55) + ∂s ∂s +The proof provided by Ernoult et al. (2019) relies on the +assumption that the vector field F (i.e., transition function) By substituting this corrected Jacobian into the recursive +is derived from a scalar potential function, which implies relations, AsymEP recovers the exact transpose required +that by BPTT. Consequently, our method extends the equiva- + ⊤ + ∂F ∂F lence between EP and BPTT to the general case of non- + = , (48) conservative force. + ∂s ∂s + + 13 + Equilibrium Propagation for Non-Conservative Systems + +C. Out-of-Equilibrium Mechanics C.3. Symmetry Breaking as Credit Assignment +Here we sketch the physical picture behind the doubled- On the diagonal manifold z = z ′ the doubled system enjoys +energy construction of Eq. (26). The full derivation from a gauge symmetry: the auxiliary variable z ′ is redundant and +Hamilton’s least-action principle, together with its connec- the difference d is identically zero. Credit assignment is im- +tion to the Bateman–Galley formalism for non-conservative plemented by deliberately breaking this symmetry through +classical mechanics (Bateman, 1931; Galley, 2013; Aykroyd the task cost. Adding βD(z, z ′ ) = β C(m) to H exerts +et al., 2025), can be found in (Scurria, 2026). opposite forces on z and z ′ and drives them apart, so that + the difference d ceases to be redundant and begins to carry +C.1. The Helmholtz Obstruction information about the loss landscape. + +The natural physical route to a variational principle for a +dynamical system ẋ = F (x, θ) is to seek a scalar potential D. Proofs for Dyadic EP +E such that F = −∂x E. The classical Helmholtz integra- We now demonstrate that Dyadic EP correctly trains the +bility condition states that such an E exists if and only if the parameters θ of the original force field F (x, θ), giving the +Jacobian JF is symmetric everywhere. Whenever the inter- 0 + exact gradient dC(x̄ + dθ + ) + in the limit of infinitesimal nudging. +actions are non-reciprocal — as in feedforward networks, +active matter, or driven optical systems — JF acquires +a non-zero antisymmetric part and the Helmholtz condi- D.1. Proof of EP +tion fails identically. No scalar potential on the original First, recall that standard EP does not strictly require the +n-dimensional state space can then generate the dynamics, system to settle at an energy minimum; it requires only that +and the “energy minimisation” route at the heart of standard the system reaches a stationary state (a fixed point of the +EP is blocked at the structural level. The obstruction is not dynamics). Indeed, using the notation of Section 2.1, EP +a matter of computational convenience: it reflects the fact relies on the key identity: +that the rotational component of F carries information that +no scalar function of x alone can record. d2 d2 + ET (xβ , θ) = ET (xβ , θ). (57) + dθdβ dβdθ +C.2. Variational Reconstruction on a Doubled Space Expanding the total derivative with respect to β gives: +Applying the Bateman–Galley formalism circumvents this ⊤ + ∂ET (xβ , θ) dxβ ∂ET (xβ , θ) + +obstruction by enlarging the configuration space. The single d + ET (xβ , θ) = + +state x ∈ Rn is replaced by a conjugate pair (z, z ′ ) ∈ R2n , dβ ∂x dβ ∂β +and the rotational component of F — which has no scalar = C(xβ ). (58) +generator on the original n-dimensional space — is ab- +sorbed into a bilinear coupling between z and z ′ on the Where the first term vanishes because the system is at a + ∂ +doubled space, where it does admit a variational descrip- stationary state, i.e., ∂x ET (xβ , θ) = 0; this holds even if +tion. The physical motion is recovered on the diagonal the system is not at a minimum of ET . Similarly, for the +submanifold z = z ′ (the so called ’physical limit’), while derivative with respect to θ: +the off-diagonal direction d = z − z ′ supplies the additional + d ∂ET (xβ , θ) +degree of freedom needed to encode non-reciprocity. ET (xβ , θ) = , (59) + dθ ∂θ +Specializing this reconstruction to the overdamped (first- + where we additionally assume that the cost function does +order) regime relevant to relaxational neural dynamics yields + not depend explicitly on the parameters θ. Substituting these +the bilinear energy + results into Eq. (57) in the limit of infinitesimal nudging + (β → 0) recovers the fundamental relation given by Eq. (9). + z + z′ + + H(z, z ′ , θ) = −(z − z ′ )⊤ F ,θ , (56) + 2 + D.2. Proof of Dyadic EP +which is precisely Eq. (26). The symmetric midpoint m = We analyze now the stationary states of Dyadic EP by intro- +(z + z ′ )/2 plays the role of the physical coordinate of the ducing the change of variables: +doubled system, while d is the auxiliary direction along +which non-reciprocity is stored. On the submanifold z = z ′ z + z′ + m= , d = z − z′. (60) +the coupling proportional to (z − z ′ ) vanishes identically 2 +and both states evolve under the original field F , so the In these coordinates, the augmented energy HT becomes +doubling leaves the on-shell physics unchanged. We refer +the reader to (Scurria, 2026) for the full construction. HT (m, d, θ, β) = −d⊤ F (m, θ) + βC(m) (61) + + 14 + Equilibrium Propagation for Non-Conservative Systems + +and the dynamics in Eq. (28) can be rewritten as: In Dyadic EP, we instead employ the single-phase update: + + 1 ∂H(z β , z ′β , θ) + + dm ∂HT + =− = F (m, θ), (62) ∆θ ∝ − (70) + dt ∂d β ∂θ + dd ∂HT ∂ + =− = dT JF (m, θ) − β C(m). (63) This choice avoids the overhead of evolving two coupled + dt ∂m ∂m + equations in the extended space, which would be computa- + β +The stationary states (mβ , d ) are the solutions to: tionally equivalent to evolving four equations in the original + space (two for +β and two for −β). Using Eq. (70), we + F (mβ , θ) = 0, (64) evolve only one coupled equation for +β in the extended + space; this corresponds to two equations in the original + βT ∂ + d JF (mβ , θ) − β C(mβ ) = 0. (65) space, thereby achieving the same computational complex- + ∂m ity as AsymEP. Furthermore, this single-phase formulation + suggests a pathway toward making the update local in time, +This leads to the following observations: provided appropriate hardware is used to implement the +1) The stationary state of m is independent of β and coin- augmented phase. +cides with the stationary state of the original system: Mathematically, these two approaches yield the same gradi- + ent estimate because the equations for dβ are linear. Explic- + z β + z ′β + = mβ = m0 = x0 . (66) itly we have : + 2 + ∂H(z β , z ′β , θ) ∂F z β + z ′β + + = −(z β − z ′β )⊤ ,θ +2) The Jacobian of the extended system defined in Eq. (26) ∂θ ∂θ 2 +is invertible, provided JF is invertible. This is most evident ⊤ + ∂F 0 −1 +from Eq. (63). = −β z ,θ JF⊤ (z 0 , θ) + ∂θ +3) The stationary state value of d is given by: + ∂C 0 + + × (z ) , (71) + β + + −1 ∂C 0 + ∂x + d = β JF⊤ (m0 , θ) (x ) (67) + ∂x where we have used Eqs. (66) and (67). Inspection of + Eq. (71) confirms that, up to corrections of order β 2 , we + 0 +In particular, when β = 0, we have d = 0, which implies obtain exactly the same gradient as in AsymEP. +that the free stationary states coincide: z 0 = z ′0 . +4) The cost at the stationary state of the extended system E. AsymEP versus Dyadic EP +is equal to the cost at the stationary state of the original + In this appendix, we demonstrate that Asymmetric Equilib- +system: + rium Propagation (AsymEP) emerges naturally as the first- + D(m0 ) = C(x0 ). (68) + order projection of the 2N -dimensional Dyadic Equilibrium +Consequently, the gradients of the cost with respect to the Propagation onto a single N -dimensional state space. We +parameters are identical. then formalize the physical trade-offs between the two ar- + chitectures. +Since both the original and extended systems, given respec- +tively in Eq. (28) and Eq. (1-2), share the same cost at their + E.1. AsymEP as the Linear Projection of Dyadic EP +respective stationary states, and because the Jacobians of +both models are invertible, applying EP update rule to the As established in Appendix D.2, transforming the 2N - +extended system give the correct gradient estimate for the dimensional extended space (z, z ′ ) into the mean state + ′ +parameters θ of the original system. m = z+z 2 and the difference state d = z − z ′ exactly +The final step of the proof is to establish the equivalence decouples the stationary dynamics. Because the stationary +between the standard parameter update rule in Eq. (8) and state of m is the free state of the original system (mβ = x0 ), +the modified rule used by Dyadic EP in Eq. (34). Indeed, if the cost function drives the difference variable to a stationary + β +we were to apply the standard update rule in the extended state d satisfying: +space, the update would be: β ∂C 0 + JF⊤ (x0 , θ)d = β (x ) (72) + 1 + β ′β + ∂H(z , z , θ) ∂H(z , z −β ′−β + , θ) + ∂x + ∆θ ∝ − − . + 2β ∂θ ∂θ To recover this exact error signal in an N -dimensional space, + (69) we postulate a modified dynamical system FA (x) compris- + + 15 + Equilibrium Propagation for Non-Conservative Systems + +ing the standard EP dynamics and a spatial correction Γ(x): F. Derivation of the Hopfield-like Energy + ∂C In this section, we derive the explicit energy functional for + FA (x) = F (x) − β (x) + Γ(x) (73) the Continuous Asymmetric Hopfield dynamics defined in + ∂x + Eq. (35). The force field is given by: +Let ∆x = xβA − x0 denote the displacement from the +free equilibrium. Expanding the stationarity condition F (x) = ρ′ (x) ⊙ (Jρ(x)) − x. (78) +FA (xβA ) = 0 to first order around x0 yields: + We omit external inputs J in for brevity, as they appear sym- + ∂C 0 metrically in the Jacobian. The variational Hamiltonian is + JF (x0 , θ)∆x − β (x ) + Γ(xβA ) ≈ 0 (74) defined as: + ∂x + z + z′ z + z′ + +To ensure the first-order displacement matches the Dyadic H(z, z ′ ) = −(z − z ′ )⊤ F + βC . + β 2 2 +EP error signal (i.e., ∆x ≈ d ), we substitute Eq. (72) into + (79) +the expansion: + To analyze this expression, we introduce the midpoint m = + z+z ′ + Γ(xβA ) = JF⊤ (x0 , θ) − JF (x0 , θ) ∆x and the difference d = z − z ′ . Since the separation + + (75) 2 + between z and z ′ is induced solely by the nudging parameter + = −2AJ (x0 , θ)(xβA − x0 ) (76) + β, the difference scales as ∥d∥ ∼ O(β). We therefore +This uniquely recovers the AsymEP augmented dynamics. neglect terms of order O(∥d∥3 ) (i.e., or equivalently O(β 3 )) +Finally, to eliminate the O(β 2 ) error, AsymEP evaluates the as they do not contribute to the gradient of the cost. +centered difference of two opposite nudges: The activation at the midpoint can be approximated as: + + dxA ρ(z) + ρ(z ′ ) + x±β 0 + A =x ±β + O(β 2 ) (77) ρ(m) = + O(∥d∥2 ). (80) + dβ β=0 2 + Similarly, the difference in activations is: +Subtracting these states cancels the O(β 2 ) error, yielding +1 +β −β β 3 +2 (xA − xA ) = d + O(β ), successfully recovering the ρ(z) − ρ(z ′ ) = ρ′ (m) ⊙ d + O(∥d∥3 ). (81) +exact post-synaptic update term. + Inverting this relation, we express the state difference as: +E.2. Physical Trade-offs and the Extended Space + z − z ′ = (ρ(z) − ρ(z ′ )) ⊙ ρ′ (m) + O(∥d∥3 ). (82) +We can view AsymEP and Dyadic EP as a space-time trade- +off of the same underlying physical optimization problem. + We substitute these expansions into the interaction term +AsymEP preserves the original N -dimensional state space of the Hamiltonian, Hint = −(z − z ′ )⊤ (ρ′ (m) ⊙ Jρ(m)). +of the network at the cost of temporal non-locality. The sys- Applying the identity a⊤ (b ⊙ c) = (a ⊙ b)⊤ c, we obtain: +tem must evolve sequentially, requiring physical memory + ⊤ +not only to store the free equilibrium x0 for the asymmet- Hint = − ((z − z ′ ) ⊙ ρ′ (m)) Jρ(m) +ric correction, but also to store the successive stationary + ρ(z) + ρ(z ′ ) + +states required to evaluate the contrastive gradient update. ≈ −(ρ(z) − ρ(z ′ ))⊤ J . (83) + 2 +AsymEP thus serves as the direct, spatially minimal exten- +sion of EP. Expanding the product gives: +Dyadic EP provide a learning signal that is local in both + 1h +space (where z − z ′ encodes the gradient) and time (allow- Hint = − ρ(z)⊤ Jρ(z) + ρ(z)⊤ Jρ(z ′ ) + 2 +ing the nudged phases to execute in parallel) at the cost i +of doubling the state space. In particular, capturing non- − ρ(z ′ )⊤ Jρ(z) − ρ(z ′ )⊤ Jρ(z ′ ) . (84) +conservative forces in this extended space requires a spe- +cific bilinear coupling, rather than a trivial superposition We decompose the connectivity matrix J into its symmetric +of uncoupled subsystems. It can be seen as a blueprint for part S and antisymmetric part A. The first and last terms +future neuromorphic hardware. simplify to ρ(z)⊤ Sρ(z). The cross terms satisfy: +Ultimately, the reduction of Dyadic EP to AsymEP via the ρ(z)⊤ Jρ(z ′ ) − ρ(z ′ )⊤ Jρ(z) = ρ(z)⊤ (J − J ⊤ )ρ(z ′ ) +variables m and d proves the universality of EP’s variational +principle. = ρ(z)⊤ (2A)ρ(z ′ ). (85) + + 16 + Equilibrium Propagation for Non-Conservative Systems + +Thus, the interaction term reduces to: The input parameters are then updated using the standard + learning rule (21). In particular, the presynaptic term associ- + 1 1 + Hint = − ρ(z)⊤ Sρ(z) + ρ(z ′ )⊤ Sρ(z ′ ) ated with the input weights is given by, + 2 2 + − ρ(z)⊤ Aρ(z ′ ) + O(∥d∥3 ). (86) ∂Fi + in + = δik ρ′ (xi )ul . (93) + ∂Jkl +Finally, for the nudging term, we expand the cost function The presynaptic terms associated with the dynamical param- + dyn +around the midpoint: eters Jij depend on the experiment. + + 1 + C(m) = (C(z) + C(z ′ )) + O(∥d∥2 ). (87) G.1. Symmetric Initialization + 2 + G.1.1. L EARNING RULES +When multiplying by β, the remainder term becomes β · +O(∥d∥2 ). Since ∥d∥ ∼ O(β), this remainder is of order For clarity, we write the learning rules for VF and AsymEP. +O(β 3 ) and can be consistently discarded alongside the third- For the input weights, using (93), we have: +order terms from the interaction expansion. 1 h +β i + in + ∆Jik ∝ (xi − x−β ′ 0 + i )ρ (xi )uk , (94) +Combining all these components, the final Hamiltonian is: 2β + + 1 1 while for the recurrent weight, we get: + H(z, z ′ ) = − ρ(z)⊤ Sρ(z) + ρ(z ′ )⊤ Sρ(z ′ ) + 2 2 1 h +β i + 1 ∆Jijdyn + ∝ (xi − x−β i )ρ′ 0 + (xi )ρ(x 0 + j ) . (95) + − ρ(z) Aρ(z ) + (∥z∥2 − ∥z ′ ∥2 ) + ⊤ ′ 2β + 2 + β ′ + For EP, we have: + + (C(z) + C(z )). (88) + 2 1 h +β i + in + ∆Jik ∝ ρ(xi ) − ρ(x−β + i ) uk , (96) +The saddle-point dynamics, given by Eq. 32, generated by 2β +this Hamiltonian are: and for the recurrent weights: + dz β ∂C 1 h +β + = ρ′ (z) ⊙ (Sρ(z) + Aρ(z ′ )) − z − + i + , dyn −β −β + dt 2 ∂z + (89) ∆Jij ∝ ρ(xi )ρ(x+βj ) − ρ(xi )ρ(xj ) . (97) + 2β + ′ + dz β ∂C + = ρ′ (z ′ ) ⊙ (Sρ(z ′ ) + Aρ(z)) − z ′ + . (90) + dt 2 ∂z ′ G.1.2. S UPPLEMENTARY N UMERICAL R ESULTS +This system recovers the original continuous Hopfield dy- To complement Fig. 2, we report the evolution of the accu- +namics when z = z ′ (assuming β = 0). racy of the three methods in Fig. 4. We consider a layered + network with 50 hidden neurons. While this capacity is +G. Experimental Details insufficient for state-of-the-art performance, it amplifies the + difference in accuracy between models to aid visualization. +As in the main text, the neuronal dynamics are governed by Models are trained for 20 epochs starting from a symmetric +the vector field: configuration, the natural setting for both VF and EP. With + this initialization, AsymEP consistently outperforms the + X dyn other methods and learns faster by exploiting the additional + Fi = ρ′ (xi ) Jij ρ(xj ) + bi (u) − xi , (91) + degrees of freedom of the asymmetric network. + j + + +where the input-dependent bias bi (u) is precomputed for G.2. Fixed Asymmetry Ratio +each MNIST input u as: This section details the implementation for the fixed asym- + X metry ratio experiments presented in Section 5.2, followed + bi (u) = Jilin ul . (92) by complementary numerical results regarding learning + l∈in speed and induced Jacobian asymmetry. +This term projects the input space into the recurrent sub- + G.2.1. L EARNING RULES +space. The bias yields a diagonal contribution to the Jaco- +bian JF = ∂F ∂x , and therefore does not contribute to the Parametrization and notation. To enforce a fixed asym- +antisymmetric correction used in the augmented dynamics metry ratio, we explicitly parameterize the independent ele- +Eq. (20) of AsymEP. ments of Eq. (38). We introduce two parameter vectors θS + + 17 + Equilibrium Propagation for Non-Conservative Systems + + Parameter Sym. Init. / Feedforward Fixed rstr Fixed rstr & rin + sec. 5.1 & 5.3 sec. 5.2 app. G.3 + Learning Rate (Input-Hidden) 0.05 0.05 0.0125 + Learning Rate (Hidden-Output) 0.01 0.01 0.0025 + Time Step (Dynamics Integration) 0.5 0.3 0.3 + Nudging Parameter (β) 0.5 0.5 0.5 + Free-phase Steps (nfree ) 20 30 40 + Nudged-phase Steps (nnudge ) 10 10 10 + Number of Epochs 40 / 20 30 40 + Batch Size 64 √64 √64 + Scaling Parameter γ n.a. 60 60 + Structure 784 - n.a. -10 784-50-10 all-to-all, 500 hid + Activation function ρ tanh tanh tanh + Initial Recurrent State s s ∼ U (−1, 1) s ∼ U(−1, 1) s ∼ U(−1, 1) + Initial Parameters θ θ ∼ N (0, N1 ) θ ∼ N (0, N1 ) θ ∼ N (0, N1 ) + Number of Runs (training + inference) 10 10 10 +Table 3. Trained Model Hyperparameters on MNIST. N is the total number of neurons, U(−1, 1) is a uniform distribution, and N (µ, σ 2 ) +is a Gaussian distribution. For the rstr parametrization, we choose more cautious hyperparameters for training and inference compared to +the symmetric initialization, due to increasingly non-conservative and potentially oscillatory dynamics. + + + + elements of S̃, the full matrices are constructed as: + S + S̃ij = δij ξi + (1 − δij )θk(max(i,j),min(i,j)) , (99) + A + Ãij = ϵij θk(max(i,j),min(i,j)) , (100) + where ϵij is the Levi-Civita symbol. The dynamical param- + eters are then given by: + dyn + Jij = γ(cS S̃ij + cA Ãij ), (101) + with normalization coefficients + p + 2 + 1 − rstr rstr + cS = , cA = , (102) + FS FA + defined in terms of the Frobenius norms: + v + uN M + uX X 2 +Figure 4. Evolution of the mean accuracy and standard deviation F =t + S ξ2 + 2i θS , k (103) +(over 10 runs) during training on MNIST for AsymEP, EP, and VF. i=1 k=1 +Models use 50 hidden neurons. v + u M + u X 2 + FA = t2 θkA . (104) + k=1 +and θA of size M = Ndyn (Ndyn − 1)/2, which encode the +off-diagonal elements of the symmetric and antisymmetric Presynaptic computation. The dependence of the nor- +components S̃ and Ã, respectively. The correspondence malization coefficients on the parameters introduces addi- +between matrix and vector indices is given by: tional regularization terms in the learning rule compared + to the parameterization of (Scellier &Bengio, 2017). The + (i − 1)(i − 2) gradients of the normalization coefficients are: + k(i, j) = + j, (1 ≤ j < i ≤ Ndyn ) + 2 ∂cS θkS ∂cS ξm + (98) = −2cS 2, = −cS 2, (105) + ∂θkS (FS ) ∂ξm (FS ) + ∂cA θA +where the condition j < i selects the strictly lower triangular A + = −2cA k 2 . (106) +elements. Introducing an additional vector ξ for the diagonal ∂θk (FA ) + + 18 + Equilibrium Propagation for Non-Conservative Systems + + Parameter Comparison Dyn. 2 hidden layers 3 hidden layers + sec. 5.4 sec. 5.4 sec. 5.4 + Learning Rate (Input-Hidden) 0.0016 0.0013 0.6 + Learning Rate (Hidden-Hidden) 0.0016 0.0013 0.6 + Learning Rate (Hidden-Output) 0.0016 0.0013 0.6 + Time Step (Dynamics Integration) 0.4 0.3 0.0075 + Nudging Parameter (β) 0.3 0.5 0.20 + Free-phase Steps (nfree ) 40 40 60 + Nudged-phase Steps (nnudge ) 20 20 30 + Number of Epochs 50 40 40 + Batch Size 64 64 64 + Layer Structure 784-500-200-10 784-500-500-10 784-500-500-500-10 + Activation function ρ tanh tanh tanh + Initial Recurrent State s s ∼ U(−1, 1) s ∼ U(−1, 1) s ∼ U (−1, 1) + Initial Parameters θ θ ∼ N (0, N1 ) θ ∼ N (0, N1 ) θ ∼ N (0, N1 ) + Number of Runs (training + inference) 10 10 10 +Table 4. Trained Model Hyperparameters on Fashion-MNIST. N is the total number of neurons, U(−1, 1) is a uniform distribution, and +N (µ, σ 2 ) is a Gaussian distribution. For the rstr parametrization, we choose more cautious hyperparameters for training and inference +compared to the symmetric initialization, due to increasingly non-conservative and potentially oscillatory dynamics. + + + +Combining these with the derivatives of the matrices S̃ and (where p > q): +Ã, we have: + N + θkA X + + ∂Fi ′ + ∂ S̃ij ∂ S̃ij = γc A ρ (xi ) −2 Ãij ρ(xj ) + = δip δjq + δiq δjp , = δij δkj (107) ∂θkA (FA )2 j=1 + ∂θkS ∂ξk + ∂ Ãij + δip ρ(xq ) − δiq ρ(xp ) . + = δip δjq − δiq δjp , (108) + ∂θkA (111) +where k corresponds to the index pair (p, q) with p > q, as +defined in Eq. (98). The full presynaptic terms are then: Initialization. To ensure the stability of the system, we + initialize our parameters suchhthat the + i variance of dynam- + dyn + • For the diagonal parameters ξm : ical parameters scales as Var Jij ∝ 1/Ndyn . This is a + conservative choice for the layered architectures used in our + dyn + ∂Fi + + ξm X + N experiments, where many entries of Jij are zero. + = γcS ρ′ (xi ) − S̃ij ρ(xj ) + ∂ξm (FS )2 j=1 In practice, we initialize the parameter vectors θS , θA , and + (109) + ξ with identical variances σ 2 . For large Ndyn , the expected + + δim ρ(xm ) . Frobenius norms approximate to E[FS,A ] ≈ Ndyn σ. Conse- + quently, the normalization coefficients become: + p + 2 + • For the off-diagonal symmetric parameters θkS (where 1 − rstr rstr + cS ≈ , cA ≈ . (112) + p > q): Ndyn σ Ndyn σ + + N Since the symmetric and antisymmetric components are sta- + θkS X + + ∂Fi ′ + = γc S ρ (x i ) −2 S̃ij ρ(xj ) tistically independent, the variance of the weights is derived + ∂θkS (FS )2 j=1 + as follows: + + δip ρ(xq ) + δiq ρ(xp ) . + • Diagonal elements (i = j): + (110) + 2 + h i 1 − rstr + Var Jiidyn = γ 2 c2S σ 2 ≈ γ 2 2 . (113) + • For the off-diagonal antisymmetric parameters θkA Ndyn + + 19 + Equilibrium Propagation for Non-Conservative Systems + + • Off-diagonal elements (i ̸= j): a zero-cost baseline (perfect prediction) during learning. + Specifically, for each method and value of rstr , we calcu- + h + dyn + i γ2 late the cumulative loss by summing the batch-averaged + = γ 2 c2S + c2A σ 2 ≈ 2 , + + Var Jij (114) + Ndyn costs of the first 5 epochs (out of 30, to avoid saturation + h i effects), and reporting the mean and standard deviation over + dyn +To satisfy Var Jij ∝ 1/Ndyn , we set: 10 independent training runs. Mathematically, for each run: + + p + γ= Ndyn (115) + 5 NX + X batches X C(x0 , u) +Note that by random matrix theory, diagonal elements do Cumul. Loss = , + |Bk | +not affect stability in the large Ndyn limit. epoch=1 k=1 (x0 ,u)∈Bk + (120) +Potential Simplification. Although the parameterization where Bk represents the k-th batch, and |Bk | denotes the +above is fully general, a simpler construction is possible number of examples in the batch. The parameters are up- +by removing self-connections (ξ = 0) and enforcing identi- dated after each batch step; consequently, the free equilib- +cal parameterization for the symmetric and antisymmetric rium x0 is inferred using the updated parameters and the +components, i.e., θS = θA = θ. The matrix elements then current example u. +become: + + S̃ij = (1 − δij )θk(max(i,j),min(i,j)) , (116) + Ãij = ϵij θk(max(i,j),min(i,j)) . (117) + +In this case, the Frobenius norms are equal (FS = FA ), and +we can omit the explicit normalization: + q + dyn 2 S̃ + r à . + Jij = 1 − rstr ij str ij (118) + +For a parameter θk corresponding to indices (p, q) with +p > q, the presynaptic term is given by: + q + ∂Fi + = ρ′ (xi ) 2 +r + 1 − rstr str δip ρ(xq ) + ∂θk + q (119) + + 2 −r + 1 − rstr δ ρ(x + str iq p . + ) + +While this parameterization works in simulations and keeps Figure 5. Cumulative loss as defined by (120) over the first 5 +the number of parameters constant for all rstr , it constrains epochs of training, for different asymmetry ratios rstr . We compare +the asymmetry to be “homogeneous”, by which we mean VF (orange) and AsymEP (blue), under two training regimes: +that the asymmetry ratio is identical for every pair of neu- training only J in (dashed) or all parameters (solid). +rons; hence, the network cannot learn to be symmetric in one +region and antisymmetric in another. Therefore, we choose +to explore the more general case of (38) in our experiments. In Fig 5, we observe that learning slows down for both al- + gorithms when rstr ≳ 0.6. This behavior likely results from +G.2.2. S UPPLEMENTARY N UMERICAL R ESULTS the increased difficulty of reaching a stationary state as the + dynamics become strongly asymmetric. With a fixed num- +To complement the results of Fig 3, we analyze the training + ber of inference steps, incomplete convergence degrades the +efficiency as a function of the asymmetry ratio rstr and in- + accuracy of the gradient estimates, thereby slowing down +vestigate the robustness of VF by monitoring the Jacobian + the learning. Fig 5 shows that while VF can eventually +asymmetry. + achieve competitive accuracy, it is consistently slower than + AsymEP as soon as asymmetry is introduced. +Training efficiency. We first study the training efficiency +of the two algorithms as a function of the asymmetry ra- +tio rstr . Inspired by the related concept in (Cesa-Bianchi +&Lugosi, 2006), we define the cumulative loss as the accu- Jacobian asymmetry. We next examine how the struc- +mulated difference between the free equilibrium cost and tural asymmetry rstr is reflected in the Jacobian of the dy- + + 20 + Equilibrium Propagation for Non-Conservative Systems + +namics (35), given by: + ∂Fi dyn ′ + = (1 − δij )ρ′ (xi )Jij ρ (xj ) + ∂sj + h i + + δij ρ′ (xi )(Jiidyn ρ′ (xi )) + ρ′′ (xi )bi − 1 . + (121) + +In our layered architecture, the self-connections are zero +(Jiidyn = 0). For the following analysis, we neglect all diag- +onal terms in the Jacobian (including external inputs and +potential), since they do not contribute to the antisymmetric +correction (20) and thus to the discrepancy between the per- +formance of VF and AsymEP. Consequently, we define the +following asymmetry ratio based solely on the off-diagonal +Jacobian JF,off : Figure 6. Asymmetry ratio of the Jacobian rjac defined in equation + ⊤ + (122) after training for different asymmetry ratios rstr . We compare + ∥JF,off − JF,off ∥F VF (orange) and AsymEP (blue), under two training regimes: + rjac = , (122) training only J in (dashed) or all parameters (solid). + ∥JF,off ∥F + +The results are presented in Fig 6. For each trained model Consequently, local stability requires the largest real eigen- +and ratio rstr , we compute rjac averaged over the stationary value of the effective weight matrix to be strictly less than 1. +states of the first batch (64 images) across 10 independent Assuming weights are initialized independently with vari- +runs. We observe that when structural asymmetry is strong ance σ 2 , Girko’s circular law dictates that the eigenvalues +and all parameters are trained, VF partially compensates for of√an asymmetric matrix uniformly populate a disk of radius +the asymmetry by adjusting the neuronal states. This can be σ n in the complex plane. In contrast, imposing symmetry +understood by rewriting the ratio as: forces the eigenvalues √ onto the real line, broadening the + spectral radius to 2σ n according to Wigner’s semicircle + dyn dyn ⊤ + ρ′ (xi ) Jij − (Jji ) ρ′ (xj ) law. As a result, asymmetric networks can stably accommo- + F + rjac = . (123) date larger variance in the weight initializations than their + dyn ′ + ρ′ (xi )Jij ρ (xj ) symmetric counterparts. + F + +Compared to the structural asymmetry ratio in Eq. (37), Asymmetry nevertheless introduces imaginary eigenvalues +a value of rjac < rstr indicates that the neuronal states ef- and, consequently, damped oscillations. To study this effect +fectively dampen the structural asymmetry, rendering the experimentally in a controlled setting, we constrain the input +dynamics more symmetric. This symmetrization of the Ja- projections J in . In the experiments of the main text, fixing +cobian appears without imposing an additional symmetriza- the structural asymmetry ratio rstr still allowed AsymEP +tion penalty and could be enhanced using the method of to reduce oscillations by aligning and increasing the input +(Laborieux &Zenke, 2022). This mechanism likely explains projections J in , thereby adding stabilizing diagonal contri- +the superior performance of ‘All (VF)’ compared to ‘J in butions to the Jacobian. To isolate the network’s ability to +(VF)’ in Fig 3, as the former is able to use the additional suppress oscillations independently of the magnitude of the +degrees of freedom to reduce the effective asymmetry at input drive, we further constrain the relative scale of J in and +high rstr . J dyn by imposing + ∥J in ∥F ∥J in ∥F +G.3. Stability analysis with Fixed Asymmetry Ratio & rin = = , (124) + ∥J dyn ∥F γ + Constrained Inputs Projection + where ∥J dyn ∥F = γ following Eq. (101). Defining unscaled +A complete stability analysis of the non-conservative dy- input projections J˜in , we set +namics trainable with AsymEP is beyond the scope of this +work. Nevertheless, for the class of continuous Hopfield J˜in + J in = rin γ (125) +networks considered here, standard arguments from random ∥J˜in ∥F +matrix theory suggest that asymmetry inherently improves +asymptotic stability. G.3.1. L EARNING RULES +In the dynamics defined by Eq. (91), the linear leak term Reusing the notations of the previous section, we write +−xi shifts the spectrum of the system’s Jacobian by −1. Jilin = γcin J˜ilin with the normalization cin = rin /Fin , where + + 21 + Equilibrium Propagation for Non-Conservative Systems + +Fin = ∥J˜in ∥F . Applying the chain rule yields: + " # + ∂Fi ′ J˜kl + in X + ˜in + = γcin ρ (xi ) δik ul − 2 J um . (126) + ∂ J˜kl + in Fin m im + +And for γ we have: + + ∂Fi 1 + = (Fi + xi ). (127) + ∂γ γ + +G.3.2. S UPPLEMENTARY N UMERICAL R ESULTS + Figure 7. Comparison of AsymEP and VF on a feedforward net- +Table 5 reports a worst-case control experiment where the + work. Test accuracy on MNIST is shown as a function of training +structural asymmetry is fixed at rstr = 0.7 while the input epochs for a single-hidden-layer network with 20 neurons. Curves +scale ratio rin is varied. The experiment uses an all-to-all report the mean and standard deviation over 10 runs. Best accura- +architecture on MNIST (excluding direct input-to-output cies are 92.7% ± 0.5% (AsymEP) and 64.3% ± 2.0% (VF). +connections). The output variance during extended infer- +ence (steps 30-50) confirms that the system successfully +learns to suppress oscillations even when rin is severely re- G.5. Advantages of Non-Conservatives Dynamics +stricted. Any small residual oscillations can be mitigated by In Section 5.4, we compare three (non-)conservative dynam- +time-averaging over the inference steps. ics under varying constraints. To further evaluate learning +Finally, rin can be interpreted as a measure of the external speed, Table 6 reports network performance after a sin- +signal magnitude relative to the internal recurrent dynamics. gle epoch. These results confirm our earlier observation: +These results indicate that the system remains capable of AsymEP learns faster than VF. +learning and stabilizing even under a low external input +drive. Even when the input projection ∥J in ∥F is 100 times G.6. Feedforward CIFAR-10 Experiments +smaller than the recurrent connections ∥J dyn ∥F , the network This appendix details the architecture and hyperparameters +still achieves 36.34 ± 6.25% accuracy, which is well above of the deep feedforward experiments comparing backprop- +chance level (∼ 10%). agation (BP), VF, AsymEP and Dyadic EP on CIFAR-10 + (see subsection 5.5) +G.4. Feedforward Network +G.4.1. L EARNING RULES Architecture. We use a nine-layer convolutional network + (denoted CNN9). The first eight layers are convolutional +For clarity, we write the learning rules for VF and AsymEP with 3 × 3 kernels and zero-padding; spatial downsampling +in a feedforward architecture with one hidden layer using is performed by strided convolutions (stride 2 on layers 2, 4, +the notation of Section 5.3. For the input weights connecting 6, 8 and stride 1 otherwise), so no pooling is used. The chan- +to the hidden layer, we get the usual formula: nel widths follow the sequence 3 → 64 → 64 → 128 → + 128 → 256 → 256 → 512 → 512, reducing the spatial + 1 h +β −β 0 + i + resolution from 32 × 32 to 2 × 2. The last layer is a fully + in + ∆Jik ∝ (hi − hi )ρ′ (hi )uk , (128) + 2β connected readout mapping the 512 × 2 × 2 feature map + to the 10 class logits. All hidden units use a ReLU non- +while for the feedforward weights connecting the hidden to linearity. +the output layer, we get: p Weights are initialized with the Kaiming scheme + (σ = 2/fan-in) and biases at zero. + 1 h +β 0 + i + ∆(Wh→o )ji ∝ (oj − o−β ′ 0 + j )ρ (oj )ρ(hi ) . (129) Training setup. All methods are trained for 40 epochs + 2β with batch size 64 and repeated over 5 seeds. Inputs are + normalized per channel and augmented with random 32 × +Note that EP is not applicable in this case. + 32 crops (padding 4), random horizontal flips and Cutout + (one 8 × 8 patch). Parameters are updated with SGD with +G.4.2. S UPPLEMENTARY N UMERICAL R ESULTS + momentum 0.9, weight decay 5 × 10−4 and gradient-norm +In addition to the final accuracy reported in Sec. 5.3, we clipping at 1, under a cosine learning-rate schedule decaying +show in Fig. 7 the evolution of the accuracy over 20 epochs from 3.5 × 10−2 to 2 × 10−4 . Test accuracy is reported +for AsymEP and VF. on an exponential moving average of the weights (decay + + 22 + Equilibrium Propagation for Non-Conservative Systems + +Table 5. Output variance and final test accuracy on MNIST (%) across different values of rin with rstr = 0.7. (mean ± std over 10 runs) +(500 hiddens, all-to-all, no input-output). + + Output variance Test Acc. (%) + rin Untrained Epoch 80 Epoch 80 + 0.01 (3.38 ± 0.90) × 10−4 (5.22 ± 2.34) × 10−5 36.34 ± 6.25 + 0.10 (2.33 ± 0.48) × 10−4 (1.39 ± 0.17) × 10−4 90.54 ± 0.19 + 0.50 (1.34 ± 0.32) × 10−5 (1.06 ± 0.25) × 10−6 94.96 ± 0.10 + 1.00 (6.27 ± 1.24) × 10−7 (1.75 ± 0.50) × 10−8 96.30 ± 0.09 + + +Table 6. Test accuracy on Fashion-MNIST (%) at Epoch 1 (mean +± std 10 runs). The table compares three classes of network +dynamics: Continuous Hopfield (CH), Predictive Coding (PC), +and Standard dynamics. Each is evaluated under three connec- + ⊤ +tivity structures: Asymmetric (Asym, Bk ̸= Wk+1 ), Symmet- + ⊤ +ric/conservative (Sym, Bk = Wk+1 ), and Feedforward (Feedfor, +Bk = 0). + + EP AsymEP VF + Asym - 74.91 ± 0.45 48.98 ± 4.09 + CH Feedfor - 74.36 ± 0.29 48.84 ± 3.46 + Sym 74.57 ± 0.43 - - + Asym - 77.83 ± 0.47 57.75 ± 3.37 + PC + Sym 76.23 ± 0.39 - - + Asym - 76.87 ± 0.51 61.50 ± 4.06 + Standard + Feedfor - 77.92 ± 0.51 63.98 ± 0.73 + + + +0.9995). The targets are smoothed (ε = 0.1), which for +the EP methods amounts to nudging toward the smoothed +one-hot vector. + +Relaxation hyperparameters. The four methods differ +only in the gradient estimator: BP uses automatic differ- +entiation, while the EP-based methods contrast stationary +states of the corresponding relaxation dynamics. VF uses +an integration step η = 1.0, nudging β = 0.1, and at most +K = 1000 relaxation steps with an early-stopping threshold +of 9 × 10−6 on the mean state update. Dyadic EP uses +the same settings except for a nudging strength β = 0.1. +AsymEP uses a smaller step η = 0.5, nudging β = 0.1, +and up to K = 250 relaxation steps with a threshold of +1 × 10−4 . + + + + + 23 +
\ No newline at end of file |
