Initial commit: ept — backprop-free equilibrium transformer (EP)

Code (ep_run/), organized docs (docs/{method,campaign,hardware,outreach,paper}), analysis scripts (scripts/), ONBOARDING.md entry point. Large data/checkpoints git-ignored (share separately). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014FAPDWQ49M5Ye3NpTndTpn
author: Yuren Hao <yurenh2@illinois.edu> 2026-07-03 05:56:50 -0500
committer: Yuren Hao <yurenh2@illinois.edu> 2026-07-03 05:56:50 -0500
commit: b83947778e2c776f757a07d4719b7ce961d7ed55 (patch)
tree: b9cc01d7adda691d9156d9d04f4fb2f644674e96 /ep_run/scurria_nonconservative.txt
1 files changed, 1708 insertions, 0 deletions
diff --git a/ep_run/scurria_nonconservative.txt b/ep_run/scurria_nonconservative.txt
new file mode 100644
index 0000000..fa6d620
--- /dev/null
+++ b/ep_run/scurria_nonconservative.txt
@@ -0,0 +1,1708 @@
+                                                           Equilibrium Propagation for Non-Conservative Systems
+
+
+                                                Antonino Emanuele Scurria 1 Dimitri Vanden Abeele 1 Bortolo Matteo Mognetti 2 Serge Massar 1
+
+
+                                                                  Abstract                                      from inference, the transmission of nonlocal error signals,
+                                                                                                                and synchronous layer-wise computations with explicit gra-
+                                             Equilibrium Propagation (EP) is a physics-
+                                                                                                                dient storage. These constraints have no clear analog in
+                                             inspired learning algorithm that uses stationary
+                                                                                                                physical systems, making backpropagation challenging to
+                                             states of a dynamical system both for inference
+                                                                                                                implement in neuromorphic or analog hardware. Conse-
+arXiv:2602.03670v2 [cs.LG] 1 Jun 2026
+
+
+
+
+                                             and learning. In its original formulation it is
+                                                                                                                quently, understanding how credit assignment can instead
+                                             limited to conservative systems, i.e. to dynam-
+                                                                                                                emerge from intrinsic system dynamics, through local inter-
+                                             ics which derive from an energy function. Given
+                                                                                                                actions and continuous relaxation, is a central question in
+                                             their applications, it is important to extend EP
+                                                                                                                neuroscience and machine learning.
+                                             to non-conservative systems, i.e. systems with
+                                             non-reciprocal interactions. Previous attempts to                  Equilibrium Propagation (EP) (Scellier &Bengio, 2017)
+                                             generalize EP to such systems failed to compute                    represents one of the most promising advances in this direc-
+                                             the exact gradient of the cost function. Here we                   tion. It formulates supervised learning as a contrast between
+                                             propose a framework that extends EP to arbitrary                   two stationary states of a dynamical system: a ‘free’ phase
+                                             non-conservative systems, including feedforward                    where the system evolves autonomously, and a ‘nudged’
+                                             networks. We keep the key property of equilib-                     phase where outputs are weakly pushed toward their targets.
+                                             rium propagation, namely the use of stationary                     The local change in neural states between these phases re-
+                                             states both for inference and learning. However,                   covers the exact gradient of the cost function with respect to
+                                             we modify the dynamics in the learning phase by                    parameters. This enables spatially local learning exploiting
+                                             a term proportional to the non-reciprocal part of                  the continuous relaxation of the system without a distinct
+                                             the interaction so as to obtain the exact gradient                 backward circuit or explicit weight transport.
+                                             of the cost function. This algorithm can also be
+                                                                                                                Since its introduction, several works have sought to im-
+                                             derived using a variational formulation that gen-
+                                                                                                                prove the practicality and biological realism of EP. Algo-
+                                             erates the learning dynamics through an energy
+                                                                                                                rithmic adaptations include enforcing temporal locality to
+                                             function defined over an augmented state space.
+                                                                                                                avoid state storage (Ernoult et al., 2020; Falk et al., 2025),
+                                             Numerical experiments show that this algorithm
+                                                                                                                deriving agnostic updates for black-box energies (Scel-
+                                             achieves better performance and learns faster than
+                                                                                                                lier et al., 2022), and substituting nudging with clamping
+                                             previous proposals.
+                                                                                                                (Stern et al., 2021). Theoretically, the framework has been
+                                                                                                                extended to stochastic systems (Scellier &Bengio, 2017;
+                                                                                                                Massar &Mognetti, 2025) and Lagrangian dynamics for
+                                        1. Introduction                                                         time-varying inputs (Massar, 2025; Pourcel et al., 2025;
+                                        Standard neural network optimization relies on error back-              Berneman &Hexner, 2025). In parallel, simulations have
+                                        propagation, an algorithm whose computational mechanism                 explored suitable substrates, ranging from spiking (Mar-
+                                        is difficult to reconcile with biological (Crick, 1989) and             tin et al., 2021; O’Connor et al., 2019) and resistive net-
+                                        physical implementations (Indiveri &Liu, 2015). Specif-                 works (Kendall et al., 2020) to coupled oscillators (Wang
+                                        ically, backpropagation requires a backward pass distinct               et al., 2024; Rageau &Grollier, 2025), as well as quantum
+                                                                                                                systems (Wanjura &Marquardt, 2025; Massar &Mognetti,
+                                            1
+                                              Laboratoire d’Information Quantique (LIQ) CP224, Université       2025; Scellier, 2024). Experimental realizations have been
+                                        libre de Bruxelles (ULB), Av. F. D. Roosevelt 50, 1050 Bruxelles,       demonstrated in memristor crossbars (Yi et al., 2023), self-
+                                        Belgium 2 Interdisciplinary Center for Nonlinear Phenomena and
+                                        Complex Systems CP231, Université libre de Bruxelles (ULB), Av.         adjusting electrical circuits (Dillavou et al., 2022; 2024),
+                                        F. D. Roosevelt 50, 1050 Bruxelles, Belgium. Correspondence to:         elastic networks (Altman et al., 2024), and classical Ising
+                                        Antonino Emanuele Scurria <antonino.scurria@ulb.be>.                    models trained on quantum annealers (Laydevant et al.,
+                                                                                                                2024).
+                                        Proceedings of the 43 rd International Conference on Machine
+                                        Learning, Seoul, South Korea. PMLR 306, 2026. Copyright 2026            Despite these recent developments and the theoretical el-
+                                        by the author(s).
+
+                                                                                                            1
+                                    Equilibrium Propagation for Non-Conservative Systems
+
+egance of EP, its standard formulation remains restricted             a framework where the original dynamics serve for infer-
+to conservative systems. In these systems, dynamics are               ence, while a new augmented dynamic is used to compute
+derived from an energy function, which inherently enforces            gradients of the cost Eq. (2). In this augmented phase, the
+symmetry (e.g., symmetric synaptic connections Jij = Jji )            output neurons are nudged towards their targets (as in stan-
+through the action-reaction principle. This constraint pre-           dard EP), while a local corrective term – proportional to the
+cludes the use of EP in a broad class of models characterized         antisymmetric part of the Jacobian at the free equilibrium
+                                                                                       ∂
+by non-conservative forces. This includes the feedforward             JF (x0 , θ, u) = ∂x F (x0 , θ, u) – is added to the forces. The
+architectures dominant in modern AI, biological circuits,             exact gradients of the cost with respect to parameters are
+as well as physical systems that reach stationary states far          then obtained by contrasting stationary states of the aug-
+from thermodynamic equilibrium, such as nonlinear optical             mented system.
+systems driven by external lasers (Cin et al., 2025), opto-
+                                                                      Second, we introduce Dyadic EP, a ‘variational’ approach
+electronic systems (Kalinin et al., 2025), exciton-polariton
+                                                                      to learning in non-conservative systems. This method in-
+condensates (Sajnok &Matuszewski, 2025), active meta-
+                                                                      volves doubling the number of variables in the system’s
+materials (Brandenbourger et al., 2019) and active colloids
+                                                                      state space and subsequently introducing a new energy func-
+(Bishop et al., 2023; Osat &Golestanian, 2023) (see (Bowick
+                                                                      tion in this extended space. This approach takes advantage
+et al., 2022) for a review).
+                                                                      of the extended space to execute the positive and negative
+Formally, we consider a dynamical system governed by                  nudging phases in parallel, recovering the same computa-
+a non-reciprocal force field F (x, θ, u), which relaxes to a          tional cost as AsymEP. Derived from first principles, this
+stationary configuration x0 satisfying:                               approach is inspired by established methods for mapping
+                                                                      dissipative dynamical systems onto conservative ones by
+                      F (x0 , θ, u) = 0,                   (1)        doubling the degrees of freedom (Bateman, 1931; Galley,
+where x represents the state variables, θ the learnable param-        2013; Aykroyd et al., 2025). A more comprehensive study
+eters and u the static input. Our goal, given a target y(u), is       of the theoretical framework and its application to feedfor-
+to compute the gradient of the cost function C(x0 , y) at this        ward networks can be found in (Scurria, 2026). Our method
+equilibrium,                                                          is related to the Dual Propagation algorithm (Høier et al.,
+                          dC 0                                        2023; Høier &Zach, 2023; 2024) and constitutes an inde-
+                              (x , y),                      (2)       pendent, first-principles generalization of Dyadic Learning
+                          dθ
+and update θ to minimize the cost.                                    (Nest &Høier; Høier et al., 2024)—previously limited to
+                                                                      Hopfield networks—to arbitrary force fields.
+Previous attempts to extend EP to non-conservative dynam-
+ics include the Vector Field (VF) algorithm (Scellier et al.,         Third, we validate our framework on MNIST (LeCun, 1998),
+2018). However, as noted by the authors, this method pro-             Fashion-MNIST, and CIFAR-10. In continuous Hopfield
+vides an unbiased gradient of the cost Eq. (2) only in the            networks initialized with symmetric connection matrices,
+conservative case. To mitigate this, (Laborieux &Zenke,               AsymEP achieves better accuracy and learns faster than
+2024) proposed adding a penalty to keep the Jacobian close            EP and VF. Additionally, when we constrain the network
+to symmetry, essentially forcing the system to be as con-             to have a strong degree of structural asymmetry, in which
+servative as possible. Alternative methods related to VF,             case EP is inapplicable, AsymEP outperforms VF. Finally,
+which similarly do not compute the exact gradient, were               when we restrict connections to a feedforward structure, our
+proposed in (Farinha et al., 2020; Costa &Santos, 2025) and           algorithm effectively trains all parameters; in contrast, VF
+for specific systems in simulation (Cin et al., 2025; Sajnok          is limited to training the last layer, acting essentially as an
+&Matuszewski, 2025).                                                  Extreme Learning Machine (Huang et al., 2006; Wang et al.,
+                                                                      2022) with poor performance.
+Conversely, generalizations of backpropagation can handle
+non-reciprocal forces and compute the exact gradient of               In summary, this theoretical work proposes two generaliza-
+the cost Eq. (2) but inherit the same challenges in physical          tions of EP beyond conservative systems to arbitrary differ-
+implementations. For instance, Backpropagation Through                entiable dynamics that compute in their stationary states.
+Time (Werbos, 1990) unfolds the network in time to ap-
+ply standard backpropagation, Recurrent Backpropagation               2. Equilibrium Propagation Overview
+(Almeida, 1990; Pineda, 1987) avoids this memory require-
+ment but still requires a specific circuit to propagate errors,       2.1. Conservative Systems
+and the continuous Adjoint Method (Chen et al., 2018) addi-           We first review standard Equilibrium Propagation (EP)
+tionally requires integrating the dynamics backward in time           (Scellier &Bengio, 2017). We consider a network described
+which is not physically possible for a dissipative system.            by an energy function E(x, θ, u), such that the force field is
+In this paper, we first propose Asymmetric EP (AsymEP),
+
+                                                                  2
+                                    Equilibrium Propagation for Non-Conservative Systems
+
+derived from the potential E:                                        stationary point, i.e., that Eq. (7) holds. Second, EP implic-
+                                                                                                                     ∂
+                                                                     itly assumes that the Jacobian JE (x0 , u) = ∂x   FE (x0 , u) is
+                                 ∂
+              FE (x, θ, u) = −      E(x, θ, u).            (3)       invertible. In this work, we assume this condition holds and
+                                 ∂x                                  will not state it explicitly hereafter. Third, for simplicity,
+The objective is to compute the total gradient dC   0                we omit the dependency on the input u and target y in the
+                                               dθ (x , y) of a
+(quadratic) cost function C(x, y) evaluated at the minimum           following equations.
+energy configuration of the system. This free equilibrium
+denoted x0 (which depend implicitly in θ and u), satisfies           2.2. Vector Field
+the stationarity condition:
+                                                                     The Vector Field (VF) algorithm, introduced in (Scellier
+                    ∂                                                et al., 2018), is an early attempt to adapt EP to non-
+                   − E(x0 , θ, u) = 0.                     (4)       reciprocal forces. This method relies on the observation
+                    ∂x
+                                                                     that, for conservative systems, linearizing the right-hand
+To compute gradients, we introduce the augmented energy              side of Eq. (9) around the equilibrium point x0 yields
+functional:
+                                                                                 ∂E(xβ , θ) ∂E(x−β , θ)
+                                                                                                         
+        ET (x, θ, β, u, y) = E(x, θ, u) + βC(x, y),        (5)              1
+                                                                       lim                  −
+                                                                       β→0 2β        ∂θ           ∂θ
+where β is a scalar nudging parameter. The stationary config-                                         ⊤  β       (10)
+                                                                                                          x − x−β
+                                                                                      
+                                                                                          ∂FE 0
+uration of this augmented system is obtained by integrating                     = lim −       (x , θ)              ,
+                                                                                  β→0      ∂θ                2β
+the dynamics
+                 dx    ∂ET (x, θ, β, u)                              where FE = −∂x E(x, θ) is the conservative force. It is
+                    =−                  ,                  (6)       therefore tempting to use the right-hand side of Eq. (10) for
+                 dt         ∂x
+                                                                     parameter updates of non-conservative systems, for which
+until the energy minimum is reached. This new fixed point            no energy function E exists.
+xβ , called nudged equilibrium, satisfies:
+                                                                     The VF algorithm adopts precisely this approach. It uses
+             ∂E(xβ , θ, u)    ∂C(xβ , y)                             the nudged counterpart of Eq. (7),
+                           +β            = 0.              (7)
+                ∂x              ∂x
+                                                                                                        ∂C β
+                                                                                       F (xβ , θ) − β      (x ) = 0,             (11)
+The training procedure, as improved in (Laborieux et al.,                                               ∂x
+2021), uses two nudged phases with factors ±β (with
+β ̸= 0). Starting from x0 , the system relaxes to two                in conjunction with the learning rule Eq. (10):
+nearby perturbed equilibria, x+β and x−β . The displace-                                                  ⊤  β
+                                                                                                              x − x−β
+                                                                                                                     
+ment x+β − x−β is then used to compute the parameter                                         ∂F 0
+                                                                               ∆θ = ϵ           (x , θ)                 .        (12)
+update in the learning rule:                                                                 ∂θ                  2β
+
+              1 ∂E(xβ , θ, u) ∂E(x−β , θ, u)
+                                              
+  ∆θ = −ϵ                      −                  , (8)              However, as noted in (Scellier et al., 2018), Eq. (12) does
+             2β         ∂θ             ∂θ                            not align with the true gradient dC      0
+                                                                                                         dθ (x ) and is exact only if
+where ϵ > 0 is the learning rate. The theoretical foundation         the force is conservative. To see this, let JF (x, θ) denote
+of EP is the result that, in the limβ→0 of Eq. (8), we get:          the Jacobian of the vector field F (x, θ) (in components
+                                                                     (JF (x, θ))ij = ∂F∂xi (x,θ)
+                                                                                            j
+                                                                                                 ). Differentiating the equilibrium
+              dC(x0 , y)    d ∂E(xβ , θ, u)                                         0
+                                                                     condition F (x , θ) = 0 with respect to θ gives
+                         =                  ,              (9)
+                dθ         dβ     ∂θ
+                                                                                                  dx0   ∂F 0
+see Appendix D.1. The error of the above method is O(β 2 ).                         JF (x0 , θ)       +    (x , θ) = 0.          (13)
+This error can be further reduced using holomorphic equi-                                         dθ    ∂θ
+librium propagation (Laborieux &Zenke, 2022).                        Consequently, the exact gradient of the cost is
+Thus, EP recovers the exact gradient of the cost function                               ⊤
+using only local computations. In this manner, learning              dC 0      dx0 ∂C 0
+                                                                        (x ) =         (x )
+implements gradient descent without an explicit backward             dθ        dθ ∂x
+                                                                                             ⊤ 
+pass, and credit assignment is realized through the system’s
+                                                                                                                       
+                                                                                  ∂F 0             ⊤ 0
+                                                                                                             −1 ∂C 0
+intrinsic relaxation dynamics.                                               =−      (x , θ)      JF (x , θ)        (x ) .
+                                                                                  ∂θ                             ∂x
+                                                                                |     {z      }|           {z           }
+Three remarks can be made at this point. First, EP does not                              pre-synaptic            post-synaptic
+require the system to be at an energy minimum, but only at a                                                                     (14)
+
+                                                                 3
+                                         Equilibrium Propagation for Non-Conservative Systems
+
+The terms ’pre-synaptic’ and ’post-synaptic’ in Eq. (14)               Algorithm 1 Asymmetric EP (AsymEP)
+are used by analogy with neuronal transmission: the pre-                1: Inputs: Force field F (x, θ), cost function C(x), nudg-
+synaptic factor captures the local influence of θ on the force               ing parameter β, learning rate ϵ.
+F , while the post-synaptic factor is the sensitivity of the            2: repeat
+cost to state perturbations.                                            3:   1. Free Phase: Evolve to stationary state
+If instead we differentiate the nudged equilibrium condition            4:      Evolve the system dynamics
+in Eq. (11) with respect to β and evaluate at β = 0, we                 5:
+                                                                                                    dx
+obtain                                                                                                 = F (x, θ),                   (17)
+                           β
+                                                                                                    dt
+                          dx             ∂C 0
+            JF (x0 , θ)              −      (x ) = 0,       (15)        6:        until convergence to the stationary state x0 .
+                          dβ   β=0       ∂x                             7:     2. Jacobian Decomposition
+which gives                                                             8:        Compute the Jacobian at equilibrium:
+                                                                        9:
+          dxβ                 −1 ∂C 0                                                                     ∂F 0
+                 = JF (x0 , θ)       (x , y).               (16)                               JF (x0 , θ) =   (x , θ),       (18)
+          dβ β=0                  ∂x                                                                       ∂x
+The right-hand side of Eq. (16) represents the effective post-         10:        and decompose it in its antisymmetric part:
+                                                                       11:
+synaptic term used by the VF algorithm (Eq. 12). Compar-
+ing this with the exact post-synaptic term derived in Eq. (14),
+                                                                                  AJ (x0 , θ) = 12 (JF (x0 , θ) − JF (x0 , θ)⊤ ). (19)
+we see that they coincide only if JF = JF⊤ , i.e., only if the
+system is conservative.                                                12:     3. Nudged Phase: Augmented Dynamics
+Now, let SJ (x0 , θ) and AJ (x0 , θ) denote the symmetric              13:        Integrate the dynamics twice starting from x0
+and antisymmetric parts of the Jacobian at the free (un-               14:
+nudged) equilibrium, respectively. Then, we show in Ap-                        dx                 ∂C
+pendix A that the gradient error increases with the spectral                       = F (x, θ) − β     (x) − 2AJ (x0 , θ) (x − x0 ),
+                        −1                                                     dt                ∂x
+radius of SJ (x0 , θ)        AJ (x0 , θ). Consequently, large                                                                 (20)
+antisymmetric contributions degrade the gradient estima-               15:        until convergence to two new stationary states
+tion, confirming empirical observations in the Appendix of                     x±β
+                                                                                A .
+(Ernoult et al., 2020). In fact, in the pathological limit where       16:     4. Parameter Update
+the Jacobian would be purely antisymmetric SJ (x0 , θ) = 0,            17:        Update the parameters according to:
+the update of VF gives the negative of the true gradient,              18:
+maximizing the cost rather than minimizing it.                                                            ⊤                 !
+                                                                                                                 xβA − x−β
+                                                                                            
+                                                                                                ∂F 0                    A
+                                                                                   ∆θ = ϵ          (x , θ)                       .   (21)
+                                                                                                ∂θ                   2β
+3. Asymmetric EP
+Here, we introduce Asymmetric EP (AsymEP), see Algo-                   19: until convergence of θ
+rithm 1, which removes the gradient estimate error inherent            20: Output: Optimized parameters θ.
+to VF by adding a local correction term to the augmented
+inference dynamics. The new nudged equilibrium xβA satis-
+fies:                                                                  where JFA (x, θ) is the Jacobian of the modified dynamical
+                  ∂C β                                                 system Eq. (20). At the equilibrium point x0 , JFA is equal
+  F (xβA , θ) − β   (x ) − 2AJ (x0 , θ) (xβA − x0 ) = 0, (22)          to the transpose of the original Jacobian:
+                  ∂x A
+As in VF, we then obtain two perturbed states x±β
+                                               A for op-
+                                                                                JFA (x0 , θ)    =   JF (x0 , θ) − 2AJ (x0 , θ)
+posite nudging ±β and apply the contrastive learning rule                                       =   SJ (x0 , θ) − AJ (x0 , θ)
+of Eq. (12).                                                                                    =   JF⊤ (x0 , θ).                    (24)
+We now show that AsymEP gives rise to the correct learning
+rule, i.e. that right-hand side of Eq. (21) is proportional to         where we have used the decomposition Eq. (44) of the orig-
+the gradient of the cost function dC     0                             inal Jacobian J into its symmetric and antisymmetric com-
+                                    dθ (x ) at the equilibrium
+         0
+point x (Eq. 14). To this end, note that the same reasoning            ponents. Therefore, the left hand side of Eq. (23) is equal to
+leading to Eq. (16) leads to                                           the true post-synaptic term
+
+           dxβA                   −1 ∂C 0                                         dxβA                   −1 ∂C 0
+                   = JFA (x0 , θ)        (x ).              (23)                           = JF⊤ (x0 , θ)        (x ),               (25)
+            dβ β=0                    ∂x                                            dβ β=0                    ∂x
+
+                                                                   4
+                                     Equilibrium Propagation for Non-Conservative Systems
+
+which, using Eq. (14), proves the result. Additionally, al-            until a stationary point (z β , z ′β ) is reached. Upon conver-
+though implied by the equality with the true gradient, we              gence, we follow the standard EP paradigm in using the
+explicitly demonstrate the equivalence of the gradient esti-           difference z β − z ′β to compute the post-synaptic term. Un-
+                                                                                                                   ′
+                                                                                                                                 ′
+mates obtained by AsymEP and Backpropagation Through                   der the change of variables m = z+z       2 and d = z − z , we
+Time in Appendix B following (Ernoult et al., 2019).                   prove in Appendix D that m follows the original dynamics
+                                                                       F (ensuring valid inference), while d relaxes to a "physical"
+Note that the corrective term −2AJ (x0 , θ)(x − x0 ) in
+                                                                       error signal proportional to the cost gradient.
+Eq. (20) is spatially local: AJ vanishes for unconnected
+neurons, and (x − x0 ) is available at the synapse given the           It is important to notice that while Dyadic EP introduces a
+memory mechanism already required by Eq. (12). This                    distinct formulation, it remains consistent with the general
+correction can create backward connections (Section 5.3).              theoretical setting of EP and matches the computational
+However, in physical realizations, both feedforward and                cost of AsymEP. Note also that we start the evolution of
+feedback connections must be physically present, though                the free phase (β = 0) with the identical initial condition
+feedback may be deactivated during inference.                          for z and z ′ , (i.e., d = 0). This guarantees that integrat-
+                                                                       ing Eq. (32) leads to a symmetric stationary point where
+4. Dyadic EP                                                           z 0 = z ′0 . Finally, we underline that the modified varia-
+                                                                       tional update rule in Eq. (34) is equivalent to the standard
+We now introduce Dyadic EP (Algorithm 2), a variational                symmetric EP update rule in Eq. (8) (see Appendix D).
+algorithm that computes the exact cost gradient in the limit
+                                                                       Now, to make this concrete, consider a continuous Hopfield
+of infinitesimal nudging. It maps the original n-variable
+                                                                       network (see also Eq. (35)) with an asymmetric connection
+dynamics F (x, θ) onto a 2n-variable system (z, z ′ ) defined
+                                                                       matrix J. After some calculations (see Appendix F), the
+by an energy H(z, z ′ , θ) and cost D(z, z ′ ). We show in
+                                                                       augmented energy of the system can be re-expressed as:
+Appendix E that AsymEP can be seen as the first-order
+projection of Dyadic EP onto the original n-dimensional                             1                 1
+                                                                       HT = − ρ(z)⊤ Sρ(z) + ρ(z ′ )⊤ Sρ(z ′ ) − ρ(z)⊤ Aρ(z ′ )
+state space.                                                                        2                 2
+                                                                                    1                    β
+The new system is defined by the energy H and cost function                      + (∥z∥ − ∥z ∥ ) + (C(z, y) + C(z ′ , y)) ,
+                                                                                           2      ′ 2
+                                                                                    2                    2
+D, given in terms of F and C by:                                                                                                  (29)
+                                         
+                                           z + z′
+                                                                      where S and A are the symmetric and antisymmetric parts
+      H(z, z ′ , θ) = −(z − z ′ )⊤ F              ,θ ,                 of J, respectively and ρ is an element-wise non-linearity.
+                                             2
+                                  ′
+                                                                      An interesting analogy can be drawn with standard learning
+                            z+z
+        D(z, z ′ ) = C                 ,               (26)            rules in discrete Hopfield networks (Hopfield, 1982). For
+                               2                                       a sequence of binary memories {ξ 1 , . . . , ξ m } where ξ µ ∈
+where z, z ′ ∈ Rn . In order to learn, we introduce the aug-           {−1, 1}n , S P   corresponds to the standard autoassociative
+mented energy                                                          Hebbian rule µ ξ µ (ξ µ )⊤ , creating stable attractors at each
+                                                                       pattern. In contrast, A corresponds to the heteroassociative
+        HT (z, z ′ , θ, β) = H(z, z ′ , θ) + βD(z, z ′ ).   (27)
+                                                                       rule (e.g., a cycle between ξ µ and ξ ν given by ξ ν (ξ µ )⊤ −
+The equilibrium configuration corresponds to a saddle point            ξ µ (ξ ν )⊤ ), encoding transitions between patterns.
+of HT , where z minimizes and z ′ maximizes the energy.
+                                                                       For this specific energy, the update rule given by Eq. (34)
+This poses no issue for EP, which requires only that the
+                                                                       can be re-expressed as:
+joint state (z, z ′ ) reaches a stationary state. Although this
+min-maximization can be interpreted as z evolving forward                          1                                         ⊤
+                                                                                        ρ(z ′β ) − ρ(z β ) ρ(z ′β ) + ρ(z β ) . (30)
+                                                                                                          
+                                                                         ∆J ∝ −
+and z ′ backward in time, in practice they evolve forward                         2β
+simultaneously, as we integrate the coupled equations:                 In the limit β → 0, this gives:
+                              z + z′
+                                        
+     dz       ∂HT                                                                                β
+                                                                                                   !
+         =−             =F            ,θ                                                       d
+     dt                                                                               ∆J ∝           ⊙ ρ′ (m)ρ(m)⊤ .             (31)
+              ∂z         ⊤     2
+                                                                                                β
+                 z − z′                    β ∂C z + z ′
+                                                        
+                             ∂F
+           +                             −                 ,
+                      2       ∂z z+z′      2 ∂z      2
+                                    2                                  matching the learning rule in (Pineda, 1987), with
+                                                                               β
+
+    dz ′      ∂HT
+                            
+                              z + z′
+                                                                      limβ→0 dβ being the error signal.
+         =+         ′
+                        =F            ,θ
+    dt        ∂z                2
+                         ′ ⊤
+                                           β ∂C z + z ′
+                                                       
+                 z−z         ∂F                                        5. Numerical Experiments
+           −                             +                 ,
+                      2      ∂z ′ z+z′     2 ∂z ′    2
+                                    2                                  In this section, we numerically validate AsymEP (Algo-
+                                                           (28)        rithm 1). The neuronal dynamics follows the one introduced
+
+                                                                   5
+                                     Equilibrium Propagation for Non-Conservative Systems
+
+Algorithm 2 Dyadic EP                                                 where ∥ · ∥F denotes the Frobenius norm. Note that this
+ 1: Inputs: Force field F (x, θ), cost function C(x, y),              metric does not capture the asymmetry of the Jacobian,
+      nudging parameter β, learning rate ϵ                            which depends on the state x.
+ 2: repeat                                                            For numerical experiments, we restricted the network to a
+ 3:   1. Free Phase: Evolve to stationary state                       layered architecture with a single hidden layer to facilitate
+ 4:      Evolve the system dynamics, starting from identi-            comparison with prior work. Accordingly, J in contains
+        cal initial conditions z(0) = z ′ (0) = z0 ,                  only input-to-hidden connections, while J dyn is block off-
+ 5:                                                                   diagonal, encoding bidirectional interactions between the
+                  dz    ∂H             dz ′   ∂H
+                     =−    ,                =+ ′,         (32)        hidden and output layers. Both J in and J dyn are trained.
+                  dt    ∂z             dt     ∂z
+                                                                      We first use MNIST (LeCun, 1998) (60k train, 10k test)
+ 6:        until stationary states z 0 , z ′0 are reached.
+                                                                      followed by Fashion-MNIST to validate AsymEP, and then
+ 7:     2. Nudged Equilibrium
+                                                                      we further validate AsymEP and Dyadic EP by comparing
+ 8:        Evolve the system dynamics, starting from the
+                                                                      them to Backpropagation on a convolutional feedforward,
+        solution of the free phase z 0 = z ′0 :
+                                                                      with CIFAR-10. Inputs are normalized using min-max to
+ 9:
+                 dz    ∂HT             dz ′    ∂HT                    [−1, 1] and targets are one-hot encoded in {−1, 1}. All
+                    =−     ,                =+       ,    (33)        hyperparameters are detailed in Appendix G, along with
+                 dt     ∂z             dt       ∂z ′
+                                                                      additional details and numerical results.
+10:        until two nudged stationary states z β , z ′β are
+        reached.
+                                                                      5.1. Symmetric Initialization
+11:     3. Parameter Update
+12:        Update the parameters according to:                        We start by comparing AsymEP with standard EP and
+13:                                                                   VF. All algorithms are initialized with an identical sym-
+                           1 ∂H(z β , z ′β , θ)
+                                               
+                   ∆θ = −ϵ                              (34)          metric matrix J dyn . EP maintains this symmetry through-
+                           β          ∂θ                              out training, while VF and AsymEP induce asymmetry in
+14: until convergence of θ                                            J dyn . Since EP and VF already achieve strong performance
+15: Output: Optimized parameters θ.                                   on MNIST, the purpose of this experiment is to validate
+                                                                      AsymEP and compare it against EP and VF rather than
+                                                                      outperform the state of the art.
+in (Scellier &Bengio, 2017), and is generalized to allow              Figure 1 compares the three algorithms as a function of
+for non-reciprocal forces as in (Scellier et al., 2018). For          hidden-layer dimension after 1 and 20 epochs. AsymEP
+clarity, we express the forces in a form that explicitly sepa-        consistently outperforms the baselines, suggesting it learns
+rates the contributions of the external input and the recurrent       faster and better.
+interactions:
+                                                                      Figure 2 studies the evolution of the asymmetry ratio rstr .
+          F (x) = ρ′ (x) ⊙ J in u + J dyn ρ(x) − x,
+                                               
+                                                           (35)       The results are reported for 50 hidden neurons. As expected,
+                                                                      EP preserves the initial weight symmetry. In contrast, VF
+where u ∈ RNin denotes the input and x ∈ RNdyn the neu-               and AsymEP induce non-trivial evolution of rstr following
+ronal state, comprising both hidden and output units. The             two distinct patterns, resulting in three distinct network
+matrices J in ∈ RNdyn ×Nin and J dyn ∈ RNdyn ×Ndyn define the         configurations. A complementary figure is available in Ap-
+input and recurrent connectivity, respectively. The activation        pendix G.1.
+function ρ(·) is taken to be the hyperbolic tangent, applied
+element-wise.                                                         5.2. Fixed Asymmetry Ratio
+If J dyn is symmetric, we can define the energy:                      While the previous section focused on networks compatible
+            1       1                                                 with all three algorithms (EP, VF, AsymEP), we now turn
+  E(x) =      ∥x∥2 − ρ(x)⊤ J dyn ρ(x) − ρ(x)⊤ J in u, (36)            to architectures with strong structural asymmetry. In this
+            2       2
+                                                                      regime, EP is inapplicable by construction, and, as we show,
+which is identical to that of (Scellier &Bengio, 2017), pro-          VF performs poorly, contrary to AsymEP which remains
+vided that the input neurons are activated as ρ(u).                   effective.
+Equation (35) naturally motivates a quantitative measure of           To this end, we consider a class of networks where the
+structural asymmetry rstr , defined as:                               asymmetry ratio rstr defined in Eq. (37) is kept fixed. Let S̃
+                               ⊤                                      and Ã be arbitrary symmetric and antisymmetric matrices
+                         ∥(J dyn − J dyn )/2∥F                        in RNdyn ×Ndyn respectively. We enforce a fixed rstr via the
+                rstr =                         ,          (37)
+                                ∥J dyn ∥F
+
+                                                                  6
+                                       Equilibrium Propagation for Non-Conservative Systems
+
+                                                                            where γ ∈ R is a learnable global scale.
+                                                                            Using VF and AsymEP, we train a layered network with one
+                                                                            hidden layer of 50 neurons (in which case S̃ and Ã are block
+                                                                            off-diagonal) for different values of rstr to investigate the
+                                                                            impact of structural asymmetry. We compare two training
+                                                                            regimes: training only the input weights J in (and the scale
+                                                                            γ), versus training all parameters including J dyn . The first
+                                                                            regime trains only the external forces from the input ρ′ (x) ⊙
+                                                                            J in u (which correspond to a symmetric contribution in the
+                                                                            Jacobian) applied to our non-conservative system, while
+                                                                            the second additionally trains J dyn and therefore the non-
+                   (a) Results after one epoch.                             symmetric part of the Jacobian directly.
+                                                                            Figure 3 summarizes the results. We find that AsymEP
+                                                                            maintains robust performance across all asymmetry levels
+                                                                            (e.g., achieving an accuracy of 93.8 ± 0.4% at rstr = 0 and
+                                                                            94.9 ± 0.2% at rstr = 0.875 when training all parameters)
+                                                                            and can even learn when the recurrent connection matrix
+                                                                            J dyn is completely antisymmetric (rstr = 1). Additionally,
+                                                                            training all parameters shows significant improvement over
+                                                                            training only J in .
+                                                                            In contrast, VF performs well at low asymmetry ratios
+                                                                            but degrades as asymmetry increases, eventually dropping
+                                                                            to chance levels (e.g., accuracies of 5 ± 3% and 8 ± 4%
+                   (b) Results after 20 epochs.                             at rstr = 1 for input-only and all-parameter training, re-
+Figure 1. Comparison of algorithm performance on MNIST using                spectively). When only J in is trained, VF accuracy col-
+a layered architecture with one hidden layer and symmetric ini-             lapses around rstr ≈ 0.5, whereas training all parameters
+tialization. Squares denote AsymEP, circles EP, and triangles VF.           delays this collapse until rstr ≈ 0.8. Our analysis in Ap-
+Test accuracy (averaged over 10 runs) is shown after one epoch
+                                                                            pendix G.2.1 reveals that VF adjusts the dynamics such that
+(Fig. 1a) and 20 epochs (Fig. 1b).
+                                                                            the asymmetry of the Jacobian’s off-diagonal terms remains
+                                                                            strictly lower than the structural asymmetry ratio. The train-
+                                                                            ing appears to adjust the neuronal state such that neurons
+                                                                            connected by strongly asymmetric weights have low activa-
+                                                                            tion. As shown in Appendix G.2.1, AsymEP learns faster
+                                                                            than VF across all levels of asymmetry.
+                                                                            Finally, Appendix G.3 opens with a brief theoretical dis-
+                                                                            cussion of the stability of these non-conservative dynamics,
+                                                                            followed by simulations on all-to-all topologies with con-
+                                                                            strained rstr and input projections J in . Even in this worst-
+                                                                            case setting, AsymEP reduces oscillations and improves
+                                                                            stability.
+
+                                                                            5.3. Feedforward Architectures
+Figure 2. Evolution of the asymmetry ratio rstr (defined in Eq. (37))       We now consider a purely feedforward architecture. Here
+during training on MNIST for AsymEP, EP and VF, initialized
+from a symmetric configuration. The models use 50 hidden neu-               VF trains only the last layer: with no backward connections,
+rons.                                                                       the output nudging signal cannot reach earlier layers, so for
+                                                                            every layer but the last the nudged stationary states coincide
+                                                                            with the free states, giving zero weight updates. As only
+following parameterization of the recurrent parameters:                     the output layer is trained, the system essentially becomes
+                "q                               #                          an Extreme Learning Machine (Huang et al., 2006; Wang
+         dyn              2     S̃           Ã                             et al., 2022). In contrast, AsymEP introduces a correction
+        J =γ         1 − rstr       + rstr         ,  (38)                  that generates effective backward connections, allowing the
+                              ∥S̃∥F        ∥Ã∥F
+
+                                                                        7
+                                      Equilibrium Propagation for Non-Conservative Systems
+
+                                                                          tivity structures inspired by (Millidge et al., 2023), while
+                                                                          keeping the number of trainable parameters fixed.
+                                                                          Experiments are conducted on Fashion-MNIST using a two-
+                                                                          hidden-layer network with hidden dimensions 500 and 200.
+                                                                          Network states are denoted (x0 , x1 , x2 , x3 ), where x0 is
+                                                                          the input and x3 = xL the output. Forward and backward
+                                                                          connections are denoted by Wk and Bk , respectively, with
+                                                                          W1 = J in .
+                                                                          We consider three classes of dynamics. First, the Continuous
+                                                                          Hopfield (CH) dynamics introduced previously:
+                                                                          dxk                                                      
+                                                                               = −xk +ρ′ (xk )⊙ Wk ρ(xk−1 )+(1−δk,L )Bk ρ(xk+1 ) .
+                                                                           dt
+                                                                                                                               (40)
+                                                                          Second, Predictive Coding (PC) dynamics, defined through
+                                                                          the prediction errors ek = xk − Wk ρ(xk−1 ), whose fixed
+Figure 3. Impact of the structural asymmetry ratio rstr on accuracy
+(top) and standard deviation over 10 runs (bottom) on MNIST.              point ek = 0 corresponds to a standard feedforward net-
+We compare VF (orange) and AsymEP (blue) under two training               work:
+regimes: training only J in (dashed) or all parameters (solid).
+                                                                             dxk
+                                                                                 = −ek + (1 − δk,L ) (ρ′ (xk ) ⊙ (Bk ek+1 )) .    (41)
+                                                                              dt
+nudging signal to influence all layers. We make this explicit
+                                                                          Third, a standard dynamics chosen for direct comparison
+for a network with one hidden layer.
+                                                                          with backpropagation:
+Let the state x be partitioned in hidden h and output o
+                                                                           dxk
+layers. The recurrent connection matrix is then J dyn =                       = −xk + Wk ρ(xk−1 ) + (1 − δk,L )Bk ρ(xk+1 ). (42)
+                                                                            dt
+
+    0     0
+             . The forces of the system are:
+  Wh→o 0                                                                  For each dynamics, we examine three connectivity scenar-
+  β                                                                      ios.
+   Fh = ρ′ (h) ⊙ J in u + λ(Wh→o )⊤ (o − o0 ) − h
+                                             
+ 
+ 
+                                               
+                                               0
+ 
+                                                                                                                    ⊤
+   Fo = ρ′ (o) ⊙ Wh→o ρ(h) − λWh→o (h − h )                                  • In the asymmetric case (Bk ̸= Wk+1
+  β
+                                                                                                                       ), the backward
+                                                     (39)
+                                                                              weights Bk are randomly initialized and kept fixed
+ 
+              ∂C                                                              while only the forward weights are trained, ensuring a
+         + λβ      −o
+ 
+ 
+               ∂o                                                              fair comparison (i.e., identical number of parameters);
+where λ is 0 during the free inference and 1 during the                        in PC, the learning rule for Bk is zero when only inputs
+nudged phase (Eq. 20). The force on the hidden layer Fhβ                       are clamped.
+now depends on the output layer through the term ρ′ (h) ⊙                                                                  ⊤
+         ⊤                                                                   • In the symmetric / conservative case (Bk = Wk+1 ), the
+(Wh→o ) (o − o0 ), enabling the nudge (the term β ∂C ∂o ) to                   CH and PC dynamics derive from an energy functional,
+influence the hidden layer. This implicitly assumes that the                   while the standard dynamics remains non-conservative
+hardware implementation supports the physical activation                       due to its non-symmetric Jacobian.
+of these backward connections.
+                                                                             • In the feedforward case (Bk = 0), the PC and stan-
+We validate this using a single hidden layer of only 20 neu-
+                                                                               dard dynamics coincide; for the standard dynamics, the
+rons on MNIST. After training, VF saturates with 64.3 ±
+                                                                               AsymEP learning rule mirrors backpropagation, with
+2.0% accuracy, whereas AsymEP reaches 92.7 ± 0.5% ac-
+                                                                               ∆xβk = 2β1
+                                                                                          (xβ − x−β ) acting as the propagated error
+curacy. We expect this discrepancy to increase with network
+                                                                               signal.
+depth, since this increases the number of layers unable to
+learn under VF. A figure with the accuracy during training
+can be found in Appendix G.4.2.                                           Table 1 shows that AsymEP consistently outperforms VF
+                                                                          in both asymmetric and feedforward settings, in final ac-
+                                                                          curacy, learning speed, and stability. After a single epoch
+5.4. Advantages of Non-Conservative Dynamics
+                                                                          it already provides on average a 15% accuracy gain with
+AsymEP is not tied to a specific neural dynamics. To further              an order-of-magnitude reduction in variance. Remarkably,
+assess the benefits of training non-conservative dynamics                 AsymEP with asymmetric connectivity also surpasses EP
+using AsymEP, we compare several dynamics and connec-                     on symmetric networks despite training only the forward
+
+                                                                      8
+                                    Equilibrium Propagation for Non-Conservative Systems
+
+weights, suggesting that relaxing symmetry constraints may            6. Discussion and Conclusion
+improve expressivity. Supplementary results are provided
+in Appendix G.5.                                                      In this work, we extended Equilibrium Propagation (EP)
+                                                                      to non-conservative systems that reach stationary states by
+                                                                      deriving two mathematically equivalent algorithms that re-
+                                                                      cover the exact gradient of the cost function in the limit of
+Table 1. Test accuracy on Fashion-MNIST (%) at Epoch 50 (mean
+± std 10 runs). BP on a standard feedforward architecture using
+                                                                      infinitesimal nudging.
+MSE and SGD achieve 87.37 ± 0.29%.                                    The first approach, Asymmetric EP, preserves the original
+                                                                      inference dynamics. It introduces a corrective force during
+                          EP          AsymEP            VF
+                                                                      the nudged phase that remains spatially local, as the anti-
+          Asym          -      86.78 ± 0.14        85.20 ± 0.12       symmetric Jacobian is null for unconnected neurons and the
+ CH       Feedfor       -      86.05 ± 0.12        77.76 ± 0.37
+          Sym     84.30 ± 0.13       -                   -
+                                                                      perturbation from equilibrium is available at the synapse
+          Asym          -      86.20 ± 0.17        80.71 ± 6.17       level. Unlike standard methods like Recurrent Backpropa-
+ PC                                                                   gation (Almeida, 1990; Pineda, 1987), this avoids explicit
+          Sym     84.78 ± 0.14       -                   -
+          Asym          -      82.91 ± 0.48        75.52 ± 1.69       digital weight transposition. However, a physical mech-
+ Standard
+          Feedfor       -      86.25 ± 0.16        78.58 ± 0.28       anism to obtain the local corrective force at the synapse
+                                                                      level remains a subject for future work. We also note that
+                                                                      AsymEP shares the temporal non-locality of standard EP.
+Finally, to investigate how AsymEP scales with depth, we              The second approach, Dyadic EP, doubles the state space
+trained deeper fully connected networks with two and three            to map non-reciprocal dynamics onto an energy land-
+hidden layers of 500 neurons on Fashion-MNIST, reaching               scape—conceptually reminiscent of multi-compartment cor-
+86.41 ± 0.22% and 87.8 ± 0.15% test accuracy respectively.            tical neurons, where apical dendrites integrate feedback
+                                                                      (analogous to z − z ′ ) separately from basal feedforward
+5.5. Feedforward Training on CIFAR-10: BP vs. Dyadic                  input (analogous to z + z ′ ) (Guerguiev et al., 2017). Addi-
+     EP vs. AsymEP                                                    tionally, this expanded space also enables the positive and
+                                                                      negative nudging phases to run in parallel. This offers a
+To test whether our framework scales beyond shallow net-              pathway to implement a version of EP that is local in time,
+works, we conclude with a deep, purely feedforward CNN                but would require a doubling of the degrees of freedom
+architecture trained on CIFAR-10. We compare backprop-                on the physical hardware. More fundamentally, the energy
+agation (BP), VF, AsymEP and Dyadic EP in a controlled                defined on the extended state shows that the tools and the-
+setting where the gradient estimator is the only difference           oretical guarantees obtained for EP should also apply to
+between runs: all methods share the same configuration,               the case of non-reciprocal forces, and that the variational
+with the BP gradient replaced by the contrast of stationary           principle behind EP is universal in the sense that it can be
+states for the EP-based methods (see App. G.6 for details).           applied to all networks which operate in a stationary state.
+Each configuration is trained for 40 epochs over 5 seeds.
+                                                                      Furthermore, Dyadic EP is not restricted to the EP com-
+Table 2 reports the final test accuracy. Both of our algo-            munity and could suggest a more physically plausible al-
+rithms scale to this regime, closely tracking the BP baseline         ternative to the stationary-state Adjoint Method (for fixed
+throughout training and matching its final accuracy: a paired         inputs) (Chen et al., 2018): by solving the forward and ad-
+t-test finds no significant difference between Dyadic EP and          joint equations simultaneously via relaxation, it circumvents
+BP (p = 0.75), and only a sub-percent gap for AsymEP.                 a separate backward-in-time pass.
+In contrast, VF makes slight initial progress (peaking near
+30%) before collapsing to chance level (10%). Additional              Finally, our experiments on MNIST, Fashion-MNIST, and
+details can be found in Appendix G.6                                  CIFAR-10 confirm that AsymEP and Dyadic EP consis-
+                                                                      tently outperform EP and VF, and notably enables effective
+                                                                      training of feedforward networks.
+                                                                      Our work thus opens new avenues for learning in neuro-
+Table 2. Test accuracy on CIFAR-10 (%) at epoch 40 (mean ± std
+over 5 seeds).                                                        morphic hardware, dissipative physical systems, and neural
+                                                                      architectures where asymmetry is intrinsic rather than inci-
+              Method              Test Acc. (%)                       dental.
+              Backpropagation     90.66 ± 0.25
+              Dyadic EP           90.69 ± 0.14
+              AsymEP              89.74 ± 0.14
+              VF                  10.00 ± 0.00
+
+
+                                                                  9
+                                   Equilibrium Propagation for Non-Conservative Systems
+
+Impact Statement                                                     References
+This paper presents results that advance the field of machine        Almeida, L. B. A learning rule for asynchronous percep-
+learning. There are many potential societal consequences               trons with feedback in a combinatorial environment. In
+of our work, none of which we feel must be specifically                Artificial neural networks: concept learning, pp. 102–111.
+highlighted here.                                                      1990.
+                                                                     Altman, L. E., Stern, M., Liu, A. J., and Durian, D. J. Ex-
+Acknowledgments                                                        perimental demonstration of coupled learning in elastic
+                                                                       networks. Physical Review Applied, 22(2):024053, 2024.
+AES is fully funded by the Horizon Europe Marie
+Skłodowska-Curie Doctoral Network ’Postdigital Plus’                 Aykroyd, C., Bourgoin, A., and Poncin-Lafitte, C. L. Hamil-
+(Grant 101169118). DVA acknowledges the support of                     tonian treatment of non-conservative systems. arXiv
+the French Community of Belgium through a FRIA fellow-                 preprint arXiv:2507.18658, 2025.
+ship. SM acknowledges financial support by the Fonds de la
+                                                                     Bateman, H. On dissipative systems and related variational
+Recherche Scientifique–FNRS, Belgium under EOS Project
+                                                                       principles. Physical Review, 38(4):815, 1931.
+No. 40007536. Computational resources have been pro-
+vided by the Consortium des Équipements de Calcul Intensif           Berneman, M. and Hexner, D. Equilibrium propagation for
+(CÉCI), funded by the Fonds de la Recherche Scientifique               dissipative dynamics. Advanced Intelligent Systems, pp.
+de Belgique (F.R.S.-FNRS) under Grant No. 2.5020.11 and                e202501310, 2025.
+by the Walloon Region.
+                                                                     Bishop, K. J., Biswal, S. L., and Bharti, B. Active colloids
+                                                                       as models, materials, and machines. Annual Review of
+                                                                       Chemical and Biomolecular Engineering, 14(1):1–30,
+                     “ἁρμονίη ἀφανὴς φανερῆς κρείττων”
+                                                                       2023.
+                                                                     Bowick, M. J., Fakhri, N., Marchetti, M. C., and Ra-
+                                                                       maswamy, S. Symmetry, thermodynamics, and topology
+                                                                       in active matter. Physical Review X, 12(1):010501, 2022.
+                                                                     Brandenbourger, M., Locsin, X., Lerner, E., and Coulais, C.
+                                                                       Non-reciprocal robotic metamaterials. Nature communi-
+                                                                       cations, 10(1):4608, 2019.
+                                                                     Cesa-Bianchi, N. and Lugosi, G. Prediction, learning, and
+                                                                       games. Cambridge university press, 2006.
+                                                                     Chen, R. T., Rubanova, Y., Bettencourt, J., and Duvenaud,
+                                                                       D. K. Neural ordinary differential equations. Advances
+                                                                       in neural information processing systems, 31, 2018.
+                                                                     Cin, N. D., Marquardt, F., and Wanjura, C. C. Training
+                                                                       nonlinear optical neural networks with scattering back-
+                                                                       propagation. arXiv preprint arXiv:2508.11750, 2025.
+                                                                     Costa, P. and Santos, P. A. Directed equilibrium propagation
+                                                                       revisited. Mathematics, 13(11), 2025. ISSN 2227-7390.
+                                                                     Crick, F. The recent excitement about neural networks.
+                                                                       Nature, 337, 1989.
+                                                                     Dillavou, S., Stern, M., Liu, A. J., and Durian, D. J. Demon-
+                                                                       stration of decentralized physics-driven learning. Physi-
+                                                                       cal Review Applied, 18(1):014040, 2022.
+                                                                     Dillavou, S., Beyer, B. D., Stern, M., Liu, A. J., Miskin,
+                                                                       M. Z., and Durian, D. J. Machine learning without a pro-
+                                                                       cessor: Emergent learning in a nonlinear analog network.
+                                                                       Proceedings of the National Academy of Sciences, 121
+                                                                       (28):e2319718121, 2024.
+
+                                                                10
+                                   Equilibrium Propagation for Non-Conservative Systems
+
+Ernoult, M., Grollier, J., Querlioz, D., Bengio, Y., and Scel-        Indiveri, G. and Liu, S.-C. Memory and information pro-
+  lier, B. Updates of equilibrium prop match gradients                  cessing in neuromorphic systems. Proceedings of the
+  of backprop through time in an rnn with static input.                 IEEE, 103(8):1379–1397, 2015.
+  Advances in neural information processing systems, 32,
+  2019.                                                               Kalinin, K. P., Gladrow, J., Chu, J., Clegg, J. H., Cletheroe,
+                                                                        D., Kelly, D. J., Rahmani, B., Brennan, G., Canakci, B.,
+Ernoult, M., Grollier, J., Querlioz, D., Bengio, Y., and Scel-          Falck, F., et al. Analog optical computer for ai inference
+  lier, B. Equilibrium propagation with continual weight                and combinatorial optimization. Nature, 645(8080):354–
+  updates. arXiv preprint arXiv:2005.04168, 2020.                       361, 2025.
+Falk, M. J., Strupp, A. T., Scellier, B., and Murugan,                Kendall, J., Pantone, R., Manickavasagam, K., Bengio,
+  A. Temporal contrastive learning through implicit non-                Y., and Scellier, B. Training end-to-end analog neural
+  equilibrium memory. Nature Communications, (16),                      networks with equilibrium propagation. arXiv preprint
+  2025.                                                                 arXiv:2006.01981, 2020.
+Farinha, M. T., Pequito, S., Santos, P. A., and Figueiredo,
+                                                                      Laborieux, A. and Zenke, F. Holomorphic equilibrium
+  M. A. T. Equilibrium propagation for complete directed
+                                                                        propagation computes exact gradients through finite size
+  neural networks. In Proceedings of the 28th European
+                                                                        oscillations. Advances in Neural Information Processing
+  Symposium on Artificial Neural Networks, Computational
+                                                                        Systems, 35:12950–12963, 2022.
+  Intelligence and Machine Learning (ESANN 2020), 2020.
+Galley, C. R. Classical mechanics of nonconservative sys-             Laborieux, A. and Zenke, F. Improving equilibrium propa-
+  tems. Physical review letters, 110(17):174301, 2013.                  gation without weight symmetry through jacobian home-
+                                                                        ostasis. In Proceedings of the International Confer-
+Guerguiev, J., Lillicrap, T. P., and Richards, B. A. Towards            ence on Learning Representations (ICLR) 2024, Virtual
+  deep learning with segregated dendrites. elife, 6:e22901,             (ICLR), May 2024.
+  2017.
+                                                                      Laborieux, A., Ernoult, M., Scellier, B., Bengio, Y., Grollier,
+Høier, R. and Zach, C. A lagrangian perspective on dual                 J., and Querlioz, D. Scaling equilibrium propagation to
+  propagation. In Proceedings of the First Workshop on Ma-              deep convnets by drastically reducing its gradient estima-
+  chine Learning with New Compute Paradigms at NeurIPS                  tor bias. Frontiers in neuroscience, 15:633674, 2021.
+ 2023, New Orleans, LA, USA, Dec 2023.
+                                                                      Laydevant, J., Marković, D., and Grollier, J. Training an
+Høier, R. and Zach, C. Two tales of single-phase contrastive
+                                                                        ising machine with equilibrium propagation. Nature Com-
+  hebbian learning. In Salakhutdinov, R., Kolter, Z., Heller,
+                                                                        munications, 15(1):3671, 2024.
+  K., Weller, A., Oliver, N., Scarlett, J., and Berkenkamp, F.
+  (eds.), Proceedings of the 41st International Conference            LeCun, Y. The mnist database of handwritten digits.
+  on Machine Learning, volume 235 of Proceedings of                     http://yann. lecun. com/exdb/mnist/, 1998.
+  Machine Learning Research, pp. 18470–18488. PMLR,
+  21–27 Jul 2024.                                                     Martin, E., Ernoult, M., Laydevant, J., Li, S., Querlioz, D.,
+                                                                       Petrisor, T., and Grollier, J. Eqspike: spike-driven equi-
+Høier, R., Staudt, D., and Zach, C. Dual propagation: accel-
+                                                                       librium propagation for neuromorphic implementations.
+  erating contrastive hebbian learning with dyadic neurons.
+                                                                       Iscience, 24(3), 2021.
+  In Proceedings of the 40th International Conference on
+  Machine Learning, ICML’23. JMLR.org, 2023.                          Massar, S. Equilibrium propagation for learning in la-
+Høier, R., Kalinin, K., Ernoult, M., and Zach, C. Dyadic               grangian dynamical systems. Physical Review E, 112
+  learning in recurrent and feedforward models. In NeurIPS             (3):035304, 2025.
+ 2024 Workshop Machine Learning with new Compute                      Massar, S. and Mognetti, B. M. Equilibrium propagation:
+ Paradigms, 2024.                                                      the quantum and the thermal cases. Quantum Studies:
+Hopfield, J. J. Neural networks and physical systems with              Mathematics and Foundations, 12(1):6, 2025.
+  emergent collective computational abilities. Proceedings
+                                                                      Millidge, B., Song, Y., Salvatori, T., Lukasiewicz, T., and
+  of the national academy of sciences, 79(8):2554–2558,
+                                                                       Bogacz, R. Backpropagation at the infinitesimal infer-
+ 1982.
+                                                                       ence limit of energy-based models: Unifying predictive
+Huang, G.-B., Zhu, Q.-Y., and Siew, C.-K. Extreme learning             coding, equilibrium propagation, and contrastive hebbian
+  machine: theory and applications. Neurocomputing, 70                 learning. In International Conference on Learning Rep-
+  (1-3):489–501, 2006.                                                 resentations (ICLR), 2023.
+
+                                                                 11
+                                    Equilibrium Propagation for Non-Conservative Systems
+
+Nest, T. and Høier, R. Dyadic learning in asymmetric                   Wang, Q., Wanjura, C. C., and Marquardt, F. Training
+  convnets. In New Frontiers in Associative Memories-                   coupled phase oscillators as a neuromorphic platform
+ Workshop at ICLR 2026.                                                 using equilibrium propagation. Neuromorphic Computing
+                                                                        and Engineering, 4(3):034014, 2024.
+Osat, S. and Golestanian, R. Non-reciprocal multifarious
+  self-organization. Nature Nanotechnology, 18(1):79–85,               Wanjura, C. C. and Marquardt, F. Quantum equilibrium
+  2023.                                                                 propagation for efficient training of quantum systems
+                                                                        based on onsager reciprocity. Nature Communications,
+O’Connor, P., Gavves, E., and Welling, M. Training a spik-              16(1):6595, 2025.
+  ing neural network with equilibrium propagation. In The
+  22nd international conference on artificial intelligence             Werbos, P. J. Backpropagation through time: what it does
+  and statistics, pp. 1516–1523. PMLR, 2019.                            and how to do it. Proceedings of the IEEE, 78(10):1550–
+                                                                        1560, 1990.
+Pineda, F. Generalization of back propagation to recurrent
+  and higher order neural networks. In Neural information              Yi, S.-i., Kendall, J. D., Williams, R. S., and Kumar, S.
+  processing systems, 1987.                                              Activity-difference training of deep neural networks using
+                                                                         memristor crossbars. Nature Electronics, 6(1):45–51,
+Pourcel, G., Basu, D., Ernoult, M., and Gilra, A. Lagrangian-            2023.
+  based equilibrium propagation: generalisation to arbi-
+  trary boundary conditions & equivalence with hamilto-
+  nian echo learning. arXiv preprint arXiv:2506.06248,
+  2025.
+
+Rageau, T. and Grollier, J. Training and synchronizing
+  oscillator networks with equilibrium propagation. Neuro-
+  morphic Computing and Engineering, 2025.
+
+Sajnok, K. and Matuszewski, M. Near-equilibrium propaga-
+  tion training in nonlinear wave systems. arXiv preprint
+  arXiv:2510.16084, 2025.
+
+Scellier, B. Quantum equilibrium propagation: Gradient-
+  descent training of quantum systems. arXiv preprint
+  arXiv:2406.00879, 2024.
+
+Scellier, B. and Bengio, Y. Equilibrium propagation: Bridg-
+  ing the gap between energy-based models and backprop-
+  agation. Frontiers in computational neuroscience, 11:24,
+  2017.
+
+Scellier, B., Goyal, A., Binas, J., Mesnard, T., and Bengio,
+  Y. Generalization of equilibrium propagation to vector
+  field dynamics. arXiv preprint arXiv:1808.04873, 2018.
+
+Scellier, B., Mishra, S., Bengio, Y., and Ollivier, Y. Agnostic
+  physics-driven deep learning. arXiv:2205.15021v1, 2022.
+
+Scurria, A. E. A physical theory of backpropagation: Exact
+  gradients from the least-action principle. 2026.
+
+Stern, M., Hexner, D., Rocks, J. W., and Liu, A. J. Su-
+  pervised learning in physical networks: From machine
+  learning to learning machines. Physical Review X, 11(2):
+  021045, 2021.
+
+Wang, J., Lu, S., Wang, S.-H., and Zhang, Y.-D. A review
+ on extreme learning machine. Multimedia Tools and
+ Applications, 81(29):41611–41660, 2022.
+
+                                                                  12
+                                       Equilibrium Propagation for Non-Conservative Systems
+
+A. Gradient Estimation Error in VF                                           where s denotes the dynamical state of the system. This
+                                                                             symmetry is the linchpin of the equivalence proof, as the
+In this appendix, we quantify the gradient estimation error                  gradient expressions derived for BPTT and standard EP
+introduced by VF in the limit where the Jacobian asymmetry                   differ precisely by a transpose operation applied to ∂F
+                                                                                                                                  ∂s .
+is small.
+                                                                             This observation aligns with our analysis in the main text:
+Comparing the post-synaptic update terms in Eqs. (12) and                    VF fails in non-conservative systems due to the missing
+(14) gives the following error in the gradient of the cost:                  transpose in the post-synaptic term (see Eq. (16)). Following
+                       ⊤                                                   the derivation in Ernoult et al. (2019) (viz., Appendix A, Eqs.
+            ∂F 0                                                             (31–33)), the recursive relations for the gradients in BPTT
+  Error = −     (x , θ)
+             ∂θ                                                              are given by:
+             −1                −1  ∂C 0
+× JF (x0 , θ)     − JF⊤ (x0 , θ)          (x , y), (43)                                                           ∂ℓ
+                                       ∂x                                                         ∇BPTT
+                                                                                                   s    (0) =        (s⋆ , y),              (49)
+                                                                                                                  ∂s
+To quantify this error, we decompose the Jacobian JF (x, θ)                  and for all t = 1, . . . , K,
+into its symmetric part SJ (x, θ) and antisymmetric part
+                                                                                                                    ⊤
+                                                                                                    ∂F
+           SJ (x, θ) = 12 JF (x, θ) + JF⊤ (x, θ) ,                               ∇BPTT                                    ∇BPTT (t − 1),
+                                                
+                                                                                  s    (t) =           (x, s⋆ , θ)         s                (50)
+                                                      (44)                                          ∂s
+           AJ (x, θ) = 12 JF (x, θ) − JF⊤ (x, θ) .
+                                                
+                                                                                                                    ⊤
+                                                                                                    ∂F
+                                                                                 ∇BPTT
+                                                                                  θ    (t) =           (x, s⋆ , θ)        ∇BPTT
+                                                                                                                           s    (t − 1),    (51)
+Assuming the asymmetry AJ (x, θ) is small, we can make                                              ∂θ
+a series expansion in SJ−1 AJ (omitting the dependencies                     where θ represents the optimization parameters, ℓ is the
+for clarity). Applying the Neumann expansion for small                       cost function, s⋆ is the free equilibrium state (satisfying
+∥SJ−1 AJ ∥ gives                                                             F (s⋆ ) = 0), y is the target, and x is the input. The index t
+                       ∞
+                                                   !                         denotes the unrolled time steps, initialized at s(0) = s⋆ .
+                       X
+        (JF ) −1
+                   =       (−1)  n
+                                     (SJ−1 AJ )n        SJ−1 ,   (45)        In contrast, the gradients computed by VF follow the recur-
+                       n=0                                                   sion (viz., Ernoult et al. (2019), Appendix A, Eqs. (24–26)):
+                        ∞
+                                           !
+                       X
+        (JF⊤ )−1 =           (SJ−1 AJ )n       SJ−1 .            (46)                                             ∂ℓ
+                                                                                                  ∆EP
+                                                                                                   s (0) = −         (s⋆ , y),              (52)
+                       n=0                                                                                        ∂s
+Subtracting the two series and assuming convergence, we                      and for all t ≥ 0,
+finally obtain
+                                                                                                     ∂F
+                                                                                       ∆EP
+                                                                                        s (t + 1) =     (x, s⋆ , θ) ∆EPs (t),               (53)
+                              ∞                                                                      ∂s
+                                                !
+                              X
+                                    −1
+                                          2n+1
+        −1      ⊤ −1
+  (JF ) − (JF ) = −2               SJ AJ          SJ−1 .                                             
+                                                                                                       ∂F
+                                                                                                                       ⊤
+                                 n=0                                                   ∆EP
+                                                                                        θ  (t + 1) =      (x, s ⋆ , θ)    ∆EP
+                                                                                                                            s (t).          (54)
+                                                                 (47)                                  ∂θ
+                                                                             Comparing these two sets of equations confirms that the only
+B. Equivalence between AsymEP and BPTT                                       difference are Eqs. (50) and (53), specifically the transpose
+                                                                             of the Jacobian ∂F
+                                                                                              ∂s (ignoring the global sign difference in
+In this appendix, we sketch the equivalence between the                      Eqs. (49) and (52)).
+gradient estimate computed by AsymEP and Backpropaga-
+tion Through Time (BPTT) (Werbos, 1990) for a Recurrent                      In AsymEP, we modify the dynamics by adding a correction
+Neural Network with fixed inputs. Our derivation relies on                   term dependent on the antisymmetric part of the Jacobian.
+the proof provided by Ernoult et al. (2019), which estab-                    Denoting the force of this augmented system by F A , the
+lished that standard (conservative) EP computes gradients                    Jacobian at the free equilibrium satisfies:
+identical to those of BPTT. To facilitate direct comparison,                                                                       ⊤
+                                                                                         ∂F A
+                                                                                                              
+we adopt their notation for this section.                                                                         ∂F
+                                                                                              (x, s⋆ , θ) =          (x, s⋆ , θ)        .   (55)
+                                                                                          ∂s                      ∂s
+The proof provided by Ernoult et al. (2019) relies on the
+assumption that the vector field F (i.e., transition function)               By substituting this corrected Jacobian into the recursive
+is derived from a scalar potential function, which implies                   relations, AsymEP recovers the exact transpose required
+that                                                                         by BPTT. Consequently, our method extends the equiva-
+                                 ⊤
+                     ∂F        ∂F                                            lence between EP and BPTT to the general case of non-
+                         =             ,                  (48)               conservative force.
+                     ∂s        ∂s
+
+                                                                        13
+                                    Equilibrium Propagation for Non-Conservative Systems
+
+C. Out-of-Equilibrium Mechanics                                         C.3. Symmetry Breaking as Credit Assignment
+Here we sketch the physical picture behind the doubled-                 On the diagonal manifold z = z ′ the doubled system enjoys
+energy construction of Eq. (26). The full derivation from               a gauge symmetry: the auxiliary variable z ′ is redundant and
+Hamilton’s least-action principle, together with its connec-            the difference d is identically zero. Credit assignment is im-
+tion to the Bateman–Galley formalism for non-conservative               plemented by deliberately breaking this symmetry through
+classical mechanics (Bateman, 1931; Galley, 2013; Aykroyd               the task cost. Adding βD(z, z ′ ) = β C(m) to H exerts
+et al., 2025), can be found in (Scurria, 2026).                         opposite forces on z and z ′ and drives them apart, so that
+                                                                        the difference d ceases to be redundant and begins to carry
+C.1. The Helmholtz Obstruction                                          information about the loss landscape.
+
+The natural physical route to a variational principle for a
+dynamical system ẋ = F (x, θ) is to seek a scalar potential            D. Proofs for Dyadic EP
+E such that F = −∂x E. The classical Helmholtz integra-                 We now demonstrate that Dyadic EP correctly trains the
+bility condition states that such an E exists if and only if the        parameters θ of the original force field F (x, θ), giving the
+Jacobian JF is symmetric everywhere. Whenever the inter-                                    0
+                                                                        exact gradient dC(x̄
+                                                                                         dθ
+                                                                                              )
+                                                                                                in the limit of infinitesimal nudging.
+actions are non-reciprocal — as in feedforward networks,
+active matter, or driven optical systems — JF acquires
+a non-zero antisymmetric part and the Helmholtz condi-                  D.1. Proof of EP
+tion fails identically. No scalar potential on the original             First, recall that standard EP does not strictly require the
+n-dimensional state space can then generate the dynamics,               system to settle at an energy minimum; it requires only that
+and the “energy minimisation” route at the heart of standard            the system reaches a stationary state (a fixed point of the
+EP is blocked at the structural level. The obstruction is not           dynamics). Indeed, using the notation of Section 2.1, EP
+a matter of computational convenience: it reflects the fact             relies on the key identity:
+that the rotational component of F carries information that
+no scalar function of x alone can record.                                           d2                 d2
+                                                                                        ET (xβ , θ) =      ET (xβ , θ).            (57)
+                                                                                   dθdβ               dβdθ
+C.2. Variational Reconstruction on a Doubled Space                      Expanding the total derivative with respect to β gives:
+Applying the Bateman–Galley formalism circumvents this                                                         ⊤
+                                                                                                ∂ET (xβ , θ)        dxβ   ∂ET (xβ , θ)
+                                                                                            
+obstruction by enlarging the configuration space. The single              d
+                                                                            ET (xβ , θ) =                               +
+state x ∈ Rn is replaced by a conjugate pair (z, z ′ ) ∈ R2n ,           dβ                        ∂x               dβ       ∂β
+and the rotational component of F — which has no scalar                                  = C(xβ ).                                 (58)
+generator on the original n-dimensional space — is ab-
+sorbed into a bilinear coupling between z and z ′ on the                Where the first term vanishes because the system is at a
+                                                                                                ∂
+doubled space, where it does admit a variational descrip-               stationary state, i.e., ∂x ET (xβ , θ) = 0; this holds even if
+tion. The physical motion is recovered on the diagonal                  the system is not at a minimum of ET . Similarly, for the
+submanifold z = z ′ (the so called ’physical limit’), while             derivative with respect to θ:
+the off-diagonal direction d = z − z ′ supplies the additional
+                                                                                       d                ∂ET (xβ , θ)
+degree of freedom needed to encode non-reciprocity.                                       ET (xβ , θ) =              ,             (59)
+                                                                                       dθ                  ∂θ
+Specializing this reconstruction to the overdamped (first-
+                                                                        where we additionally assume that the cost function does
+order) regime relevant to relaxational neural dynamics yields
+                                                                        not depend explicitly on the parameters θ. Substituting these
+the bilinear energy
+                                                                        results into Eq. (57) in the limit of infinitesimal nudging
+                                                                        (β → 0) recovers the fundamental relation given by Eq. (9).
+                                             z + z′
+                                                     
+        H(z, z ′ , θ) = −(z − z ′ )⊤ F              ,θ ,   (56)
+                                               2
+                                                                        D.2. Proof of Dyadic EP
+which is precisely Eq. (26). The symmetric midpoint m =                 We analyze now the stationary states of Dyadic EP by intro-
+(z + z ′ )/2 plays the role of the physical coordinate of the           ducing the change of variables:
+doubled system, while d is the auxiliary direction along
+which non-reciprocity is stored. On the submanifold z = z ′                                  z + z′
+                                                                                       m=           ,          d = z − z′.         (60)
+the coupling proportional to (z − z ′ ) vanishes identically                                   2
+and both states evolve under the original field F , so the              In these coordinates, the augmented energy HT becomes
+doubling leaves the on-shell physics unchanged. We refer
+the reader to (Scurria, 2026) for the full construction.                        HT (m, d, θ, β) = −d⊤ F (m, θ) + βC(m)             (61)
+
+                                                                   14
+                                   Equilibrium Propagation for Non-Conservative Systems
+
+and the dynamics in Eq. (28) can be rewritten as:                          In Dyadic EP, we instead employ the single-phase update:
+
+                                                                                                 1 ∂H(z β , z ′β , θ)
+                                                                                                                     
+    dm     ∂HT
+        =−     = F (m, θ),                                    (62)                       ∆θ ∝ −                                (70)
+    dt      ∂d                                                                                   β         ∂θ
+     dd    ∂HT                     ∂
+        =−     = dT JF (m, θ) − β    C(m).                    (63)         This choice avoids the overhead of evolving two coupled
+     dt    ∂m                     ∂m
+                                                                           equations in the extended space, which would be computa-
+                             β
+The stationary states (mβ , d ) are the solutions to:                      tionally equivalent to evolving four equations in the original
+                                                                           space (two for +β and two for −β). Using Eq. (70), we
+                              F (mβ , θ) = 0,                 (64)         evolve only one coupled equation for +β in the extended
+                                                                           space; this corresponds to two equations in the original
+            βT                ∂
+           d JF (mβ , θ) − β    C(mβ ) = 0.                   (65)         space, thereby achieving the same computational complex-
+                             ∂m                                            ity as AsymEP. Furthermore, this single-phase formulation
+                                                                           suggests a pathway toward making the update local in time,
+This leads to the following observations:                                  provided appropriate hardware is used to implement the
+1) The stationary state of m is independent of β and coin-                 augmented phase.
+cides with the stationary state of the original system:                    Mathematically, these two approaches yield the same gradi-
+                                                                           ent estimate because the equations for dβ are linear. Explic-
+               z β + z ′β
+                          = mβ = m0 = x0 .                    (66)         itly we have :
+                   2
+                                                                              ∂H(z β , z ′β , θ)                   ∂F z β + z ′β
+                                                                                                                                    
+                                                                                                 = −(z β − z ′β )⊤                ,θ
+2) The Jacobian of the extended system defined in Eq. (26)                         ∂θ                              ∂θ     2
+is invertible, provided JF is invertible. This is most evident                                                      ⊤
+                                                                                                         ∂F 0                        −1
+from Eq. (63).                                                                                   = −β          z ,θ     JF⊤ (z 0 , θ)
+                                                                                                         ∂θ
+3) The stationary state value of d is given by:                                                        
+                                                                                                          ∂C 0
+                                                                                                                   
+                                                                                                    ×         (z ) ,                   (71)
+             β
+                                      
+                                 −1 ∂C 0
+                                                                                                         ∂x
+           d = β JF⊤ (m0 , θ)              (x )               (67)
+                                        ∂x                                 where we have used Eqs. (66) and (67). Inspection of
+                                                                           Eq. (71) confirms that, up to corrections of order β 2 , we
+                                       0
+In particular, when β = 0, we have d = 0, which implies                    obtain exactly the same gradient as in AsymEP.
+that the free stationary states coincide: z 0 = z ′0 .
+4) The cost at the stationary state of the extended system                 E. AsymEP versus Dyadic EP
+is equal to the cost at the stationary state of the original
+                                                                           In this appendix, we demonstrate that Asymmetric Equilib-
+system:
+                                                                           rium Propagation (AsymEP) emerges naturally as the first-
+                     D(m0 ) = C(x0 ).                   (68)
+                                                                           order projection of the 2N -dimensional Dyadic Equilibrium
+Consequently, the gradients of the cost with respect to the                Propagation onto a single N -dimensional state space. We
+parameters are identical.                                                  then formalize the physical trade-offs between the two ar-
+                                                                           chitectures.
+Since both the original and extended systems, given respec-
+tively in Eq. (28) and Eq. (1-2), share the same cost at their
+                                                                           E.1. AsymEP as the Linear Projection of Dyadic EP
+respective stationary states, and because the Jacobians of
+both models are invertible, applying EP update rule to the                 As established in Appendix D.2, transforming the 2N -
+extended system give the correct gradient estimate for the                 dimensional extended space (z, z ′ ) into the mean state
+                                                                                       ′
+parameters θ of the original system.                                       m = z+z   2   and the difference state d = z − z ′ exactly
+The final step of the proof is to establish the equivalence                decouples the stationary dynamics. Because the stationary
+between the standard parameter update rule in Eq. (8) and                  state of m is the free state of the original system (mβ = x0 ),
+the modified rule used by Dyadic EP in Eq. (34). Indeed, if                the cost function drives the difference variable to a stationary
+                                                                                  β
+we were to apply the standard update rule in the extended                  state d satisfying:
+space, the update would be:                                                                              β      ∂C 0
+                                                                                            JF⊤ (x0 , θ)d = β      (x )               (72)
+          1
+                      β   ′β
+                   ∂H(z , z , θ) ∂H(z , z   −β   ′−β
+                                                       , θ)
+                                                                                                               ∂x
+  ∆θ ∝ −                        −                                 .
+         2β            ∂θ             ∂θ                                   To recover this exact error signal in an N -dimensional space,
+                                                              (69)         we postulate a modified dynamical system FA (x) compris-
+
+                                                                      15
+                                   Equilibrium Propagation for Non-Conservative Systems
+
+ing the standard EP dynamics and a spatial correction Γ(x):          F. Derivation of the Hopfield-like Energy
+                                 ∂C                                  In this section, we derive the explicit energy functional for
+            FA (x) = F (x) − β      (x) + Γ(x)          (73)         the Continuous Asymmetric Hopfield dynamics defined in
+                                 ∂x
+                                                                     Eq. (35). The force field is given by:
+Let ∆x = xβA − x0 denote the displacement from the
+free equilibrium. Expanding the stationarity condition                              F (x) = ρ′ (x) ⊙ (Jρ(x)) − x.              (78)
+FA (xβA ) = 0 to first order around x0 yields:
+                                                                     We omit external inputs J in for brevity, as they appear sym-
+                             ∂C 0                                    metrically in the Jacobian. The variational Hamiltonian is
+         JF (x0 , θ)∆x − β      (x ) + Γ(xβA ) ≈ 0      (74)         defined as:
+                             ∂x
+                                                                                                    z + z′              z + z′
+                                                                                                                            
+To ensure the first-order displacement matches the Dyadic             H(z, z ′ ) = −(z − z ′ )⊤ F             + βC               .
+                              β                                                                        2                  2
+EP error signal (i.e., ∆x ≈ d ), we substitute Eq. (72) into
+                                                                                                                               (79)
+the expansion:
+                                                                     To analyze this expression, we introduce the midpoint m =
+                                                                     z+z ′
+         Γ(xβA ) = JF⊤ (x0 , θ) − JF (x0 , θ) ∆x                           and the difference d = z − z ′ . Since the separation
+                                             
+                                                        (75)           2
+                                                                     between z and z ′ is induced solely by the nudging parameter
+                 = −2AJ (x0 , θ)(xβA − x0 )             (76)
+                                                                     β, the difference scales as ∥d∥ ∼ O(β). We therefore
+This uniquely recovers the AsymEP augmented dynamics.                neglect terms of order O(∥d∥3 ) (i.e., or equivalently O(β 3 ))
+Finally, to eliminate the O(β 2 ) error, AsymEP evaluates the        as they do not contribute to the gradient of the cost.
+centered difference of two opposite nudges:                          The activation at the midpoint can be approximated as:
+
+                             dxA                                                           ρ(z) + ρ(z ′ )
+            x±β  0
+             A =x ±β                 + O(β 2 )          (77)                     ρ(m) =                   + O(∥d∥2 ).          (80)
+                              dβ β=0                                                            2
+                                                                     Similarly, the difference in activations is:
+Subtracting these states cancels the O(β 2 ) error, yielding
+1   +β     −β       β        3
+2 (xA − xA ) = d + O(β ), successfully recovering the                          ρ(z) − ρ(z ′ ) = ρ′ (m) ⊙ d + O(∥d∥3 ).         (81)
+exact post-synaptic update term.
+                                                                     Inverting this relation, we express the state difference as:
+E.2. Physical Trade-offs and the Extended Space
+                                                                         z − z ′ = (ρ(z) − ρ(z ′ )) ⊙ ρ′ (m) + O(∥d∥3 ).       (82)
+We can view AsymEP and Dyadic EP as a space-time trade-
+off of the same underlying physical optimization problem.
+                                                                     We substitute these expansions into the interaction term
+AsymEP preserves the original N -dimensional state space             of the Hamiltonian, Hint = −(z − z ′ )⊤ (ρ′ (m) ⊙ Jρ(m)).
+of the network at the cost of temporal non-locality. The sys-        Applying the identity a⊤ (b ⊙ c) = (a ⊙ b)⊤ c, we obtain:
+tem must evolve sequentially, requiring physical memory
+                                                                                                        ⊤
+not only to store the free equilibrium x0 for the asymmet-               Hint = − ((z − z ′ ) ⊙ ρ′ (m)) Jρ(m)
+ric correction, but also to store the successive stationary                                           
+                                                                                                        ρ(z) + ρ(z ′ )
+                                                                                                                       
+states required to evaluate the contrastive gradient update.                  ≈ −(ρ(z) − ρ(z ′ ))⊤ J                     .     (83)
+                                                                                                             2
+AsymEP thus serves as the direct, spatially minimal exten-
+sion of EP.                                                          Expanding the product gives:
+Dyadic EP provide a learning signal that is local in both
+                                                                                         1h
+space (where z − z ′ encodes the gradient) and time (allow-                   Hint = −      ρ(z)⊤ Jρ(z) + ρ(z)⊤ Jρ(z ′ )
+                                                                                         2
+ing the nudged phases to execute in parallel) at the cost                                                             i
+of doubling the state space. In particular, capturing non-                           − ρ(z ′ )⊤ Jρ(z) − ρ(z ′ )⊤ Jρ(z ′ ) .    (84)
+conservative forces in this extended space requires a spe-
+cific bilinear coupling, rather than a trivial superposition         We decompose the connectivity matrix J into its symmetric
+of uncoupled subsystems. It can be seen as a blueprint for           part S and antisymmetric part A. The first and last terms
+future neuromorphic hardware.                                        simplify to ρ(z)⊤ Sρ(z). The cross terms satisfy:
+Ultimately, the reduction of Dyadic EP to AsymEP via the               ρ(z)⊤ Jρ(z ′ ) − ρ(z ′ )⊤ Jρ(z) = ρ(z)⊤ (J − J ⊤ )ρ(z ′ )
+variables m and d proves the universality of EP’s variational
+principle.                                                                                             = ρ(z)⊤ (2A)ρ(z ′ ).    (85)
+
+                                                                16
+                                         Equilibrium Propagation for Non-Conservative Systems
+
+Thus, the interaction term reduces to:                                  The input parameters are then updated using the standard
+                                                                        learning rule (21). In particular, the presynaptic term associ-
+                 1                1
+        Hint = − ρ(z)⊤ Sρ(z) + ρ(z ′ )⊤ Sρ(z ′ )                        ated with the input weights is given by,
+                 2                2
+               − ρ(z)⊤ Aρ(z ′ ) + O(∥d∥3 ).                 (86)                            ∂Fi
+                                                                                              in
+                                                                                                 = δik ρ′ (xi )ul .               (93)
+                                                                                            ∂Jkl
+Finally, for the nudging term, we expand the cost function              The presynaptic terms associated with the dynamical param-
+                                                                               dyn
+around the midpoint:                                                    eters Jij  depend on the experiment.
+
+                    1
+          C(m) =      (C(z) + C(z ′ )) + O(∥d∥2 ).          (87)        G.1. Symmetric Initialization
+                    2
+                                                                        G.1.1. L EARNING RULES
+When multiplying by β, the remainder term becomes β ·
+O(∥d∥2 ). Since ∥d∥ ∼ O(β), this remainder is of order                  For clarity, we write the learning rules for VF and AsymEP.
+O(β 3 ) and can be consistently discarded alongside the third-          For the input weights, using (93), we have:
+order terms from the interaction expansion.                                                  1 h +β                      i
+                                                                                      in
+                                                                                   ∆Jik  ∝       (xi − x−β       ′ 0
+                                                                                                           i )ρ (xi )uk ,      (94)
+Combining all these components, the final Hamiltonian is:                                   2β
+
+                    1               1                                   while for the recurrent weight, we get:
+     H(z, z ′ ) = − ρ(z)⊤ Sρ(z) + ρ(z ′ )⊤ Sρ(z ′ )
+                    2               2                                                      1 h +β                          i
+                                   1                                            ∆Jijdyn
+                                                                                        ∝      (xi − x−β i  )ρ′ 0
+                                                                                                               (xi )ρ(x 0
+                                                                                                                        j )  .    (95)
+                  − ρ(z) Aρ(z ) + (∥z∥2 − ∥z ′ ∥2 )
+                        ⊤    ′                                                            2β
+                                   2
+                    β            ′
+                                                                        For EP, we have:
+                  + (C(z) + C(z )).                 (88)
+                    2                                                                       1 h +β              i
+                                                                                   in
+                                                                                 ∆Jik ∝         ρ(xi ) − ρ(x−β
+                                                                                                            i  ) uk ,             (96)
+The saddle-point dynamics, given by Eq. 32, generated by                                   2β
+this Hamiltonian are:                                                   and for the recurrent weights:
+    dz                                          β ∂C                                  1 h +β
+        = ρ′ (z) ⊙ (Sρ(z) + Aρ(z ′ )) − z −
+                                                                                                                      i
+                                                     ,                        dyn                            −β    −β
+    dt                                          2 ∂z
+                                                            (89)          ∆Jij    ∝      ρ(xi )ρ(x+βj ) − ρ(xi )ρ(xj ) . (97)
+                                                                                     2β
+      ′
+   dz                                            β ∂C
+        = ρ′ (z ′ ) ⊙ (Sρ(z ′ ) + Aρ(z)) − z ′ +        .   (90)
+   dt                                            2 ∂z ′                 G.1.2. S UPPLEMENTARY N UMERICAL R ESULTS
+This system recovers the original continuous Hopfield dy-               To complement Fig. 2, we report the evolution of the accu-
+namics when z = z ′ (assuming β = 0).                                   racy of the three methods in Fig. 4. We consider a layered
+                                                                        network with 50 hidden neurons. While this capacity is
+G. Experimental Details                                                 insufficient for state-of-the-art performance, it amplifies the
+                                                                        difference in accuracy between models to aid visualization.
+As in the main text, the neuronal dynamics are governed by              Models are trained for 20 epochs starting from a symmetric
+the vector field:                                                       configuration, the natural setting for both VF and EP. With
+                                                                      this initialization, AsymEP consistently outperforms the
+                    X dyn                                               other methods and learns faster by exploiting the additional
+     Fi = ρ′ (xi )      Jij ρ(xj ) + bi (u) − xi ,  (91)
+                                                                        degrees of freedom of the asymmetric network.
+                      j
+
+
+where the input-dependent bias bi (u) is precomputed for                G.2. Fixed Asymmetry Ratio
+each MNIST input u as:                                                  This section details the implementation for the fixed asym-
+                                X                                       metry ratio experiments presented in Section 5.2, followed
+                     bi (u) =          Jilin ul .           (92)        by complementary numerical results regarding learning
+                                l∈in                                    speed and induced Jacobian asymmetry.
+This term projects the input space into the recurrent sub-
+                                                                        G.2.1. L EARNING RULES
+space. The bias yields a diagonal contribution to the Jaco-
+bian JF = ∂F ∂x , and therefore does not contribute to the              Parametrization and notation. To enforce a fixed asym-
+antisymmetric correction used in the augmented dynamics                 metry ratio, we explicitly parameterize the independent ele-
+Eq. (20) of AsymEP.                                                     ments of Eq. (38). We introduce two parameter vectors θS
+
+                                                                   17
+                                      Equilibrium Propagation for Non-Conservative Systems
+
+         Parameter                                     Sym. Init. / Feedforward             Fixed rstr          Fixed rstr & rin
+                                                            sec. 5.1 & 5.3                   sec. 5.2              app. G.3
+         Learning Rate (Input-Hidden)                             0.05                       0.05                    0.0125
+         Learning Rate (Hidden-Output)                            0.01                       0.01                    0.0025
+         Time Step (Dynamics Integration)                          0.5                        0.3                      0.3
+         Nudging Parameter (β)                                     0.5                        0.5                      0.5
+         Free-phase Steps (nfree )                                  20                        30                       40
+         Nudged-phase Steps (nnudge )                               10                        10                       10
+         Number of Epochs                                        40 / 20                      30                       40
+         Batch Size                                                 64                       √64                      √64
+         Scaling Parameter γ                                       n.a.                        60                        60
+         Structure                                            784 - n.a. -10               784-50-10           all-to-all, 500 hid
+         Activation function ρ                                    tanh                       tanh                     tanh
+         Initial Recurrent State s                            s ∼ U (−1, 1)              s ∼ U(−1, 1)            s ∼ U(−1, 1)
+         Initial Parameters θ                                 θ ∼ N (0, N1 )             θ ∼ N (0, N1 )          θ ∼ N (0, N1 )
+         Number of Runs (training + inference)                      10                        10                       10
+Table 3. Trained Model Hyperparameters on MNIST. N is the total number of neurons, U(−1, 1) is a uniform distribution, and N (µ, σ 2 )
+is a Gaussian distribution. For the rstr parametrization, we choose more cautious hyperparameters for training and inference compared to
+the symmetric initialization, due to increasingly non-conservative and potentially oscillatory dynamics.
+
+
+
+                                                                        elements of S̃, the full matrices are constructed as:
+                                                                                                        S
+                                                                             S̃ij = δij ξi + (1 − δij )θk(max(i,j),min(i,j)) ,         (99)
+                                                                                         A
+                                                                             Ãij = ϵij θk(max(i,j),min(i,j)) ,                       (100)
+                                                                        where ϵij is the Levi-Civita symbol. The dynamical param-
+                                                                        eters are then given by:
+                                                                                         dyn
+                                                                                        Jij  = γ(cS S̃ij + cA Ãij ),                 (101)
+                                                                        with normalization coefficients
+                                                                                        p
+                                                                                                2
+                                                                                          1 − rstr                           rstr
+                                                                                 cS =              ,                  cA =        ,   (102)
+                                                                                           FS                                FA
+                                                                        defined in terms of the Frobenius norms:
+                                                                                             v
+                                                                                             uN            M
+                                                                                             uX           X      2
+Figure 4. Evolution of the mean accuracy and standard deviation                       F =t
+                                                                                        S          ξ2 + 2i    θS ,     k              (103)
+(over 10 runs) during training on MNIST for AsymEP, EP, and VF.                                  i=1            k=1
+Models use 50 hidden neurons.                                                              v
+                                                                                           u M
+                                                                                           u X    2
+                                                                                      FA = t2  θkA .                                  (104)
+                                                                                                  k=1
+and θA of size M = Ndyn (Ndyn − 1)/2, which encode the
+off-diagonal elements of the symmetric and antisymmetric                Presynaptic computation. The dependence of the nor-
+components S̃ and Ã, respectively. The correspondence                  malization coefficients on the parameters introduces addi-
+between matrix and vector indices is given by:                          tional regularization terms in the learning rule compared
+                                                                        to the parameterization of (Scellier &Bengio, 2017). The
+               (i − 1)(i − 2)                                           gradients of the normalization coefficients are:
+   k(i, j) =                  + j,     (1 ≤ j < i ≤ Ndyn )
+                     2                                                      ∂cS          θkS                 ∂cS        ξm
+                                                            (98)                 = −2cS      2,                  = −cS       2,       (105)
+                                                                            ∂θkS        (FS )                ∂ξm       (FS )
+                                                                            ∂cA        θA
+where the condition j < i selects the strictly lower triangular               A
+                                                                                = −2cA k 2 .                                          (106)
+elements. Introducing an additional vector ξ for the diagonal               ∂θk       (FA )
+
+                                                                   18
+                                        Equilibrium Propagation for Non-Conservative Systems
+
+          Parameter                                        Comparison Dyn.         2 hidden layers         3 hidden layers
+                                                              sec. 5.4                 sec. 5.4                sec. 5.4
+          Learning Rate (Input-Hidden)                          0.0016                 0.0013                   0.6
+          Learning Rate (Hidden-Hidden)                         0.0016                 0.0013                   0.6
+          Learning Rate (Hidden-Output)                         0.0016                 0.0013                   0.6
+          Time Step (Dynamics Integration)                        0.4                    0.3                  0.0075
+          Nudging Parameter (β)                                   0.3                    0.5                   0.20
+          Free-phase Steps (nfree )                               40                     40                     60
+          Nudged-phase Steps (nnudge )                            20                     20                     30
+          Number of Epochs                                        50                     40                     40
+          Batch Size                                              64                     64                     64
+          Layer Structure                                   784-500-200-10         784-500-500-10       784-500-500-500-10
+          Activation function ρ                                  tanh                   tanh                   tanh
+          Initial Recurrent State s                          s ∼ U(−1, 1)           s ∼ U(−1, 1)           s ∼ U (−1, 1)
+          Initial Parameters θ                               θ ∼ N (0, N1 )         θ ∼ N (0, N1 )         θ ∼ N (0, N1 )
+          Number of Runs (training + inference)                   10                     10                     10
+Table 4. Trained Model Hyperparameters on Fashion-MNIST. N is the total number of neurons, U(−1, 1) is a uniform distribution, and
+N (µ, σ 2 ) is a Gaussian distribution. For the rstr parametrization, we choose more cautious hyperparameters for training and inference
+compared to the symmetric initialization, due to increasingly non-conservative and potentially oscillatory dynamics.
+
+
+
+Combining these with the derivatives of the matrices S̃ and                    (where p > q):
+Ã, we have:
+                                                                                                                   N
+                                                                                                            θkA X
+                                                                                                       
+                                                                                   ∂Fi          ′
+     ∂ S̃ij                             ∂ S̃ij                                          = γc A ρ  (xi ) −2            Ãij ρ(xj )
+            = δip δjq + δiq δjp ,              = δij δkj     (107)                 ∂θkA                    (FA )2 j=1
+     ∂θkS                               ∂ξk                                                                                        
+     ∂ Ãij                                                                                               + δip ρ(xq ) − δiq ρ(xp ) .
+            = δip δjq − δiq δjp ,                            (108)
+     ∂θkA                                                                                                                           (111)
+where k corresponds to the index pair (p, q) with p > q, as
+defined in Eq. (98). The full presynaptic terms are then:                 Initialization. To ensure the stability of the system, we
+                                                                          initialize our parameters suchhthat the
+                                                                                                                i variance of dynam-
+                                                                                                            dyn
+   • For the diagonal parameters ξm :                                     ical parameters scales as Var Jij ∝ 1/Ndyn . This is a
+                                                                          conservative choice for the layered architectures used in our
+                                                                                                                 dyn
+         ∂Fi
+                           
+                               ξm X
+                                      N                                   experiments, where many entries of Jij     are zero.
+             = γcS ρ′ (xi ) −            S̃ij ρ(xj )
+         ∂ξm                  (FS )2 j=1                                  In practice, we initialize the parameter vectors θS , θA , and
+                                                             (109)
+                                                                         ξ with identical variances σ 2 . For large Ndyn , the expected
+                             + δim ρ(xm ) .                               Frobenius norms approximate to E[FS,A ] ≈ Ndyn σ. Conse-
+                                                                          quently, the normalization coefficients become:
+                                                                                         p
+                                                                                                 2
+   • For the off-diagonal symmetric parameters θkS (where                                 1 − rstr                   rstr
+                                                                                  cS ≈             ,        cA ≈           .        (112)
+     p > q):                                                                              Ndyn σ                    Ndyn σ
+
+                                           N                              Since the symmetric and antisymmetric components are sta-
+                                    θkS X
+                               
+          ∂Fi          ′
+               = γc S ρ  (x i ) −2            S̃ij ρ(xj )                 tistically independent, the variance of the weights is derived
+          ∂θkS                     (FS )2 j=1
+                                                                         as follows:
+                                    + δip ρ(xq ) + δiq ρ(xp ) .
+                                                                             • Diagonal elements (i = j):
+                                                             (110)
+                                                                                                                         2
+                                                                                        h      i                    1 − rstr
+                                                                                     Var Jiidyn = γ 2 c2S σ 2 ≈ γ 2    2     .      (113)
+   • For the off-diagonal antisymmetric parameters θkA                                                               Ndyn
+
+                                                                     19
+                                      Equilibrium Propagation for Non-Conservative Systems
+
+   • Off-diagonal elements (i ̸= j):                                    a zero-cost baseline (perfect prediction) during learning.
+                                                                        Specifically, for each method and value of rstr , we calcu-
+             h
+               dyn
+                   i                       γ2                           late the cumulative loss by summing the batch-averaged
+                     = γ 2 c2S + c2A σ 2 ≈ 2 ,
+                                    
+          Var Jij                                         (114)
+                                          Ndyn                          costs of the first 5 epochs (out of 30, to avoid saturation
+              h     i                                                   effects), and reporting the mean and standard deviation over
+                dyn
+To satisfy Var Jij    ∝ 1/Ndyn , we set:                                10 independent training runs. Mathematically, for each run:
+
+                               p
+                         γ=     Ndyn                      (115)                                                                       
+                                                                                               5     NX
+                                                                                               X      batches       X        C(x0 , u) 
+Note that by random matrix theory, diagonal elements do                   Cumul. Loss =                                                 ,
+                                                                                                                              |Bk |
+not affect stability in the large Ndyn limit.                                                epoch=1 k=1        (x0 ,u)∈Bk
+                                                                                                                              (120)
+Potential Simplification. Although the parameterization                 where Bk represents the k-th batch, and |Bk | denotes the
+above is fully general, a simpler construction is possible              number of examples in the batch. The parameters are up-
+by removing self-connections (ξ = 0) and enforcing identi-              dated after each batch step; consequently, the free equilib-
+cal parameterization for the symmetric and antisymmetric                rium x0 is inferred using the updated parameters and the
+components, i.e., θS = θA = θ. The matrix elements then                 current example u.
+become:
+
+            S̃ij = (1 − δij )θk(max(i,j),min(i,j)) ,      (116)
+            Ãij = ϵij θk(max(i,j),min(i,j)) .            (117)
+
+In this case, the Frobenius norms are equal (FS = FA ), and
+we can omit the explicit normalization:
+                       q
+                 dyn           2 S̃ + r Ã .
+               Jij   = 1 − rstr    ij   str ij       (118)
+
+For a parameter θk corresponding to indices (p, q) with
+p > q, the presynaptic term is given by:
+                q                   
+  ∂Fi
+      = ρ′ (xi )           2 +r
+                     1 − rstr      str δip ρ(xq )
+  ∂θk
+                     q                                  (119)
+                  +             2 −r
+                          1 − rstr            δ    ρ(x
+                                        str     iq     p .
+                                                        )
+
+While this parameterization works in simulations and keeps              Figure 5. Cumulative loss as defined by (120) over the first 5
+the number of parameters constant for all rstr , it constrains          epochs of training, for different asymmetry ratios rstr . We compare
+the asymmetry to be “homogeneous”, by which we mean                     VF (orange) and AsymEP (blue), under two training regimes:
+that the asymmetry ratio is identical for every pair of neu-            training only J in (dashed) or all parameters (solid).
+rons; hence, the network cannot learn to be symmetric in one
+region and antisymmetric in another. Therefore, we choose
+to explore the more general case of (38) in our experiments.            In Fig 5, we observe that learning slows down for both al-
+                                                                        gorithms when rstr ≳ 0.6. This behavior likely results from
+G.2.2. S UPPLEMENTARY N UMERICAL R ESULTS                               the increased difficulty of reaching a stationary state as the
+                                                                        dynamics become strongly asymmetric. With a fixed num-
+To complement the results of Fig 3, we analyze the training
+                                                                        ber of inference steps, incomplete convergence degrades the
+efficiency as a function of the asymmetry ratio rstr and in-
+                                                                        accuracy of the gradient estimates, thereby slowing down
+vestigate the robustness of VF by monitoring the Jacobian
+                                                                        the learning. Fig 5 shows that while VF can eventually
+asymmetry.
+                                                                        achieve competitive accuracy, it is consistently slower than
+                                                                        AsymEP as soon as asymmetry is introduced.
+Training efficiency. We first study the training efficiency
+of the two algorithms as a function of the asymmetry ra-
+tio rstr . Inspired by the related concept in (Cesa-Bianchi
+&Lugosi, 2006), we define the cumulative loss as the accu-              Jacobian asymmetry. We next examine how the struc-
+mulated difference between the free equilibrium cost and                tural asymmetry rstr is reflected in the Jacobian of the dy-
+
+                                                                   20
+                                     Equilibrium Propagation for Non-Conservative Systems
+
+namics (35), given by:
+    ∂Fi                       dyn ′
+        = (1 − δij )ρ′ (xi )Jij  ρ (xj )
+    ∂sj
+               h                                           i
+          + δij ρ′ (xi )(Jiidyn ρ′ (xi )) + ρ′′ (xi )bi − 1 .
+                                                           (121)
+
+In our layered architecture, the self-connections are zero
+(Jiidyn = 0). For the following analysis, we neglect all diag-
+onal terms in the Jacobian (including external inputs and
+potential), since they do not contribute to the antisymmetric
+correction (20) and thus to the discrepancy between the per-
+formance of VF and AsymEP. Consequently, we define the
+following asymmetry ratio based solely on the off-diagonal
+Jacobian JF,off :                                                       Figure 6. Asymmetry ratio of the Jacobian rjac defined in equation
+                                      ⊤
+                                                                        (122) after training for different asymmetry ratios rstr . We compare
+                          ∥JF,off − JF,off ∥F                           VF (orange) and AsymEP (blue), under two training regimes:
+                 rjac =                       ,            (122)        training only J in (dashed) or all parameters (solid).
+                             ∥JF,off ∥F
+
+The results are presented in Fig 6. For each trained model              Consequently, local stability requires the largest real eigen-
+and ratio rstr , we compute rjac averaged over the stationary           value of the effective weight matrix to be strictly less than 1.
+states of the first batch (64 images) across 10 independent             Assuming weights are initialized independently with vari-
+runs. We observe that when structural asymmetry is strong               ance σ 2 , Girko’s circular law dictates that the eigenvalues
+and all parameters are trained, VF partially compensates for            of√an asymmetric matrix uniformly populate a disk of radius
+the asymmetry by adjusting the neuronal states. This can be             σ n in the complex plane. In contrast, imposing symmetry
+understood by rewriting the ratio as:                                   forces the eigenvalues √ onto the real line, broadening the
+                                                                      spectral radius to 2σ n according to Wigner’s semicircle
+                            dyn      dyn ⊤
+                ρ′ (xi ) Jij    − (Jji  ) ρ′ (xj )                      law. As a result, asymmetric networks can stably accommo-
+                                                   F
+     rjac =                                          .  (123)           date larger variance in the weight initializations than their
+                                  dyn ′
+                         ρ′ (xi )Jij ρ (xj )                            symmetric counterparts.
+                                          F
+
+Compared to the structural asymmetry ratio in Eq. (37),                 Asymmetry nevertheless introduces imaginary eigenvalues
+a value of rjac < rstr indicates that the neuronal states ef-           and, consequently, damped oscillations. To study this effect
+fectively dampen the structural asymmetry, rendering the                experimentally in a controlled setting, we constrain the input
+dynamics more symmetric. This symmetrization of the Ja-                 projections J in . In the experiments of the main text, fixing
+cobian appears without imposing an additional symmetriza-               the structural asymmetry ratio rstr still allowed AsymEP
+tion penalty and could be enhanced using the method of                  to reduce oscillations by aligning and increasing the input
+(Laborieux &Zenke, 2022). This mechanism likely explains                projections J in , thereby adding stabilizing diagonal contri-
+the superior performance of ‘All (VF)’ compared to ‘J in                butions to the Jacobian. To isolate the network’s ability to
+(VF)’ in Fig 3, as the former is able to use the additional             suppress oscillations independently of the magnitude of the
+degrees of freedom to reduce the effective asymmetry at                 input drive, we further constrain the relative scale of J in and
+high rstr .                                                             J dyn by imposing
+                                                                                                   ∥J in ∥F   ∥J in ∥F
+G.3. Stability analysis with Fixed Asymmetry Ratio &                                      rin =             =          ,              (124)
+                                                                                                  ∥J dyn ∥F      γ
+     Constrained Inputs Projection
+                                                                        where ∥J dyn ∥F = γ following Eq. (101). Defining unscaled
+A complete stability analysis of the non-conservative dy-               input projections J˜in , we set
+namics trainable with AsymEP is beyond the scope of this
+work. Nevertheless, for the class of continuous Hopfield                                                        J˜in
+                                                                                               J in = rin γ                           (125)
+networks considered here, standard arguments from random                                                      ∥J˜in ∥F
+matrix theory suggest that asymmetry inherently improves
+asymptotic stability.                                                   G.3.1. L EARNING RULES
+In the dynamics defined by Eq. (91), the linear leak term               Reusing the notations of the previous section, we write
+−xi shifts the spectrum of the system’s Jacobian by −1.                 Jilin = γcin J˜ilin with the normalization cin = rin /Fin , where
+
+                                                                   21
+                                   Equilibrium Propagation for Non-Conservative Systems
+
+Fin = ∥J˜in ∥F . Applying the chain rule yields:
+                        "                   #
+   ∂Fi            ′              J˜kl
+                                   in X
+                                        ˜in
+          = γcin ρ (xi ) δik ul − 2     J um .          (126)
+   ∂ J˜kl
+       in                        Fin m im
+
+And for γ we have:
+
+                     ∂Fi  1
+                         = (Fi + xi ).                  (127)
+                     ∂γ   γ
+
+G.3.2. S UPPLEMENTARY N UMERICAL R ESULTS
+                                                                      Figure 7. Comparison of AsymEP and VF on a feedforward net-
+Table 5 reports a worst-case control experiment where the
+                                                                      work. Test accuracy on MNIST is shown as a function of training
+structural asymmetry is fixed at rstr = 0.7 while the input           epochs for a single-hidden-layer network with 20 neurons. Curves
+scale ratio rin is varied. The experiment uses an all-to-all          report the mean and standard deviation over 10 runs. Best accura-
+architecture on MNIST (excluding direct input-to-output               cies are 92.7% ± 0.5% (AsymEP) and 64.3% ± 2.0% (VF).
+connections). The output variance during extended infer-
+ence (steps 30-50) confirms that the system successfully
+learns to suppress oscillations even when rin is severely re-         G.5. Advantages of Non-Conservatives Dynamics
+stricted. Any small residual oscillations can be mitigated by         In Section 5.4, we compare three (non-)conservative dynam-
+time-averaging over the inference steps.                              ics under varying constraints. To further evaluate learning
+Finally, rin can be interpreted as a measure of the external          speed, Table 6 reports network performance after a sin-
+signal magnitude relative to the internal recurrent dynamics.         gle epoch. These results confirm our earlier observation:
+These results indicate that the system remains capable of             AsymEP learns faster than VF.
+learning and stabilizing even under a low external input
+drive. Even when the input projection ∥J in ∥F is 100 times           G.6. Feedforward CIFAR-10 Experiments
+smaller than the recurrent connections ∥J dyn ∥F , the network        This appendix details the architecture and hyperparameters
+still achieves 36.34 ± 6.25% accuracy, which is well above            of the deep feedforward experiments comparing backprop-
+chance level (∼ 10%).                                                 agation (BP), VF, AsymEP and Dyadic EP on CIFAR-10
+                                                                      (see subsection 5.5)
+G.4. Feedforward Network
+G.4.1. L EARNING RULES                                                Architecture. We use a nine-layer convolutional network
+                                                                      (denoted CNN9). The first eight layers are convolutional
+For clarity, we write the learning rules for VF and AsymEP            with 3 × 3 kernels and zero-padding; spatial downsampling
+in a feedforward architecture with one hidden layer using             is performed by strided convolutions (stride 2 on layers 2, 4,
+the notation of Section 5.3. For the input weights connecting         6, 8 and stride 1 otherwise), so no pooling is used. The chan-
+to the hidden layer, we get the usual formula:                        nel widths follow the sequence 3 → 64 → 64 → 128 →
+                                                                      128 → 256 → 256 → 512 → 512, reducing the spatial
+                  1 h +β    −β      0
+                                         i
+                                                                      resolution from 32 × 32 to 2 × 2. The last layer is a fully
+         in
+       ∆Jik ∝        (hi − hi )ρ′ (hi )uk ,             (128)
+                 2β                                                   connected readout mapping the 512 × 2 × 2 feature map
+                                                                      to the 10 class logits. All hidden units use a ReLU non-
+while for the feedforward weights connecting the hidden to            linearity.
+the output layer, we get:                                                   p Weights are initialized with the Kaiming scheme
+                                                                      (σ = 2/fan-in) and biases at zero.
+                    1 h +β                 0
+                                              i
+  ∆(Wh→o )ji ∝         (oj − o−β  ′ 0
+                              j )ρ (oj )ρ(hi ) .        (129)         Training setup. All methods are trained for 40 epochs
+                   2β                                                 with batch size 64 and repeated over 5 seeds. Inputs are
+                                                                      normalized per channel and augmented with random 32 ×
+Note that EP is not applicable in this case.
+                                                                      32 crops (padding 4), random horizontal flips and Cutout
+                                                                      (one 8 × 8 patch). Parameters are updated with SGD with
+G.4.2. S UPPLEMENTARY N UMERICAL R ESULTS
+                                                                      momentum 0.9, weight decay 5 × 10−4 and gradient-norm
+In addition to the final accuracy reported in Sec. 5.3, we            clipping at 1, under a cosine learning-rate schedule decaying
+show in Fig. 7 the evolution of the accuracy over 20 epochs           from 3.5 × 10−2 to 2 × 10−4 . Test accuracy is reported
+for AsymEP and VF.                                                    on an exponential moving average of the weights (decay
+
+                                                                 22
+                                     Equilibrium Propagation for Non-Conservative Systems
+
+Table 5. Output variance and final test accuracy on MNIST (%) across different values of rin with rstr = 0.7. (mean ± std over 10 runs)
+(500 hiddens, all-to-all, no input-output).
+
+                                                  Output variance                          Test Acc. (%)
+                           rin            Untrained             Epoch 80                     Epoch 80
+                           0.01    (3.38 ± 0.90) × 10−4        (5.22 ± 2.34) × 10−5         36.34 ± 6.25
+                           0.10    (2.33 ± 0.48) × 10−4        (1.39 ± 0.17) × 10−4         90.54 ± 0.19
+                           0.50    (1.34 ± 0.32) × 10−5        (1.06 ± 0.25) × 10−6         94.96 ± 0.10
+                           1.00    (6.27 ± 1.24) × 10−7        (1.75 ± 0.50) × 10−8         96.30 ± 0.09
+
+
+Table 6. Test accuracy on Fashion-MNIST (%) at Epoch 1 (mean
+± std 10 runs). The table compares three classes of network
+dynamics: Continuous Hopfield (CH), Predictive Coding (PC),
+and Standard dynamics. Each is evaluated under three connec-
+                                               ⊤
+tivity structures: Asymmetric (Asym, Bk ̸= Wk+1   ), Symmet-
+                                ⊤
+ric/conservative (Sym, Bk = Wk+1 ), and Feedforward (Feedfor,
+Bk = 0).
+
+                           EP          AsymEP            VF
+          Asym          -      74.91 ± 0.45         48.98 ± 4.09
+ CH       Feedfor       -      74.36 ± 0.29         48.84 ± 3.46
+          Sym     74.57 ± 0.43       -                    -
+          Asym          -      77.83 ± 0.47         57.75 ± 3.37
+ PC
+          Sym     76.23 ± 0.39       -                    -
+          Asym          -      76.87 ± 0.51         61.50 ± 4.06
+ Standard
+          Feedfor       -      77.92 ± 0.51         63.98 ± 0.73
+
+
+
+0.9995). The targets are smoothed (ε = 0.1), which for
+the EP methods amounts to nudging toward the smoothed
+one-hot vector.
+
+Relaxation hyperparameters. The four methods differ
+only in the gradient estimator: BP uses automatic differ-
+entiation, while the EP-based methods contrast stationary
+states of the corresponding relaxation dynamics. VF uses
+an integration step η = 1.0, nudging β = 0.1, and at most
+K = 1000 relaxation steps with an early-stopping threshold
+of 9 × 10−6 on the mean state update. Dyadic EP uses
+the same settings except for a nudging strength β = 0.1.
+AsymEP uses a smaller step η = 0.5, nudging β = 0.1,
+and up to K = 250 relaxation steps with a threshold of
+1 × 10−4 .
+
+
+
+
+                                                                  23
+
+\ No newline at end of file
author	Yuren Hao <yurenh2@illinois.edu>	2026-07-03 05:56:50 -0500
committer	Yuren Hao <yurenh2@illinois.edu>	2026-07-03 05:56:50 -0500
commit	b83947778e2c776f757a07d4719b7ce961d7ed55 (patch)
tree	b9cc01d7adda691d9156d9d04f4fb2f644674e96 /ep_run/scurria_nonconservative.txt