Equilibrium Propagation for Non-Conservative Systems


                                                Antonino Emanuele Scurria 1 Dimitri Vanden Abeele 1 Bortolo Matteo Mognetti 2 Serge Massar 1


                                                                  Abstract                                      from inference, the transmission of nonlocal error signals,
                                                                                                                and synchronous layer-wise computations with explicit gra-
                                             Equilibrium Propagation (EP) is a physics-
                                                                                                                dient storage. These constraints have no clear analog in
                                             inspired learning algorithm that uses stationary
                                                                                                                physical systems, making backpropagation challenging to
                                             states of a dynamical system both for inference
                                                                                                                implement in neuromorphic or analog hardware. Conse-
arXiv:2602.03670v2 [cs.LG] 1 Jun 2026


                                             and learning. In its original formulation it is
                                                                                                                quently, understanding how credit assignment can instead
                                             limited to conservative systems, i.e. to dynam-
                                                                                                                emerge from intrinsic system dynamics, through local inter-
                                             ics which derive from an energy function. Given
                                                                                                                actions and continuous relaxation, is a central question in
                                             their applications, it is important to extend EP
                                                                                                                neuroscience and machine learning.
                                             to non-conservative systems, i.e. systems with
                                             non-reciprocal interactions. Previous attempts to                  Equilibrium Propagation (EP) (Scellier &Bengio, 2017)
                                             generalize EP to such systems failed to compute                    represents one of the most promising advances in this direc-
                                             the exact gradient of the cost function. Here we                   tion. It formulates supervised learning as a contrast between
                                             propose a framework that extends EP to arbitrary                   two stationary states of a dynamical system: a ‘free’ phase
                                             non-conservative systems, including feedforward                    where the system evolves autonomously, and a ‘nudged’
                                             networks. We keep the key property of equilib-                     phase where outputs are weakly pushed toward their targets.
                                             rium propagation, namely the use of stationary                     The local change in neural states between these phases re-
                                             states both for inference and learning. However,                   covers the exact gradient of the cost function with respect to
                                             we modify the dynamics in the learning phase by                    parameters. This enables spatially local learning exploiting
                                             a term proportional to the non-reciprocal part of                  the continuous relaxation of the system without a distinct
                                             the interaction so as to obtain the exact gradient                 backward circuit or explicit weight transport.
                                             of the cost function. This algorithm can also be
                                                                                                                Since its introduction, several works have sought to im-
                                             derived using a variational formulation that gen-
                                                                                                                prove the practicality and biological realism of EP. Algo-
                                             erates the learning dynamics through an energy
                                                                                                                rithmic adaptations include enforcing temporal locality to
                                             function defined over an augmented state space.
                                                                                                                avoid state storage (Ernoult et al., 2020; Falk et al., 2025),
                                             Numerical experiments show that this algorithm
                                                                                                                deriving agnostic updates for black-box energies (Scel-
                                             achieves better performance and learns faster than
                                                                                                                lier et al., 2022), and substituting nudging with clamping
                                             previous proposals.
                                                                                                                (Stern et al., 2021). Theoretically, the framework has been
                                                                                                                extended to stochastic systems (Scellier &Bengio, 2017;
                                                                                                                Massar &Mognetti, 2025) and Lagrangian dynamics for
                                        1. Introduction                                                         time-varying inputs (Massar, 2025; Pourcel et al., 2025;
                                        Standard neural network optimization relies on error back-              Berneman &Hexner, 2025). In parallel, simulations have
                                        propagation, an algorithm whose computational mechanism                 explored suitable substrates, ranging from spiking (Mar-
                                        is difficult to reconcile with biological (Crick, 1989) and             tin et al., 2021; O’Connor et al., 2019) and resistive net-
                                        physical implementations (Indiveri &Liu, 2015). Specif-                 works (Kendall et al., 2020) to coupled oscillators (Wang
                                        ically, backpropagation requires a backward pass distinct               et al., 2024; Rageau &Grollier, 2025), as well as quantum
                                                                                                                systems (Wanjura &Marquardt, 2025; Massar &Mognetti,
                                            1
                                              Laboratoire d’Information Quantique (LIQ) CP224, Université       2025; Scellier, 2024). Experimental realizations have been
                                        libre de Bruxelles (ULB), Av. F. D. Roosevelt 50, 1050 Bruxelles,       demonstrated in memristor crossbars (Yi et al., 2023), self-
                                        Belgium 2 Interdisciplinary Center for Nonlinear Phenomena and
                                        Complex Systems CP231, Université libre de Bruxelles (ULB), Av.         adjusting electrical circuits (Dillavou et al., 2022; 2024),
                                        F. D. Roosevelt 50, 1050 Bruxelles, Belgium. Correspondence to:         elastic networks (Altman et al., 2024), and classical Ising
                                        Antonino Emanuele Scurria <antonino.scurria@ulb.be>.                    models trained on quantum annealers (Laydevant et al.,
                                                                                                                2024).
                                        Proceedings of the 43 rd International Conference on Machine
                                        Learning, Seoul, South Korea. PMLR 306, 2026. Copyright 2026            Despite these recent developments and the theoretical el-
                                        by the author(s).

                                                                                                            1
                                    Equilibrium Propagation for Non-Conservative Systems

egance of EP, its standard formulation remains restricted             a framework where the original dynamics serve for infer-
to conservative systems. In these systems, dynamics are               ence, while a new augmented dynamic is used to compute
derived from an energy function, which inherently enforces            gradients of the cost Eq. (2). In this augmented phase, the
symmetry (e.g., symmetric synaptic connections Jij = Jji )            output neurons are nudged towards their targets (as in stan-
through the action-reaction principle. This constraint pre-           dard EP), while a local corrective term – proportional to the
cludes the use of EP in a broad class of models characterized         antisymmetric part of the Jacobian at the free equilibrium
                                                                                       ∂
by non-conservative forces. This includes the feedforward             JF (x0 , θ, u) = ∂x F (x0 , θ, u) – is added to the forces. The
architectures dominant in modern AI, biological circuits,             exact gradients of the cost with respect to parameters are
as well as physical systems that reach stationary states far          then obtained by contrasting stationary states of the aug-
from thermodynamic equilibrium, such as nonlinear optical             mented system.
systems driven by external lasers (Cin et al., 2025), opto-
                                                                      Second, we introduce Dyadic EP, a ‘variational’ approach
electronic systems (Kalinin et al., 2025), exciton-polariton
                                                                      to learning in non-conservative systems. This method in-
condensates (Sajnok &Matuszewski, 2025), active meta-
                                                                      volves doubling the number of variables in the system’s
materials (Brandenbourger et al., 2019) and active colloids
                                                                      state space and subsequently introducing a new energy func-
(Bishop et al., 2023; Osat &Golestanian, 2023) (see (Bowick
                                                                      tion in this extended space. This approach takes advantage
et al., 2022) for a review).
                                                                      of the extended space to execute the positive and negative
Formally, we consider a dynamical system governed by                  nudging phases in parallel, recovering the same computa-
a non-reciprocal force field F (x, θ, u), which relaxes to a          tional cost as AsymEP. Derived from first principles, this
stationary configuration x0 satisfying:                               approach is inspired by established methods for mapping
                                                                      dissipative dynamical systems onto conservative ones by
                      F (x0 , θ, u) = 0,                   (1)        doubling the degrees of freedom (Bateman, 1931; Galley,
where x represents the state variables, θ the learnable param-        2013; Aykroyd et al., 2025). A more comprehensive study
eters and u the static input. Our goal, given a target y(u), is       of the theoretical framework and its application to feedfor-
to compute the gradient of the cost function C(x0 , y) at this        ward networks can be found in (Scurria, 2026). Our method
equilibrium,                                                          is related to the Dual Propagation algorithm (Høier et al.,
                          dC 0                                        2023; Høier &Zach, 2023; 2024) and constitutes an inde-
                              (x , y),                      (2)       pendent, first-principles generalization of Dyadic Learning
                          dθ
and update θ to minimize the cost.                                    (Nest &Høier; Høier et al., 2024)—previously limited to
                                                                      Hopfield networks—to arbitrary force fields.
Previous attempts to extend EP to non-conservative dynam-
ics include the Vector Field (VF) algorithm (Scellier et al.,         Third, we validate our framework on MNIST (LeCun, 1998),
2018). However, as noted by the authors, this method pro-             Fashion-MNIST, and CIFAR-10. In continuous Hopfield
vides an unbiased gradient of the cost Eq. (2) only in the            networks initialized with symmetric connection matrices,
conservative case. To mitigate this, (Laborieux &Zenke,               AsymEP achieves better accuracy and learns faster than
2024) proposed adding a penalty to keep the Jacobian close            EP and VF. Additionally, when we constrain the network
to symmetry, essentially forcing the system to be as con-             to have a strong degree of structural asymmetry, in which
servative as possible. Alternative methods related to VF,             case EP is inapplicable, AsymEP outperforms VF. Finally,
which similarly do not compute the exact gradient, were               when we restrict connections to a feedforward structure, our
proposed in (Farinha et al., 2020; Costa &Santos, 2025) and           algorithm effectively trains all parameters; in contrast, VF
for specific systems in simulation (Cin et al., 2025; Sajnok          is limited to training the last layer, acting essentially as an
&Matuszewski, 2025).                                                  Extreme Learning Machine (Huang et al., 2006; Wang et al.,
                                                                      2022) with poor performance.
Conversely, generalizations of backpropagation can handle
non-reciprocal forces and compute the exact gradient of               In summary, this theoretical work proposes two generaliza-
the cost Eq. (2) but inherit the same challenges in physical          tions of EP beyond conservative systems to arbitrary differ-
implementations. For instance, Backpropagation Through                entiable dynamics that compute in their stationary states.
Time (Werbos, 1990) unfolds the network in time to ap-
ply standard backpropagation, Recurrent Backpropagation               2. Equilibrium Propagation Overview
(Almeida, 1990; Pineda, 1987) avoids this memory require-
ment but still requires a specific circuit to propagate errors,       2.1. Conservative Systems
and the continuous Adjoint Method (Chen et al., 2018) addi-           We first review standard Equilibrium Propagation (EP)
tionally requires integrating the dynamics backward in time           (Scellier &Bengio, 2017). We consider a network described
which is not physically possible for a dissipative system.            by an energy function E(x, θ, u), such that the force field is
In this paper, we first propose Asymmetric EP (AsymEP),

                                                                  2
                                    Equilibrium Propagation for Non-Conservative Systems

derived from the potential E:                                        stationary point, i.e., that Eq. (7) holds. Second, EP implic-
                                                                                                                     ∂
                                                                     itly assumes that the Jacobian JE (x0 , u) = ∂x   FE (x0 , u) is
                                 ∂
              FE (x, θ, u) = −      E(x, θ, u).            (3)       invertible. In this work, we assume this condition holds and
                                 ∂x                                  will not state it explicitly hereafter. Third, for simplicity,
The objective is to compute the total gradient dC   0                we omit the dependency on the input u and target y in the
                                               dθ (x , y) of a
(quadratic) cost function C(x, y) evaluated at the minimum           following equations.
energy configuration of the system. This free equilibrium
denoted x0 (which depend implicitly in θ and u), satisfies           2.2. Vector Field
the stationarity condition:
                                                                     The Vector Field (VF) algorithm, introduced in (Scellier
                    ∂                                                et al., 2018), is an early attempt to adapt EP to non-
                   − E(x0 , θ, u) = 0.                     (4)       reciprocal forces. This method relies on the observation
                    ∂x
                                                                     that, for conservative systems, linearizing the right-hand
To compute gradients, we introduce the augmented energy              side of Eq. (9) around the equilibrium point x0 yields
functional:
                                                                                 ∂E(xβ , θ) ∂E(x−β , θ)
                                                                                                         
        ET (x, θ, β, u, y) = E(x, θ, u) + βC(x, y),        (5)              1
                                                                       lim                  −
                                                                       β→0 2β        ∂θ           ∂θ
where β is a scalar nudging parameter. The stationary config-                                         ⊤  β       (10)
                                                                                                          x − x−β
                                                                                      
                                                                                          ∂FE 0
uration of this augmented system is obtained by integrating                     = lim −       (x , θ)              ,
                                                                                  β→0      ∂θ                2β
the dynamics
                 dx    ∂ET (x, θ, β, u)                              where FE = −∂x E(x, θ) is the conservative force. It is
                    =−                  ,                  (6)       therefore tempting to use the right-hand side of Eq. (10) for
                 dt         ∂x
                                                                     parameter updates of non-conservative systems, for which
until the energy minimum is reached. This new fixed point            no energy function E exists.
xβ , called nudged equilibrium, satisfies:
                                                                     The VF algorithm adopts precisely this approach. It uses
             ∂E(xβ , θ, u)    ∂C(xβ , y)                             the nudged counterpart of Eq. (7),
                           +β            = 0.              (7)
                ∂x              ∂x
                                                                                                        ∂C β
                                                                                       F (xβ , θ) − β      (x ) = 0,             (11)
The training procedure, as improved in (Laborieux et al.,                                               ∂x
2021), uses two nudged phases with factors ±β (with
β ̸= 0). Starting from x0 , the system relaxes to two                in conjunction with the learning rule Eq. (10):
nearby perturbed equilibria, x+β and x−β . The displace-                                                  ⊤  β
                                                                                                              x − x−β
                                                                                                                     
ment x+β − x−β is then used to compute the parameter                                         ∂F 0
                                                                               ∆θ = ϵ           (x , θ)                 .        (12)
update in the learning rule:                                                                 ∂θ                  2β

              1 ∂E(xβ , θ, u) ∂E(x−β , θ, u)
                                              
  ∆θ = −ϵ                      −                  , (8)              However, as noted in (Scellier et al., 2018), Eq. (12) does
             2β         ∂θ             ∂θ                            not align with the true gradient dC      0
                                                                                                         dθ (x ) and is exact only if
where ϵ > 0 is the learning rate. The theoretical foundation         the force is conservative. To see this, let JF (x, θ) denote
of EP is the result that, in the limβ→0 of Eq. (8), we get:          the Jacobian of the vector field F (x, θ) (in components
                                                                     (JF (x, θ))ij = ∂F∂xi (x,θ)
                                                                                            j
                                                                                                 ). Differentiating the equilibrium
              dC(x0 , y)    d ∂E(xβ , θ, u)                                         0
                                                                     condition F (x , θ) = 0 with respect to θ gives
                         =                  ,              (9)
                dθ         dβ     ∂θ
                                                                                                  dx0   ∂F 0
see Appendix D.1. The error of the above method is O(β 2 ).                         JF (x0 , θ)       +    (x , θ) = 0.          (13)
This error can be further reduced using holomorphic equi-                                         dθ    ∂θ
librium propagation (Laborieux &Zenke, 2022).                        Consequently, the exact gradient of the cost is
Thus, EP recovers the exact gradient of the cost function                               ⊤
using only local computations. In this manner, learning              dC 0      dx0 ∂C 0
                                                                        (x ) =         (x )
implements gradient descent without an explicit backward             dθ        dθ ∂x
                                                                                             ⊤ 
pass, and credit assignment is realized through the system’s
                                                                                                                       
                                                                                  ∂F 0             ⊤ 0
                                                                                                             −1 ∂C 0
intrinsic relaxation dynamics.                                               =−      (x , θ)      JF (x , θ)        (x ) .
                                                                                  ∂θ                             ∂x
                                                                                |     {z      }|           {z           }
Three remarks can be made at this point. First, EP does not                              pre-synaptic            post-synaptic
require the system to be at an energy minimum, but only at a                                                                     (14)

                                                                 3
                                         Equilibrium Propagation for Non-Conservative Systems

The terms ’pre-synaptic’ and ’post-synaptic’ in Eq. (14)               Algorithm 1 Asymmetric EP (AsymEP)
are used by analogy with neuronal transmission: the pre-                1: Inputs: Force field F (x, θ), cost function C(x), nudg-
synaptic factor captures the local influence of θ on the force               ing parameter β, learning rate ϵ.
F , while the post-synaptic factor is the sensitivity of the            2: repeat
cost to state perturbations.                                            3:   1. Free Phase: Evolve to stationary state
If instead we differentiate the nudged equilibrium condition            4:      Evolve the system dynamics
in Eq. (11) with respect to β and evaluate at β = 0, we                 5:
                                                                                                    dx
obtain                                                                                                 = F (x, θ),                   (17)
                           β
                                                                                                    dt
                          dx             ∂C 0
            JF (x0 , θ)              −      (x ) = 0,       (15)        6:        until convergence to the stationary state x0 .
                          dβ   β=0       ∂x                             7:     2. Jacobian Decomposition
which gives                                                             8:        Compute the Jacobian at equilibrium:
                                                                        9:
          dxβ                 −1 ∂C 0                                                                     ∂F 0
                 = JF (x0 , θ)       (x , y).               (16)                               JF (x0 , θ) =   (x , θ),       (18)
          dβ β=0                  ∂x                                                                       ∂x
The right-hand side of Eq. (16) represents the effective post-         10:        and decompose it in its antisymmetric part:
                                                                       11:
synaptic term used by the VF algorithm (Eq. 12). Compar-
ing this with the exact post-synaptic term derived in Eq. (14),
                                                                                  AJ (x0 , θ) = 12 (JF (x0 , θ) − JF (x0 , θ)⊤ ). (19)
we see that they coincide only if JF = JF⊤ , i.e., only if the
system is conservative.                                                12:     3. Nudged Phase: Augmented Dynamics
Now, let SJ (x0 , θ) and AJ (x0 , θ) denote the symmetric              13:        Integrate the dynamics twice starting from x0
and antisymmetric parts of the Jacobian at the free (un-               14:
nudged) equilibrium, respectively. Then, we show in Ap-                        dx                 ∂C
pendix A that the gradient error increases with the spectral                       = F (x, θ) − β     (x) − 2AJ (x0 , θ) (x − x0 ),
                        −1                                                     dt                ∂x
radius of SJ (x0 , θ)        AJ (x0 , θ). Consequently, large                                                                 (20)
antisymmetric contributions degrade the gradient estima-               15:        until convergence to two new stationary states
tion, confirming empirical observations in the Appendix of                     x±β
                                                                                A .
(Ernoult et al., 2020). In fact, in the pathological limit where       16:     4. Parameter Update
the Jacobian would be purely antisymmetric SJ (x0 , θ) = 0,            17:        Update the parameters according to:
the update of VF gives the negative of the true gradient,              18:
maximizing the cost rather than minimizing it.                                                            ⊤                 !
                                                                                                                 xβA − x−β
                                                                                            
                                                                                                ∂F 0                    A
                                                                                   ∆θ = ϵ          (x , θ)                       .   (21)
                                                                                                ∂θ                   2β
3. Asymmetric EP
Here, we introduce Asymmetric EP (AsymEP), see Algo-                   19: until convergence of θ
rithm 1, which removes the gradient estimate error inherent            20: Output: Optimized parameters θ.
to VF by adding a local correction term to the augmented
inference dynamics. The new nudged equilibrium xβA satis-
fies:                                                                  where JFA (x, θ) is the Jacobian of the modified dynamical
                  ∂C β                                                 system Eq. (20). At the equilibrium point x0 , JFA is equal
  F (xβA , θ) − β   (x ) − 2AJ (x0 , θ) (xβA − x0 ) = 0, (22)          to the transpose of the original Jacobian:
                  ∂x A
As in VF, we then obtain two perturbed states x±β
                                               A for op-
                                                                                JFA (x0 , θ)    =   JF (x0 , θ) − 2AJ (x0 , θ)
posite nudging ±β and apply the contrastive learning rule                                       =   SJ (x0 , θ) − AJ (x0 , θ)
of Eq. (12).                                                                                    =   JF⊤ (x0 , θ).                    (24)
We now show that AsymEP gives rise to the correct learning
rule, i.e. that right-hand side of Eq. (21) is proportional to         where we have used the decomposition Eq. (44) of the orig-
the gradient of the cost function dC     0                             inal Jacobian J into its symmetric and antisymmetric com-
                                    dθ (x ) at the equilibrium
         0
point x (Eq. 14). To this end, note that the same reasoning            ponents. Therefore, the left hand side of Eq. (23) is equal to
leading to Eq. (16) leads to                                           the true post-synaptic term

           dxβA                   −1 ∂C 0                                         dxβA                   −1 ∂C 0
                   = JFA (x0 , θ)        (x ).              (23)                           = JF⊤ (x0 , θ)        (x ),               (25)
            dβ β=0                    ∂x                                            dβ β=0                    ∂x

                                                                   4
                                     Equilibrium Propagation for Non-Conservative Systems

which, using Eq. (14), proves the result. Additionally, al-            until a stationary point (z β , z ′β ) is reached. Upon conver-
though implied by the equality with the true gradient, we              gence, we follow the standard EP paradigm in using the
explicitly demonstrate the equivalence of the gradient esti-           difference z β − z ′β to compute the post-synaptic term. Un-
                                                                                                                   ′
                                                                                                                                 ′
mates obtained by AsymEP and Backpropagation Through                   der the change of variables m = z+z       2 and d = z − z , we
Time in Appendix B following (Ernoult et al., 2019).                   prove in Appendix D that m follows the original dynamics
                                                                       F (ensuring valid inference), while d relaxes to a "physical"
Note that the corrective term −2AJ (x0 , θ)(x − x0 ) in
                                                                       error signal proportional to the cost gradient.
Eq. (20) is spatially local: AJ vanishes for unconnected
neurons, and (x − x0 ) is available at the synapse given the           It is important to notice that while Dyadic EP introduces a
memory mechanism already required by Eq. (12). This                    distinct formulation, it remains consistent with the general
correction can create backward connections (Section 5.3).              theoretical setting of EP and matches the computational
However, in physical realizations, both feedforward and                cost of AsymEP. Note also that we start the evolution of
feedback connections must be physically present, though                the free phase (β = 0) with the identical initial condition
feedback may be deactivated during inference.                          for z and z ′ , (i.e., d = 0). This guarantees that integrat-
                                                                       ing Eq. (32) leads to a symmetric stationary point where
4. Dyadic EP                                                           z 0 = z ′0 . Finally, we underline that the modified varia-
                                                                       tional update rule in Eq. (34) is equivalent to the standard
We now introduce Dyadic EP (Algorithm 2), a variational                symmetric EP update rule in Eq. (8) (see Appendix D).
algorithm that computes the exact cost gradient in the limit
                                                                       Now, to make this concrete, consider a continuous Hopfield
of infinitesimal nudging. It maps the original n-variable
                                                                       network (see also Eq. (35)) with an asymmetric connection
dynamics F (x, θ) onto a 2n-variable system (z, z ′ ) defined
                                                                       matrix J. After some calculations (see Appendix F), the
by an energy H(z, z ′ , θ) and cost D(z, z ′ ). We show in
                                                                       augmented energy of the system can be re-expressed as:
Appendix E that AsymEP can be seen as the first-order
projection of Dyadic EP onto the original n-dimensional                             1                 1
                                                                       HT = − ρ(z)⊤ Sρ(z) + ρ(z ′ )⊤ Sρ(z ′ ) − ρ(z)⊤ Aρ(z ′ )
state space.                                                                        2                 2
                                                                                    1                    β
The new system is defined by the energy H and cost function                      + (∥z∥ − ∥z ∥ ) + (C(z, y) + C(z ′ , y)) ,
                                                                                           2      ′ 2
                                                                                    2                    2
D, given in terms of F and C by:                                                                                                  (29)
                                         
                                           z + z′
                                                                      where S and A are the symmetric and antisymmetric parts
      H(z, z ′ , θ) = −(z − z ′ )⊤ F              ,θ ,                 of J, respectively and ρ is an element-wise non-linearity.
                                             2
                                  ′
                                                                      An interesting analogy can be drawn with standard learning
                            z+z
        D(z, z ′ ) = C                 ,               (26)            rules in discrete Hopfield networks (Hopfield, 1982). For
                               2                                       a sequence of binary memories {ξ 1 , . . . , ξ m } where ξ µ ∈
where z, z ′ ∈ Rn . In order to learn, we introduce the aug-           {−1, 1}n , S P   corresponds to the standard autoassociative
mented energy                                                          Hebbian rule µ ξ µ (ξ µ )⊤ , creating stable attractors at each
                                                                       pattern. In contrast, A corresponds to the heteroassociative
        HT (z, z ′ , θ, β) = H(z, z ′ , θ) + βD(z, z ′ ).   (27)
                                                                       rule (e.g., a cycle between ξ µ and ξ ν given by ξ ν (ξ µ )⊤ −
The equilibrium configuration corresponds to a saddle point            ξ µ (ξ ν )⊤ ), encoding transitions between patterns.
of HT , where z minimizes and z ′ maximizes the energy.
                                                                       For this specific energy, the update rule given by Eq. (34)
This poses no issue for EP, which requires only that the
                                                                       can be re-expressed as:
joint state (z, z ′ ) reaches a stationary state. Although this
min-maximization can be interpreted as z evolving forward                          1                                         ⊤
                                                                                        ρ(z ′β ) − ρ(z β ) ρ(z ′β ) + ρ(z β ) . (30)
                                                                                                          
                                                                         ∆J ∝ −
and z ′ backward in time, in practice they evolve forward                         2β
simultaneously, as we integrate the coupled equations:                 In the limit β → 0, this gives:
                              z + z′
                                        
     dz       ∂HT                                                                                β
                                                                                                   !
         =−             =F            ,θ                                                       d
     dt                                                                               ∆J ∝           ⊙ ρ′ (m)ρ(m)⊤ .             (31)
              ∂z         ⊤     2
                                                                                                β
                 z − z′                    β ∂C z + z ′
                                                        
                             ∂F
           +                             −                 ,
                      2       ∂z z+z′      2 ∂z      2
                                    2                                  matching the learning rule in (Pineda, 1987), with
                                                                               β

    dz ′      ∂HT
                            
                              z + z′
                                                                      limβ→0 dβ being the error signal.
         =+         ′
                        =F            ,θ
    dt        ∂z                2
                         ′ ⊤
                                           β ∂C z + z ′
                                                       
                 z−z         ∂F                                        5. Numerical Experiments
           −                             +                 ,
                      2      ∂z ′ z+z′     2 ∂z ′    2
                                    2                                  In this section, we numerically validate AsymEP (Algo-
                                                           (28)        rithm 1). The neuronal dynamics follows the one introduced

                                                                   5
                                     Equilibrium Propagation for Non-Conservative Systems

Algorithm 2 Dyadic EP                                                 where ∥ · ∥F denotes the Frobenius norm. Note that this
 1: Inputs: Force field F (x, θ), cost function C(x, y),              metric does not capture the asymmetry of the Jacobian,
      nudging parameter β, learning rate ϵ                            which depends on the state x.
 2: repeat                                                            For numerical experiments, we restricted the network to a
 3:   1. Free Phase: Evolve to stationary state                       layered architecture with a single hidden layer to facilitate
 4:      Evolve the system dynamics, starting from identi-            comparison with prior work. Accordingly, J in contains
        cal initial conditions z(0) = z ′ (0) = z0 ,                  only input-to-hidden connections, while J dyn is block off-
 5:                                                                   diagonal, encoding bidirectional interactions between the
                  dz    ∂H             dz ′   ∂H
                     =−    ,                =+ ′,         (32)        hidden and output layers. Both J in and J dyn are trained.
                  dt    ∂z             dt     ∂z
                                                                      We first use MNIST (LeCun, 1998) (60k train, 10k test)
 6:        until stationary states z 0 , z ′0 are reached.
                                                                      followed by Fashion-MNIST to validate AsymEP, and then
 7:     2. Nudged Equilibrium
                                                                      we further validate AsymEP and Dyadic EP by comparing
 8:        Evolve the system dynamics, starting from the
                                                                      them to Backpropagation on a convolutional feedforward,
        solution of the free phase z 0 = z ′0 :
                                                                      with CIFAR-10. Inputs are normalized using min-max to
 9:
                 dz    ∂HT             dz ′    ∂HT                    [−1, 1] and targets are one-hot encoded in {−1, 1}. All
                    =−     ,                =+       ,    (33)        hyperparameters are detailed in Appendix G, along with
                 dt     ∂z             dt       ∂z ′
                                                                      additional details and numerical results.
10:        until two nudged stationary states z β , z ′β are
        reached.
                                                                      5.1. Symmetric Initialization
11:     3. Parameter Update
12:        Update the parameters according to:                        We start by comparing AsymEP with standard EP and
13:                                                                   VF. All algorithms are initialized with an identical sym-
                           1 ∂H(z β , z ′β , θ)
                                               
                   ∆θ = −ϵ                              (34)          metric matrix J dyn . EP maintains this symmetry through-
                           β          ∂θ                              out training, while VF and AsymEP induce asymmetry in
14: until convergence of θ                                            J dyn . Since EP and VF already achieve strong performance
15: Output: Optimized parameters θ.                                   on MNIST, the purpose of this experiment is to validate
                                                                      AsymEP and compare it against EP and VF rather than
                                                                      outperform the state of the art.
in (Scellier &Bengio, 2017), and is generalized to allow              Figure 1 compares the three algorithms as a function of
for non-reciprocal forces as in (Scellier et al., 2018). For          hidden-layer dimension after 1 and 20 epochs. AsymEP
clarity, we express the forces in a form that explicitly sepa-        consistently outperforms the baselines, suggesting it learns
rates the contributions of the external input and the recurrent       faster and better.
interactions:
                                                                      Figure 2 studies the evolution of the asymmetry ratio rstr .
          F (x) = ρ′ (x) ⊙ J in u + J dyn ρ(x) − x,
                                               
                                                           (35)       The results are reported for 50 hidden neurons. As expected,
                                                                      EP preserves the initial weight symmetry. In contrast, VF
where u ∈ RNin denotes the input and x ∈ RNdyn the neu-               and AsymEP induce non-trivial evolution of rstr following
ronal state, comprising both hidden and output units. The             two distinct patterns, resulting in three distinct network
matrices J in ∈ RNdyn ×Nin and J dyn ∈ RNdyn ×Ndyn define the         configurations. A complementary figure is available in Ap-
input and recurrent connectivity, respectively. The activation        pendix G.1.
function ρ(·) is taken to be the hyperbolic tangent, applied
element-wise.                                                         5.2. Fixed Asymmetry Ratio
If J dyn is symmetric, we can define the energy:                      While the previous section focused on networks compatible
            1       1                                                 with all three algorithms (EP, VF, AsymEP), we now turn
  E(x) =      ∥x∥2 − ρ(x)⊤ J dyn ρ(x) − ρ(x)⊤ J in u, (36)            to architectures with strong structural asymmetry. In this
            2       2
                                                                      regime, EP is inapplicable by construction, and, as we show,
which is identical to that of (Scellier &Bengio, 2017), pro-          VF performs poorly, contrary to AsymEP which remains
vided that the input neurons are activated as ρ(u).                   effective.
Equation (35) naturally motivates a quantitative measure of           To this end, we consider a class of networks where the
structural asymmetry rstr , defined as:                               asymmetry ratio rstr defined in Eq. (37) is kept fixed. Let S̃
                               ⊤                                      and Ã be arbitrary symmetric and antisymmetric matrices
                         ∥(J dyn − J dyn )/2∥F                        in RNdyn ×Ndyn respectively. We enforce a fixed rstr via the
                rstr =                         ,          (37)
                                ∥J dyn ∥F

                                                                  6
                                       Equilibrium Propagation for Non-Conservative Systems

                                                                            where γ ∈ R is a learnable global scale.
                                                                            Using VF and AsymEP, we train a layered network with one
                                                                            hidden layer of 50 neurons (in which case S̃ and Ã are block
                                                                            off-diagonal) for different values of rstr to investigate the
                                                                            impact of structural asymmetry. We compare two training
                                                                            regimes: training only the input weights J in (and the scale
                                                                            γ), versus training all parameters including J dyn . The first
                                                                            regime trains only the external forces from the input ρ′ (x) ⊙
                                                                            J in u (which correspond to a symmetric contribution in the
                                                                            Jacobian) applied to our non-conservative system, while
                                                                            the second additionally trains J dyn and therefore the non-
                   (a) Results after one epoch.                             symmetric part of the Jacobian directly.
                                                                            Figure 3 summarizes the results. We find that AsymEP
                                                                            maintains robust performance across all asymmetry levels
                                                                            (e.g., achieving an accuracy of 93.8 ± 0.4% at rstr = 0 and
                                                                            94.9 ± 0.2% at rstr = 0.875 when training all parameters)
                                                                            and can even learn when the recurrent connection matrix
                                                                            J dyn is completely antisymmetric (rstr = 1). Additionally,
                                                                            training all parameters shows significant improvement over
                                                                            training only J in .
                                                                            In contrast, VF performs well at low asymmetry ratios
                                                                            but degrades as asymmetry increases, eventually dropping
                                                                            to chance levels (e.g., accuracies of 5 ± 3% and 8 ± 4%
                   (b) Results after 20 epochs.                             at rstr = 1 for input-only and all-parameter training, re-
Figure 1. Comparison of algorithm performance on MNIST using                spectively). When only J in is trained, VF accuracy col-
a layered architecture with one hidden layer and symmetric ini-             lapses around rstr ≈ 0.5, whereas training all parameters
tialization. Squares denote AsymEP, circles EP, and triangles VF.           delays this collapse until rstr ≈ 0.8. Our analysis in Ap-
Test accuracy (averaged over 10 runs) is shown after one epoch
                                                                            pendix G.2.1 reveals that VF adjusts the dynamics such that
(Fig. 1a) and 20 epochs (Fig. 1b).
                                                                            the asymmetry of the Jacobian’s off-diagonal terms remains
                                                                            strictly lower than the structural asymmetry ratio. The train-
                                                                            ing appears to adjust the neuronal state such that neurons
                                                                            connected by strongly asymmetric weights have low activa-
                                                                            tion. As shown in Appendix G.2.1, AsymEP learns faster
                                                                            than VF across all levels of asymmetry.
                                                                            Finally, Appendix G.3 opens with a brief theoretical dis-
                                                                            cussion of the stability of these non-conservative dynamics,
                                                                            followed by simulations on all-to-all topologies with con-
                                                                            strained rstr and input projections J in . Even in this worst-
                                                                            case setting, AsymEP reduces oscillations and improves
                                                                            stability.

                                                                            5.3. Feedforward Architectures
Figure 2. Evolution of the asymmetry ratio rstr (defined in Eq. (37))       We now consider a purely feedforward architecture. Here
during training on MNIST for AsymEP, EP and VF, initialized
from a symmetric configuration. The models use 50 hidden neu-               VF trains only the last layer: with no backward connections,
rons.                                                                       the output nudging signal cannot reach earlier layers, so for
                                                                            every layer but the last the nudged stationary states coincide
                                                                            with the free states, giving zero weight updates. As only
following parameterization of the recurrent parameters:                     the output layer is trained, the system essentially becomes
                "q                               #                          an Extreme Learning Machine (Huang et al., 2006; Wang
         dyn              2     S̃           Ã                             et al., 2022). In contrast, AsymEP introduces a correction
        J =γ         1 − rstr       + rstr         ,  (38)                  that generates effective backward connections, allowing the
                              ∥S̃∥F        ∥Ã∥F

                                                                        7
                                      Equilibrium Propagation for Non-Conservative Systems

                                                                          tivity structures inspired by (Millidge et al., 2023), while
                                                                          keeping the number of trainable parameters fixed.
                                                                          Experiments are conducted on Fashion-MNIST using a two-
                                                                          hidden-layer network with hidden dimensions 500 and 200.
                                                                          Network states are denoted (x0 , x1 , x2 , x3 ), where x0 is
                                                                          the input and x3 = xL the output. Forward and backward
                                                                          connections are denoted by Wk and Bk , respectively, with
                                                                          W1 = J in .
                                                                          We consider three classes of dynamics. First, the Continuous
                                                                          Hopfield (CH) dynamics introduced previously:
                                                                          dxk                                                      
                                                                               = −xk +ρ′ (xk )⊙ Wk ρ(xk−1 )+(1−δk,L )Bk ρ(xk+1 ) .
                                                                           dt
                                                                                                                               (40)
                                                                          Second, Predictive Coding (PC) dynamics, defined through
                                                                          the prediction errors ek = xk − Wk ρ(xk−1 ), whose fixed
Figure 3. Impact of the structural asymmetry ratio rstr on accuracy
(top) and standard deviation over 10 runs (bottom) on MNIST.              point ek = 0 corresponds to a standard feedforward net-
We compare VF (orange) and AsymEP (blue) under two training               work:
regimes: training only J in (dashed) or all parameters (solid).
                                                                             dxk
                                                                                 = −ek + (1 − δk,L ) (ρ′ (xk ) ⊙ (Bk ek+1 )) .    (41)
                                                                              dt
nudging signal to influence all layers. We make this explicit
                                                                          Third, a standard dynamics chosen for direct comparison
for a network with one hidden layer.
                                                                          with backpropagation:
Let the state x be partitioned in hidden h and output o
                                                                           dxk
layers. The recurrent connection matrix is then J dyn =                       = −xk + Wk ρ(xk−1 ) + (1 − δk,L )Bk ρ(xk+1 ). (42)
                                                                            dt

    0     0
             . The forces of the system are:
  Wh→o 0                                                                  For each dynamics, we examine three connectivity scenar-
  β                                                                      ios.
   Fh = ρ′ (h) ⊙ J in u + λ(Wh→o )⊤ (o − o0 ) − h
                                             
 
 
                                               
                                               0
 
                                                                                                                    ⊤
   Fo = ρ′ (o) ⊙ Wh→o ρ(h) − λWh→o (h − h )                                  • In the asymmetric case (Bk ̸= Wk+1
  β
                                                                                                                       ), the backward
                                                     (39)
                                                                              weights Bk are randomly initialized and kept fixed
 
              ∂C                                                              while only the forward weights are trained, ensuring a
         + λβ      −o
 
 
               ∂o                                                              fair comparison (i.e., identical number of parameters);
where λ is 0 during the free inference and 1 during the                        in PC, the learning rule for Bk is zero when only inputs
nudged phase (Eq. 20). The force on the hidden layer Fhβ                       are clamped.
now depends on the output layer through the term ρ′ (h) ⊙                                                                  ⊤
         ⊤                                                                   • In the symmetric / conservative case (Bk = Wk+1 ), the
(Wh→o ) (o − o0 ), enabling the nudge (the term β ∂C ∂o ) to                   CH and PC dynamics derive from an energy functional,
influence the hidden layer. This implicitly assumes that the                   while the standard dynamics remains non-conservative
hardware implementation supports the physical activation                       due to its non-symmetric Jacobian.
of these backward connections.
                                                                             • In the feedforward case (Bk = 0), the PC and stan-
We validate this using a single hidden layer of only 20 neu-
                                                                               dard dynamics coincide; for the standard dynamics, the
rons on MNIST. After training, VF saturates with 64.3 ±
                                                                               AsymEP learning rule mirrors backpropagation, with
2.0% accuracy, whereas AsymEP reaches 92.7 ± 0.5% ac-
                                                                               ∆xβk = 2β1
                                                                                          (xβ − x−β ) acting as the propagated error
curacy. We expect this discrepancy to increase with network
                                                                               signal.
depth, since this increases the number of layers unable to
learn under VF. A figure with the accuracy during training
can be found in Appendix G.4.2.                                           Table 1 shows that AsymEP consistently outperforms VF
                                                                          in both asymmetric and feedforward settings, in final ac-
                                                                          curacy, learning speed, and stability. After a single epoch
5.4. Advantages of Non-Conservative Dynamics
                                                                          it already provides on average a 15% accuracy gain with
AsymEP is not tied to a specific neural dynamics. To further              an order-of-magnitude reduction in variance. Remarkably,
assess the benefits of training non-conservative dynamics                 AsymEP with asymmetric connectivity also surpasses EP
using AsymEP, we compare several dynamics and connec-                     on symmetric networks despite training only the forward

                                                                      8
                                    Equilibrium Propagation for Non-Conservative Systems

weights, suggesting that relaxing symmetry constraints may            6. Discussion and Conclusion
improve expressivity. Supplementary results are provided
in Appendix G.5.                                                      In this work, we extended Equilibrium Propagation (EP)
                                                                      to non-conservative systems that reach stationary states by
                                                                      deriving two mathematically equivalent algorithms that re-
                                                                      cover the exact gradient of the cost function in the limit of
Table 1. Test accuracy on Fashion-MNIST (%) at Epoch 50 (mean
± std 10 runs). BP on a standard feedforward architecture using
                                                                      infinitesimal nudging.
MSE and SGD achieve 87.37 ± 0.29%.                                    The first approach, Asymmetric EP, preserves the original
                                                                      inference dynamics. It introduces a corrective force during
                          EP          AsymEP            VF
                                                                      the nudged phase that remains spatially local, as the anti-
          Asym          -      86.78 ± 0.14        85.20 ± 0.12       symmetric Jacobian is null for unconnected neurons and the
 CH       Feedfor       -      86.05 ± 0.12        77.76 ± 0.37
          Sym     84.30 ± 0.13       -                   -
                                                                      perturbation from equilibrium is available at the synapse
          Asym          -      86.20 ± 0.17        80.71 ± 6.17       level. Unlike standard methods like Recurrent Backpropa-
 PC                                                                   gation (Almeida, 1990; Pineda, 1987), this avoids explicit
          Sym     84.78 ± 0.14       -                   -
          Asym          -      82.91 ± 0.48        75.52 ± 1.69       digital weight transposition. However, a physical mech-
 Standard
          Feedfor       -      86.25 ± 0.16        78.58 ± 0.28       anism to obtain the local corrective force at the synapse
                                                                      level remains a subject for future work. We also note that
                                                                      AsymEP shares the temporal non-locality of standard EP.
Finally, to investigate how AsymEP scales with depth, we              The second approach, Dyadic EP, doubles the state space
trained deeper fully connected networks with two and three            to map non-reciprocal dynamics onto an energy land-
hidden layers of 500 neurons on Fashion-MNIST, reaching               scape—conceptually reminiscent of multi-compartment cor-
86.41 ± 0.22% and 87.8 ± 0.15% test accuracy respectively.            tical neurons, where apical dendrites integrate feedback
                                                                      (analogous to z − z ′ ) separately from basal feedforward
5.5. Feedforward Training on CIFAR-10: BP vs. Dyadic                  input (analogous to z + z ′ ) (Guerguiev et al., 2017). Addi-
     EP vs. AsymEP                                                    tionally, this expanded space also enables the positive and
                                                                      negative nudging phases to run in parallel. This offers a
To test whether our framework scales beyond shallow net-              pathway to implement a version of EP that is local in time,
works, we conclude with a deep, purely feedforward CNN                but would require a doubling of the degrees of freedom
architecture trained on CIFAR-10. We compare backprop-                on the physical hardware. More fundamentally, the energy
agation (BP), VF, AsymEP and Dyadic EP in a controlled                defined on the extended state shows that the tools and the-
setting where the gradient estimator is the only difference           oretical guarantees obtained for EP should also apply to
between runs: all methods share the same configuration,               the case of non-reciprocal forces, and that the variational
with the BP gradient replaced by the contrast of stationary           principle behind EP is universal in the sense that it can be
states for the EP-based methods (see App. G.6 for details).           applied to all networks which operate in a stationary state.
Each configuration is trained for 40 epochs over 5 seeds.
                                                                      Furthermore, Dyadic EP is not restricted to the EP com-
Table 2 reports the final test accuracy. Both of our algo-            munity and could suggest a more physically plausible al-
rithms scale to this regime, closely tracking the BP baseline         ternative to the stationary-state Adjoint Method (for fixed
throughout training and matching its final accuracy: a paired         inputs) (Chen et al., 2018): by solving the forward and ad-
t-test finds no significant difference between Dyadic EP and          joint equations simultaneously via relaxation, it circumvents
BP (p = 0.75), and only a sub-percent gap for AsymEP.                 a separate backward-in-time pass.
In contrast, VF makes slight initial progress (peaking near
30%) before collapsing to chance level (10%). Additional              Finally, our experiments on MNIST, Fashion-MNIST, and
details can be found in Appendix G.6                                  CIFAR-10 confirm that AsymEP and Dyadic EP consis-
                                                                      tently outperform EP and VF, and notably enables effective
                                                                      training of feedforward networks.
                                                                      Our work thus opens new avenues for learning in neuro-
Table 2. Test accuracy on CIFAR-10 (%) at epoch 40 (mean ± std
over 5 seeds).                                                        morphic hardware, dissipative physical systems, and neural
                                                                      architectures where asymmetry is intrinsic rather than inci-
              Method              Test Acc. (%)                       dental.
              Backpropagation     90.66 ± 0.25
              Dyadic EP           90.69 ± 0.14
              AsymEP              89.74 ± 0.14
              VF                  10.00 ± 0.00


                                                                  9
                                   Equilibrium Propagation for Non-Conservative Systems

Impact Statement                                                     References
This paper presents results that advance the field of machine        Almeida, L. B. A learning rule for asynchronous percep-
learning. There are many potential societal consequences               trons with feedback in a combinatorial environment. In
of our work, none of which we feel must be specifically                Artificial neural networks: concept learning, pp. 102–111.
highlighted here.                                                      1990.
                                                                     Altman, L. E., Stern, M., Liu, A. J., and Durian, D. J. Ex-
Acknowledgments                                                        perimental demonstration of coupled learning in elastic
                                                                       networks. Physical Review Applied, 22(2):024053, 2024.
AES is fully funded by the Horizon Europe Marie
Skłodowska-Curie Doctoral Network ’Postdigital Plus’                 Aykroyd, C., Bourgoin, A., and Poncin-Lafitte, C. L. Hamil-
(Grant 101169118). DVA acknowledges the support of                     tonian treatment of non-conservative systems. arXiv
the French Community of Belgium through a FRIA fellow-                 preprint arXiv:2507.18658, 2025.
ship. SM acknowledges financial support by the Fonds de la
                                                                     Bateman, H. On dissipative systems and related variational
Recherche Scientifique–FNRS, Belgium under EOS Project
                                                                       principles. Physical Review, 38(4):815, 1931.
No. 40007536. Computational resources have been pro-
vided by the Consortium des Équipements de Calcul Intensif           Berneman, M. and Hexner, D. Equilibrium propagation for
(CÉCI), funded by the Fonds de la Recherche Scientifique               dissipative dynamics. Advanced Intelligent Systems, pp.
de Belgique (F.R.S.-FNRS) under Grant No. 2.5020.11 and                e202501310, 2025.
by the Walloon Region.
                                                                     Bishop, K. J., Biswal, S. L., and Bharti, B. Active colloids
                                                                       as models, materials, and machines. Annual Review of
                                                                       Chemical and Biomolecular Engineering, 14(1):1–30,
                     “ἁρμονίη ἀφανὴς φανερῆς κρείττων”
                                                                       2023.
                                                                     Bowick, M. J., Fakhri, N., Marchetti, M. C., and Ra-
                                                                       maswamy, S. Symmetry, thermodynamics, and topology
                                                                       in active matter. Physical Review X, 12(1):010501, 2022.
                                                                     Brandenbourger, M., Locsin, X., Lerner, E., and Coulais, C.
                                                                       Non-reciprocal robotic metamaterials. Nature communi-
                                                                       cations, 10(1):4608, 2019.
                                                                     Cesa-Bianchi, N. and Lugosi, G. Prediction, learning, and
                                                                       games. Cambridge university press, 2006.
                                                                     Chen, R. T., Rubanova, Y., Bettencourt, J., and Duvenaud,
                                                                       D. K. Neural ordinary differential equations. Advances
                                                                       in neural information processing systems, 31, 2018.
                                                                     Cin, N. D., Marquardt, F., and Wanjura, C. C. Training
                                                                       nonlinear optical neural networks with scattering back-
                                                                       propagation. arXiv preprint arXiv:2508.11750, 2025.
                                                                     Costa, P. and Santos, P. A. Directed equilibrium propagation
                                                                       revisited. Mathematics, 13(11), 2025. ISSN 2227-7390.
                                                                     Crick, F. The recent excitement about neural networks.
                                                                       Nature, 337, 1989.
                                                                     Dillavou, S., Stern, M., Liu, A. J., and Durian, D. J. Demon-
                                                                       stration of decentralized physics-driven learning. Physi-
                                                                       cal Review Applied, 18(1):014040, 2022.
                                                                     Dillavou, S., Beyer, B. D., Stern, M., Liu, A. J., Miskin,
                                                                       M. Z., and Durian, D. J. Machine learning without a pro-
                                                                       cessor: Emergent learning in a nonlinear analog network.
                                                                       Proceedings of the National Academy of Sciences, 121
                                                                       (28):e2319718121, 2024.

                                                                10
                                   Equilibrium Propagation for Non-Conservative Systems

Ernoult, M., Grollier, J., Querlioz, D., Bengio, Y., and Scel-        Indiveri, G. and Liu, S.-C. Memory and information pro-
  lier, B. Updates of equilibrium prop match gradients                  cessing in neuromorphic systems. Proceedings of the
  of backprop through time in an rnn with static input.                 IEEE, 103(8):1379–1397, 2015.
  Advances in neural information processing systems, 32,
  2019.                                                               Kalinin, K. P., Gladrow, J., Chu, J., Clegg, J. H., Cletheroe,
                                                                        D., Kelly, D. J., Rahmani, B., Brennan, G., Canakci, B.,
Ernoult, M., Grollier, J., Querlioz, D., Bengio, Y., and Scel-          Falck, F., et al. Analog optical computer for ai inference
  lier, B. Equilibrium propagation with continual weight                and combinatorial optimization. Nature, 645(8080):354–
  updates. arXiv preprint arXiv:2005.04168, 2020.                       361, 2025.
Falk, M. J., Strupp, A. T., Scellier, B., and Murugan,                Kendall, J., Pantone, R., Manickavasagam, K., Bengio,
  A. Temporal contrastive learning through implicit non-                Y., and Scellier, B. Training end-to-end analog neural
  equilibrium memory. Nature Communications, (16),                      networks with equilibrium propagation. arXiv preprint
  2025.                                                                 arXiv:2006.01981, 2020.
Farinha, M. T., Pequito, S., Santos, P. A., and Figueiredo,
                                                                      Laborieux, A. and Zenke, F. Holomorphic equilibrium
  M. A. T. Equilibrium propagation for complete directed
                                                                        propagation computes exact gradients through finite size
  neural networks. In Proceedings of the 28th European
                                                                        oscillations. Advances in Neural Information Processing
  Symposium on Artificial Neural Networks, Computational
                                                                        Systems, 35:12950–12963, 2022.
  Intelligence and Machine Learning (ESANN 2020), 2020.
Galley, C. R. Classical mechanics of nonconservative sys-             Laborieux, A. and Zenke, F. Improving equilibrium propa-
  tems. Physical review letters, 110(17):174301, 2013.                  gation without weight symmetry through jacobian home-
                                                                        ostasis. In Proceedings of the International Confer-
Guerguiev, J., Lillicrap, T. P., and Richards, B. A. Towards            ence on Learning Representations (ICLR) 2024, Virtual
  deep learning with segregated dendrites. elife, 6:e22901,             (ICLR), May 2024.
  2017.
                                                                      Laborieux, A., Ernoult, M., Scellier, B., Bengio, Y., Grollier,
Høier, R. and Zach, C. A lagrangian perspective on dual                 J., and Querlioz, D. Scaling equilibrium propagation to
  propagation. In Proceedings of the First Workshop on Ma-              deep convnets by drastically reducing its gradient estima-
  chine Learning with New Compute Paradigms at NeurIPS                  tor bias. Frontiers in neuroscience, 15:633674, 2021.
 2023, New Orleans, LA, USA, Dec 2023.
                                                                      Laydevant, J., Marković, D., and Grollier, J. Training an
Høier, R. and Zach, C. Two tales of single-phase contrastive
                                                                        ising machine with equilibrium propagation. Nature Com-
  hebbian learning. In Salakhutdinov, R., Kolter, Z., Heller,
                                                                        munications, 15(1):3671, 2024.
  K., Weller, A., Oliver, N., Scarlett, J., and Berkenkamp, F.
  (eds.), Proceedings of the 41st International Conference            LeCun, Y. The mnist database of handwritten digits.
  on Machine Learning, volume 235 of Proceedings of                     http://yann. lecun. com/exdb/mnist/, 1998.
  Machine Learning Research, pp. 18470–18488. PMLR,
  21–27 Jul 2024.                                                     Martin, E., Ernoult, M., Laydevant, J., Li, S., Querlioz, D.,
                                                                       Petrisor, T., and Grollier, J. Eqspike: spike-driven equi-
Høier, R., Staudt, D., and Zach, C. Dual propagation: accel-
                                                                       librium propagation for neuromorphic implementations.
  erating contrastive hebbian learning with dyadic neurons.
                                                                       Iscience, 24(3), 2021.
  In Proceedings of the 40th International Conference on
  Machine Learning, ICML’23. JMLR.org, 2023.                          Massar, S. Equilibrium propagation for learning in la-
Høier, R., Kalinin, K., Ernoult, M., and Zach, C. Dyadic               grangian dynamical systems. Physical Review E, 112
  learning in recurrent and feedforward models. In NeurIPS             (3):035304, 2025.
 2024 Workshop Machine Learning with new Compute                      Massar, S. and Mognetti, B. M. Equilibrium propagation:
 Paradigms, 2024.                                                      the quantum and the thermal cases. Quantum Studies:
Hopfield, J. J. Neural networks and physical systems with              Mathematics and Foundations, 12(1):6, 2025.
  emergent collective computational abilities. Proceedings
                                                                      Millidge, B., Song, Y., Salvatori, T., Lukasiewicz, T., and
  of the national academy of sciences, 79(8):2554–2558,
                                                                       Bogacz, R. Backpropagation at the infinitesimal infer-
 1982.
                                                                       ence limit of energy-based models: Unifying predictive
Huang, G.-B., Zhu, Q.-Y., and Siew, C.-K. Extreme learning             coding, equilibrium propagation, and contrastive hebbian
  machine: theory and applications. Neurocomputing, 70                 learning. In International Conference on Learning Rep-
  (1-3):489–501, 2006.                                                 resentations (ICLR), 2023.

                                                                 11
                                    Equilibrium Propagation for Non-Conservative Systems

Nest, T. and Høier, R. Dyadic learning in asymmetric                   Wang, Q., Wanjura, C. C., and Marquardt, F. Training
  convnets. In New Frontiers in Associative Memories-                   coupled phase oscillators as a neuromorphic platform
 Workshop at ICLR 2026.                                                 using equilibrium propagation. Neuromorphic Computing
                                                                        and Engineering, 4(3):034014, 2024.
Osat, S. and Golestanian, R. Non-reciprocal multifarious
  self-organization. Nature Nanotechnology, 18(1):79–85,               Wanjura, C. C. and Marquardt, F. Quantum equilibrium
  2023.                                                                 propagation for efficient training of quantum systems
                                                                        based on onsager reciprocity. Nature Communications,
O’Connor, P., Gavves, E., and Welling, M. Training a spik-              16(1):6595, 2025.
  ing neural network with equilibrium propagation. In The
  22nd international conference on artificial intelligence             Werbos, P. J. Backpropagation through time: what it does
  and statistics, pp. 1516–1523. PMLR, 2019.                            and how to do it. Proceedings of the IEEE, 78(10):1550–
                                                                        1560, 1990.
Pineda, F. Generalization of back propagation to recurrent
  and higher order neural networks. In Neural information              Yi, S.-i., Kendall, J. D., Williams, R. S., and Kumar, S.
  processing systems, 1987.                                              Activity-difference training of deep neural networks using
                                                                         memristor crossbars. Nature Electronics, 6(1):45–51,
Pourcel, G., Basu, D., Ernoult, M., and Gilra, A. Lagrangian-            2023.
  based equilibrium propagation: generalisation to arbi-
  trary boundary conditions & equivalence with hamilto-
  nian echo learning. arXiv preprint arXiv:2506.06248,
  2025.

Rageau, T. and Grollier, J. Training and synchronizing
  oscillator networks with equilibrium propagation. Neuro-
  morphic Computing and Engineering, 2025.

Sajnok, K. and Matuszewski, M. Near-equilibrium propaga-
  tion training in nonlinear wave systems. arXiv preprint
  arXiv:2510.16084, 2025.

Scellier, B. Quantum equilibrium propagation: Gradient-
  descent training of quantum systems. arXiv preprint
  arXiv:2406.00879, 2024.

Scellier, B. and Bengio, Y. Equilibrium propagation: Bridg-
  ing the gap between energy-based models and backprop-
  agation. Frontiers in computational neuroscience, 11:24,
  2017.

Scellier, B., Goyal, A., Binas, J., Mesnard, T., and Bengio,
  Y. Generalization of equilibrium propagation to vector
  field dynamics. arXiv preprint arXiv:1808.04873, 2018.

Scellier, B., Mishra, S., Bengio, Y., and Ollivier, Y. Agnostic
  physics-driven deep learning. arXiv:2205.15021v1, 2022.

Scurria, A. E. A physical theory of backpropagation: Exact
  gradients from the least-action principle. 2026.

Stern, M., Hexner, D., Rocks, J. W., and Liu, A. J. Su-
  pervised learning in physical networks: From machine
  learning to learning machines. Physical Review X, 11(2):
  021045, 2021.

Wang, J., Lu, S., Wang, S.-H., and Zhang, Y.-D. A review
 on extreme learning machine. Multimedia Tools and
 Applications, 81(29):41611–41660, 2022.

                                                                  12
                                       Equilibrium Propagation for Non-Conservative Systems

A. Gradient Estimation Error in VF                                           where s denotes the dynamical state of the system. This
                                                                             symmetry is the linchpin of the equivalence proof, as the
In this appendix, we quantify the gradient estimation error                  gradient expressions derived for BPTT and standard EP
introduced by VF in the limit where the Jacobian asymmetry                   differ precisely by a transpose operation applied to ∂F
                                                                                                                                  ∂s .
is small.
                                                                             This observation aligns with our analysis in the main text:
Comparing the post-synaptic update terms in Eqs. (12) and                    VF fails in non-conservative systems due to the missing
(14) gives the following error in the gradient of the cost:                  transpose in the post-synaptic term (see Eq. (16)). Following
                       ⊤                                                   the derivation in Ernoult et al. (2019) (viz., Appendix A, Eqs.
            ∂F 0                                                             (31–33)), the recursive relations for the gradients in BPTT
  Error = −     (x , θ)
             ∂θ                                                              are given by:
             −1                −1  ∂C 0
× JF (x0 , θ)     − JF⊤ (x0 , θ)          (x , y), (43)                                                           ∂ℓ
                                       ∂x                                                         ∇BPTT
                                                                                                   s    (0) =        (s⋆ , y),              (49)
                                                                                                                  ∂s
To quantify this error, we decompose the Jacobian JF (x, θ)                  and for all t = 1, . . . , K,
into its symmetric part SJ (x, θ) and antisymmetric part
                                                                                                                    ⊤
                                                                                                    ∂F
           SJ (x, θ) = 12 JF (x, θ) + JF⊤ (x, θ) ,                               ∇BPTT                                    ∇BPTT (t − 1),
                                                
                                                                                  s    (t) =           (x, s⋆ , θ)         s                (50)
                                                      (44)                                          ∂s
           AJ (x, θ) = 12 JF (x, θ) − JF⊤ (x, θ) .
                                                
                                                                                                                    ⊤
                                                                                                    ∂F
                                                                                 ∇BPTT
                                                                                  θ    (t) =           (x, s⋆ , θ)        ∇BPTT
                                                                                                                           s    (t − 1),    (51)
Assuming the asymmetry AJ (x, θ) is small, we can make                                              ∂θ
a series expansion in SJ−1 AJ (omitting the dependencies                     where θ represents the optimization parameters, ℓ is the
for clarity). Applying the Neumann expansion for small                       cost function, s⋆ is the free equilibrium state (satisfying
∥SJ−1 AJ ∥ gives                                                             F (s⋆ ) = 0), y is the target, and x is the input. The index t
                       ∞
                                                   !                         denotes the unrolled time steps, initialized at s(0) = s⋆ .
                       X
        (JF ) −1
                   =       (−1)  n
                                     (SJ−1 AJ )n        SJ−1 ,   (45)        In contrast, the gradients computed by VF follow the recur-
                       n=0                                                   sion (viz., Ernoult et al. (2019), Appendix A, Eqs. (24–26)):
                        ∞
                                           !
                       X
        (JF⊤ )−1 =           (SJ−1 AJ )n       SJ−1 .            (46)                                             ∂ℓ
                                                                                                  ∆EP
                                                                                                   s (0) = −         (s⋆ , y),              (52)
                       n=0                                                                                        ∂s
Subtracting the two series and assuming convergence, we                      and for all t ≥ 0,
finally obtain
                                                                                                     ∂F
                                                                                       ∆EP
                                                                                        s (t + 1) =     (x, s⋆ , θ) ∆EPs (t),               (53)
                              ∞                                                                      ∂s
                                                !
                              X
                                    −1
                                          2n+1
        −1      ⊤ −1
  (JF ) − (JF ) = −2               SJ AJ          SJ−1 .                                             
                                                                                                       ∂F
                                                                                                                       ⊤
                                 n=0                                                   ∆EP
                                                                                        θ  (t + 1) =      (x, s ⋆ , θ)    ∆EP
                                                                                                                            s (t).          (54)
                                                                 (47)                                  ∂θ
                                                                             Comparing these two sets of equations confirms that the only
B. Equivalence between AsymEP and BPTT                                       difference are Eqs. (50) and (53), specifically the transpose
                                                                             of the Jacobian ∂F
                                                                                              ∂s (ignoring the global sign difference in
In this appendix, we sketch the equivalence between the                      Eqs. (49) and (52)).
gradient estimate computed by AsymEP and Backpropaga-
tion Through Time (BPTT) (Werbos, 1990) for a Recurrent                      In AsymEP, we modify the dynamics by adding a correction
Neural Network with fixed inputs. Our derivation relies on                   term dependent on the antisymmetric part of the Jacobian.
the proof provided by Ernoult et al. (2019), which estab-                    Denoting the force of this augmented system by F A , the
lished that standard (conservative) EP computes gradients                    Jacobian at the free equilibrium satisfies:
identical to those of BPTT. To facilitate direct comparison,                                                                       ⊤
                                                                                         ∂F A
                                                                                                              
we adopt their notation for this section.                                                                         ∂F
                                                                                              (x, s⋆ , θ) =          (x, s⋆ , θ)        .   (55)
                                                                                          ∂s                      ∂s
The proof provided by Ernoult et al. (2019) relies on the
assumption that the vector field F (i.e., transition function)               By substituting this corrected Jacobian into the recursive
is derived from a scalar potential function, which implies                   relations, AsymEP recovers the exact transpose required
that                                                                         by BPTT. Consequently, our method extends the equiva-
                                 ⊤
                     ∂F        ∂F                                            lence between EP and BPTT to the general case of non-
                         =             ,                  (48)               conservative force.
                     ∂s        ∂s

                                                                        13
                                    Equilibrium Propagation for Non-Conservative Systems

C. Out-of-Equilibrium Mechanics                                         C.3. Symmetry Breaking as Credit Assignment
Here we sketch the physical picture behind the doubled-                 On the diagonal manifold z = z ′ the doubled system enjoys
energy construction of Eq. (26). The full derivation from               a gauge symmetry: the auxiliary variable z ′ is redundant and
Hamilton’s least-action principle, together with its connec-            the difference d is identically zero. Credit assignment is im-
tion to the Bateman–Galley formalism for non-conservative               plemented by deliberately breaking this symmetry through
classical mechanics (Bateman, 1931; Galley, 2013; Aykroyd               the task cost. Adding βD(z, z ′ ) = β C(m) to H exerts
et al., 2025), can be found in (Scurria, 2026).                         opposite forces on z and z ′ and drives them apart, so that
                                                                        the difference d ceases to be redundant and begins to carry
C.1. The Helmholtz Obstruction                                          information about the loss landscape.

The natural physical route to a variational principle for a
dynamical system ẋ = F (x, θ) is to seek a scalar potential            D. Proofs for Dyadic EP
E such that F = −∂x E. The classical Helmholtz integra-                 We now demonstrate that Dyadic EP correctly trains the
bility condition states that such an E exists if and only if the        parameters θ of the original force field F (x, θ), giving the
Jacobian JF is symmetric everywhere. Whenever the inter-                                    0
                                                                        exact gradient dC(x̄
                                                                                         dθ
                                                                                              )
                                                                                                in the limit of infinitesimal nudging.
actions are non-reciprocal — as in feedforward networks,
active matter, or driven optical systems — JF acquires
a non-zero antisymmetric part and the Helmholtz condi-                  D.1. Proof of EP
tion fails identically. No scalar potential on the original             First, recall that standard EP does not strictly require the
n-dimensional state space can then generate the dynamics,               system to settle at an energy minimum; it requires only that
and the “energy minimisation” route at the heart of standard            the system reaches a stationary state (a fixed point of the
EP is blocked at the structural level. The obstruction is not           dynamics). Indeed, using the notation of Section 2.1, EP
a matter of computational convenience: it reflects the fact             relies on the key identity:
that the rotational component of F carries information that
no scalar function of x alone can record.                                           d2                 d2
                                                                                        ET (xβ , θ) =      ET (xβ , θ).            (57)
                                                                                   dθdβ               dβdθ
C.2. Variational Reconstruction on a Doubled Space                      Expanding the total derivative with respect to β gives:
Applying the Bateman–Galley formalism circumvents this                                                         ⊤
                                                                                                ∂ET (xβ , θ)        dxβ   ∂ET (xβ , θ)
                                                                                            
obstruction by enlarging the configuration space. The single              d
                                                                            ET (xβ , θ) =                               +
state x ∈ Rn is replaced by a conjugate pair (z, z ′ ) ∈ R2n ,           dβ                        ∂x               dβ       ∂β
and the rotational component of F — which has no scalar                                  = C(xβ ).                                 (58)
generator on the original n-dimensional space — is ab-
sorbed into a bilinear coupling between z and z ′ on the                Where the first term vanishes because the system is at a
                                                                                                ∂
doubled space, where it does admit a variational descrip-               stationary state, i.e., ∂x ET (xβ , θ) = 0; this holds even if
tion. The physical motion is recovered on the diagonal                  the system is not at a minimum of ET . Similarly, for the
submanifold z = z ′ (the so called ’physical limit’), while             derivative with respect to θ:
the off-diagonal direction d = z − z ′ supplies the additional
                                                                                       d                ∂ET (xβ , θ)
degree of freedom needed to encode non-reciprocity.                                       ET (xβ , θ) =              ,             (59)
                                                                                       dθ                  ∂θ
Specializing this reconstruction to the overdamped (first-
                                                                        where we additionally assume that the cost function does
order) regime relevant to relaxational neural dynamics yields
                                                                        not depend explicitly on the parameters θ. Substituting these
the bilinear energy
                                                                        results into Eq. (57) in the limit of infinitesimal nudging
                                                                        (β → 0) recovers the fundamental relation given by Eq. (9).
                                             z + z′
                                                     
        H(z, z ′ , θ) = −(z − z ′ )⊤ F              ,θ ,   (56)
                                               2
                                                                        D.2. Proof of Dyadic EP
which is precisely Eq. (26). The symmetric midpoint m =                 We analyze now the stationary states of Dyadic EP by intro-
(z + z ′ )/2 plays the role of the physical coordinate of the           ducing the change of variables:
doubled system, while d is the auxiliary direction along
which non-reciprocity is stored. On the submanifold z = z ′                                  z + z′
                                                                                       m=           ,          d = z − z′.         (60)
the coupling proportional to (z − z ′ ) vanishes identically                                   2
and both states evolve under the original field F , so the              In these coordinates, the augmented energy HT becomes
doubling leaves the on-shell physics unchanged. We refer
the reader to (Scurria, 2026) for the full construction.                        HT (m, d, θ, β) = −d⊤ F (m, θ) + βC(m)             (61)

                                                                   14
                                   Equilibrium Propagation for Non-Conservative Systems

and the dynamics in Eq. (28) can be rewritten as:                          In Dyadic EP, we instead employ the single-phase update:

                                                                                                 1 ∂H(z β , z ′β , θ)
                                                                                                                     
    dm     ∂HT
        =−     = F (m, θ),                                    (62)                       ∆θ ∝ −                                (70)
    dt      ∂d                                                                                   β         ∂θ
     dd    ∂HT                     ∂
        =−     = dT JF (m, θ) − β    C(m).                    (63)         This choice avoids the overhead of evolving two coupled
     dt    ∂m                     ∂m
                                                                           equations in the extended space, which would be computa-
                             β
The stationary states (mβ , d ) are the solutions to:                      tionally equivalent to evolving four equations in the original
                                                                           space (two for +β and two for −β). Using Eq. (70), we
                              F (mβ , θ) = 0,                 (64)         evolve only one coupled equation for +β in the extended
                                                                           space; this corresponds to two equations in the original
            βT                ∂
           d JF (mβ , θ) − β    C(mβ ) = 0.                   (65)         space, thereby achieving the same computational complex-
                             ∂m                                            ity as AsymEP. Furthermore, this single-phase formulation
                                                                           suggests a pathway toward making the update local in time,
This leads to the following observations:                                  provided appropriate hardware is used to implement the
1) The stationary state of m is independent of β and coin-                 augmented phase.
cides with the stationary state of the original system:                    Mathematically, these two approaches yield the same gradi-
                                                                           ent estimate because the equations for dβ are linear. Explic-
               z β + z ′β
                          = mβ = m0 = x0 .                    (66)         itly we have :
                   2
                                                                              ∂H(z β , z ′β , θ)                   ∂F z β + z ′β
                                                                                                                                    
                                                                                                 = −(z β − z ′β )⊤                ,θ
2) The Jacobian of the extended system defined in Eq. (26)                         ∂θ                              ∂θ     2
is invertible, provided JF is invertible. This is most evident                                                      ⊤
                                                                                                         ∂F 0                        −1
from Eq. (63).                                                                                   = −β          z ,θ     JF⊤ (z 0 , θ)
                                                                                                         ∂θ
3) The stationary state value of d is given by:                                                        
                                                                                                          ∂C 0
                                                                                                                   
                                                                                                    ×         (z ) ,                   (71)
             β
                                      
                                 −1 ∂C 0
                                                                                                         ∂x
           d = β JF⊤ (m0 , θ)              (x )               (67)
                                        ∂x                                 where we have used Eqs. (66) and (67). Inspection of
                                                                           Eq. (71) confirms that, up to corrections of order β 2 , we
                                       0
In particular, when β = 0, we have d = 0, which implies                    obtain exactly the same gradient as in AsymEP.
that the free stationary states coincide: z 0 = z ′0 .
4) The cost at the stationary state of the extended system                 E. AsymEP versus Dyadic EP
is equal to the cost at the stationary state of the original
                                                                           In this appendix, we demonstrate that Asymmetric Equilib-
system:
                                                                           rium Propagation (AsymEP) emerges naturally as the first-
                     D(m0 ) = C(x0 ).                   (68)
                                                                           order projection of the 2N -dimensional Dyadic Equilibrium
Consequently, the gradients of the cost with respect to the                Propagation onto a single N -dimensional state space. We
parameters are identical.                                                  then formalize the physical trade-offs between the two ar-
                                                                           chitectures.
Since both the original and extended systems, given respec-
tively in Eq. (28) and Eq. (1-2), share the same cost at their
                                                                           E.1. AsymEP as the Linear Projection of Dyadic EP
respective stationary states, and because the Jacobians of
both models are invertible, applying EP update rule to the                 As established in Appendix D.2, transforming the 2N -
extended system give the correct gradient estimate for the                 dimensional extended space (z, z ′ ) into the mean state
                                                                                       ′
parameters θ of the original system.                                       m = z+z   2   and the difference state d = z − z ′ exactly
The final step of the proof is to establish the equivalence                decouples the stationary dynamics. Because the stationary
between the standard parameter update rule in Eq. (8) and                  state of m is the free state of the original system (mβ = x0 ),
the modified rule used by Dyadic EP in Eq. (34). Indeed, if                the cost function drives the difference variable to a stationary
                                                                                  β
we were to apply the standard update rule in the extended                  state d satisfying:
space, the update would be:                                                                              β      ∂C 0
                                                                                            JF⊤ (x0 , θ)d = β      (x )               (72)
          1
                      β   ′β
                   ∂H(z , z , θ) ∂H(z , z   −β   ′−β
                                                       , θ)
                                                                                                               ∂x
  ∆θ ∝ −                        −                                 .
         2β            ∂θ             ∂θ                                   To recover this exact error signal in an N -dimensional space,
                                                              (69)         we postulate a modified dynamical system FA (x) compris-

                                                                      15
                                   Equilibrium Propagation for Non-Conservative Systems

ing the standard EP dynamics and a spatial correction Γ(x):          F. Derivation of the Hopfield-like Energy
                                 ∂C                                  In this section, we derive the explicit energy functional for
            FA (x) = F (x) − β      (x) + Γ(x)          (73)         the Continuous Asymmetric Hopfield dynamics defined in
                                 ∂x
                                                                     Eq. (35). The force field is given by:
Let ∆x = xβA − x0 denote the displacement from the
free equilibrium. Expanding the stationarity condition                              F (x) = ρ′ (x) ⊙ (Jρ(x)) − x.              (78)
FA (xβA ) = 0 to first order around x0 yields:
                                                                     We omit external inputs J in for brevity, as they appear sym-
                             ∂C 0                                    metrically in the Jacobian. The variational Hamiltonian is
         JF (x0 , θ)∆x − β      (x ) + Γ(xβA ) ≈ 0      (74)         defined as:
                             ∂x
                                                                                                    z + z′              z + z′
                                                                                                                            
To ensure the first-order displacement matches the Dyadic             H(z, z ′ ) = −(z − z ′ )⊤ F             + βC               .
                              β                                                                        2                  2
EP error signal (i.e., ∆x ≈ d ), we substitute Eq. (72) into
                                                                                                                               (79)
the expansion:
                                                                     To analyze this expression, we introduce the midpoint m =
                                                                     z+z ′
         Γ(xβA ) = JF⊤ (x0 , θ) − JF (x0 , θ) ∆x                           and the difference d = z − z ′ . Since the separation
                                             
                                                        (75)           2
                                                                     between z and z ′ is induced solely by the nudging parameter
                 = −2AJ (x0 , θ)(xβA − x0 )             (76)
                                                                     β, the difference scales as ∥d∥ ∼ O(β). We therefore
This uniquely recovers the AsymEP augmented dynamics.                neglect terms of order O(∥d∥3 ) (i.e., or equivalently O(β 3 ))
Finally, to eliminate the O(β 2 ) error, AsymEP evaluates the        as they do not contribute to the gradient of the cost.
centered difference of two opposite nudges:                          The activation at the midpoint can be approximated as:

                             dxA                                                           ρ(z) + ρ(z ′ )
            x±β  0
             A =x ±β                 + O(β 2 )          (77)                     ρ(m) =                   + O(∥d∥2 ).          (80)
                              dβ β=0                                                            2
                                                                     Similarly, the difference in activations is:
Subtracting these states cancels the O(β 2 ) error, yielding
1   +β     −β       β        3
2 (xA − xA ) = d + O(β ), successfully recovering the                          ρ(z) − ρ(z ′ ) = ρ′ (m) ⊙ d + O(∥d∥3 ).         (81)
exact post-synaptic update term.
                                                                     Inverting this relation, we express the state difference as:
E.2. Physical Trade-offs and the Extended Space
                                                                         z − z ′ = (ρ(z) − ρ(z ′ )) ⊙ ρ′ (m) + O(∥d∥3 ).       (82)
We can view AsymEP and Dyadic EP as a space-time trade-
off of the same underlying physical optimization problem.
                                                                     We substitute these expansions into the interaction term
AsymEP preserves the original N -dimensional state space             of the Hamiltonian, Hint = −(z − z ′ )⊤ (ρ′ (m) ⊙ Jρ(m)).
of the network at the cost of temporal non-locality. The sys-        Applying the identity a⊤ (b ⊙ c) = (a ⊙ b)⊤ c, we obtain:
tem must evolve sequentially, requiring physical memory
                                                                                                        ⊤
not only to store the free equilibrium x0 for the asymmet-               Hint = − ((z − z ′ ) ⊙ ρ′ (m)) Jρ(m)
ric correction, but also to store the successive stationary                                           
                                                                                                        ρ(z) + ρ(z ′ )
                                                                                                                       
states required to evaluate the contrastive gradient update.                  ≈ −(ρ(z) − ρ(z ′ ))⊤ J                     .     (83)
                                                                                                             2
AsymEP thus serves as the direct, spatially minimal exten-
sion of EP.                                                          Expanding the product gives:
Dyadic EP provide a learning signal that is local in both
                                                                                         1h
space (where z − z ′ encodes the gradient) and time (allow-                   Hint = −      ρ(z)⊤ Jρ(z) + ρ(z)⊤ Jρ(z ′ )
                                                                                         2
ing the nudged phases to execute in parallel) at the cost                                                             i
of doubling the state space. In particular, capturing non-                           − ρ(z ′ )⊤ Jρ(z) − ρ(z ′ )⊤ Jρ(z ′ ) .    (84)
conservative forces in this extended space requires a spe-
cific bilinear coupling, rather than a trivial superposition         We decompose the connectivity matrix J into its symmetric
of uncoupled subsystems. It can be seen as a blueprint for           part S and antisymmetric part A. The first and last terms
future neuromorphic hardware.                                        simplify to ρ(z)⊤ Sρ(z). The cross terms satisfy:
Ultimately, the reduction of Dyadic EP to AsymEP via the               ρ(z)⊤ Jρ(z ′ ) − ρ(z ′ )⊤ Jρ(z) = ρ(z)⊤ (J − J ⊤ )ρ(z ′ )
variables m and d proves the universality of EP’s variational
principle.                                                                                             = ρ(z)⊤ (2A)ρ(z ′ ).    (85)

                                                                16
                                         Equilibrium Propagation for Non-Conservative Systems

Thus, the interaction term reduces to:                                  The input parameters are then updated using the standard
                                                                        learning rule (21). In particular, the presynaptic term associ-
                 1                1
        Hint = − ρ(z)⊤ Sρ(z) + ρ(z ′ )⊤ Sρ(z ′ )                        ated with the input weights is given by,
                 2                2
               − ρ(z)⊤ Aρ(z ′ ) + O(∥d∥3 ).                 (86)                            ∂Fi
                                                                                              in
                                                                                                 = δik ρ′ (xi )ul .               (93)
                                                                                            ∂Jkl
Finally, for the nudging term, we expand the cost function              The presynaptic terms associated with the dynamical param-
                                                                               dyn
around the midpoint:                                                    eters Jij  depend on the experiment.

                    1
          C(m) =      (C(z) + C(z ′ )) + O(∥d∥2 ).          (87)        G.1. Symmetric Initialization
                    2
                                                                        G.1.1. L EARNING RULES
When multiplying by β, the remainder term becomes β ·
O(∥d∥2 ). Since ∥d∥ ∼ O(β), this remainder is of order                  For clarity, we write the learning rules for VF and AsymEP.
O(β 3 ) and can be consistently discarded alongside the third-          For the input weights, using (93), we have:
order terms from the interaction expansion.                                                  1 h +β                      i
                                                                                      in
                                                                                   ∆Jik  ∝       (xi − x−β       ′ 0
                                                                                                           i )ρ (xi )uk ,      (94)
Combining all these components, the final Hamiltonian is:                                   2β

                    1               1                                   while for the recurrent weight, we get:
     H(z, z ′ ) = − ρ(z)⊤ Sρ(z) + ρ(z ′ )⊤ Sρ(z ′ )
                    2               2                                                      1 h +β                          i
                                   1                                            ∆Jijdyn
                                                                                        ∝      (xi − x−β i  )ρ′ 0
                                                                                                               (xi )ρ(x 0
                                                                                                                        j )  .    (95)
                  − ρ(z) Aρ(z ) + (∥z∥2 − ∥z ′ ∥2 )
                        ⊤    ′                                                            2β
                                   2
                    β            ′
                                                                        For EP, we have:
                  + (C(z) + C(z )).                 (88)
                    2                                                                       1 h +β              i
                                                                                   in
                                                                                 ∆Jik ∝         ρ(xi ) − ρ(x−β
                                                                                                            i  ) uk ,             (96)
The saddle-point dynamics, given by Eq. 32, generated by                                   2β
this Hamiltonian are:                                                   and for the recurrent weights:
    dz                                          β ∂C                                  1 h +β
        = ρ′ (z) ⊙ (Sρ(z) + Aρ(z ′ )) − z −
                                                                                                                      i
                                                     ,                        dyn                            −β    −β
    dt                                          2 ∂z
                                                            (89)          ∆Jij    ∝      ρ(xi )ρ(x+βj ) − ρ(xi )ρ(xj ) . (97)
                                                                                     2β
      ′
   dz                                            β ∂C
        = ρ′ (z ′ ) ⊙ (Sρ(z ′ ) + Aρ(z)) − z ′ +        .   (90)
   dt                                            2 ∂z ′                 G.1.2. S UPPLEMENTARY N UMERICAL R ESULTS
This system recovers the original continuous Hopfield dy-               To complement Fig. 2, we report the evolution of the accu-
namics when z = z ′ (assuming β = 0).                                   racy of the three methods in Fig. 4. We consider a layered
                                                                        network with 50 hidden neurons. While this capacity is
G. Experimental Details                                                 insufficient for state-of-the-art performance, it amplifies the
                                                                        difference in accuracy between models to aid visualization.
As in the main text, the neuronal dynamics are governed by              Models are trained for 20 epochs starting from a symmetric
the vector field:                                                       configuration, the natural setting for both VF and EP. With
                                                                      this initialization, AsymEP consistently outperforms the
                    X dyn                                               other methods and learns faster by exploiting the additional
     Fi = ρ′ (xi )      Jij ρ(xj ) + bi (u) − xi ,  (91)
                                                                        degrees of freedom of the asymmetric network.
                      j


where the input-dependent bias bi (u) is precomputed for                G.2. Fixed Asymmetry Ratio
each MNIST input u as:                                                  This section details the implementation for the fixed asym-
                                X                                       metry ratio experiments presented in Section 5.2, followed
                     bi (u) =          Jilin ul .           (92)        by complementary numerical results regarding learning
                                l∈in                                    speed and induced Jacobian asymmetry.
This term projects the input space into the recurrent sub-
                                                                        G.2.1. L EARNING RULES
space. The bias yields a diagonal contribution to the Jaco-
bian JF = ∂F ∂x , and therefore does not contribute to the              Parametrization and notation. To enforce a fixed asym-
antisymmetric correction used in the augmented dynamics                 metry ratio, we explicitly parameterize the independent ele-
Eq. (20) of AsymEP.                                                     ments of Eq. (38). We introduce two parameter vectors θS

                                                                   17
                                      Equilibrium Propagation for Non-Conservative Systems

         Parameter                                     Sym. Init. / Feedforward             Fixed rstr          Fixed rstr & rin
                                                            sec. 5.1 & 5.3                   sec. 5.2              app. G.3
         Learning Rate (Input-Hidden)                             0.05                       0.05                    0.0125
         Learning Rate (Hidden-Output)                            0.01                       0.01                    0.0025
         Time Step (Dynamics Integration)                          0.5                        0.3                      0.3
         Nudging Parameter (β)                                     0.5                        0.5                      0.5
         Free-phase Steps (nfree )                                  20                        30                       40
         Nudged-phase Steps (nnudge )                               10                        10                       10
         Number of Epochs                                        40 / 20                      30                       40
         Batch Size                                                 64                       √64                      √64
         Scaling Parameter γ                                       n.a.                        60                        60
         Structure                                            784 - n.a. -10               784-50-10           all-to-all, 500 hid
         Activation function ρ                                    tanh                       tanh                     tanh
         Initial Recurrent State s                            s ∼ U (−1, 1)              s ∼ U(−1, 1)            s ∼ U(−1, 1)
         Initial Parameters θ                                 θ ∼ N (0, N1 )             θ ∼ N (0, N1 )          θ ∼ N (0, N1 )
         Number of Runs (training + inference)                      10                        10                       10
Table 3. Trained Model Hyperparameters on MNIST. N is the total number of neurons, U(−1, 1) is a uniform distribution, and N (µ, σ 2 )
is a Gaussian distribution. For the rstr parametrization, we choose more cautious hyperparameters for training and inference compared to
the symmetric initialization, due to increasingly non-conservative and potentially oscillatory dynamics.


                                                                        elements of S̃, the full matrices are constructed as:
                                                                                                        S
                                                                             S̃ij = δij ξi + (1 − δij )θk(max(i,j),min(i,j)) ,         (99)
                                                                                         A
                                                                             Ãij = ϵij θk(max(i,j),min(i,j)) ,                       (100)
                                                                        where ϵij is the Levi-Civita symbol. The dynamical param-
                                                                        eters are then given by:
                                                                                         dyn
                                                                                        Jij  = γ(cS S̃ij + cA Ãij ),                 (101)
                                                                        with normalization coefficients
                                                                                        p
                                                                                                2
                                                                                          1 − rstr                           rstr
                                                                                 cS =              ,                  cA =        ,   (102)
                                                                                           FS                                FA
                                                                        defined in terms of the Frobenius norms:
                                                                                             v
                                                                                             uN            M
                                                                                             uX           X      2
Figure 4. Evolution of the mean accuracy and standard deviation                       F =t
                                                                                        S          ξ2 + 2i    θS ,     k              (103)
(over 10 runs) during training on MNIST for AsymEP, EP, and VF.                                  i=1            k=1
Models use 50 hidden neurons.                                                              v
                                                                                           u M
                                                                                           u X    2
                                                                                      FA = t2  θkA .                                  (104)
                                                                                                  k=1
and θA of size M = Ndyn (Ndyn − 1)/2, which encode the
off-diagonal elements of the symmetric and antisymmetric                Presynaptic computation. The dependence of the nor-
components S̃ and Ã, respectively. The correspondence                  malization coefficients on the parameters introduces addi-
between matrix and vector indices is given by:                          tional regularization terms in the learning rule compared
                                                                        to the parameterization of (Scellier &Bengio, 2017). The
               (i − 1)(i − 2)                                           gradients of the normalization coefficients are:
   k(i, j) =                  + j,     (1 ≤ j < i ≤ Ndyn )
                     2                                                      ∂cS          θkS                 ∂cS        ξm
                                                            (98)                 = −2cS      2,                  = −cS       2,       (105)
                                                                            ∂θkS        (FS )                ∂ξm       (FS )
                                                                            ∂cA        θA
where the condition j < i selects the strictly lower triangular               A
                                                                                = −2cA k 2 .                                          (106)
elements. Introducing an additional vector ξ for the diagonal               ∂θk       (FA )

                                                                   18
                                        Equilibrium Propagation for Non-Conservative Systems

          Parameter                                        Comparison Dyn.         2 hidden layers         3 hidden layers
                                                              sec. 5.4                 sec. 5.4                sec. 5.4
          Learning Rate (Input-Hidden)                          0.0016                 0.0013                   0.6
          Learning Rate (Hidden-Hidden)                         0.0016                 0.0013                   0.6
          Learning Rate (Hidden-Output)                         0.0016                 0.0013                   0.6
          Time Step (Dynamics Integration)                        0.4                    0.3                  0.0075
          Nudging Parameter (β)                                   0.3                    0.5                   0.20
          Free-phase Steps (nfree )                               40                     40                     60
          Nudged-phase Steps (nnudge )                            20                     20                     30
          Number of Epochs                                        50                     40                     40
          Batch Size                                              64                     64                     64
          Layer Structure                                   784-500-200-10         784-500-500-10       784-500-500-500-10
          Activation function ρ                                  tanh                   tanh                   tanh
          Initial Recurrent State s                          s ∼ U(−1, 1)           s ∼ U(−1, 1)           s ∼ U (−1, 1)
          Initial Parameters θ                               θ ∼ N (0, N1 )         θ ∼ N (0, N1 )         θ ∼ N (0, N1 )
          Number of Runs (training + inference)                   10                     10                     10
Table 4. Trained Model Hyperparameters on Fashion-MNIST. N is the total number of neurons, U(−1, 1) is a uniform distribution, and
N (µ, σ 2 ) is a Gaussian distribution. For the rstr parametrization, we choose more cautious hyperparameters for training and inference
compared to the symmetric initialization, due to increasingly non-conservative and potentially oscillatory dynamics.


Combining these with the derivatives of the matrices S̃ and                    (where p > q):
Ã, we have:
                                                                                                                   N
                                                                                                            θkA X
                                                                                                       
                                                                                   ∂Fi          ′
     ∂ S̃ij                             ∂ S̃ij                                          = γc A ρ  (xi ) −2            Ãij ρ(xj )
            = δip δjq + δiq δjp ,              = δij δkj     (107)                 ∂θkA                    (FA )2 j=1
     ∂θkS                               ∂ξk                                                                                        
     ∂ Ãij                                                                                               + δip ρ(xq ) − δiq ρ(xp ) .
            = δip δjq − δiq δjp ,                            (108)
     ∂θkA                                                                                                                           (111)
where k corresponds to the index pair (p, q) with p > q, as
defined in Eq. (98). The full presynaptic terms are then:                 Initialization. To ensure the stability of the system, we
                                                                          initialize our parameters suchhthat the
                                                                                                                i variance of dynam-
                                                                                                            dyn
   • For the diagonal parameters ξm :                                     ical parameters scales as Var Jij ∝ 1/Ndyn . This is a
                                                                          conservative choice for the layered architectures used in our
                                                                                                                 dyn
         ∂Fi
                           
                               ξm X
                                      N                                   experiments, where many entries of Jij     are zero.
             = γcS ρ′ (xi ) −            S̃ij ρ(xj )
         ∂ξm                  (FS )2 j=1                                  In practice, we initialize the parameter vectors θS , θA , and
                                                             (109)
                                                                         ξ with identical variances σ 2 . For large Ndyn , the expected
                             + δim ρ(xm ) .                               Frobenius norms approximate to E[FS,A ] ≈ Ndyn σ. Conse-
                                                                          quently, the normalization coefficients become:
                                                                                         p
                                                                                                 2
   • For the off-diagonal symmetric parameters θkS (where                                 1 − rstr                   rstr
                                                                                  cS ≈             ,        cA ≈           .        (112)
     p > q):                                                                              Ndyn σ                    Ndyn σ

                                           N                              Since the symmetric and antisymmetric components are sta-
                                    θkS X
                               
          ∂Fi          ′
               = γc S ρ  (x i ) −2            S̃ij ρ(xj )                 tistically independent, the variance of the weights is derived
          ∂θkS                     (FS )2 j=1
                                                                         as follows:
                                    + δip ρ(xq ) + δiq ρ(xp ) .
                                                                             • Diagonal elements (i = j):
                                                             (110)
                                                                                                                         2
                                                                                        h      i                    1 − rstr
                                                                                     Var Jiidyn = γ 2 c2S σ 2 ≈ γ 2    2     .      (113)
   • For the off-diagonal antisymmetric parameters θkA                                                               Ndyn

                                                                     19
                                      Equilibrium Propagation for Non-Conservative Systems

   • Off-diagonal elements (i ̸= j):                                    a zero-cost baseline (perfect prediction) during learning.
                                                                        Specifically, for each method and value of rstr , we calcu-
             h
               dyn
                   i                       γ2                           late the cumulative loss by summing the batch-averaged
                     = γ 2 c2S + c2A σ 2 ≈ 2 ,
                                    
          Var Jij                                         (114)
                                          Ndyn                          costs of the first 5 epochs (out of 30, to avoid saturation
              h     i                                                   effects), and reporting the mean and standard deviation over
                dyn
To satisfy Var Jij    ∝ 1/Ndyn , we set:                                10 independent training runs. Mathematically, for each run:

                               p
                         γ=     Ndyn                      (115)                                                                       
                                                                                               5     NX
                                                                                               X      batches       X        C(x0 , u) 
Note that by random matrix theory, diagonal elements do                   Cumul. Loss =                                                 ,
                                                                                                                              |Bk |
not affect stability in the large Ndyn limit.                                                epoch=1 k=1        (x0 ,u)∈Bk
                                                                                                                              (120)
Potential Simplification. Although the parameterization                 where Bk represents the k-th batch, and |Bk | denotes the
above is fully general, a simpler construction is possible              number of examples in the batch. The parameters are up-
by removing self-connections (ξ = 0) and enforcing identi-              dated after each batch step; consequently, the free equilib-
cal parameterization for the symmetric and antisymmetric                rium x0 is inferred using the updated parameters and the
components, i.e., θS = θA = θ. The matrix elements then                 current example u.
become:

            S̃ij = (1 − δij )θk(max(i,j),min(i,j)) ,      (116)
            Ãij = ϵij θk(max(i,j),min(i,j)) .            (117)

In this case, the Frobenius norms are equal (FS = FA ), and
we can omit the explicit normalization:
                       q
                 dyn           2 S̃ + r Ã .
               Jij   = 1 − rstr    ij   str ij       (118)

For a parameter θk corresponding to indices (p, q) with
p > q, the presynaptic term is given by:
                q                   
  ∂Fi
      = ρ′ (xi )           2 +r
                     1 − rstr      str δip ρ(xq )
  ∂θk
                     q                                  (119)
                  +             2 −r
                          1 − rstr            δ    ρ(x
                                        str     iq     p .
                                                        )

While this parameterization works in simulations and keeps              Figure 5. Cumulative loss as defined by (120) over the first 5
the number of parameters constant for all rstr , it constrains          epochs of training, for different asymmetry ratios rstr . We compare
the asymmetry to be “homogeneous”, by which we mean                     VF (orange) and AsymEP (blue), under two training regimes:
that the asymmetry ratio is identical for every pair of neu-            training only J in (dashed) or all parameters (solid).
rons; hence, the network cannot learn to be symmetric in one
region and antisymmetric in another. Therefore, we choose
to explore the more general case of (38) in our experiments.            In Fig 5, we observe that learning slows down for both al-
                                                                        gorithms when rstr ≳ 0.6. This behavior likely results from
G.2.2. S UPPLEMENTARY N UMERICAL R ESULTS                               the increased difficulty of reaching a stationary state as the
                                                                        dynamics become strongly asymmetric. With a fixed num-
To complement the results of Fig 3, we analyze the training
                                                                        ber of inference steps, incomplete convergence degrades the
efficiency as a function of the asymmetry ratio rstr and in-
                                                                        accuracy of the gradient estimates, thereby slowing down
vestigate the robustness of VF by monitoring the Jacobian
                                                                        the learning. Fig 5 shows that while VF can eventually
asymmetry.
                                                                        achieve competitive accuracy, it is consistently slower than
                                                                        AsymEP as soon as asymmetry is introduced.
Training efficiency. We first study the training efficiency
of the two algorithms as a function of the asymmetry ra-
tio rstr . Inspired by the related concept in (Cesa-Bianchi
&Lugosi, 2006), we define the cumulative loss as the accu-              Jacobian asymmetry. We next examine how the struc-
mulated difference between the free equilibrium cost and                tural asymmetry rstr is reflected in the Jacobian of the dy-

                                                                   20
                                     Equilibrium Propagation for Non-Conservative Systems

namics (35), given by:
    ∂Fi                       dyn ′
        = (1 − δij )ρ′ (xi )Jij  ρ (xj )
    ∂sj
               h                                           i
          + δij ρ′ (xi )(Jiidyn ρ′ (xi )) + ρ′′ (xi )bi − 1 .
                                                           (121)

In our layered architecture, the self-connections are zero
(Jiidyn = 0). For the following analysis, we neglect all diag-
onal terms in the Jacobian (including external inputs and
potential), since they do not contribute to the antisymmetric
correction (20) and thus to the discrepancy between the per-
formance of VF and AsymEP. Consequently, we define the
following asymmetry ratio based solely on the off-diagonal
Jacobian JF,off :                                                       Figure 6. Asymmetry ratio of the Jacobian rjac defined in equation
                                      ⊤
                                                                        (122) after training for different asymmetry ratios rstr . We compare
                          ∥JF,off − JF,off ∥F                           VF (orange) and AsymEP (blue), under two training regimes:
                 rjac =                       ,            (122)        training only J in (dashed) or all parameters (solid).
                             ∥JF,off ∥F

The results are presented in Fig 6. For each trained model              Consequently, local stability requires the largest real eigen-
and ratio rstr , we compute rjac averaged over the stationary           value of the effective weight matrix to be strictly less than 1.
states of the first batch (64 images) across 10 independent             Assuming weights are initialized independently with vari-
runs. We observe that when structural asymmetry is strong               ance σ 2 , Girko’s circular law dictates that the eigenvalues
and all parameters are trained, VF partially compensates for            of√an asymmetric matrix uniformly populate a disk of radius
the asymmetry by adjusting the neuronal states. This can be             σ n in the complex plane. In contrast, imposing symmetry
understood by rewriting the ratio as:                                   forces the eigenvalues √ onto the real line, broadening the
                                                                      spectral radius to 2σ n according to Wigner’s semicircle
                            dyn      dyn ⊤
                ρ′ (xi ) Jij    − (Jji  ) ρ′ (xj )                      law. As a result, asymmetric networks can stably accommo-
                                                   F
     rjac =                                          .  (123)           date larger variance in the weight initializations than their
                                  dyn ′
                         ρ′ (xi )Jij ρ (xj )                            symmetric counterparts.
                                          F

Compared to the structural asymmetry ratio in Eq. (37),                 Asymmetry nevertheless introduces imaginary eigenvalues
a value of rjac < rstr indicates that the neuronal states ef-           and, consequently, damped oscillations. To study this effect
fectively dampen the structural asymmetry, rendering the                experimentally in a controlled setting, we constrain the input
dynamics more symmetric. This symmetrization of the Ja-                 projections J in . In the experiments of the main text, fixing
cobian appears without imposing an additional symmetriza-               the structural asymmetry ratio rstr still allowed AsymEP
tion penalty and could be enhanced using the method of                  to reduce oscillations by aligning and increasing the input
(Laborieux &Zenke, 2022). This mechanism likely explains                projections J in , thereby adding stabilizing diagonal contri-
the superior performance of ‘All (VF)’ compared to ‘J in                butions to the Jacobian. To isolate the network’s ability to
(VF)’ in Fig 3, as the former is able to use the additional             suppress oscillations independently of the magnitude of the
degrees of freedom to reduce the effective asymmetry at                 input drive, we further constrain the relative scale of J in and
high rstr .                                                             J dyn by imposing
                                                                                                   ∥J in ∥F   ∥J in ∥F
G.3. Stability analysis with Fixed Asymmetry Ratio &                                      rin =             =          ,              (124)
                                                                                                  ∥J dyn ∥F      γ
     Constrained Inputs Projection
                                                                        where ∥J dyn ∥F = γ following Eq. (101). Defining unscaled
A complete stability analysis of the non-conservative dy-               input projections J˜in , we set
namics trainable with AsymEP is beyond the scope of this
work. Nevertheless, for the class of continuous Hopfield                                                        J˜in
                                                                                               J in = rin γ                           (125)
networks considered here, standard arguments from random                                                      ∥J˜in ∥F
matrix theory suggest that asymmetry inherently improves
asymptotic stability.                                                   G.3.1. L EARNING RULES
In the dynamics defined by Eq. (91), the linear leak term               Reusing the notations of the previous section, we write
−xi shifts the spectrum of the system’s Jacobian by −1.                 Jilin = γcin J˜ilin with the normalization cin = rin /Fin , where

                                                                   21
                                   Equilibrium Propagation for Non-Conservative Systems

Fin = ∥J˜in ∥F . Applying the chain rule yields:
                        "                   #
   ∂Fi            ′              J˜kl
                                   in X
                                        ˜in
          = γcin ρ (xi ) δik ul − 2     J um .          (126)
   ∂ J˜kl
       in                        Fin m im

And for γ we have:

                     ∂Fi  1
                         = (Fi + xi ).                  (127)
                     ∂γ   γ

G.3.2. S UPPLEMENTARY N UMERICAL R ESULTS
                                                                      Figure 7. Comparison of AsymEP and VF on a feedforward net-
Table 5 reports a worst-case control experiment where the
                                                                      work. Test accuracy on MNIST is shown as a function of training
structural asymmetry is fixed at rstr = 0.7 while the input           epochs for a single-hidden-layer network with 20 neurons. Curves
scale ratio rin is varied. The experiment uses an all-to-all          report the mean and standard deviation over 10 runs. Best accura-
architecture on MNIST (excluding direct input-to-output               cies are 92.7% ± 0.5% (AsymEP) and 64.3% ± 2.0% (VF).
connections). The output variance during extended infer-
ence (steps 30-50) confirms that the system successfully
learns to suppress oscillations even when rin is severely re-         G.5. Advantages of Non-Conservatives Dynamics
stricted. Any small residual oscillations can be mitigated by         In Section 5.4, we compare three (non-)conservative dynam-
time-averaging over the inference steps.                              ics under varying constraints. To further evaluate learning
Finally, rin can be interpreted as a measure of the external          speed, Table 6 reports network performance after a sin-
signal magnitude relative to the internal recurrent dynamics.         gle epoch. These results confirm our earlier observation:
These results indicate that the system remains capable of             AsymEP learns faster than VF.
learning and stabilizing even under a low external input
drive. Even when the input projection ∥J in ∥F is 100 times           G.6. Feedforward CIFAR-10 Experiments
smaller than the recurrent connections ∥J dyn ∥F , the network        This appendix details the architecture and hyperparameters
still achieves 36.34 ± 6.25% accuracy, which is well above            of the deep feedforward experiments comparing backprop-
chance level (∼ 10%).                                                 agation (BP), VF, AsymEP and Dyadic EP on CIFAR-10
                                                                      (see subsection 5.5)
G.4. Feedforward Network
G.4.1. L EARNING RULES                                                Architecture. We use a nine-layer convolutional network
                                                                      (denoted CNN9). The first eight layers are convolutional
For clarity, we write the learning rules for VF and AsymEP            with 3 × 3 kernels and zero-padding; spatial downsampling
in a feedforward architecture with one hidden layer using             is performed by strided convolutions (stride 2 on layers 2, 4,
the notation of Section 5.3. For the input weights connecting         6, 8 and stride 1 otherwise), so no pooling is used. The chan-
to the hidden layer, we get the usual formula:                        nel widths follow the sequence 3 → 64 → 64 → 128 →
                                                                      128 → 256 → 256 → 512 → 512, reducing the spatial
                  1 h +β    −β      0
                                         i
                                                                      resolution from 32 × 32 to 2 × 2. The last layer is a fully
         in
       ∆Jik ∝        (hi − hi )ρ′ (hi )uk ,             (128)
                 2β                                                   connected readout mapping the 512 × 2 × 2 feature map
                                                                      to the 10 class logits. All hidden units use a ReLU non-
while for the feedforward weights connecting the hidden to            linearity.
the output layer, we get:                                                   p Weights are initialized with the Kaiming scheme
                                                                      (σ = 2/fan-in) and biases at zero.
                    1 h +β                 0
                                              i
  ∆(Wh→o )ji ∝         (oj − o−β  ′ 0
                              j )ρ (oj )ρ(hi ) .        (129)         Training setup. All methods are trained for 40 epochs
                   2β                                                 with batch size 64 and repeated over 5 seeds. Inputs are
                                                                      normalized per channel and augmented with random 32 ×
Note that EP is not applicable in this case.
                                                                      32 crops (padding 4), random horizontal flips and Cutout
                                                                      (one 8 × 8 patch). Parameters are updated with SGD with
G.4.2. S UPPLEMENTARY N UMERICAL R ESULTS
                                                                      momentum 0.9, weight decay 5 × 10−4 and gradient-norm
In addition to the final accuracy reported in Sec. 5.3, we            clipping at 1, under a cosine learning-rate schedule decaying
show in Fig. 7 the evolution of the accuracy over 20 epochs           from 3.5 × 10−2 to 2 × 10−4 . Test accuracy is reported
for AsymEP and VF.                                                    on an exponential moving average of the weights (decay

                                                                 22
                                     Equilibrium Propagation for Non-Conservative Systems

Table 5. Output variance and final test accuracy on MNIST (%) across different values of rin with rstr = 0.7. (mean ± std over 10 runs)
(500 hiddens, all-to-all, no input-output).

                                                  Output variance                          Test Acc. (%)
                           rin            Untrained             Epoch 80                     Epoch 80
                           0.01    (3.38 ± 0.90) × 10−4        (5.22 ± 2.34) × 10−5         36.34 ± 6.25
                           0.10    (2.33 ± 0.48) × 10−4        (1.39 ± 0.17) × 10−4         90.54 ± 0.19
                           0.50    (1.34 ± 0.32) × 10−5        (1.06 ± 0.25) × 10−6         94.96 ± 0.10
                           1.00    (6.27 ± 1.24) × 10−7        (1.75 ± 0.50) × 10−8         96.30 ± 0.09


Table 6. Test accuracy on Fashion-MNIST (%) at Epoch 1 (mean
± std 10 runs). The table compares three classes of network
dynamics: Continuous Hopfield (CH), Predictive Coding (PC),
and Standard dynamics. Each is evaluated under three connec-
                                               ⊤
tivity structures: Asymmetric (Asym, Bk ̸= Wk+1   ), Symmet-
                                ⊤
ric/conservative (Sym, Bk = Wk+1 ), and Feedforward (Feedfor,
Bk = 0).

                           EP          AsymEP            VF
          Asym          -      74.91 ± 0.45         48.98 ± 4.09
 CH       Feedfor       -      74.36 ± 0.29         48.84 ± 3.46
          Sym     74.57 ± 0.43       -                    -
          Asym          -      77.83 ± 0.47         57.75 ± 3.37
 PC
          Sym     76.23 ± 0.39       -                    -
          Asym          -      76.87 ± 0.51         61.50 ± 4.06
 Standard
          Feedfor       -      77.92 ± 0.51         63.98 ± 0.73


0.9995). The targets are smoothed (ε = 0.1), which for
the EP methods amounts to nudging toward the smoothed
one-hot vector.

Relaxation hyperparameters. The four methods differ
only in the gradient estimator: BP uses automatic differ-
entiation, while the EP-based methods contrast stationary
states of the corresponding relaxation dynamics. VF uses
an integration step η = 1.0, nudging β = 0.1, and at most
K = 1000 relaxation steps with an early-stopping threshold
of 9 × 10−6 on the mean state update. Dyadic EP uses
the same settings except for a nudging strength β = 0.1.
AsymEP uses a smaller step η = 0.5, nudging β = 0.1,
and up to K = 250 relaxation steps with a threshold of
1 × 10−4 .


                                                                  23