Published as a conference paper at ICLR 2026


                                        T OWARD P RACTICAL E QUILIBRIUM P ROPAGATION :
                                        B RAIN - INSPIRED R ECURRENT N EURAL N ETWORK
                                        WITH F EEDBACK R EGULATION AND R ESIDUAL C ON -
                                        NECTIONS

                                         Zhuo Liu                                               Tao Chen ∗
                                         School of Microelectronics                             School of Microelectronics
                                         University of Science and Technology of China          University of Science and Technology of China
                                         Hefei 230026, Anhui, China                             Hefei 230026, Anhui, China
arXiv:2508.11659v2 [cs.NE] 7 May 2026


                                         zhuoliu00@mail.ustc.edu.cn                             tchen@ustc.edu.cn


                                                                                     A BSTRACT
                                                    Brain-like intelligent systems need brain-like learning methods. Equilibrium
                                                    Propagation (EP) is a biologically plausible learning framework with strong poten-
                                                    tial for brain-inspired computing hardware. However, existing implementations
                                                    of EP suffer from instability and prohibitively high computational costs. Inspired
                                                    by the structure and dynamics of the brain, we propose a biologically plausible
                                                    Feedback-regulated REsidual recurrent neural network (FRE-RNN) and study its
                                                    learning performance in the EP framework. Feedback regulation enables rapid
                                                    convergence by attenuating feedback signals and reducing the disturbance of feed-
                                                    back paths to feedforward paths. The improvement in the convergence property
                                                    reduces the computational cost and training time of EP by orders of magnitude,
                                                    delivering performance on par with backpropagation (BP) in benchmark tasks.
                                                    Meanwhile, residual connections with brain-inspired topologies help alleviate the
                                                    vanishing gradient problem that arises when feedback pathways are weak in deep
                                                    RNNs. Our approach substantially enhances the applicability and practicality of
                                                    EP. The techniques developed here also offer guidance for implementing in-situ
                                                    learning in physical neural networks.


                                        1       I NTRODUCTION
                                        Backpropagation (BP) has been the driving force behind the success of artificial intelligence (AI)
                                        across a wide variety of tasks, ranging from image recognition to natural language processing
                                        (Rumelhart et al., 1986; Lecun, 1988; He et al., 2016; Vaswani et al., 2017). Despite these tri-
                                        umphs, BP’s reliance on non-local error signals and weight transport lacks biological plausibility
                                        (Journé et al., 2023; Ororbia, 2023). The brain does not appear to implement the gradient computa-
                                        tions performed by BP, in particular the explicit derivative of the activation function, which demands
                                        precise access to the rate of change in neuronal activities at specific operating points (Ororbia, 2023).
                                        Moreover, implementing BP in neuromorphic systems incurs enormous overhead (Kudithipudi et al.,
                                        2025). Drawing inspiration from the topology and dynamics of the brain is a viable approach to ad-
                                        vancing biologically plausible learning mechanisms and to promoting energy-efficient computing
                                        systems for AI.
                                        Equilibrium Propagation (EP) (Scellier & Bengio, 2017; Ernoult et al., 2019; Laborieux et al., 2021)
                                        presents a compelling and hardware-friendly alternative. It leverages naturally settling dynamics in
                                        RNN for credit assignment, and eliminates the need for explicit activation derivatives. EP operates
                                        in two phases with nearly identical dynamics, and the synaptic adjustments depend only on local
                                        information (Ackley et al., 1985; Movellan, 1991; Ernoult et al., 2020). In EP, the output layer is
                                        softly nudged by the prediction error toward configurations that incrementally minimize the loss
                                        function, a regime termed weak supervision (Millidge et al., 2023). A major drawback of EP is
                                            ∗
                                                Corresponding Author


                                                                                            1
Published as a conference paper at ICLR 2026


its notably slow training speed and instability. An RNN often requires dozens or even hundreds
of iterations to reach a stable state (Scellier & Bengio, 2017). Previous attempts to optimize EP’s
performance have led to markedly more complicated procedures (O’Connor et al., 2019; Laborieux
& Zenke, 2024).
In this paper, we draw inspiration from the brain and propose a Feedback-regulated REsidual recur-
rent neural network (FRE-RNN). We substantially improve the convergence properties of the RNNs
and training speed of EP while achieving performance comparable to BP. Our contributions are as
follows:

       • By scaling down the feedback strength of RNNs, we enhance the robustness of EP and
         accelerate the training and inference speed by orders of magnitude because of the improved
         convergence properties.
       • To counteract the gradient vanishing problem caused by weak feedback, we introduce resid-
         ual connections into the layered RNNs, enabling the training of deep networks that previ-
         ously challenged EP and achieving performance closer to BP.
       • The feedback regulation and residual connections in RNNs of arbitrary graph topologies
         mirror the multi-scale recurrence in biological neural networks. Our work fosters EP’s bio-
         logical plausibility and extends its applicability in brain-inspired computational hardware.


2     BACKGROUND

2.1   C ONVERGENT RNN S WITH S TATIC I NPUT

Consider an RNN as a dynamical system driven by a static input x:

                                        s[t + 1] = F (x, s[t], θ),                                    (1)

where F is the transition function, s[t] is the network state at time step t(t = 0, 1, 2, . . . , T ), and
θ denotes the parameters. Assuming that the network state stabilizes in T steps, the RNN reaches
a stable point s[T ]. Its convergence is typically guaranteed by either symmetric connections with
asynchronous updates or by a sufficiently small spectral radius of asymmetric connections with
synchronous updates (Hopfield, 1982; Yildiz et al., 2012; Liu et al., 2026). That said, other factors,
e.g. activation function, also influence the dynamical properties of RNNs (Miller & Hardt, 2019).

2.2   S CALING A DJACENCY M ATRIX TO T UNE N ETWORK DYNAMICS

Scaling the spectral radius (SR) of the adjacency matrix, the largest eigenvalue of the weight matrix,
is a common method to tune the dynamics of RNN (Bai et al., 2012; Nakajima et al., 2024; Liu et al.,
2026). A SR less than one yields stable and convergent dynamics. In this case, injected signals tend
to decay over time, which manifests as short-term memory. A SR exceeding one can give rise to
expansive or even chaotic behavior in which small perturbations are amplified. By adjusting SR,
one can bias the RNN toward convergent, oscillatory, or edge-of-chaos regimes, thereby tuning
computational properties, such as convergence speed or long-term memory capacity. (Jaeger &
Haas, 2004; Legenstein & Maass, 2007; Miller & Hardt, 2019).

2.3   E QUILIBRIUM P ROPAGATION

Equilibrium propagation is a learning framework initially based on energy-based models. It proceeds
in two phases: a free (first) phase and a weakly clamped (second) phase. For the first phase, the
RNN converges to a steady state s0 under the stimulation of input alone. In the clamped phase, the
network is gently nudged by the prediction error and settles to a new stable state sβ . The weight
update can be simplified to a contrastive learning compatible with spiking-time-dependent plasticity
(STDP) (Scellier et al., 2018). EP has been further generalized to asymmetric RNNs governed by
vector field dynamics (Scellier et al., 2018). Recent work shows that asymmetry in skew-symmetric
Hopfield models can improve classification performance (Høier et al., 2024).


                                                    2
Published as a conference paper at ICLR 2026


2.4   F EEDBACK R EGULATION AND N ETWORK S TRUCTURE IN THE B RAIN

Cortical areas in the brain feature dynamic regulation of feedforward and feedback connections
(Felleman & Van Essen, 1991; Mejias et al., 2016; Michalareas et al., 2016; Semedo et al., 2022;
Fişek et al., 2023; Wang et al., 2023). In the visual system, for instance, feedforward signals domi-
nate immediately following the onset of external stimulus, whereas feedback signals become promi-
nent during spontaneous activity. Dynamically regulating the strength of feedback allows the brain
to optimize information integration, ensuring efficient perception and decision-making. In mam-
malian neocortices, information processing involves not only feedforward synaptic chains but also
extensive lateral and feedback loops that interconnect disparate regions, forming a richly recursive
network rather than a strictly layered structure. This topology implies short average path length
between neurons and efficient information flow (Watts & Strogatz, 1998; Markov et al., 2013; Lynn
& Bassett, 2019; Kulkarni & Bassett, 2025). In deep neural networks, residual connections reflect
the long-range skip-layer projections observed in cortical circuits (Perich & Rajan, 2020; Holk &
Mejias, 2024). They mitigate the vanishing gradient by providing skip pathways that preserve gra-
dient (He et al., 2016).

3     ACCELERATING EP WITH B RAIN - INSPIRED N ETWORK P ROPERTIES

                         (a)
                           𝑠0                𝑠1                  𝑠2              Predict: 𝑠𝑝
                                  𝑊0               𝛼1 𝑊1               𝑊𝑓                      Label: 𝑠𝑡


                                                   𝛽1 𝐵1              𝛽𝑓 𝐵𝑓
                                                                                 Error: 𝑒𝑝 = 𝑠𝑡 − 𝑠𝑝
                         (b)
                           𝑠0                𝑠1                  𝑠2         Predict: 𝑠𝑝
                                𝐶𝑜𝑛𝑣0             𝑃1 , 𝐶𝑜𝑛𝑣1          𝑃2 , 𝑊𝑓                  Label: 𝑠𝑡
                                (32,5,1,0)        2 , 64,5,1,0        (2)


                                                                            𝑇
                                             𝐶𝑜𝑛𝑣𝑇1 , 𝑃1−1 , 𝛽1       𝑊𝑓        , 𝑃2−1 , 𝛽𝑓
                                                                                   Error: 𝑒𝑝 = 𝑠𝑡 − 𝑠𝑝

Figure 1: Illustration of feedback and feedforward regulation. (a) Layered architecture of RNN. The
feedforward weights Wi and feedback weights Bi are rescaled by coefficients αi and βi respectively.
The dashed box encloses an RNN formed by layers s1 and s2 with feedforward and feedback path-
ways. βf is the nudging factor, which essentially scales the feedback strength of prediction error.
(b) Embedding convolutional architecture in RNN. Convolutional parameter (32,5,1,0) is written
as (channels, kernels, stride, padding). Parameter (2) in (b) denotes max-pooling with stride 2.
ConvTi represents transpose convolution, the inverse process of the convolution, and Pi−1 means
max-unpooling (Ernoult et al., 2019). Model architectures and training process are given in Ap-
pendix D.


3.1   P ROTOTYPICAL SETTING OF EQUILIBRIUM PROPAGATION

Unlike the prototypical setting of equilibrium propagation (P-EP) (Ernoult et al., 2019), we separate
the input and output layer from the recurrent network (Figure 1a). This separation allows the output
layer to adopt the SoftMax activation commonly used in feedforward networks, which facilitates
performance comparison (Laborieux & Zenke, 2024). For clarity, the RNN (black dashed box in
Figure 1a) shown here only contains two hidden layers s1 and s2 , but the approach applies to deeper
structures (see below). The states of the RNN evolve for T discrete steps until they converge. The
dynamics of the whole RNN can be formulated as:
                           sβf [t + 1] = F (sβf [t], b) = ρ(W · sβf [t] + b),
                                    b = [W0 · s0 , βf · Bf · ep ],                                         (2)


                                                                 3
Published as a conference paper at ICLR 2026


where sβf [t] is the state of the RNN at time t, ρ is the activation function, W is the forward weight
matrix of the RNN, and b combines the feedforward input and the error-nudging term. Note that
           β    β
sβf = [s1 f , s2 f ]. For each sample-label pair (x, star ), we run the free phase (βf = 0) for te
iterations, obtain the prediction sp = SoftMax(Wf · s2 ), and compute the prediction error ep =
star −sp . During the clamped phase, the error nudges the RNN through the feedback weights Bf and
scaling coefficient βf = βf 1 (βf 1 = 0.1 for layered architecture and βf 1 = 0.25 for convolutional
architecture by default). The network evolves for K further iterations under clamping to another
state. The weights (W0 , W1 ) are then updated with an STDP-compatible rule:
                                                                                           β
                                   ∆Wi = dsi+1 · (s0i )⊤ ,                        f1
                                                                        dsi+1 = si+1 − s0i+1 ,                           (3)
where dsi is the offset of stable point caused by the error nudging (Scellier et al., 2018). Similarly,
the final weight for output is updated:

                                                 ∆Wf = (star − s0p ) · (s02 )⊤ .                                         (4)
We also consider an RNN embedded with convolutional architecture in its forward paths (2 convo-
lution layers, 2 max-pooling layers and 1 fully connected layer) shown in Figure 1b. The forward
convolutional structure follows the architecture of existing convolutional neural networks (CNN)
(Krizhevsky et al., 2012; Simonyan & Zisserman, 2015), in which a pooling layer is placed after
the activation of the convolution layer. We transform the CNN to an RNN by adding feedback con-
nections symmetric with the feed-forward connections (See Appendix D for the pseudocode and
schematics).

3.2   F EEDBACK R EGULATION IN L AYERED RNN FOR FAST C ONVERGENCE

                  (a) 0                                                (b) 0
                 Index


                                                                    Index


                         100                                                100
                  (c) 0                                      100
                                                                       (d) 0                                      100
                 Index


                                                                    Index


                         100                                                100                                   10 2
                  (e) 0                                      10 1
                                                                       (f) 0
                                                                                                                  10 4
                 Index


                                                                    Index


                    100                                                  100
                  (g) 0                                      10 2
                                                                       (h) 0                                      10 6
                 Index


                                                                    Index


                         100                                                100
                               0   20   40       60   80                          0   20       40       60   80
                                             t                                                      t


Figure 2: Convergence dynamics and speed versus feedback scaling βi . All neurons in all hidden
layers are indexed (s1 :0-63; s2 :64-127). Colors indicate neuronal activity (a,c,e,g) and changes in
activity (b,d,f,h). (a) The state evolution of RNN with symmetric weights and βi = 0.1; (b) The
one-step difference of neural states in (a). (c, d) Symmetric weights with βi = 2; (e, f) Asymmetric
weights with βi = 0.1; (g, h) Asymmetric weights with βi = 4. In both symmetric and asymmetric
feedback cases, down-scaling feedback connections tends to stabilize the network. See Figure 5d
for the statistical robustness.

Although the SR can tune the RNN dynamics, scaling forward weights Wi distorts forward signal
propagation, which is harmful to performance (see below). Therefore, we turn to another choice,
namely, scaling only the feedback strength with βi . This coefficient scales the gradients, in the same
way as the nudging factor βf . We consider both symmetric (Bi = (Wi )⊤ ) and asymmetric (Bi ̸=
(Wi )⊤ ) recurrent connections in the study, and compare the results with FNNs of the same size
trained by BP (feedback connections removed) or feedback alignment (FA) (Lillicrap et al., 2016)
that uses random weights Bi ̸= (Wi )⊤ to feedback the error. Note that, after scaling, the overall
weight matrix W of a symmetric RNN is no longer strictly symmetric. Therefore, we started from
the vector field setting of EP rather then the energy-based setting in the first place. The feedforward
and feedback weights are multiplied by coefficients αi and βi respectively. Figure 2a-d shows
convergence speed for different βi . With asymmetric weights, the network can converge to a fixed
point (Figure 2e, f), exhibit cyclical oscillation (Figure 2g, h), or even become chaotic. The feedback
weights stay fixed during training process, which differs from EP in vector field dynamics (Scellier


                                                                   4
Published as a conference paper at ICLR 2026


et al., 2018). The pseudocode of learning procedure with a 2-hidden-layer RNN shown in Figure 1(a)
is provided in Algorithm 1.

Algorithm 1 EP with Feedforward and Feedback Scaling
Require: Input: (x, star )
Require: Parameters: θ = [W0 , W1 , Wf , Bf , B1 , α1 , β1 , βf 1 ]
Ensure: Updated parameters θ
 1: function F IRST- PHASE(θ, star )
 2:    s0 ← x
 3:    for t = 1 to T do
 4:         h1 ← W0 · s0 + β1 · B1 · s02
 5:         h2 ← α1 · W1 · s01
 6:         hp ← Wf · s02
 7:         s01 , s02 , s0p ← ρ(h1 ), ρ(h2 ), SoftMax(hp )
 8:    end for
 9:    Λ1 ← [s0i ], i = 0, 1, 2, p
10:    return Λ1
11: end function
12: function S ECOND - PHASE(θ, Λ1 , star )
         β        β        β
13:    s1 f 1 , s2 f 1 , sp f 1 ← s01 , s02 , s0p
14:    for t = 1 to K do
                                   β
15:         ep ← star − sp f 1
                                                  β
16:         h1 ← W0 · s0 + β1 · B1 · s2 f 1
                                     βf 1
17:         h2 ← α1 · W1 · s1 + βf · Bf · ep
                                 β
18:         hp ← Wf · s2 f 1
              βf 1      βf 1    βf 1
19:         s1 , s2 , sp ← ρ(h1 ), ρ(h2 ), SoftMax(hp )
20:    end for
                    β
21:    dsi ← si f 1 − s0i , i = 1, 2
22:    Λ2 ← [ds1 , ds2 ]
23:    return Λ2
24: end function
25: function U PDATING -W EIGHTS(θ, Λ1 , Λ2 , star )
26:    ∆Wi ← dsi+1 · (s0i )⊤ , i = 0, 1
27:    ∆Wf ← (star − s0p ) · (s02 )⊤
28: end function


3.3    R ESIDUAL C ONNECTIONS TO AVOID VANISHING G RADIENTS

In our 10-hidden-layer RNN with symmetric connections, we add cross layer residual links (Fig-
ure 3a-b) and carry out ablation study on their effects in performance. The three long-range bidi-
rectional connections bypass adjacent layers to reduce gradient decay. For RNN with asymmet-
ric connections, we introduce skip-layer connections between non-adjacent layers with P = 20%
probability, creating an RNN with arbitrary graph topologies (AGT) where any pair of layers form
connections stochastically (Figure 3c) (Salvatori et al., 2022). (See Algorithm S3 in Appendix D for
training detail)

4     E XPERIMENTS
We evaluated our RNN models on MNIST and CIFAR-10 datasets and compared the results with P-
EP and BP. The MNIST dataset consists of 70,000 grayscale handwritten digit images (28×28 pixels)
split into 60,000 training and 10,000 test samples. CIFAR-10 contains 60,000 RGB images (32×32
pixels) of 10 categories, divided into 50,000 training and 10,000 test samples. Pre-processing, net-
work structures and additional training details are in Appendix D.

4.1    I NFLUENCE OF F EEDFORWARD S CALING AND F EEDBACK S CALING

Figure 4 compares the effects of feedforward scaling αi and feedback scaling βi . In general, relative
small feedback scaling (βi = 0.1) yields high MNIST accuracy (Figure 4). In deeper RNNs, overly


                                                         5
Published as a conference paper at ICLR 2026


Figure 3: (a) A 10-hidden-layer RNN model with residual connections. The solid blue wires and
the dashed orange wires represent forward and feedback residual connections respectively. The
bidirectional connections are symmetric. (b) Adjacency matrix of (a). The blocks (green) other than
the sub-diagonals indicate residual connections. (c) Adjacency matrix for an RNN with arbitrary
graph topology.


                 (a)               2HL-Test accuracy                 (b)              3HL-Test accuracy                (c)                5HL-Test accuracy
                                                                                                                                                                                           1.00
                         0.001 0.9659 0.9730 0.9756 0.9753                 0.001 0.9348 0.9692 0.9718 0.9555                 0.001 0.3844 0.7980 0.9338 0.8048
                                                                                                                                                                                           0.95
                          0.01 0.9666 0.9723 0.9760 0.9758                  0.01 0.9246 0.9679 0.9765 0.9715                  0.01 0.2515 0.9365 0.9575 0.8583                             0.90
                     i


                                                                      i


                                                                                                                        i


                           0.1 0.9624 0.9711 0.9725 0.9694                   0.1 0.6323 0.9497 0.9741 0.9702                      0.1 0.2104 0.8386 0.9757 0.9332                          0.85

                                                                                                                                                                                           0.80
                           1.0 0.8925 0.9127 0.9300 0.7978                   1.0 0.4862 0.8739 0.5249 0.1676                      1.0 0.2096 0.3891 0.2500 0.1324
                                                                                                                                                                                           0.75
                                0.01   0.1       1.0         4.0                    0.01   0.1       1.0         4.0                   0.01   0.1                      1.0   4.0
                                             i                                                   i                                                                 i


Figure 4: The influence of feedforward scaling αi and feedback scaling βi on accuracy of MNIST
classification. (a) 2 hidden layers; (b) 3 hidden layers; (c) 5 hidden layers. Each layer has 64
neurons. By default, T = 10×NHiddenLayer , K = T /2. Each result is averaged over five repetitive
experiments.


   (a)                                                 (b)                                           (c)                                       (d)
               100                                                                                           1
                                                                                                                                                                   102
                                                               8                                             0
                                                                                                                                                Convergence time


                                                               7
                                                               6                                             1
 Test Error


                                                                                                     FTMLE


              10 1                                             5
                                                         SR


                                                               4                                             2                                                     101         symm-before training
                                                               3                                                                                                               asymm-before training
                                                               2                                             3                                                                 symm-after training
                                                               1                                                                                                               asymm-after training
              10 2                                             0                                             4                                                     100
                       01
                         1
                      0.1

                         5
                         1
                         2
                         4


                                                                     01
                                                                       1
                                                                    0.1

                                                                       5
                                                                       1
                                                                       2
                                                                       4


                                                                                                                   01
                                                                                                                     1
                                                                                                                  0.1

                                                                                                                     5
                                                                                                                     1
                                                                                                                     2
                                                                                                                     4


                                                                                                                                                                           01
                                                                                                                                                                             1
                                                                                                                                                                          0.1

                                                                                                                                                                             5
                                                                                                                                                                             1
                                                                                                                                                                             2
                                                                                                                                                                             4
                     0.0

                     0.2


                                                                   0.0

                                                                   0.2


                                                                                                                 0.0

                                                                                                                 0.2


                                                                                                                                                                         0.0

                                                                                                                                                                         0.2
                0.0


                                                               0.0


                                                                                                             0.0


                                                                                                                                                                       0.0


                                   i                                            i                                             i                                                        i


Figure 5: The test error, SR, FTMLE, and convergence time versus feedback scaling βi . The results
are obtained from a 3-hidden-layer (64 neurons per layer) model trained on MNIST dataset. Note
that the network does not converge under certain conditions, resulting in missing value in d.


                                                                                                 6
Published as a conference paper at ICLR 2026


low feedback scaling βi jeopardizes the performance, which we attribute to vanishing gradients
(Figure 4c, right two columns). In contrast, down-scaling the feedforward weights degrades perfor-
mance, as the inference signals are weakened through the layers (see rows of Figure 4a). However,
up-scaling αi can also be detrimental, as this easily leads to saturation of neural state. The best per-
formance of a 5-hidden-layer RNN is achieved without feedforward scaling αi = 1 and a trade-off
in feedback scaling at βi = 0.1. These results suggest that balancing the feedforward and feedback
strengths is critical for better performance, not only accuracy but also speed (see Table 1).
To further investigate the influence of feedback scaling βi , we plot the error, SR, finite time max-
imum Lyapunov exponent (FTMLE) (Shadden et al., 2005; Kanno & Uchida, 2014) and conver-
gence time against feedback scaling coefficient before and after training of a 3-hidden-layer RNN
on MNIST (Figure 5). It shows that larger feedback scaling βi decreases accuracy (Figure 5a). As
expected, SR is positively correlated to βi (see Figure 5b), and large SR can lead to instability of
an RNN indicated by the FTMLE shown in Figure 5c, which in turn explains the results in Figure
5a. In general, down-scaling the feedback (βi < 1) reduces the convergence time of RNN, which is
favorable. Note that up-scaling of feedback βi >1 can also decrease FTMLE and convergence time.
However, this is attributed to the saturation of neural state, and will also lower the performance.
Additionally, one might suspect that the gradient signals in the lower layers are not fulfilling their
intended role. In reservoir computing, where only the last layer is trained, the network can also
reach high accuracy as long as the output dimension is large enough. However, this is unlikely in
our case, as each layer in our network has only 64 neurons by default (other than the results in Table
1). To further confirm that the learning in lower layers is meaningful, we performed training with
the weights of lower layers frozen—details of these experiments are included in Appendix C.5. The
results clearly show that getting comparable results to BP requires effective training of lower layers.

      (a) 0                      i=0.01                             (b) 0                           i=0.1                             (c) 0                             i=1.0                         (d) 0                              i=4.0
                 10                                                             10                                                               10                                                               10
Testing Error


                                                               Testing Error


                                                                                                                                Testing Error


                10 1                                                           10 1                                                             10 1                                             Testing Error   10 1


                                10           50                                                10             50                                                   10            50                                                 10            50
                                20           100                                               20             100                                                  20            100                                                20            100
                10 2                                                           10 2                                                             10 2                                                             10 2
                       0   10   20    30           40     50                          0   10   20    30             40     50                          0   10      20    30            40   50                          0   10      20    30            40   50
                                 Epoch                                                          Epoch                                                               Epoch                                                            Epoch
      (e) 0                     2 Layers                               (f) 0                   3 Layers                              (g) 0                         5 Layers                           (h) 0                         10 Layers
                 10                                                             10                                                               10                                                               10


                                     0.001              1.0                                           0.001              1.0
Testing Error


                                                               Testing Error


                                                                                                                                Testing Error


                                                                                                                                                                                                 Testing Error


                                     0.01               2.0                                           0.01               2.0
                10 1                 0.1                4.0                    10 1                   0.1                4.0                    10 1                                                             10 1
                                     0.25                                                             0.25
                                                                                                                                                           0.001           1.0                                              0.001           1.0
                                                                                                                                                           0.01            2.0                                              0.01            2.0
                                                                                                                                                           0.1             4.0                                              0.1             4.0
                                                                                                                                                           0.25                                                             0.25
                10 2                                                           10 2                                                             10 2                                                             10 2
                       0   10   20    30           40     50                          0   10   20    30             40     50                          0   10      20    30            40   50                          0   10      20    30            40   50
                                 Epoch                                                          Epoch                                                               Epoch                                                            Epoch


Figure 6: Test error with different hyperparameters. The curves of different T (10, 20, 50, 100) with
2 hidden layers (64 neurons per hidden layer) and (a) βi = 0.01; (b) βi = 0.1; (c) βi = 1; (d)
βi = 4. The curves of different βi (0.001, 0.01, 0.1, 0.25, 1, 2, 4) with (e) 2 hidden layers; (f) 3
hidden layers; (g) 5 hidden layers; (h) 10 hidden layers. The shaded areas represent deviations of
five repeated experiments. By default, T = 10 × NHiddenLayer , K = T /2. See Appendix A for
more information.


4.2                    D OWN - SCALING F EEDBACK L EADS TO FASTER C ONVERGENCE

Figure 6a-d plots the error versus the number of epochs with different iteration steps T . Under the
condition of βi = 0.01 (Figure 6a), the model with T = 10 and K = 5 works as well as the model
with T = 100 and K = 50, suggesting possibility of speedup in training. Larger βi requires more
iterations to achieve a certain level of performance (See Figure 6b, c, d). Larger βi means larger SR
and FTMLE, thus requiring more iterations to settle the RNN as shown in Figure 2 and Figure 5(b-


                                                                                                                               7
Published as a conference paper at ICLR 2026


Table 1: Comparison with P-EP and BP in accuracy and cost. The results of P-EP come from
previous work (Ernoult et al., 2019). For BP results, we used a network with the same number
of layers and number of nodes/channels. Each experiment is repeated five times, and the standard
deviation is given. By default, βi = 0.01 in our results, the feedback weights are symmetric with
the feedforward weights for P-EP and Ours, and the learning rate in all layers are the same except
for Ours-DLR (different learning rate), which uses varying learning rates identical to that of P-EP.
For 2HL (two hidden layers) and 3HL (three hidden layers), there are 512 nodes per hidden layer.
See Appendix D for more details.
                                                                Epoch / Batch size Wall Clock Time
 Architecture   Training approach        Testing (Training)
                                                                       -T/K         (HH:MM:SS)
                 P-EP (sigmoid-s)      98.05%±0.10% (99.86%)      50/20-100/20          1:56:-
     2HL         Ours (tanh, Adam)    98.39%±0.04% (100.00%)       50/500-10/10        0:01:16
                  BP (tanh, Adam)     98.26%±0.06% (100.00%)        50/500-1/1         0:00:18
                 P-EP (sigmoid-s)      97.99%±0.18% (99.90%)      100/20-180/20         8:27:-
                 Ours-DLR (tanh)       97.65%±0.08% (98.93%)       100/20-18/10        1:01:14
     3HL            Ours (tanh)       97.83%±0.13% (99.98%)        100/20-18/10        1:01:54
                 Ours (tanh, Adam)    98.36%±0.06% (100.00%)       50/500-18/10        0:02:11
                  BP (tanh, Adam)     98.36%±0.08% (100.00%)        50/500-1/1         0:00:24
                P-EP (hard-sigmoid)    98.98%±0.04% (99.46%)      40/20-200/10          8:58:-
     Conv       Ours (hard-sigmoid)    99.14%±0.02% (99.78%)       40/128-20/10        0:12:28
                 BP (hard-sigmoid)    98.93%±0.18% (99.43%)         40/128-1/1         0:01:01


Table 2: Comparison with BP and FA and ablation study of residual connection. For layered
architecture, there are 64 nodes per hidden layer and we chose T = 10 × NHiddenLayer , and
K = 5 × NHiddenLayer , which guarantees saturation of accuracy at βi = 0.1. For convolutional
architectures, βi = 0.01. By default, the Adam optimizer is used. Each experiment is repeated five
times. See Appendix D for more training details.
  Architecture
                  Training approach MNIST-Testing (Training) CIFAR-10-Testing (Training)
  -connections
                         BP             97.69%±0.10% (100.00%)       49.23%±0.81% (56.72%)
     5-symm
                         Ours            97.64%±0.10% (99.98%)       50.72%±0.17% (57.02%)
                          FA             96.44%±0.10% (98.96%)       37.97%±2.18% (38.92%)
    5-asymm
                         Ours            96.37%±0.11% (97.99%)       45.27%±0.73% (46.79%)
                         BP              97.61%±0.04% (99.93%)       48.23%±1.26% (55.37%)
    10-symm              Ours            92.49%±0.32% (95.27%)       34.90%±0.38% (34.64%)
                    Ours-Residual        97.49%±0.05% (99.77%)       44.46%±0.51% (48.67%)
                          FA             94.52%±0.26% (95.54%)       30.16%±6.12% (30.20%)
    10-asymm             Ours            87.37%±0.49% (87.95%)       30.37%±1.09% (29.97%)
                      Ours-AGT           96.87%±0.11% (99.45%)       30.94%±4.90% (31.36%)
                         BP              97.48%±0.07% (99.74%)       47.35%±1.49% (54.59%)
    20-symm
                    Ours-Residual        95.95%±0.18% (98.20%)       43.61%±1.17% (44.26%)
                         BP              99.34%±0.04% (99.97%)       75.45%±0.46% (83.61%)
      Conv
                         Ours            99.27%±0.07% (99.78%)       75.04%±0.51% (80.79%)


d). Or even worse, the gradient signal is completely distorted. At βi = 4, even T=100 fails to exceed
95% accuracy. Figure 6e-h shows that while shallow networks benefit from low βi , deeper networks
(3, 5 and 10 layers) lose accuracy. In all cases, training performance peaks at certain βi dependent
on the network depth. Additional results are provided in Table S1 in Appendix B.
Table 1 compares our approach with P-EP, BP, and FA. Our model supersedes P-EP in training
speed by at least one order of magnitude for both convolutional architecture and layered architecture.
Importantly, our accuracy is comparable to BP and FA for the shallow architectures (5-hidden-layer
and conv model, see also Table 2). In consideration of the improved stability (Figure 6) via feedback
regulation, we anticipate that physical implementations of RNN can achieve performance on par


                                                  8
Published as a conference paper at ICLR 2026


with BP. Additionally, for layered architecture, we also adopt the same training parameters (learning
rate, batch size and epochs) as P-EP, differing only in feedback scaling (‘ours-DLR’ in Table 1). The
results present clear evidence of speedup, which mainly stems from the reduced number of iterations
required for convergence.


4.3   D OWN - SCALED F EEDBACK C OORDINATES P LASTICITY OF D IFFERENT L AYERS

It is hypothesized that the brain requires different plasticity in different areas due to their varying
functional roles (Atallah et al., 2004; Lowet et al., 2020). The variability in plasticity can be realized
explicitly by adjusting learning rates or implicitly by modulating the intensity of gradient. Previ-
ous work postulated that EP with weak feedback necessitates learning rates differing by orders of
magnitude across layers (Scellier & Bengio, 2017). Here, we found that due to gradient differences
across different layers induced by weak feedback, a 3-hidden-layer RNN at βi = 0.01 (Table 1,
‘ours (tanh)’) learns well with a uniform learning rate. This result suggests that the feedback scaling
alone is able to regulate gradient strength of different layers, pointing to another possible mechanism
to coordinate plasticity.


4.4   R ESIDUAL C ONNECTIONS OVERCOME THE G RADIENT VANISHING IN D EEP RNN S

Weak feedback exacerbates vanishing gradient in deeper layered RNN (Figures S5–S6 in Ap-
pendix B). Adding residual connections restores gradient flow (Figure S7 in Appendix B). As a
result, a 10-hidden-layer network sees substantial performance gains (Table 2), 5% increase in ac-
curacy for MNIST and 9% for CIFAR-10. Even 20-hidden-layer model can be trained. As shown
in Table 2, without residual connections, an asymmetric RNN trained by EP falls short of FA in
accuracy, but arbitrary residual links surpass the accuracy of FA for the MNIST classification (See
ablation study on connection probability in Appendix B.). For more complex dataset CIFAR-10, the
10-hidden-layer asymmetric model with residual random feedback connections achieves accuracy
nearly 14% below the symmetric model. A possible reason is that the gradient signal through mul-
tiple random fixed feedback connections becomes too distorted by error to coordinate the forward
weight learning.


5     D ISCUSSION

We have applied the feedback scaling to RNN to speed up the convergence and to accelerate training
with EP with negligible overhead. To counteract the vanishing gradient in deep architectures, we
have added residual connections to non-adjacent layers of deep RNNs, partly restoring classification
performance. In principle, the residual connections make credit assignment pathways shorter (Veit
et al., 2016). The training exhibits remarkable resilience to noise on weight and neural state. Our
structural modification is compatible with other algorithmic speed-ups (Scellier et al., 2023), thereby
expanding the design space for efficient EP implementations.
Recent work on credit assignment in brain-inspired networks, e.g. adjoint propagation (Liu et al.,
2026), partitions a large network into local RNNs with random internal connections of low SR
for fast convergence and dynamic resource allocation, yielding speed and accuracy similar to this
work. This work, however, adopts the feedback scaling to solve the stability issue and accelerate
convergence of EP.
Weak feedback is often considered in biologically plausible learning algorithms (Sacramento et al.,
2018; Haider et al., 2021; Meulemans et al., 2021). It has been shown that contrastive Hebbian
learning with weak feedback approximates backpropagation while converging quickly (Xie & Se-
ung, 2003). More recently, local representation alignment (LRA) likewise employed weak feedback
(Ororbia et al., 2023) and skip connections from the output to deep layers for efficient training. The
EP framework also approximates BP (Scellier & Bengio, 2017; Millidge et al., 2023), but under
the weak clamping condition (weak supervision) (Laborieux et al., 2021; Millidge et al., 2023). We
have shown that, at the infinitesimal inference limit, namely weak supervision and weak feedback
(Millidge et al., 2023), EP is equivalent to LRA and BP (Appendix C). In other words, the dynamics
of FRE-RNN is more like the feedforward neural network due to its weak feedback.


                                                    9
Published as a conference paper at ICLR 2026


However, there are still a few limitations to our approaches for large-scale neural networks that
underpin artificial intelligence. For complex datasets like CIFAR-10, there exists a notable perfor-
mance gap compared to BP, using deep fully connected neural networks. We attribute this gap to
the inaccurate approximation to the true gradient as computed by BP (See Appendix C.4). There-
fore, although EP can be extended to deep fully connected network (20-hidden-layers) and shallow
CNNs, its applicability for deep CNN remains to be explored. For deep architectures with asymmet-
ric connections, the accuracy decreases faster with increasing depth due to the inaccurate random
error feedback. More in-depth investigation on residual connection topology is required to scale
up the methodology to large scale deep architectures. Besides, the hyperparameters are optimized
empirically. We find a feedback scaling in the range of 0.01-0.1 is favorable for shallow networks
(less than 4 layers) and 0.1-0.25 for deeper architectures. Finding a general way to determine these
parameters is still on-going. Additionally, existing research on EP converging naturally continues
to focus primarily on static-input settings (Laborieux et al., 2021; Ernoult et al., 2019; Laborieux
& Zenke, 2024). Extending naturally converging RNN trained by EP to sequence tasks remains a
challenge.
From a neurobiological perspective, residual connections, particularly the randomly generated arbi-
trary graph topologies, yield cortex-like connectivity patterns in the brain. The feedback-regulated
residual RNNs equip the biologically plausible learning framework, EP, with biologically plausible
network architecture. Although it currently runs on GPUs, it can exploit the natural convergence of
physical RNNs and facilitate efficient learning and inference on dedicated neuromorphic hardware.

ACKNOWLEDGEMENTS
This work was supported by the National Key R&D Program of China (Grant No.
2024YFA1208804). Additional financial support from the University of Science and Technology
of China and the Chinese Academy of Sciences is also gratefully acknowledged.

C ODE AVAILABILITY
The code used in this work is available at https://github.com/Zero0Hero/
FRE-RNN-EP.

R EPRODUCIBILITY STATEMENT
The code necessary to reproduce the main results is provided as Jupyter Notebooks in the Supple-
mentary Materials. Researchers can directly run them to reproduce the results. Further details on
data pre-processing and training process are available within the provided code and in Appendix D.

T HE U SE OF L ARGE L ANGUAGE M ODELS (LLM S )
In the preparation of this work, the authors used GPT-5 and DeepSeek solely for the purpose of
polishing and improving the linguistic fluency and readability of the text. This includes tasks such
as correcting grammar and rephrasing sentences. After using the model, the authors have reviewed
and edited all content extensively and take full responsibility for all ideas, claims, and the final
language presented in this paper.

R EFERENCES
David H. Ackley, Geoffrey E. Hinton, and Terrence J. Sejnowski. A learning algorithm for boltz-
  mann machines. Cognitive Science, 9(1):147–169, 1985.
Hisham E. Atallah, Michael J. Frank, and Randall C. O’Reilly. Hippocampus, cortex, and basal
  ganglia: Insights from computational models of complementary learning systems. Neurobi-
  ology of Learning and Memory, 82(3):253–267, 2004. ISSN 1074-7427. doi: https://doi.
  org/10.1016/j.nlm.2004.06.004. URL https://www.sciencedirect.com/science/
  article/pii/S1074742704000693.
Zhang Bai, D. J. Miller, and Wang Yue. Nonlinear system modeling with random matrices: Echo
  state networks revisited. IEEE Transactions on Neural Networks and Learning Systems, 23(1):
  175–182, 2012. ISSN 2162-237X 2162-2388. doi: 10.1109/tnnls.2011.2178562.


                                                10
Published as a conference paper at ICLR 2026


Maxence Ernoult, Julie Grollier, Damien Querlioz, Yoshua Bengio, and Benjamin Scellier. Updates
 of equilibrium prop match gradients of backprop through time in an rnn with static input. In
 Proceedings of the 33rd International Conference on Neural Information Processing Systems, pp.
 Article 636. Curran Associates Inc., 2019.
Maxence Ernoult, Julie Grollier, Damien Querlioz, Yoshua Bengio, and Benjamin Scellier. Equilib-
 rium propagation with continual weight updates, 2020. URL https://openreview.net/
 forum?id=H1xJhJStPS.
D. J. Felleman and D. C. Van Essen. Distributed hierarchical processing in the primate cerebral
  cortex. Cerebral Cortex, 1(1):1–47, 1991. ISSN 1047-3211. doi: 10.1093/cercor/1.1.1. URL
  <GotoISI>://WOS:000208047200002.
Mehmet Fişek, Dustin Herrmann, Alexander Egea-Weiss, Matilda Cloves, Lisa Bauer, Tai-Ying
 Lee, Lloyd E. Russell, and Michael Häusser. Cortico-cortical feedback engages active den-
 drites in visual cortex. Nature, 617(7962):769–776, 2023. ISSN 1476-4687. doi: 10.1038/
 s41586-023-06007-6. URL https://doi.org/10.1038/s41586-023-06007-6.
Paul Haider, Benjamin Ellenberger, Laura Kriener, Jakob Jordan, Walter Senn, and Mihai A. Petro-
  vici. Latent equilibrium: a unified learning theory for arbitrarily fast computation with arbitrarily
  slow neurons. In Proceedings of the 35th International Conference on Neural Information Pro-
  cessing Systems, pp. Article 1365. Curran Associates Inc., 2021.
K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In 2016 IEEE
  Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, 2016. ISBN
  1063-6919. doi: 10.1109/CVPR.2016.90.
Maya van Holk and Jorge F Mejias. Biologically plausible models of cognitive flexibility: merging
 recurrent neural networks with full-brain dynamics. Current Opinion in Behavioral Sciences, 56:
 101351, 2024. doi: 10.1016/j.cobeha.2024.101351. van Holk, Maya; Mejias, Jorge F. ORCIDs:
 0000-0003-4930-1472 2352-1546 2352-1554.
J. J. Hopfield. Neural networks and physical systems with emergent collective computational abili-
   ties. Proceedings of the National Academy of Sciences, 79(8):2554–2558, 1982. ISSN 0027-8424
   1091-6490. doi: 10.1073/pnas.79.8.2554.
Rasmus Høier, Kirill Kalinin, Maxence Ernoult, and Christopher Zach. Dyadic learning in recur-
  rent and feedforward models. In NeurIPS 2024 Workshop Machine Learning with new Compute
  Paradigms, 2024. URL https://openreview.net/forum?id=LNfWowAErI.
Herbert Jaeger and Harald Haas. Harnessing nonlinearity: Predicting chaotic systems and sav-
  ing energy in wireless communication. Science, 304(5667):78–80, 2004. doi: 10.1126/science.
  1091277. URL https://doi.org/10.1126/science.1091277. doi: 10.1126/sci-
  ence.1091277.
Adrien Journé, Hector Garcia Rodriguez, Qinghai Guo, and Timoleon Moraitis. Hebbian deep learn-
  ing without feedback. In The Eleventh International Conference on Learning Representations,
  2023. URL https://openreview.net/forum?id=8gd4M-_Rj1.
Kazutaka Kanno and Atsushi Uchida. Finite-time lyapunov exponents in time-delayed nonlinear
  dynamical systems. Physical Review E, 89(3):032918, 2014. ISSN 1539-3755 1550-2376. doi:
  10.1103/PhysRevE.89.032918.
Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International
  Conference for Learning Representations. ICLR, 2015. doi: doi.org/10.48550/arXiv.1412.6980.
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convo-
  lutional neural networks. In Advances in Neural Information Processing Systems, 2012.
Dhireesha Kudithipudi, Catherine Schuman, Craig M. Vineyard, Tej Pandit, Cory Merkel, Rajkumar
  Kubendran, James B. Aimone, Garrick Orchard, Christian Mayr, Ryad Benosman, Joe Hays, Cliff
  Young, Chiara Bartolozzi, Amitava Majumdar, Suma George Cardwell, Melika Payvand, Sonia
  Buckley, Shruti Kulkarni, Hector A. Gonzalez, Gert Cauwenberghs, Chetan Singh Thakur, Anand
  Subramoney, and Steve Furber. Neuromorphic computing at scale. Nature, 637(8047):801–812,
  2025. ISSN 0028-0836 1476-4687. doi: 10.1038/s41586-024-08253-8.


                                                  11
Published as a conference paper at ICLR 2026


Suman Kulkarni and Dani S. Bassett.      Toward principles of brain network organiza-
  tion and function.   Annual Review of Biophysics, 54(Volume 54, 2025):353–378,
  2025.   ISSN 1936-1238.   doi: https://doi.org/10.1146/annurev-biophys-030722-110624.
  URL         https://www.annualreviews.org/content/journals/10.1146/
  annurev-biophys-030722-110624.
Axel Laborieux and Friedemann Zenke. Improving equilibrium propagation without weight sym-
  metry through jacobian homeostasis. In The Twelfth International Conference on Learning Rep-
  resentations. ICLR, 2024. URL https://openreview.net/forum?id=kUveo5k1GF.
Axel Laborieux, Maxence Ernoult, Benjamin Scellier, Yoshua Bengio, Julie Grollier, and Damien
  Querlioz. Scaling equilibrium propagation to deep convnets by drastically reducing its gradient
  estimator bias. Frontiers in Neuroscience, 15:633674, 2021. ISSN 1662-453X. doi: 10.3389/
  fnins.2021.633674.
Yann Lecun. A theoretical framework for back-propagation. In Proceedings of the 1988 Connec-
  tionist Models Summer School, 1988.
Robert Legenstein and Wolfgang Maass. Edge of chaos and prediction of computational perfor-
  mance for neural circuit models. Neural Networks, 20(3):323–334, 2007. ISSN 08936080. doi:
  10.1016/j.neunet.2007.04.017.
Timothy P. Lillicrap, Daniel Cownden, Douglas B. Tweed, and Colin J. Akerman. Random synaptic
  feedback weights support error backpropagation for deep learning. Nature Communications, 7
  (1):13276, 2016. ISSN 2041-1723. doi: 10.1038/ncomms13276.
Zhuo Liu, Hao Shu, Linmiao Wang, Xu Meng, Yousheng Wang, Xuancheng Li, Wei Wang, and
  Tao Chen. Adjoint propagation of error signal through modular recurrent neural networks for
  biologically plausible learning. eLife, 15:e108237, 2026. ISSN 2050-084X. doi: 10.7554/eLife.
  108237. URL https://doi.org/10.7554/eLife.108237.
Adam S. Lowet, Qiao Zheng, Sara Matias, Jan Drugowitsch, and Naoshige Uchida. Distribu-
  tional reinforcement learning in the brain. Trends in Neurosciences, 43(12):980–997, 2020.
  ISSN 0166-2236. doi: https://doi.org/10.1016/j.tins.2020.09.004. URL https://www.
  sciencedirect.com/science/article/pii/S0166223620301983.
Christopher W. Lynn and Danielle S. Bassett. The physics of brain network structure, function
  and control. Nature Reviews Physics, 1(5):318–332, 2019. ISSN 2522-5820. doi: 10.1038/
  s42254-019-0040-8.
Nikola T. Markov, Mária Ercsey-Ravasz, David C. Van Essen, Kenneth Knoblauch, Zoltán
  Toroczkai, and Henry Kennedy. Cortical high-density counterstream architectures. Science, 342
  (6158), 2013. ISSN 0036-8075 1095-9203. doi: 10.1126/science.1238406.
Jorge F. Mejias, John D. Murray, Henry Kennedy, and Xiao-Jing Wang. Feedforward and feedback
  frequency-dependent interactions in a large-scale laminar network of the primate cortex. Science
  Advances, 2(11):e1601335, 2016. doi: doi:10.1126/sciadv.1601335. URL https://www.
  science.org/doi/abs/10.1126/sciadv.1601335.
Jan Melchior and Laurenz Wiskott. Hebbian-descent, 2019. URL https://arxiv.org/abs/
  1905.10585.
Alexander Meulemans, Matilde Tristany Farinha, Javier Garcia Ordonez, Pau Vilimelis Aceituno,
  João Sacramento, and Benjamin F. Grewe. Credit assignment in neural networks through deep
  feedback control. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman
  Vaughan (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 4674–4687.
  Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_
  files/paper/2021/file/25048eb6a33209cb5a815bff0cf6887c-Paper.pdf.
Georgios Michalareas, Julien Vezoli, Stan van Pelt, Jan-Mathijs Schoffelen, Henry Kennedy, and
  Pascal Fries. Alpha-beta and gamma rhythms subserve feedback and feedforward influences
  among human visual cortical areas. Neuron, 89(2):384–397, 2016. ISSN 0896-6273. doi:
  https://doi.org/10.1016/j.neuron.2015.12.018. URL https://www.sciencedirect.com/
  science/article/pii/S0896627315011204.


                                               12
Published as a conference paper at ICLR 2026


John Miller and Moritz Hardt. Stable recurrent models. In International Conference on Learning
  Representations, 2019. URL https://openreview.net/forum?id=Hygxb2CqKm.
Beren Millidge, Yuhang Song, Tommaso Salvatori, Thomas Lukasiewicz, and Rafal Bogacz. Back-
  propagation at the infinitesimal inference limit of energy-based models: Unifying predictive cod-
  ing, equilibrium propagation, and contrastive hebbian learning. In The Eleventh International
  Conference on Learning Representations, 2023. URL https://openreview.net/forum?
  id=nIMifqu2EO.
Javier R. Movellan. Contrastive Hebbian Learning in the Continuous Hopfield Model, pp.
  10–17. Morgan Kaufmann, 1991. ISBN 978-1-4832-1448-1. doi: https://doi.org/10.1016/
  B978-1-4832-1448-1.50007-X. URL https://www.sciencedirect.com/science/
  article/pii/B978148321448150007X.
Mitsumasa Nakajima, Yongbo Zhang, Katsuma Inoue, Yasuo Kuniyoshi, Toshikazu Hashimoto,
 and Kohei Nakajima. Reservoir direct feedback alignment: deep learning by physical dynamics.
 Communications Physics, 7(1):411, 2024. ISSN 2399-3650. doi: 10.1038/s42005-024-01895-0.
Arild Nøkland.     Direct feedback alignment provides learning in deep neural net-
  works. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett (eds.), Ad-
  vances in Neural Information Processing Systems, volume 29. Curran Associates, Inc.,
  2016. URL https://proceedings.neurips.cc/paper_files/paper/2016/
  file/d490d7b4576290fa60eb31b5fc917ad1-Paper.pdf.
Peter O’Connor, Efstratios Gavves, and Max Welling. Initialized equilibrium propagation for
  backprop-free training. In International Conference on Learning Representations. ICLR, 2019.
  URL https://openreview.net/forum?id=B1GMDsR5tm.
Alexander Ororbia and Ankur Mali. Biologically motivated algorithms for propagating local target
  representations. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp.
  4651–4658. AAAI, 2019.
Alexander Ororbia, Patrick Haffner, David Reitter, and C. Lee Giles. Learning to adapt by minimiz-
  ing discrepancy, 2017.
Alexander G. Ororbia. Brain-inspired machine intelligence: A survey of neurobiologically-plausible
  credit assignment, 2023. URL https://arxiv.org/abs/2312.09257.
Alexander G. Ororbia, Ankur Mali, Daniel Kifer, and C. Lee Giles. Backpropagation-free deep
  learning with recursive local representation alignment. In Proceedings of the AAAI Conference on
  Artificial Intelligence, volume 37, pp. 9327–9335. AAAI, 2023. URL https://ojs.aaai.
  org/index.php/AAAI/article/view/26118.
Matthew G. Perich and Kanaka Rajan. Rethinking brain-wide interactions through multi-region
 ‘network of networks’ models. Current Opinion in Neurobiology, 65:146–151, 2020. ISSN
 09594388. doi: 10.1016/j.conb.2020.11.003.
D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by back-propagating
  errors. Nature, 323(6088):533–536, 1986. ISSN 0028-0836. doi: 10.1038/323533a0. URL
  <GotoISI>://WOS:A1986E327500055. E3275 Times Cited:16725 Cited References
  Count:4.
João Sacramento, Rui Ponte Costa, Yoshua Bengio, and Walter Senn.     Dendritic corti-
   cal microcircuits approximate the backpropagation algorithm. In S. Bengio, H. Wal-
   lach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.), Ad-
   vances in Neural Information Processing Systems, volume 31. Curran Associates, Inc.,
   2018. URL https://proceedings.neurips.cc/paper_files/paper/2018/
   file/1dc3a89d0d440ba31729b0ba74b93a33-Paper.pdf.
Tommaso Salvatori, Luca Pinchetti, Beren Millidge, Yuhang Song, Tianyi Bao, Rafal Bogacz, and
  Thomas Lukasiewicz. Learning on arbitrary graph topologies via predictive coding. In Alice H.
  Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Informa-
  tion Processing Systems, 2022. URL https://openreview.net/forum?id=dqO59nI_
  R9A.


                                                13
Published as a conference paper at ICLR 2026


Benjamin Scellier and Yoshua Bengio. Equilibrium propagation: Bridging the gap between energy-
  based models and backpropagation. Frontiers in Computational Neuroscience, 11:24, 2017. ISSN
  1662-5188. doi: 10.3389/fncom.2017.00024.
Benjamin Scellier, Anirudh Goyal, Jonathan Binas, Thomas Mesnard, and Yoshua Bengio. Gener-
  alization of equilibrium propagation to vector field dynamics, 2018. URL https://arxiv.
  org/abs/1808.04873.
Benjamin Scellier, Maxence Ernoult, Jack Kendall, and Suhas Kumar. Energy-based learn-
  ing algorithms for analog computing: a comparative study.    In A. Oh, T. Naumann,
  A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Advances in Neural In-
  formation Processing Systems, volume 36, pp. 52705–52731. Curran Associates, Inc.,
  2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/
  file/a52b0d191b619477cc798d544f4f0e4b-Paper-Conference.pdf.
João D. Semedo, Anna I. Jasper, Amin Zandvakili, Aravind Krishna, Amir Aschner, Christian K.
   Machens, Adam Kohn, and Byron M. Yu. Feedforward and feedback interactions between visual
   cortical areas use different population activity patterns. Nature Communications, 13(1):1099,
   2022. ISSN 2041-1723. doi: 10.1038/s41467-022-28552-w. URL https://doi.org/10.
   1038/s41467-022-28552-w.
Shawn C. Shadden, Francois Lekien, and Jerrold E. Marsden. Definition and properties of la-
  grangian coherent structures from finite-time lyapunov exponents in two-dimensional aperi-
  odic flows. Physica D: Nonlinear Phenomena, 212(3-4):271–304, 2005. ISSN 01672789. doi:
  10.1016/j.physd.2005.10.007.
Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image
  recognition. In International Conference on Learning Representations, 2015.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,
  Ł. ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. Von
  Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Ad-
  vances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.,
  2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/
  file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
Andreas Veit, Michael J Wilber, and Serge Belongie. Residual networks behave like ensem-
  bles of relatively shallow networks. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and
  R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 29. Curran
  Associates, Inc., 2016. URL https://proceedings.neurips.cc/paper_files/
  paper/2016/file/37bc2f75bf1bcfe8450a1a41c200364c-Paper.pdf.
Ran Wang, Xupeng Chen, Amirhossein Khalilian-Gourtani, Leyao Yu, Patricia Dugan, Daniel Fried-
  man, Werner Doyle, Orrin Devinsky, Yao Wang, and Adeen Flinker. Distributed feedforward
  and feedback cortical processing supports human speech production. Proceedings of the Na-
  tional Academy of Sciences, 120(42):e2300255120, 2023. doi: 10.1073/pnas.2300255120. URL
  https://doi.org/10.1073/pnas.2300255120. doi: 10.1073/pnas.2300255120.
D. J. Watts and S. H. Strogatz. Collective dynamics of ’small-world’ networks. Nature, 393
  (6684):440–442, 1998. ISSN 0028-0836. doi: Doi10.1038/30918. URL <GotoISI>://WOS:
  000074020000035. Zr842 Times Cited:29354 Cited References Count:27.
Alan Wolf, Jack B. Swift, Harry L. Swinney, and John A. Vastano. Determining lyapunov
  exponents from a time series. Physica D: Nonlinear Phenomena, 16(3):285–317, 1985.
  ISSN 0167-2789. doi: https://doi.org/10.1016/0167-2789(85)90011-9. URL https://www.
  sciencedirect.com/science/article/pii/0167278985900119.
X. Xie and H. S. Seung. Equivalence of backpropagation and contrastive hebbian learning in a
  layered network. Neural Computation, 15(2):441–454, 2003. ISSN 0899-7667. doi: 10.1162/
  089976603762552988.
Izzet B. Yildiz, Herbert Jaeger, and Stefan J. Kiebel. Re-visiting the echo state property. Neural
   Networks, 35:1–9, 2012. ISSN 08936080. doi: 10.1016/j.neunet.2012.07.005.


                                               14
Published as a conference paper at ICLR 2026


A    T HE DYNAMICS OF THE RNN
We quantify the convergence property of the recurrent neural network (RNN) with maximum Lya-
punov exponent (MLE) (Wolf et al., 1985), and finite time maximum Lyapunov exponent (FTMLE)
(Kanno & Uchida, 2014). When the MLE/FTMLE is large, the RNN converges slow or even not
at all. To compute MLE and FTMLE, we first initialize a random perturbation vector δ0 . Then we
record the sequence of states s0 [t] with t = 0, 1, 2, . . . , Te − 1 corresponding to the last sample of a
training set (see Figure 2 in the main text), and run the following steps:

      1. Normalize perturbation vectors to unit length:
                                                             δt
                                                    δt ←
                                                            ∥δt ∥
      2. Calculate the Jacobian matrix:
                                                           ∂F (s0 [t], b)
                                             J(s0 [t]) =
                                                             ∂s0 [t]
      3. Update the perturbation:
                                               δt+1 = J(s0 [t]) · δt
      4. Record
                                                  ri = ln ∥δt+1 ∥
                                                                      PTe −1
The maximum Lyapunov exponent is computed as λmax = T1e                     t=0   ri for a sufficiently large Te
(default Te = 500). The results at any T < Te are the FTMLE.
Figure S1–S2 show the FTMLE, MLE, training accuracy and test accuracy versus epochs of dif-
ferent models. In all cases, smaller βi usually yields smaller (FT)MLE, whereas larger βi do not
always lead to larger (FT)MLE because the activation function saturates. The saturation diminishes
perturbation.
For 2-hidden-layer RNN, smaller feedback scaling βi yields steady training progress and better ac-
curacy. Figure S3 plots the FTMLE and test accuracy against feedback scaling for different numbers
of hidden layers. It shows that smaller βi is favorable for shallow networks, because the RNN is eas-
ier to converge (indicated by FTMLE). But for deeper networks (5-hidden-layer or more), smaller
βi degrades performance because of vanishing gradient.
Further comparison between our FRE-RNN that incorporates convolutional structure with previous
work (Ernoult et al., 2019) are also plotted in Figure S4. These results suggest that small feedback
scaling (βi = 0.01) leads to a smoother training process.

B    G RADIENT VANISHING AND THE RESIDUAL CONNECTIONS
Figure S5 and S6 plot the error of each neuron versus epoch at different βi . For a 2-hidden-layer
RNN, the best performance is obtained at βi = 0.001. In this situation, the error of the first hidden
layer is at least two orders of magnitude less than the second hidden layer. At βi = 2, the error also
decreases from higher (high index neurons, closer to output layer) to lower layers, which is attributed
to the saturation of the activation function. In general, the training progresses more steadily for
smaller βi despite the vanishing gradient, which also applies to deeper networks (up to 10-hidden-
layer).
To eliminate the vanishing gradient in EP, direct feedback from the higher layers or local amplifi-
cation (with higher learning rate) is unavoidable (Nøkland, 2016; Ororbia et al., 2023). Figure S7
shows the effect of residual connections. βi = 0.1 yield the best accuracy 97.5%, due to the balance
between gradient flow and convergence.
Figure S8 shows the testing accuracy varies with the connection probability P of AGT with 10
hidden layers. Except for the connections in layered model, the connection between any two hidden
layers is generated with probability P , i.e., we first use P to decide if the connections between any
two layers will be established. As P increases, the accuracy rises first, peaks at 0.2 and decreases
around 1. However, the reason behind is yet to be explored.


                                                    15
Published as a conference paper at ICLR 2026


Figure S1: The FTMLE, MLE, training accuracy and testing accuracy of symmetric RNNs versus
epochs with different feedback scaling βi (legend). First row: 2 hidden layers; Second row: 3 hidden
layers; Third row: 5 hidden layers; Fourth row: 10 hidden layers. The activation is tanh. Each case
is repeated 5 times.


                                                16
Published as a conference paper at ICLR 2026


Figure S2: The FTMLE, MLE, training accuracy and testing accuracy of asymmetric RNNs versus
epochs with different feedback scaling βi (legend). First row: 2 hidden layers; Second row: 3 hidden
layers; Third row: 5 hidden layers; Fourth row: 10 hidden layers. The activation is tanh. Each case
is repeated 5 times.


Figure S3: The FTMLE and testing accuracy versus feedback scaling βi with different numbers of
hidden layers. (a) Symmetry weights; (b) Asymmetric weights. The FTMLE and testing accuracy
given here correspond to their maxima in all epochs. Note that the 5-hidden-layer asymmetric RNN
with large βi diverged and resulted in missing data points in (b). Each case is repeated 5 times.


                                                17
Published as a conference paper at ICLR 2026


Figure S4: Comparison of RNN embedded with convolutional structure on the MNIST between
P-EP (a) (Ernoult et al., 2019) and our approach at different βi (b-d). We used the same parameters
as the EP reference (Ernoult et al., 2019).


                                                18
Published as a conference paper at ICLR 2026


Figure S5: For 2-hidden-layer RNN, the mean error of each neuron in the last batch and testing
accuracy versus epochs at different βi . All neurons in the hidden layers and the output layer are
indexed from the input to the output layer.


                                               19
Published as a conference paper at ICLR 2026


Figure S6: For the 10-hidden-layer model, the mean error of each neuron in the last batch and
testing accuracy versus epochs at different βi . All neurons in the hidden layers and the output layer
are indexed from the input to the output layer.


                                                 20
Published as a conference paper at ICLR 2026


Figure S7: For the 10-hidden-layer model with residual connections, the mean error of each neuron
in the last batch and testing accuracy versus epochs at different βi . All neurons in the hidden layers
and the output layer are indexed from the input to the output layer.


Figure S8: The testing accuracy on MNIST varies with the connection probability P of AGT with
10 hidden layers. The experiments are repeated 5 times.


                                                  21
Published as a conference paper at ICLR 2026


Table S1: Testing accuracy (mean of 5 repeated experiments) with different feedback scaling βi . By
default, T = 10 × NHiddenLayer , K = 5 × NHiddenLayer . Each hidden layer has 64 nodes.
 Architecture-connections   βi = 0.001   βi = 0.01     βi = 0.1   βi = 0.25   βi = 1   βi = 2   βi = 4
        2HL-symm              97.69%      97.57%       97.25%      96.22%     93.12%   66.04%   40.92%
        3HL-symm              97.22%      97.64%       97.41%      96.60%     55.86%   32.64%   22.11%
        5HL-symm              93.54%      95.54%       97.60%      90.63%     25.31%   17.88%   14.61%
       10HL-symm              87.15%      89.99%       92.54%      41.84%     14.07%   14.30%   14.23%
  10HL-Residual-symm             –        97.52%       97.46%         –       95.51%      –        –
        conv-symm                –        99.15%       98.71%         –       11.35%      –        –
       2HL-asymm              96.96%      96.97%       96.88%      96.79%     93.88%   91.81%   89.91%
       3HL-asymm              95.17%      96.91%       96.76%      96.66%     91.21%   54.65%   26.72%
       5HL-asymm              91.14%      92.34%       96.41%      96.35%     17.15%   11.35%   13.07%
      10HL-asymm              84.27%      85.83%       87.79%      90.97%     16.13%   14.21%   16.67%
    10HL-AGT-asymm               –        96.37%       96.75%         –       33.31%      –        –


C     E QUIVALENCE WITH EP AND BP UNDER THE CONDITION OF
      INFINITESIMAL INFERENCE LIMIT


Figure S9: A layered network model used to illustrate the process of backpropagation (BP), local
representation alignment (LRA), and EP. Note that the final prediction layer ·p corresponds to the
third layer with subindex ·3 . For LRA, we use βLRA instead of β1 and βf . For BP, the feedback
(orange) paths are absent.

In this section, we will use the infinitesimal inference limit (Millidge et al., 2023) to derive the
equivalence of EP with LRA and BP.

C.1   BACKPROPAGATION

When we remove the feedback connection of a 2-hidden-layer RNN shown in Figure S9, a feedfor-
ward network is left and can be trained with BP. The forward process of BP is described by:


                                      s1 = ρ(h1 ), h1 = W0 · s0 ,
                                      s2 = ρ(h2 ), h2 = W1 · s1 ,                                  (S1)
                                       sp = hp , hp = Wf · s2 .

Defining a loss LBP = 21 (sp −star )2 , the weights adjust according to the gradient of the loss. Taking
∆W0 as an example:
                             ∂LBP
                 ∆W0 = −
                             ∂W0
                       = −ρ′ (h1 ) ⊙ W1⊤ · ρ′ (h2 ) ⊙ Wf⊤ · (sp − star ) · (s0 )⊤ ,
                                                                            
                                                                                                    (S2)

where “⊙” means Hadamard product (element-wise product), “·” means scalar or matrix multipli-
cation. For two vectors/matrices, “⊙” requires identical dimensions and computes element-wise
products. Broadcasting rules may apply (e.g., a column vector vm×1 ⊙ Am×n scales each column
of A by v).


                                                  22
Published as a conference paper at ICLR 2026


C.2   L OCAL R EPRESENTATION A LIGNMENT

LRA is an alternative training method following the principle of discrepancy reduction (Ororbia
et al., 2017; Ororbia & Mali, 2019). It can be divided into two phases: 1) the network runs the
forward process, producing latent representations of the input samples. 2) the weights adjust in the
direction of reducing the mismatch between current latent representations and target representations
in each layer.
The forward process is the same as BP:
                                  s01 = ρ(h01 ),         h01 = W0 · s0 ,
                                     s02 = ρ(h02 ),      h02 = W1 · s01 ,                            (S3)
                                      s0p = h0p ,       h0p = Wf · s02 .
where s0i are interpreted as the latent representations. The prediction error is ep = star − s0p . Then
we can get the target representations of the second hidden layer:
                    sβ2 LRA = ρ(hβ2 LRA ),    hβ2 LRA = W1 · s01 + βLRA · Bf · ep ,                  (S4)
The same goes for the first hidden layer:
          sβ1 LRA = ρ(hβ1 LRA ),   hβ1 LRA = W1 · s0 + βLRA · B1 · e2 ,       e2 = sβ2 LRA − s02 ,   (S5)
LRA defines the loss as the total discrepancy between latent representations and target representa-
tions:
                                 L                             L
                                X                             X   1 0
                      LLRA =         ki Li (s0i , sβi LRA ) =       (si − sβi LRA )2 ,         (S6)
                                i=1                           i=1
                                                                  2
The weight Wi adjusts according to the local mismatch between s0i+1 and sβi+1
                                                                           LRA
                                                                               :
                                         ∂ki Li (s0i+1 , sβi+1
                                                            LRA
                                                                )
                              ∆Wi = −
                                                 ∂Wi
                                   = (sβi+1
                                         LRA
                                             − s0i+1 ) ⊙ f ′ (h0i+1 ) · (s0i )⊤
                                     ≈ (sβi+1
                                           LRA
                                               − s0i+1 ) · (s0i )⊤ ,                             (S7)
where the derivative of the activation function is omitted in the last row, a useful practice common
in LRA (Melchior & Wiskott, 2019; Ororbia & Mali, 2019; Ororbia et al., 2023). When βLRA → 0,
sβi LRA → s0i and hβi LRA → h0i , then


                   ei = sβi LRA − s0i = ρ(hβi LRA ) − ρ(h0i )
                      = ρ(h0i + βLRA · Bi · ei+1 ) − ρ(h0i )
                      ≈ [ρ(h0i ) + ρ′ (h0i ) ⊙ (βLRA · Bi · ei+1 ) − ρ(h0i ))]βLRA →0 ,              (S8)
                          ′
                      = ρ (h0i ) ⊙ (βLRA · Bi · ei+1 )

The approximation in Equation S8 is based on a first-order Taylor expansion of ρ(h0i + ∆h) around
h0i , where ∆h = βLRA · Bi · ei+1 . For a small perturbation ∆h → 0, the Taylor expansion gives:
                          ρ(h0i + ∆h) = ρ(h0i ) + ρ′ (h0i ) · ∆h + O(∆h2 ),                          (S9)
                                                    2
When βLRA → 0, higher order terms O(∆h ) are negligible, leaving only the linear terms. We
arrive at the last row after canceling out ρ(h0i ). There we can express the weight adjustments as
   ∆W0 = e1 · (s00 )⊤
                                                                        
             ′ 0                     ′ 0
                                                                       
                                                                            · (s0 )⊤ B =(W )⊤
                                                                                    
         = ρ (h1 ) ⊙ βLRA · B1 · ρ (h2 ) ⊙ βLRA · Bf · (star − sp )
                                                                                      i   i

                            ′ 0        ⊤     ′ 0         ⊤                     ⊤
                                                                       
         = −βLRA · βLRA · ρ (h1 ) ⊙ W1 · ρ (h2 ) ⊙ Wf · (sp − star ) · (s0 ) ,              (S10)
which is the same as BP (Equation S2) except for a constant. Thus, LRA at weak feedback limit
approximates BP. An LRA algorithm for a 2-hidden-layer network is described in Algorithm S1. The
feedback weights in LRA need not be learned here, but can be kept symmetric with the feedforward
weights.


                                                        23
Published as a conference paper at ICLR 2026


C.3    E QUILIBRIUM P ROPAGATION

We can also formulate EP in terms of discrepancy reduction. In EP (Algorithm 1 in the main text),
the network states evolve as follows (β = 0 for the first phase and β = βf for the second phase):


                                             hβ1 = W0 · sβ0 + β1 · B1 · sβ2 ,
                                            hβ2 = W1 · sβ1 + βf · Bf · ep ,
                                                       hβp = Wf · sβ2 ,
                                            sβ1 , sβ2 , sβp = ρ(hβ1 ), ρ(hβ2 ), hβp ,                   (S11)


where ep = star − s0p is the predicting error. The network converges to final states h01 , h02 , s01 , s02 in
the free phase. The error of s2 neurons can be described by:

                                                        β
                                         ds2 = [ρ(h2 f )]βf →0 − [ρ(h02 )]βf =0
                                                ≈ ρ′ (h02 ) ⊙ (βf · Bf · ep ),                          (S12)

where only the first-order infinitesimal term is retained as β1 → 0. The same goes for the first
hidden layer:

                                            β
                              ds1 = [ρ(h1 f )]βf →0 − [ρ(h01 )]βf =0
                                   ≈ ρ′ (h01 ) ⊙ (β1 · B1 · (ρ′ (h02 ) ⊙ (βf · Bf · ep ))),             (S13)

The weight W0 can be updated by:

                                  ds1 · (s00 )⊤
                  ∆W0 =                         = ρ′ (h01 ) ⊙ B1 · (ρ′ (h02 ) ⊙ Bf · ep ) · (s00 )⊤ ,   (S14)
                                    β1 · β f

With Bi = Wi⊤ ,

         ds1 = βf · β1 · ρ′ (h01 ) ⊙ W1⊤ · (ρ′ (h02 ) ⊙ Wf⊤ · −(sp − star )),                           (S15)
                 ds1 · (s00 )⊤
                               = −ρ′ (h01 ) ⊙ W1⊤ · ρ′ (h02 ) ⊙ Wf⊤ · (sp − star ) · (s00 )⊤ .
                                                                                  
       ∆W0 =                                                                                            (S16)
                   β1 · β f

Note that compared with the weight update in the main text, 1/(β1 ·βf ) is added to recover a gradient
amplitude similar to BP. Further, if we assume that the high-order infinitesimal in the first phase can
be omitted, the dynamics of RNN is governed by:

                      s01      = ρ(hβ1 ),       h01 = [W0 · s0 + β1 · B1 · s02 ]β1 →0 ≈ W0 · s0 ,       (S17)
                s02         = ρ(h02 ),   h02 = [W1 · s01 + βf · Bf · ep ]β1 →0,βf =0 ≈ W1 · s01 ,       (S18)
                                            s0p = h0p , h0p = Wf · s02 .                                (S19)


The information flow of RNN degenerates into that of a feedforward network. This does not affect
the error information dsi , thus Equation S16 approximates Equation S2 for BP. Meanwhile, it re-
sembles LRA with low βLRA , which turns explicit error into implicit error. Hitherto, we have shown
that although the errors are obtained differently in EP, LRA, and BP, they are equivalent under the
assumption of weak supervision and weak feedback.


                                                              24
Published as a conference paper at ICLR 2026


Algorithm S1 Local Representation Alignment (LRA)
Input: (x, star )
Parameter: θ = [W0 , W1 , W2 , B2 , B1 , βLRA ]
Output: θ
 1: function F ORWARD(θ, x)
 2:    s0 ← x
 3:    s01 ← ρ(h1 ), h1 ← W0 · s0
 4:    s02 ← ρ(h2 ), h2 ← W1 · s01
 5:    s0p ← Wf · s02
 6:    Λ1 ← [s0i ], i = 0, 1, 2, p
 7:    return Λ1
 8: end function
 9: function F EEDBACK(θ, Λ1 , star )
10:    ep ← star − s0p
11:    sβ2 LRA ← ρ(h2 ), h2 ← W1 · s01 + βLRA · Bf · ep
12:    e2 ← sβ2 LRA − s02
13:    sβ1 LRA ← ρ(h1 ), h1 ← W0 · s0 + βLRA · B1 · e2
14:    e1 ← sβ1 LRA − s01
15:    Λ2 ← [e1 , e2 , ep ]
16:    return Λ2
17: end function
18: function U PDATING -W EIGHTS(θ, Λ1 , Λ2 )
19:    ∆Wi ← ei+1 · (s0i )T , i = 0, 1
20:    ∆Wf ← ep · (s02 )T
21: end function


C.4   E XPERIMENTS FOR EQUIVALENCE WITH EP AND BP

Prior works have shown that EP can be equalized to BPTT in specific conditions and can achieve
comparable performance (Ernoult et al., 2019; Laborieux et al., 2021). As discussed in the previ-
ous section, although the overall architecture forms an RNN, the network behaves similarly to a
feedforward model due to weak feedback connections.
To experimentally show the equivalence of EP and BP, we can further compare our model with
FNN with same feedforward weights trained by BP. We mainly compare cosine similarity of states,
bias gradients and weight gradients for the first batch (batch size is 200) as given in Figure S10.
Figure S10(a-c) shows similarity under the conditions of βi = 1 with different iterations. For the the
bias gradients, i.e., dsi , the cosine similarity declines rapidly, indicating no similarity between our
model and BP. With weak feedback βi = 0.1, as shown in Figure S10(d-f), the similarity of states
approaches 1 and the similarity of bias gradient of last 6(4) layers exceeds 0.5 with T = 500/50
(T = 20). These results provide further evidence that EP is equivalent to BP under the condition of
weak feedback.
We further studied the influence of βi on the cosine similarity. Figure S10(g) shows that larger
βi leads to lower similarity of states. Figure S10(h) shows that lower βi = 0.01 also leads to
the decrease in similarity, which may caused by insufficient precision of data storage (float32 by
default). Therefore, we use datatype float64 to repeat experiments. Figure S10(k,l) shows that the
similarity of gradient signal remains around 1 with βi = 0.1. This indicates that weak feedback does
indeed lead to an exponential decline in gradient signals, thus requiring higher relative accuracy.


                                                  25
 Published as a conference paper at ICLR 2026


        (a)                            states                      (b)                       bias gradients                      (c)                    weight gradients
                    1.0                                                         1.0                                                         1.0
                                                                                                                                                         T=500,K=200
cosine similarity


                                                            cosine similarity


                                                                                                                        cosine similarity
                                                                                                                                                         T=50,K=20
                    0.5                                                         0.5                                                         0.5          T=20,K=8
                    0.0                                                         0.0                                                         0.0

                    0.5                                                         0.5                                                         0.5
                          fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 out                         fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 out                         fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 out
       (d)                             states                       (e)                      bias gradients                        (f)                  weight gradients
                    1.0                                                         1.0                                                         1.0
cosine similarity


                                                            cosine similarity


                                                                                                                        cosine similarity
                    0.5                                                         0.5                                                         0.5

                    0.0                                                         0.0                                                         0.0                    T=500,K=200
                                                                                                                                                                   T=50,K=20
                                                                                                                                                                   T=20,K=8
                    0.5                                                         0.5                                                         0.5
                          fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 out                         fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 out                         fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 out
       (g)                             states                       (h)                      bias gradients                         (i)                 weight gradients
                    1.0                                                         1.0                                                         1.0
                                                                                                                                                          i=0.0              i=0.5
cosine similarity


                                                            cosine similarity


                                                                                                                        cosine similarity

                                                                                                                                                          i=0.01             i=1
                    0.5                                                         0.5                                                         0.5           i=0.1

                    0.0                                                         0.0                                                         0.0

                    0.5                                                         0.5                                                         0.5
                          fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 out                         fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 out                         fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 out
            (j)                        states                       (k)                      bias gradients                         (l)                 weight gradients
                    1.0                                                         1.0                                                         1.0
cosine similarity


                                                            cosine similarity


                                                                                                                        cosine similarity


                    0.5                                                         0.5                                                         0.5                            i=0.0
                                                                                                                                                                           i=0.01
                                                                                                                                                                           i=0.1
                    0.0                                                         0.0                                                         0.0
                                                                                                                                                                           i=0.5
                                                                                                                                                                           i=1
                    0.5                                                         0.5                                                         0.5
                          fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 out                         fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 out                         fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 out

 Figure S10: The cosine similarity of gradients and states between our model and feedforward model
 trained by BP in an 8-hidden-layer FNN (states: s0i ; bias gradients: dsi ; weight gradients: ∆Wi ).
 The axis x is the layers of the model. Error propagates from the last layer ”out” to the first hidden
 layer ”fc1” layer by layer. (a-c), with different numbers of iterations under feedback scaling βi = 1.
 (d-f), with different numbers of iterations under small feedback scaling βi = 0.1. (g-i), with different
 feedback scaling (T=50,K=20). (j-l), Repeat (g-i) with datatype float64 (float32 by default).


                                                                                                26
Published as a conference paper at ICLR 2026


C.5                         V ERIFYING THE EFFECTIVENESS OF WEAK FEEDBACK IN E QUILIBRIUM P ROPAGATION


                                                                       1.00
                                                                                                                i=0.01
                                                                                                                i=0.1
                                                                       0.95


                                                       Test accuracy
                                                                       0.90

                                                                       0.85

                                                                       0.80
                                                                                                 1                            3                                                 5
                                                                                                               Only last nll layers learning

Figure S11: The testing accuracy on MNIST with different βi varies with nll . The experiments are
repeated 5 times.

To demonstrate that the lower few layers of our model are indeed receiving meaningful credit signals,
we report the test accuracy of only updating the last nll layer (i.e., freezing the weights of Layers
1−nll ) in Figure S11. For a 5-hidden layer model with βi = 0.1, updating only the final layer yields
a test accuracy of about 85%. As nll increases to 5, the accuracy also reaches around 97.5%. A
similar trend is observed for the model with βi = 0.01. These results show that achieving over 97%
accuracy requires effective gradient propagation to all layers, confirming that our model successfully
delivers usable credit signals throughout the entire network.

C.6                         ROBUSTNESS TO THE NOISE

               (a)                        2HL                                                 (b)                        3HL                                       (c)                        5HL
                                                                        1.00                                                               1.00                                                                 1.00
                            0 0.972 0.972 0.972 0.904                                                      0 0.974 0.974 0.967 0.752                                          0 0.976 0.969 0.814 0.469
Weight noise intensity


                                                                               Weight noise intensity


                                                                                                                                                  Weight noise intensity


                                                                        0.95                                                               0.95                                                                 0.95
                         0.001 0.972 0.972 0.972 0.896                  0.90                            0.001 0.973 0.975 0.964 0.749      0.90                            0.001 0.973 0.965 0.812 0.464        0.90

                         0.01 0.917 0.917 0.912 0.689                   0.85                            0.01 0.903 0.904 0.838 0.485       0.85                            0.01 0.876 0.815 0.520 0.299         0.85
                                                                        0.80                                                               0.80                                                                 0.80
                          0.1 0.194 0.206 0.196 0.217                                                    0.1 0.167 0.180 0.174 0.167                                        0.1 0.134 0.135 0.134 0.135
                                                                        0.75                                                               0.75                                                                 0.75
                                0      1e-05 0.001 0.1                                                         0      1e-05 0.001 0.1                                               0      1e-05 0.001 0.1
                                    State noise intensity                                                          State noise intensity                                                State noise intensity


Figure S12: The maximum test accuracy model with different noise intensity on weights and states
added both in training and test. (a) With 2 hidden layers; (b) With 3 hidden layers; (c) With 5 hidden
layers. The model is trained for 50 epochs and the experiments are repeated 5 times.

To evaluate the robustness of the model, we introduce noise on weights and time-varying noise
on states, which are random Gauss noise imposed at each weight update or at each state update,
respectively. The noise on weights is directly added to the weight, while the noise on states is added
as the bias b in Equation 3.1. The mean absolute values of non-zero weights and neural activations
after noiseless training are approximately 0.09 and 0.76 respectively.
The accuracy of the model with two hidden layers varies with two types of noises as presented in Fig-
ure S12(a). It maintains satisfactory performance when the standard deviation of state noise reaches
0.1 or the standard deviation of weight noise reaches 0.01. In deeper structures (Figure S12(b,c)),
the results are consistent with the aforementioned observations for weight noise, demonstrating ex-
cellent robustness. However, the tolerance to time-varying state noise degrades significantly, which
we attribute to the layer-wise noise accumulation and the distortion of weak gradient signal by the
noise in the training process. To confirm our hypothesis, we impose the noises only in the test,


                                                                                                                         27
Published as a conference paper at ICLR 2026


                      (a)                   2HL                                 (b)                                 3HL                                         (c)                            5HL
                                                                1.00                                                                       1.00                                                                       1.00
                             0 0.973 0.973 0.973 0.971 0.944                                       0 0.974 0.976 0.975 0.974 0.940                                            0 0.975 0.977 0.977 0.977 0.919
                                                                0.95                                                                       0.95                                                                       0.95


                                                                       Weight noise intensity


                                                                                                                                                  Weight noise intensity
 Weight noise intensity
                          0.001 0.971 0.971 0.974 0.970 0.945   0.90                            0.001 0.975 0.973 0.973 0.972 0.945        0.90                            0.001 0.973 0.973 0.973 0.973 0.928        0.90

                           0.01 0.919 0.916 0.919 0.918 0.912   0.85                             0.01 0.908 0.903 0.908 0.902 0.894        0.85                             0.01 0.885 0.876 0.880 0.873 0.842        0.85
                                                                0.80                                                                       0.80                                                                       0.80
                            0.1 0.199 0.187 0.194 0.219 0.177                                     0.1 0.176 0.182 0.152 0.167 0.142                                          0.1 0.133 0.128 0.143 0.163 0.142
                                                                0.75                                                                       0.75                                                                       0.75
                                  0   0.001 0.001 0.1 0.                                                0   0.001 0.001 0.1          0.5                                           0   0.001 0.001 0.1          0.5
                                      State noise intensity                                                  State noise intensity                                                      State noise intensity


Figure S13: The maximum test accuracy model with different noise intensity on weights and states
(the state noise is added only in test). (a) With 2 hidden layers; (b) With 3 hidden layers; (c) With 5
hidden layers. The model is trained for 50 epochs in a single experiment.


and the test accuracy almost remains unaffected (Figure S13). Therefore, the network is potentially
resilient to noise. However, how to improve resilience in the training process requires further study.


D                            T RAINING DETAILS

Table S2 provides the parameters of the Adam optimizer that are used in Tables S1–S2 (Kingma &
Ba, 2015). The training details for Table 1 are given in Table S3. For convolutional architectures
in EP, the training process can be described by Algorithm S2. The training sample is fed into the
network through Conv0 . Then the state of the first layer goes through max pooling MaxPool1 and
convolution Conv1 sequentially to reach the second layer. The second layer also feedbacks its states
to the first layer through transposed convolution ConvT1 and max-unpooling MaxUnpool1 . With
T iterations, the RNN converges to the steady states and produces outputs through MaxPool2 and
a fully connected layer. Then the prediction error is computed and used to nudge the RNN by
the reverse of the fully connected layer and max-unpooling MaxUnpool2 . Note that the unpooling
MaxUnpooli requires the indices from the corresponding pooling MaxPooli .
For Table S2, Adam optimizer is used for all experiments. The activation functions sigmoid-s
                                              1
and hard-sigmoid are defined as ρ(x) = 1+e−4(x−0.5) , ρ(x) = max(min(x, 0), 1), respectively
(Ernoult et al., 2019). For 5-HL, 10-HL and 20-HL architectures, the Adam optimizer parameters
are as shown in Table S2 (epoch: 50, batch size: 500). The inference details of the architecture
shown in Figure 3b are described by Algorithm S3. The details for convolutional architectures are
given in Table S3 and Figure S14–S15. The cosine-annealing scheduler is used in convolutional
architectures for CIFAR-10 (Tmax = 50, ηmin = 10−6 ).
For MNIST, no pre-processing is used. For the CIFAR-10 dataset, we follow ref. (Scellier
et al., 2023) to pre-process the images. We normalize the input images using mean µ =
(0.4914, 0.4822, 0.4465) and standard deviation σ = 3 × (0.2023, 0.1944, 0.2010).
The results for comparison of time consumption were obtained in a virtualized Windows 11 envi-
ronment with Intel Xeon Gold 6238R CPU, 16 GB RAM, and Nvidia RTX A5000 (24 GB VRAM).
Other results were obtained on a Windows 11 environment with Intel Core i5-12490F, 32 GB RAM,
and Nvidia GTX 1650 (4 GB VRAM) or a Windows 11 environment with AMD R7-7700, 32 GB
RAM, and Nvidia RTX 4070 (12 GB VRAM). The default numerical precision is float32 (single-
precision float).


                                                     Table S2: The parameters of the Adam optimizer.
                                                           Parameter Name                     Default Value
                                                   Learning rate (MNIST / CIFAR-10)          10−3 /2 × 10−4
                                             First-order moment estimation decay rate (β1 )        0.9
                                            Second-order moment estimation decay rate (β2 )       0.999
                                                Small constant for numerical stability (ϵ)        10−8


                                                                                                                    28
Published as a conference paper at ICLR 2026


Table S3: Training details for Table 1 and Table 2. The results of EB-EP and P-EP come from
previous work (Ernoult et al., 2019). SGD refers to Stochastic Gradient Descent with mini-batches.
                                                                        Epoch / Batch size
    Architecture             Training approach           Optimizer                                       Learning rate         Weight decay
                                                                               -T/K
                               P-EP (sigmoid-s)             SGD            50/20-100/20               [0.005, 0.05, 0.2]           None
        2HL
                            Proposed (tanh, Adam)           Adam           50/500-10/10             [0.001, 0.001, 0.001]          None
                               P-EP (sigmoid-s)             SGD           100/20-180/20            [0.002, 0.01, 0.05, 0.2]        None
                             Proposed-DLR (tanh)            SGD           100/20-18/10             [0.002, 0.01, 0.05, 0.2]        None
        3HL                     Proposed (tanh)             SGD            100/20-18/10               [0.1, 0.1, 0.1, 0.1]         None
                            Proposed (tanh, Adam)           Adam           50/500-18/10                      10−3                  None
                               BP (tanh, Adam)              Adam            50/500-1/1                       10−3                  None
                             P-EP (hard-sigmoid)            SGD            40/20-200/10              [0.015, 0.035, 0.15]          None
        Conv
                           Proposed (hard-sigmoid)          SGD           40/128-20/10                 [0.15, 0.35, 0.9]           10−5
      (Table 1)
                              BP (hard-sigmoid)             SGD             40/128-1/1                [0.001, 0.02, 0.4]           10−5
         Conv              Proposed (hard-sigmoid)          Adam          40/128-20/10                    2 × 10−4                 10−6
   (Table 2 MNIST)            BP (hard-sigmoid)             Adam            40/128-1/1                    2 × 10−4                 10−6
        Conv               Proposed (hard-sigmoid)          Adam          50/128-40/10                   2.5 × 10−4              2 × 10−4
 (Table 2 CIFAR-10)           BP (hard-sigmoid)             Adam            50/128-1/1                   2.5 × 10−4              2 × 10−4


                                                                   64@8x8                     64@4x4
                           32@24x24                32@12x12
         1@28x28                                                                                                                1x10


                       Conv1                       MaxPool1                   Conv2                        MaxPool2           Dense

                                 Figure S14: Convolutional architectures for MNIST.


                                                                               128@8x8                 128@4x4
                                                 64@16x16      64@8x8
                      32@32x32        32@16x16
         3@32x32                                                                                                                 1x10


                   Conv1           MaxPool1      Conv2             MaxPool2           Conv3                      MaxPool3

                             Figure S15: Convolutional architectures for CIFAR-10.


                                                                   29
Published as a conference paper at ICLR 2026


Algorithm S2 Two phases in EP training process for convolution architecture
Input: Sample-label pairs (x, star )
Parameter: θ = [W0 , W1 , Wf , Bf , B1 , α1 , β1 , βf 1 ]
Output: θ
 1: function F IRST- PHASE(θ, star )
 2:    s0 ← x
 3:    for t ← 1 to T do
 4:        h1 ← Conv0 (s0 ) + β1 · MaxUnpool1 (ConvT1 (s02 ))
 5:        h2 ← Conv1 (MaxPool1 (s01 ))
 6:        hp ← Wf · Flatten(MaxPool2 (s02 ))
 7:        s01 , s02 , s0p ← ρ(h1 ), ρ(h2 ), SoftMax(hp )
 8:    end for
 9:    Λ1 ← [s0i ], i = 0, 1, 2, p
10:    return Λ1
11: end function
12: function S ECOND - PHASE(θ, Λ1 , star )
             β        β        β
13:    s0 , s1 f 1 , s2 f 1 , sp f 1 ← x, s01 , s02 , s0p
14:    for t ← 1 to K do
                                    β
15:         ep ← star − sp f 1
                                                                     β
16:            h1 ← Conv0 (s0 ) + β1 · MaxUnpool1 (ConvT1 (s2 f 1 ))
                                        β
17:            h2 ← Conv1 (MaxPool1 (s1 f 1 )) + βf · MaxUnpool2 (Unflatten(Wf⊤ ep ))
                                                            β
18:        hp ← Wf · Flatten(MaxPool2 (s2 f 1 ))
            β        β        β
19:        s1 f 1 , s2 f 1 , sp f 1 ← ρ(h1 ), ρ(h2 ), SoftMax(hp )
20:    end for
21: end function


                                                                30
Published as a conference paper at ICLR 2026


Algorithm S3 EP with feedback scaling and residual connections (Figure 3b)
Input: (x, star )
Parameter: θ = [W0 , Wi , Wf , Bf , Bi , βi , βf 1 ]
Output: θ
 1: function I TERATION(θ, Λ1 , star )
 2:    for t ← 1 to K do
 3:         if Nudging Phase then
 4:              βf ← βf 1
                       β        β        β
 5:              s0 , s1 f 1 , s2 f 1 , sp f 1 ← x, s01 , s02 , s0p
 6:         else
 7:              βf ← 0
 8:              s0 ← x
 9:         end if
                                              β                   β
10:         h1 ← W0 s0 + β1 B1 s2 f + β4,1 B4,1 s4 f
                            βf                  βf
11:         h2 ← W1 s1 + β2 B2 s3
                            β                   β
12:         h3 ← W2 s2 f + β3 B3 s4 f
                            βf                   β              β   β
13:         h4 ← W3 s3 + β14 B4 s5 f + W1,4 s1 f + β7,4 B7,4 s7 f
                            β                   β
14:         h5 ← W4 s4 f + β5 B5 s6 f
                            βf                  β
15:         h6 ← W5 s5 + β6 B6 s7 f
                            β                   β             β      β
16:         h7 ← W6 s6 f + β7 B7 s8 f + W4,7 s4 f + β10,7 B10,7 s10f
                            βf                  βf
17:         h8 ← W7 s7 + β8 B8 s9
                            β                   β
18:         h9 ← W8 s8 f + β9 B9 s10f
                              βf                  β               β
19:         h10 ← W9 s9 + βf Bf ep f + W7,10 s7 f
                            β
20:         hp ← Wf s10f
              βf
21:         si ← ρ(hi ), i = 0, 1, 2, . . . , 10
              β
22:         sp f ← SoftMax(hp )
23:    end for
24: end function


                                               31