Initial commit: ept — backprop-free equilibrium transformer (EP)

Code (ep_run/), organized docs (docs/{method,campaign,hardware,outreach,paper}), analysis scripts (scripts/), ONBOARDING.md entry point. Large data/checkpoints git-ignored (share separately). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014FAPDWQ49M5Ye3NpTndTpn
author: Yuren Hao <yurenh2@illinois.edu> 2026-07-03 05:56:50 -0500
committer: Yuren Hao <yurenh2@illinois.edu> 2026-07-03 05:56:50 -0500
commit: b83947778e2c776f757a07d4719b7ce961d7ed55 (patch)
tree: b9cc01d7adda691d9156d9d04f4fb2f644674e96 /refs/fre_rnn_full.txt
1 files changed, 1945 insertions, 0 deletions
diff --git a/refs/fre_rnn_full.txt b/refs/fre_rnn_full.txt
new file mode 100644
index 0000000..48da35c
--- /dev/null
+++ b/refs/fre_rnn_full.txt
@@ -0,0 +1,1945 @@
+                                        Published as a conference paper at ICLR 2026
+
+
+
+
+                                        T OWARD P RACTICAL E QUILIBRIUM P ROPAGATION :
+                                        B RAIN - INSPIRED R ECURRENT N EURAL N ETWORK
+                                        WITH F EEDBACK R EGULATION AND R ESIDUAL C ON -
+                                        NECTIONS
+
+                                         Zhuo Liu                                               Tao Chen ∗
+                                         School of Microelectronics                             School of Microelectronics
+                                         University of Science and Technology of China          University of Science and Technology of China
+                                         Hefei 230026, Anhui, China                             Hefei 230026, Anhui, China
+arXiv:2508.11659v2 [cs.NE] 7 May 2026
+
+
+
+
+                                         zhuoliu00@mail.ustc.edu.cn                             tchen@ustc.edu.cn
+
+
+
+                                                                                     A BSTRACT
+                                                    Brain-like intelligent systems need brain-like learning methods. Equilibrium
+                                                    Propagation (EP) is a biologically plausible learning framework with strong poten-
+                                                    tial for brain-inspired computing hardware. However, existing implementations
+                                                    of EP suffer from instability and prohibitively high computational costs. Inspired
+                                                    by the structure and dynamics of the brain, we propose a biologically plausible
+                                                    Feedback-regulated REsidual recurrent neural network (FRE-RNN) and study its
+                                                    learning performance in the EP framework. Feedback regulation enables rapid
+                                                    convergence by attenuating feedback signals and reducing the disturbance of feed-
+                                                    back paths to feedforward paths. The improvement in the convergence property
+                                                    reduces the computational cost and training time of EP by orders of magnitude,
+                                                    delivering performance on par with backpropagation (BP) in benchmark tasks.
+                                                    Meanwhile, residual connections with brain-inspired topologies help alleviate the
+                                                    vanishing gradient problem that arises when feedback pathways are weak in deep
+                                                    RNNs. Our approach substantially enhances the applicability and practicality of
+                                                    EP. The techniques developed here also offer guidance for implementing in-situ
+                                                    learning in physical neural networks.
+
+
+                                        1       I NTRODUCTION
+                                        Backpropagation (BP) has been the driving force behind the success of artificial intelligence (AI)
+                                        across a wide variety of tasks, ranging from image recognition to natural language processing
+                                        (Rumelhart et al., 1986; Lecun, 1988; He et al., 2016; Vaswani et al., 2017). Despite these tri-
+                                        umphs, BP’s reliance on non-local error signals and weight transport lacks biological plausibility
+                                        (Journé et al., 2023; Ororbia, 2023). The brain does not appear to implement the gradient computa-
+                                        tions performed by BP, in particular the explicit derivative of the activation function, which demands
+                                        precise access to the rate of change in neuronal activities at specific operating points (Ororbia, 2023).
+                                        Moreover, implementing BP in neuromorphic systems incurs enormous overhead (Kudithipudi et al.,
+                                        2025). Drawing inspiration from the topology and dynamics of the brain is a viable approach to ad-
+                                        vancing biologically plausible learning mechanisms and to promoting energy-efficient computing
+                                        systems for AI.
+                                        Equilibrium Propagation (EP) (Scellier & Bengio, 2017; Ernoult et al., 2019; Laborieux et al., 2021)
+                                        presents a compelling and hardware-friendly alternative. It leverages naturally settling dynamics in
+                                        RNN for credit assignment, and eliminates the need for explicit activation derivatives. EP operates
+                                        in two phases with nearly identical dynamics, and the synaptic adjustments depend only on local
+                                        information (Ackley et al., 1985; Movellan, 1991; Ernoult et al., 2020). In EP, the output layer is
+                                        softly nudged by the prediction error toward configurations that incrementally minimize the loss
+                                        function, a regime termed weak supervision (Millidge et al., 2023). A major drawback of EP is
+                                            ∗
+                                                Corresponding Author
+
+
+                                                                                            1
+Published as a conference paper at ICLR 2026
+
+
+
+
+its notably slow training speed and instability. An RNN often requires dozens or even hundreds
+of iterations to reach a stable state (Scellier & Bengio, 2017). Previous attempts to optimize EP’s
+performance have led to markedly more complicated procedures (O’Connor et al., 2019; Laborieux
+& Zenke, 2024).
+In this paper, we draw inspiration from the brain and propose a Feedback-regulated REsidual recur-
+rent neural network (FRE-RNN). We substantially improve the convergence properties of the RNNs
+and training speed of EP while achieving performance comparable to BP. Our contributions are as
+follows:
+
+       • By scaling down the feedback strength of RNNs, we enhance the robustness of EP and
+         accelerate the training and inference speed by orders of magnitude because of the improved
+         convergence properties.
+       • To counteract the gradient vanishing problem caused by weak feedback, we introduce resid-
+         ual connections into the layered RNNs, enabling the training of deep networks that previ-
+         ously challenged EP and achieving performance closer to BP.
+       • The feedback regulation and residual connections in RNNs of arbitrary graph topologies
+         mirror the multi-scale recurrence in biological neural networks. Our work fosters EP’s bio-
+         logical plausibility and extends its applicability in brain-inspired computational hardware.
+
+
+2     BACKGROUND
+
+2.1   C ONVERGENT RNN S WITH S TATIC I NPUT
+
+Consider an RNN as a dynamical system driven by a static input x:
+
+                                        s[t + 1] = F (x, s[t], θ),                                    (1)
+
+where F is the transition function, s[t] is the network state at time step t(t = 0, 1, 2, . . . , T ), and
+θ denotes the parameters. Assuming that the network state stabilizes in T steps, the RNN reaches
+a stable point s[T ]. Its convergence is typically guaranteed by either symmetric connections with
+asynchronous updates or by a sufficiently small spectral radius of asymmetric connections with
+synchronous updates (Hopfield, 1982; Yildiz et al., 2012; Liu et al., 2026). That said, other factors,
+e.g. activation function, also influence the dynamical properties of RNNs (Miller & Hardt, 2019).
+
+2.2   S CALING A DJACENCY M ATRIX TO T UNE N ETWORK DYNAMICS
+
+Scaling the spectral radius (SR) of the adjacency matrix, the largest eigenvalue of the weight matrix,
+is a common method to tune the dynamics of RNN (Bai et al., 2012; Nakajima et al., 2024; Liu et al.,
+2026). A SR less than one yields stable and convergent dynamics. In this case, injected signals tend
+to decay over time, which manifests as short-term memory. A SR exceeding one can give rise to
+expansive or even chaotic behavior in which small perturbations are amplified. By adjusting SR,
+one can bias the RNN toward convergent, oscillatory, or edge-of-chaos regimes, thereby tuning
+computational properties, such as convergence speed or long-term memory capacity. (Jaeger &
+Haas, 2004; Legenstein & Maass, 2007; Miller & Hardt, 2019).
+
+2.3   E QUILIBRIUM P ROPAGATION
+
+Equilibrium propagation is a learning framework initially based on energy-based models. It proceeds
+in two phases: a free (first) phase and a weakly clamped (second) phase. For the first phase, the
+RNN converges to a steady state s0 under the stimulation of input alone. In the clamped phase, the
+network is gently nudged by the prediction error and settles to a new stable state sβ . The weight
+update can be simplified to a contrastive learning compatible with spiking-time-dependent plasticity
+(STDP) (Scellier et al., 2018). EP has been further generalized to asymmetric RNNs governed by
+vector field dynamics (Scellier et al., 2018). Recent work shows that asymmetry in skew-symmetric
+Hopfield models can improve classification performance (Høier et al., 2024).
+
+
+                                                    2
+Published as a conference paper at ICLR 2026
+
+
+
+
+2.4   F EEDBACK R EGULATION AND N ETWORK S TRUCTURE IN THE B RAIN
+
+Cortical areas in the brain feature dynamic regulation of feedforward and feedback connections
+(Felleman & Van Essen, 1991; Mejias et al., 2016; Michalareas et al., 2016; Semedo et al., 2022;
+Fişek et al., 2023; Wang et al., 2023). In the visual system, for instance, feedforward signals domi-
+nate immediately following the onset of external stimulus, whereas feedback signals become promi-
+nent during spontaneous activity. Dynamically regulating the strength of feedback allows the brain
+to optimize information integration, ensuring efficient perception and decision-making. In mam-
+malian neocortices, information processing involves not only feedforward synaptic chains but also
+extensive lateral and feedback loops that interconnect disparate regions, forming a richly recursive
+network rather than a strictly layered structure. This topology implies short average path length
+between neurons and efficient information flow (Watts & Strogatz, 1998; Markov et al., 2013; Lynn
+& Bassett, 2019; Kulkarni & Bassett, 2025). In deep neural networks, residual connections reflect
+the long-range skip-layer projections observed in cortical circuits (Perich & Rajan, 2020; Holk &
+Mejias, 2024). They mitigate the vanishing gradient by providing skip pathways that preserve gra-
+dient (He et al., 2016).
+
+3     ACCELERATING EP WITH B RAIN - INSPIRED N ETWORK P ROPERTIES
+
+                         (a)
+                           𝑠0                𝑠1                  𝑠2              Predict: 𝑠𝑝
+                                  𝑊0               𝛼1 𝑊1               𝑊𝑓                      Label: 𝑠𝑡
+
+
+
+                                                   𝛽1 𝐵1              𝛽𝑓 𝐵𝑓
+                                                                                 Error: 𝑒𝑝 = 𝑠𝑡 − 𝑠𝑝
+                         (b)
+                           𝑠0                𝑠1                  𝑠2         Predict: 𝑠𝑝
+                                𝐶𝑜𝑛𝑣0             𝑃1 , 𝐶𝑜𝑛𝑣1          𝑃2 , 𝑊𝑓                  Label: 𝑠𝑡
+                                (32,5,1,0)        2 , 64,5,1,0        (2)
+
+
+
+                                                                            𝑇
+                                             𝐶𝑜𝑛𝑣𝑇1 , 𝑃1−1 , 𝛽1       𝑊𝑓        , 𝑃2−1 , 𝛽𝑓
+                                                                                   Error: 𝑒𝑝 = 𝑠𝑡 − 𝑠𝑝
+
+Figure 1: Illustration of feedback and feedforward regulation. (a) Layered architecture of RNN. The
+feedforward weights Wi and feedback weights Bi are rescaled by coefficients αi and βi respectively.
+The dashed box encloses an RNN formed by layers s1 and s2 with feedforward and feedback path-
+ways. βf is the nudging factor, which essentially scales the feedback strength of prediction error.
+(b) Embedding convolutional architecture in RNN. Convolutional parameter (32,5,1,0) is written
+as (channels, kernels, stride, padding). Parameter (2) in (b) denotes max-pooling with stride 2.
+ConvTi represents transpose convolution, the inverse process of the convolution, and Pi−1 means
+max-unpooling (Ernoult et al., 2019). Model architectures and training process are given in Ap-
+pendix D.
+
+
+3.1   P ROTOTYPICAL SETTING OF EQUILIBRIUM PROPAGATION
+
+Unlike the prototypical setting of equilibrium propagation (P-EP) (Ernoult et al., 2019), we separate
+the input and output layer from the recurrent network (Figure 1a). This separation allows the output
+layer to adopt the SoftMax activation commonly used in feedforward networks, which facilitates
+performance comparison (Laborieux & Zenke, 2024). For clarity, the RNN (black dashed box in
+Figure 1a) shown here only contains two hidden layers s1 and s2 , but the approach applies to deeper
+structures (see below). The states of the RNN evolve for T discrete steps until they converge. The
+dynamics of the whole RNN can be formulated as:
+                           sβf [t + 1] = F (sβf [t], b) = ρ(W · sβf [t] + b),
+                                    b = [W0 · s0 , βf · Bf · ep ],                                         (2)
+
+
+                                                                 3
+Published as a conference paper at ICLR 2026
+
+
+
+
+where sβf [t] is the state of the RNN at time t, ρ is the activation function, W is the forward weight
+matrix of the RNN, and b combines the feedforward input and the error-nudging term. Note that
+           β    β
+sβf = [s1 f , s2 f ]. For each sample-label pair (x, star ), we run the free phase (βf = 0) for te
+iterations, obtain the prediction sp = SoftMax(Wf · s2 ), and compute the prediction error ep =
+star −sp . During the clamped phase, the error nudges the RNN through the feedback weights Bf and
+scaling coefficient βf = βf 1 (βf 1 = 0.1 for layered architecture and βf 1 = 0.25 for convolutional
+architecture by default). The network evolves for K further iterations under clamping to another
+state. The weights (W0 , W1 ) are then updated with an STDP-compatible rule:
+                                                                                           β
+                                   ∆Wi = dsi+1 · (s0i )⊤ ,                        f1
+                                                                        dsi+1 = si+1 − s0i+1 ,                           (3)
+where dsi is the offset of stable point caused by the error nudging (Scellier et al., 2018). Similarly,
+the final weight for output is updated:
+
+                                                 ∆Wf = (star − s0p ) · (s02 )⊤ .                                         (4)
+We also consider an RNN embedded with convolutional architecture in its forward paths (2 convo-
+lution layers, 2 max-pooling layers and 1 fully connected layer) shown in Figure 1b. The forward
+convolutional structure follows the architecture of existing convolutional neural networks (CNN)
+(Krizhevsky et al., 2012; Simonyan & Zisserman, 2015), in which a pooling layer is placed after
+the activation of the convolution layer. We transform the CNN to an RNN by adding feedback con-
+nections symmetric with the feed-forward connections (See Appendix D for the pseudocode and
+schematics).
+
+3.2   F EEDBACK R EGULATION IN L AYERED RNN FOR FAST C ONVERGENCE
+
+                  (a) 0                                                (b) 0
+                 Index
+
+
+
+
+                                                                    Index
+
+
+
+
+                         100                                                100
+                  (c) 0                                      100
+                                                                       (d) 0                                      100
+                 Index
+
+
+
+
+                                                                    Index
+
+
+
+
+                         100                                                100                                   10 2
+                  (e) 0                                      10 1
+                                                                       (f) 0
+                                                                                                                  10 4
+                 Index
+
+
+
+
+                                                                    Index
+
+
+
+
+                    100                                                  100
+                  (g) 0                                      10 2
+                                                                       (h) 0                                      10 6
+                 Index
+
+
+
+
+                                                                    Index
+
+
+
+
+                         100                                                100
+                               0   20   40       60   80                          0   20       40       60   80
+                                             t                                                      t
+
+
+Figure 2: Convergence dynamics and speed versus feedback scaling βi . All neurons in all hidden
+layers are indexed (s1 :0-63; s2 :64-127). Colors indicate neuronal activity (a,c,e,g) and changes in
+activity (b,d,f,h). (a) The state evolution of RNN with symmetric weights and βi = 0.1; (b) The
+one-step difference of neural states in (a). (c, d) Symmetric weights with βi = 2; (e, f) Asymmetric
+weights with βi = 0.1; (g, h) Asymmetric weights with βi = 4. In both symmetric and asymmetric
+feedback cases, down-scaling feedback connections tends to stabilize the network. See Figure 5d
+for the statistical robustness.
+
+Although the SR can tune the RNN dynamics, scaling forward weights Wi distorts forward signal
+propagation, which is harmful to performance (see below). Therefore, we turn to another choice,
+namely, scaling only the feedback strength with βi . This coefficient scales the gradients, in the same
+way as the nudging factor βf . We consider both symmetric (Bi = (Wi )⊤ ) and asymmetric (Bi ̸=
+(Wi )⊤ ) recurrent connections in the study, and compare the results with FNNs of the same size
+trained by BP (feedback connections removed) or feedback alignment (FA) (Lillicrap et al., 2016)
+that uses random weights Bi ̸= (Wi )⊤ to feedback the error. Note that, after scaling, the overall
+weight matrix W of a symmetric RNN is no longer strictly symmetric. Therefore, we started from
+the vector field setting of EP rather then the energy-based setting in the first place. The feedforward
+and feedback weights are multiplied by coefficients αi and βi respectively. Figure 2a-d shows
+convergence speed for different βi . With asymmetric weights, the network can converge to a fixed
+point (Figure 2e, f), exhibit cyclical oscillation (Figure 2g, h), or even become chaotic. The feedback
+weights stay fixed during training process, which differs from EP in vector field dynamics (Scellier
+
+
+                                                                   4
+Published as a conference paper at ICLR 2026
+
+
+
+
+et al., 2018). The pseudocode of learning procedure with a 2-hidden-layer RNN shown in Figure 1(a)
+is provided in Algorithm 1.
+
+Algorithm 1 EP with Feedforward and Feedback Scaling
+Require: Input: (x, star )
+Require: Parameters: θ = [W0 , W1 , Wf , Bf , B1 , α1 , β1 , βf 1 ]
+Ensure: Updated parameters θ
+ 1: function F IRST- PHASE(θ, star )
+ 2:    s0 ← x
+ 3:    for t = 1 to T do
+ 4:         h1 ← W0 · s0 + β1 · B1 · s02
+ 5:         h2 ← α1 · W1 · s01
+ 6:         hp ← Wf · s02
+ 7:         s01 , s02 , s0p ← ρ(h1 ), ρ(h2 ), SoftMax(hp )
+ 8:    end for
+ 9:    Λ1 ← [s0i ], i = 0, 1, 2, p
+10:    return Λ1
+11: end function
+12: function S ECOND - PHASE(θ, Λ1 , star )
+         β        β        β
+13:    s1 f 1 , s2 f 1 , sp f 1 ← s01 , s02 , s0p
+14:    for t = 1 to K do
+                                   β
+15:         ep ← star − sp f 1
+                                                  β
+16:         h1 ← W0 · s0 + β1 · B1 · s2 f 1
+                                     βf 1
+17:         h2 ← α1 · W1 · s1 + βf · Bf · ep
+                                 β
+18:         hp ← Wf · s2 f 1
+              βf 1      βf 1    βf 1
+19:         s1 , s2 , sp ← ρ(h1 ), ρ(h2 ), SoftMax(hp )
+20:    end for
+                    β
+21:    dsi ← si f 1 − s0i , i = 1, 2
+22:    Λ2 ← [ds1 , ds2 ]
+23:    return Λ2
+24: end function
+25: function U PDATING -W EIGHTS(θ, Λ1 , Λ2 , star )
+26:    ∆Wi ← dsi+1 · (s0i )⊤ , i = 0, 1
+27:    ∆Wf ← (star − s0p ) · (s02 )⊤
+28: end function
+
+
+
+3.3    R ESIDUAL C ONNECTIONS TO AVOID VANISHING G RADIENTS
+
+In our 10-hidden-layer RNN with symmetric connections, we add cross layer residual links (Fig-
+ure 3a-b) and carry out ablation study on their effects in performance. The three long-range bidi-
+rectional connections bypass adjacent layers to reduce gradient decay. For RNN with asymmet-
+ric connections, we introduce skip-layer connections between non-adjacent layers with P = 20%
+probability, creating an RNN with arbitrary graph topologies (AGT) where any pair of layers form
+connections stochastically (Figure 3c) (Salvatori et al., 2022). (See Algorithm S3 in Appendix D for
+training detail)
+
+4     E XPERIMENTS
+We evaluated our RNN models on MNIST and CIFAR-10 datasets and compared the results with P-
+EP and BP. The MNIST dataset consists of 70,000 grayscale handwritten digit images (28×28 pixels)
+split into 60,000 training and 10,000 test samples. CIFAR-10 contains 60,000 RGB images (32×32
+pixels) of 10 categories, divided into 50,000 training and 10,000 test samples. Pre-processing, net-
+work structures and additional training details are in Appendix D.
+
+4.1    I NFLUENCE OF F EEDFORWARD S CALING AND F EEDBACK S CALING
+
+Figure 4 compares the effects of feedforward scaling αi and feedback scaling βi . In general, relative
+small feedback scaling (βi = 0.1) yields high MNIST accuracy (Figure 4). In deeper RNNs, overly
+
+
+                                                         5
+Published as a conference paper at ICLR 2026
+
+
+
+
+Figure 3: (a) A 10-hidden-layer RNN model with residual connections. The solid blue wires and
+the dashed orange wires represent forward and feedback residual connections respectively. The
+bidirectional connections are symmetric. (b) Adjacency matrix of (a). The blocks (green) other than
+the sub-diagonals indicate residual connections. (c) Adjacency matrix for an RNN with arbitrary
+graph topology.
+
+
+
+                 (a)               2HL-Test accuracy                 (b)              3HL-Test accuracy                (c)                5HL-Test accuracy
+                                                                                                                                                                                           1.00
+                         0.001 0.9659 0.9730 0.9756 0.9753                 0.001 0.9348 0.9692 0.9718 0.9555                 0.001 0.3844 0.7980 0.9338 0.8048
+                                                                                                                                                                                           0.95
+                          0.01 0.9666 0.9723 0.9760 0.9758                  0.01 0.9246 0.9679 0.9765 0.9715                  0.01 0.2515 0.9365 0.9575 0.8583                             0.90
+                     i
+
+
+
+
+                                                                      i
+
+
+
+
+                                                                                                                        i
+
+
+
+
+                           0.1 0.9624 0.9711 0.9725 0.9694                   0.1 0.6323 0.9497 0.9741 0.9702                      0.1 0.2104 0.8386 0.9757 0.9332                          0.85
+
+                                                                                                                                                                                           0.80
+                           1.0 0.8925 0.9127 0.9300 0.7978                   1.0 0.4862 0.8739 0.5249 0.1676                      1.0 0.2096 0.3891 0.2500 0.1324
+                                                                                                                                                                                           0.75
+                                0.01   0.1       1.0         4.0                    0.01   0.1       1.0         4.0                   0.01   0.1                      1.0   4.0
+                                             i                                                   i                                                                 i
+
+
+
+Figure 4: The influence of feedforward scaling αi and feedback scaling βi on accuracy of MNIST
+classification. (a) 2 hidden layers; (b) 3 hidden layers; (c) 5 hidden layers. Each layer has 64
+neurons. By default, T = 10×NHiddenLayer , K = T /2. Each result is averaged over five repetitive
+experiments.
+
+
+   (a)                                                 (b)                                           (c)                                       (d)
+               100                                                                                           1
+                                                                                                                                                                   102
+                                                               8                                             0
+                                                                                                                                                Convergence time
+
+
+
+
+                                                               7
+                                                               6                                             1
+ Test Error
+
+
+
+
+                                                                                                     FTMLE
+
+
+
+
+              10 1                                             5
+                                                         SR
+
+
+
+
+                                                               4                                             2                                                     101         symm-before training
+                                                               3                                                                                                               asymm-before training
+                                                               2                                             3                                                                 symm-after training
+                                                               1                                                                                                               asymm-after training
+              10 2                                             0                                             4                                                     100
+                       01
+                         1
+                      0.1
+
+                         5
+                         1
+                         2
+                         4
+
+
+
+                                                                     01
+                                                                       1
+                                                                    0.1
+
+                                                                       5
+                                                                       1
+                                                                       2
+                                                                       4
+
+
+
+                                                                                                                   01
+                                                                                                                     1
+                                                                                                                  0.1
+
+                                                                                                                     5
+                                                                                                                     1
+                                                                                                                     2
+                                                                                                                     4
+
+
+
+                                                                                                                                                                           01
+                                                                                                                                                                             1
+                                                                                                                                                                          0.1
+
+                                                                                                                                                                             5
+                                                                                                                                                                             1
+                                                                                                                                                                             2
+                                                                                                                                                                             4
+                     0.0
+
+                     0.2
+
+
+
+
+                                                                   0.0
+
+                                                                   0.2
+
+
+
+
+                                                                                                                 0.0
+
+                                                                                                                 0.2
+
+
+
+
+                                                                                                                                                                         0.0
+
+                                                                                                                                                                         0.2
+                0.0
+
+
+
+
+                                                               0.0
+
+
+
+
+                                                                                                             0.0
+
+
+
+
+                                                                                                                                                                       0.0
+
+
+
+
+                                   i                                            i                                             i                                                        i
+
+
+
+
+Figure 5: The test error, SR, FTMLE, and convergence time versus feedback scaling βi . The results
+are obtained from a 3-hidden-layer (64 neurons per layer) model trained on MNIST dataset. Note
+that the network does not converge under certain conditions, resulting in missing value in d.
+
+
+
+                                                                                                 6
+Published as a conference paper at ICLR 2026
+
+
+
+
+low feedback scaling βi jeopardizes the performance, which we attribute to vanishing gradients
+(Figure 4c, right two columns). In contrast, down-scaling the feedforward weights degrades perfor-
+mance, as the inference signals are weakened through the layers (see rows of Figure 4a). However,
+up-scaling αi can also be detrimental, as this easily leads to saturation of neural state. The best per-
+formance of a 5-hidden-layer RNN is achieved without feedforward scaling αi = 1 and a trade-off
+in feedback scaling at βi = 0.1. These results suggest that balancing the feedforward and feedback
+strengths is critical for better performance, not only accuracy but also speed (see Table 1).
+To further investigate the influence of feedback scaling βi , we plot the error, SR, finite time max-
+imum Lyapunov exponent (FTMLE) (Shadden et al., 2005; Kanno & Uchida, 2014) and conver-
+gence time against feedback scaling coefficient before and after training of a 3-hidden-layer RNN
+on MNIST (Figure 5). It shows that larger feedback scaling βi decreases accuracy (Figure 5a). As
+expected, SR is positively correlated to βi (see Figure 5b), and large SR can lead to instability of
+an RNN indicated by the FTMLE shown in Figure 5c, which in turn explains the results in Figure
+5a. In general, down-scaling the feedback (βi < 1) reduces the convergence time of RNN, which is
+favorable. Note that up-scaling of feedback βi >1 can also decrease FTMLE and convergence time.
+However, this is attributed to the saturation of neural state, and will also lower the performance.
+Additionally, one might suspect that the gradient signals in the lower layers are not fulfilling their
+intended role. In reservoir computing, where only the last layer is trained, the network can also
+reach high accuracy as long as the output dimension is large enough. However, this is unlikely in
+our case, as each layer in our network has only 64 neurons by default (other than the results in Table
+1). To further confirm that the learning in lower layers is meaningful, we performed training with
+the weights of lower layers frozen—details of these experiments are included in Appendix C.5. The
+results clearly show that getting comparable results to BP requires effective training of lower layers.
+
+      (a) 0                      i=0.01                             (b) 0                           i=0.1                             (c) 0                             i=1.0                         (d) 0                              i=4.0
+                 10                                                             10                                                               10                                                               10
+Testing Error
+
+
+
+
+                                                               Testing Error
+
+
+
+
+                                                                                                                                Testing Error
+
+
+
+
+                10 1                                                           10 1                                                             10 1                                             Testing Error   10 1
+
+
+
+                                10           50                                                10             50                                                   10            50                                                 10            50
+                                20           100                                               20             100                                                  20            100                                                20            100
+                10 2                                                           10 2                                                             10 2                                                             10 2
+                       0   10   20    30           40     50                          0   10   20    30             40     50                          0   10      20    30            40   50                          0   10      20    30            40   50
+                                 Epoch                                                          Epoch                                                               Epoch                                                            Epoch
+      (e) 0                     2 Layers                               (f) 0                   3 Layers                              (g) 0                         5 Layers                           (h) 0                         10 Layers
+                 10                                                             10                                                               10                                                               10
+
+
+
+                                     0.001              1.0                                           0.001              1.0
+Testing Error
+
+
+
+
+                                                               Testing Error
+
+
+
+
+                                                                                                                                Testing Error
+
+
+
+
+                                                                                                                                                                                                 Testing Error
+
+
+
+
+                                     0.01               2.0                                           0.01               2.0
+                10 1                 0.1                4.0                    10 1                   0.1                4.0                    10 1                                                             10 1
+                                     0.25                                                             0.25
+                                                                                                                                                           0.001           1.0                                              0.001           1.0
+                                                                                                                                                           0.01            2.0                                              0.01            2.0
+                                                                                                                                                           0.1             4.0                                              0.1             4.0
+                                                                                                                                                           0.25                                                             0.25
+                10 2                                                           10 2                                                             10 2                                                             10 2
+                       0   10   20    30           40     50                          0   10   20    30             40     50                          0   10      20    30            40   50                          0   10      20    30            40   50
+                                 Epoch                                                          Epoch                                                               Epoch                                                            Epoch
+
+
+
+Figure 6: Test error with different hyperparameters. The curves of different T (10, 20, 50, 100) with
+2 hidden layers (64 neurons per hidden layer) and (a) βi = 0.01; (b) βi = 0.1; (c) βi = 1; (d)
+βi = 4. The curves of different βi (0.001, 0.01, 0.1, 0.25, 1, 2, 4) with (e) 2 hidden layers; (f) 3
+hidden layers; (g) 5 hidden layers; (h) 10 hidden layers. The shaded areas represent deviations of
+five repeated experiments. By default, T = 10 × NHiddenLayer , K = T /2. See Appendix A for
+more information.
+
+
+4.2                    D OWN - SCALING F EEDBACK L EADS TO FASTER C ONVERGENCE
+
+Figure 6a-d plots the error versus the number of epochs with different iteration steps T . Under the
+condition of βi = 0.01 (Figure 6a), the model with T = 10 and K = 5 works as well as the model
+with T = 100 and K = 50, suggesting possibility of speedup in training. Larger βi requires more
+iterations to achieve a certain level of performance (See Figure 6b, c, d). Larger βi means larger SR
+and FTMLE, thus requiring more iterations to settle the RNN as shown in Figure 2 and Figure 5(b-
+
+
+                                                                                                                               7
+Published as a conference paper at ICLR 2026
+
+
+
+
+Table 1: Comparison with P-EP and BP in accuracy and cost. The results of P-EP come from
+previous work (Ernoult et al., 2019). For BP results, we used a network with the same number
+of layers and number of nodes/channels. Each experiment is repeated five times, and the standard
+deviation is given. By default, βi = 0.01 in our results, the feedback weights are symmetric with
+the feedforward weights for P-EP and Ours, and the learning rate in all layers are the same except
+for Ours-DLR (different learning rate), which uses varying learning rates identical to that of P-EP.
+For 2HL (two hidden layers) and 3HL (three hidden layers), there are 512 nodes per hidden layer.
+See Appendix D for more details.
+                                                                Epoch / Batch size Wall Clock Time
+ Architecture   Training approach        Testing (Training)
+                                                                       -T/K         (HH:MM:SS)
+                 P-EP (sigmoid-s)      98.05%±0.10% (99.86%)      50/20-100/20          1:56:-
+     2HL         Ours (tanh, Adam)    98.39%±0.04% (100.00%)       50/500-10/10        0:01:16
+                  BP (tanh, Adam)     98.26%±0.06% (100.00%)        50/500-1/1         0:00:18
+                 P-EP (sigmoid-s)      97.99%±0.18% (99.90%)      100/20-180/20         8:27:-
+                 Ours-DLR (tanh)       97.65%±0.08% (98.93%)       100/20-18/10        1:01:14
+     3HL            Ours (tanh)       97.83%±0.13% (99.98%)        100/20-18/10        1:01:54
+                 Ours (tanh, Adam)    98.36%±0.06% (100.00%)       50/500-18/10        0:02:11
+                  BP (tanh, Adam)     98.36%±0.08% (100.00%)        50/500-1/1         0:00:24
+                P-EP (hard-sigmoid)    98.98%±0.04% (99.46%)      40/20-200/10          8:58:-
+     Conv       Ours (hard-sigmoid)    99.14%±0.02% (99.78%)       40/128-20/10        0:12:28
+                 BP (hard-sigmoid)    98.93%±0.18% (99.43%)         40/128-1/1         0:01:01
+
+
+
+Table 2: Comparison with BP and FA and ablation study of residual connection. For layered
+architecture, there are 64 nodes per hidden layer and we chose T = 10 × NHiddenLayer , and
+K = 5 × NHiddenLayer , which guarantees saturation of accuracy at βi = 0.1. For convolutional
+architectures, βi = 0.01. By default, the Adam optimizer is used. Each experiment is repeated five
+times. See Appendix D for more training details.
+  Architecture
+                  Training approach MNIST-Testing (Training) CIFAR-10-Testing (Training)
+  -connections
+                         BP             97.69%±0.10% (100.00%)       49.23%±0.81% (56.72%)
+     5-symm
+                         Ours            97.64%±0.10% (99.98%)       50.72%±0.17% (57.02%)
+                          FA             96.44%±0.10% (98.96%)       37.97%±2.18% (38.92%)
+    5-asymm
+                         Ours            96.37%±0.11% (97.99%)       45.27%±0.73% (46.79%)
+                         BP              97.61%±0.04% (99.93%)       48.23%±1.26% (55.37%)
+    10-symm              Ours            92.49%±0.32% (95.27%)       34.90%±0.38% (34.64%)
+                    Ours-Residual        97.49%±0.05% (99.77%)       44.46%±0.51% (48.67%)
+                          FA             94.52%±0.26% (95.54%)       30.16%±6.12% (30.20%)
+    10-asymm             Ours            87.37%±0.49% (87.95%)       30.37%±1.09% (29.97%)
+                      Ours-AGT           96.87%±0.11% (99.45%)       30.94%±4.90% (31.36%)
+                         BP              97.48%±0.07% (99.74%)       47.35%±1.49% (54.59%)
+    20-symm
+                    Ours-Residual        95.95%±0.18% (98.20%)       43.61%±1.17% (44.26%)
+                         BP              99.34%±0.04% (99.97%)       75.45%±0.46% (83.61%)
+      Conv
+                         Ours            99.27%±0.07% (99.78%)       75.04%±0.51% (80.79%)
+
+
+d). Or even worse, the gradient signal is completely distorted. At βi = 4, even T=100 fails to exceed
+95% accuracy. Figure 6e-h shows that while shallow networks benefit from low βi , deeper networks
+(3, 5 and 10 layers) lose accuracy. In all cases, training performance peaks at certain βi dependent
+on the network depth. Additional results are provided in Table S1 in Appendix B.
+Table 1 compares our approach with P-EP, BP, and FA. Our model supersedes P-EP in training
+speed by at least one order of magnitude for both convolutional architecture and layered architecture.
+Importantly, our accuracy is comparable to BP and FA for the shallow architectures (5-hidden-layer
+and conv model, see also Table 2). In consideration of the improved stability (Figure 6) via feedback
+regulation, we anticipate that physical implementations of RNN can achieve performance on par
+
+
+                                                  8
+Published as a conference paper at ICLR 2026
+
+
+
+
+with BP. Additionally, for layered architecture, we also adopt the same training parameters (learning
+rate, batch size and epochs) as P-EP, differing only in feedback scaling (‘ours-DLR’ in Table 1). The
+results present clear evidence of speedup, which mainly stems from the reduced number of iterations
+required for convergence.
+
+
+4.3   D OWN - SCALED F EEDBACK C OORDINATES P LASTICITY OF D IFFERENT L AYERS
+
+It is hypothesized that the brain requires different plasticity in different areas due to their varying
+functional roles (Atallah et al., 2004; Lowet et al., 2020). The variability in plasticity can be realized
+explicitly by adjusting learning rates or implicitly by modulating the intensity of gradient. Previ-
+ous work postulated that EP with weak feedback necessitates learning rates differing by orders of
+magnitude across layers (Scellier & Bengio, 2017). Here, we found that due to gradient differences
+across different layers induced by weak feedback, a 3-hidden-layer RNN at βi = 0.01 (Table 1,
+‘ours (tanh)’) learns well with a uniform learning rate. This result suggests that the feedback scaling
+alone is able to regulate gradient strength of different layers, pointing to another possible mechanism
+to coordinate plasticity.
+
+
+4.4   R ESIDUAL C ONNECTIONS OVERCOME THE G RADIENT VANISHING IN D EEP RNN S
+
+Weak feedback exacerbates vanishing gradient in deeper layered RNN (Figures S5–S6 in Ap-
+pendix B). Adding residual connections restores gradient flow (Figure S7 in Appendix B). As a
+result, a 10-hidden-layer network sees substantial performance gains (Table 2), 5% increase in ac-
+curacy for MNIST and 9% for CIFAR-10. Even 20-hidden-layer model can be trained. As shown
+in Table 2, without residual connections, an asymmetric RNN trained by EP falls short of FA in
+accuracy, but arbitrary residual links surpass the accuracy of FA for the MNIST classification (See
+ablation study on connection probability in Appendix B.). For more complex dataset CIFAR-10, the
+10-hidden-layer asymmetric model with residual random feedback connections achieves accuracy
+nearly 14% below the symmetric model. A possible reason is that the gradient signal through mul-
+tiple random fixed feedback connections becomes too distorted by error to coordinate the forward
+weight learning.
+
+
+5     D ISCUSSION
+
+We have applied the feedback scaling to RNN to speed up the convergence and to accelerate training
+with EP with negligible overhead. To counteract the vanishing gradient in deep architectures, we
+have added residual connections to non-adjacent layers of deep RNNs, partly restoring classification
+performance. In principle, the residual connections make credit assignment pathways shorter (Veit
+et al., 2016). The training exhibits remarkable resilience to noise on weight and neural state. Our
+structural modification is compatible with other algorithmic speed-ups (Scellier et al., 2023), thereby
+expanding the design space for efficient EP implementations.
+Recent work on credit assignment in brain-inspired networks, e.g. adjoint propagation (Liu et al.,
+2026), partitions a large network into local RNNs with random internal connections of low SR
+for fast convergence and dynamic resource allocation, yielding speed and accuracy similar to this
+work. This work, however, adopts the feedback scaling to solve the stability issue and accelerate
+convergence of EP.
+Weak feedback is often considered in biologically plausible learning algorithms (Sacramento et al.,
+2018; Haider et al., 2021; Meulemans et al., 2021). It has been shown that contrastive Hebbian
+learning with weak feedback approximates backpropagation while converging quickly (Xie & Se-
+ung, 2003). More recently, local representation alignment (LRA) likewise employed weak feedback
+(Ororbia et al., 2023) and skip connections from the output to deep layers for efficient training. The
+EP framework also approximates BP (Scellier & Bengio, 2017; Millidge et al., 2023), but under
+the weak clamping condition (weak supervision) (Laborieux et al., 2021; Millidge et al., 2023). We
+have shown that, at the infinitesimal inference limit, namely weak supervision and weak feedback
+(Millidge et al., 2023), EP is equivalent to LRA and BP (Appendix C). In other words, the dynamics
+of FRE-RNN is more like the feedforward neural network due to its weak feedback.
+
+
+                                                    9
+Published as a conference paper at ICLR 2026
+
+
+
+
+However, there are still a few limitations to our approaches for large-scale neural networks that
+underpin artificial intelligence. For complex datasets like CIFAR-10, there exists a notable perfor-
+mance gap compared to BP, using deep fully connected neural networks. We attribute this gap to
+the inaccurate approximation to the true gradient as computed by BP (See Appendix C.4). There-
+fore, although EP can be extended to deep fully connected network (20-hidden-layers) and shallow
+CNNs, its applicability for deep CNN remains to be explored. For deep architectures with asymmet-
+ric connections, the accuracy decreases faster with increasing depth due to the inaccurate random
+error feedback. More in-depth investigation on residual connection topology is required to scale
+up the methodology to large scale deep architectures. Besides, the hyperparameters are optimized
+empirically. We find a feedback scaling in the range of 0.01-0.1 is favorable for shallow networks
+(less than 4 layers) and 0.1-0.25 for deeper architectures. Finding a general way to determine these
+parameters is still on-going. Additionally, existing research on EP converging naturally continues
+to focus primarily on static-input settings (Laborieux et al., 2021; Ernoult et al., 2019; Laborieux
+& Zenke, 2024). Extending naturally converging RNN trained by EP to sequence tasks remains a
+challenge.
+From a neurobiological perspective, residual connections, particularly the randomly generated arbi-
+trary graph topologies, yield cortex-like connectivity patterns in the brain. The feedback-regulated
+residual RNNs equip the biologically plausible learning framework, EP, with biologically plausible
+network architecture. Although it currently runs on GPUs, it can exploit the natural convergence of
+physical RNNs and facilitate efficient learning and inference on dedicated neuromorphic hardware.
+
+ACKNOWLEDGEMENTS
+This work was supported by the National Key R&D Program of China (Grant No.
+2024YFA1208804). Additional financial support from the University of Science and Technology
+of China and the Chinese Academy of Sciences is also gratefully acknowledged.
+
+C ODE AVAILABILITY
+The code used in this work is available at https://github.com/Zero0Hero/
+FRE-RNN-EP.
+
+R EPRODUCIBILITY STATEMENT
+The code necessary to reproduce the main results is provided as Jupyter Notebooks in the Supple-
+mentary Materials. Researchers can directly run them to reproduce the results. Further details on
+data pre-processing and training process are available within the provided code and in Appendix D.
+
+T HE U SE OF L ARGE L ANGUAGE M ODELS (LLM S )
+In the preparation of this work, the authors used GPT-5 and DeepSeek solely for the purpose of
+polishing and improving the linguistic fluency and readability of the text. This includes tasks such
+as correcting grammar and rephrasing sentences. After using the model, the authors have reviewed
+and edited all content extensively and take full responsibility for all ideas, claims, and the final
+language presented in this paper.
+
+R EFERENCES
+David H. Ackley, Geoffrey E. Hinton, and Terrence J. Sejnowski. A learning algorithm for boltz-
+  mann machines. Cognitive Science, 9(1):147–169, 1985.
+Hisham E. Atallah, Michael J. Frank, and Randall C. O’Reilly. Hippocampus, cortex, and basal
+  ganglia: Insights from computational models of complementary learning systems. Neurobi-
+  ology of Learning and Memory, 82(3):253–267, 2004. ISSN 1074-7427. doi: https://doi.
+  org/10.1016/j.nlm.2004.06.004. URL https://www.sciencedirect.com/science/
+  article/pii/S1074742704000693.
+Zhang Bai, D. J. Miller, and Wang Yue. Nonlinear system modeling with random matrices: Echo
+  state networks revisited. IEEE Transactions on Neural Networks and Learning Systems, 23(1):
+  175–182, 2012. ISSN 2162-237X 2162-2388. doi: 10.1109/tnnls.2011.2178562.
+
+
+                                                10
+Published as a conference paper at ICLR 2026
+
+
+
+
+Maxence Ernoult, Julie Grollier, Damien Querlioz, Yoshua Bengio, and Benjamin Scellier. Updates
+ of equilibrium prop match gradients of backprop through time in an rnn with static input. In
+ Proceedings of the 33rd International Conference on Neural Information Processing Systems, pp.
+ Article 636. Curran Associates Inc., 2019.
+Maxence Ernoult, Julie Grollier, Damien Querlioz, Yoshua Bengio, and Benjamin Scellier. Equilib-
+ rium propagation with continual weight updates, 2020. URL https://openreview.net/
+ forum?id=H1xJhJStPS.
+D. J. Felleman and D. C. Van Essen. Distributed hierarchical processing in the primate cerebral
+  cortex. Cerebral Cortex, 1(1):1–47, 1991. ISSN 1047-3211. doi: 10.1093/cercor/1.1.1. URL
+  <GotoISI>://WOS:000208047200002.
+Mehmet Fişek, Dustin Herrmann, Alexander Egea-Weiss, Matilda Cloves, Lisa Bauer, Tai-Ying
+ Lee, Lloyd E. Russell, and Michael Häusser. Cortico-cortical feedback engages active den-
+ drites in visual cortex. Nature, 617(7962):769–776, 2023. ISSN 1476-4687. doi: 10.1038/
+ s41586-023-06007-6. URL https://doi.org/10.1038/s41586-023-06007-6.
+Paul Haider, Benjamin Ellenberger, Laura Kriener, Jakob Jordan, Walter Senn, and Mihai A. Petro-
+  vici. Latent equilibrium: a unified learning theory for arbitrarily fast computation with arbitrarily
+  slow neurons. In Proceedings of the 35th International Conference on Neural Information Pro-
+  cessing Systems, pp. Article 1365. Curran Associates Inc., 2021.
+K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In 2016 IEEE
+  Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, 2016. ISBN
+  1063-6919. doi: 10.1109/CVPR.2016.90.
+Maya van Holk and Jorge F Mejias. Biologically plausible models of cognitive flexibility: merging
+ recurrent neural networks with full-brain dynamics. Current Opinion in Behavioral Sciences, 56:
+ 101351, 2024. doi: 10.1016/j.cobeha.2024.101351. van Holk, Maya; Mejias, Jorge F. ORCIDs:
+ 0000-0003-4930-1472 2352-1546 2352-1554.
+J. J. Hopfield. Neural networks and physical systems with emergent collective computational abili-
+   ties. Proceedings of the National Academy of Sciences, 79(8):2554–2558, 1982. ISSN 0027-8424
+   1091-6490. doi: 10.1073/pnas.79.8.2554.
+Rasmus Høier, Kirill Kalinin, Maxence Ernoult, and Christopher Zach. Dyadic learning in recur-
+  rent and feedforward models. In NeurIPS 2024 Workshop Machine Learning with new Compute
+  Paradigms, 2024. URL https://openreview.net/forum?id=LNfWowAErI.
+Herbert Jaeger and Harald Haas. Harnessing nonlinearity: Predicting chaotic systems and sav-
+  ing energy in wireless communication. Science, 304(5667):78–80, 2004. doi: 10.1126/science.
+  1091277. URL https://doi.org/10.1126/science.1091277. doi: 10.1126/sci-
+  ence.1091277.
+Adrien Journé, Hector Garcia Rodriguez, Qinghai Guo, and Timoleon Moraitis. Hebbian deep learn-
+  ing without feedback. In The Eleventh International Conference on Learning Representations,
+  2023. URL https://openreview.net/forum?id=8gd4M-_Rj1.
+Kazutaka Kanno and Atsushi Uchida. Finite-time lyapunov exponents in time-delayed nonlinear
+  dynamical systems. Physical Review E, 89(3):032918, 2014. ISSN 1539-3755 1550-2376. doi:
+  10.1103/PhysRevE.89.032918.
+Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International
+  Conference for Learning Representations. ICLR, 2015. doi: doi.org/10.48550/arXiv.1412.6980.
+Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convo-
+  lutional neural networks. In Advances in Neural Information Processing Systems, 2012.
+Dhireesha Kudithipudi, Catherine Schuman, Craig M. Vineyard, Tej Pandit, Cory Merkel, Rajkumar
+  Kubendran, James B. Aimone, Garrick Orchard, Christian Mayr, Ryad Benosman, Joe Hays, Cliff
+  Young, Chiara Bartolozzi, Amitava Majumdar, Suma George Cardwell, Melika Payvand, Sonia
+  Buckley, Shruti Kulkarni, Hector A. Gonzalez, Gert Cauwenberghs, Chetan Singh Thakur, Anand
+  Subramoney, and Steve Furber. Neuromorphic computing at scale. Nature, 637(8047):801–812,
+  2025. ISSN 0028-0836 1476-4687. doi: 10.1038/s41586-024-08253-8.
+
+
+                                                  11
+Published as a conference paper at ICLR 2026
+
+
+
+
+Suman Kulkarni and Dani S. Bassett.      Toward principles of brain network organiza-
+  tion and function.   Annual Review of Biophysics, 54(Volume 54, 2025):353–378,
+  2025.   ISSN 1936-1238.   doi: https://doi.org/10.1146/annurev-biophys-030722-110624.
+  URL         https://www.annualreviews.org/content/journals/10.1146/
+  annurev-biophys-030722-110624.
+Axel Laborieux and Friedemann Zenke. Improving equilibrium propagation without weight sym-
+  metry through jacobian homeostasis. In The Twelfth International Conference on Learning Rep-
+  resentations. ICLR, 2024. URL https://openreview.net/forum?id=kUveo5k1GF.
+Axel Laborieux, Maxence Ernoult, Benjamin Scellier, Yoshua Bengio, Julie Grollier, and Damien
+  Querlioz. Scaling equilibrium propagation to deep convnets by drastically reducing its gradient
+  estimator bias. Frontiers in Neuroscience, 15:633674, 2021. ISSN 1662-453X. doi: 10.3389/
+  fnins.2021.633674.
+Yann Lecun. A theoretical framework for back-propagation. In Proceedings of the 1988 Connec-
+  tionist Models Summer School, 1988.
+Robert Legenstein and Wolfgang Maass. Edge of chaos and prediction of computational perfor-
+  mance for neural circuit models. Neural Networks, 20(3):323–334, 2007. ISSN 08936080. doi:
+  10.1016/j.neunet.2007.04.017.
+Timothy P. Lillicrap, Daniel Cownden, Douglas B. Tweed, and Colin J. Akerman. Random synaptic
+  feedback weights support error backpropagation for deep learning. Nature Communications, 7
+  (1):13276, 2016. ISSN 2041-1723. doi: 10.1038/ncomms13276.
+Zhuo Liu, Hao Shu, Linmiao Wang, Xu Meng, Yousheng Wang, Xuancheng Li, Wei Wang, and
+  Tao Chen. Adjoint propagation of error signal through modular recurrent neural networks for
+  biologically plausible learning. eLife, 15:e108237, 2026. ISSN 2050-084X. doi: 10.7554/eLife.
+  108237. URL https://doi.org/10.7554/eLife.108237.
+Adam S. Lowet, Qiao Zheng, Sara Matias, Jan Drugowitsch, and Naoshige Uchida. Distribu-
+  tional reinforcement learning in the brain. Trends in Neurosciences, 43(12):980–997, 2020.
+  ISSN 0166-2236. doi: https://doi.org/10.1016/j.tins.2020.09.004. URL https://www.
+  sciencedirect.com/science/article/pii/S0166223620301983.
+Christopher W. Lynn and Danielle S. Bassett. The physics of brain network structure, function
+  and control. Nature Reviews Physics, 1(5):318–332, 2019. ISSN 2522-5820. doi: 10.1038/
+  s42254-019-0040-8.
+Nikola T. Markov, Mária Ercsey-Ravasz, David C. Van Essen, Kenneth Knoblauch, Zoltán
+  Toroczkai, and Henry Kennedy. Cortical high-density counterstream architectures. Science, 342
+  (6158), 2013. ISSN 0036-8075 1095-9203. doi: 10.1126/science.1238406.
+Jorge F. Mejias, John D. Murray, Henry Kennedy, and Xiao-Jing Wang. Feedforward and feedback
+  frequency-dependent interactions in a large-scale laminar network of the primate cortex. Science
+  Advances, 2(11):e1601335, 2016. doi: doi:10.1126/sciadv.1601335. URL https://www.
+  science.org/doi/abs/10.1126/sciadv.1601335.
+Jan Melchior and Laurenz Wiskott. Hebbian-descent, 2019. URL https://arxiv.org/abs/
+  1905.10585.
+Alexander Meulemans, Matilde Tristany Farinha, Javier Garcia Ordonez, Pau Vilimelis Aceituno,
+  João Sacramento, and Benjamin F. Grewe. Credit assignment in neural networks through deep
+  feedback control. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman
+  Vaughan (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 4674–4687.
+  Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_
+  files/paper/2021/file/25048eb6a33209cb5a815bff0cf6887c-Paper.pdf.
+Georgios Michalareas, Julien Vezoli, Stan van Pelt, Jan-Mathijs Schoffelen, Henry Kennedy, and
+  Pascal Fries. Alpha-beta and gamma rhythms subserve feedback and feedforward influences
+  among human visual cortical areas. Neuron, 89(2):384–397, 2016. ISSN 0896-6273. doi:
+  https://doi.org/10.1016/j.neuron.2015.12.018. URL https://www.sciencedirect.com/
+  science/article/pii/S0896627315011204.
+
+
+                                               12
+Published as a conference paper at ICLR 2026
+
+
+
+
+John Miller and Moritz Hardt. Stable recurrent models. In International Conference on Learning
+  Representations, 2019. URL https://openreview.net/forum?id=Hygxb2CqKm.
+Beren Millidge, Yuhang Song, Tommaso Salvatori, Thomas Lukasiewicz, and Rafal Bogacz. Back-
+  propagation at the infinitesimal inference limit of energy-based models: Unifying predictive cod-
+  ing, equilibrium propagation, and contrastive hebbian learning. In The Eleventh International
+  Conference on Learning Representations, 2023. URL https://openreview.net/forum?
+  id=nIMifqu2EO.
+Javier R. Movellan. Contrastive Hebbian Learning in the Continuous Hopfield Model, pp.
+  10–17. Morgan Kaufmann, 1991. ISBN 978-1-4832-1448-1. doi: https://doi.org/10.1016/
+  B978-1-4832-1448-1.50007-X. URL https://www.sciencedirect.com/science/
+  article/pii/B978148321448150007X.
+Mitsumasa Nakajima, Yongbo Zhang, Katsuma Inoue, Yasuo Kuniyoshi, Toshikazu Hashimoto,
+ and Kohei Nakajima. Reservoir direct feedback alignment: deep learning by physical dynamics.
+ Communications Physics, 7(1):411, 2024. ISSN 2399-3650. doi: 10.1038/s42005-024-01895-0.
+Arild Nøkland.     Direct feedback alignment provides learning in deep neural net-
+  works. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett (eds.), Ad-
+  vances in Neural Information Processing Systems, volume 29. Curran Associates, Inc.,
+  2016. URL https://proceedings.neurips.cc/paper_files/paper/2016/
+  file/d490d7b4576290fa60eb31b5fc917ad1-Paper.pdf.
+Peter O’Connor, Efstratios Gavves, and Max Welling. Initialized equilibrium propagation for
+  backprop-free training. In International Conference on Learning Representations. ICLR, 2019.
+  URL https://openreview.net/forum?id=B1GMDsR5tm.
+Alexander Ororbia and Ankur Mali. Biologically motivated algorithms for propagating local target
+  representations. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp.
+  4651–4658. AAAI, 2019.
+Alexander Ororbia, Patrick Haffner, David Reitter, and C. Lee Giles. Learning to adapt by minimiz-
+  ing discrepancy, 2017.
+Alexander G. Ororbia. Brain-inspired machine intelligence: A survey of neurobiologically-plausible
+  credit assignment, 2023. URL https://arxiv.org/abs/2312.09257.
+Alexander G. Ororbia, Ankur Mali, Daniel Kifer, and C. Lee Giles. Backpropagation-free deep
+  learning with recursive local representation alignment. In Proceedings of the AAAI Conference on
+  Artificial Intelligence, volume 37, pp. 9327–9335. AAAI, 2023. URL https://ojs.aaai.
+  org/index.php/AAAI/article/view/26118.
+Matthew G. Perich and Kanaka Rajan. Rethinking brain-wide interactions through multi-region
+ ‘network of networks’ models. Current Opinion in Neurobiology, 65:146–151, 2020. ISSN
+ 09594388. doi: 10.1016/j.conb.2020.11.003.
+D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by back-propagating
+  errors. Nature, 323(6088):533–536, 1986. ISSN 0028-0836. doi: 10.1038/323533a0. URL
+  <GotoISI>://WOS:A1986E327500055. E3275 Times Cited:16725 Cited References
+  Count:4.
+João Sacramento, Rui Ponte Costa, Yoshua Bengio, and Walter Senn.     Dendritic corti-
+   cal microcircuits approximate the backpropagation algorithm. In S. Bengio, H. Wal-
+   lach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.), Ad-
+   vances in Neural Information Processing Systems, volume 31. Curran Associates, Inc.,
+   2018. URL https://proceedings.neurips.cc/paper_files/paper/2018/
+   file/1dc3a89d0d440ba31729b0ba74b93a33-Paper.pdf.
+Tommaso Salvatori, Luca Pinchetti, Beren Millidge, Yuhang Song, Tianyi Bao, Rafal Bogacz, and
+  Thomas Lukasiewicz. Learning on arbitrary graph topologies via predictive coding. In Alice H.
+  Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Informa-
+  tion Processing Systems, 2022. URL https://openreview.net/forum?id=dqO59nI_
+  R9A.
+
+
+                                                13
+Published as a conference paper at ICLR 2026
+
+
+
+
+Benjamin Scellier and Yoshua Bengio. Equilibrium propagation: Bridging the gap between energy-
+  based models and backpropagation. Frontiers in Computational Neuroscience, 11:24, 2017. ISSN
+  1662-5188. doi: 10.3389/fncom.2017.00024.
+Benjamin Scellier, Anirudh Goyal, Jonathan Binas, Thomas Mesnard, and Yoshua Bengio. Gener-
+  alization of equilibrium propagation to vector field dynamics, 2018. URL https://arxiv.
+  org/abs/1808.04873.
+Benjamin Scellier, Maxence Ernoult, Jack Kendall, and Suhas Kumar. Energy-based learn-
+  ing algorithms for analog computing: a comparative study.    In A. Oh, T. Naumann,
+  A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Advances in Neural In-
+  formation Processing Systems, volume 36, pp. 52705–52731. Curran Associates, Inc.,
+  2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/
+  file/a52b0d191b619477cc798d544f4f0e4b-Paper-Conference.pdf.
+João D. Semedo, Anna I. Jasper, Amin Zandvakili, Aravind Krishna, Amir Aschner, Christian K.
+   Machens, Adam Kohn, and Byron M. Yu. Feedforward and feedback interactions between visual
+   cortical areas use different population activity patterns. Nature Communications, 13(1):1099,
+   2022. ISSN 2041-1723. doi: 10.1038/s41467-022-28552-w. URL https://doi.org/10.
+   1038/s41467-022-28552-w.
+Shawn C. Shadden, Francois Lekien, and Jerrold E. Marsden. Definition and properties of la-
+  grangian coherent structures from finite-time lyapunov exponents in two-dimensional aperi-
+  odic flows. Physica D: Nonlinear Phenomena, 212(3-4):271–304, 2005. ISSN 01672789. doi:
+  10.1016/j.physd.2005.10.007.
+Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image
+  recognition. In International Conference on Learning Representations, 2015.
+Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,
+  Ł. ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. Von
+  Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Ad-
+  vances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.,
+  2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/
+  file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
+Andreas Veit, Michael J Wilber, and Serge Belongie. Residual networks behave like ensem-
+  bles of relatively shallow networks. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and
+  R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 29. Curran
+  Associates, Inc., 2016. URL https://proceedings.neurips.cc/paper_files/
+  paper/2016/file/37bc2f75bf1bcfe8450a1a41c200364c-Paper.pdf.
+Ran Wang, Xupeng Chen, Amirhossein Khalilian-Gourtani, Leyao Yu, Patricia Dugan, Daniel Fried-
+  man, Werner Doyle, Orrin Devinsky, Yao Wang, and Adeen Flinker. Distributed feedforward
+  and feedback cortical processing supports human speech production. Proceedings of the Na-
+  tional Academy of Sciences, 120(42):e2300255120, 2023. doi: 10.1073/pnas.2300255120. URL
+  https://doi.org/10.1073/pnas.2300255120. doi: 10.1073/pnas.2300255120.
+D. J. Watts and S. H. Strogatz. Collective dynamics of ’small-world’ networks. Nature, 393
+  (6684):440–442, 1998. ISSN 0028-0836. doi: Doi10.1038/30918. URL <GotoISI>://WOS:
+  000074020000035. Zr842 Times Cited:29354 Cited References Count:27.
+Alan Wolf, Jack B. Swift, Harry L. Swinney, and John A. Vastano. Determining lyapunov
+  exponents from a time series. Physica D: Nonlinear Phenomena, 16(3):285–317, 1985.
+  ISSN 0167-2789. doi: https://doi.org/10.1016/0167-2789(85)90011-9. URL https://www.
+  sciencedirect.com/science/article/pii/0167278985900119.
+X. Xie and H. S. Seung. Equivalence of backpropagation and contrastive hebbian learning in a
+  layered network. Neural Computation, 15(2):441–454, 2003. ISSN 0899-7667. doi: 10.1162/
+  089976603762552988.
+Izzet B. Yildiz, Herbert Jaeger, and Stefan J. Kiebel. Re-visiting the echo state property. Neural
+   Networks, 35:1–9, 2012. ISSN 08936080. doi: 10.1016/j.neunet.2012.07.005.
+
+
+                                               14
+Published as a conference paper at ICLR 2026
+
+
+
+
+A    T HE DYNAMICS OF THE RNN
+We quantify the convergence property of the recurrent neural network (RNN) with maximum Lya-
+punov exponent (MLE) (Wolf et al., 1985), and finite time maximum Lyapunov exponent (FTMLE)
+(Kanno & Uchida, 2014). When the MLE/FTMLE is large, the RNN converges slow or even not
+at all. To compute MLE and FTMLE, we first initialize a random perturbation vector δ0 . Then we
+record the sequence of states s0 [t] with t = 0, 1, 2, . . . , Te − 1 corresponding to the last sample of a
+training set (see Figure 2 in the main text), and run the following steps:
+
+      1. Normalize perturbation vectors to unit length:
+                                                             δt
+                                                    δt ←
+                                                            ∥δt ∥
+      2. Calculate the Jacobian matrix:
+                                                           ∂F (s0 [t], b)
+                                             J(s0 [t]) =
+                                                             ∂s0 [t]
+      3. Update the perturbation:
+                                               δt+1 = J(s0 [t]) · δt
+      4. Record
+                                                  ri = ln ∥δt+1 ∥
+                                                                      PTe −1
+The maximum Lyapunov exponent is computed as λmax = T1e                     t=0   ri for a sufficiently large Te
+(default Te = 500). The results at any T < Te are the FTMLE.
+Figure S1–S2 show the FTMLE, MLE, training accuracy and test accuracy versus epochs of dif-
+ferent models. In all cases, smaller βi usually yields smaller (FT)MLE, whereas larger βi do not
+always lead to larger (FT)MLE because the activation function saturates. The saturation diminishes
+perturbation.
+For 2-hidden-layer RNN, smaller feedback scaling βi yields steady training progress and better ac-
+curacy. Figure S3 plots the FTMLE and test accuracy against feedback scaling for different numbers
+of hidden layers. It shows that smaller βi is favorable for shallow networks, because the RNN is eas-
+ier to converge (indicated by FTMLE). But for deeper networks (5-hidden-layer or more), smaller
+βi degrades performance because of vanishing gradient.
+Further comparison between our FRE-RNN that incorporates convolutional structure with previous
+work (Ernoult et al., 2019) are also plotted in Figure S4. These results suggest that small feedback
+scaling (βi = 0.01) leads to a smoother training process.
+
+B    G RADIENT VANISHING AND THE RESIDUAL CONNECTIONS
+Figure S5 and S6 plot the error of each neuron versus epoch at different βi . For a 2-hidden-layer
+RNN, the best performance is obtained at βi = 0.001. In this situation, the error of the first hidden
+layer is at least two orders of magnitude less than the second hidden layer. At βi = 2, the error also
+decreases from higher (high index neurons, closer to output layer) to lower layers, which is attributed
+to the saturation of the activation function. In general, the training progresses more steadily for
+smaller βi despite the vanishing gradient, which also applies to deeper networks (up to 10-hidden-
+layer).
+To eliminate the vanishing gradient in EP, direct feedback from the higher layers or local amplifi-
+cation (with higher learning rate) is unavoidable (Nøkland, 2016; Ororbia et al., 2023). Figure S7
+shows the effect of residual connections. βi = 0.1 yield the best accuracy 97.5%, due to the balance
+between gradient flow and convergence.
+Figure S8 shows the testing accuracy varies with the connection probability P of AGT with 10
+hidden layers. Except for the connections in layered model, the connection between any two hidden
+layers is generated with probability P , i.e., we first use P to decide if the connections between any
+two layers will be established. As P increases, the accuracy rises first, peaks at 0.2 and decreases
+around 1. However, the reason behind is yet to be explored.
+
+
+                                                    15
+Published as a conference paper at ICLR 2026
+
+
+
+
+Figure S1: The FTMLE, MLE, training accuracy and testing accuracy of symmetric RNNs versus
+epochs with different feedback scaling βi (legend). First row: 2 hidden layers; Second row: 3 hidden
+layers; Third row: 5 hidden layers; Fourth row: 10 hidden layers. The activation is tanh. Each case
+is repeated 5 times.
+
+
+
+
+                                                16
+Published as a conference paper at ICLR 2026
+
+
+
+
+Figure S2: The FTMLE, MLE, training accuracy and testing accuracy of asymmetric RNNs versus
+epochs with different feedback scaling βi (legend). First row: 2 hidden layers; Second row: 3 hidden
+layers; Third row: 5 hidden layers; Fourth row: 10 hidden layers. The activation is tanh. Each case
+is repeated 5 times.
+
+
+
+
+Figure S3: The FTMLE and testing accuracy versus feedback scaling βi with different numbers of
+hidden layers. (a) Symmetry weights; (b) Asymmetric weights. The FTMLE and testing accuracy
+given here correspond to their maxima in all epochs. Note that the 5-hidden-layer asymmetric RNN
+with large βi diverged and resulted in missing data points in (b). Each case is repeated 5 times.
+
+
+                                                17
+Published as a conference paper at ICLR 2026
+
+
+
+
+Figure S4: Comparison of RNN embedded with convolutional structure on the MNIST between
+P-EP (a) (Ernoult et al., 2019) and our approach at different βi (b-d). We used the same parameters
+as the EP reference (Ernoult et al., 2019).
+
+
+
+
+                                                18
+Published as a conference paper at ICLR 2026
+
+
+
+
+Figure S5: For 2-hidden-layer RNN, the mean error of each neuron in the last batch and testing
+accuracy versus epochs at different βi . All neurons in the hidden layers and the output layer are
+indexed from the input to the output layer.
+
+
+
+
+                                               19
+Published as a conference paper at ICLR 2026
+
+
+
+
+Figure S6: For the 10-hidden-layer model, the mean error of each neuron in the last batch and
+testing accuracy versus epochs at different βi . All neurons in the hidden layers and the output layer
+are indexed from the input to the output layer.
+
+
+
+
+                                                 20
+Published as a conference paper at ICLR 2026
+
+
+
+
+Figure S7: For the 10-hidden-layer model with residual connections, the mean error of each neuron
+in the last batch and testing accuracy versus epochs at different βi . All neurons in the hidden layers
+and the output layer are indexed from the input to the output layer.
+
+
+
+
+Figure S8: The testing accuracy on MNIST varies with the connection probability P of AGT with
+10 hidden layers. The experiments are repeated 5 times.
+
+
+                                                  21
+Published as a conference paper at ICLR 2026
+
+
+
+
+Table S1: Testing accuracy (mean of 5 repeated experiments) with different feedback scaling βi . By
+default, T = 10 × NHiddenLayer , K = 5 × NHiddenLayer . Each hidden layer has 64 nodes.
+ Architecture-connections   βi = 0.001   βi = 0.01     βi = 0.1   βi = 0.25   βi = 1   βi = 2   βi = 4
+        2HL-symm              97.69%      97.57%       97.25%      96.22%     93.12%   66.04%   40.92%
+        3HL-symm              97.22%      97.64%       97.41%      96.60%     55.86%   32.64%   22.11%
+        5HL-symm              93.54%      95.54%       97.60%      90.63%     25.31%   17.88%   14.61%
+       10HL-symm              87.15%      89.99%       92.54%      41.84%     14.07%   14.30%   14.23%
+  10HL-Residual-symm             –        97.52%       97.46%         –       95.51%      –        –
+        conv-symm                –        99.15%       98.71%         –       11.35%      –        –
+       2HL-asymm              96.96%      96.97%       96.88%      96.79%     93.88%   91.81%   89.91%
+       3HL-asymm              95.17%      96.91%       96.76%      96.66%     91.21%   54.65%   26.72%
+       5HL-asymm              91.14%      92.34%       96.41%      96.35%     17.15%   11.35%   13.07%
+      10HL-asymm              84.27%      85.83%       87.79%      90.97%     16.13%   14.21%   16.67%
+    10HL-AGT-asymm               –        96.37%       96.75%         –       33.31%      –        –
+
+
+
+C     E QUIVALENCE WITH EP AND BP UNDER THE CONDITION OF
+      INFINITESIMAL INFERENCE LIMIT
+
+
+
+
+Figure S9: A layered network model used to illustrate the process of backpropagation (BP), local
+representation alignment (LRA), and EP. Note that the final prediction layer ·p corresponds to the
+third layer with subindex ·3 . For LRA, we use βLRA instead of β1 and βf . For BP, the feedback
+(orange) paths are absent.
+
+In this section, we will use the infinitesimal inference limit (Millidge et al., 2023) to derive the
+equivalence of EP with LRA and BP.
+
+C.1   BACKPROPAGATION
+
+When we remove the feedback connection of a 2-hidden-layer RNN shown in Figure S9, a feedfor-
+ward network is left and can be trained with BP. The forward process of BP is described by:
+
+
+                                      s1 = ρ(h1 ), h1 = W0 · s0 ,
+                                      s2 = ρ(h2 ), h2 = W1 · s1 ,                                  (S1)
+                                       sp = hp , hp = Wf · s2 .
+
+Defining a loss LBP = 21 (sp −star )2 , the weights adjust according to the gradient of the loss. Taking
+∆W0 as an example:
+                             ∂LBP
+                 ∆W0 = −
+                             ∂W0
+                       = −ρ′ (h1 ) ⊙ W1⊤ · ρ′ (h2 ) ⊙ Wf⊤ · (sp − star ) · (s0 )⊤ ,
+                                                                            
+                                                                                                    (S2)
+
+where “⊙” means Hadamard product (element-wise product), “·” means scalar or matrix multipli-
+cation. For two vectors/matrices, “⊙” requires identical dimensions and computes element-wise
+products. Broadcasting rules may apply (e.g., a column vector vm×1 ⊙ Am×n scales each column
+of A by v).
+
+
+                                                  22
+Published as a conference paper at ICLR 2026
+
+
+
+
+C.2   L OCAL R EPRESENTATION A LIGNMENT
+
+LRA is an alternative training method following the principle of discrepancy reduction (Ororbia
+et al., 2017; Ororbia & Mali, 2019). It can be divided into two phases: 1) the network runs the
+forward process, producing latent representations of the input samples. 2) the weights adjust in the
+direction of reducing the mismatch between current latent representations and target representations
+in each layer.
+The forward process is the same as BP:
+                                  s01 = ρ(h01 ),         h01 = W0 · s0 ,
+                                     s02 = ρ(h02 ),      h02 = W1 · s01 ,                            (S3)
+                                      s0p = h0p ,       h0p = Wf · s02 .
+where s0i are interpreted as the latent representations. The prediction error is ep = star − s0p . Then
+we can get the target representations of the second hidden layer:
+                    sβ2 LRA = ρ(hβ2 LRA ),    hβ2 LRA = W1 · s01 + βLRA · Bf · ep ,                  (S4)
+The same goes for the first hidden layer:
+          sβ1 LRA = ρ(hβ1 LRA ),   hβ1 LRA = W1 · s0 + βLRA · B1 · e2 ,       e2 = sβ2 LRA − s02 ,   (S5)
+LRA defines the loss as the total discrepancy between latent representations and target representa-
+tions:
+                                 L                             L
+                                X                             X   1 0
+                      LLRA =         ki Li (s0i , sβi LRA ) =       (si − sβi LRA )2 ,         (S6)
+                                i=1                           i=1
+                                                                  2
+The weight Wi adjusts according to the local mismatch between s0i+1 and sβi+1
+                                                                           LRA
+                                                                               :
+                                         ∂ki Li (s0i+1 , sβi+1
+                                                            LRA
+                                                                )
+                              ∆Wi = −
+                                                 ∂Wi
+                                   = (sβi+1
+                                         LRA
+                                             − s0i+1 ) ⊙ f ′ (h0i+1 ) · (s0i )⊤
+                                     ≈ (sβi+1
+                                           LRA
+                                               − s0i+1 ) · (s0i )⊤ ,                             (S7)
+where the derivative of the activation function is omitted in the last row, a useful practice common
+in LRA (Melchior & Wiskott, 2019; Ororbia & Mali, 2019; Ororbia et al., 2023). When βLRA → 0,
+sβi LRA → s0i and hβi LRA → h0i , then
+
+
+                   ei = sβi LRA − s0i = ρ(hβi LRA ) − ρ(h0i )
+                      = ρ(h0i + βLRA · Bi · ei+1 ) − ρ(h0i )
+                      ≈ [ρ(h0i ) + ρ′ (h0i ) ⊙ (βLRA · Bi · ei+1 ) − ρ(h0i ))]βLRA →0 ,              (S8)
+                          ′
+                      = ρ (h0i ) ⊙ (βLRA · Bi · ei+1 )
+
+The approximation in Equation S8 is based on a first-order Taylor expansion of ρ(h0i + ∆h) around
+h0i , where ∆h = βLRA · Bi · ei+1 . For a small perturbation ∆h → 0, the Taylor expansion gives:
+                          ρ(h0i + ∆h) = ρ(h0i ) + ρ′ (h0i ) · ∆h + O(∆h2 ),                          (S9)
+                                                    2
+When βLRA → 0, higher order terms O(∆h ) are negligible, leaving only the linear terms. We
+arrive at the last row after canceling out ρ(h0i ). There we can express the weight adjustments as
+   ∆W0 = e1 · (s00 )⊤
+                                                                        
+             ′ 0                     ′ 0
+                                                                       
+                                                                            · (s0 )⊤ B =(W )⊤
+                                                                                    
+         = ρ (h1 ) ⊙ βLRA · B1 · ρ (h2 ) ⊙ βLRA · Bf · (star − sp )
+                                                                                      i   i
+
+                            ′ 0        ⊤     ′ 0         ⊤                     ⊤
+                                                                       
+         = −βLRA · βLRA · ρ (h1 ) ⊙ W1 · ρ (h2 ) ⊙ Wf · (sp − star ) · (s0 ) ,              (S10)
+which is the same as BP (Equation S2) except for a constant. Thus, LRA at weak feedback limit
+approximates BP. An LRA algorithm for a 2-hidden-layer network is described in Algorithm S1. The
+feedback weights in LRA need not be learned here, but can be kept symmetric with the feedforward
+weights.
+
+
+                                                        23
+Published as a conference paper at ICLR 2026
+
+
+
+
+C.3    E QUILIBRIUM P ROPAGATION
+
+We can also formulate EP in terms of discrepancy reduction. In EP (Algorithm 1 in the main text),
+the network states evolve as follows (β = 0 for the first phase and β = βf for the second phase):
+
+
+                                             hβ1 = W0 · sβ0 + β1 · B1 · sβ2 ,
+                                            hβ2 = W1 · sβ1 + βf · Bf · ep ,
+                                                       hβp = Wf · sβ2 ,
+                                            sβ1 , sβ2 , sβp = ρ(hβ1 ), ρ(hβ2 ), hβp ,                   (S11)
+
+
+where ep = star − s0p is the predicting error. The network converges to final states h01 , h02 , s01 , s02 in
+the free phase. The error of s2 neurons can be described by:
+
+                                                        β
+                                         ds2 = [ρ(h2 f )]βf →0 − [ρ(h02 )]βf =0
+                                                ≈ ρ′ (h02 ) ⊙ (βf · Bf · ep ),                          (S12)
+
+where only the first-order infinitesimal term is retained as β1 → 0. The same goes for the first
+hidden layer:
+
+                                            β
+                              ds1 = [ρ(h1 f )]βf →0 − [ρ(h01 )]βf =0
+                                   ≈ ρ′ (h01 ) ⊙ (β1 · B1 · (ρ′ (h02 ) ⊙ (βf · Bf · ep ))),             (S13)
+
+The weight W0 can be updated by:
+
+                                  ds1 · (s00 )⊤
+                  ∆W0 =                         = ρ′ (h01 ) ⊙ B1 · (ρ′ (h02 ) ⊙ Bf · ep ) · (s00 )⊤ ,   (S14)
+                                    β1 · β f
+
+With Bi = Wi⊤ ,
+
+         ds1 = βf · β1 · ρ′ (h01 ) ⊙ W1⊤ · (ρ′ (h02 ) ⊙ Wf⊤ · −(sp − star )),                           (S15)
+                 ds1 · (s00 )⊤
+                               = −ρ′ (h01 ) ⊙ W1⊤ · ρ′ (h02 ) ⊙ Wf⊤ · (sp − star ) · (s00 )⊤ .
+                                                                                  
+       ∆W0 =                                                                                            (S16)
+                   β1 · β f
+
+Note that compared with the weight update in the main text, 1/(β1 ·βf ) is added to recover a gradient
+amplitude similar to BP. Further, if we assume that the high-order infinitesimal in the first phase can
+be omitted, the dynamics of RNN is governed by:
+
+                      s01      = ρ(hβ1 ),       h01 = [W0 · s0 + β1 · B1 · s02 ]β1 →0 ≈ W0 · s0 ,       (S17)
+                s02         = ρ(h02 ),   h02 = [W1 · s01 + βf · Bf · ep ]β1 →0,βf =0 ≈ W1 · s01 ,       (S18)
+                                            s0p = h0p , h0p = Wf · s02 .                                (S19)
+
+
+The information flow of RNN degenerates into that of a feedforward network. This does not affect
+the error information dsi , thus Equation S16 approximates Equation S2 for BP. Meanwhile, it re-
+sembles LRA with low βLRA , which turns explicit error into implicit error. Hitherto, we have shown
+that although the errors are obtained differently in EP, LRA, and BP, they are equivalent under the
+assumption of weak supervision and weak feedback.
+
+
+                                                              24
+Published as a conference paper at ICLR 2026
+
+
+
+
+Algorithm S1 Local Representation Alignment (LRA)
+Input: (x, star )
+Parameter: θ = [W0 , W1 , W2 , B2 , B1 , βLRA ]
+Output: θ
+ 1: function F ORWARD(θ, x)
+ 2:    s0 ← x
+ 3:    s01 ← ρ(h1 ), h1 ← W0 · s0
+ 4:    s02 ← ρ(h2 ), h2 ← W1 · s01
+ 5:    s0p ← Wf · s02
+ 6:    Λ1 ← [s0i ], i = 0, 1, 2, p
+ 7:    return Λ1
+ 8: end function
+ 9: function F EEDBACK(θ, Λ1 , star )
+10:    ep ← star − s0p
+11:    sβ2 LRA ← ρ(h2 ), h2 ← W1 · s01 + βLRA · Bf · ep
+12:    e2 ← sβ2 LRA − s02
+13:    sβ1 LRA ← ρ(h1 ), h1 ← W0 · s0 + βLRA · B1 · e2
+14:    e1 ← sβ1 LRA − s01
+15:    Λ2 ← [e1 , e2 , ep ]
+16:    return Λ2
+17: end function
+18: function U PDATING -W EIGHTS(θ, Λ1 , Λ2 )
+19:    ∆Wi ← ei+1 · (s0i )T , i = 0, 1
+20:    ∆Wf ← ep · (s02 )T
+21: end function
+
+
+C.4   E XPERIMENTS FOR EQUIVALENCE WITH EP AND BP
+
+Prior works have shown that EP can be equalized to BPTT in specific conditions and can achieve
+comparable performance (Ernoult et al., 2019; Laborieux et al., 2021). As discussed in the previ-
+ous section, although the overall architecture forms an RNN, the network behaves similarly to a
+feedforward model due to weak feedback connections.
+To experimentally show the equivalence of EP and BP, we can further compare our model with
+FNN with same feedforward weights trained by BP. We mainly compare cosine similarity of states,
+bias gradients and weight gradients for the first batch (batch size is 200) as given in Figure S10.
+Figure S10(a-c) shows similarity under the conditions of βi = 1 with different iterations. For the the
+bias gradients, i.e., dsi , the cosine similarity declines rapidly, indicating no similarity between our
+model and BP. With weak feedback βi = 0.1, as shown in Figure S10(d-f), the similarity of states
+approaches 1 and the similarity of bias gradient of last 6(4) layers exceeds 0.5 with T = 500/50
+(T = 20). These results provide further evidence that EP is equivalent to BP under the condition of
+weak feedback.
+We further studied the influence of βi on the cosine similarity. Figure S10(g) shows that larger
+βi leads to lower similarity of states. Figure S10(h) shows that lower βi = 0.01 also leads to
+the decrease in similarity, which may caused by insufficient precision of data storage (float32 by
+default). Therefore, we use datatype float64 to repeat experiments. Figure S10(k,l) shows that the
+similarity of gradient signal remains around 1 with βi = 0.1. This indicates that weak feedback does
+indeed lead to an exponential decline in gradient signals, thus requiring higher relative accuracy.
+
+
+
+
+                                                  25
+ Published as a conference paper at ICLR 2026
+
+
+
+
+        (a)                            states                      (b)                       bias gradients                      (c)                    weight gradients
+                    1.0                                                         1.0                                                         1.0
+                                                                                                                                                         T=500,K=200
+cosine similarity
+
+
+
+
+                                                            cosine similarity
+
+
+
+
+                                                                                                                        cosine similarity
+                                                                                                                                                         T=50,K=20
+                    0.5                                                         0.5                                                         0.5          T=20,K=8
+                    0.0                                                         0.0                                                         0.0
+
+                    0.5                                                         0.5                                                         0.5
+                          fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 out                         fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 out                         fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 out
+       (d)                             states                       (e)                      bias gradients                        (f)                  weight gradients
+                    1.0                                                         1.0                                                         1.0
+cosine similarity
+
+
+
+
+                                                            cosine similarity
+
+
+
+
+                                                                                                                        cosine similarity
+                    0.5                                                         0.5                                                         0.5
+
+                    0.0                                                         0.0                                                         0.0                    T=500,K=200
+                                                                                                                                                                   T=50,K=20
+                                                                                                                                                                   T=20,K=8
+                    0.5                                                         0.5                                                         0.5
+                          fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 out                         fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 out                         fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 out
+       (g)                             states                       (h)                      bias gradients                         (i)                 weight gradients
+                    1.0                                                         1.0                                                         1.0
+                                                                                                                                                          i=0.0              i=0.5
+cosine similarity
+
+
+
+
+                                                            cosine similarity
+
+
+
+
+                                                                                                                        cosine similarity
+
+                                                                                                                                                          i=0.01             i=1
+                    0.5                                                         0.5                                                         0.5           i=0.1
+
+                    0.0                                                         0.0                                                         0.0
+
+                    0.5                                                         0.5                                                         0.5
+                          fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 out                         fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 out                         fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 out
+            (j)                        states                       (k)                      bias gradients                         (l)                 weight gradients
+                    1.0                                                         1.0                                                         1.0
+cosine similarity
+
+
+
+
+                                                            cosine similarity
+
+
+
+
+                                                                                                                        cosine similarity
+
+
+
+
+                    0.5                                                         0.5                                                         0.5                            i=0.0
+                                                                                                                                                                           i=0.01
+                                                                                                                                                                           i=0.1
+                    0.0                                                         0.0                                                         0.0
+                                                                                                                                                                           i=0.5
+                                                                                                                                                                           i=1
+                    0.5                                                         0.5                                                         0.5
+                          fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 out                         fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 out                         fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 out
+
+ Figure S10: The cosine similarity of gradients and states between our model and feedforward model
+ trained by BP in an 8-hidden-layer FNN (states: s0i ; bias gradients: dsi ; weight gradients: ∆Wi ).
+ The axis x is the layers of the model. Error propagates from the last layer ”out” to the first hidden
+ layer ”fc1” layer by layer. (a-c), with different numbers of iterations under feedback scaling βi = 1.
+ (d-f), with different numbers of iterations under small feedback scaling βi = 0.1. (g-i), with different
+ feedback scaling (T=50,K=20). (j-l), Repeat (g-i) with datatype float64 (float32 by default).
+
+
+
+
+                                                                                                26
+Published as a conference paper at ICLR 2026
+
+
+
+
+C.5                         V ERIFYING THE EFFECTIVENESS OF WEAK FEEDBACK IN E QUILIBRIUM P ROPAGATION
+
+
+                                                                       1.00
+                                                                                                                i=0.01
+                                                                                                                i=0.1
+                                                                       0.95
+
+
+
+
+                                                       Test accuracy
+                                                                       0.90
+
+                                                                       0.85
+
+                                                                       0.80
+                                                                                                 1                            3                                                 5
+                                                                                                               Only last nll layers learning
+
+Figure S11: The testing accuracy on MNIST with different βi varies with nll . The experiments are
+repeated 5 times.
+
+To demonstrate that the lower few layers of our model are indeed receiving meaningful credit signals,
+we report the test accuracy of only updating the last nll layer (i.e., freezing the weights of Layers
+1−nll ) in Figure S11. For a 5-hidden layer model with βi = 0.1, updating only the final layer yields
+a test accuracy of about 85%. As nll increases to 5, the accuracy also reaches around 97.5%. A
+similar trend is observed for the model with βi = 0.01. These results show that achieving over 97%
+accuracy requires effective gradient propagation to all layers, confirming that our model successfully
+delivers usable credit signals throughout the entire network.
+
+C.6                         ROBUSTNESS TO THE NOISE
+
+               (a)                        2HL                                                 (b)                        3HL                                       (c)                        5HL
+                                                                        1.00                                                               1.00                                                                 1.00
+                            0 0.972 0.972 0.972 0.904                                                      0 0.974 0.974 0.967 0.752                                          0 0.976 0.969 0.814 0.469
+Weight noise intensity
+
+
+
+
+                                                                               Weight noise intensity
+
+
+
+
+                                                                                                                                                  Weight noise intensity
+
+
+
+
+                                                                        0.95                                                               0.95                                                                 0.95
+                         0.001 0.972 0.972 0.972 0.896                  0.90                            0.001 0.973 0.975 0.964 0.749      0.90                            0.001 0.973 0.965 0.812 0.464        0.90
+
+                         0.01 0.917 0.917 0.912 0.689                   0.85                            0.01 0.903 0.904 0.838 0.485       0.85                            0.01 0.876 0.815 0.520 0.299         0.85
+                                                                        0.80                                                               0.80                                                                 0.80
+                          0.1 0.194 0.206 0.196 0.217                                                    0.1 0.167 0.180 0.174 0.167                                        0.1 0.134 0.135 0.134 0.135
+                                                                        0.75                                                               0.75                                                                 0.75
+                                0      1e-05 0.001 0.1                                                         0      1e-05 0.001 0.1                                               0      1e-05 0.001 0.1
+                                    State noise intensity                                                          State noise intensity                                                State noise intensity
+
+
+Figure S12: The maximum test accuracy model with different noise intensity on weights and states
+added both in training and test. (a) With 2 hidden layers; (b) With 3 hidden layers; (c) With 5 hidden
+layers. The model is trained for 50 epochs and the experiments are repeated 5 times.
+
+To evaluate the robustness of the model, we introduce noise on weights and time-varying noise
+on states, which are random Gauss noise imposed at each weight update or at each state update,
+respectively. The noise on weights is directly added to the weight, while the noise on states is added
+as the bias b in Equation 3.1. The mean absolute values of non-zero weights and neural activations
+after noiseless training are approximately 0.09 and 0.76 respectively.
+The accuracy of the model with two hidden layers varies with two types of noises as presented in Fig-
+ure S12(a). It maintains satisfactory performance when the standard deviation of state noise reaches
+0.1 or the standard deviation of weight noise reaches 0.01. In deeper structures (Figure S12(b,c)),
+the results are consistent with the aforementioned observations for weight noise, demonstrating ex-
+cellent robustness. However, the tolerance to time-varying state noise degrades significantly, which
+we attribute to the layer-wise noise accumulation and the distortion of weak gradient signal by the
+noise in the training process. To confirm our hypothesis, we impose the noises only in the test,
+
+
+                                                                                                                         27
+Published as a conference paper at ICLR 2026
+
+
+
+
+                      (a)                   2HL                                 (b)                                 3HL                                         (c)                            5HL
+                                                                1.00                                                                       1.00                                                                       1.00
+                             0 0.973 0.973 0.973 0.971 0.944                                       0 0.974 0.976 0.975 0.974 0.940                                            0 0.975 0.977 0.977 0.977 0.919
+                                                                0.95                                                                       0.95                                                                       0.95
+
+
+
+
+                                                                       Weight noise intensity
+
+
+
+
+                                                                                                                                                  Weight noise intensity
+ Weight noise intensity
+                          0.001 0.971 0.971 0.974 0.970 0.945   0.90                            0.001 0.975 0.973 0.973 0.972 0.945        0.90                            0.001 0.973 0.973 0.973 0.973 0.928        0.90
+
+                           0.01 0.919 0.916 0.919 0.918 0.912   0.85                             0.01 0.908 0.903 0.908 0.902 0.894        0.85                             0.01 0.885 0.876 0.880 0.873 0.842        0.85
+                                                                0.80                                                                       0.80                                                                       0.80
+                            0.1 0.199 0.187 0.194 0.219 0.177                                     0.1 0.176 0.182 0.152 0.167 0.142                                          0.1 0.133 0.128 0.143 0.163 0.142
+                                                                0.75                                                                       0.75                                                                       0.75
+                                  0   0.001 0.001 0.1 0.                                                0   0.001 0.001 0.1          0.5                                           0   0.001 0.001 0.1          0.5
+                                      State noise intensity                                                  State noise intensity                                                      State noise intensity
+
+
+
+
+Figure S13: The maximum test accuracy model with different noise intensity on weights and states
+(the state noise is added only in test). (a) With 2 hidden layers; (b) With 3 hidden layers; (c) With 5
+hidden layers. The model is trained for 50 epochs in a single experiment.
+
+
+and the test accuracy almost remains unaffected (Figure S13). Therefore, the network is potentially
+resilient to noise. However, how to improve resilience in the training process requires further study.
+
+
+D                            T RAINING DETAILS
+
+Table S2 provides the parameters of the Adam optimizer that are used in Tables S1–S2 (Kingma &
+Ba, 2015). The training details for Table 1 are given in Table S3. For convolutional architectures
+in EP, the training process can be described by Algorithm S2. The training sample is fed into the
+network through Conv0 . Then the state of the first layer goes through max pooling MaxPool1 and
+convolution Conv1 sequentially to reach the second layer. The second layer also feedbacks its states
+to the first layer through transposed convolution ConvT1 and max-unpooling MaxUnpool1 . With
+T iterations, the RNN converges to the steady states and produces outputs through MaxPool2 and
+a fully connected layer. Then the prediction error is computed and used to nudge the RNN by
+the reverse of the fully connected layer and max-unpooling MaxUnpool2 . Note that the unpooling
+MaxUnpooli requires the indices from the corresponding pooling MaxPooli .
+For Table S2, Adam optimizer is used for all experiments. The activation functions sigmoid-s
+                                              1
+and hard-sigmoid are defined as ρ(x) = 1+e−4(x−0.5) , ρ(x) = max(min(x, 0), 1), respectively
+(Ernoult et al., 2019). For 5-HL, 10-HL and 20-HL architectures, the Adam optimizer parameters
+are as shown in Table S2 (epoch: 50, batch size: 500). The inference details of the architecture
+shown in Figure 3b are described by Algorithm S3. The details for convolutional architectures are
+given in Table S3 and Figure S14–S15. The cosine-annealing scheduler is used in convolutional
+architectures for CIFAR-10 (Tmax = 50, ηmin = 10−6 ).
+For MNIST, no pre-processing is used. For the CIFAR-10 dataset, we follow ref. (Scellier
+et al., 2023) to pre-process the images. We normalize the input images using mean µ =
+(0.4914, 0.4822, 0.4465) and standard deviation σ = 3 × (0.2023, 0.1944, 0.2010).
+The results for comparison of time consumption were obtained in a virtualized Windows 11 envi-
+ronment with Intel Xeon Gold 6238R CPU, 16 GB RAM, and Nvidia RTX A5000 (24 GB VRAM).
+Other results were obtained on a Windows 11 environment with Intel Core i5-12490F, 32 GB RAM,
+and Nvidia GTX 1650 (4 GB VRAM) or a Windows 11 environment with AMD R7-7700, 32 GB
+RAM, and Nvidia RTX 4070 (12 GB VRAM). The default numerical precision is float32 (single-
+precision float).
+
+
+                                                     Table S2: The parameters of the Adam optimizer.
+                                                           Parameter Name                     Default Value
+                                                   Learning rate (MNIST / CIFAR-10)          10−3 /2 × 10−4
+                                             First-order moment estimation decay rate (β1 )        0.9
+                                            Second-order moment estimation decay rate (β2 )       0.999
+                                                Small constant for numerical stability (ϵ)        10−8
+
+
+
+                                                                                                                    28
+Published as a conference paper at ICLR 2026
+
+
+
+
+Table S3: Training details for Table 1 and Table 2. The results of EB-EP and P-EP come from
+previous work (Ernoult et al., 2019). SGD refers to Stochastic Gradient Descent with mini-batches.
+                                                                        Epoch / Batch size
+    Architecture             Training approach           Optimizer                                       Learning rate         Weight decay
+                                                                               -T/K
+                               P-EP (sigmoid-s)             SGD            50/20-100/20               [0.005, 0.05, 0.2]           None
+        2HL
+                            Proposed (tanh, Adam)           Adam           50/500-10/10             [0.001, 0.001, 0.001]          None
+                               P-EP (sigmoid-s)             SGD           100/20-180/20            [0.002, 0.01, 0.05, 0.2]        None
+                             Proposed-DLR (tanh)            SGD           100/20-18/10             [0.002, 0.01, 0.05, 0.2]        None
+        3HL                     Proposed (tanh)             SGD            100/20-18/10               [0.1, 0.1, 0.1, 0.1]         None
+                            Proposed (tanh, Adam)           Adam           50/500-18/10                      10−3                  None
+                               BP (tanh, Adam)              Adam            50/500-1/1                       10−3                  None
+                             P-EP (hard-sigmoid)            SGD            40/20-200/10              [0.015, 0.035, 0.15]          None
+        Conv
+                           Proposed (hard-sigmoid)          SGD           40/128-20/10                 [0.15, 0.35, 0.9]           10−5
+      (Table 1)
+                              BP (hard-sigmoid)             SGD             40/128-1/1                [0.001, 0.02, 0.4]           10−5
+         Conv              Proposed (hard-sigmoid)          Adam          40/128-20/10                    2 × 10−4                 10−6
+   (Table 2 MNIST)            BP (hard-sigmoid)             Adam            40/128-1/1                    2 × 10−4                 10−6
+        Conv               Proposed (hard-sigmoid)          Adam          50/128-40/10                   2.5 × 10−4              2 × 10−4
+ (Table 2 CIFAR-10)           BP (hard-sigmoid)             Adam            50/128-1/1                   2.5 × 10−4              2 × 10−4
+
+
+
+
+                                                                   64@8x8                     64@4x4
+                           32@24x24                32@12x12
+         1@28x28                                                                                                                1x10
+
+
+
+
+                       Conv1                       MaxPool1                   Conv2                        MaxPool2           Dense
+
+                                 Figure S14: Convolutional architectures for MNIST.
+
+
+
+
+                                                                               128@8x8                 128@4x4
+                                                 64@16x16      64@8x8
+                      32@32x32        32@16x16
+         3@32x32                                                                                                                 1x10
+
+
+
+
+                   Conv1           MaxPool1      Conv2             MaxPool2           Conv3                      MaxPool3
+
+                             Figure S15: Convolutional architectures for CIFAR-10.
+
+
+
+
+                                                                   29
+Published as a conference paper at ICLR 2026
+
+
+
+
+Algorithm S2 Two phases in EP training process for convolution architecture
+Input: Sample-label pairs (x, star )
+Parameter: θ = [W0 , W1 , Wf , Bf , B1 , α1 , β1 , βf 1 ]
+Output: θ
+ 1: function F IRST- PHASE(θ, star )
+ 2:    s0 ← x
+ 3:    for t ← 1 to T do
+ 4:        h1 ← Conv0 (s0 ) + β1 · MaxUnpool1 (ConvT1 (s02 ))
+ 5:        h2 ← Conv1 (MaxPool1 (s01 ))
+ 6:        hp ← Wf · Flatten(MaxPool2 (s02 ))
+ 7:        s01 , s02 , s0p ← ρ(h1 ), ρ(h2 ), SoftMax(hp )
+ 8:    end for
+ 9:    Λ1 ← [s0i ], i = 0, 1, 2, p
+10:    return Λ1
+11: end function
+12: function S ECOND - PHASE(θ, Λ1 , star )
+             β        β        β
+13:    s0 , s1 f 1 , s2 f 1 , sp f 1 ← x, s01 , s02 , s0p
+14:    for t ← 1 to K do
+                                    β
+15:         ep ← star − sp f 1
+                                                                     β
+16:            h1 ← Conv0 (s0 ) + β1 · MaxUnpool1 (ConvT1 (s2 f 1 ))
+                                        β
+17:            h2 ← Conv1 (MaxPool1 (s1 f 1 )) + βf · MaxUnpool2 (Unflatten(Wf⊤ ep ))
+                                                            β
+18:        hp ← Wf · Flatten(MaxPool2 (s2 f 1 ))
+            β        β        β
+19:        s1 f 1 , s2 f 1 , sp f 1 ← ρ(h1 ), ρ(h2 ), SoftMax(hp )
+20:    end for
+21: end function
+
+
+
+
+                                                                30
+Published as a conference paper at ICLR 2026
+
+
+
+
+Algorithm S3 EP with feedback scaling and residual connections (Figure 3b)
+Input: (x, star )
+Parameter: θ = [W0 , Wi , Wf , Bf , Bi , βi , βf 1 ]
+Output: θ
+ 1: function I TERATION(θ, Λ1 , star )
+ 2:    for t ← 1 to K do
+ 3:         if Nudging Phase then
+ 4:              βf ← βf 1
+                       β        β        β
+ 5:              s0 , s1 f 1 , s2 f 1 , sp f 1 ← x, s01 , s02 , s0p
+ 6:         else
+ 7:              βf ← 0
+ 8:              s0 ← x
+ 9:         end if
+                                              β                   β
+10:         h1 ← W0 s0 + β1 B1 s2 f + β4,1 B4,1 s4 f
+                            βf                  βf
+11:         h2 ← W1 s1 + β2 B2 s3
+                            β                   β
+12:         h3 ← W2 s2 f + β3 B3 s4 f
+                            βf                   β              β   β
+13:         h4 ← W3 s3 + β14 B4 s5 f + W1,4 s1 f + β7,4 B7,4 s7 f
+                            β                   β
+14:         h5 ← W4 s4 f + β5 B5 s6 f
+                            βf                  β
+15:         h6 ← W5 s5 + β6 B6 s7 f
+                            β                   β             β      β
+16:         h7 ← W6 s6 f + β7 B7 s8 f + W4,7 s4 f + β10,7 B10,7 s10f
+                            βf                  βf
+17:         h8 ← W7 s7 + β8 B8 s9
+                            β                   β
+18:         h9 ← W8 s8 f + β9 B9 s10f
+                              βf                  β               β
+19:         h10 ← W9 s9 + βf Bf ep f + W7,10 s7 f
+                            β
+20:         hp ← Wf s10f
+              βf
+21:         si ← ρ(hi ), i = 0, 1, 2, . . . , 10
+              β
+22:         sp f ← SoftMax(hp )
+23:    end for
+24: end function
+
+
+
+
+                                               31
+
+\ No newline at end of file
author	Yuren Hao <yurenh2@illinois.edu>	2026-07-03 05:56:50 -0500
committer	Yuren Hao <yurenh2@illinois.edu>	2026-07-03 05:56:50 -0500
commit	b83947778e2c776f757a07d4719b7ce961d7ed55 (patch)
tree	b9cc01d7adda691d9156d9d04f4fb2f644674e96 /refs/fre_rnn_full.txt