summaryrefslogtreecommitdiff
path: root/refs/fre_rnn_full.txt
diff options
context:
space:
mode:
authorYuren Hao <yurenh2@illinois.edu>2026-07-03 05:56:50 -0500
committerYuren Hao <yurenh2@illinois.edu>2026-07-03 05:56:50 -0500
commitb83947778e2c776f757a07d4719b7ce961d7ed55 (patch)
treeb9cc01d7adda691d9156d9d04f4fb2f644674e96 /refs/fre_rnn_full.txt
Initial commit: ept — backprop-free equilibrium transformer (EP)
Code (ep_run/), organized docs (docs/{method,campaign,hardware,outreach,paper}), analysis scripts (scripts/), ONBOARDING.md entry point. Large data/checkpoints git-ignored (share separately). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014FAPDWQ49M5Ye3NpTndTpn
Diffstat (limited to 'refs/fre_rnn_full.txt')
-rw-r--r--refs/fre_rnn_full.txt1945
1 files changed, 1945 insertions, 0 deletions
diff --git a/refs/fre_rnn_full.txt b/refs/fre_rnn_full.txt
new file mode 100644
index 0000000..48da35c
--- /dev/null
+++ b/refs/fre_rnn_full.txt
@@ -0,0 +1,1945 @@
+ Published as a conference paper at ICLR 2026
+
+
+
+
+ T OWARD P RACTICAL E QUILIBRIUM P ROPAGATION :
+ B RAIN - INSPIRED R ECURRENT N EURAL N ETWORK
+ WITH F EEDBACK R EGULATION AND R ESIDUAL C ON -
+ NECTIONS
+
+ Zhuo Liu Tao Chen ∗
+ School of Microelectronics School of Microelectronics
+ University of Science and Technology of China University of Science and Technology of China
+ Hefei 230026, Anhui, China Hefei 230026, Anhui, China
+arXiv:2508.11659v2 [cs.NE] 7 May 2026
+
+
+
+
+ zhuoliu00@mail.ustc.edu.cn tchen@ustc.edu.cn
+
+
+
+ A BSTRACT
+ Brain-like intelligent systems need brain-like learning methods. Equilibrium
+ Propagation (EP) is a biologically plausible learning framework with strong poten-
+ tial for brain-inspired computing hardware. However, existing implementations
+ of EP suffer from instability and prohibitively high computational costs. Inspired
+ by the structure and dynamics of the brain, we propose a biologically plausible
+ Feedback-regulated REsidual recurrent neural network (FRE-RNN) and study its
+ learning performance in the EP framework. Feedback regulation enables rapid
+ convergence by attenuating feedback signals and reducing the disturbance of feed-
+ back paths to feedforward paths. The improvement in the convergence property
+ reduces the computational cost and training time of EP by orders of magnitude,
+ delivering performance on par with backpropagation (BP) in benchmark tasks.
+ Meanwhile, residual connections with brain-inspired topologies help alleviate the
+ vanishing gradient problem that arises when feedback pathways are weak in deep
+ RNNs. Our approach substantially enhances the applicability and practicality of
+ EP. The techniques developed here also offer guidance for implementing in-situ
+ learning in physical neural networks.
+
+
+ 1 I NTRODUCTION
+ Backpropagation (BP) has been the driving force behind the success of artificial intelligence (AI)
+ across a wide variety of tasks, ranging from image recognition to natural language processing
+ (Rumelhart et al., 1986; Lecun, 1988; He et al., 2016; Vaswani et al., 2017). Despite these tri-
+ umphs, BP’s reliance on non-local error signals and weight transport lacks biological plausibility
+ (Journé et al., 2023; Ororbia, 2023). The brain does not appear to implement the gradient computa-
+ tions performed by BP, in particular the explicit derivative of the activation function, which demands
+ precise access to the rate of change in neuronal activities at specific operating points (Ororbia, 2023).
+ Moreover, implementing BP in neuromorphic systems incurs enormous overhead (Kudithipudi et al.,
+ 2025). Drawing inspiration from the topology and dynamics of the brain is a viable approach to ad-
+ vancing biologically plausible learning mechanisms and to promoting energy-efficient computing
+ systems for AI.
+ Equilibrium Propagation (EP) (Scellier & Bengio, 2017; Ernoult et al., 2019; Laborieux et al., 2021)
+ presents a compelling and hardware-friendly alternative. It leverages naturally settling dynamics in
+ RNN for credit assignment, and eliminates the need for explicit activation derivatives. EP operates
+ in two phases with nearly identical dynamics, and the synaptic adjustments depend only on local
+ information (Ackley et al., 1985; Movellan, 1991; Ernoult et al., 2020). In EP, the output layer is
+ softly nudged by the prediction error toward configurations that incrementally minimize the loss
+ function, a regime termed weak supervision (Millidge et al., 2023). A major drawback of EP is
+ ∗
+ Corresponding Author
+
+
+ 1
+ Published as a conference paper at ICLR 2026
+
+
+
+
+its notably slow training speed and instability. An RNN often requires dozens or even hundreds
+of iterations to reach a stable state (Scellier & Bengio, 2017). Previous attempts to optimize EP’s
+performance have led to markedly more complicated procedures (O’Connor et al., 2019; Laborieux
+& Zenke, 2024).
+In this paper, we draw inspiration from the brain and propose a Feedback-regulated REsidual recur-
+rent neural network (FRE-RNN). We substantially improve the convergence properties of the RNNs
+and training speed of EP while achieving performance comparable to BP. Our contributions are as
+follows:
+
+ • By scaling down the feedback strength of RNNs, we enhance the robustness of EP and
+ accelerate the training and inference speed by orders of magnitude because of the improved
+ convergence properties.
+ • To counteract the gradient vanishing problem caused by weak feedback, we introduce resid-
+ ual connections into the layered RNNs, enabling the training of deep networks that previ-
+ ously challenged EP and achieving performance closer to BP.
+ • The feedback regulation and residual connections in RNNs of arbitrary graph topologies
+ mirror the multi-scale recurrence in biological neural networks. Our work fosters EP’s bio-
+ logical plausibility and extends its applicability in brain-inspired computational hardware.
+
+
+2 BACKGROUND
+
+2.1 C ONVERGENT RNN S WITH S TATIC I NPUT
+
+Consider an RNN as a dynamical system driven by a static input x:
+
+ s[t + 1] = F (x, s[t], θ), (1)
+
+where F is the transition function, s[t] is the network state at time step t(t = 0, 1, 2, . . . , T ), and
+θ denotes the parameters. Assuming that the network state stabilizes in T steps, the RNN reaches
+a stable point s[T ]. Its convergence is typically guaranteed by either symmetric connections with
+asynchronous updates or by a sufficiently small spectral radius of asymmetric connections with
+synchronous updates (Hopfield, 1982; Yildiz et al., 2012; Liu et al., 2026). That said, other factors,
+e.g. activation function, also influence the dynamical properties of RNNs (Miller & Hardt, 2019).
+
+2.2 S CALING A DJACENCY M ATRIX TO T UNE N ETWORK DYNAMICS
+
+Scaling the spectral radius (SR) of the adjacency matrix, the largest eigenvalue of the weight matrix,
+is a common method to tune the dynamics of RNN (Bai et al., 2012; Nakajima et al., 2024; Liu et al.,
+2026). A SR less than one yields stable and convergent dynamics. In this case, injected signals tend
+to decay over time, which manifests as short-term memory. A SR exceeding one can give rise to
+expansive or even chaotic behavior in which small perturbations are amplified. By adjusting SR,
+one can bias the RNN toward convergent, oscillatory, or edge-of-chaos regimes, thereby tuning
+computational properties, such as convergence speed or long-term memory capacity. (Jaeger &
+Haas, 2004; Legenstein & Maass, 2007; Miller & Hardt, 2019).
+
+2.3 E QUILIBRIUM P ROPAGATION
+
+Equilibrium propagation is a learning framework initially based on energy-based models. It proceeds
+in two phases: a free (first) phase and a weakly clamped (second) phase. For the first phase, the
+RNN converges to a steady state s0 under the stimulation of input alone. In the clamped phase, the
+network is gently nudged by the prediction error and settles to a new stable state sβ . The weight
+update can be simplified to a contrastive learning compatible with spiking-time-dependent plasticity
+(STDP) (Scellier et al., 2018). EP has been further generalized to asymmetric RNNs governed by
+vector field dynamics (Scellier et al., 2018). Recent work shows that asymmetry in skew-symmetric
+Hopfield models can improve classification performance (Høier et al., 2024).
+
+
+ 2
+ Published as a conference paper at ICLR 2026
+
+
+
+
+2.4 F EEDBACK R EGULATION AND N ETWORK S TRUCTURE IN THE B RAIN
+
+Cortical areas in the brain feature dynamic regulation of feedforward and feedback connections
+(Felleman & Van Essen, 1991; Mejias et al., 2016; Michalareas et al., 2016; Semedo et al., 2022;
+Fişek et al., 2023; Wang et al., 2023). In the visual system, for instance, feedforward signals domi-
+nate immediately following the onset of external stimulus, whereas feedback signals become promi-
+nent during spontaneous activity. Dynamically regulating the strength of feedback allows the brain
+to optimize information integration, ensuring efficient perception and decision-making. In mam-
+malian neocortices, information processing involves not only feedforward synaptic chains but also
+extensive lateral and feedback loops that interconnect disparate regions, forming a richly recursive
+network rather than a strictly layered structure. This topology implies short average path length
+between neurons and efficient information flow (Watts & Strogatz, 1998; Markov et al., 2013; Lynn
+& Bassett, 2019; Kulkarni & Bassett, 2025). In deep neural networks, residual connections reflect
+the long-range skip-layer projections observed in cortical circuits (Perich & Rajan, 2020; Holk &
+Mejias, 2024). They mitigate the vanishing gradient by providing skip pathways that preserve gra-
+dient (He et al., 2016).
+
+3 ACCELERATING EP WITH B RAIN - INSPIRED N ETWORK P ROPERTIES
+
+ (a)
+ 𝑠0 𝑠1 𝑠2 Predict: 𝑠𝑝
+ 𝑊0 𝛼1 𝑊1 𝑊𝑓 Label: 𝑠𝑡
+
+
+
+ 𝛽1 𝐵1 𝛽𝑓 𝐵𝑓
+ Error: 𝑒𝑝 = 𝑠𝑡 − 𝑠𝑝
+ (b)
+ 𝑠0 𝑠1 𝑠2 Predict: 𝑠𝑝
+ 𝐶𝑜𝑛𝑣0 𝑃1 , 𝐶𝑜𝑛𝑣1 𝑃2 , 𝑊𝑓 Label: 𝑠𝑡
+ (32,5,1,0) 2 , 64,5,1,0 (2)
+
+
+
+ 𝑇
+ 𝐶𝑜𝑛𝑣𝑇1 , 𝑃1−1 , 𝛽1 𝑊𝑓 , 𝑃2−1 , 𝛽𝑓
+ Error: 𝑒𝑝 = 𝑠𝑡 − 𝑠𝑝
+
+Figure 1: Illustration of feedback and feedforward regulation. (a) Layered architecture of RNN. The
+feedforward weights Wi and feedback weights Bi are rescaled by coefficients αi and βi respectively.
+The dashed box encloses an RNN formed by layers s1 and s2 with feedforward and feedback path-
+ways. βf is the nudging factor, which essentially scales the feedback strength of prediction error.
+(b) Embedding convolutional architecture in RNN. Convolutional parameter (32,5,1,0) is written
+as (channels, kernels, stride, padding). Parameter (2) in (b) denotes max-pooling with stride 2.
+ConvTi represents transpose convolution, the inverse process of the convolution, and Pi−1 means
+max-unpooling (Ernoult et al., 2019). Model architectures and training process are given in Ap-
+pendix D.
+
+
+3.1 P ROTOTYPICAL SETTING OF EQUILIBRIUM PROPAGATION
+
+Unlike the prototypical setting of equilibrium propagation (P-EP) (Ernoult et al., 2019), we separate
+the input and output layer from the recurrent network (Figure 1a). This separation allows the output
+layer to adopt the SoftMax activation commonly used in feedforward networks, which facilitates
+performance comparison (Laborieux & Zenke, 2024). For clarity, the RNN (black dashed box in
+Figure 1a) shown here only contains two hidden layers s1 and s2 , but the approach applies to deeper
+structures (see below). The states of the RNN evolve for T discrete steps until they converge. The
+dynamics of the whole RNN can be formulated as:
+ sβf [t + 1] = F (sβf [t], b) = ρ(W · sβf [t] + b),
+ b = [W0 · s0 , βf · Bf · ep ], (2)
+
+
+ 3
+ Published as a conference paper at ICLR 2026
+
+
+
+
+where sβf [t] is the state of the RNN at time t, ρ is the activation function, W is the forward weight
+matrix of the RNN, and b combines the feedforward input and the error-nudging term. Note that
+ β β
+sβf = [s1 f , s2 f ]. For each sample-label pair (x, star ), we run the free phase (βf = 0) for te
+iterations, obtain the prediction sp = SoftMax(Wf · s2 ), and compute the prediction error ep =
+star −sp . During the clamped phase, the error nudges the RNN through the feedback weights Bf and
+scaling coefficient βf = βf 1 (βf 1 = 0.1 for layered architecture and βf 1 = 0.25 for convolutional
+architecture by default). The network evolves for K further iterations under clamping to another
+state. The weights (W0 , W1 ) are then updated with an STDP-compatible rule:
+ β
+ ∆Wi = dsi+1 · (s0i )⊤ , f1
+ dsi+1 = si+1 − s0i+1 , (3)
+where dsi is the offset of stable point caused by the error nudging (Scellier et al., 2018). Similarly,
+the final weight for output is updated:
+
+ ∆Wf = (star − s0p ) · (s02 )⊤ . (4)
+We also consider an RNN embedded with convolutional architecture in its forward paths (2 convo-
+lution layers, 2 max-pooling layers and 1 fully connected layer) shown in Figure 1b. The forward
+convolutional structure follows the architecture of existing convolutional neural networks (CNN)
+(Krizhevsky et al., 2012; Simonyan & Zisserman, 2015), in which a pooling layer is placed after
+the activation of the convolution layer. We transform the CNN to an RNN by adding feedback con-
+nections symmetric with the feed-forward connections (See Appendix D for the pseudocode and
+schematics).
+
+3.2 F EEDBACK R EGULATION IN L AYERED RNN FOR FAST C ONVERGENCE
+
+ (a) 0 (b) 0
+ Index
+
+
+
+
+ Index
+
+
+
+
+ 100 100
+ (c) 0 100
+ (d) 0 100
+ Index
+
+
+
+
+ Index
+
+
+
+
+ 100 100 10 2
+ (e) 0 10 1
+ (f) 0
+ 10 4
+ Index
+
+
+
+
+ Index
+
+
+
+
+ 100 100
+ (g) 0 10 2
+ (h) 0 10 6
+ Index
+
+
+
+
+ Index
+
+
+
+
+ 100 100
+ 0 20 40 60 80 0 20 40 60 80
+ t t
+
+
+Figure 2: Convergence dynamics and speed versus feedback scaling βi . All neurons in all hidden
+layers are indexed (s1 :0-63; s2 :64-127). Colors indicate neuronal activity (a,c,e,g) and changes in
+activity (b,d,f,h). (a) The state evolution of RNN with symmetric weights and βi = 0.1; (b) The
+one-step difference of neural states in (a). (c, d) Symmetric weights with βi = 2; (e, f) Asymmetric
+weights with βi = 0.1; (g, h) Asymmetric weights with βi = 4. In both symmetric and asymmetric
+feedback cases, down-scaling feedback connections tends to stabilize the network. See Figure 5d
+for the statistical robustness.
+
+Although the SR can tune the RNN dynamics, scaling forward weights Wi distorts forward signal
+propagation, which is harmful to performance (see below). Therefore, we turn to another choice,
+namely, scaling only the feedback strength with βi . This coefficient scales the gradients, in the same
+way as the nudging factor βf . We consider both symmetric (Bi = (Wi )⊤ ) and asymmetric (Bi ̸=
+(Wi )⊤ ) recurrent connections in the study, and compare the results with FNNs of the same size
+trained by BP (feedback connections removed) or feedback alignment (FA) (Lillicrap et al., 2016)
+that uses random weights Bi ̸= (Wi )⊤ to feedback the error. Note that, after scaling, the overall
+weight matrix W of a symmetric RNN is no longer strictly symmetric. Therefore, we started from
+the vector field setting of EP rather then the energy-based setting in the first place. The feedforward
+and feedback weights are multiplied by coefficients αi and βi respectively. Figure 2a-d shows
+convergence speed for different βi . With asymmetric weights, the network can converge to a fixed
+point (Figure 2e, f), exhibit cyclical oscillation (Figure 2g, h), or even become chaotic. The feedback
+weights stay fixed during training process, which differs from EP in vector field dynamics (Scellier
+
+
+ 4
+ Published as a conference paper at ICLR 2026
+
+
+
+
+et al., 2018). The pseudocode of learning procedure with a 2-hidden-layer RNN shown in Figure 1(a)
+is provided in Algorithm 1.
+
+Algorithm 1 EP with Feedforward and Feedback Scaling
+Require: Input: (x, star )
+Require: Parameters: θ = [W0 , W1 , Wf , Bf , B1 , α1 , β1 , βf 1 ]
+Ensure: Updated parameters θ
+ 1: function F IRST- PHASE(θ, star )
+ 2: s0 ← x
+ 3: for t = 1 to T do
+ 4: h1 ← W0 · s0 + β1 · B1 · s02
+ 5: h2 ← α1 · W1 · s01
+ 6: hp ← Wf · s02
+ 7: s01 , s02 , s0p ← ρ(h1 ), ρ(h2 ), SoftMax(hp )
+ 8: end for
+ 9: Λ1 ← [s0i ], i = 0, 1, 2, p
+10: return Λ1
+11: end function
+12: function S ECOND - PHASE(θ, Λ1 , star )
+ β β β
+13: s1 f 1 , s2 f 1 , sp f 1 ← s01 , s02 , s0p
+14: for t = 1 to K do
+ β
+15: ep ← star − sp f 1
+ β
+16: h1 ← W0 · s0 + β1 · B1 · s2 f 1
+ βf 1
+17: h2 ← α1 · W1 · s1 + βf · Bf · ep
+ β
+18: hp ← Wf · s2 f 1
+ βf 1 βf 1 βf 1
+19: s1 , s2 , sp ← ρ(h1 ), ρ(h2 ), SoftMax(hp )
+20: end for
+ β
+21: dsi ← si f 1 − s0i , i = 1, 2
+22: Λ2 ← [ds1 , ds2 ]
+23: return Λ2
+24: end function
+25: function U PDATING -W EIGHTS(θ, Λ1 , Λ2 , star )
+26: ∆Wi ← dsi+1 · (s0i )⊤ , i = 0, 1
+27: ∆Wf ← (star − s0p ) · (s02 )⊤
+28: end function
+
+
+
+3.3 R ESIDUAL C ONNECTIONS TO AVOID VANISHING G RADIENTS
+
+In our 10-hidden-layer RNN with symmetric connections, we add cross layer residual links (Fig-
+ure 3a-b) and carry out ablation study on their effects in performance. The three long-range bidi-
+rectional connections bypass adjacent layers to reduce gradient decay. For RNN with asymmet-
+ric connections, we introduce skip-layer connections between non-adjacent layers with P = 20%
+probability, creating an RNN with arbitrary graph topologies (AGT) where any pair of layers form
+connections stochastically (Figure 3c) (Salvatori et al., 2022). (See Algorithm S3 in Appendix D for
+training detail)
+
+4 E XPERIMENTS
+We evaluated our RNN models on MNIST and CIFAR-10 datasets and compared the results with P-
+EP and BP. The MNIST dataset consists of 70,000 grayscale handwritten digit images (28×28 pixels)
+split into 60,000 training and 10,000 test samples. CIFAR-10 contains 60,000 RGB images (32×32
+pixels) of 10 categories, divided into 50,000 training and 10,000 test samples. Pre-processing, net-
+work structures and additional training details are in Appendix D.
+
+4.1 I NFLUENCE OF F EEDFORWARD S CALING AND F EEDBACK S CALING
+
+Figure 4 compares the effects of feedforward scaling αi and feedback scaling βi . In general, relative
+small feedback scaling (βi = 0.1) yields high MNIST accuracy (Figure 4). In deeper RNNs, overly
+
+
+ 5
+ Published as a conference paper at ICLR 2026
+
+
+
+
+Figure 3: (a) A 10-hidden-layer RNN model with residual connections. The solid blue wires and
+the dashed orange wires represent forward and feedback residual connections respectively. The
+bidirectional connections are symmetric. (b) Adjacency matrix of (a). The blocks (green) other than
+the sub-diagonals indicate residual connections. (c) Adjacency matrix for an RNN with arbitrary
+graph topology.
+
+
+
+ (a) 2HL-Test accuracy (b) 3HL-Test accuracy (c) 5HL-Test accuracy
+ 1.00
+ 0.001 0.9659 0.9730 0.9756 0.9753 0.001 0.9348 0.9692 0.9718 0.9555 0.001 0.3844 0.7980 0.9338 0.8048
+ 0.95
+ 0.01 0.9666 0.9723 0.9760 0.9758 0.01 0.9246 0.9679 0.9765 0.9715 0.01 0.2515 0.9365 0.9575 0.8583 0.90
+ i
+
+
+
+
+ i
+
+
+
+
+ i
+
+
+
+
+ 0.1 0.9624 0.9711 0.9725 0.9694 0.1 0.6323 0.9497 0.9741 0.9702 0.1 0.2104 0.8386 0.9757 0.9332 0.85
+
+ 0.80
+ 1.0 0.8925 0.9127 0.9300 0.7978 1.0 0.4862 0.8739 0.5249 0.1676 1.0 0.2096 0.3891 0.2500 0.1324
+ 0.75
+ 0.01 0.1 1.0 4.0 0.01 0.1 1.0 4.0 0.01 0.1 1.0 4.0
+ i i i
+
+
+
+Figure 4: The influence of feedforward scaling αi and feedback scaling βi on accuracy of MNIST
+classification. (a) 2 hidden layers; (b) 3 hidden layers; (c) 5 hidden layers. Each layer has 64
+neurons. By default, T = 10×NHiddenLayer , K = T /2. Each result is averaged over five repetitive
+experiments.
+
+
+ (a) (b) (c) (d)
+ 100 1
+ 102
+ 8 0
+ Convergence time
+
+
+
+
+ 7
+ 6 1
+ Test Error
+
+
+
+
+ FTMLE
+
+
+
+
+ 10 1 5
+ SR
+
+
+
+
+ 4 2 101 symm-before training
+ 3 asymm-before training
+ 2 3 symm-after training
+ 1 asymm-after training
+ 10 2 0 4 100
+ 01
+ 1
+ 0.1
+
+ 5
+ 1
+ 2
+ 4
+
+
+
+ 01
+ 1
+ 0.1
+
+ 5
+ 1
+ 2
+ 4
+
+
+
+ 01
+ 1
+ 0.1
+
+ 5
+ 1
+ 2
+ 4
+
+
+
+ 01
+ 1
+ 0.1
+
+ 5
+ 1
+ 2
+ 4
+ 0.0
+
+ 0.2
+
+
+
+
+ 0.0
+
+ 0.2
+
+
+
+
+ 0.0
+
+ 0.2
+
+
+
+
+ 0.0
+
+ 0.2
+ 0.0
+
+
+
+
+ 0.0
+
+
+
+
+ 0.0
+
+
+
+
+ 0.0
+
+
+
+
+ i i i i
+
+
+
+
+Figure 5: The test error, SR, FTMLE, and convergence time versus feedback scaling βi . The results
+are obtained from a 3-hidden-layer (64 neurons per layer) model trained on MNIST dataset. Note
+that the network does not converge under certain conditions, resulting in missing value in d.
+
+
+
+ 6
+ Published as a conference paper at ICLR 2026
+
+
+
+
+low feedback scaling βi jeopardizes the performance, which we attribute to vanishing gradients
+(Figure 4c, right two columns). In contrast, down-scaling the feedforward weights degrades perfor-
+mance, as the inference signals are weakened through the layers (see rows of Figure 4a). However,
+up-scaling αi can also be detrimental, as this easily leads to saturation of neural state. The best per-
+formance of a 5-hidden-layer RNN is achieved without feedforward scaling αi = 1 and a trade-off
+in feedback scaling at βi = 0.1. These results suggest that balancing the feedforward and feedback
+strengths is critical for better performance, not only accuracy but also speed (see Table 1).
+To further investigate the influence of feedback scaling βi , we plot the error, SR, finite time max-
+imum Lyapunov exponent (FTMLE) (Shadden et al., 2005; Kanno & Uchida, 2014) and conver-
+gence time against feedback scaling coefficient before and after training of a 3-hidden-layer RNN
+on MNIST (Figure 5). It shows that larger feedback scaling βi decreases accuracy (Figure 5a). As
+expected, SR is positively correlated to βi (see Figure 5b), and large SR can lead to instability of
+an RNN indicated by the FTMLE shown in Figure 5c, which in turn explains the results in Figure
+5a. In general, down-scaling the feedback (βi < 1) reduces the convergence time of RNN, which is
+favorable. Note that up-scaling of feedback βi >1 can also decrease FTMLE and convergence time.
+However, this is attributed to the saturation of neural state, and will also lower the performance.
+Additionally, one might suspect that the gradient signals in the lower layers are not fulfilling their
+intended role. In reservoir computing, where only the last layer is trained, the network can also
+reach high accuracy as long as the output dimension is large enough. However, this is unlikely in
+our case, as each layer in our network has only 64 neurons by default (other than the results in Table
+1). To further confirm that the learning in lower layers is meaningful, we performed training with
+the weights of lower layers frozen—details of these experiments are included in Appendix C.5. The
+results clearly show that getting comparable results to BP requires effective training of lower layers.
+
+ (a) 0 i=0.01 (b) 0 i=0.1 (c) 0 i=1.0 (d) 0 i=4.0
+ 10 10 10 10
+Testing Error
+
+
+
+
+ Testing Error
+
+
+
+
+ Testing Error
+
+
+
+
+ 10 1 10 1 10 1 Testing Error 10 1
+
+
+
+ 10 50 10 50 10 50 10 50
+ 20 100 20 100 20 100 20 100
+ 10 2 10 2 10 2 10 2
+ 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50
+ Epoch Epoch Epoch Epoch
+ (e) 0 2 Layers (f) 0 3 Layers (g) 0 5 Layers (h) 0 10 Layers
+ 10 10 10 10
+
+
+
+ 0.001 1.0 0.001 1.0
+Testing Error
+
+
+
+
+ Testing Error
+
+
+
+
+ Testing Error
+
+
+
+
+ Testing Error
+
+
+
+
+ 0.01 2.0 0.01 2.0
+ 10 1 0.1 4.0 10 1 0.1 4.0 10 1 10 1
+ 0.25 0.25
+ 0.001 1.0 0.001 1.0
+ 0.01 2.0 0.01 2.0
+ 0.1 4.0 0.1 4.0
+ 0.25 0.25
+ 10 2 10 2 10 2 10 2
+ 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50
+ Epoch Epoch Epoch Epoch
+
+
+
+Figure 6: Test error with different hyperparameters. The curves of different T (10, 20, 50, 100) with
+2 hidden layers (64 neurons per hidden layer) and (a) βi = 0.01; (b) βi = 0.1; (c) βi = 1; (d)
+βi = 4. The curves of different βi (0.001, 0.01, 0.1, 0.25, 1, 2, 4) with (e) 2 hidden layers; (f) 3
+hidden layers; (g) 5 hidden layers; (h) 10 hidden layers. The shaded areas represent deviations of
+five repeated experiments. By default, T = 10 × NHiddenLayer , K = T /2. See Appendix A for
+more information.
+
+
+4.2 D OWN - SCALING F EEDBACK L EADS TO FASTER C ONVERGENCE
+
+Figure 6a-d plots the error versus the number of epochs with different iteration steps T . Under the
+condition of βi = 0.01 (Figure 6a), the model with T = 10 and K = 5 works as well as the model
+with T = 100 and K = 50, suggesting possibility of speedup in training. Larger βi requires more
+iterations to achieve a certain level of performance (See Figure 6b, c, d). Larger βi means larger SR
+and FTMLE, thus requiring more iterations to settle the RNN as shown in Figure 2 and Figure 5(b-
+
+
+ 7
+ Published as a conference paper at ICLR 2026
+
+
+
+
+Table 1: Comparison with P-EP and BP in accuracy and cost. The results of P-EP come from
+previous work (Ernoult et al., 2019). For BP results, we used a network with the same number
+of layers and number of nodes/channels. Each experiment is repeated five times, and the standard
+deviation is given. By default, βi = 0.01 in our results, the feedback weights are symmetric with
+the feedforward weights for P-EP and Ours, and the learning rate in all layers are the same except
+for Ours-DLR (different learning rate), which uses varying learning rates identical to that of P-EP.
+For 2HL (two hidden layers) and 3HL (three hidden layers), there are 512 nodes per hidden layer.
+See Appendix D for more details.
+ Epoch / Batch size Wall Clock Time
+ Architecture Training approach Testing (Training)
+ -T/K (HH:MM:SS)
+ P-EP (sigmoid-s) 98.05%±0.10% (99.86%) 50/20-100/20 1:56:-
+ 2HL Ours (tanh, Adam) 98.39%±0.04% (100.00%) 50/500-10/10 0:01:16
+ BP (tanh, Adam) 98.26%±0.06% (100.00%) 50/500-1/1 0:00:18
+ P-EP (sigmoid-s) 97.99%±0.18% (99.90%) 100/20-180/20 8:27:-
+ Ours-DLR (tanh) 97.65%±0.08% (98.93%) 100/20-18/10 1:01:14
+ 3HL Ours (tanh) 97.83%±0.13% (99.98%) 100/20-18/10 1:01:54
+ Ours (tanh, Adam) 98.36%±0.06% (100.00%) 50/500-18/10 0:02:11
+ BP (tanh, Adam) 98.36%±0.08% (100.00%) 50/500-1/1 0:00:24
+ P-EP (hard-sigmoid) 98.98%±0.04% (99.46%) 40/20-200/10 8:58:-
+ Conv Ours (hard-sigmoid) 99.14%±0.02% (99.78%) 40/128-20/10 0:12:28
+ BP (hard-sigmoid) 98.93%±0.18% (99.43%) 40/128-1/1 0:01:01
+
+
+
+Table 2: Comparison with BP and FA and ablation study of residual connection. For layered
+architecture, there are 64 nodes per hidden layer and we chose T = 10 × NHiddenLayer , and
+K = 5 × NHiddenLayer , which guarantees saturation of accuracy at βi = 0.1. For convolutional
+architectures, βi = 0.01. By default, the Adam optimizer is used. Each experiment is repeated five
+times. See Appendix D for more training details.
+ Architecture
+ Training approach MNIST-Testing (Training) CIFAR-10-Testing (Training)
+ -connections
+ BP 97.69%±0.10% (100.00%) 49.23%±0.81% (56.72%)
+ 5-symm
+ Ours 97.64%±0.10% (99.98%) 50.72%±0.17% (57.02%)
+ FA 96.44%±0.10% (98.96%) 37.97%±2.18% (38.92%)
+ 5-asymm
+ Ours 96.37%±0.11% (97.99%) 45.27%±0.73% (46.79%)
+ BP 97.61%±0.04% (99.93%) 48.23%±1.26% (55.37%)
+ 10-symm Ours 92.49%±0.32% (95.27%) 34.90%±0.38% (34.64%)
+ Ours-Residual 97.49%±0.05% (99.77%) 44.46%±0.51% (48.67%)
+ FA 94.52%±0.26% (95.54%) 30.16%±6.12% (30.20%)
+ 10-asymm Ours 87.37%±0.49% (87.95%) 30.37%±1.09% (29.97%)
+ Ours-AGT 96.87%±0.11% (99.45%) 30.94%±4.90% (31.36%)
+ BP 97.48%±0.07% (99.74%) 47.35%±1.49% (54.59%)
+ 20-symm
+ Ours-Residual 95.95%±0.18% (98.20%) 43.61%±1.17% (44.26%)
+ BP 99.34%±0.04% (99.97%) 75.45%±0.46% (83.61%)
+ Conv
+ Ours 99.27%±0.07% (99.78%) 75.04%±0.51% (80.79%)
+
+
+d). Or even worse, the gradient signal is completely distorted. At βi = 4, even T=100 fails to exceed
+95% accuracy. Figure 6e-h shows that while shallow networks benefit from low βi , deeper networks
+(3, 5 and 10 layers) lose accuracy. In all cases, training performance peaks at certain βi dependent
+on the network depth. Additional results are provided in Table S1 in Appendix B.
+Table 1 compares our approach with P-EP, BP, and FA. Our model supersedes P-EP in training
+speed by at least one order of magnitude for both convolutional architecture and layered architecture.
+Importantly, our accuracy is comparable to BP and FA for the shallow architectures (5-hidden-layer
+and conv model, see also Table 2). In consideration of the improved stability (Figure 6) via feedback
+regulation, we anticipate that physical implementations of RNN can achieve performance on par
+
+
+ 8
+ Published as a conference paper at ICLR 2026
+
+
+
+
+with BP. Additionally, for layered architecture, we also adopt the same training parameters (learning
+rate, batch size and epochs) as P-EP, differing only in feedback scaling (‘ours-DLR’ in Table 1). The
+results present clear evidence of speedup, which mainly stems from the reduced number of iterations
+required for convergence.
+
+
+4.3 D OWN - SCALED F EEDBACK C OORDINATES P LASTICITY OF D IFFERENT L AYERS
+
+It is hypothesized that the brain requires different plasticity in different areas due to their varying
+functional roles (Atallah et al., 2004; Lowet et al., 2020). The variability in plasticity can be realized
+explicitly by adjusting learning rates or implicitly by modulating the intensity of gradient. Previ-
+ous work postulated that EP with weak feedback necessitates learning rates differing by orders of
+magnitude across layers (Scellier & Bengio, 2017). Here, we found that due to gradient differences
+across different layers induced by weak feedback, a 3-hidden-layer RNN at βi = 0.01 (Table 1,
+‘ours (tanh)’) learns well with a uniform learning rate. This result suggests that the feedback scaling
+alone is able to regulate gradient strength of different layers, pointing to another possible mechanism
+to coordinate plasticity.
+
+
+4.4 R ESIDUAL C ONNECTIONS OVERCOME THE G RADIENT VANISHING IN D EEP RNN S
+
+Weak feedback exacerbates vanishing gradient in deeper layered RNN (Figures S5–S6 in Ap-
+pendix B). Adding residual connections restores gradient flow (Figure S7 in Appendix B). As a
+result, a 10-hidden-layer network sees substantial performance gains (Table 2), 5% increase in ac-
+curacy for MNIST and 9% for CIFAR-10. Even 20-hidden-layer model can be trained. As shown
+in Table 2, without residual connections, an asymmetric RNN trained by EP falls short of FA in
+accuracy, but arbitrary residual links surpass the accuracy of FA for the MNIST classification (See
+ablation study on connection probability in Appendix B.). For more complex dataset CIFAR-10, the
+10-hidden-layer asymmetric model with residual random feedback connections achieves accuracy
+nearly 14% below the symmetric model. A possible reason is that the gradient signal through mul-
+tiple random fixed feedback connections becomes too distorted by error to coordinate the forward
+weight learning.
+
+
+5 D ISCUSSION
+
+We have applied the feedback scaling to RNN to speed up the convergence and to accelerate training
+with EP with negligible overhead. To counteract the vanishing gradient in deep architectures, we
+have added residual connections to non-adjacent layers of deep RNNs, partly restoring classification
+performance. In principle, the residual connections make credit assignment pathways shorter (Veit
+et al., 2016). The training exhibits remarkable resilience to noise on weight and neural state. Our
+structural modification is compatible with other algorithmic speed-ups (Scellier et al., 2023), thereby
+expanding the design space for efficient EP implementations.
+Recent work on credit assignment in brain-inspired networks, e.g. adjoint propagation (Liu et al.,
+2026), partitions a large network into local RNNs with random internal connections of low SR
+for fast convergence and dynamic resource allocation, yielding speed and accuracy similar to this
+work. This work, however, adopts the feedback scaling to solve the stability issue and accelerate
+convergence of EP.
+Weak feedback is often considered in biologically plausible learning algorithms (Sacramento et al.,
+2018; Haider et al., 2021; Meulemans et al., 2021). It has been shown that contrastive Hebbian
+learning with weak feedback approximates backpropagation while converging quickly (Xie & Se-
+ung, 2003). More recently, local representation alignment (LRA) likewise employed weak feedback
+(Ororbia et al., 2023) and skip connections from the output to deep layers for efficient training. The
+EP framework also approximates BP (Scellier & Bengio, 2017; Millidge et al., 2023), but under
+the weak clamping condition (weak supervision) (Laborieux et al., 2021; Millidge et al., 2023). We
+have shown that, at the infinitesimal inference limit, namely weak supervision and weak feedback
+(Millidge et al., 2023), EP is equivalent to LRA and BP (Appendix C). In other words, the dynamics
+of FRE-RNN is more like the feedforward neural network due to its weak feedback.
+
+
+ 9
+ Published as a conference paper at ICLR 2026
+
+
+
+
+However, there are still a few limitations to our approaches for large-scale neural networks that
+underpin artificial intelligence. For complex datasets like CIFAR-10, there exists a notable perfor-
+mance gap compared to BP, using deep fully connected neural networks. We attribute this gap to
+the inaccurate approximation to the true gradient as computed by BP (See Appendix C.4). There-
+fore, although EP can be extended to deep fully connected network (20-hidden-layers) and shallow
+CNNs, its applicability for deep CNN remains to be explored. For deep architectures with asymmet-
+ric connections, the accuracy decreases faster with increasing depth due to the inaccurate random
+error feedback. More in-depth investigation on residual connection topology is required to scale
+up the methodology to large scale deep architectures. Besides, the hyperparameters are optimized
+empirically. We find a feedback scaling in the range of 0.01-0.1 is favorable for shallow networks
+(less than 4 layers) and 0.1-0.25 for deeper architectures. Finding a general way to determine these
+parameters is still on-going. Additionally, existing research on EP converging naturally continues
+to focus primarily on static-input settings (Laborieux et al., 2021; Ernoult et al., 2019; Laborieux
+& Zenke, 2024). Extending naturally converging RNN trained by EP to sequence tasks remains a
+challenge.
+From a neurobiological perspective, residual connections, particularly the randomly generated arbi-
+trary graph topologies, yield cortex-like connectivity patterns in the brain. The feedback-regulated
+residual RNNs equip the biologically plausible learning framework, EP, with biologically plausible
+network architecture. Although it currently runs on GPUs, it can exploit the natural convergence of
+physical RNNs and facilitate efficient learning and inference on dedicated neuromorphic hardware.
+
+ACKNOWLEDGEMENTS
+This work was supported by the National Key R&D Program of China (Grant No.
+2024YFA1208804). Additional financial support from the University of Science and Technology
+of China and the Chinese Academy of Sciences is also gratefully acknowledged.
+
+C ODE AVAILABILITY
+The code used in this work is available at https://github.com/Zero0Hero/
+FRE-RNN-EP.
+
+R EPRODUCIBILITY STATEMENT
+The code necessary to reproduce the main results is provided as Jupyter Notebooks in the Supple-
+mentary Materials. Researchers can directly run them to reproduce the results. Further details on
+data pre-processing and training process are available within the provided code and in Appendix D.
+
+T HE U SE OF L ARGE L ANGUAGE M ODELS (LLM S )
+In the preparation of this work, the authors used GPT-5 and DeepSeek solely for the purpose of
+polishing and improving the linguistic fluency and readability of the text. This includes tasks such
+as correcting grammar and rephrasing sentences. After using the model, the authors have reviewed
+and edited all content extensively and take full responsibility for all ideas, claims, and the final
+language presented in this paper.
+
+R EFERENCES
+David H. Ackley, Geoffrey E. Hinton, and Terrence J. Sejnowski. A learning algorithm for boltz-
+ mann machines. Cognitive Science, 9(1):147–169, 1985.
+Hisham E. Atallah, Michael J. Frank, and Randall C. O’Reilly. Hippocampus, cortex, and basal
+ ganglia: Insights from computational models of complementary learning systems. Neurobi-
+ ology of Learning and Memory, 82(3):253–267, 2004. ISSN 1074-7427. doi: https://doi.
+ org/10.1016/j.nlm.2004.06.004. URL https://www.sciencedirect.com/science/
+ article/pii/S1074742704000693.
+Zhang Bai, D. J. Miller, and Wang Yue. Nonlinear system modeling with random matrices: Echo
+ state networks revisited. IEEE Transactions on Neural Networks and Learning Systems, 23(1):
+ 175–182, 2012. ISSN 2162-237X 2162-2388. doi: 10.1109/tnnls.2011.2178562.
+
+
+ 10
+ Published as a conference paper at ICLR 2026
+
+
+
+
+Maxence Ernoult, Julie Grollier, Damien Querlioz, Yoshua Bengio, and Benjamin Scellier. Updates
+ of equilibrium prop match gradients of backprop through time in an rnn with static input. In
+ Proceedings of the 33rd International Conference on Neural Information Processing Systems, pp.
+ Article 636. Curran Associates Inc., 2019.
+Maxence Ernoult, Julie Grollier, Damien Querlioz, Yoshua Bengio, and Benjamin Scellier. Equilib-
+ rium propagation with continual weight updates, 2020. URL https://openreview.net/
+ forum?id=H1xJhJStPS.
+D. J. Felleman and D. C. Van Essen. Distributed hierarchical processing in the primate cerebral
+ cortex. Cerebral Cortex, 1(1):1–47, 1991. ISSN 1047-3211. doi: 10.1093/cercor/1.1.1. URL
+ <GotoISI>://WOS:000208047200002.
+Mehmet Fişek, Dustin Herrmann, Alexander Egea-Weiss, Matilda Cloves, Lisa Bauer, Tai-Ying
+ Lee, Lloyd E. Russell, and Michael Häusser. Cortico-cortical feedback engages active den-
+ drites in visual cortex. Nature, 617(7962):769–776, 2023. ISSN 1476-4687. doi: 10.1038/
+ s41586-023-06007-6. URL https://doi.org/10.1038/s41586-023-06007-6.
+Paul Haider, Benjamin Ellenberger, Laura Kriener, Jakob Jordan, Walter Senn, and Mihai A. Petro-
+ vici. Latent equilibrium: a unified learning theory for arbitrarily fast computation with arbitrarily
+ slow neurons. In Proceedings of the 35th International Conference on Neural Information Pro-
+ cessing Systems, pp. Article 1365. Curran Associates Inc., 2021.
+K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In 2016 IEEE
+ Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, 2016. ISBN
+ 1063-6919. doi: 10.1109/CVPR.2016.90.
+Maya van Holk and Jorge F Mejias. Biologically plausible models of cognitive flexibility: merging
+ recurrent neural networks with full-brain dynamics. Current Opinion in Behavioral Sciences, 56:
+ 101351, 2024. doi: 10.1016/j.cobeha.2024.101351. van Holk, Maya; Mejias, Jorge F. ORCIDs:
+ 0000-0003-4930-1472 2352-1546 2352-1554.
+J. J. Hopfield. Neural networks and physical systems with emergent collective computational abili-
+ ties. Proceedings of the National Academy of Sciences, 79(8):2554–2558, 1982. ISSN 0027-8424
+ 1091-6490. doi: 10.1073/pnas.79.8.2554.
+Rasmus Høier, Kirill Kalinin, Maxence Ernoult, and Christopher Zach. Dyadic learning in recur-
+ rent and feedforward models. In NeurIPS 2024 Workshop Machine Learning with new Compute
+ Paradigms, 2024. URL https://openreview.net/forum?id=LNfWowAErI.
+Herbert Jaeger and Harald Haas. Harnessing nonlinearity: Predicting chaotic systems and sav-
+ ing energy in wireless communication. Science, 304(5667):78–80, 2004. doi: 10.1126/science.
+ 1091277. URL https://doi.org/10.1126/science.1091277. doi: 10.1126/sci-
+ ence.1091277.
+Adrien Journé, Hector Garcia Rodriguez, Qinghai Guo, and Timoleon Moraitis. Hebbian deep learn-
+ ing without feedback. In The Eleventh International Conference on Learning Representations,
+ 2023. URL https://openreview.net/forum?id=8gd4M-_Rj1.
+Kazutaka Kanno and Atsushi Uchida. Finite-time lyapunov exponents in time-delayed nonlinear
+ dynamical systems. Physical Review E, 89(3):032918, 2014. ISSN 1539-3755 1550-2376. doi:
+ 10.1103/PhysRevE.89.032918.
+Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International
+ Conference for Learning Representations. ICLR, 2015. doi: doi.org/10.48550/arXiv.1412.6980.
+Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convo-
+ lutional neural networks. In Advances in Neural Information Processing Systems, 2012.
+Dhireesha Kudithipudi, Catherine Schuman, Craig M. Vineyard, Tej Pandit, Cory Merkel, Rajkumar
+ Kubendran, James B. Aimone, Garrick Orchard, Christian Mayr, Ryad Benosman, Joe Hays, Cliff
+ Young, Chiara Bartolozzi, Amitava Majumdar, Suma George Cardwell, Melika Payvand, Sonia
+ Buckley, Shruti Kulkarni, Hector A. Gonzalez, Gert Cauwenberghs, Chetan Singh Thakur, Anand
+ Subramoney, and Steve Furber. Neuromorphic computing at scale. Nature, 637(8047):801–812,
+ 2025. ISSN 0028-0836 1476-4687. doi: 10.1038/s41586-024-08253-8.
+
+
+ 11
+ Published as a conference paper at ICLR 2026
+
+
+
+
+Suman Kulkarni and Dani S. Bassett. Toward principles of brain network organiza-
+ tion and function. Annual Review of Biophysics, 54(Volume 54, 2025):353–378,
+ 2025. ISSN 1936-1238. doi: https://doi.org/10.1146/annurev-biophys-030722-110624.
+ URL https://www.annualreviews.org/content/journals/10.1146/
+ annurev-biophys-030722-110624.
+Axel Laborieux and Friedemann Zenke. Improving equilibrium propagation without weight sym-
+ metry through jacobian homeostasis. In The Twelfth International Conference on Learning Rep-
+ resentations. ICLR, 2024. URL https://openreview.net/forum?id=kUveo5k1GF.
+Axel Laborieux, Maxence Ernoult, Benjamin Scellier, Yoshua Bengio, Julie Grollier, and Damien
+ Querlioz. Scaling equilibrium propagation to deep convnets by drastically reducing its gradient
+ estimator bias. Frontiers in Neuroscience, 15:633674, 2021. ISSN 1662-453X. doi: 10.3389/
+ fnins.2021.633674.
+Yann Lecun. A theoretical framework for back-propagation. In Proceedings of the 1988 Connec-
+ tionist Models Summer School, 1988.
+Robert Legenstein and Wolfgang Maass. Edge of chaos and prediction of computational perfor-
+ mance for neural circuit models. Neural Networks, 20(3):323–334, 2007. ISSN 08936080. doi:
+ 10.1016/j.neunet.2007.04.017.
+Timothy P. Lillicrap, Daniel Cownden, Douglas B. Tweed, and Colin J. Akerman. Random synaptic
+ feedback weights support error backpropagation for deep learning. Nature Communications, 7
+ (1):13276, 2016. ISSN 2041-1723. doi: 10.1038/ncomms13276.
+Zhuo Liu, Hao Shu, Linmiao Wang, Xu Meng, Yousheng Wang, Xuancheng Li, Wei Wang, and
+ Tao Chen. Adjoint propagation of error signal through modular recurrent neural networks for
+ biologically plausible learning. eLife, 15:e108237, 2026. ISSN 2050-084X. doi: 10.7554/eLife.
+ 108237. URL https://doi.org/10.7554/eLife.108237.
+Adam S. Lowet, Qiao Zheng, Sara Matias, Jan Drugowitsch, and Naoshige Uchida. Distribu-
+ tional reinforcement learning in the brain. Trends in Neurosciences, 43(12):980–997, 2020.
+ ISSN 0166-2236. doi: https://doi.org/10.1016/j.tins.2020.09.004. URL https://www.
+ sciencedirect.com/science/article/pii/S0166223620301983.
+Christopher W. Lynn and Danielle S. Bassett. The physics of brain network structure, function
+ and control. Nature Reviews Physics, 1(5):318–332, 2019. ISSN 2522-5820. doi: 10.1038/
+ s42254-019-0040-8.
+Nikola T. Markov, Mária Ercsey-Ravasz, David C. Van Essen, Kenneth Knoblauch, Zoltán
+ Toroczkai, and Henry Kennedy. Cortical high-density counterstream architectures. Science, 342
+ (6158), 2013. ISSN 0036-8075 1095-9203. doi: 10.1126/science.1238406.
+Jorge F. Mejias, John D. Murray, Henry Kennedy, and Xiao-Jing Wang. Feedforward and feedback
+ frequency-dependent interactions in a large-scale laminar network of the primate cortex. Science
+ Advances, 2(11):e1601335, 2016. doi: doi:10.1126/sciadv.1601335. URL https://www.
+ science.org/doi/abs/10.1126/sciadv.1601335.
+Jan Melchior and Laurenz Wiskott. Hebbian-descent, 2019. URL https://arxiv.org/abs/
+ 1905.10585.
+Alexander Meulemans, Matilde Tristany Farinha, Javier Garcia Ordonez, Pau Vilimelis Aceituno,
+ João Sacramento, and Benjamin F. Grewe. Credit assignment in neural networks through deep
+ feedback control. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman
+ Vaughan (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 4674–4687.
+ Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_
+ files/paper/2021/file/25048eb6a33209cb5a815bff0cf6887c-Paper.pdf.
+Georgios Michalareas, Julien Vezoli, Stan van Pelt, Jan-Mathijs Schoffelen, Henry Kennedy, and
+ Pascal Fries. Alpha-beta and gamma rhythms subserve feedback and feedforward influences
+ among human visual cortical areas. Neuron, 89(2):384–397, 2016. ISSN 0896-6273. doi:
+ https://doi.org/10.1016/j.neuron.2015.12.018. URL https://www.sciencedirect.com/
+ science/article/pii/S0896627315011204.
+
+
+ 12
+ Published as a conference paper at ICLR 2026
+
+
+
+
+John Miller and Moritz Hardt. Stable recurrent models. In International Conference on Learning
+ Representations, 2019. URL https://openreview.net/forum?id=Hygxb2CqKm.
+Beren Millidge, Yuhang Song, Tommaso Salvatori, Thomas Lukasiewicz, and Rafal Bogacz. Back-
+ propagation at the infinitesimal inference limit of energy-based models: Unifying predictive cod-
+ ing, equilibrium propagation, and contrastive hebbian learning. In The Eleventh International
+ Conference on Learning Representations, 2023. URL https://openreview.net/forum?
+ id=nIMifqu2EO.
+Javier R. Movellan. Contrastive Hebbian Learning in the Continuous Hopfield Model, pp.
+ 10–17. Morgan Kaufmann, 1991. ISBN 978-1-4832-1448-1. doi: https://doi.org/10.1016/
+ B978-1-4832-1448-1.50007-X. URL https://www.sciencedirect.com/science/
+ article/pii/B978148321448150007X.
+Mitsumasa Nakajima, Yongbo Zhang, Katsuma Inoue, Yasuo Kuniyoshi, Toshikazu Hashimoto,
+ and Kohei Nakajima. Reservoir direct feedback alignment: deep learning by physical dynamics.
+ Communications Physics, 7(1):411, 2024. ISSN 2399-3650. doi: 10.1038/s42005-024-01895-0.
+Arild Nøkland. Direct feedback alignment provides learning in deep neural net-
+ works. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett (eds.), Ad-
+ vances in Neural Information Processing Systems, volume 29. Curran Associates, Inc.,
+ 2016. URL https://proceedings.neurips.cc/paper_files/paper/2016/
+ file/d490d7b4576290fa60eb31b5fc917ad1-Paper.pdf.
+Peter O’Connor, Efstratios Gavves, and Max Welling. Initialized equilibrium propagation for
+ backprop-free training. In International Conference on Learning Representations. ICLR, 2019.
+ URL https://openreview.net/forum?id=B1GMDsR5tm.
+Alexander Ororbia and Ankur Mali. Biologically motivated algorithms for propagating local target
+ representations. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp.
+ 4651–4658. AAAI, 2019.
+Alexander Ororbia, Patrick Haffner, David Reitter, and C. Lee Giles. Learning to adapt by minimiz-
+ ing discrepancy, 2017.
+Alexander G. Ororbia. Brain-inspired machine intelligence: A survey of neurobiologically-plausible
+ credit assignment, 2023. URL https://arxiv.org/abs/2312.09257.
+Alexander G. Ororbia, Ankur Mali, Daniel Kifer, and C. Lee Giles. Backpropagation-free deep
+ learning with recursive local representation alignment. In Proceedings of the AAAI Conference on
+ Artificial Intelligence, volume 37, pp. 9327–9335. AAAI, 2023. URL https://ojs.aaai.
+ org/index.php/AAAI/article/view/26118.
+Matthew G. Perich and Kanaka Rajan. Rethinking brain-wide interactions through multi-region
+ ‘network of networks’ models. Current Opinion in Neurobiology, 65:146–151, 2020. ISSN
+ 09594388. doi: 10.1016/j.conb.2020.11.003.
+D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by back-propagating
+ errors. Nature, 323(6088):533–536, 1986. ISSN 0028-0836. doi: 10.1038/323533a0. URL
+ <GotoISI>://WOS:A1986E327500055. E3275 Times Cited:16725 Cited References
+ Count:4.
+João Sacramento, Rui Ponte Costa, Yoshua Bengio, and Walter Senn. Dendritic corti-
+ cal microcircuits approximate the backpropagation algorithm. In S. Bengio, H. Wal-
+ lach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.), Ad-
+ vances in Neural Information Processing Systems, volume 31. Curran Associates, Inc.,
+ 2018. URL https://proceedings.neurips.cc/paper_files/paper/2018/
+ file/1dc3a89d0d440ba31729b0ba74b93a33-Paper.pdf.
+Tommaso Salvatori, Luca Pinchetti, Beren Millidge, Yuhang Song, Tianyi Bao, Rafal Bogacz, and
+ Thomas Lukasiewicz. Learning on arbitrary graph topologies via predictive coding. In Alice H.
+ Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Informa-
+ tion Processing Systems, 2022. URL https://openreview.net/forum?id=dqO59nI_
+ R9A.
+
+
+ 13
+ Published as a conference paper at ICLR 2026
+
+
+
+
+Benjamin Scellier and Yoshua Bengio. Equilibrium propagation: Bridging the gap between energy-
+ based models and backpropagation. Frontiers in Computational Neuroscience, 11:24, 2017. ISSN
+ 1662-5188. doi: 10.3389/fncom.2017.00024.
+Benjamin Scellier, Anirudh Goyal, Jonathan Binas, Thomas Mesnard, and Yoshua Bengio. Gener-
+ alization of equilibrium propagation to vector field dynamics, 2018. URL https://arxiv.
+ org/abs/1808.04873.
+Benjamin Scellier, Maxence Ernoult, Jack Kendall, and Suhas Kumar. Energy-based learn-
+ ing algorithms for analog computing: a comparative study. In A. Oh, T. Naumann,
+ A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Advances in Neural In-
+ formation Processing Systems, volume 36, pp. 52705–52731. Curran Associates, Inc.,
+ 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/
+ file/a52b0d191b619477cc798d544f4f0e4b-Paper-Conference.pdf.
+João D. Semedo, Anna I. Jasper, Amin Zandvakili, Aravind Krishna, Amir Aschner, Christian K.
+ Machens, Adam Kohn, and Byron M. Yu. Feedforward and feedback interactions between visual
+ cortical areas use different population activity patterns. Nature Communications, 13(1):1099,
+ 2022. ISSN 2041-1723. doi: 10.1038/s41467-022-28552-w. URL https://doi.org/10.
+ 1038/s41467-022-28552-w.
+Shawn C. Shadden, Francois Lekien, and Jerrold E. Marsden. Definition and properties of la-
+ grangian coherent structures from finite-time lyapunov exponents in two-dimensional aperi-
+ odic flows. Physica D: Nonlinear Phenomena, 212(3-4):271–304, 2005. ISSN 01672789. doi:
+ 10.1016/j.physd.2005.10.007.
+Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image
+ recognition. In International Conference on Learning Representations, 2015.
+Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,
+ Ł. ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. Von
+ Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Ad-
+ vances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.,
+ 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/
+ file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
+Andreas Veit, Michael J Wilber, and Serge Belongie. Residual networks behave like ensem-
+ bles of relatively shallow networks. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and
+ R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 29. Curran
+ Associates, Inc., 2016. URL https://proceedings.neurips.cc/paper_files/
+ paper/2016/file/37bc2f75bf1bcfe8450a1a41c200364c-Paper.pdf.
+Ran Wang, Xupeng Chen, Amirhossein Khalilian-Gourtani, Leyao Yu, Patricia Dugan, Daniel Fried-
+ man, Werner Doyle, Orrin Devinsky, Yao Wang, and Adeen Flinker. Distributed feedforward
+ and feedback cortical processing supports human speech production. Proceedings of the Na-
+ tional Academy of Sciences, 120(42):e2300255120, 2023. doi: 10.1073/pnas.2300255120. URL
+ https://doi.org/10.1073/pnas.2300255120. doi: 10.1073/pnas.2300255120.
+D. J. Watts and S. H. Strogatz. Collective dynamics of ’small-world’ networks. Nature, 393
+ (6684):440–442, 1998. ISSN 0028-0836. doi: Doi10.1038/30918. URL <GotoISI>://WOS:
+ 000074020000035. Zr842 Times Cited:29354 Cited References Count:27.
+Alan Wolf, Jack B. Swift, Harry L. Swinney, and John A. Vastano. Determining lyapunov
+ exponents from a time series. Physica D: Nonlinear Phenomena, 16(3):285–317, 1985.
+ ISSN 0167-2789. doi: https://doi.org/10.1016/0167-2789(85)90011-9. URL https://www.
+ sciencedirect.com/science/article/pii/0167278985900119.
+X. Xie and H. S. Seung. Equivalence of backpropagation and contrastive hebbian learning in a
+ layered network. Neural Computation, 15(2):441–454, 2003. ISSN 0899-7667. doi: 10.1162/
+ 089976603762552988.
+Izzet B. Yildiz, Herbert Jaeger, and Stefan J. Kiebel. Re-visiting the echo state property. Neural
+ Networks, 35:1–9, 2012. ISSN 08936080. doi: 10.1016/j.neunet.2012.07.005.
+
+
+ 14
+ Published as a conference paper at ICLR 2026
+
+
+
+
+A T HE DYNAMICS OF THE RNN
+We quantify the convergence property of the recurrent neural network (RNN) with maximum Lya-
+punov exponent (MLE) (Wolf et al., 1985), and finite time maximum Lyapunov exponent (FTMLE)
+(Kanno & Uchida, 2014). When the MLE/FTMLE is large, the RNN converges slow or even not
+at all. To compute MLE and FTMLE, we first initialize a random perturbation vector δ0 . Then we
+record the sequence of states s0 [t] with t = 0, 1, 2, . . . , Te − 1 corresponding to the last sample of a
+training set (see Figure 2 in the main text), and run the following steps:
+
+ 1. Normalize perturbation vectors to unit length:
+ δt
+ δt ←
+ ∥δt ∥
+ 2. Calculate the Jacobian matrix:
+ ∂F (s0 [t], b)
+ J(s0 [t]) =
+ ∂s0 [t]
+ 3. Update the perturbation:
+ δt+1 = J(s0 [t]) · δt
+ 4. Record
+ ri = ln ∥δt+1 ∥
+ PTe −1
+The maximum Lyapunov exponent is computed as λmax = T1e t=0 ri for a sufficiently large Te
+(default Te = 500). The results at any T < Te are the FTMLE.
+Figure S1–S2 show the FTMLE, MLE, training accuracy and test accuracy versus epochs of dif-
+ferent models. In all cases, smaller βi usually yields smaller (FT)MLE, whereas larger βi do not
+always lead to larger (FT)MLE because the activation function saturates. The saturation diminishes
+perturbation.
+For 2-hidden-layer RNN, smaller feedback scaling βi yields steady training progress and better ac-
+curacy. Figure S3 plots the FTMLE and test accuracy against feedback scaling for different numbers
+of hidden layers. It shows that smaller βi is favorable for shallow networks, because the RNN is eas-
+ier to converge (indicated by FTMLE). But for deeper networks (5-hidden-layer or more), smaller
+βi degrades performance because of vanishing gradient.
+Further comparison between our FRE-RNN that incorporates convolutional structure with previous
+work (Ernoult et al., 2019) are also plotted in Figure S4. These results suggest that small feedback
+scaling (βi = 0.01) leads to a smoother training process.
+
+B G RADIENT VANISHING AND THE RESIDUAL CONNECTIONS
+Figure S5 and S6 plot the error of each neuron versus epoch at different βi . For a 2-hidden-layer
+RNN, the best performance is obtained at βi = 0.001. In this situation, the error of the first hidden
+layer is at least two orders of magnitude less than the second hidden layer. At βi = 2, the error also
+decreases from higher (high index neurons, closer to output layer) to lower layers, which is attributed
+to the saturation of the activation function. In general, the training progresses more steadily for
+smaller βi despite the vanishing gradient, which also applies to deeper networks (up to 10-hidden-
+layer).
+To eliminate the vanishing gradient in EP, direct feedback from the higher layers or local amplifi-
+cation (with higher learning rate) is unavoidable (Nøkland, 2016; Ororbia et al., 2023). Figure S7
+shows the effect of residual connections. βi = 0.1 yield the best accuracy 97.5%, due to the balance
+between gradient flow and convergence.
+Figure S8 shows the testing accuracy varies with the connection probability P of AGT with 10
+hidden layers. Except for the connections in layered model, the connection between any two hidden
+layers is generated with probability P , i.e., we first use P to decide if the connections between any
+two layers will be established. As P increases, the accuracy rises first, peaks at 0.2 and decreases
+around 1. However, the reason behind is yet to be explored.
+
+
+ 15
+ Published as a conference paper at ICLR 2026
+
+
+
+
+Figure S1: The FTMLE, MLE, training accuracy and testing accuracy of symmetric RNNs versus
+epochs with different feedback scaling βi (legend). First row: 2 hidden layers; Second row: 3 hidden
+layers; Third row: 5 hidden layers; Fourth row: 10 hidden layers. The activation is tanh. Each case
+is repeated 5 times.
+
+
+
+
+ 16
+ Published as a conference paper at ICLR 2026
+
+
+
+
+Figure S2: The FTMLE, MLE, training accuracy and testing accuracy of asymmetric RNNs versus
+epochs with different feedback scaling βi (legend). First row: 2 hidden layers; Second row: 3 hidden
+layers; Third row: 5 hidden layers; Fourth row: 10 hidden layers. The activation is tanh. Each case
+is repeated 5 times.
+
+
+
+
+Figure S3: The FTMLE and testing accuracy versus feedback scaling βi with different numbers of
+hidden layers. (a) Symmetry weights; (b) Asymmetric weights. The FTMLE and testing accuracy
+given here correspond to their maxima in all epochs. Note that the 5-hidden-layer asymmetric RNN
+with large βi diverged and resulted in missing data points in (b). Each case is repeated 5 times.
+
+
+ 17
+ Published as a conference paper at ICLR 2026
+
+
+
+
+Figure S4: Comparison of RNN embedded with convolutional structure on the MNIST between
+P-EP (a) (Ernoult et al., 2019) and our approach at different βi (b-d). We used the same parameters
+as the EP reference (Ernoult et al., 2019).
+
+
+
+
+ 18
+ Published as a conference paper at ICLR 2026
+
+
+
+
+Figure S5: For 2-hidden-layer RNN, the mean error of each neuron in the last batch and testing
+accuracy versus epochs at different βi . All neurons in the hidden layers and the output layer are
+indexed from the input to the output layer.
+
+
+
+
+ 19
+ Published as a conference paper at ICLR 2026
+
+
+
+
+Figure S6: For the 10-hidden-layer model, the mean error of each neuron in the last batch and
+testing accuracy versus epochs at different βi . All neurons in the hidden layers and the output layer
+are indexed from the input to the output layer.
+
+
+
+
+ 20
+ Published as a conference paper at ICLR 2026
+
+
+
+
+Figure S7: For the 10-hidden-layer model with residual connections, the mean error of each neuron
+in the last batch and testing accuracy versus epochs at different βi . All neurons in the hidden layers
+and the output layer are indexed from the input to the output layer.
+
+
+
+
+Figure S8: The testing accuracy on MNIST varies with the connection probability P of AGT with
+10 hidden layers. The experiments are repeated 5 times.
+
+
+ 21
+ Published as a conference paper at ICLR 2026
+
+
+
+
+Table S1: Testing accuracy (mean of 5 repeated experiments) with different feedback scaling βi . By
+default, T = 10 × NHiddenLayer , K = 5 × NHiddenLayer . Each hidden layer has 64 nodes.
+ Architecture-connections βi = 0.001 βi = 0.01 βi = 0.1 βi = 0.25 βi = 1 βi = 2 βi = 4
+ 2HL-symm 97.69% 97.57% 97.25% 96.22% 93.12% 66.04% 40.92%
+ 3HL-symm 97.22% 97.64% 97.41% 96.60% 55.86% 32.64% 22.11%
+ 5HL-symm 93.54% 95.54% 97.60% 90.63% 25.31% 17.88% 14.61%
+ 10HL-symm 87.15% 89.99% 92.54% 41.84% 14.07% 14.30% 14.23%
+ 10HL-Residual-symm – 97.52% 97.46% – 95.51% – –
+ conv-symm – 99.15% 98.71% – 11.35% – –
+ 2HL-asymm 96.96% 96.97% 96.88% 96.79% 93.88% 91.81% 89.91%
+ 3HL-asymm 95.17% 96.91% 96.76% 96.66% 91.21% 54.65% 26.72%
+ 5HL-asymm 91.14% 92.34% 96.41% 96.35% 17.15% 11.35% 13.07%
+ 10HL-asymm 84.27% 85.83% 87.79% 90.97% 16.13% 14.21% 16.67%
+ 10HL-AGT-asymm – 96.37% 96.75% – 33.31% – –
+
+
+
+C E QUIVALENCE WITH EP AND BP UNDER THE CONDITION OF
+ INFINITESIMAL INFERENCE LIMIT
+
+
+
+
+Figure S9: A layered network model used to illustrate the process of backpropagation (BP), local
+representation alignment (LRA), and EP. Note that the final prediction layer ·p corresponds to the
+third layer with subindex ·3 . For LRA, we use βLRA instead of β1 and βf . For BP, the feedback
+(orange) paths are absent.
+
+In this section, we will use the infinitesimal inference limit (Millidge et al., 2023) to derive the
+equivalence of EP with LRA and BP.
+
+C.1 BACKPROPAGATION
+
+When we remove the feedback connection of a 2-hidden-layer RNN shown in Figure S9, a feedfor-
+ward network is left and can be trained with BP. The forward process of BP is described by:
+
+
+ s1 = ρ(h1 ), h1 = W0 · s0 ,
+ s2 = ρ(h2 ), h2 = W1 · s1 , (S1)
+ sp = hp , hp = Wf · s2 .
+
+Defining a loss LBP = 21 (sp −star )2 , the weights adjust according to the gradient of the loss. Taking
+∆W0 as an example:
+ ∂LBP
+ ∆W0 = −
+ ∂W0
+ = −ρ′ (h1 ) ⊙ W1⊤ · ρ′ (h2 ) ⊙ Wf⊤ · (sp − star ) · (s0 )⊤ ,
+ 
+ (S2)
+
+where “⊙” means Hadamard product (element-wise product), “·” means scalar or matrix multipli-
+cation. For two vectors/matrices, “⊙” requires identical dimensions and computes element-wise
+products. Broadcasting rules may apply (e.g., a column vector vm×1 ⊙ Am×n scales each column
+of A by v).
+
+
+ 22
+ Published as a conference paper at ICLR 2026
+
+
+
+
+C.2 L OCAL R EPRESENTATION A LIGNMENT
+
+LRA is an alternative training method following the principle of discrepancy reduction (Ororbia
+et al., 2017; Ororbia & Mali, 2019). It can be divided into two phases: 1) the network runs the
+forward process, producing latent representations of the input samples. 2) the weights adjust in the
+direction of reducing the mismatch between current latent representations and target representations
+in each layer.
+The forward process is the same as BP:
+ s01 = ρ(h01 ), h01 = W0 · s0 ,
+ s02 = ρ(h02 ), h02 = W1 · s01 , (S3)
+ s0p = h0p , h0p = Wf · s02 .
+where s0i are interpreted as the latent representations. The prediction error is ep = star − s0p . Then
+we can get the target representations of the second hidden layer:
+ sβ2 LRA = ρ(hβ2 LRA ), hβ2 LRA = W1 · s01 + βLRA · Bf · ep , (S4)
+The same goes for the first hidden layer:
+ sβ1 LRA = ρ(hβ1 LRA ), hβ1 LRA = W1 · s0 + βLRA · B1 · e2 , e2 = sβ2 LRA − s02 , (S5)
+LRA defines the loss as the total discrepancy between latent representations and target representa-
+tions:
+ L L
+ X X 1 0
+ LLRA = ki Li (s0i , sβi LRA ) = (si − sβi LRA )2 , (S6)
+ i=1 i=1
+ 2
+The weight Wi adjusts according to the local mismatch between s0i+1 and sβi+1
+ LRA
+ :
+ ∂ki Li (s0i+1 , sβi+1
+ LRA
+ )
+ ∆Wi = −
+ ∂Wi
+ = (sβi+1
+ LRA
+ − s0i+1 ) ⊙ f ′ (h0i+1 ) · (s0i )⊤
+ ≈ (sβi+1
+ LRA
+ − s0i+1 ) · (s0i )⊤ , (S7)
+where the derivative of the activation function is omitted in the last row, a useful practice common
+in LRA (Melchior & Wiskott, 2019; Ororbia & Mali, 2019; Ororbia et al., 2023). When βLRA → 0,
+sβi LRA → s0i and hβi LRA → h0i , then
+
+
+ ei = sβi LRA − s0i = ρ(hβi LRA ) − ρ(h0i )
+ = ρ(h0i + βLRA · Bi · ei+1 ) − ρ(h0i )
+ ≈ [ρ(h0i ) + ρ′ (h0i ) ⊙ (βLRA · Bi · ei+1 ) − ρ(h0i ))]βLRA →0 , (S8)
+ ′
+ = ρ (h0i ) ⊙ (βLRA · Bi · ei+1 )
+
+The approximation in Equation S8 is based on a first-order Taylor expansion of ρ(h0i + ∆h) around
+h0i , where ∆h = βLRA · Bi · ei+1 . For a small perturbation ∆h → 0, the Taylor expansion gives:
+ ρ(h0i + ∆h) = ρ(h0i ) + ρ′ (h0i ) · ∆h + O(∆h2 ), (S9)
+ 2
+When βLRA → 0, higher order terms O(∆h ) are negligible, leaving only the linear terms. We
+arrive at the last row after canceling out ρ(h0i ). There we can express the weight adjustments as
+ ∆W0 = e1 · (s00 )⊤
+   
+  ′ 0 ′ 0
+ 
+ · (s0 )⊤ B =(W )⊤
+ 
+ = ρ (h1 ) ⊙ βLRA · B1 · ρ (h2 ) ⊙ βLRA · Bf · (star − sp )
+ i i
+
+ ′ 0 ⊤ ′ 0 ⊤ ⊤
+ 
+ = −βLRA · βLRA · ρ (h1 ) ⊙ W1 · ρ (h2 ) ⊙ Wf · (sp − star ) · (s0 ) , (S10)
+which is the same as BP (Equation S2) except for a constant. Thus, LRA at weak feedback limit
+approximates BP. An LRA algorithm for a 2-hidden-layer network is described in Algorithm S1. The
+feedback weights in LRA need not be learned here, but can be kept symmetric with the feedforward
+weights.
+
+
+ 23
+ Published as a conference paper at ICLR 2026
+
+
+
+
+C.3 E QUILIBRIUM P ROPAGATION
+
+We can also formulate EP in terms of discrepancy reduction. In EP (Algorithm 1 in the main text),
+the network states evolve as follows (β = 0 for the first phase and β = βf for the second phase):
+
+
+ hβ1 = W0 · sβ0 + β1 · B1 · sβ2 ,
+ hβ2 = W1 · sβ1 + βf · Bf · ep ,
+ hβp = Wf · sβ2 ,
+ sβ1 , sβ2 , sβp = ρ(hβ1 ), ρ(hβ2 ), hβp , (S11)
+
+
+where ep = star − s0p is the predicting error. The network converges to final states h01 , h02 , s01 , s02 in
+the free phase. The error of s2 neurons can be described by:
+
+ β
+ ds2 = [ρ(h2 f )]βf →0 − [ρ(h02 )]βf =0
+ ≈ ρ′ (h02 ) ⊙ (βf · Bf · ep ), (S12)
+
+where only the first-order infinitesimal term is retained as β1 → 0. The same goes for the first
+hidden layer:
+
+ β
+ ds1 = [ρ(h1 f )]βf →0 − [ρ(h01 )]βf =0
+ ≈ ρ′ (h01 ) ⊙ (β1 · B1 · (ρ′ (h02 ) ⊙ (βf · Bf · ep ))), (S13)
+
+The weight W0 can be updated by:
+
+ ds1 · (s00 )⊤
+ ∆W0 = = ρ′ (h01 ) ⊙ B1 · (ρ′ (h02 ) ⊙ Bf · ep ) · (s00 )⊤ , (S14)
+ β1 · β f
+
+With Bi = Wi⊤ ,
+
+ ds1 = βf · β1 · ρ′ (h01 ) ⊙ W1⊤ · (ρ′ (h02 ) ⊙ Wf⊤ · −(sp − star )), (S15)
+ ds1 · (s00 )⊤
+ = −ρ′ (h01 ) ⊙ W1⊤ · ρ′ (h02 ) ⊙ Wf⊤ · (sp − star ) · (s00 )⊤ .
+ 
+ ∆W0 = (S16)
+ β1 · β f
+
+Note that compared with the weight update in the main text, 1/(β1 ·βf ) is added to recover a gradient
+amplitude similar to BP. Further, if we assume that the high-order infinitesimal in the first phase can
+be omitted, the dynamics of RNN is governed by:
+
+ s01 = ρ(hβ1 ), h01 = [W0 · s0 + β1 · B1 · s02 ]β1 →0 ≈ W0 · s0 , (S17)
+ s02 = ρ(h02 ), h02 = [W1 · s01 + βf · Bf · ep ]β1 →0,βf =0 ≈ W1 · s01 , (S18)
+ s0p = h0p , h0p = Wf · s02 . (S19)
+
+
+The information flow of RNN degenerates into that of a feedforward network. This does not affect
+the error information dsi , thus Equation S16 approximates Equation S2 for BP. Meanwhile, it re-
+sembles LRA with low βLRA , which turns explicit error into implicit error. Hitherto, we have shown
+that although the errors are obtained differently in EP, LRA, and BP, they are equivalent under the
+assumption of weak supervision and weak feedback.
+
+
+ 24
+ Published as a conference paper at ICLR 2026
+
+
+
+
+Algorithm S1 Local Representation Alignment (LRA)
+Input: (x, star )
+Parameter: θ = [W0 , W1 , W2 , B2 , B1 , βLRA ]
+Output: θ
+ 1: function F ORWARD(θ, x)
+ 2: s0 ← x
+ 3: s01 ← ρ(h1 ), h1 ← W0 · s0
+ 4: s02 ← ρ(h2 ), h2 ← W1 · s01
+ 5: s0p ← Wf · s02
+ 6: Λ1 ← [s0i ], i = 0, 1, 2, p
+ 7: return Λ1
+ 8: end function
+ 9: function F EEDBACK(θ, Λ1 , star )
+10: ep ← star − s0p
+11: sβ2 LRA ← ρ(h2 ), h2 ← W1 · s01 + βLRA · Bf · ep
+12: e2 ← sβ2 LRA − s02
+13: sβ1 LRA ← ρ(h1 ), h1 ← W0 · s0 + βLRA · B1 · e2
+14: e1 ← sβ1 LRA − s01
+15: Λ2 ← [e1 , e2 , ep ]
+16: return Λ2
+17: end function
+18: function U PDATING -W EIGHTS(θ, Λ1 , Λ2 )
+19: ∆Wi ← ei+1 · (s0i )T , i = 0, 1
+20: ∆Wf ← ep · (s02 )T
+21: end function
+
+
+C.4 E XPERIMENTS FOR EQUIVALENCE WITH EP AND BP
+
+Prior works have shown that EP can be equalized to BPTT in specific conditions and can achieve
+comparable performance (Ernoult et al., 2019; Laborieux et al., 2021). As discussed in the previ-
+ous section, although the overall architecture forms an RNN, the network behaves similarly to a
+feedforward model due to weak feedback connections.
+To experimentally show the equivalence of EP and BP, we can further compare our model with
+FNN with same feedforward weights trained by BP. We mainly compare cosine similarity of states,
+bias gradients and weight gradients for the first batch (batch size is 200) as given in Figure S10.
+Figure S10(a-c) shows similarity under the conditions of βi = 1 with different iterations. For the the
+bias gradients, i.e., dsi , the cosine similarity declines rapidly, indicating no similarity between our
+model and BP. With weak feedback βi = 0.1, as shown in Figure S10(d-f), the similarity of states
+approaches 1 and the similarity of bias gradient of last 6(4) layers exceeds 0.5 with T = 500/50
+(T = 20). These results provide further evidence that EP is equivalent to BP under the condition of
+weak feedback.
+We further studied the influence of βi on the cosine similarity. Figure S10(g) shows that larger
+βi leads to lower similarity of states. Figure S10(h) shows that lower βi = 0.01 also leads to
+the decrease in similarity, which may caused by insufficient precision of data storage (float32 by
+default). Therefore, we use datatype float64 to repeat experiments. Figure S10(k,l) shows that the
+similarity of gradient signal remains around 1 with βi = 0.1. This indicates that weak feedback does
+indeed lead to an exponential decline in gradient signals, thus requiring higher relative accuracy.
+
+
+
+
+ 25
+ Published as a conference paper at ICLR 2026
+
+
+
+
+ (a) states (b) bias gradients (c) weight gradients
+ 1.0 1.0 1.0
+ T=500,K=200
+cosine similarity
+
+
+
+
+ cosine similarity
+
+
+
+
+ cosine similarity
+ T=50,K=20
+ 0.5 0.5 0.5 T=20,K=8
+ 0.0 0.0 0.0
+
+ 0.5 0.5 0.5
+ fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 out fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 out fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 out
+ (d) states (e) bias gradients (f) weight gradients
+ 1.0 1.0 1.0
+cosine similarity
+
+
+
+
+ cosine similarity
+
+
+
+
+ cosine similarity
+ 0.5 0.5 0.5
+
+ 0.0 0.0 0.0 T=500,K=200
+ T=50,K=20
+ T=20,K=8
+ 0.5 0.5 0.5
+ fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 out fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 out fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 out
+ (g) states (h) bias gradients (i) weight gradients
+ 1.0 1.0 1.0
+ i=0.0 i=0.5
+cosine similarity
+
+
+
+
+ cosine similarity
+
+
+
+
+ cosine similarity
+
+ i=0.01 i=1
+ 0.5 0.5 0.5 i=0.1
+
+ 0.0 0.0 0.0
+
+ 0.5 0.5 0.5
+ fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 out fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 out fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 out
+ (j) states (k) bias gradients (l) weight gradients
+ 1.0 1.0 1.0
+cosine similarity
+
+
+
+
+ cosine similarity
+
+
+
+
+ cosine similarity
+
+
+
+
+ 0.5 0.5 0.5 i=0.0
+ i=0.01
+ i=0.1
+ 0.0 0.0 0.0
+ i=0.5
+ i=1
+ 0.5 0.5 0.5
+ fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 out fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 out fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 out
+
+ Figure S10: The cosine similarity of gradients and states between our model and feedforward model
+ trained by BP in an 8-hidden-layer FNN (states: s0i ; bias gradients: dsi ; weight gradients: ∆Wi ).
+ The axis x is the layers of the model. Error propagates from the last layer ”out” to the first hidden
+ layer ”fc1” layer by layer. (a-c), with different numbers of iterations under feedback scaling βi = 1.
+ (d-f), with different numbers of iterations under small feedback scaling βi = 0.1. (g-i), with different
+ feedback scaling (T=50,K=20). (j-l), Repeat (g-i) with datatype float64 (float32 by default).
+
+
+
+
+ 26
+ Published as a conference paper at ICLR 2026
+
+
+
+
+C.5 V ERIFYING THE EFFECTIVENESS OF WEAK FEEDBACK IN E QUILIBRIUM P ROPAGATION
+
+
+ 1.00
+ i=0.01
+ i=0.1
+ 0.95
+
+
+
+
+ Test accuracy
+ 0.90
+
+ 0.85
+
+ 0.80
+ 1 3 5
+ Only last nll layers learning
+
+Figure S11: The testing accuracy on MNIST with different βi varies with nll . The experiments are
+repeated 5 times.
+
+To demonstrate that the lower few layers of our model are indeed receiving meaningful credit signals,
+we report the test accuracy of only updating the last nll layer (i.e., freezing the weights of Layers
+1−nll ) in Figure S11. For a 5-hidden layer model with βi = 0.1, updating only the final layer yields
+a test accuracy of about 85%. As nll increases to 5, the accuracy also reaches around 97.5%. A
+similar trend is observed for the model with βi = 0.01. These results show that achieving over 97%
+accuracy requires effective gradient propagation to all layers, confirming that our model successfully
+delivers usable credit signals throughout the entire network.
+
+C.6 ROBUSTNESS TO THE NOISE
+
+ (a) 2HL (b) 3HL (c) 5HL
+ 1.00 1.00 1.00
+ 0 0.972 0.972 0.972 0.904 0 0.974 0.974 0.967 0.752 0 0.976 0.969 0.814 0.469
+Weight noise intensity
+
+
+
+
+ Weight noise intensity
+
+
+
+
+ Weight noise intensity
+
+
+
+
+ 0.95 0.95 0.95
+ 0.001 0.972 0.972 0.972 0.896 0.90 0.001 0.973 0.975 0.964 0.749 0.90 0.001 0.973 0.965 0.812 0.464 0.90
+
+ 0.01 0.917 0.917 0.912 0.689 0.85 0.01 0.903 0.904 0.838 0.485 0.85 0.01 0.876 0.815 0.520 0.299 0.85
+ 0.80 0.80 0.80
+ 0.1 0.194 0.206 0.196 0.217 0.1 0.167 0.180 0.174 0.167 0.1 0.134 0.135 0.134 0.135
+ 0.75 0.75 0.75
+ 0 1e-05 0.001 0.1 0 1e-05 0.001 0.1 0 1e-05 0.001 0.1
+ State noise intensity State noise intensity State noise intensity
+
+
+Figure S12: The maximum test accuracy model with different noise intensity on weights and states
+added both in training and test. (a) With 2 hidden layers; (b) With 3 hidden layers; (c) With 5 hidden
+layers. The model is trained for 50 epochs and the experiments are repeated 5 times.
+
+To evaluate the robustness of the model, we introduce noise on weights and time-varying noise
+on states, which are random Gauss noise imposed at each weight update or at each state update,
+respectively. The noise on weights is directly added to the weight, while the noise on states is added
+as the bias b in Equation 3.1. The mean absolute values of non-zero weights and neural activations
+after noiseless training are approximately 0.09 and 0.76 respectively.
+The accuracy of the model with two hidden layers varies with two types of noises as presented in Fig-
+ure S12(a). It maintains satisfactory performance when the standard deviation of state noise reaches
+0.1 or the standard deviation of weight noise reaches 0.01. In deeper structures (Figure S12(b,c)),
+the results are consistent with the aforementioned observations for weight noise, demonstrating ex-
+cellent robustness. However, the tolerance to time-varying state noise degrades significantly, which
+we attribute to the layer-wise noise accumulation and the distortion of weak gradient signal by the
+noise in the training process. To confirm our hypothesis, we impose the noises only in the test,
+
+
+ 27
+ Published as a conference paper at ICLR 2026
+
+
+
+
+ (a) 2HL (b) 3HL (c) 5HL
+ 1.00 1.00 1.00
+ 0 0.973 0.973 0.973 0.971 0.944 0 0.974 0.976 0.975 0.974 0.940 0 0.975 0.977 0.977 0.977 0.919
+ 0.95 0.95 0.95
+
+
+
+
+ Weight noise intensity
+
+
+
+
+ Weight noise intensity
+ Weight noise intensity
+ 0.001 0.971 0.971 0.974 0.970 0.945 0.90 0.001 0.975 0.973 0.973 0.972 0.945 0.90 0.001 0.973 0.973 0.973 0.973 0.928 0.90
+
+ 0.01 0.919 0.916 0.919 0.918 0.912 0.85 0.01 0.908 0.903 0.908 0.902 0.894 0.85 0.01 0.885 0.876 0.880 0.873 0.842 0.85
+ 0.80 0.80 0.80
+ 0.1 0.199 0.187 0.194 0.219 0.177 0.1 0.176 0.182 0.152 0.167 0.142 0.1 0.133 0.128 0.143 0.163 0.142
+ 0.75 0.75 0.75
+ 0 0.001 0.001 0.1 0. 0 0.001 0.001 0.1 0.5 0 0.001 0.001 0.1 0.5
+ State noise intensity State noise intensity State noise intensity
+
+
+
+
+Figure S13: The maximum test accuracy model with different noise intensity on weights and states
+(the state noise is added only in test). (a) With 2 hidden layers; (b) With 3 hidden layers; (c) With 5
+hidden layers. The model is trained for 50 epochs in a single experiment.
+
+
+and the test accuracy almost remains unaffected (Figure S13). Therefore, the network is potentially
+resilient to noise. However, how to improve resilience in the training process requires further study.
+
+
+D T RAINING DETAILS
+
+Table S2 provides the parameters of the Adam optimizer that are used in Tables S1–S2 (Kingma &
+Ba, 2015). The training details for Table 1 are given in Table S3. For convolutional architectures
+in EP, the training process can be described by Algorithm S2. The training sample is fed into the
+network through Conv0 . Then the state of the first layer goes through max pooling MaxPool1 and
+convolution Conv1 sequentially to reach the second layer. The second layer also feedbacks its states
+to the first layer through transposed convolution ConvT1 and max-unpooling MaxUnpool1 . With
+T iterations, the RNN converges to the steady states and produces outputs through MaxPool2 and
+a fully connected layer. Then the prediction error is computed and used to nudge the RNN by
+the reverse of the fully connected layer and max-unpooling MaxUnpool2 . Note that the unpooling
+MaxUnpooli requires the indices from the corresponding pooling MaxPooli .
+For Table S2, Adam optimizer is used for all experiments. The activation functions sigmoid-s
+ 1
+and hard-sigmoid are defined as ρ(x) = 1+e−4(x−0.5) , ρ(x) = max(min(x, 0), 1), respectively
+(Ernoult et al., 2019). For 5-HL, 10-HL and 20-HL architectures, the Adam optimizer parameters
+are as shown in Table S2 (epoch: 50, batch size: 500). The inference details of the architecture
+shown in Figure 3b are described by Algorithm S3. The details for convolutional architectures are
+given in Table S3 and Figure S14–S15. The cosine-annealing scheduler is used in convolutional
+architectures for CIFAR-10 (Tmax = 50, ηmin = 10−6 ).
+For MNIST, no pre-processing is used. For the CIFAR-10 dataset, we follow ref. (Scellier
+et al., 2023) to pre-process the images. We normalize the input images using mean µ =
+(0.4914, 0.4822, 0.4465) and standard deviation σ = 3 × (0.2023, 0.1944, 0.2010).
+The results for comparison of time consumption were obtained in a virtualized Windows 11 envi-
+ronment with Intel Xeon Gold 6238R CPU, 16 GB RAM, and Nvidia RTX A5000 (24 GB VRAM).
+Other results were obtained on a Windows 11 environment with Intel Core i5-12490F, 32 GB RAM,
+and Nvidia GTX 1650 (4 GB VRAM) or a Windows 11 environment with AMD R7-7700, 32 GB
+RAM, and Nvidia RTX 4070 (12 GB VRAM). The default numerical precision is float32 (single-
+precision float).
+
+
+ Table S2: The parameters of the Adam optimizer.
+ Parameter Name Default Value
+ Learning rate (MNIST / CIFAR-10) 10−3 /2 × 10−4
+ First-order moment estimation decay rate (β1 ) 0.9
+ Second-order moment estimation decay rate (β2 ) 0.999
+ Small constant for numerical stability (ϵ) 10−8
+
+
+
+ 28
+ Published as a conference paper at ICLR 2026
+
+
+
+
+Table S3: Training details for Table 1 and Table 2. The results of EB-EP and P-EP come from
+previous work (Ernoult et al., 2019). SGD refers to Stochastic Gradient Descent with mini-batches.
+ Epoch / Batch size
+ Architecture Training approach Optimizer Learning rate Weight decay
+ -T/K
+ P-EP (sigmoid-s) SGD 50/20-100/20 [0.005, 0.05, 0.2] None
+ 2HL
+ Proposed (tanh, Adam) Adam 50/500-10/10 [0.001, 0.001, 0.001] None
+ P-EP (sigmoid-s) SGD 100/20-180/20 [0.002, 0.01, 0.05, 0.2] None
+ Proposed-DLR (tanh) SGD 100/20-18/10 [0.002, 0.01, 0.05, 0.2] None
+ 3HL Proposed (tanh) SGD 100/20-18/10 [0.1, 0.1, 0.1, 0.1] None
+ Proposed (tanh, Adam) Adam 50/500-18/10 10−3 None
+ BP (tanh, Adam) Adam 50/500-1/1 10−3 None
+ P-EP (hard-sigmoid) SGD 40/20-200/10 [0.015, 0.035, 0.15] None
+ Conv
+ Proposed (hard-sigmoid) SGD 40/128-20/10 [0.15, 0.35, 0.9] 10−5
+ (Table 1)
+ BP (hard-sigmoid) SGD 40/128-1/1 [0.001, 0.02, 0.4] 10−5
+ Conv Proposed (hard-sigmoid) Adam 40/128-20/10 2 × 10−4 10−6
+ (Table 2 MNIST) BP (hard-sigmoid) Adam 40/128-1/1 2 × 10−4 10−6
+ Conv Proposed (hard-sigmoid) Adam 50/128-40/10 2.5 × 10−4 2 × 10−4
+ (Table 2 CIFAR-10) BP (hard-sigmoid) Adam 50/128-1/1 2.5 × 10−4 2 × 10−4
+
+
+
+
+ 64@8x8 64@4x4
+ 32@24x24 32@12x12
+ 1@28x28 1x10
+
+
+
+
+ Conv1 MaxPool1 Conv2 MaxPool2 Dense
+
+ Figure S14: Convolutional architectures for MNIST.
+
+
+
+
+ 128@8x8 128@4x4
+ 64@16x16 64@8x8
+ 32@32x32 32@16x16
+ 3@32x32 1x10
+
+
+
+
+ Conv1 MaxPool1 Conv2 MaxPool2 Conv3 MaxPool3
+
+ Figure S15: Convolutional architectures for CIFAR-10.
+
+
+
+
+ 29
+ Published as a conference paper at ICLR 2026
+
+
+
+
+Algorithm S2 Two phases in EP training process for convolution architecture
+Input: Sample-label pairs (x, star )
+Parameter: θ = [W0 , W1 , Wf , Bf , B1 , α1 , β1 , βf 1 ]
+Output: θ
+ 1: function F IRST- PHASE(θ, star )
+ 2: s0 ← x
+ 3: for t ← 1 to T do
+ 4: h1 ← Conv0 (s0 ) + β1 · MaxUnpool1 (ConvT1 (s02 ))
+ 5: h2 ← Conv1 (MaxPool1 (s01 ))
+ 6: hp ← Wf · Flatten(MaxPool2 (s02 ))
+ 7: s01 , s02 , s0p ← ρ(h1 ), ρ(h2 ), SoftMax(hp )
+ 8: end for
+ 9: Λ1 ← [s0i ], i = 0, 1, 2, p
+10: return Λ1
+11: end function
+12: function S ECOND - PHASE(θ, Λ1 , star )
+ β β β
+13: s0 , s1 f 1 , s2 f 1 , sp f 1 ← x, s01 , s02 , s0p
+14: for t ← 1 to K do
+ β
+15: ep ← star − sp f 1
+ β
+16: h1 ← Conv0 (s0 ) + β1 · MaxUnpool1 (ConvT1 (s2 f 1 ))
+ β
+17: h2 ← Conv1 (MaxPool1 (s1 f 1 )) + βf · MaxUnpool2 (Unflatten(Wf⊤ ep ))
+ β
+18: hp ← Wf · Flatten(MaxPool2 (s2 f 1 ))
+ β β β
+19: s1 f 1 , s2 f 1 , sp f 1 ← ρ(h1 ), ρ(h2 ), SoftMax(hp )
+20: end for
+21: end function
+
+
+
+
+ 30
+ Published as a conference paper at ICLR 2026
+
+
+
+
+Algorithm S3 EP with feedback scaling and residual connections (Figure 3b)
+Input: (x, star )
+Parameter: θ = [W0 , Wi , Wf , Bf , Bi , βi , βf 1 ]
+Output: θ
+ 1: function I TERATION(θ, Λ1 , star )
+ 2: for t ← 1 to K do
+ 3: if Nudging Phase then
+ 4: βf ← βf 1
+ β β β
+ 5: s0 , s1 f 1 , s2 f 1 , sp f 1 ← x, s01 , s02 , s0p
+ 6: else
+ 7: βf ← 0
+ 8: s0 ← x
+ 9: end if
+ β β
+10: h1 ← W0 s0 + β1 B1 s2 f + β4,1 B4,1 s4 f
+ βf βf
+11: h2 ← W1 s1 + β2 B2 s3
+ β β
+12: h3 ← W2 s2 f + β3 B3 s4 f
+ βf β β β
+13: h4 ← W3 s3 + β14 B4 s5 f + W1,4 s1 f + β7,4 B7,4 s7 f
+ β β
+14: h5 ← W4 s4 f + β5 B5 s6 f
+ βf β
+15: h6 ← W5 s5 + β6 B6 s7 f
+ β β β β
+16: h7 ← W6 s6 f + β7 B7 s8 f + W4,7 s4 f + β10,7 B10,7 s10f
+ βf βf
+17: h8 ← W7 s7 + β8 B8 s9
+ β β
+18: h9 ← W8 s8 f + β9 B9 s10f
+ βf β β
+19: h10 ← W9 s9 + βf Bf ep f + W7,10 s7 f
+ β
+20: hp ← Wf s10f
+ βf
+21: si ← ρ(hi ), i = 0, 1, 2, . . . , 10
+ β
+22: sp f ← SoftMax(hp )
+23: end for
+24: end function
+
+
+
+
+ 31
+ \ No newline at end of file