diff options
| author | Yuren Hao <yurenh2@illinois.edu> | 2026-07-03 05:56:50 -0500 |
|---|---|---|
| committer | Yuren Hao <yurenh2@illinois.edu> | 2026-07-03 05:56:50 -0500 |
| commit | b83947778e2c776f757a07d4719b7ce961d7ed55 (patch) | |
| tree | b9cc01d7adda691d9156d9d04f4fb2f644674e96 /refs/fre_rnn_full.txt | |
Initial commit: ept — backprop-free equilibrium transformer (EP)
Code (ep_run/), organized docs (docs/{method,campaign,hardware,outreach,paper}),
analysis scripts (scripts/), ONBOARDING.md entry point. Large data/checkpoints
git-ignored (share separately).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_014FAPDWQ49M5Ye3NpTndTpn
Diffstat (limited to 'refs/fre_rnn_full.txt')
| -rw-r--r-- | refs/fre_rnn_full.txt | 1945 |
1 files changed, 1945 insertions, 0 deletions
diff --git a/refs/fre_rnn_full.txt b/refs/fre_rnn_full.txt new file mode 100644 index 0000000..48da35c --- /dev/null +++ b/refs/fre_rnn_full.txt @@ -0,0 +1,1945 @@ + Published as a conference paper at ICLR 2026 + + + + + T OWARD P RACTICAL E QUILIBRIUM P ROPAGATION : + B RAIN - INSPIRED R ECURRENT N EURAL N ETWORK + WITH F EEDBACK R EGULATION AND R ESIDUAL C ON - + NECTIONS + + Zhuo Liu Tao Chen ∗ + School of Microelectronics School of Microelectronics + University of Science and Technology of China University of Science and Technology of China + Hefei 230026, Anhui, China Hefei 230026, Anhui, China +arXiv:2508.11659v2 [cs.NE] 7 May 2026 + + + + + zhuoliu00@mail.ustc.edu.cn tchen@ustc.edu.cn + + + + A BSTRACT + Brain-like intelligent systems need brain-like learning methods. Equilibrium + Propagation (EP) is a biologically plausible learning framework with strong poten- + tial for brain-inspired computing hardware. However, existing implementations + of EP suffer from instability and prohibitively high computational costs. Inspired + by the structure and dynamics of the brain, we propose a biologically plausible + Feedback-regulated REsidual recurrent neural network (FRE-RNN) and study its + learning performance in the EP framework. Feedback regulation enables rapid + convergence by attenuating feedback signals and reducing the disturbance of feed- + back paths to feedforward paths. The improvement in the convergence property + reduces the computational cost and training time of EP by orders of magnitude, + delivering performance on par with backpropagation (BP) in benchmark tasks. + Meanwhile, residual connections with brain-inspired topologies help alleviate the + vanishing gradient problem that arises when feedback pathways are weak in deep + RNNs. Our approach substantially enhances the applicability and practicality of + EP. The techniques developed here also offer guidance for implementing in-situ + learning in physical neural networks. + + + 1 I NTRODUCTION + Backpropagation (BP) has been the driving force behind the success of artificial intelligence (AI) + across a wide variety of tasks, ranging from image recognition to natural language processing + (Rumelhart et al., 1986; Lecun, 1988; He et al., 2016; Vaswani et al., 2017). Despite these tri- + umphs, BP’s reliance on non-local error signals and weight transport lacks biological plausibility + (Journé et al., 2023; Ororbia, 2023). The brain does not appear to implement the gradient computa- + tions performed by BP, in particular the explicit derivative of the activation function, which demands + precise access to the rate of change in neuronal activities at specific operating points (Ororbia, 2023). + Moreover, implementing BP in neuromorphic systems incurs enormous overhead (Kudithipudi et al., + 2025). Drawing inspiration from the topology and dynamics of the brain is a viable approach to ad- + vancing biologically plausible learning mechanisms and to promoting energy-efficient computing + systems for AI. + Equilibrium Propagation (EP) (Scellier & Bengio, 2017; Ernoult et al., 2019; Laborieux et al., 2021) + presents a compelling and hardware-friendly alternative. It leverages naturally settling dynamics in + RNN for credit assignment, and eliminates the need for explicit activation derivatives. EP operates + in two phases with nearly identical dynamics, and the synaptic adjustments depend only on local + information (Ackley et al., 1985; Movellan, 1991; Ernoult et al., 2020). In EP, the output layer is + softly nudged by the prediction error toward configurations that incrementally minimize the loss + function, a regime termed weak supervision (Millidge et al., 2023). A major drawback of EP is + ∗ + Corresponding Author + + + 1 +Published as a conference paper at ICLR 2026 + + + + +its notably slow training speed and instability. An RNN often requires dozens or even hundreds +of iterations to reach a stable state (Scellier & Bengio, 2017). Previous attempts to optimize EP’s +performance have led to markedly more complicated procedures (O’Connor et al., 2019; Laborieux +& Zenke, 2024). +In this paper, we draw inspiration from the brain and propose a Feedback-regulated REsidual recur- +rent neural network (FRE-RNN). We substantially improve the convergence properties of the RNNs +and training speed of EP while achieving performance comparable to BP. Our contributions are as +follows: + + • By scaling down the feedback strength of RNNs, we enhance the robustness of EP and + accelerate the training and inference speed by orders of magnitude because of the improved + convergence properties. + • To counteract the gradient vanishing problem caused by weak feedback, we introduce resid- + ual connections into the layered RNNs, enabling the training of deep networks that previ- + ously challenged EP and achieving performance closer to BP. + • The feedback regulation and residual connections in RNNs of arbitrary graph topologies + mirror the multi-scale recurrence in biological neural networks. Our work fosters EP’s bio- + logical plausibility and extends its applicability in brain-inspired computational hardware. + + +2 BACKGROUND + +2.1 C ONVERGENT RNN S WITH S TATIC I NPUT + +Consider an RNN as a dynamical system driven by a static input x: + + s[t + 1] = F (x, s[t], θ), (1) + +where F is the transition function, s[t] is the network state at time step t(t = 0, 1, 2, . . . , T ), and +θ denotes the parameters. Assuming that the network state stabilizes in T steps, the RNN reaches +a stable point s[T ]. Its convergence is typically guaranteed by either symmetric connections with +asynchronous updates or by a sufficiently small spectral radius of asymmetric connections with +synchronous updates (Hopfield, 1982; Yildiz et al., 2012; Liu et al., 2026). That said, other factors, +e.g. activation function, also influence the dynamical properties of RNNs (Miller & Hardt, 2019). + +2.2 S CALING A DJACENCY M ATRIX TO T UNE N ETWORK DYNAMICS + +Scaling the spectral radius (SR) of the adjacency matrix, the largest eigenvalue of the weight matrix, +is a common method to tune the dynamics of RNN (Bai et al., 2012; Nakajima et al., 2024; Liu et al., +2026). A SR less than one yields stable and convergent dynamics. In this case, injected signals tend +to decay over time, which manifests as short-term memory. A SR exceeding one can give rise to +expansive or even chaotic behavior in which small perturbations are amplified. By adjusting SR, +one can bias the RNN toward convergent, oscillatory, or edge-of-chaos regimes, thereby tuning +computational properties, such as convergence speed or long-term memory capacity. (Jaeger & +Haas, 2004; Legenstein & Maass, 2007; Miller & Hardt, 2019). + +2.3 E QUILIBRIUM P ROPAGATION + +Equilibrium propagation is a learning framework initially based on energy-based models. It proceeds +in two phases: a free (first) phase and a weakly clamped (second) phase. For the first phase, the +RNN converges to a steady state s0 under the stimulation of input alone. In the clamped phase, the +network is gently nudged by the prediction error and settles to a new stable state sβ . The weight +update can be simplified to a contrastive learning compatible with spiking-time-dependent plasticity +(STDP) (Scellier et al., 2018). EP has been further generalized to asymmetric RNNs governed by +vector field dynamics (Scellier et al., 2018). Recent work shows that asymmetry in skew-symmetric +Hopfield models can improve classification performance (Høier et al., 2024). + + + 2 +Published as a conference paper at ICLR 2026 + + + + +2.4 F EEDBACK R EGULATION AND N ETWORK S TRUCTURE IN THE B RAIN + +Cortical areas in the brain feature dynamic regulation of feedforward and feedback connections +(Felleman & Van Essen, 1991; Mejias et al., 2016; Michalareas et al., 2016; Semedo et al., 2022; +Fişek et al., 2023; Wang et al., 2023). In the visual system, for instance, feedforward signals domi- +nate immediately following the onset of external stimulus, whereas feedback signals become promi- +nent during spontaneous activity. Dynamically regulating the strength of feedback allows the brain +to optimize information integration, ensuring efficient perception and decision-making. In mam- +malian neocortices, information processing involves not only feedforward synaptic chains but also +extensive lateral and feedback loops that interconnect disparate regions, forming a richly recursive +network rather than a strictly layered structure. This topology implies short average path length +between neurons and efficient information flow (Watts & Strogatz, 1998; Markov et al., 2013; Lynn +& Bassett, 2019; Kulkarni & Bassett, 2025). In deep neural networks, residual connections reflect +the long-range skip-layer projections observed in cortical circuits (Perich & Rajan, 2020; Holk & +Mejias, 2024). They mitigate the vanishing gradient by providing skip pathways that preserve gra- +dient (He et al., 2016). + +3 ACCELERATING EP WITH B RAIN - INSPIRED N ETWORK P ROPERTIES + + (a) + 𝑠0 𝑠1 𝑠2 Predict: 𝑠𝑝 + 𝑊0 𝛼1 𝑊1 𝑊𝑓 Label: 𝑠𝑡 + + + + 𝛽1 𝐵1 𝛽𝑓 𝐵𝑓 + Error: 𝑒𝑝 = 𝑠𝑡 − 𝑠𝑝 + (b) + 𝑠0 𝑠1 𝑠2 Predict: 𝑠𝑝 + 𝐶𝑜𝑛𝑣0 𝑃1 , 𝐶𝑜𝑛𝑣1 𝑃2 , 𝑊𝑓 Label: 𝑠𝑡 + (32,5,1,0) 2 , 64,5,1,0 (2) + + + + 𝑇 + 𝐶𝑜𝑛𝑣𝑇1 , 𝑃1−1 , 𝛽1 𝑊𝑓 , 𝑃2−1 , 𝛽𝑓 + Error: 𝑒𝑝 = 𝑠𝑡 − 𝑠𝑝 + +Figure 1: Illustration of feedback and feedforward regulation. (a) Layered architecture of RNN. The +feedforward weights Wi and feedback weights Bi are rescaled by coefficients αi and βi respectively. +The dashed box encloses an RNN formed by layers s1 and s2 with feedforward and feedback path- +ways. βf is the nudging factor, which essentially scales the feedback strength of prediction error. +(b) Embedding convolutional architecture in RNN. Convolutional parameter (32,5,1,0) is written +as (channels, kernels, stride, padding). Parameter (2) in (b) denotes max-pooling with stride 2. +ConvTi represents transpose convolution, the inverse process of the convolution, and Pi−1 means +max-unpooling (Ernoult et al., 2019). Model architectures and training process are given in Ap- +pendix D. + + +3.1 P ROTOTYPICAL SETTING OF EQUILIBRIUM PROPAGATION + +Unlike the prototypical setting of equilibrium propagation (P-EP) (Ernoult et al., 2019), we separate +the input and output layer from the recurrent network (Figure 1a). This separation allows the output +layer to adopt the SoftMax activation commonly used in feedforward networks, which facilitates +performance comparison (Laborieux & Zenke, 2024). For clarity, the RNN (black dashed box in +Figure 1a) shown here only contains two hidden layers s1 and s2 , but the approach applies to deeper +structures (see below). The states of the RNN evolve for T discrete steps until they converge. The +dynamics of the whole RNN can be formulated as: + sβf [t + 1] = F (sβf [t], b) = ρ(W · sβf [t] + b), + b = [W0 · s0 , βf · Bf · ep ], (2) + + + 3 +Published as a conference paper at ICLR 2026 + + + + +where sβf [t] is the state of the RNN at time t, ρ is the activation function, W is the forward weight +matrix of the RNN, and b combines the feedforward input and the error-nudging term. Note that + β β +sβf = [s1 f , s2 f ]. For each sample-label pair (x, star ), we run the free phase (βf = 0) for te +iterations, obtain the prediction sp = SoftMax(Wf · s2 ), and compute the prediction error ep = +star −sp . During the clamped phase, the error nudges the RNN through the feedback weights Bf and +scaling coefficient βf = βf 1 (βf 1 = 0.1 for layered architecture and βf 1 = 0.25 for convolutional +architecture by default). The network evolves for K further iterations under clamping to another +state. The weights (W0 , W1 ) are then updated with an STDP-compatible rule: + β + ∆Wi = dsi+1 · (s0i )⊤ , f1 + dsi+1 = si+1 − s0i+1 , (3) +where dsi is the offset of stable point caused by the error nudging (Scellier et al., 2018). Similarly, +the final weight for output is updated: + + ∆Wf = (star − s0p ) · (s02 )⊤ . (4) +We also consider an RNN embedded with convolutional architecture in its forward paths (2 convo- +lution layers, 2 max-pooling layers and 1 fully connected layer) shown in Figure 1b. The forward +convolutional structure follows the architecture of existing convolutional neural networks (CNN) +(Krizhevsky et al., 2012; Simonyan & Zisserman, 2015), in which a pooling layer is placed after +the activation of the convolution layer. We transform the CNN to an RNN by adding feedback con- +nections symmetric with the feed-forward connections (See Appendix D for the pseudocode and +schematics). + +3.2 F EEDBACK R EGULATION IN L AYERED RNN FOR FAST C ONVERGENCE + + (a) 0 (b) 0 + Index + + + + + Index + + + + + 100 100 + (c) 0 100 + (d) 0 100 + Index + + + + + Index + + + + + 100 100 10 2 + (e) 0 10 1 + (f) 0 + 10 4 + Index + + + + + Index + + + + + 100 100 + (g) 0 10 2 + (h) 0 10 6 + Index + + + + + Index + + + + + 100 100 + 0 20 40 60 80 0 20 40 60 80 + t t + + +Figure 2: Convergence dynamics and speed versus feedback scaling βi . All neurons in all hidden +layers are indexed (s1 :0-63; s2 :64-127). Colors indicate neuronal activity (a,c,e,g) and changes in +activity (b,d,f,h). (a) The state evolution of RNN with symmetric weights and βi = 0.1; (b) The +one-step difference of neural states in (a). (c, d) Symmetric weights with βi = 2; (e, f) Asymmetric +weights with βi = 0.1; (g, h) Asymmetric weights with βi = 4. In both symmetric and asymmetric +feedback cases, down-scaling feedback connections tends to stabilize the network. See Figure 5d +for the statistical robustness. + +Although the SR can tune the RNN dynamics, scaling forward weights Wi distorts forward signal +propagation, which is harmful to performance (see below). Therefore, we turn to another choice, +namely, scaling only the feedback strength with βi . This coefficient scales the gradients, in the same +way as the nudging factor βf . We consider both symmetric (Bi = (Wi )⊤ ) and asymmetric (Bi ̸= +(Wi )⊤ ) recurrent connections in the study, and compare the results with FNNs of the same size +trained by BP (feedback connections removed) or feedback alignment (FA) (Lillicrap et al., 2016) +that uses random weights Bi ̸= (Wi )⊤ to feedback the error. Note that, after scaling, the overall +weight matrix W of a symmetric RNN is no longer strictly symmetric. Therefore, we started from +the vector field setting of EP rather then the energy-based setting in the first place. The feedforward +and feedback weights are multiplied by coefficients αi and βi respectively. Figure 2a-d shows +convergence speed for different βi . With asymmetric weights, the network can converge to a fixed +point (Figure 2e, f), exhibit cyclical oscillation (Figure 2g, h), or even become chaotic. The feedback +weights stay fixed during training process, which differs from EP in vector field dynamics (Scellier + + + 4 +Published as a conference paper at ICLR 2026 + + + + +et al., 2018). The pseudocode of learning procedure with a 2-hidden-layer RNN shown in Figure 1(a) +is provided in Algorithm 1. + +Algorithm 1 EP with Feedforward and Feedback Scaling +Require: Input: (x, star ) +Require: Parameters: θ = [W0 , W1 , Wf , Bf , B1 , α1 , β1 , βf 1 ] +Ensure: Updated parameters θ + 1: function F IRST- PHASE(θ, star ) + 2: s0 ← x + 3: for t = 1 to T do + 4: h1 ← W0 · s0 + β1 · B1 · s02 + 5: h2 ← α1 · W1 · s01 + 6: hp ← Wf · s02 + 7: s01 , s02 , s0p ← ρ(h1 ), ρ(h2 ), SoftMax(hp ) + 8: end for + 9: Λ1 ← [s0i ], i = 0, 1, 2, p +10: return Λ1 +11: end function +12: function S ECOND - PHASE(θ, Λ1 , star ) + β β β +13: s1 f 1 , s2 f 1 , sp f 1 ← s01 , s02 , s0p +14: for t = 1 to K do + β +15: ep ← star − sp f 1 + β +16: h1 ← W0 · s0 + β1 · B1 · s2 f 1 + βf 1 +17: h2 ← α1 · W1 · s1 + βf · Bf · ep + β +18: hp ← Wf · s2 f 1 + βf 1 βf 1 βf 1 +19: s1 , s2 , sp ← ρ(h1 ), ρ(h2 ), SoftMax(hp ) +20: end for + β +21: dsi ← si f 1 − s0i , i = 1, 2 +22: Λ2 ← [ds1 , ds2 ] +23: return Λ2 +24: end function +25: function U PDATING -W EIGHTS(θ, Λ1 , Λ2 , star ) +26: ∆Wi ← dsi+1 · (s0i )⊤ , i = 0, 1 +27: ∆Wf ← (star − s0p ) · (s02 )⊤ +28: end function + + + +3.3 R ESIDUAL C ONNECTIONS TO AVOID VANISHING G RADIENTS + +In our 10-hidden-layer RNN with symmetric connections, we add cross layer residual links (Fig- +ure 3a-b) and carry out ablation study on their effects in performance. The three long-range bidi- +rectional connections bypass adjacent layers to reduce gradient decay. For RNN with asymmet- +ric connections, we introduce skip-layer connections between non-adjacent layers with P = 20% +probability, creating an RNN with arbitrary graph topologies (AGT) where any pair of layers form +connections stochastically (Figure 3c) (Salvatori et al., 2022). (See Algorithm S3 in Appendix D for +training detail) + +4 E XPERIMENTS +We evaluated our RNN models on MNIST and CIFAR-10 datasets and compared the results with P- +EP and BP. The MNIST dataset consists of 70,000 grayscale handwritten digit images (28×28 pixels) +split into 60,000 training and 10,000 test samples. CIFAR-10 contains 60,000 RGB images (32×32 +pixels) of 10 categories, divided into 50,000 training and 10,000 test samples. Pre-processing, net- +work structures and additional training details are in Appendix D. + +4.1 I NFLUENCE OF F EEDFORWARD S CALING AND F EEDBACK S CALING + +Figure 4 compares the effects of feedforward scaling αi and feedback scaling βi . In general, relative +small feedback scaling (βi = 0.1) yields high MNIST accuracy (Figure 4). In deeper RNNs, overly + + + 5 +Published as a conference paper at ICLR 2026 + + + + +Figure 3: (a) A 10-hidden-layer RNN model with residual connections. The solid blue wires and +the dashed orange wires represent forward and feedback residual connections respectively. The +bidirectional connections are symmetric. (b) Adjacency matrix of (a). The blocks (green) other than +the sub-diagonals indicate residual connections. (c) Adjacency matrix for an RNN with arbitrary +graph topology. + + + + (a) 2HL-Test accuracy (b) 3HL-Test accuracy (c) 5HL-Test accuracy + 1.00 + 0.001 0.9659 0.9730 0.9756 0.9753 0.001 0.9348 0.9692 0.9718 0.9555 0.001 0.3844 0.7980 0.9338 0.8048 + 0.95 + 0.01 0.9666 0.9723 0.9760 0.9758 0.01 0.9246 0.9679 0.9765 0.9715 0.01 0.2515 0.9365 0.9575 0.8583 0.90 + i + + + + + i + + + + + i + + + + + 0.1 0.9624 0.9711 0.9725 0.9694 0.1 0.6323 0.9497 0.9741 0.9702 0.1 0.2104 0.8386 0.9757 0.9332 0.85 + + 0.80 + 1.0 0.8925 0.9127 0.9300 0.7978 1.0 0.4862 0.8739 0.5249 0.1676 1.0 0.2096 0.3891 0.2500 0.1324 + 0.75 + 0.01 0.1 1.0 4.0 0.01 0.1 1.0 4.0 0.01 0.1 1.0 4.0 + i i i + + + +Figure 4: The influence of feedforward scaling αi and feedback scaling βi on accuracy of MNIST +classification. (a) 2 hidden layers; (b) 3 hidden layers; (c) 5 hidden layers. Each layer has 64 +neurons. By default, T = 10×NHiddenLayer , K = T /2. Each result is averaged over five repetitive +experiments. + + + (a) (b) (c) (d) + 100 1 + 102 + 8 0 + Convergence time + + + + + 7 + 6 1 + Test Error + + + + + FTMLE + + + + + 10 1 5 + SR + + + + + 4 2 101 symm-before training + 3 asymm-before training + 2 3 symm-after training + 1 asymm-after training + 10 2 0 4 100 + 01 + 1 + 0.1 + + 5 + 1 + 2 + 4 + + + + 01 + 1 + 0.1 + + 5 + 1 + 2 + 4 + + + + 01 + 1 + 0.1 + + 5 + 1 + 2 + 4 + + + + 01 + 1 + 0.1 + + 5 + 1 + 2 + 4 + 0.0 + + 0.2 + + + + + 0.0 + + 0.2 + + + + + 0.0 + + 0.2 + + + + + 0.0 + + 0.2 + 0.0 + + + + + 0.0 + + + + + 0.0 + + + + + 0.0 + + + + + i i i i + + + + +Figure 5: The test error, SR, FTMLE, and convergence time versus feedback scaling βi . The results +are obtained from a 3-hidden-layer (64 neurons per layer) model trained on MNIST dataset. Note +that the network does not converge under certain conditions, resulting in missing value in d. + + + + 6 +Published as a conference paper at ICLR 2026 + + + + +low feedback scaling βi jeopardizes the performance, which we attribute to vanishing gradients +(Figure 4c, right two columns). In contrast, down-scaling the feedforward weights degrades perfor- +mance, as the inference signals are weakened through the layers (see rows of Figure 4a). However, +up-scaling αi can also be detrimental, as this easily leads to saturation of neural state. The best per- +formance of a 5-hidden-layer RNN is achieved without feedforward scaling αi = 1 and a trade-off +in feedback scaling at βi = 0.1. These results suggest that balancing the feedforward and feedback +strengths is critical for better performance, not only accuracy but also speed (see Table 1). +To further investigate the influence of feedback scaling βi , we plot the error, SR, finite time max- +imum Lyapunov exponent (FTMLE) (Shadden et al., 2005; Kanno & Uchida, 2014) and conver- +gence time against feedback scaling coefficient before and after training of a 3-hidden-layer RNN +on MNIST (Figure 5). It shows that larger feedback scaling βi decreases accuracy (Figure 5a). As +expected, SR is positively correlated to βi (see Figure 5b), and large SR can lead to instability of +an RNN indicated by the FTMLE shown in Figure 5c, which in turn explains the results in Figure +5a. In general, down-scaling the feedback (βi < 1) reduces the convergence time of RNN, which is +favorable. Note that up-scaling of feedback βi >1 can also decrease FTMLE and convergence time. +However, this is attributed to the saturation of neural state, and will also lower the performance. +Additionally, one might suspect that the gradient signals in the lower layers are not fulfilling their +intended role. In reservoir computing, where only the last layer is trained, the network can also +reach high accuracy as long as the output dimension is large enough. However, this is unlikely in +our case, as each layer in our network has only 64 neurons by default (other than the results in Table +1). To further confirm that the learning in lower layers is meaningful, we performed training with +the weights of lower layers frozen—details of these experiments are included in Appendix C.5. The +results clearly show that getting comparable results to BP requires effective training of lower layers. + + (a) 0 i=0.01 (b) 0 i=0.1 (c) 0 i=1.0 (d) 0 i=4.0 + 10 10 10 10 +Testing Error + + + + + Testing Error + + + + + Testing Error + + + + + 10 1 10 1 10 1 Testing Error 10 1 + + + + 10 50 10 50 10 50 10 50 + 20 100 20 100 20 100 20 100 + 10 2 10 2 10 2 10 2 + 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 + Epoch Epoch Epoch Epoch + (e) 0 2 Layers (f) 0 3 Layers (g) 0 5 Layers (h) 0 10 Layers + 10 10 10 10 + + + + 0.001 1.0 0.001 1.0 +Testing Error + + + + + Testing Error + + + + + Testing Error + + + + + Testing Error + + + + + 0.01 2.0 0.01 2.0 + 10 1 0.1 4.0 10 1 0.1 4.0 10 1 10 1 + 0.25 0.25 + 0.001 1.0 0.001 1.0 + 0.01 2.0 0.01 2.0 + 0.1 4.0 0.1 4.0 + 0.25 0.25 + 10 2 10 2 10 2 10 2 + 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 + Epoch Epoch Epoch Epoch + + + +Figure 6: Test error with different hyperparameters. The curves of different T (10, 20, 50, 100) with +2 hidden layers (64 neurons per hidden layer) and (a) βi = 0.01; (b) βi = 0.1; (c) βi = 1; (d) +βi = 4. The curves of different βi (0.001, 0.01, 0.1, 0.25, 1, 2, 4) with (e) 2 hidden layers; (f) 3 +hidden layers; (g) 5 hidden layers; (h) 10 hidden layers. The shaded areas represent deviations of +five repeated experiments. By default, T = 10 × NHiddenLayer , K = T /2. See Appendix A for +more information. + + +4.2 D OWN - SCALING F EEDBACK L EADS TO FASTER C ONVERGENCE + +Figure 6a-d plots the error versus the number of epochs with different iteration steps T . Under the +condition of βi = 0.01 (Figure 6a), the model with T = 10 and K = 5 works as well as the model +with T = 100 and K = 50, suggesting possibility of speedup in training. Larger βi requires more +iterations to achieve a certain level of performance (See Figure 6b, c, d). Larger βi means larger SR +and FTMLE, thus requiring more iterations to settle the RNN as shown in Figure 2 and Figure 5(b- + + + 7 +Published as a conference paper at ICLR 2026 + + + + +Table 1: Comparison with P-EP and BP in accuracy and cost. The results of P-EP come from +previous work (Ernoult et al., 2019). For BP results, we used a network with the same number +of layers and number of nodes/channels. Each experiment is repeated five times, and the standard +deviation is given. By default, βi = 0.01 in our results, the feedback weights are symmetric with +the feedforward weights for P-EP and Ours, and the learning rate in all layers are the same except +for Ours-DLR (different learning rate), which uses varying learning rates identical to that of P-EP. +For 2HL (two hidden layers) and 3HL (three hidden layers), there are 512 nodes per hidden layer. +See Appendix D for more details. + Epoch / Batch size Wall Clock Time + Architecture Training approach Testing (Training) + -T/K (HH:MM:SS) + P-EP (sigmoid-s) 98.05%±0.10% (99.86%) 50/20-100/20 1:56:- + 2HL Ours (tanh, Adam) 98.39%±0.04% (100.00%) 50/500-10/10 0:01:16 + BP (tanh, Adam) 98.26%±0.06% (100.00%) 50/500-1/1 0:00:18 + P-EP (sigmoid-s) 97.99%±0.18% (99.90%) 100/20-180/20 8:27:- + Ours-DLR (tanh) 97.65%±0.08% (98.93%) 100/20-18/10 1:01:14 + 3HL Ours (tanh) 97.83%±0.13% (99.98%) 100/20-18/10 1:01:54 + Ours (tanh, Adam) 98.36%±0.06% (100.00%) 50/500-18/10 0:02:11 + BP (tanh, Adam) 98.36%±0.08% (100.00%) 50/500-1/1 0:00:24 + P-EP (hard-sigmoid) 98.98%±0.04% (99.46%) 40/20-200/10 8:58:- + Conv Ours (hard-sigmoid) 99.14%±0.02% (99.78%) 40/128-20/10 0:12:28 + BP (hard-sigmoid) 98.93%±0.18% (99.43%) 40/128-1/1 0:01:01 + + + +Table 2: Comparison with BP and FA and ablation study of residual connection. For layered +architecture, there are 64 nodes per hidden layer and we chose T = 10 × NHiddenLayer , and +K = 5 × NHiddenLayer , which guarantees saturation of accuracy at βi = 0.1. For convolutional +architectures, βi = 0.01. By default, the Adam optimizer is used. Each experiment is repeated five +times. See Appendix D for more training details. + Architecture + Training approach MNIST-Testing (Training) CIFAR-10-Testing (Training) + -connections + BP 97.69%±0.10% (100.00%) 49.23%±0.81% (56.72%) + 5-symm + Ours 97.64%±0.10% (99.98%) 50.72%±0.17% (57.02%) + FA 96.44%±0.10% (98.96%) 37.97%±2.18% (38.92%) + 5-asymm + Ours 96.37%±0.11% (97.99%) 45.27%±0.73% (46.79%) + BP 97.61%±0.04% (99.93%) 48.23%±1.26% (55.37%) + 10-symm Ours 92.49%±0.32% (95.27%) 34.90%±0.38% (34.64%) + Ours-Residual 97.49%±0.05% (99.77%) 44.46%±0.51% (48.67%) + FA 94.52%±0.26% (95.54%) 30.16%±6.12% (30.20%) + 10-asymm Ours 87.37%±0.49% (87.95%) 30.37%±1.09% (29.97%) + Ours-AGT 96.87%±0.11% (99.45%) 30.94%±4.90% (31.36%) + BP 97.48%±0.07% (99.74%) 47.35%±1.49% (54.59%) + 20-symm + Ours-Residual 95.95%±0.18% (98.20%) 43.61%±1.17% (44.26%) + BP 99.34%±0.04% (99.97%) 75.45%±0.46% (83.61%) + Conv + Ours 99.27%±0.07% (99.78%) 75.04%±0.51% (80.79%) + + +d). Or even worse, the gradient signal is completely distorted. At βi = 4, even T=100 fails to exceed +95% accuracy. Figure 6e-h shows that while shallow networks benefit from low βi , deeper networks +(3, 5 and 10 layers) lose accuracy. In all cases, training performance peaks at certain βi dependent +on the network depth. Additional results are provided in Table S1 in Appendix B. +Table 1 compares our approach with P-EP, BP, and FA. Our model supersedes P-EP in training +speed by at least one order of magnitude for both convolutional architecture and layered architecture. +Importantly, our accuracy is comparable to BP and FA for the shallow architectures (5-hidden-layer +and conv model, see also Table 2). In consideration of the improved stability (Figure 6) via feedback +regulation, we anticipate that physical implementations of RNN can achieve performance on par + + + 8 +Published as a conference paper at ICLR 2026 + + + + +with BP. Additionally, for layered architecture, we also adopt the same training parameters (learning +rate, batch size and epochs) as P-EP, differing only in feedback scaling (‘ours-DLR’ in Table 1). The +results present clear evidence of speedup, which mainly stems from the reduced number of iterations +required for convergence. + + +4.3 D OWN - SCALED F EEDBACK C OORDINATES P LASTICITY OF D IFFERENT L AYERS + +It is hypothesized that the brain requires different plasticity in different areas due to their varying +functional roles (Atallah et al., 2004; Lowet et al., 2020). The variability in plasticity can be realized +explicitly by adjusting learning rates or implicitly by modulating the intensity of gradient. Previ- +ous work postulated that EP with weak feedback necessitates learning rates differing by orders of +magnitude across layers (Scellier & Bengio, 2017). Here, we found that due to gradient differences +across different layers induced by weak feedback, a 3-hidden-layer RNN at βi = 0.01 (Table 1, +‘ours (tanh)’) learns well with a uniform learning rate. This result suggests that the feedback scaling +alone is able to regulate gradient strength of different layers, pointing to another possible mechanism +to coordinate plasticity. + + +4.4 R ESIDUAL C ONNECTIONS OVERCOME THE G RADIENT VANISHING IN D EEP RNN S + +Weak feedback exacerbates vanishing gradient in deeper layered RNN (Figures S5–S6 in Ap- +pendix B). Adding residual connections restores gradient flow (Figure S7 in Appendix B). As a +result, a 10-hidden-layer network sees substantial performance gains (Table 2), 5% increase in ac- +curacy for MNIST and 9% for CIFAR-10. Even 20-hidden-layer model can be trained. As shown +in Table 2, without residual connections, an asymmetric RNN trained by EP falls short of FA in +accuracy, but arbitrary residual links surpass the accuracy of FA for the MNIST classification (See +ablation study on connection probability in Appendix B.). For more complex dataset CIFAR-10, the +10-hidden-layer asymmetric model with residual random feedback connections achieves accuracy +nearly 14% below the symmetric model. A possible reason is that the gradient signal through mul- +tiple random fixed feedback connections becomes too distorted by error to coordinate the forward +weight learning. + + +5 D ISCUSSION + +We have applied the feedback scaling to RNN to speed up the convergence and to accelerate training +with EP with negligible overhead. To counteract the vanishing gradient in deep architectures, we +have added residual connections to non-adjacent layers of deep RNNs, partly restoring classification +performance. In principle, the residual connections make credit assignment pathways shorter (Veit +et al., 2016). The training exhibits remarkable resilience to noise on weight and neural state. Our +structural modification is compatible with other algorithmic speed-ups (Scellier et al., 2023), thereby +expanding the design space for efficient EP implementations. +Recent work on credit assignment in brain-inspired networks, e.g. adjoint propagation (Liu et al., +2026), partitions a large network into local RNNs with random internal connections of low SR +for fast convergence and dynamic resource allocation, yielding speed and accuracy similar to this +work. This work, however, adopts the feedback scaling to solve the stability issue and accelerate +convergence of EP. +Weak feedback is often considered in biologically plausible learning algorithms (Sacramento et al., +2018; Haider et al., 2021; Meulemans et al., 2021). It has been shown that contrastive Hebbian +learning with weak feedback approximates backpropagation while converging quickly (Xie & Se- +ung, 2003). More recently, local representation alignment (LRA) likewise employed weak feedback +(Ororbia et al., 2023) and skip connections from the output to deep layers for efficient training. The +EP framework also approximates BP (Scellier & Bengio, 2017; Millidge et al., 2023), but under +the weak clamping condition (weak supervision) (Laborieux et al., 2021; Millidge et al., 2023). We +have shown that, at the infinitesimal inference limit, namely weak supervision and weak feedback +(Millidge et al., 2023), EP is equivalent to LRA and BP (Appendix C). In other words, the dynamics +of FRE-RNN is more like the feedforward neural network due to its weak feedback. + + + 9 +Published as a conference paper at ICLR 2026 + + + + +However, there are still a few limitations to our approaches for large-scale neural networks that +underpin artificial intelligence. For complex datasets like CIFAR-10, there exists a notable perfor- +mance gap compared to BP, using deep fully connected neural networks. We attribute this gap to +the inaccurate approximation to the true gradient as computed by BP (See Appendix C.4). There- +fore, although EP can be extended to deep fully connected network (20-hidden-layers) and shallow +CNNs, its applicability for deep CNN remains to be explored. For deep architectures with asymmet- +ric connections, the accuracy decreases faster with increasing depth due to the inaccurate random +error feedback. More in-depth investigation on residual connection topology is required to scale +up the methodology to large scale deep architectures. Besides, the hyperparameters are optimized +empirically. We find a feedback scaling in the range of 0.01-0.1 is favorable for shallow networks +(less than 4 layers) and 0.1-0.25 for deeper architectures. Finding a general way to determine these +parameters is still on-going. Additionally, existing research on EP converging naturally continues +to focus primarily on static-input settings (Laborieux et al., 2021; Ernoult et al., 2019; Laborieux +& Zenke, 2024). Extending naturally converging RNN trained by EP to sequence tasks remains a +challenge. +From a neurobiological perspective, residual connections, particularly the randomly generated arbi- +trary graph topologies, yield cortex-like connectivity patterns in the brain. The feedback-regulated +residual RNNs equip the biologically plausible learning framework, EP, with biologically plausible +network architecture. Although it currently runs on GPUs, it can exploit the natural convergence of +physical RNNs and facilitate efficient learning and inference on dedicated neuromorphic hardware. + +ACKNOWLEDGEMENTS +This work was supported by the National Key R&D Program of China (Grant No. +2024YFA1208804). Additional financial support from the University of Science and Technology +of China and the Chinese Academy of Sciences is also gratefully acknowledged. + +C ODE AVAILABILITY +The code used in this work is available at https://github.com/Zero0Hero/ +FRE-RNN-EP. + +R EPRODUCIBILITY STATEMENT +The code necessary to reproduce the main results is provided as Jupyter Notebooks in the Supple- +mentary Materials. Researchers can directly run them to reproduce the results. Further details on +data pre-processing and training process are available within the provided code and in Appendix D. + +T HE U SE OF L ARGE L ANGUAGE M ODELS (LLM S ) +In the preparation of this work, the authors used GPT-5 and DeepSeek solely for the purpose of +polishing and improving the linguistic fluency and readability of the text. This includes tasks such +as correcting grammar and rephrasing sentences. After using the model, the authors have reviewed +and edited all content extensively and take full responsibility for all ideas, claims, and the final +language presented in this paper. + +R EFERENCES +David H. Ackley, Geoffrey E. Hinton, and Terrence J. Sejnowski. A learning algorithm for boltz- + mann machines. Cognitive Science, 9(1):147–169, 1985. +Hisham E. Atallah, Michael J. Frank, and Randall C. O’Reilly. Hippocampus, cortex, and basal + ganglia: Insights from computational models of complementary learning systems. Neurobi- + ology of Learning and Memory, 82(3):253–267, 2004. ISSN 1074-7427. doi: https://doi. + org/10.1016/j.nlm.2004.06.004. URL https://www.sciencedirect.com/science/ + article/pii/S1074742704000693. +Zhang Bai, D. J. Miller, and Wang Yue. Nonlinear system modeling with random matrices: Echo + state networks revisited. IEEE Transactions on Neural Networks and Learning Systems, 23(1): + 175–182, 2012. ISSN 2162-237X 2162-2388. doi: 10.1109/tnnls.2011.2178562. + + + 10 +Published as a conference paper at ICLR 2026 + + + + +Maxence Ernoult, Julie Grollier, Damien Querlioz, Yoshua Bengio, and Benjamin Scellier. Updates + of equilibrium prop match gradients of backprop through time in an rnn with static input. In + Proceedings of the 33rd International Conference on Neural Information Processing Systems, pp. + Article 636. Curran Associates Inc., 2019. +Maxence Ernoult, Julie Grollier, Damien Querlioz, Yoshua Bengio, and Benjamin Scellier. Equilib- + rium propagation with continual weight updates, 2020. URL https://openreview.net/ + forum?id=H1xJhJStPS. +D. J. Felleman and D. C. Van Essen. Distributed hierarchical processing in the primate cerebral + cortex. Cerebral Cortex, 1(1):1–47, 1991. ISSN 1047-3211. doi: 10.1093/cercor/1.1.1. URL + <GotoISI>://WOS:000208047200002. +Mehmet Fişek, Dustin Herrmann, Alexander Egea-Weiss, Matilda Cloves, Lisa Bauer, Tai-Ying + Lee, Lloyd E. Russell, and Michael Häusser. Cortico-cortical feedback engages active den- + drites in visual cortex. Nature, 617(7962):769–776, 2023. ISSN 1476-4687. doi: 10.1038/ + s41586-023-06007-6. URL https://doi.org/10.1038/s41586-023-06007-6. +Paul Haider, Benjamin Ellenberger, Laura Kriener, Jakob Jordan, Walter Senn, and Mihai A. Petro- + vici. Latent equilibrium: a unified learning theory for arbitrarily fast computation with arbitrarily + slow neurons. In Proceedings of the 35th International Conference on Neural Information Pro- + cessing Systems, pp. Article 1365. Curran Associates Inc., 2021. +K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In 2016 IEEE + Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, 2016. ISBN + 1063-6919. doi: 10.1109/CVPR.2016.90. +Maya van Holk and Jorge F Mejias. Biologically plausible models of cognitive flexibility: merging + recurrent neural networks with full-brain dynamics. Current Opinion in Behavioral Sciences, 56: + 101351, 2024. doi: 10.1016/j.cobeha.2024.101351. van Holk, Maya; Mejias, Jorge F. ORCIDs: + 0000-0003-4930-1472 2352-1546 2352-1554. +J. J. Hopfield. Neural networks and physical systems with emergent collective computational abili- + ties. Proceedings of the National Academy of Sciences, 79(8):2554–2558, 1982. ISSN 0027-8424 + 1091-6490. doi: 10.1073/pnas.79.8.2554. +Rasmus Høier, Kirill Kalinin, Maxence Ernoult, and Christopher Zach. Dyadic learning in recur- + rent and feedforward models. In NeurIPS 2024 Workshop Machine Learning with new Compute + Paradigms, 2024. URL https://openreview.net/forum?id=LNfWowAErI. +Herbert Jaeger and Harald Haas. Harnessing nonlinearity: Predicting chaotic systems and sav- + ing energy in wireless communication. Science, 304(5667):78–80, 2004. doi: 10.1126/science. + 1091277. URL https://doi.org/10.1126/science.1091277. doi: 10.1126/sci- + ence.1091277. +Adrien Journé, Hector Garcia Rodriguez, Qinghai Guo, and Timoleon Moraitis. Hebbian deep learn- + ing without feedback. In The Eleventh International Conference on Learning Representations, + 2023. URL https://openreview.net/forum?id=8gd4M-_Rj1. +Kazutaka Kanno and Atsushi Uchida. Finite-time lyapunov exponents in time-delayed nonlinear + dynamical systems. Physical Review E, 89(3):032918, 2014. ISSN 1539-3755 1550-2376. doi: + 10.1103/PhysRevE.89.032918. +Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International + Conference for Learning Representations. ICLR, 2015. doi: doi.org/10.48550/arXiv.1412.6980. +Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convo- + lutional neural networks. In Advances in Neural Information Processing Systems, 2012. +Dhireesha Kudithipudi, Catherine Schuman, Craig M. Vineyard, Tej Pandit, Cory Merkel, Rajkumar + Kubendran, James B. Aimone, Garrick Orchard, Christian Mayr, Ryad Benosman, Joe Hays, Cliff + Young, Chiara Bartolozzi, Amitava Majumdar, Suma George Cardwell, Melika Payvand, Sonia + Buckley, Shruti Kulkarni, Hector A. Gonzalez, Gert Cauwenberghs, Chetan Singh Thakur, Anand + Subramoney, and Steve Furber. Neuromorphic computing at scale. Nature, 637(8047):801–812, + 2025. ISSN 0028-0836 1476-4687. doi: 10.1038/s41586-024-08253-8. + + + 11 +Published as a conference paper at ICLR 2026 + + + + +Suman Kulkarni and Dani S. Bassett. Toward principles of brain network organiza- + tion and function. Annual Review of Biophysics, 54(Volume 54, 2025):353–378, + 2025. ISSN 1936-1238. doi: https://doi.org/10.1146/annurev-biophys-030722-110624. + URL https://www.annualreviews.org/content/journals/10.1146/ + annurev-biophys-030722-110624. +Axel Laborieux and Friedemann Zenke. Improving equilibrium propagation without weight sym- + metry through jacobian homeostasis. In The Twelfth International Conference on Learning Rep- + resentations. ICLR, 2024. URL https://openreview.net/forum?id=kUveo5k1GF. +Axel Laborieux, Maxence Ernoult, Benjamin Scellier, Yoshua Bengio, Julie Grollier, and Damien + Querlioz. Scaling equilibrium propagation to deep convnets by drastically reducing its gradient + estimator bias. Frontiers in Neuroscience, 15:633674, 2021. ISSN 1662-453X. doi: 10.3389/ + fnins.2021.633674. +Yann Lecun. A theoretical framework for back-propagation. In Proceedings of the 1988 Connec- + tionist Models Summer School, 1988. +Robert Legenstein and Wolfgang Maass. Edge of chaos and prediction of computational perfor- + mance for neural circuit models. Neural Networks, 20(3):323–334, 2007. ISSN 08936080. doi: + 10.1016/j.neunet.2007.04.017. +Timothy P. Lillicrap, Daniel Cownden, Douglas B. Tweed, and Colin J. Akerman. Random synaptic + feedback weights support error backpropagation for deep learning. Nature Communications, 7 + (1):13276, 2016. ISSN 2041-1723. doi: 10.1038/ncomms13276. +Zhuo Liu, Hao Shu, Linmiao Wang, Xu Meng, Yousheng Wang, Xuancheng Li, Wei Wang, and + Tao Chen. Adjoint propagation of error signal through modular recurrent neural networks for + biologically plausible learning. eLife, 15:e108237, 2026. ISSN 2050-084X. doi: 10.7554/eLife. + 108237. URL https://doi.org/10.7554/eLife.108237. +Adam S. Lowet, Qiao Zheng, Sara Matias, Jan Drugowitsch, and Naoshige Uchida. Distribu- + tional reinforcement learning in the brain. Trends in Neurosciences, 43(12):980–997, 2020. + ISSN 0166-2236. doi: https://doi.org/10.1016/j.tins.2020.09.004. URL https://www. + sciencedirect.com/science/article/pii/S0166223620301983. +Christopher W. Lynn and Danielle S. Bassett. The physics of brain network structure, function + and control. Nature Reviews Physics, 1(5):318–332, 2019. ISSN 2522-5820. doi: 10.1038/ + s42254-019-0040-8. +Nikola T. Markov, Mária Ercsey-Ravasz, David C. Van Essen, Kenneth Knoblauch, Zoltán + Toroczkai, and Henry Kennedy. Cortical high-density counterstream architectures. Science, 342 + (6158), 2013. ISSN 0036-8075 1095-9203. doi: 10.1126/science.1238406. +Jorge F. Mejias, John D. Murray, Henry Kennedy, and Xiao-Jing Wang. Feedforward and feedback + frequency-dependent interactions in a large-scale laminar network of the primate cortex. Science + Advances, 2(11):e1601335, 2016. doi: doi:10.1126/sciadv.1601335. URL https://www. + science.org/doi/abs/10.1126/sciadv.1601335. +Jan Melchior and Laurenz Wiskott. Hebbian-descent, 2019. URL https://arxiv.org/abs/ + 1905.10585. +Alexander Meulemans, Matilde Tristany Farinha, Javier Garcia Ordonez, Pau Vilimelis Aceituno, + João Sacramento, and Benjamin F. Grewe. Credit assignment in neural networks through deep + feedback control. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman + Vaughan (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 4674–4687. + Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_ + files/paper/2021/file/25048eb6a33209cb5a815bff0cf6887c-Paper.pdf. +Georgios Michalareas, Julien Vezoli, Stan van Pelt, Jan-Mathijs Schoffelen, Henry Kennedy, and + Pascal Fries. Alpha-beta and gamma rhythms subserve feedback and feedforward influences + among human visual cortical areas. Neuron, 89(2):384–397, 2016. ISSN 0896-6273. doi: + https://doi.org/10.1016/j.neuron.2015.12.018. URL https://www.sciencedirect.com/ + science/article/pii/S0896627315011204. + + + 12 +Published as a conference paper at ICLR 2026 + + + + +John Miller and Moritz Hardt. Stable recurrent models. In International Conference on Learning + Representations, 2019. URL https://openreview.net/forum?id=Hygxb2CqKm. +Beren Millidge, Yuhang Song, Tommaso Salvatori, Thomas Lukasiewicz, and Rafal Bogacz. Back- + propagation at the infinitesimal inference limit of energy-based models: Unifying predictive cod- + ing, equilibrium propagation, and contrastive hebbian learning. In The Eleventh International + Conference on Learning Representations, 2023. URL https://openreview.net/forum? + id=nIMifqu2EO. +Javier R. Movellan. Contrastive Hebbian Learning in the Continuous Hopfield Model, pp. + 10–17. Morgan Kaufmann, 1991. ISBN 978-1-4832-1448-1. doi: https://doi.org/10.1016/ + B978-1-4832-1448-1.50007-X. URL https://www.sciencedirect.com/science/ + article/pii/B978148321448150007X. +Mitsumasa Nakajima, Yongbo Zhang, Katsuma Inoue, Yasuo Kuniyoshi, Toshikazu Hashimoto, + and Kohei Nakajima. Reservoir direct feedback alignment: deep learning by physical dynamics. + Communications Physics, 7(1):411, 2024. ISSN 2399-3650. doi: 10.1038/s42005-024-01895-0. +Arild Nøkland. Direct feedback alignment provides learning in deep neural net- + works. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett (eds.), Ad- + vances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., + 2016. URL https://proceedings.neurips.cc/paper_files/paper/2016/ + file/d490d7b4576290fa60eb31b5fc917ad1-Paper.pdf. +Peter O’Connor, Efstratios Gavves, and Max Welling. Initialized equilibrium propagation for + backprop-free training. In International Conference on Learning Representations. ICLR, 2019. + URL https://openreview.net/forum?id=B1GMDsR5tm. +Alexander Ororbia and Ankur Mali. Biologically motivated algorithms for propagating local target + representations. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp. + 4651–4658. AAAI, 2019. +Alexander Ororbia, Patrick Haffner, David Reitter, and C. Lee Giles. Learning to adapt by minimiz- + ing discrepancy, 2017. +Alexander G. Ororbia. Brain-inspired machine intelligence: A survey of neurobiologically-plausible + credit assignment, 2023. URL https://arxiv.org/abs/2312.09257. +Alexander G. Ororbia, Ankur Mali, Daniel Kifer, and C. Lee Giles. Backpropagation-free deep + learning with recursive local representation alignment. In Proceedings of the AAAI Conference on + Artificial Intelligence, volume 37, pp. 9327–9335. AAAI, 2023. URL https://ojs.aaai. + org/index.php/AAAI/article/view/26118. +Matthew G. Perich and Kanaka Rajan. Rethinking brain-wide interactions through multi-region + ‘network of networks’ models. Current Opinion in Neurobiology, 65:146–151, 2020. ISSN + 09594388. doi: 10.1016/j.conb.2020.11.003. +D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by back-propagating + errors. Nature, 323(6088):533–536, 1986. ISSN 0028-0836. doi: 10.1038/323533a0. URL + <GotoISI>://WOS:A1986E327500055. E3275 Times Cited:16725 Cited References + Count:4. +João Sacramento, Rui Ponte Costa, Yoshua Bengio, and Walter Senn. Dendritic corti- + cal microcircuits approximate the backpropagation algorithm. In S. Bengio, H. Wal- + lach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.), Ad- + vances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., + 2018. URL https://proceedings.neurips.cc/paper_files/paper/2018/ + file/1dc3a89d0d440ba31729b0ba74b93a33-Paper.pdf. +Tommaso Salvatori, Luca Pinchetti, Beren Millidge, Yuhang Song, Tianyi Bao, Rafal Bogacz, and + Thomas Lukasiewicz. Learning on arbitrary graph topologies via predictive coding. In Alice H. + Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Informa- + tion Processing Systems, 2022. URL https://openreview.net/forum?id=dqO59nI_ + R9A. + + + 13 +Published as a conference paper at ICLR 2026 + + + + +Benjamin Scellier and Yoshua Bengio. Equilibrium propagation: Bridging the gap between energy- + based models and backpropagation. Frontiers in Computational Neuroscience, 11:24, 2017. ISSN + 1662-5188. doi: 10.3389/fncom.2017.00024. +Benjamin Scellier, Anirudh Goyal, Jonathan Binas, Thomas Mesnard, and Yoshua Bengio. Gener- + alization of equilibrium propagation to vector field dynamics, 2018. URL https://arxiv. + org/abs/1808.04873. +Benjamin Scellier, Maxence Ernoult, Jack Kendall, and Suhas Kumar. Energy-based learn- + ing algorithms for analog computing: a comparative study. In A. Oh, T. Naumann, + A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Advances in Neural In- + formation Processing Systems, volume 36, pp. 52705–52731. Curran Associates, Inc., + 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/ + file/a52b0d191b619477cc798d544f4f0e4b-Paper-Conference.pdf. +João D. Semedo, Anna I. Jasper, Amin Zandvakili, Aravind Krishna, Amir Aschner, Christian K. + Machens, Adam Kohn, and Byron M. Yu. Feedforward and feedback interactions between visual + cortical areas use different population activity patterns. Nature Communications, 13(1):1099, + 2022. ISSN 2041-1723. doi: 10.1038/s41467-022-28552-w. URL https://doi.org/10. + 1038/s41467-022-28552-w. +Shawn C. Shadden, Francois Lekien, and Jerrold E. Marsden. Definition and properties of la- + grangian coherent structures from finite-time lyapunov exponents in two-dimensional aperi- + odic flows. Physica D: Nonlinear Phenomena, 212(3-4):271–304, 2005. ISSN 01672789. doi: + 10.1016/j.physd.2005.10.007. +Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image + recognition. In International Conference on Learning Representations, 2015. +Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, + Ł. ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. Von + Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Ad- + vances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., + 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/ + file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf. +Andreas Veit, Michael J Wilber, and Serge Belongie. Residual networks behave like ensem- + bles of relatively shallow networks. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and + R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 29. Curran + Associates, Inc., 2016. URL https://proceedings.neurips.cc/paper_files/ + paper/2016/file/37bc2f75bf1bcfe8450a1a41c200364c-Paper.pdf. +Ran Wang, Xupeng Chen, Amirhossein Khalilian-Gourtani, Leyao Yu, Patricia Dugan, Daniel Fried- + man, Werner Doyle, Orrin Devinsky, Yao Wang, and Adeen Flinker. Distributed feedforward + and feedback cortical processing supports human speech production. Proceedings of the Na- + tional Academy of Sciences, 120(42):e2300255120, 2023. doi: 10.1073/pnas.2300255120. URL + https://doi.org/10.1073/pnas.2300255120. doi: 10.1073/pnas.2300255120. +D. J. Watts and S. H. Strogatz. Collective dynamics of ’small-world’ networks. Nature, 393 + (6684):440–442, 1998. ISSN 0028-0836. doi: Doi10.1038/30918. URL <GotoISI>://WOS: + 000074020000035. Zr842 Times Cited:29354 Cited References Count:27. +Alan Wolf, Jack B. Swift, Harry L. Swinney, and John A. Vastano. Determining lyapunov + exponents from a time series. Physica D: Nonlinear Phenomena, 16(3):285–317, 1985. + ISSN 0167-2789. doi: https://doi.org/10.1016/0167-2789(85)90011-9. URL https://www. + sciencedirect.com/science/article/pii/0167278985900119. +X. Xie and H. S. Seung. Equivalence of backpropagation and contrastive hebbian learning in a + layered network. Neural Computation, 15(2):441–454, 2003. ISSN 0899-7667. doi: 10.1162/ + 089976603762552988. +Izzet B. Yildiz, Herbert Jaeger, and Stefan J. Kiebel. Re-visiting the echo state property. Neural + Networks, 35:1–9, 2012. ISSN 08936080. doi: 10.1016/j.neunet.2012.07.005. + + + 14 +Published as a conference paper at ICLR 2026 + + + + +A T HE DYNAMICS OF THE RNN +We quantify the convergence property of the recurrent neural network (RNN) with maximum Lya- +punov exponent (MLE) (Wolf et al., 1985), and finite time maximum Lyapunov exponent (FTMLE) +(Kanno & Uchida, 2014). When the MLE/FTMLE is large, the RNN converges slow or even not +at all. To compute MLE and FTMLE, we first initialize a random perturbation vector δ0 . Then we +record the sequence of states s0 [t] with t = 0, 1, 2, . . . , Te − 1 corresponding to the last sample of a +training set (see Figure 2 in the main text), and run the following steps: + + 1. Normalize perturbation vectors to unit length: + δt + δt ← + ∥δt ∥ + 2. Calculate the Jacobian matrix: + ∂F (s0 [t], b) + J(s0 [t]) = + ∂s0 [t] + 3. Update the perturbation: + δt+1 = J(s0 [t]) · δt + 4. Record + ri = ln ∥δt+1 ∥ + PTe −1 +The maximum Lyapunov exponent is computed as λmax = T1e t=0 ri for a sufficiently large Te +(default Te = 500). The results at any T < Te are the FTMLE. +Figure S1–S2 show the FTMLE, MLE, training accuracy and test accuracy versus epochs of dif- +ferent models. In all cases, smaller βi usually yields smaller (FT)MLE, whereas larger βi do not +always lead to larger (FT)MLE because the activation function saturates. The saturation diminishes +perturbation. +For 2-hidden-layer RNN, smaller feedback scaling βi yields steady training progress and better ac- +curacy. Figure S3 plots the FTMLE and test accuracy against feedback scaling for different numbers +of hidden layers. It shows that smaller βi is favorable for shallow networks, because the RNN is eas- +ier to converge (indicated by FTMLE). But for deeper networks (5-hidden-layer or more), smaller +βi degrades performance because of vanishing gradient. +Further comparison between our FRE-RNN that incorporates convolutional structure with previous +work (Ernoult et al., 2019) are also plotted in Figure S4. These results suggest that small feedback +scaling (βi = 0.01) leads to a smoother training process. + +B G RADIENT VANISHING AND THE RESIDUAL CONNECTIONS +Figure S5 and S6 plot the error of each neuron versus epoch at different βi . For a 2-hidden-layer +RNN, the best performance is obtained at βi = 0.001. In this situation, the error of the first hidden +layer is at least two orders of magnitude less than the second hidden layer. At βi = 2, the error also +decreases from higher (high index neurons, closer to output layer) to lower layers, which is attributed +to the saturation of the activation function. In general, the training progresses more steadily for +smaller βi despite the vanishing gradient, which also applies to deeper networks (up to 10-hidden- +layer). +To eliminate the vanishing gradient in EP, direct feedback from the higher layers or local amplifi- +cation (with higher learning rate) is unavoidable (Nøkland, 2016; Ororbia et al., 2023). Figure S7 +shows the effect of residual connections. βi = 0.1 yield the best accuracy 97.5%, due to the balance +between gradient flow and convergence. +Figure S8 shows the testing accuracy varies with the connection probability P of AGT with 10 +hidden layers. Except for the connections in layered model, the connection between any two hidden +layers is generated with probability P , i.e., we first use P to decide if the connections between any +two layers will be established. As P increases, the accuracy rises first, peaks at 0.2 and decreases +around 1. However, the reason behind is yet to be explored. + + + 15 +Published as a conference paper at ICLR 2026 + + + + +Figure S1: The FTMLE, MLE, training accuracy and testing accuracy of symmetric RNNs versus +epochs with different feedback scaling βi (legend). First row: 2 hidden layers; Second row: 3 hidden +layers; Third row: 5 hidden layers; Fourth row: 10 hidden layers. The activation is tanh. Each case +is repeated 5 times. + + + + + 16 +Published as a conference paper at ICLR 2026 + + + + +Figure S2: The FTMLE, MLE, training accuracy and testing accuracy of asymmetric RNNs versus +epochs with different feedback scaling βi (legend). First row: 2 hidden layers; Second row: 3 hidden +layers; Third row: 5 hidden layers; Fourth row: 10 hidden layers. The activation is tanh. Each case +is repeated 5 times. + + + + +Figure S3: The FTMLE and testing accuracy versus feedback scaling βi with different numbers of +hidden layers. (a) Symmetry weights; (b) Asymmetric weights. The FTMLE and testing accuracy +given here correspond to their maxima in all epochs. Note that the 5-hidden-layer asymmetric RNN +with large βi diverged and resulted in missing data points in (b). Each case is repeated 5 times. + + + 17 +Published as a conference paper at ICLR 2026 + + + + +Figure S4: Comparison of RNN embedded with convolutional structure on the MNIST between +P-EP (a) (Ernoult et al., 2019) and our approach at different βi (b-d). We used the same parameters +as the EP reference (Ernoult et al., 2019). + + + + + 18 +Published as a conference paper at ICLR 2026 + + + + +Figure S5: For 2-hidden-layer RNN, the mean error of each neuron in the last batch and testing +accuracy versus epochs at different βi . All neurons in the hidden layers and the output layer are +indexed from the input to the output layer. + + + + + 19 +Published as a conference paper at ICLR 2026 + + + + +Figure S6: For the 10-hidden-layer model, the mean error of each neuron in the last batch and +testing accuracy versus epochs at different βi . All neurons in the hidden layers and the output layer +are indexed from the input to the output layer. + + + + + 20 +Published as a conference paper at ICLR 2026 + + + + +Figure S7: For the 10-hidden-layer model with residual connections, the mean error of each neuron +in the last batch and testing accuracy versus epochs at different βi . All neurons in the hidden layers +and the output layer are indexed from the input to the output layer. + + + + +Figure S8: The testing accuracy on MNIST varies with the connection probability P of AGT with +10 hidden layers. The experiments are repeated 5 times. + + + 21 +Published as a conference paper at ICLR 2026 + + + + +Table S1: Testing accuracy (mean of 5 repeated experiments) with different feedback scaling βi . By +default, T = 10 × NHiddenLayer , K = 5 × NHiddenLayer . Each hidden layer has 64 nodes. + Architecture-connections βi = 0.001 βi = 0.01 βi = 0.1 βi = 0.25 βi = 1 βi = 2 βi = 4 + 2HL-symm 97.69% 97.57% 97.25% 96.22% 93.12% 66.04% 40.92% + 3HL-symm 97.22% 97.64% 97.41% 96.60% 55.86% 32.64% 22.11% + 5HL-symm 93.54% 95.54% 97.60% 90.63% 25.31% 17.88% 14.61% + 10HL-symm 87.15% 89.99% 92.54% 41.84% 14.07% 14.30% 14.23% + 10HL-Residual-symm – 97.52% 97.46% – 95.51% – – + conv-symm – 99.15% 98.71% – 11.35% – – + 2HL-asymm 96.96% 96.97% 96.88% 96.79% 93.88% 91.81% 89.91% + 3HL-asymm 95.17% 96.91% 96.76% 96.66% 91.21% 54.65% 26.72% + 5HL-asymm 91.14% 92.34% 96.41% 96.35% 17.15% 11.35% 13.07% + 10HL-asymm 84.27% 85.83% 87.79% 90.97% 16.13% 14.21% 16.67% + 10HL-AGT-asymm – 96.37% 96.75% – 33.31% – – + + + +C E QUIVALENCE WITH EP AND BP UNDER THE CONDITION OF + INFINITESIMAL INFERENCE LIMIT + + + + +Figure S9: A layered network model used to illustrate the process of backpropagation (BP), local +representation alignment (LRA), and EP. Note that the final prediction layer ·p corresponds to the +third layer with subindex ·3 . For LRA, we use βLRA instead of β1 and βf . For BP, the feedback +(orange) paths are absent. + +In this section, we will use the infinitesimal inference limit (Millidge et al., 2023) to derive the +equivalence of EP with LRA and BP. + +C.1 BACKPROPAGATION + +When we remove the feedback connection of a 2-hidden-layer RNN shown in Figure S9, a feedfor- +ward network is left and can be trained with BP. The forward process of BP is described by: + + + s1 = ρ(h1 ), h1 = W0 · s0 , + s2 = ρ(h2 ), h2 = W1 · s1 , (S1) + sp = hp , hp = Wf · s2 . + +Defining a loss LBP = 21 (sp −star )2 , the weights adjust according to the gradient of the loss. Taking +∆W0 as an example: + ∂LBP + ∆W0 = − + ∂W0 + = −ρ′ (h1 ) ⊙ W1⊤ · ρ′ (h2 ) ⊙ Wf⊤ · (sp − star ) · (s0 )⊤ , + + (S2) + +where “⊙” means Hadamard product (element-wise product), “·” means scalar or matrix multipli- +cation. For two vectors/matrices, “⊙” requires identical dimensions and computes element-wise +products. Broadcasting rules may apply (e.g., a column vector vm×1 ⊙ Am×n scales each column +of A by v). + + + 22 +Published as a conference paper at ICLR 2026 + + + + +C.2 L OCAL R EPRESENTATION A LIGNMENT + +LRA is an alternative training method following the principle of discrepancy reduction (Ororbia +et al., 2017; Ororbia & Mali, 2019). It can be divided into two phases: 1) the network runs the +forward process, producing latent representations of the input samples. 2) the weights adjust in the +direction of reducing the mismatch between current latent representations and target representations +in each layer. +The forward process is the same as BP: + s01 = ρ(h01 ), h01 = W0 · s0 , + s02 = ρ(h02 ), h02 = W1 · s01 , (S3) + s0p = h0p , h0p = Wf · s02 . +where s0i are interpreted as the latent representations. The prediction error is ep = star − s0p . Then +we can get the target representations of the second hidden layer: + sβ2 LRA = ρ(hβ2 LRA ), hβ2 LRA = W1 · s01 + βLRA · Bf · ep , (S4) +The same goes for the first hidden layer: + sβ1 LRA = ρ(hβ1 LRA ), hβ1 LRA = W1 · s0 + βLRA · B1 · e2 , e2 = sβ2 LRA − s02 , (S5) +LRA defines the loss as the total discrepancy between latent representations and target representa- +tions: + L L + X X 1 0 + LLRA = ki Li (s0i , sβi LRA ) = (si − sβi LRA )2 , (S6) + i=1 i=1 + 2 +The weight Wi adjusts according to the local mismatch between s0i+1 and sβi+1 + LRA + : + ∂ki Li (s0i+1 , sβi+1 + LRA + ) + ∆Wi = − + ∂Wi + = (sβi+1 + LRA + − s0i+1 ) ⊙ f ′ (h0i+1 ) · (s0i )⊤ + ≈ (sβi+1 + LRA + − s0i+1 ) · (s0i )⊤ , (S7) +where the derivative of the activation function is omitted in the last row, a useful practice common +in LRA (Melchior & Wiskott, 2019; Ororbia & Mali, 2019; Ororbia et al., 2023). When βLRA → 0, +sβi LRA → s0i and hβi LRA → h0i , then + + + ei = sβi LRA − s0i = ρ(hβi LRA ) − ρ(h0i ) + = ρ(h0i + βLRA · Bi · ei+1 ) − ρ(h0i ) + ≈ [ρ(h0i ) + ρ′ (h0i ) ⊙ (βLRA · Bi · ei+1 ) − ρ(h0i ))]βLRA →0 , (S8) + ′ + = ρ (h0i ) ⊙ (βLRA · Bi · ei+1 ) + +The approximation in Equation S8 is based on a first-order Taylor expansion of ρ(h0i + ∆h) around +h0i , where ∆h = βLRA · Bi · ei+1 . For a small perturbation ∆h → 0, the Taylor expansion gives: + ρ(h0i + ∆h) = ρ(h0i ) + ρ′ (h0i ) · ∆h + O(∆h2 ), (S9) + 2 +When βLRA → 0, higher order terms O(∆h ) are negligible, leaving only the linear terms. We +arrive at the last row after canceling out ρ(h0i ). There we can express the weight adjustments as + ∆W0 = e1 · (s00 )⊤ + + ′ 0 ′ 0 + + · (s0 )⊤ B =(W )⊤ + + = ρ (h1 ) ⊙ βLRA · B1 · ρ (h2 ) ⊙ βLRA · Bf · (star − sp ) + i i + + ′ 0 ⊤ ′ 0 ⊤ ⊤ + + = −βLRA · βLRA · ρ (h1 ) ⊙ W1 · ρ (h2 ) ⊙ Wf · (sp − star ) · (s0 ) , (S10) +which is the same as BP (Equation S2) except for a constant. Thus, LRA at weak feedback limit +approximates BP. An LRA algorithm for a 2-hidden-layer network is described in Algorithm S1. The +feedback weights in LRA need not be learned here, but can be kept symmetric with the feedforward +weights. + + + 23 +Published as a conference paper at ICLR 2026 + + + + +C.3 E QUILIBRIUM P ROPAGATION + +We can also formulate EP in terms of discrepancy reduction. In EP (Algorithm 1 in the main text), +the network states evolve as follows (β = 0 for the first phase and β = βf for the second phase): + + + hβ1 = W0 · sβ0 + β1 · B1 · sβ2 , + hβ2 = W1 · sβ1 + βf · Bf · ep , + hβp = Wf · sβ2 , + sβ1 , sβ2 , sβp = ρ(hβ1 ), ρ(hβ2 ), hβp , (S11) + + +where ep = star − s0p is the predicting error. The network converges to final states h01 , h02 , s01 , s02 in +the free phase. The error of s2 neurons can be described by: + + β + ds2 = [ρ(h2 f )]βf →0 − [ρ(h02 )]βf =0 + ≈ ρ′ (h02 ) ⊙ (βf · Bf · ep ), (S12) + +where only the first-order infinitesimal term is retained as β1 → 0. The same goes for the first +hidden layer: + + β + ds1 = [ρ(h1 f )]βf →0 − [ρ(h01 )]βf =0 + ≈ ρ′ (h01 ) ⊙ (β1 · B1 · (ρ′ (h02 ) ⊙ (βf · Bf · ep ))), (S13) + +The weight W0 can be updated by: + + ds1 · (s00 )⊤ + ∆W0 = = ρ′ (h01 ) ⊙ B1 · (ρ′ (h02 ) ⊙ Bf · ep ) · (s00 )⊤ , (S14) + β1 · β f + +With Bi = Wi⊤ , + + ds1 = βf · β1 · ρ′ (h01 ) ⊙ W1⊤ · (ρ′ (h02 ) ⊙ Wf⊤ · −(sp − star )), (S15) + ds1 · (s00 )⊤ + = −ρ′ (h01 ) ⊙ W1⊤ · ρ′ (h02 ) ⊙ Wf⊤ · (sp − star ) · (s00 )⊤ . + + ∆W0 = (S16) + β1 · β f + +Note that compared with the weight update in the main text, 1/(β1 ·βf ) is added to recover a gradient +amplitude similar to BP. Further, if we assume that the high-order infinitesimal in the first phase can +be omitted, the dynamics of RNN is governed by: + + s01 = ρ(hβ1 ), h01 = [W0 · s0 + β1 · B1 · s02 ]β1 →0 ≈ W0 · s0 , (S17) + s02 = ρ(h02 ), h02 = [W1 · s01 + βf · Bf · ep ]β1 →0,βf =0 ≈ W1 · s01 , (S18) + s0p = h0p , h0p = Wf · s02 . (S19) + + +The information flow of RNN degenerates into that of a feedforward network. This does not affect +the error information dsi , thus Equation S16 approximates Equation S2 for BP. Meanwhile, it re- +sembles LRA with low βLRA , which turns explicit error into implicit error. Hitherto, we have shown +that although the errors are obtained differently in EP, LRA, and BP, they are equivalent under the +assumption of weak supervision and weak feedback. + + + 24 +Published as a conference paper at ICLR 2026 + + + + +Algorithm S1 Local Representation Alignment (LRA) +Input: (x, star ) +Parameter: θ = [W0 , W1 , W2 , B2 , B1 , βLRA ] +Output: θ + 1: function F ORWARD(θ, x) + 2: s0 ← x + 3: s01 ← ρ(h1 ), h1 ← W0 · s0 + 4: s02 ← ρ(h2 ), h2 ← W1 · s01 + 5: s0p ← Wf · s02 + 6: Λ1 ← [s0i ], i = 0, 1, 2, p + 7: return Λ1 + 8: end function + 9: function F EEDBACK(θ, Λ1 , star ) +10: ep ← star − s0p +11: sβ2 LRA ← ρ(h2 ), h2 ← W1 · s01 + βLRA · Bf · ep +12: e2 ← sβ2 LRA − s02 +13: sβ1 LRA ← ρ(h1 ), h1 ← W0 · s0 + βLRA · B1 · e2 +14: e1 ← sβ1 LRA − s01 +15: Λ2 ← [e1 , e2 , ep ] +16: return Λ2 +17: end function +18: function U PDATING -W EIGHTS(θ, Λ1 , Λ2 ) +19: ∆Wi ← ei+1 · (s0i )T , i = 0, 1 +20: ∆Wf ← ep · (s02 )T +21: end function + + +C.4 E XPERIMENTS FOR EQUIVALENCE WITH EP AND BP + +Prior works have shown that EP can be equalized to BPTT in specific conditions and can achieve +comparable performance (Ernoult et al., 2019; Laborieux et al., 2021). As discussed in the previ- +ous section, although the overall architecture forms an RNN, the network behaves similarly to a +feedforward model due to weak feedback connections. +To experimentally show the equivalence of EP and BP, we can further compare our model with +FNN with same feedforward weights trained by BP. We mainly compare cosine similarity of states, +bias gradients and weight gradients for the first batch (batch size is 200) as given in Figure S10. +Figure S10(a-c) shows similarity under the conditions of βi = 1 with different iterations. For the the +bias gradients, i.e., dsi , the cosine similarity declines rapidly, indicating no similarity between our +model and BP. With weak feedback βi = 0.1, as shown in Figure S10(d-f), the similarity of states +approaches 1 and the similarity of bias gradient of last 6(4) layers exceeds 0.5 with T = 500/50 +(T = 20). These results provide further evidence that EP is equivalent to BP under the condition of +weak feedback. +We further studied the influence of βi on the cosine similarity. Figure S10(g) shows that larger +βi leads to lower similarity of states. Figure S10(h) shows that lower βi = 0.01 also leads to +the decrease in similarity, which may caused by insufficient precision of data storage (float32 by +default). Therefore, we use datatype float64 to repeat experiments. Figure S10(k,l) shows that the +similarity of gradient signal remains around 1 with βi = 0.1. This indicates that weak feedback does +indeed lead to an exponential decline in gradient signals, thus requiring higher relative accuracy. + + + + + 25 + Published as a conference paper at ICLR 2026 + + + + + (a) states (b) bias gradients (c) weight gradients + 1.0 1.0 1.0 + T=500,K=200 +cosine similarity + + + + + cosine similarity + + + + + cosine similarity + T=50,K=20 + 0.5 0.5 0.5 T=20,K=8 + 0.0 0.0 0.0 + + 0.5 0.5 0.5 + fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 out fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 out fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 out + (d) states (e) bias gradients (f) weight gradients + 1.0 1.0 1.0 +cosine similarity + + + + + cosine similarity + + + + + cosine similarity + 0.5 0.5 0.5 + + 0.0 0.0 0.0 T=500,K=200 + T=50,K=20 + T=20,K=8 + 0.5 0.5 0.5 + fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 out fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 out fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 out + (g) states (h) bias gradients (i) weight gradients + 1.0 1.0 1.0 + i=0.0 i=0.5 +cosine similarity + + + + + cosine similarity + + + + + cosine similarity + + i=0.01 i=1 + 0.5 0.5 0.5 i=0.1 + + 0.0 0.0 0.0 + + 0.5 0.5 0.5 + fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 out fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 out fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 out + (j) states (k) bias gradients (l) weight gradients + 1.0 1.0 1.0 +cosine similarity + + + + + cosine similarity + + + + + cosine similarity + + + + + 0.5 0.5 0.5 i=0.0 + i=0.01 + i=0.1 + 0.0 0.0 0.0 + i=0.5 + i=1 + 0.5 0.5 0.5 + fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 out fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 out fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 out + + Figure S10: The cosine similarity of gradients and states between our model and feedforward model + trained by BP in an 8-hidden-layer FNN (states: s0i ; bias gradients: dsi ; weight gradients: ∆Wi ). + The axis x is the layers of the model. Error propagates from the last layer ”out” to the first hidden + layer ”fc1” layer by layer. (a-c), with different numbers of iterations under feedback scaling βi = 1. + (d-f), with different numbers of iterations under small feedback scaling βi = 0.1. (g-i), with different + feedback scaling (T=50,K=20). (j-l), Repeat (g-i) with datatype float64 (float32 by default). + + + + + 26 +Published as a conference paper at ICLR 2026 + + + + +C.5 V ERIFYING THE EFFECTIVENESS OF WEAK FEEDBACK IN E QUILIBRIUM P ROPAGATION + + + 1.00 + i=0.01 + i=0.1 + 0.95 + + + + + Test accuracy + 0.90 + + 0.85 + + 0.80 + 1 3 5 + Only last nll layers learning + +Figure S11: The testing accuracy on MNIST with different βi varies with nll . The experiments are +repeated 5 times. + +To demonstrate that the lower few layers of our model are indeed receiving meaningful credit signals, +we report the test accuracy of only updating the last nll layer (i.e., freezing the weights of Layers +1−nll ) in Figure S11. For a 5-hidden layer model with βi = 0.1, updating only the final layer yields +a test accuracy of about 85%. As nll increases to 5, the accuracy also reaches around 97.5%. A +similar trend is observed for the model with βi = 0.01. These results show that achieving over 97% +accuracy requires effective gradient propagation to all layers, confirming that our model successfully +delivers usable credit signals throughout the entire network. + +C.6 ROBUSTNESS TO THE NOISE + + (a) 2HL (b) 3HL (c) 5HL + 1.00 1.00 1.00 + 0 0.972 0.972 0.972 0.904 0 0.974 0.974 0.967 0.752 0 0.976 0.969 0.814 0.469 +Weight noise intensity + + + + + Weight noise intensity + + + + + Weight noise intensity + + + + + 0.95 0.95 0.95 + 0.001 0.972 0.972 0.972 0.896 0.90 0.001 0.973 0.975 0.964 0.749 0.90 0.001 0.973 0.965 0.812 0.464 0.90 + + 0.01 0.917 0.917 0.912 0.689 0.85 0.01 0.903 0.904 0.838 0.485 0.85 0.01 0.876 0.815 0.520 0.299 0.85 + 0.80 0.80 0.80 + 0.1 0.194 0.206 0.196 0.217 0.1 0.167 0.180 0.174 0.167 0.1 0.134 0.135 0.134 0.135 + 0.75 0.75 0.75 + 0 1e-05 0.001 0.1 0 1e-05 0.001 0.1 0 1e-05 0.001 0.1 + State noise intensity State noise intensity State noise intensity + + +Figure S12: The maximum test accuracy model with different noise intensity on weights and states +added both in training and test. (a) With 2 hidden layers; (b) With 3 hidden layers; (c) With 5 hidden +layers. The model is trained for 50 epochs and the experiments are repeated 5 times. + +To evaluate the robustness of the model, we introduce noise on weights and time-varying noise +on states, which are random Gauss noise imposed at each weight update or at each state update, +respectively. The noise on weights is directly added to the weight, while the noise on states is added +as the bias b in Equation 3.1. The mean absolute values of non-zero weights and neural activations +after noiseless training are approximately 0.09 and 0.76 respectively. +The accuracy of the model with two hidden layers varies with two types of noises as presented in Fig- +ure S12(a). It maintains satisfactory performance when the standard deviation of state noise reaches +0.1 or the standard deviation of weight noise reaches 0.01. In deeper structures (Figure S12(b,c)), +the results are consistent with the aforementioned observations for weight noise, demonstrating ex- +cellent robustness. However, the tolerance to time-varying state noise degrades significantly, which +we attribute to the layer-wise noise accumulation and the distortion of weak gradient signal by the +noise in the training process. To confirm our hypothesis, we impose the noises only in the test, + + + 27 +Published as a conference paper at ICLR 2026 + + + + + (a) 2HL (b) 3HL (c) 5HL + 1.00 1.00 1.00 + 0 0.973 0.973 0.973 0.971 0.944 0 0.974 0.976 0.975 0.974 0.940 0 0.975 0.977 0.977 0.977 0.919 + 0.95 0.95 0.95 + + + + + Weight noise intensity + + + + + Weight noise intensity + Weight noise intensity + 0.001 0.971 0.971 0.974 0.970 0.945 0.90 0.001 0.975 0.973 0.973 0.972 0.945 0.90 0.001 0.973 0.973 0.973 0.973 0.928 0.90 + + 0.01 0.919 0.916 0.919 0.918 0.912 0.85 0.01 0.908 0.903 0.908 0.902 0.894 0.85 0.01 0.885 0.876 0.880 0.873 0.842 0.85 + 0.80 0.80 0.80 + 0.1 0.199 0.187 0.194 0.219 0.177 0.1 0.176 0.182 0.152 0.167 0.142 0.1 0.133 0.128 0.143 0.163 0.142 + 0.75 0.75 0.75 + 0 0.001 0.001 0.1 0. 0 0.001 0.001 0.1 0.5 0 0.001 0.001 0.1 0.5 + State noise intensity State noise intensity State noise intensity + + + + +Figure S13: The maximum test accuracy model with different noise intensity on weights and states +(the state noise is added only in test). (a) With 2 hidden layers; (b) With 3 hidden layers; (c) With 5 +hidden layers. The model is trained for 50 epochs in a single experiment. + + +and the test accuracy almost remains unaffected (Figure S13). Therefore, the network is potentially +resilient to noise. However, how to improve resilience in the training process requires further study. + + +D T RAINING DETAILS + +Table S2 provides the parameters of the Adam optimizer that are used in Tables S1–S2 (Kingma & +Ba, 2015). The training details for Table 1 are given in Table S3. For convolutional architectures +in EP, the training process can be described by Algorithm S2. The training sample is fed into the +network through Conv0 . Then the state of the first layer goes through max pooling MaxPool1 and +convolution Conv1 sequentially to reach the second layer. The second layer also feedbacks its states +to the first layer through transposed convolution ConvT1 and max-unpooling MaxUnpool1 . With +T iterations, the RNN converges to the steady states and produces outputs through MaxPool2 and +a fully connected layer. Then the prediction error is computed and used to nudge the RNN by +the reverse of the fully connected layer and max-unpooling MaxUnpool2 . Note that the unpooling +MaxUnpooli requires the indices from the corresponding pooling MaxPooli . +For Table S2, Adam optimizer is used for all experiments. The activation functions sigmoid-s + 1 +and hard-sigmoid are defined as ρ(x) = 1+e−4(x−0.5) , ρ(x) = max(min(x, 0), 1), respectively +(Ernoult et al., 2019). For 5-HL, 10-HL and 20-HL architectures, the Adam optimizer parameters +are as shown in Table S2 (epoch: 50, batch size: 500). The inference details of the architecture +shown in Figure 3b are described by Algorithm S3. The details for convolutional architectures are +given in Table S3 and Figure S14–S15. The cosine-annealing scheduler is used in convolutional +architectures for CIFAR-10 (Tmax = 50, ηmin = 10−6 ). +For MNIST, no pre-processing is used. For the CIFAR-10 dataset, we follow ref. (Scellier +et al., 2023) to pre-process the images. We normalize the input images using mean µ = +(0.4914, 0.4822, 0.4465) and standard deviation σ = 3 × (0.2023, 0.1944, 0.2010). +The results for comparison of time consumption were obtained in a virtualized Windows 11 envi- +ronment with Intel Xeon Gold 6238R CPU, 16 GB RAM, and Nvidia RTX A5000 (24 GB VRAM). +Other results were obtained on a Windows 11 environment with Intel Core i5-12490F, 32 GB RAM, +and Nvidia GTX 1650 (4 GB VRAM) or a Windows 11 environment with AMD R7-7700, 32 GB +RAM, and Nvidia RTX 4070 (12 GB VRAM). The default numerical precision is float32 (single- +precision float). + + + Table S2: The parameters of the Adam optimizer. + Parameter Name Default Value + Learning rate (MNIST / CIFAR-10) 10−3 /2 × 10−4 + First-order moment estimation decay rate (β1 ) 0.9 + Second-order moment estimation decay rate (β2 ) 0.999 + Small constant for numerical stability (ϵ) 10−8 + + + + 28 +Published as a conference paper at ICLR 2026 + + + + +Table S3: Training details for Table 1 and Table 2. The results of EB-EP and P-EP come from +previous work (Ernoult et al., 2019). SGD refers to Stochastic Gradient Descent with mini-batches. + Epoch / Batch size + Architecture Training approach Optimizer Learning rate Weight decay + -T/K + P-EP (sigmoid-s) SGD 50/20-100/20 [0.005, 0.05, 0.2] None + 2HL + Proposed (tanh, Adam) Adam 50/500-10/10 [0.001, 0.001, 0.001] None + P-EP (sigmoid-s) SGD 100/20-180/20 [0.002, 0.01, 0.05, 0.2] None + Proposed-DLR (tanh) SGD 100/20-18/10 [0.002, 0.01, 0.05, 0.2] None + 3HL Proposed (tanh) SGD 100/20-18/10 [0.1, 0.1, 0.1, 0.1] None + Proposed (tanh, Adam) Adam 50/500-18/10 10−3 None + BP (tanh, Adam) Adam 50/500-1/1 10−3 None + P-EP (hard-sigmoid) SGD 40/20-200/10 [0.015, 0.035, 0.15] None + Conv + Proposed (hard-sigmoid) SGD 40/128-20/10 [0.15, 0.35, 0.9] 10−5 + (Table 1) + BP (hard-sigmoid) SGD 40/128-1/1 [0.001, 0.02, 0.4] 10−5 + Conv Proposed (hard-sigmoid) Adam 40/128-20/10 2 × 10−4 10−6 + (Table 2 MNIST) BP (hard-sigmoid) Adam 40/128-1/1 2 × 10−4 10−6 + Conv Proposed (hard-sigmoid) Adam 50/128-40/10 2.5 × 10−4 2 × 10−4 + (Table 2 CIFAR-10) BP (hard-sigmoid) Adam 50/128-1/1 2.5 × 10−4 2 × 10−4 + + + + + 64@8x8 64@4x4 + 32@24x24 32@12x12 + 1@28x28 1x10 + + + + + Conv1 MaxPool1 Conv2 MaxPool2 Dense + + Figure S14: Convolutional architectures for MNIST. + + + + + 128@8x8 128@4x4 + 64@16x16 64@8x8 + 32@32x32 32@16x16 + 3@32x32 1x10 + + + + + Conv1 MaxPool1 Conv2 MaxPool2 Conv3 MaxPool3 + + Figure S15: Convolutional architectures for CIFAR-10. + + + + + 29 +Published as a conference paper at ICLR 2026 + + + + +Algorithm S2 Two phases in EP training process for convolution architecture +Input: Sample-label pairs (x, star ) +Parameter: θ = [W0 , W1 , Wf , Bf , B1 , α1 , β1 , βf 1 ] +Output: θ + 1: function F IRST- PHASE(θ, star ) + 2: s0 ← x + 3: for t ← 1 to T do + 4: h1 ← Conv0 (s0 ) + β1 · MaxUnpool1 (ConvT1 (s02 )) + 5: h2 ← Conv1 (MaxPool1 (s01 )) + 6: hp ← Wf · Flatten(MaxPool2 (s02 )) + 7: s01 , s02 , s0p ← ρ(h1 ), ρ(h2 ), SoftMax(hp ) + 8: end for + 9: Λ1 ← [s0i ], i = 0, 1, 2, p +10: return Λ1 +11: end function +12: function S ECOND - PHASE(θ, Λ1 , star ) + β β β +13: s0 , s1 f 1 , s2 f 1 , sp f 1 ← x, s01 , s02 , s0p +14: for t ← 1 to K do + β +15: ep ← star − sp f 1 + β +16: h1 ← Conv0 (s0 ) + β1 · MaxUnpool1 (ConvT1 (s2 f 1 )) + β +17: h2 ← Conv1 (MaxPool1 (s1 f 1 )) + βf · MaxUnpool2 (Unflatten(Wf⊤ ep )) + β +18: hp ← Wf · Flatten(MaxPool2 (s2 f 1 )) + β β β +19: s1 f 1 , s2 f 1 , sp f 1 ← ρ(h1 ), ρ(h2 ), SoftMax(hp ) +20: end for +21: end function + + + + + 30 +Published as a conference paper at ICLR 2026 + + + + +Algorithm S3 EP with feedback scaling and residual connections (Figure 3b) +Input: (x, star ) +Parameter: θ = [W0 , Wi , Wf , Bf , Bi , βi , βf 1 ] +Output: θ + 1: function I TERATION(θ, Λ1 , star ) + 2: for t ← 1 to K do + 3: if Nudging Phase then + 4: βf ← βf 1 + β β β + 5: s0 , s1 f 1 , s2 f 1 , sp f 1 ← x, s01 , s02 , s0p + 6: else + 7: βf ← 0 + 8: s0 ← x + 9: end if + β β +10: h1 ← W0 s0 + β1 B1 s2 f + β4,1 B4,1 s4 f + βf βf +11: h2 ← W1 s1 + β2 B2 s3 + β β +12: h3 ← W2 s2 f + β3 B3 s4 f + βf β β β +13: h4 ← W3 s3 + β14 B4 s5 f + W1,4 s1 f + β7,4 B7,4 s7 f + β β +14: h5 ← W4 s4 f + β5 B5 s6 f + βf β +15: h6 ← W5 s5 + β6 B6 s7 f + β β β β +16: h7 ← W6 s6 f + β7 B7 s8 f + W4,7 s4 f + β10,7 B10,7 s10f + βf βf +17: h8 ← W7 s7 + β8 B8 s9 + β β +18: h9 ← W8 s8 f + β9 B9 s10f + βf β β +19: h10 ← W9 s9 + βf Bf ep f + W7,10 s7 f + β +20: hp ← Wf s10f + βf +21: si ← ρ(hi ), i = 0, 1, 2, . . . , 10 + β +22: sp f ← SoftMax(hp ) +23: end for +24: end function + + + + + 31 +
\ No newline at end of file |
