diff options
| author | Yuren Hao <yurenh2@illinois.edu> | 2026-07-03 05:56:50 -0500 |
|---|---|---|
| committer | Yuren Hao <yurenh2@illinois.edu> | 2026-07-03 05:56:50 -0500 |
| commit | b83947778e2c776f757a07d4719b7ce961d7ed55 (patch) | |
| tree | b9cc01d7adda691d9156d9d04f4fb2f644674e96 /refs | |
Initial commit: ept — backprop-free equilibrium transformer (EP)
Code (ep_run/), organized docs (docs/{method,campaign,hardware,outreach,paper}),
analysis scripts (scripts/), ONBOARDING.md entry point. Large data/checkpoints
git-ignored (share separately).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_014FAPDWQ49M5Ye3NpTndTpn
Diffstat (limited to 'refs')
| -rw-r--r-- | refs/fre_rnn_full.txt | 1945 | ||||
| -rw-r--r-- | refs/hw_groups_claims.json | 103 | ||||
| -rw-r--r-- | refs/hw_research_claims.json | 134 | ||||
| -rw-r--r-- | refs/hw_research_claims2.json | 85 | ||||
| -rw-r--r-- | refs/paper_2603.12934.txt | 2039 | ||||
| -rw-r--r-- | refs/scurria_2602.03670v2.txt | 3311 |
6 files changed, 7617 insertions, 0 deletions
diff --git a/refs/fre_rnn_full.txt b/refs/fre_rnn_full.txt new file mode 100644 index 0000000..48da35c --- /dev/null +++ b/refs/fre_rnn_full.txt @@ -0,0 +1,1945 @@ + Published as a conference paper at ICLR 2026 + + + + + T OWARD P RACTICAL E QUILIBRIUM P ROPAGATION : + B RAIN - INSPIRED R ECURRENT N EURAL N ETWORK + WITH F EEDBACK R EGULATION AND R ESIDUAL C ON - + NECTIONS + + Zhuo Liu Tao Chen ∗ + School of Microelectronics School of Microelectronics + University of Science and Technology of China University of Science and Technology of China + Hefei 230026, Anhui, China Hefei 230026, Anhui, China +arXiv:2508.11659v2 [cs.NE] 7 May 2026 + + + + + zhuoliu00@mail.ustc.edu.cn tchen@ustc.edu.cn + + + + A BSTRACT + Brain-like intelligent systems need brain-like learning methods. Equilibrium + Propagation (EP) is a biologically plausible learning framework with strong poten- + tial for brain-inspired computing hardware. However, existing implementations + of EP suffer from instability and prohibitively high computational costs. Inspired + by the structure and dynamics of the brain, we propose a biologically plausible + Feedback-regulated REsidual recurrent neural network (FRE-RNN) and study its + learning performance in the EP framework. Feedback regulation enables rapid + convergence by attenuating feedback signals and reducing the disturbance of feed- + back paths to feedforward paths. The improvement in the convergence property + reduces the computational cost and training time of EP by orders of magnitude, + delivering performance on par with backpropagation (BP) in benchmark tasks. + Meanwhile, residual connections with brain-inspired topologies help alleviate the + vanishing gradient problem that arises when feedback pathways are weak in deep + RNNs. Our approach substantially enhances the applicability and practicality of + EP. The techniques developed here also offer guidance for implementing in-situ + learning in physical neural networks. + + + 1 I NTRODUCTION + Backpropagation (BP) has been the driving force behind the success of artificial intelligence (AI) + across a wide variety of tasks, ranging from image recognition to natural language processing + (Rumelhart et al., 1986; Lecun, 1988; He et al., 2016; Vaswani et al., 2017). Despite these tri- + umphs, BP’s reliance on non-local error signals and weight transport lacks biological plausibility + (Journé et al., 2023; Ororbia, 2023). The brain does not appear to implement the gradient computa- + tions performed by BP, in particular the explicit derivative of the activation function, which demands + precise access to the rate of change in neuronal activities at specific operating points (Ororbia, 2023). + Moreover, implementing BP in neuromorphic systems incurs enormous overhead (Kudithipudi et al., + 2025). Drawing inspiration from the topology and dynamics of the brain is a viable approach to ad- + vancing biologically plausible learning mechanisms and to promoting energy-efficient computing + systems for AI. + Equilibrium Propagation (EP) (Scellier & Bengio, 2017; Ernoult et al., 2019; Laborieux et al., 2021) + presents a compelling and hardware-friendly alternative. It leverages naturally settling dynamics in + RNN for credit assignment, and eliminates the need for explicit activation derivatives. EP operates + in two phases with nearly identical dynamics, and the synaptic adjustments depend only on local + information (Ackley et al., 1985; Movellan, 1991; Ernoult et al., 2020). In EP, the output layer is + softly nudged by the prediction error toward configurations that incrementally minimize the loss + function, a regime termed weak supervision (Millidge et al., 2023). A major drawback of EP is + ∗ + Corresponding Author + + + 1 +Published as a conference paper at ICLR 2026 + + + + +its notably slow training speed and instability. An RNN often requires dozens or even hundreds +of iterations to reach a stable state (Scellier & Bengio, 2017). Previous attempts to optimize EP’s +performance have led to markedly more complicated procedures (O’Connor et al., 2019; Laborieux +& Zenke, 2024). +In this paper, we draw inspiration from the brain and propose a Feedback-regulated REsidual recur- +rent neural network (FRE-RNN). We substantially improve the convergence properties of the RNNs +and training speed of EP while achieving performance comparable to BP. Our contributions are as +follows: + + • By scaling down the feedback strength of RNNs, we enhance the robustness of EP and + accelerate the training and inference speed by orders of magnitude because of the improved + convergence properties. + • To counteract the gradient vanishing problem caused by weak feedback, we introduce resid- + ual connections into the layered RNNs, enabling the training of deep networks that previ- + ously challenged EP and achieving performance closer to BP. + • The feedback regulation and residual connections in RNNs of arbitrary graph topologies + mirror the multi-scale recurrence in biological neural networks. Our work fosters EP’s bio- + logical plausibility and extends its applicability in brain-inspired computational hardware. + + +2 BACKGROUND + +2.1 C ONVERGENT RNN S WITH S TATIC I NPUT + +Consider an RNN as a dynamical system driven by a static input x: + + s[t + 1] = F (x, s[t], θ), (1) + +where F is the transition function, s[t] is the network state at time step t(t = 0, 1, 2, . . . , T ), and +θ denotes the parameters. Assuming that the network state stabilizes in T steps, the RNN reaches +a stable point s[T ]. Its convergence is typically guaranteed by either symmetric connections with +asynchronous updates or by a sufficiently small spectral radius of asymmetric connections with +synchronous updates (Hopfield, 1982; Yildiz et al., 2012; Liu et al., 2026). That said, other factors, +e.g. activation function, also influence the dynamical properties of RNNs (Miller & Hardt, 2019). + +2.2 S CALING A DJACENCY M ATRIX TO T UNE N ETWORK DYNAMICS + +Scaling the spectral radius (SR) of the adjacency matrix, the largest eigenvalue of the weight matrix, +is a common method to tune the dynamics of RNN (Bai et al., 2012; Nakajima et al., 2024; Liu et al., +2026). A SR less than one yields stable and convergent dynamics. In this case, injected signals tend +to decay over time, which manifests as short-term memory. A SR exceeding one can give rise to +expansive or even chaotic behavior in which small perturbations are amplified. By adjusting SR, +one can bias the RNN toward convergent, oscillatory, or edge-of-chaos regimes, thereby tuning +computational properties, such as convergence speed or long-term memory capacity. (Jaeger & +Haas, 2004; Legenstein & Maass, 2007; Miller & Hardt, 2019). + +2.3 E QUILIBRIUM P ROPAGATION + +Equilibrium propagation is a learning framework initially based on energy-based models. It proceeds +in two phases: a free (first) phase and a weakly clamped (second) phase. For the first phase, the +RNN converges to a steady state s0 under the stimulation of input alone. In the clamped phase, the +network is gently nudged by the prediction error and settles to a new stable state sβ . The weight +update can be simplified to a contrastive learning compatible with spiking-time-dependent plasticity +(STDP) (Scellier et al., 2018). EP has been further generalized to asymmetric RNNs governed by +vector field dynamics (Scellier et al., 2018). Recent work shows that asymmetry in skew-symmetric +Hopfield models can improve classification performance (Høier et al., 2024). + + + 2 +Published as a conference paper at ICLR 2026 + + + + +2.4 F EEDBACK R EGULATION AND N ETWORK S TRUCTURE IN THE B RAIN + +Cortical areas in the brain feature dynamic regulation of feedforward and feedback connections +(Felleman & Van Essen, 1991; Mejias et al., 2016; Michalareas et al., 2016; Semedo et al., 2022; +Fişek et al., 2023; Wang et al., 2023). In the visual system, for instance, feedforward signals domi- +nate immediately following the onset of external stimulus, whereas feedback signals become promi- +nent during spontaneous activity. Dynamically regulating the strength of feedback allows the brain +to optimize information integration, ensuring efficient perception and decision-making. In mam- +malian neocortices, information processing involves not only feedforward synaptic chains but also +extensive lateral and feedback loops that interconnect disparate regions, forming a richly recursive +network rather than a strictly layered structure. This topology implies short average path length +between neurons and efficient information flow (Watts & Strogatz, 1998; Markov et al., 2013; Lynn +& Bassett, 2019; Kulkarni & Bassett, 2025). In deep neural networks, residual connections reflect +the long-range skip-layer projections observed in cortical circuits (Perich & Rajan, 2020; Holk & +Mejias, 2024). They mitigate the vanishing gradient by providing skip pathways that preserve gra- +dient (He et al., 2016). + +3 ACCELERATING EP WITH B RAIN - INSPIRED N ETWORK P ROPERTIES + + (a) + 𝑠0 𝑠1 𝑠2 Predict: 𝑠𝑝 + 𝑊0 𝛼1 𝑊1 𝑊𝑓 Label: 𝑠𝑡 + + + + 𝛽1 𝐵1 𝛽𝑓 𝐵𝑓 + Error: 𝑒𝑝 = 𝑠𝑡 − 𝑠𝑝 + (b) + 𝑠0 𝑠1 𝑠2 Predict: 𝑠𝑝 + 𝐶𝑜𝑛𝑣0 𝑃1 , 𝐶𝑜𝑛𝑣1 𝑃2 , 𝑊𝑓 Label: 𝑠𝑡 + (32,5,1,0) 2 , 64,5,1,0 (2) + + + + 𝑇 + 𝐶𝑜𝑛𝑣𝑇1 , 𝑃1−1 , 𝛽1 𝑊𝑓 , 𝑃2−1 , 𝛽𝑓 + Error: 𝑒𝑝 = 𝑠𝑡 − 𝑠𝑝 + +Figure 1: Illustration of feedback and feedforward regulation. (a) Layered architecture of RNN. The +feedforward weights Wi and feedback weights Bi are rescaled by coefficients αi and βi respectively. +The dashed box encloses an RNN formed by layers s1 and s2 with feedforward and feedback path- +ways. βf is the nudging factor, which essentially scales the feedback strength of prediction error. +(b) Embedding convolutional architecture in RNN. Convolutional parameter (32,5,1,0) is written +as (channels, kernels, stride, padding). Parameter (2) in (b) denotes max-pooling with stride 2. +ConvTi represents transpose convolution, the inverse process of the convolution, and Pi−1 means +max-unpooling (Ernoult et al., 2019). Model architectures and training process are given in Ap- +pendix D. + + +3.1 P ROTOTYPICAL SETTING OF EQUILIBRIUM PROPAGATION + +Unlike the prototypical setting of equilibrium propagation (P-EP) (Ernoult et al., 2019), we separate +the input and output layer from the recurrent network (Figure 1a). This separation allows the output +layer to adopt the SoftMax activation commonly used in feedforward networks, which facilitates +performance comparison (Laborieux & Zenke, 2024). For clarity, the RNN (black dashed box in +Figure 1a) shown here only contains two hidden layers s1 and s2 , but the approach applies to deeper +structures (see below). The states of the RNN evolve for T discrete steps until they converge. The +dynamics of the whole RNN can be formulated as: + sβf [t + 1] = F (sβf [t], b) = ρ(W · sβf [t] + b), + b = [W0 · s0 , βf · Bf · ep ], (2) + + + 3 +Published as a conference paper at ICLR 2026 + + + + +where sβf [t] is the state of the RNN at time t, ρ is the activation function, W is the forward weight +matrix of the RNN, and b combines the feedforward input and the error-nudging term. Note that + β β +sβf = [s1 f , s2 f ]. For each sample-label pair (x, star ), we run the free phase (βf = 0) for te +iterations, obtain the prediction sp = SoftMax(Wf · s2 ), and compute the prediction error ep = +star −sp . During the clamped phase, the error nudges the RNN through the feedback weights Bf and +scaling coefficient βf = βf 1 (βf 1 = 0.1 for layered architecture and βf 1 = 0.25 for convolutional +architecture by default). The network evolves for K further iterations under clamping to another +state. The weights (W0 , W1 ) are then updated with an STDP-compatible rule: + β + ∆Wi = dsi+1 · (s0i )⊤ , f1 + dsi+1 = si+1 − s0i+1 , (3) +where dsi is the offset of stable point caused by the error nudging (Scellier et al., 2018). Similarly, +the final weight for output is updated: + + ∆Wf = (star − s0p ) · (s02 )⊤ . (4) +We also consider an RNN embedded with convolutional architecture in its forward paths (2 convo- +lution layers, 2 max-pooling layers and 1 fully connected layer) shown in Figure 1b. The forward +convolutional structure follows the architecture of existing convolutional neural networks (CNN) +(Krizhevsky et al., 2012; Simonyan & Zisserman, 2015), in which a pooling layer is placed after +the activation of the convolution layer. We transform the CNN to an RNN by adding feedback con- +nections symmetric with the feed-forward connections (See Appendix D for the pseudocode and +schematics). + +3.2 F EEDBACK R EGULATION IN L AYERED RNN FOR FAST C ONVERGENCE + + (a) 0 (b) 0 + Index + + + + + Index + + + + + 100 100 + (c) 0 100 + (d) 0 100 + Index + + + + + Index + + + + + 100 100 10 2 + (e) 0 10 1 + (f) 0 + 10 4 + Index + + + + + Index + + + + + 100 100 + (g) 0 10 2 + (h) 0 10 6 + Index + + + + + Index + + + + + 100 100 + 0 20 40 60 80 0 20 40 60 80 + t t + + +Figure 2: Convergence dynamics and speed versus feedback scaling βi . All neurons in all hidden +layers are indexed (s1 :0-63; s2 :64-127). Colors indicate neuronal activity (a,c,e,g) and changes in +activity (b,d,f,h). (a) The state evolution of RNN with symmetric weights and βi = 0.1; (b) The +one-step difference of neural states in (a). (c, d) Symmetric weights with βi = 2; (e, f) Asymmetric +weights with βi = 0.1; (g, h) Asymmetric weights with βi = 4. In both symmetric and asymmetric +feedback cases, down-scaling feedback connections tends to stabilize the network. See Figure 5d +for the statistical robustness. + +Although the SR can tune the RNN dynamics, scaling forward weights Wi distorts forward signal +propagation, which is harmful to performance (see below). Therefore, we turn to another choice, +namely, scaling only the feedback strength with βi . This coefficient scales the gradients, in the same +way as the nudging factor βf . We consider both symmetric (Bi = (Wi )⊤ ) and asymmetric (Bi ̸= +(Wi )⊤ ) recurrent connections in the study, and compare the results with FNNs of the same size +trained by BP (feedback connections removed) or feedback alignment (FA) (Lillicrap et al., 2016) +that uses random weights Bi ̸= (Wi )⊤ to feedback the error. Note that, after scaling, the overall +weight matrix W of a symmetric RNN is no longer strictly symmetric. Therefore, we started from +the vector field setting of EP rather then the energy-based setting in the first place. The feedforward +and feedback weights are multiplied by coefficients αi and βi respectively. Figure 2a-d shows +convergence speed for different βi . With asymmetric weights, the network can converge to a fixed +point (Figure 2e, f), exhibit cyclical oscillation (Figure 2g, h), or even become chaotic. The feedback +weights stay fixed during training process, which differs from EP in vector field dynamics (Scellier + + + 4 +Published as a conference paper at ICLR 2026 + + + + +et al., 2018). The pseudocode of learning procedure with a 2-hidden-layer RNN shown in Figure 1(a) +is provided in Algorithm 1. + +Algorithm 1 EP with Feedforward and Feedback Scaling +Require: Input: (x, star ) +Require: Parameters: θ = [W0 , W1 , Wf , Bf , B1 , α1 , β1 , βf 1 ] +Ensure: Updated parameters θ + 1: function F IRST- PHASE(θ, star ) + 2: s0 ← x + 3: for t = 1 to T do + 4: h1 ← W0 · s0 + β1 · B1 · s02 + 5: h2 ← α1 · W1 · s01 + 6: hp ← Wf · s02 + 7: s01 , s02 , s0p ← ρ(h1 ), ρ(h2 ), SoftMax(hp ) + 8: end for + 9: Λ1 ← [s0i ], i = 0, 1, 2, p +10: return Λ1 +11: end function +12: function S ECOND - PHASE(θ, Λ1 , star ) + β β β +13: s1 f 1 , s2 f 1 , sp f 1 ← s01 , s02 , s0p +14: for t = 1 to K do + β +15: ep ← star − sp f 1 + β +16: h1 ← W0 · s0 + β1 · B1 · s2 f 1 + βf 1 +17: h2 ← α1 · W1 · s1 + βf · Bf · ep + β +18: hp ← Wf · s2 f 1 + βf 1 βf 1 βf 1 +19: s1 , s2 , sp ← ρ(h1 ), ρ(h2 ), SoftMax(hp ) +20: end for + β +21: dsi ← si f 1 − s0i , i = 1, 2 +22: Λ2 ← [ds1 , ds2 ] +23: return Λ2 +24: end function +25: function U PDATING -W EIGHTS(θ, Λ1 , Λ2 , star ) +26: ∆Wi ← dsi+1 · (s0i )⊤ , i = 0, 1 +27: ∆Wf ← (star − s0p ) · (s02 )⊤ +28: end function + + + +3.3 R ESIDUAL C ONNECTIONS TO AVOID VANISHING G RADIENTS + +In our 10-hidden-layer RNN with symmetric connections, we add cross layer residual links (Fig- +ure 3a-b) and carry out ablation study on their effects in performance. The three long-range bidi- +rectional connections bypass adjacent layers to reduce gradient decay. For RNN with asymmet- +ric connections, we introduce skip-layer connections between non-adjacent layers with P = 20% +probability, creating an RNN with arbitrary graph topologies (AGT) where any pair of layers form +connections stochastically (Figure 3c) (Salvatori et al., 2022). (See Algorithm S3 in Appendix D for +training detail) + +4 E XPERIMENTS +We evaluated our RNN models on MNIST and CIFAR-10 datasets and compared the results with P- +EP and BP. The MNIST dataset consists of 70,000 grayscale handwritten digit images (28×28 pixels) +split into 60,000 training and 10,000 test samples. CIFAR-10 contains 60,000 RGB images (32×32 +pixels) of 10 categories, divided into 50,000 training and 10,000 test samples. Pre-processing, net- +work structures and additional training details are in Appendix D. + +4.1 I NFLUENCE OF F EEDFORWARD S CALING AND F EEDBACK S CALING + +Figure 4 compares the effects of feedforward scaling αi and feedback scaling βi . In general, relative +small feedback scaling (βi = 0.1) yields high MNIST accuracy (Figure 4). In deeper RNNs, overly + + + 5 +Published as a conference paper at ICLR 2026 + + + + +Figure 3: (a) A 10-hidden-layer RNN model with residual connections. The solid blue wires and +the dashed orange wires represent forward and feedback residual connections respectively. The +bidirectional connections are symmetric. (b) Adjacency matrix of (a). The blocks (green) other than +the sub-diagonals indicate residual connections. (c) Adjacency matrix for an RNN with arbitrary +graph topology. + + + + (a) 2HL-Test accuracy (b) 3HL-Test accuracy (c) 5HL-Test accuracy + 1.00 + 0.001 0.9659 0.9730 0.9756 0.9753 0.001 0.9348 0.9692 0.9718 0.9555 0.001 0.3844 0.7980 0.9338 0.8048 + 0.95 + 0.01 0.9666 0.9723 0.9760 0.9758 0.01 0.9246 0.9679 0.9765 0.9715 0.01 0.2515 0.9365 0.9575 0.8583 0.90 + i + + + + + i + + + + + i + + + + + 0.1 0.9624 0.9711 0.9725 0.9694 0.1 0.6323 0.9497 0.9741 0.9702 0.1 0.2104 0.8386 0.9757 0.9332 0.85 + + 0.80 + 1.0 0.8925 0.9127 0.9300 0.7978 1.0 0.4862 0.8739 0.5249 0.1676 1.0 0.2096 0.3891 0.2500 0.1324 + 0.75 + 0.01 0.1 1.0 4.0 0.01 0.1 1.0 4.0 0.01 0.1 1.0 4.0 + i i i + + + +Figure 4: The influence of feedforward scaling αi and feedback scaling βi on accuracy of MNIST +classification. (a) 2 hidden layers; (b) 3 hidden layers; (c) 5 hidden layers. Each layer has 64 +neurons. By default, T = 10×NHiddenLayer , K = T /2. Each result is averaged over five repetitive +experiments. + + + (a) (b) (c) (d) + 100 1 + 102 + 8 0 + Convergence time + + + + + 7 + 6 1 + Test Error + + + + + FTMLE + + + + + 10 1 5 + SR + + + + + 4 2 101 symm-before training + 3 asymm-before training + 2 3 symm-after training + 1 asymm-after training + 10 2 0 4 100 + 01 + 1 + 0.1 + + 5 + 1 + 2 + 4 + + + + 01 + 1 + 0.1 + + 5 + 1 + 2 + 4 + + + + 01 + 1 + 0.1 + + 5 + 1 + 2 + 4 + + + + 01 + 1 + 0.1 + + 5 + 1 + 2 + 4 + 0.0 + + 0.2 + + + + + 0.0 + + 0.2 + + + + + 0.0 + + 0.2 + + + + + 0.0 + + 0.2 + 0.0 + + + + + 0.0 + + + + + 0.0 + + + + + 0.0 + + + + + i i i i + + + + +Figure 5: The test error, SR, FTMLE, and convergence time versus feedback scaling βi . The results +are obtained from a 3-hidden-layer (64 neurons per layer) model trained on MNIST dataset. Note +that the network does not converge under certain conditions, resulting in missing value in d. + + + + 6 +Published as a conference paper at ICLR 2026 + + + + +low feedback scaling βi jeopardizes the performance, which we attribute to vanishing gradients +(Figure 4c, right two columns). In contrast, down-scaling the feedforward weights degrades perfor- +mance, as the inference signals are weakened through the layers (see rows of Figure 4a). However, +up-scaling αi can also be detrimental, as this easily leads to saturation of neural state. The best per- +formance of a 5-hidden-layer RNN is achieved without feedforward scaling αi = 1 and a trade-off +in feedback scaling at βi = 0.1. These results suggest that balancing the feedforward and feedback +strengths is critical for better performance, not only accuracy but also speed (see Table 1). +To further investigate the influence of feedback scaling βi , we plot the error, SR, finite time max- +imum Lyapunov exponent (FTMLE) (Shadden et al., 2005; Kanno & Uchida, 2014) and conver- +gence time against feedback scaling coefficient before and after training of a 3-hidden-layer RNN +on MNIST (Figure 5). It shows that larger feedback scaling βi decreases accuracy (Figure 5a). As +expected, SR is positively correlated to βi (see Figure 5b), and large SR can lead to instability of +an RNN indicated by the FTMLE shown in Figure 5c, which in turn explains the results in Figure +5a. In general, down-scaling the feedback (βi < 1) reduces the convergence time of RNN, which is +favorable. Note that up-scaling of feedback βi >1 can also decrease FTMLE and convergence time. +However, this is attributed to the saturation of neural state, and will also lower the performance. +Additionally, one might suspect that the gradient signals in the lower layers are not fulfilling their +intended role. In reservoir computing, where only the last layer is trained, the network can also +reach high accuracy as long as the output dimension is large enough. However, this is unlikely in +our case, as each layer in our network has only 64 neurons by default (other than the results in Table +1). To further confirm that the learning in lower layers is meaningful, we performed training with +the weights of lower layers frozen—details of these experiments are included in Appendix C.5. The +results clearly show that getting comparable results to BP requires effective training of lower layers. + + (a) 0 i=0.01 (b) 0 i=0.1 (c) 0 i=1.0 (d) 0 i=4.0 + 10 10 10 10 +Testing Error + + + + + Testing Error + + + + + Testing Error + + + + + 10 1 10 1 10 1 Testing Error 10 1 + + + + 10 50 10 50 10 50 10 50 + 20 100 20 100 20 100 20 100 + 10 2 10 2 10 2 10 2 + 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 + Epoch Epoch Epoch Epoch + (e) 0 2 Layers (f) 0 3 Layers (g) 0 5 Layers (h) 0 10 Layers + 10 10 10 10 + + + + 0.001 1.0 0.001 1.0 +Testing Error + + + + + Testing Error + + + + + Testing Error + + + + + Testing Error + + + + + 0.01 2.0 0.01 2.0 + 10 1 0.1 4.0 10 1 0.1 4.0 10 1 10 1 + 0.25 0.25 + 0.001 1.0 0.001 1.0 + 0.01 2.0 0.01 2.0 + 0.1 4.0 0.1 4.0 + 0.25 0.25 + 10 2 10 2 10 2 10 2 + 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 + Epoch Epoch Epoch Epoch + + + +Figure 6: Test error with different hyperparameters. The curves of different T (10, 20, 50, 100) with +2 hidden layers (64 neurons per hidden layer) and (a) βi = 0.01; (b) βi = 0.1; (c) βi = 1; (d) +βi = 4. The curves of different βi (0.001, 0.01, 0.1, 0.25, 1, 2, 4) with (e) 2 hidden layers; (f) 3 +hidden layers; (g) 5 hidden layers; (h) 10 hidden layers. The shaded areas represent deviations of +five repeated experiments. By default, T = 10 × NHiddenLayer , K = T /2. See Appendix A for +more information. + + +4.2 D OWN - SCALING F EEDBACK L EADS TO FASTER C ONVERGENCE + +Figure 6a-d plots the error versus the number of epochs with different iteration steps T . Under the +condition of βi = 0.01 (Figure 6a), the model with T = 10 and K = 5 works as well as the model +with T = 100 and K = 50, suggesting possibility of speedup in training. Larger βi requires more +iterations to achieve a certain level of performance (See Figure 6b, c, d). Larger βi means larger SR +and FTMLE, thus requiring more iterations to settle the RNN as shown in Figure 2 and Figure 5(b- + + + 7 +Published as a conference paper at ICLR 2026 + + + + +Table 1: Comparison with P-EP and BP in accuracy and cost. The results of P-EP come from +previous work (Ernoult et al., 2019). For BP results, we used a network with the same number +of layers and number of nodes/channels. Each experiment is repeated five times, and the standard +deviation is given. By default, βi = 0.01 in our results, the feedback weights are symmetric with +the feedforward weights for P-EP and Ours, and the learning rate in all layers are the same except +for Ours-DLR (different learning rate), which uses varying learning rates identical to that of P-EP. +For 2HL (two hidden layers) and 3HL (three hidden layers), there are 512 nodes per hidden layer. +See Appendix D for more details. + Epoch / Batch size Wall Clock Time + Architecture Training approach Testing (Training) + -T/K (HH:MM:SS) + P-EP (sigmoid-s) 98.05%±0.10% (99.86%) 50/20-100/20 1:56:- + 2HL Ours (tanh, Adam) 98.39%±0.04% (100.00%) 50/500-10/10 0:01:16 + BP (tanh, Adam) 98.26%±0.06% (100.00%) 50/500-1/1 0:00:18 + P-EP (sigmoid-s) 97.99%±0.18% (99.90%) 100/20-180/20 8:27:- + Ours-DLR (tanh) 97.65%±0.08% (98.93%) 100/20-18/10 1:01:14 + 3HL Ours (tanh) 97.83%±0.13% (99.98%) 100/20-18/10 1:01:54 + Ours (tanh, Adam) 98.36%±0.06% (100.00%) 50/500-18/10 0:02:11 + BP (tanh, Adam) 98.36%±0.08% (100.00%) 50/500-1/1 0:00:24 + P-EP (hard-sigmoid) 98.98%±0.04% (99.46%) 40/20-200/10 8:58:- + Conv Ours (hard-sigmoid) 99.14%±0.02% (99.78%) 40/128-20/10 0:12:28 + BP (hard-sigmoid) 98.93%±0.18% (99.43%) 40/128-1/1 0:01:01 + + + +Table 2: Comparison with BP and FA and ablation study of residual connection. For layered +architecture, there are 64 nodes per hidden layer and we chose T = 10 × NHiddenLayer , and +K = 5 × NHiddenLayer , which guarantees saturation of accuracy at βi = 0.1. For convolutional +architectures, βi = 0.01. By default, the Adam optimizer is used. Each experiment is repeated five +times. See Appendix D for more training details. + Architecture + Training approach MNIST-Testing (Training) CIFAR-10-Testing (Training) + -connections + BP 97.69%±0.10% (100.00%) 49.23%±0.81% (56.72%) + 5-symm + Ours 97.64%±0.10% (99.98%) 50.72%±0.17% (57.02%) + FA 96.44%±0.10% (98.96%) 37.97%±2.18% (38.92%) + 5-asymm + Ours 96.37%±0.11% (97.99%) 45.27%±0.73% (46.79%) + BP 97.61%±0.04% (99.93%) 48.23%±1.26% (55.37%) + 10-symm Ours 92.49%±0.32% (95.27%) 34.90%±0.38% (34.64%) + Ours-Residual 97.49%±0.05% (99.77%) 44.46%±0.51% (48.67%) + FA 94.52%±0.26% (95.54%) 30.16%±6.12% (30.20%) + 10-asymm Ours 87.37%±0.49% (87.95%) 30.37%±1.09% (29.97%) + Ours-AGT 96.87%±0.11% (99.45%) 30.94%±4.90% (31.36%) + BP 97.48%±0.07% (99.74%) 47.35%±1.49% (54.59%) + 20-symm + Ours-Residual 95.95%±0.18% (98.20%) 43.61%±1.17% (44.26%) + BP 99.34%±0.04% (99.97%) 75.45%±0.46% (83.61%) + Conv + Ours 99.27%±0.07% (99.78%) 75.04%±0.51% (80.79%) + + +d). Or even worse, the gradient signal is completely distorted. At βi = 4, even T=100 fails to exceed +95% accuracy. Figure 6e-h shows that while shallow networks benefit from low βi , deeper networks +(3, 5 and 10 layers) lose accuracy. In all cases, training performance peaks at certain βi dependent +on the network depth. Additional results are provided in Table S1 in Appendix B. +Table 1 compares our approach with P-EP, BP, and FA. Our model supersedes P-EP in training +speed by at least one order of magnitude for both convolutional architecture and layered architecture. +Importantly, our accuracy is comparable to BP and FA for the shallow architectures (5-hidden-layer +and conv model, see also Table 2). In consideration of the improved stability (Figure 6) via feedback +regulation, we anticipate that physical implementations of RNN can achieve performance on par + + + 8 +Published as a conference paper at ICLR 2026 + + + + +with BP. Additionally, for layered architecture, we also adopt the same training parameters (learning +rate, batch size and epochs) as P-EP, differing only in feedback scaling (‘ours-DLR’ in Table 1). The +results present clear evidence of speedup, which mainly stems from the reduced number of iterations +required for convergence. + + +4.3 D OWN - SCALED F EEDBACK C OORDINATES P LASTICITY OF D IFFERENT L AYERS + +It is hypothesized that the brain requires different plasticity in different areas due to their varying +functional roles (Atallah et al., 2004; Lowet et al., 2020). The variability in plasticity can be realized +explicitly by adjusting learning rates or implicitly by modulating the intensity of gradient. Previ- +ous work postulated that EP with weak feedback necessitates learning rates differing by orders of +magnitude across layers (Scellier & Bengio, 2017). Here, we found that due to gradient differences +across different layers induced by weak feedback, a 3-hidden-layer RNN at βi = 0.01 (Table 1, +‘ours (tanh)’) learns well with a uniform learning rate. This result suggests that the feedback scaling +alone is able to regulate gradient strength of different layers, pointing to another possible mechanism +to coordinate plasticity. + + +4.4 R ESIDUAL C ONNECTIONS OVERCOME THE G RADIENT VANISHING IN D EEP RNN S + +Weak feedback exacerbates vanishing gradient in deeper layered RNN (Figures S5–S6 in Ap- +pendix B). Adding residual connections restores gradient flow (Figure S7 in Appendix B). As a +result, a 10-hidden-layer network sees substantial performance gains (Table 2), 5% increase in ac- +curacy for MNIST and 9% for CIFAR-10. Even 20-hidden-layer model can be trained. As shown +in Table 2, without residual connections, an asymmetric RNN trained by EP falls short of FA in +accuracy, but arbitrary residual links surpass the accuracy of FA for the MNIST classification (See +ablation study on connection probability in Appendix B.). For more complex dataset CIFAR-10, the +10-hidden-layer asymmetric model with residual random feedback connections achieves accuracy +nearly 14% below the symmetric model. A possible reason is that the gradient signal through mul- +tiple random fixed feedback connections becomes too distorted by error to coordinate the forward +weight learning. + + +5 D ISCUSSION + +We have applied the feedback scaling to RNN to speed up the convergence and to accelerate training +with EP with negligible overhead. To counteract the vanishing gradient in deep architectures, we +have added residual connections to non-adjacent layers of deep RNNs, partly restoring classification +performance. In principle, the residual connections make credit assignment pathways shorter (Veit +et al., 2016). The training exhibits remarkable resilience to noise on weight and neural state. Our +structural modification is compatible with other algorithmic speed-ups (Scellier et al., 2023), thereby +expanding the design space for efficient EP implementations. +Recent work on credit assignment in brain-inspired networks, e.g. adjoint propagation (Liu et al., +2026), partitions a large network into local RNNs with random internal connections of low SR +for fast convergence and dynamic resource allocation, yielding speed and accuracy similar to this +work. This work, however, adopts the feedback scaling to solve the stability issue and accelerate +convergence of EP. +Weak feedback is often considered in biologically plausible learning algorithms (Sacramento et al., +2018; Haider et al., 2021; Meulemans et al., 2021). It has been shown that contrastive Hebbian +learning with weak feedback approximates backpropagation while converging quickly (Xie & Se- +ung, 2003). More recently, local representation alignment (LRA) likewise employed weak feedback +(Ororbia et al., 2023) and skip connections from the output to deep layers for efficient training. The +EP framework also approximates BP (Scellier & Bengio, 2017; Millidge et al., 2023), but under +the weak clamping condition (weak supervision) (Laborieux et al., 2021; Millidge et al., 2023). We +have shown that, at the infinitesimal inference limit, namely weak supervision and weak feedback +(Millidge et al., 2023), EP is equivalent to LRA and BP (Appendix C). In other words, the dynamics +of FRE-RNN is more like the feedforward neural network due to its weak feedback. + + + 9 +Published as a conference paper at ICLR 2026 + + + + +However, there are still a few limitations to our approaches for large-scale neural networks that +underpin artificial intelligence. For complex datasets like CIFAR-10, there exists a notable perfor- +mance gap compared to BP, using deep fully connected neural networks. We attribute this gap to +the inaccurate approximation to the true gradient as computed by BP (See Appendix C.4). There- +fore, although EP can be extended to deep fully connected network (20-hidden-layers) and shallow +CNNs, its applicability for deep CNN remains to be explored. For deep architectures with asymmet- +ric connections, the accuracy decreases faster with increasing depth due to the inaccurate random +error feedback. More in-depth investigation on residual connection topology is required to scale +up the methodology to large scale deep architectures. Besides, the hyperparameters are optimized +empirically. We find a feedback scaling in the range of 0.01-0.1 is favorable for shallow networks +(less than 4 layers) and 0.1-0.25 for deeper architectures. Finding a general way to determine these +parameters is still on-going. Additionally, existing research on EP converging naturally continues +to focus primarily on static-input settings (Laborieux et al., 2021; Ernoult et al., 2019; Laborieux +& Zenke, 2024). Extending naturally converging RNN trained by EP to sequence tasks remains a +challenge. +From a neurobiological perspective, residual connections, particularly the randomly generated arbi- +trary graph topologies, yield cortex-like connectivity patterns in the brain. The feedback-regulated +residual RNNs equip the biologically plausible learning framework, EP, with biologically plausible +network architecture. Although it currently runs on GPUs, it can exploit the natural convergence of +physical RNNs and facilitate efficient learning and inference on dedicated neuromorphic hardware. + +ACKNOWLEDGEMENTS +This work was supported by the National Key R&D Program of China (Grant No. +2024YFA1208804). Additional financial support from the University of Science and Technology +of China and the Chinese Academy of Sciences is also gratefully acknowledged. + +C ODE AVAILABILITY +The code used in this work is available at https://github.com/Zero0Hero/ +FRE-RNN-EP. + +R EPRODUCIBILITY STATEMENT +The code necessary to reproduce the main results is provided as Jupyter Notebooks in the Supple- +mentary Materials. Researchers can directly run them to reproduce the results. Further details on +data pre-processing and training process are available within the provided code and in Appendix D. + +T HE U SE OF L ARGE L ANGUAGE M ODELS (LLM S ) +In the preparation of this work, the authors used GPT-5 and DeepSeek solely for the purpose of +polishing and improving the linguistic fluency and readability of the text. This includes tasks such +as correcting grammar and rephrasing sentences. After using the model, the authors have reviewed +and edited all content extensively and take full responsibility for all ideas, claims, and the final +language presented in this paper. + +R EFERENCES +David H. Ackley, Geoffrey E. Hinton, and Terrence J. Sejnowski. A learning algorithm for boltz- + mann machines. Cognitive Science, 9(1):147–169, 1985. +Hisham E. Atallah, Michael J. Frank, and Randall C. O’Reilly. Hippocampus, cortex, and basal + ganglia: Insights from computational models of complementary learning systems. Neurobi- + ology of Learning and Memory, 82(3):253–267, 2004. ISSN 1074-7427. doi: https://doi. + org/10.1016/j.nlm.2004.06.004. URL https://www.sciencedirect.com/science/ + article/pii/S1074742704000693. +Zhang Bai, D. J. Miller, and Wang Yue. Nonlinear system modeling with random matrices: Echo + state networks revisited. IEEE Transactions on Neural Networks and Learning Systems, 23(1): + 175–182, 2012. ISSN 2162-237X 2162-2388. doi: 10.1109/tnnls.2011.2178562. + + + 10 +Published as a conference paper at ICLR 2026 + + + + +Maxence Ernoult, Julie Grollier, Damien Querlioz, Yoshua Bengio, and Benjamin Scellier. Updates + of equilibrium prop match gradients of backprop through time in an rnn with static input. In + Proceedings of the 33rd International Conference on Neural Information Processing Systems, pp. + Article 636. Curran Associates Inc., 2019. +Maxence Ernoult, Julie Grollier, Damien Querlioz, Yoshua Bengio, and Benjamin Scellier. Equilib- + rium propagation with continual weight updates, 2020. URL https://openreview.net/ + forum?id=H1xJhJStPS. +D. J. Felleman and D. C. Van Essen. Distributed hierarchical processing in the primate cerebral + cortex. Cerebral Cortex, 1(1):1–47, 1991. ISSN 1047-3211. doi: 10.1093/cercor/1.1.1. URL + <GotoISI>://WOS:000208047200002. +Mehmet Fişek, Dustin Herrmann, Alexander Egea-Weiss, Matilda Cloves, Lisa Bauer, Tai-Ying + Lee, Lloyd E. Russell, and Michael Häusser. Cortico-cortical feedback engages active den- + drites in visual cortex. Nature, 617(7962):769–776, 2023. ISSN 1476-4687. doi: 10.1038/ + s41586-023-06007-6. URL https://doi.org/10.1038/s41586-023-06007-6. +Paul Haider, Benjamin Ellenberger, Laura Kriener, Jakob Jordan, Walter Senn, and Mihai A. Petro- + vici. Latent equilibrium: a unified learning theory for arbitrarily fast computation with arbitrarily + slow neurons. In Proceedings of the 35th International Conference on Neural Information Pro- + cessing Systems, pp. Article 1365. Curran Associates Inc., 2021. +K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In 2016 IEEE + Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, 2016. ISBN + 1063-6919. doi: 10.1109/CVPR.2016.90. +Maya van Holk and Jorge F Mejias. Biologically plausible models of cognitive flexibility: merging + recurrent neural networks with full-brain dynamics. Current Opinion in Behavioral Sciences, 56: + 101351, 2024. doi: 10.1016/j.cobeha.2024.101351. van Holk, Maya; Mejias, Jorge F. ORCIDs: + 0000-0003-4930-1472 2352-1546 2352-1554. +J. J. Hopfield. Neural networks and physical systems with emergent collective computational abili- + ties. Proceedings of the National Academy of Sciences, 79(8):2554–2558, 1982. ISSN 0027-8424 + 1091-6490. doi: 10.1073/pnas.79.8.2554. +Rasmus Høier, Kirill Kalinin, Maxence Ernoult, and Christopher Zach. Dyadic learning in recur- + rent and feedforward models. In NeurIPS 2024 Workshop Machine Learning with new Compute + Paradigms, 2024. URL https://openreview.net/forum?id=LNfWowAErI. +Herbert Jaeger and Harald Haas. Harnessing nonlinearity: Predicting chaotic systems and sav- + ing energy in wireless communication. Science, 304(5667):78–80, 2004. doi: 10.1126/science. + 1091277. URL https://doi.org/10.1126/science.1091277. doi: 10.1126/sci- + ence.1091277. +Adrien Journé, Hector Garcia Rodriguez, Qinghai Guo, and Timoleon Moraitis. Hebbian deep learn- + ing without feedback. In The Eleventh International Conference on Learning Representations, + 2023. URL https://openreview.net/forum?id=8gd4M-_Rj1. +Kazutaka Kanno and Atsushi Uchida. Finite-time lyapunov exponents in time-delayed nonlinear + dynamical systems. Physical Review E, 89(3):032918, 2014. ISSN 1539-3755 1550-2376. doi: + 10.1103/PhysRevE.89.032918. +Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International + Conference for Learning Representations. ICLR, 2015. doi: doi.org/10.48550/arXiv.1412.6980. +Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convo- + lutional neural networks. In Advances in Neural Information Processing Systems, 2012. +Dhireesha Kudithipudi, Catherine Schuman, Craig M. Vineyard, Tej Pandit, Cory Merkel, Rajkumar + Kubendran, James B. Aimone, Garrick Orchard, Christian Mayr, Ryad Benosman, Joe Hays, Cliff + Young, Chiara Bartolozzi, Amitava Majumdar, Suma George Cardwell, Melika Payvand, Sonia + Buckley, Shruti Kulkarni, Hector A. Gonzalez, Gert Cauwenberghs, Chetan Singh Thakur, Anand + Subramoney, and Steve Furber. Neuromorphic computing at scale. Nature, 637(8047):801–812, + 2025. ISSN 0028-0836 1476-4687. doi: 10.1038/s41586-024-08253-8. + + + 11 +Published as a conference paper at ICLR 2026 + + + + +Suman Kulkarni and Dani S. Bassett. Toward principles of brain network organiza- + tion and function. Annual Review of Biophysics, 54(Volume 54, 2025):353–378, + 2025. ISSN 1936-1238. doi: https://doi.org/10.1146/annurev-biophys-030722-110624. + URL https://www.annualreviews.org/content/journals/10.1146/ + annurev-biophys-030722-110624. +Axel Laborieux and Friedemann Zenke. Improving equilibrium propagation without weight sym- + metry through jacobian homeostasis. In The Twelfth International Conference on Learning Rep- + resentations. ICLR, 2024. URL https://openreview.net/forum?id=kUveo5k1GF. +Axel Laborieux, Maxence Ernoult, Benjamin Scellier, Yoshua Bengio, Julie Grollier, and Damien + Querlioz. Scaling equilibrium propagation to deep convnets by drastically reducing its gradient + estimator bias. Frontiers in Neuroscience, 15:633674, 2021. ISSN 1662-453X. doi: 10.3389/ + fnins.2021.633674. +Yann Lecun. A theoretical framework for back-propagation. In Proceedings of the 1988 Connec- + tionist Models Summer School, 1988. +Robert Legenstein and Wolfgang Maass. Edge of chaos and prediction of computational perfor- + mance for neural circuit models. Neural Networks, 20(3):323–334, 2007. ISSN 08936080. doi: + 10.1016/j.neunet.2007.04.017. +Timothy P. Lillicrap, Daniel Cownden, Douglas B. Tweed, and Colin J. Akerman. Random synaptic + feedback weights support error backpropagation for deep learning. Nature Communications, 7 + (1):13276, 2016. ISSN 2041-1723. doi: 10.1038/ncomms13276. +Zhuo Liu, Hao Shu, Linmiao Wang, Xu Meng, Yousheng Wang, Xuancheng Li, Wei Wang, and + Tao Chen. Adjoint propagation of error signal through modular recurrent neural networks for + biologically plausible learning. eLife, 15:e108237, 2026. ISSN 2050-084X. doi: 10.7554/eLife. + 108237. URL https://doi.org/10.7554/eLife.108237. +Adam S. Lowet, Qiao Zheng, Sara Matias, Jan Drugowitsch, and Naoshige Uchida. Distribu- + tional reinforcement learning in the brain. Trends in Neurosciences, 43(12):980–997, 2020. + ISSN 0166-2236. doi: https://doi.org/10.1016/j.tins.2020.09.004. URL https://www. + sciencedirect.com/science/article/pii/S0166223620301983. +Christopher W. Lynn and Danielle S. Bassett. The physics of brain network structure, function + and control. Nature Reviews Physics, 1(5):318–332, 2019. ISSN 2522-5820. doi: 10.1038/ + s42254-019-0040-8. +Nikola T. Markov, Mária Ercsey-Ravasz, David C. Van Essen, Kenneth Knoblauch, Zoltán + Toroczkai, and Henry Kennedy. Cortical high-density counterstream architectures. Science, 342 + (6158), 2013. ISSN 0036-8075 1095-9203. doi: 10.1126/science.1238406. +Jorge F. Mejias, John D. Murray, Henry Kennedy, and Xiao-Jing Wang. Feedforward and feedback + frequency-dependent interactions in a large-scale laminar network of the primate cortex. Science + Advances, 2(11):e1601335, 2016. doi: doi:10.1126/sciadv.1601335. URL https://www. + science.org/doi/abs/10.1126/sciadv.1601335. +Jan Melchior and Laurenz Wiskott. Hebbian-descent, 2019. URL https://arxiv.org/abs/ + 1905.10585. +Alexander Meulemans, Matilde Tristany Farinha, Javier Garcia Ordonez, Pau Vilimelis Aceituno, + João Sacramento, and Benjamin F. Grewe. Credit assignment in neural networks through deep + feedback control. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman + Vaughan (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 4674–4687. + Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_ + files/paper/2021/file/25048eb6a33209cb5a815bff0cf6887c-Paper.pdf. +Georgios Michalareas, Julien Vezoli, Stan van Pelt, Jan-Mathijs Schoffelen, Henry Kennedy, and + Pascal Fries. Alpha-beta and gamma rhythms subserve feedback and feedforward influences + among human visual cortical areas. Neuron, 89(2):384–397, 2016. ISSN 0896-6273. doi: + https://doi.org/10.1016/j.neuron.2015.12.018. URL https://www.sciencedirect.com/ + science/article/pii/S0896627315011204. + + + 12 +Published as a conference paper at ICLR 2026 + + + + +John Miller and Moritz Hardt. Stable recurrent models. In International Conference on Learning + Representations, 2019. URL https://openreview.net/forum?id=Hygxb2CqKm. +Beren Millidge, Yuhang Song, Tommaso Salvatori, Thomas Lukasiewicz, and Rafal Bogacz. Back- + propagation at the infinitesimal inference limit of energy-based models: Unifying predictive cod- + ing, equilibrium propagation, and contrastive hebbian learning. In The Eleventh International + Conference on Learning Representations, 2023. URL https://openreview.net/forum? + id=nIMifqu2EO. +Javier R. Movellan. Contrastive Hebbian Learning in the Continuous Hopfield Model, pp. + 10–17. Morgan Kaufmann, 1991. ISBN 978-1-4832-1448-1. doi: https://doi.org/10.1016/ + B978-1-4832-1448-1.50007-X. URL https://www.sciencedirect.com/science/ + article/pii/B978148321448150007X. +Mitsumasa Nakajima, Yongbo Zhang, Katsuma Inoue, Yasuo Kuniyoshi, Toshikazu Hashimoto, + and Kohei Nakajima. Reservoir direct feedback alignment: deep learning by physical dynamics. + Communications Physics, 7(1):411, 2024. ISSN 2399-3650. doi: 10.1038/s42005-024-01895-0. +Arild Nøkland. Direct feedback alignment provides learning in deep neural net- + works. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett (eds.), Ad- + vances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., + 2016. URL https://proceedings.neurips.cc/paper_files/paper/2016/ + file/d490d7b4576290fa60eb31b5fc917ad1-Paper.pdf. +Peter O’Connor, Efstratios Gavves, and Max Welling. Initialized equilibrium propagation for + backprop-free training. In International Conference on Learning Representations. ICLR, 2019. + URL https://openreview.net/forum?id=B1GMDsR5tm. +Alexander Ororbia and Ankur Mali. Biologically motivated algorithms for propagating local target + representations. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp. + 4651–4658. AAAI, 2019. +Alexander Ororbia, Patrick Haffner, David Reitter, and C. Lee Giles. Learning to adapt by minimiz- + ing discrepancy, 2017. +Alexander G. Ororbia. Brain-inspired machine intelligence: A survey of neurobiologically-plausible + credit assignment, 2023. URL https://arxiv.org/abs/2312.09257. +Alexander G. Ororbia, Ankur Mali, Daniel Kifer, and C. Lee Giles. Backpropagation-free deep + learning with recursive local representation alignment. In Proceedings of the AAAI Conference on + Artificial Intelligence, volume 37, pp. 9327–9335. AAAI, 2023. URL https://ojs.aaai. + org/index.php/AAAI/article/view/26118. +Matthew G. Perich and Kanaka Rajan. Rethinking brain-wide interactions through multi-region + ‘network of networks’ models. Current Opinion in Neurobiology, 65:146–151, 2020. ISSN + 09594388. doi: 10.1016/j.conb.2020.11.003. +D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by back-propagating + errors. Nature, 323(6088):533–536, 1986. ISSN 0028-0836. doi: 10.1038/323533a0. URL + <GotoISI>://WOS:A1986E327500055. E3275 Times Cited:16725 Cited References + Count:4. +João Sacramento, Rui Ponte Costa, Yoshua Bengio, and Walter Senn. Dendritic corti- + cal microcircuits approximate the backpropagation algorithm. In S. Bengio, H. Wal- + lach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.), Ad- + vances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., + 2018. URL https://proceedings.neurips.cc/paper_files/paper/2018/ + file/1dc3a89d0d440ba31729b0ba74b93a33-Paper.pdf. +Tommaso Salvatori, Luca Pinchetti, Beren Millidge, Yuhang Song, Tianyi Bao, Rafal Bogacz, and + Thomas Lukasiewicz. Learning on arbitrary graph topologies via predictive coding. In Alice H. + Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Informa- + tion Processing Systems, 2022. URL https://openreview.net/forum?id=dqO59nI_ + R9A. + + + 13 +Published as a conference paper at ICLR 2026 + + + + +Benjamin Scellier and Yoshua Bengio. Equilibrium propagation: Bridging the gap between energy- + based models and backpropagation. Frontiers in Computational Neuroscience, 11:24, 2017. ISSN + 1662-5188. doi: 10.3389/fncom.2017.00024. +Benjamin Scellier, Anirudh Goyal, Jonathan Binas, Thomas Mesnard, and Yoshua Bengio. Gener- + alization of equilibrium propagation to vector field dynamics, 2018. URL https://arxiv. + org/abs/1808.04873. +Benjamin Scellier, Maxence Ernoult, Jack Kendall, and Suhas Kumar. Energy-based learn- + ing algorithms for analog computing: a comparative study. In A. Oh, T. Naumann, + A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Advances in Neural In- + formation Processing Systems, volume 36, pp. 52705–52731. Curran Associates, Inc., + 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/ + file/a52b0d191b619477cc798d544f4f0e4b-Paper-Conference.pdf. +João D. Semedo, Anna I. Jasper, Amin Zandvakili, Aravind Krishna, Amir Aschner, Christian K. + Machens, Adam Kohn, and Byron M. Yu. Feedforward and feedback interactions between visual + cortical areas use different population activity patterns. Nature Communications, 13(1):1099, + 2022. ISSN 2041-1723. doi: 10.1038/s41467-022-28552-w. URL https://doi.org/10. + 1038/s41467-022-28552-w. +Shawn C. Shadden, Francois Lekien, and Jerrold E. Marsden. Definition and properties of la- + grangian coherent structures from finite-time lyapunov exponents in two-dimensional aperi- + odic flows. Physica D: Nonlinear Phenomena, 212(3-4):271–304, 2005. ISSN 01672789. doi: + 10.1016/j.physd.2005.10.007. +Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image + recognition. In International Conference on Learning Representations, 2015. +Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, + Ł. ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. Von + Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Ad- + vances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., + 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/ + file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf. +Andreas Veit, Michael J Wilber, and Serge Belongie. Residual networks behave like ensem- + bles of relatively shallow networks. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and + R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 29. Curran + Associates, Inc., 2016. URL https://proceedings.neurips.cc/paper_files/ + paper/2016/file/37bc2f75bf1bcfe8450a1a41c200364c-Paper.pdf. +Ran Wang, Xupeng Chen, Amirhossein Khalilian-Gourtani, Leyao Yu, Patricia Dugan, Daniel Fried- + man, Werner Doyle, Orrin Devinsky, Yao Wang, and Adeen Flinker. Distributed feedforward + and feedback cortical processing supports human speech production. Proceedings of the Na- + tional Academy of Sciences, 120(42):e2300255120, 2023. doi: 10.1073/pnas.2300255120. URL + https://doi.org/10.1073/pnas.2300255120. doi: 10.1073/pnas.2300255120. +D. J. Watts and S. H. Strogatz. Collective dynamics of ’small-world’ networks. Nature, 393 + (6684):440–442, 1998. ISSN 0028-0836. doi: Doi10.1038/30918. URL <GotoISI>://WOS: + 000074020000035. Zr842 Times Cited:29354 Cited References Count:27. +Alan Wolf, Jack B. Swift, Harry L. Swinney, and John A. Vastano. Determining lyapunov + exponents from a time series. Physica D: Nonlinear Phenomena, 16(3):285–317, 1985. + ISSN 0167-2789. doi: https://doi.org/10.1016/0167-2789(85)90011-9. URL https://www. + sciencedirect.com/science/article/pii/0167278985900119. +X. Xie and H. S. Seung. Equivalence of backpropagation and contrastive hebbian learning in a + layered network. Neural Computation, 15(2):441–454, 2003. ISSN 0899-7667. doi: 10.1162/ + 089976603762552988. +Izzet B. Yildiz, Herbert Jaeger, and Stefan J. Kiebel. Re-visiting the echo state property. Neural + Networks, 35:1–9, 2012. ISSN 08936080. doi: 10.1016/j.neunet.2012.07.005. + + + 14 +Published as a conference paper at ICLR 2026 + + + + +A T HE DYNAMICS OF THE RNN +We quantify the convergence property of the recurrent neural network (RNN) with maximum Lya- +punov exponent (MLE) (Wolf et al., 1985), and finite time maximum Lyapunov exponent (FTMLE) +(Kanno & Uchida, 2014). When the MLE/FTMLE is large, the RNN converges slow or even not +at all. To compute MLE and FTMLE, we first initialize a random perturbation vector δ0 . Then we +record the sequence of states s0 [t] with t = 0, 1, 2, . . . , Te − 1 corresponding to the last sample of a +training set (see Figure 2 in the main text), and run the following steps: + + 1. Normalize perturbation vectors to unit length: + δt + δt ← + ∥δt ∥ + 2. Calculate the Jacobian matrix: + ∂F (s0 [t], b) + J(s0 [t]) = + ∂s0 [t] + 3. Update the perturbation: + δt+1 = J(s0 [t]) · δt + 4. Record + ri = ln ∥δt+1 ∥ + PTe −1 +The maximum Lyapunov exponent is computed as λmax = T1e t=0 ri for a sufficiently large Te +(default Te = 500). The results at any T < Te are the FTMLE. +Figure S1–S2 show the FTMLE, MLE, training accuracy and test accuracy versus epochs of dif- +ferent models. In all cases, smaller βi usually yields smaller (FT)MLE, whereas larger βi do not +always lead to larger (FT)MLE because the activation function saturates. The saturation diminishes +perturbation. +For 2-hidden-layer RNN, smaller feedback scaling βi yields steady training progress and better ac- +curacy. Figure S3 plots the FTMLE and test accuracy against feedback scaling for different numbers +of hidden layers. It shows that smaller βi is favorable for shallow networks, because the RNN is eas- +ier to converge (indicated by FTMLE). But for deeper networks (5-hidden-layer or more), smaller +βi degrades performance because of vanishing gradient. +Further comparison between our FRE-RNN that incorporates convolutional structure with previous +work (Ernoult et al., 2019) are also plotted in Figure S4. These results suggest that small feedback +scaling (βi = 0.01) leads to a smoother training process. + +B G RADIENT VANISHING AND THE RESIDUAL CONNECTIONS +Figure S5 and S6 plot the error of each neuron versus epoch at different βi . For a 2-hidden-layer +RNN, the best performance is obtained at βi = 0.001. In this situation, the error of the first hidden +layer is at least two orders of magnitude less than the second hidden layer. At βi = 2, the error also +decreases from higher (high index neurons, closer to output layer) to lower layers, which is attributed +to the saturation of the activation function. In general, the training progresses more steadily for +smaller βi despite the vanishing gradient, which also applies to deeper networks (up to 10-hidden- +layer). +To eliminate the vanishing gradient in EP, direct feedback from the higher layers or local amplifi- +cation (with higher learning rate) is unavoidable (Nøkland, 2016; Ororbia et al., 2023). Figure S7 +shows the effect of residual connections. βi = 0.1 yield the best accuracy 97.5%, due to the balance +between gradient flow and convergence. +Figure S8 shows the testing accuracy varies with the connection probability P of AGT with 10 +hidden layers. Except for the connections in layered model, the connection between any two hidden +layers is generated with probability P , i.e., we first use P to decide if the connections between any +two layers will be established. As P increases, the accuracy rises first, peaks at 0.2 and decreases +around 1. However, the reason behind is yet to be explored. + + + 15 +Published as a conference paper at ICLR 2026 + + + + +Figure S1: The FTMLE, MLE, training accuracy and testing accuracy of symmetric RNNs versus +epochs with different feedback scaling βi (legend). First row: 2 hidden layers; Second row: 3 hidden +layers; Third row: 5 hidden layers; Fourth row: 10 hidden layers. The activation is tanh. Each case +is repeated 5 times. + + + + + 16 +Published as a conference paper at ICLR 2026 + + + + +Figure S2: The FTMLE, MLE, training accuracy and testing accuracy of asymmetric RNNs versus +epochs with different feedback scaling βi (legend). First row: 2 hidden layers; Second row: 3 hidden +layers; Third row: 5 hidden layers; Fourth row: 10 hidden layers. The activation is tanh. Each case +is repeated 5 times. + + + + +Figure S3: The FTMLE and testing accuracy versus feedback scaling βi with different numbers of +hidden layers. (a) Symmetry weights; (b) Asymmetric weights. The FTMLE and testing accuracy +given here correspond to their maxima in all epochs. Note that the 5-hidden-layer asymmetric RNN +with large βi diverged and resulted in missing data points in (b). Each case is repeated 5 times. + + + 17 +Published as a conference paper at ICLR 2026 + + + + +Figure S4: Comparison of RNN embedded with convolutional structure on the MNIST between +P-EP (a) (Ernoult et al., 2019) and our approach at different βi (b-d). We used the same parameters +as the EP reference (Ernoult et al., 2019). + + + + + 18 +Published as a conference paper at ICLR 2026 + + + + +Figure S5: For 2-hidden-layer RNN, the mean error of each neuron in the last batch and testing +accuracy versus epochs at different βi . All neurons in the hidden layers and the output layer are +indexed from the input to the output layer. + + + + + 19 +Published as a conference paper at ICLR 2026 + + + + +Figure S6: For the 10-hidden-layer model, the mean error of each neuron in the last batch and +testing accuracy versus epochs at different βi . All neurons in the hidden layers and the output layer +are indexed from the input to the output layer. + + + + + 20 +Published as a conference paper at ICLR 2026 + + + + +Figure S7: For the 10-hidden-layer model with residual connections, the mean error of each neuron +in the last batch and testing accuracy versus epochs at different βi . All neurons in the hidden layers +and the output layer are indexed from the input to the output layer. + + + + +Figure S8: The testing accuracy on MNIST varies with the connection probability P of AGT with +10 hidden layers. The experiments are repeated 5 times. + + + 21 +Published as a conference paper at ICLR 2026 + + + + +Table S1: Testing accuracy (mean of 5 repeated experiments) with different feedback scaling βi . By +default, T = 10 × NHiddenLayer , K = 5 × NHiddenLayer . Each hidden layer has 64 nodes. + Architecture-connections βi = 0.001 βi = 0.01 βi = 0.1 βi = 0.25 βi = 1 βi = 2 βi = 4 + 2HL-symm 97.69% 97.57% 97.25% 96.22% 93.12% 66.04% 40.92% + 3HL-symm 97.22% 97.64% 97.41% 96.60% 55.86% 32.64% 22.11% + 5HL-symm 93.54% 95.54% 97.60% 90.63% 25.31% 17.88% 14.61% + 10HL-symm 87.15% 89.99% 92.54% 41.84% 14.07% 14.30% 14.23% + 10HL-Residual-symm – 97.52% 97.46% – 95.51% – – + conv-symm – 99.15% 98.71% – 11.35% – – + 2HL-asymm 96.96% 96.97% 96.88% 96.79% 93.88% 91.81% 89.91% + 3HL-asymm 95.17% 96.91% 96.76% 96.66% 91.21% 54.65% 26.72% + 5HL-asymm 91.14% 92.34% 96.41% 96.35% 17.15% 11.35% 13.07% + 10HL-asymm 84.27% 85.83% 87.79% 90.97% 16.13% 14.21% 16.67% + 10HL-AGT-asymm – 96.37% 96.75% – 33.31% – – + + + +C E QUIVALENCE WITH EP AND BP UNDER THE CONDITION OF + INFINITESIMAL INFERENCE LIMIT + + + + +Figure S9: A layered network model used to illustrate the process of backpropagation (BP), local +representation alignment (LRA), and EP. Note that the final prediction layer ·p corresponds to the +third layer with subindex ·3 . For LRA, we use βLRA instead of β1 and βf . For BP, the feedback +(orange) paths are absent. + +In this section, we will use the infinitesimal inference limit (Millidge et al., 2023) to derive the +equivalence of EP with LRA and BP. + +C.1 BACKPROPAGATION + +When we remove the feedback connection of a 2-hidden-layer RNN shown in Figure S9, a feedfor- +ward network is left and can be trained with BP. The forward process of BP is described by: + + + s1 = ρ(h1 ), h1 = W0 · s0 , + s2 = ρ(h2 ), h2 = W1 · s1 , (S1) + sp = hp , hp = Wf · s2 . + +Defining a loss LBP = 21 (sp −star )2 , the weights adjust according to the gradient of the loss. Taking +∆W0 as an example: + ∂LBP + ∆W0 = − + ∂W0 + = −ρ′ (h1 ) ⊙ W1⊤ · ρ′ (h2 ) ⊙ Wf⊤ · (sp − star ) · (s0 )⊤ , + + (S2) + +where “⊙” means Hadamard product (element-wise product), “·” means scalar or matrix multipli- +cation. For two vectors/matrices, “⊙” requires identical dimensions and computes element-wise +products. Broadcasting rules may apply (e.g., a column vector vm×1 ⊙ Am×n scales each column +of A by v). + + + 22 +Published as a conference paper at ICLR 2026 + + + + +C.2 L OCAL R EPRESENTATION A LIGNMENT + +LRA is an alternative training method following the principle of discrepancy reduction (Ororbia +et al., 2017; Ororbia & Mali, 2019). It can be divided into two phases: 1) the network runs the +forward process, producing latent representations of the input samples. 2) the weights adjust in the +direction of reducing the mismatch between current latent representations and target representations +in each layer. +The forward process is the same as BP: + s01 = ρ(h01 ), h01 = W0 · s0 , + s02 = ρ(h02 ), h02 = W1 · s01 , (S3) + s0p = h0p , h0p = Wf · s02 . +where s0i are interpreted as the latent representations. The prediction error is ep = star − s0p . Then +we can get the target representations of the second hidden layer: + sβ2 LRA = ρ(hβ2 LRA ), hβ2 LRA = W1 · s01 + βLRA · Bf · ep , (S4) +The same goes for the first hidden layer: + sβ1 LRA = ρ(hβ1 LRA ), hβ1 LRA = W1 · s0 + βLRA · B1 · e2 , e2 = sβ2 LRA − s02 , (S5) +LRA defines the loss as the total discrepancy between latent representations and target representa- +tions: + L L + X X 1 0 + LLRA = ki Li (s0i , sβi LRA ) = (si − sβi LRA )2 , (S6) + i=1 i=1 + 2 +The weight Wi adjusts according to the local mismatch between s0i+1 and sβi+1 + LRA + : + ∂ki Li (s0i+1 , sβi+1 + LRA + ) + ∆Wi = − + ∂Wi + = (sβi+1 + LRA + − s0i+1 ) ⊙ f ′ (h0i+1 ) · (s0i )⊤ + ≈ (sβi+1 + LRA + − s0i+1 ) · (s0i )⊤ , (S7) +where the derivative of the activation function is omitted in the last row, a useful practice common +in LRA (Melchior & Wiskott, 2019; Ororbia & Mali, 2019; Ororbia et al., 2023). When βLRA → 0, +sβi LRA → s0i and hβi LRA → h0i , then + + + ei = sβi LRA − s0i = ρ(hβi LRA ) − ρ(h0i ) + = ρ(h0i + βLRA · Bi · ei+1 ) − ρ(h0i ) + ≈ [ρ(h0i ) + ρ′ (h0i ) ⊙ (βLRA · Bi · ei+1 ) − ρ(h0i ))]βLRA →0 , (S8) + ′ + = ρ (h0i ) ⊙ (βLRA · Bi · ei+1 ) + +The approximation in Equation S8 is based on a first-order Taylor expansion of ρ(h0i + ∆h) around +h0i , where ∆h = βLRA · Bi · ei+1 . For a small perturbation ∆h → 0, the Taylor expansion gives: + ρ(h0i + ∆h) = ρ(h0i ) + ρ′ (h0i ) · ∆h + O(∆h2 ), (S9) + 2 +When βLRA → 0, higher order terms O(∆h ) are negligible, leaving only the linear terms. We +arrive at the last row after canceling out ρ(h0i ). There we can express the weight adjustments as + ∆W0 = e1 · (s00 )⊤ + + ′ 0 ′ 0 + + · (s0 )⊤ B =(W )⊤ + + = ρ (h1 ) ⊙ βLRA · B1 · ρ (h2 ) ⊙ βLRA · Bf · (star − sp ) + i i + + ′ 0 ⊤ ′ 0 ⊤ ⊤ + + = −βLRA · βLRA · ρ (h1 ) ⊙ W1 · ρ (h2 ) ⊙ Wf · (sp − star ) · (s0 ) , (S10) +which is the same as BP (Equation S2) except for a constant. Thus, LRA at weak feedback limit +approximates BP. An LRA algorithm for a 2-hidden-layer network is described in Algorithm S1. The +feedback weights in LRA need not be learned here, but can be kept symmetric with the feedforward +weights. + + + 23 +Published as a conference paper at ICLR 2026 + + + + +C.3 E QUILIBRIUM P ROPAGATION + +We can also formulate EP in terms of discrepancy reduction. In EP (Algorithm 1 in the main text), +the network states evolve as follows (β = 0 for the first phase and β = βf for the second phase): + + + hβ1 = W0 · sβ0 + β1 · B1 · sβ2 , + hβ2 = W1 · sβ1 + βf · Bf · ep , + hβp = Wf · sβ2 , + sβ1 , sβ2 , sβp = ρ(hβ1 ), ρ(hβ2 ), hβp , (S11) + + +where ep = star − s0p is the predicting error. The network converges to final states h01 , h02 , s01 , s02 in +the free phase. The error of s2 neurons can be described by: + + β + ds2 = [ρ(h2 f )]βf →0 − [ρ(h02 )]βf =0 + ≈ ρ′ (h02 ) ⊙ (βf · Bf · ep ), (S12) + +where only the first-order infinitesimal term is retained as β1 → 0. The same goes for the first +hidden layer: + + β + ds1 = [ρ(h1 f )]βf →0 − [ρ(h01 )]βf =0 + ≈ ρ′ (h01 ) ⊙ (β1 · B1 · (ρ′ (h02 ) ⊙ (βf · Bf · ep ))), (S13) + +The weight W0 can be updated by: + + ds1 · (s00 )⊤ + ∆W0 = = ρ′ (h01 ) ⊙ B1 · (ρ′ (h02 ) ⊙ Bf · ep ) · (s00 )⊤ , (S14) + β1 · β f + +With Bi = Wi⊤ , + + ds1 = βf · β1 · ρ′ (h01 ) ⊙ W1⊤ · (ρ′ (h02 ) ⊙ Wf⊤ · −(sp − star )), (S15) + ds1 · (s00 )⊤ + = −ρ′ (h01 ) ⊙ W1⊤ · ρ′ (h02 ) ⊙ Wf⊤ · (sp − star ) · (s00 )⊤ . + + ∆W0 = (S16) + β1 · β f + +Note that compared with the weight update in the main text, 1/(β1 ·βf ) is added to recover a gradient +amplitude similar to BP. Further, if we assume that the high-order infinitesimal in the first phase can +be omitted, the dynamics of RNN is governed by: + + s01 = ρ(hβ1 ), h01 = [W0 · s0 + β1 · B1 · s02 ]β1 →0 ≈ W0 · s0 , (S17) + s02 = ρ(h02 ), h02 = [W1 · s01 + βf · Bf · ep ]β1 →0,βf =0 ≈ W1 · s01 , (S18) + s0p = h0p , h0p = Wf · s02 . (S19) + + +The information flow of RNN degenerates into that of a feedforward network. This does not affect +the error information dsi , thus Equation S16 approximates Equation S2 for BP. Meanwhile, it re- +sembles LRA with low βLRA , which turns explicit error into implicit error. Hitherto, we have shown +that although the errors are obtained differently in EP, LRA, and BP, they are equivalent under the +assumption of weak supervision and weak feedback. + + + 24 +Published as a conference paper at ICLR 2026 + + + + +Algorithm S1 Local Representation Alignment (LRA) +Input: (x, star ) +Parameter: θ = [W0 , W1 , W2 , B2 , B1 , βLRA ] +Output: θ + 1: function F ORWARD(θ, x) + 2: s0 ← x + 3: s01 ← ρ(h1 ), h1 ← W0 · s0 + 4: s02 ← ρ(h2 ), h2 ← W1 · s01 + 5: s0p ← Wf · s02 + 6: Λ1 ← [s0i ], i = 0, 1, 2, p + 7: return Λ1 + 8: end function + 9: function F EEDBACK(θ, Λ1 , star ) +10: ep ← star − s0p +11: sβ2 LRA ← ρ(h2 ), h2 ← W1 · s01 + βLRA · Bf · ep +12: e2 ← sβ2 LRA − s02 +13: sβ1 LRA ← ρ(h1 ), h1 ← W0 · s0 + βLRA · B1 · e2 +14: e1 ← sβ1 LRA − s01 +15: Λ2 ← [e1 , e2 , ep ] +16: return Λ2 +17: end function +18: function U PDATING -W EIGHTS(θ, Λ1 , Λ2 ) +19: ∆Wi ← ei+1 · (s0i )T , i = 0, 1 +20: ∆Wf ← ep · (s02 )T +21: end function + + +C.4 E XPERIMENTS FOR EQUIVALENCE WITH EP AND BP + +Prior works have shown that EP can be equalized to BPTT in specific conditions and can achieve +comparable performance (Ernoult et al., 2019; Laborieux et al., 2021). As discussed in the previ- +ous section, although the overall architecture forms an RNN, the network behaves similarly to a +feedforward model due to weak feedback connections. +To experimentally show the equivalence of EP and BP, we can further compare our model with +FNN with same feedforward weights trained by BP. We mainly compare cosine similarity of states, +bias gradients and weight gradients for the first batch (batch size is 200) as given in Figure S10. +Figure S10(a-c) shows similarity under the conditions of βi = 1 with different iterations. For the the +bias gradients, i.e., dsi , the cosine similarity declines rapidly, indicating no similarity between our +model and BP. With weak feedback βi = 0.1, as shown in Figure S10(d-f), the similarity of states +approaches 1 and the similarity of bias gradient of last 6(4) layers exceeds 0.5 with T = 500/50 +(T = 20). These results provide further evidence that EP is equivalent to BP under the condition of +weak feedback. +We further studied the influence of βi on the cosine similarity. Figure S10(g) shows that larger +βi leads to lower similarity of states. Figure S10(h) shows that lower βi = 0.01 also leads to +the decrease in similarity, which may caused by insufficient precision of data storage (float32 by +default). Therefore, we use datatype float64 to repeat experiments. Figure S10(k,l) shows that the +similarity of gradient signal remains around 1 with βi = 0.1. This indicates that weak feedback does +indeed lead to an exponential decline in gradient signals, thus requiring higher relative accuracy. + + + + + 25 + Published as a conference paper at ICLR 2026 + + + + + (a) states (b) bias gradients (c) weight gradients + 1.0 1.0 1.0 + T=500,K=200 +cosine similarity + + + + + cosine similarity + + + + + cosine similarity + T=50,K=20 + 0.5 0.5 0.5 T=20,K=8 + 0.0 0.0 0.0 + + 0.5 0.5 0.5 + fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 out fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 out fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 out + (d) states (e) bias gradients (f) weight gradients + 1.0 1.0 1.0 +cosine similarity + + + + + cosine similarity + + + + + cosine similarity + 0.5 0.5 0.5 + + 0.0 0.0 0.0 T=500,K=200 + T=50,K=20 + T=20,K=8 + 0.5 0.5 0.5 + fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 out fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 out fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 out + (g) states (h) bias gradients (i) weight gradients + 1.0 1.0 1.0 + i=0.0 i=0.5 +cosine similarity + + + + + cosine similarity + + + + + cosine similarity + + i=0.01 i=1 + 0.5 0.5 0.5 i=0.1 + + 0.0 0.0 0.0 + + 0.5 0.5 0.5 + fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 out fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 out fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 out + (j) states (k) bias gradients (l) weight gradients + 1.0 1.0 1.0 +cosine similarity + + + + + cosine similarity + + + + + cosine similarity + + + + + 0.5 0.5 0.5 i=0.0 + i=0.01 + i=0.1 + 0.0 0.0 0.0 + i=0.5 + i=1 + 0.5 0.5 0.5 + fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 out fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 out fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 out + + Figure S10: The cosine similarity of gradients and states between our model and feedforward model + trained by BP in an 8-hidden-layer FNN (states: s0i ; bias gradients: dsi ; weight gradients: ∆Wi ). + The axis x is the layers of the model. Error propagates from the last layer ”out” to the first hidden + layer ”fc1” layer by layer. (a-c), with different numbers of iterations under feedback scaling βi = 1. + (d-f), with different numbers of iterations under small feedback scaling βi = 0.1. (g-i), with different + feedback scaling (T=50,K=20). (j-l), Repeat (g-i) with datatype float64 (float32 by default). + + + + + 26 +Published as a conference paper at ICLR 2026 + + + + +C.5 V ERIFYING THE EFFECTIVENESS OF WEAK FEEDBACK IN E QUILIBRIUM P ROPAGATION + + + 1.00 + i=0.01 + i=0.1 + 0.95 + + + + + Test accuracy + 0.90 + + 0.85 + + 0.80 + 1 3 5 + Only last nll layers learning + +Figure S11: The testing accuracy on MNIST with different βi varies with nll . The experiments are +repeated 5 times. + +To demonstrate that the lower few layers of our model are indeed receiving meaningful credit signals, +we report the test accuracy of only updating the last nll layer (i.e., freezing the weights of Layers +1−nll ) in Figure S11. For a 5-hidden layer model with βi = 0.1, updating only the final layer yields +a test accuracy of about 85%. As nll increases to 5, the accuracy also reaches around 97.5%. A +similar trend is observed for the model with βi = 0.01. These results show that achieving over 97% +accuracy requires effective gradient propagation to all layers, confirming that our model successfully +delivers usable credit signals throughout the entire network. + +C.6 ROBUSTNESS TO THE NOISE + + (a) 2HL (b) 3HL (c) 5HL + 1.00 1.00 1.00 + 0 0.972 0.972 0.972 0.904 0 0.974 0.974 0.967 0.752 0 0.976 0.969 0.814 0.469 +Weight noise intensity + + + + + Weight noise intensity + + + + + Weight noise intensity + + + + + 0.95 0.95 0.95 + 0.001 0.972 0.972 0.972 0.896 0.90 0.001 0.973 0.975 0.964 0.749 0.90 0.001 0.973 0.965 0.812 0.464 0.90 + + 0.01 0.917 0.917 0.912 0.689 0.85 0.01 0.903 0.904 0.838 0.485 0.85 0.01 0.876 0.815 0.520 0.299 0.85 + 0.80 0.80 0.80 + 0.1 0.194 0.206 0.196 0.217 0.1 0.167 0.180 0.174 0.167 0.1 0.134 0.135 0.134 0.135 + 0.75 0.75 0.75 + 0 1e-05 0.001 0.1 0 1e-05 0.001 0.1 0 1e-05 0.001 0.1 + State noise intensity State noise intensity State noise intensity + + +Figure S12: The maximum test accuracy model with different noise intensity on weights and states +added both in training and test. (a) With 2 hidden layers; (b) With 3 hidden layers; (c) With 5 hidden +layers. The model is trained for 50 epochs and the experiments are repeated 5 times. + +To evaluate the robustness of the model, we introduce noise on weights and time-varying noise +on states, which are random Gauss noise imposed at each weight update or at each state update, +respectively. The noise on weights is directly added to the weight, while the noise on states is added +as the bias b in Equation 3.1. The mean absolute values of non-zero weights and neural activations +after noiseless training are approximately 0.09 and 0.76 respectively. +The accuracy of the model with two hidden layers varies with two types of noises as presented in Fig- +ure S12(a). It maintains satisfactory performance when the standard deviation of state noise reaches +0.1 or the standard deviation of weight noise reaches 0.01. In deeper structures (Figure S12(b,c)), +the results are consistent with the aforementioned observations for weight noise, demonstrating ex- +cellent robustness. However, the tolerance to time-varying state noise degrades significantly, which +we attribute to the layer-wise noise accumulation and the distortion of weak gradient signal by the +noise in the training process. To confirm our hypothesis, we impose the noises only in the test, + + + 27 +Published as a conference paper at ICLR 2026 + + + + + (a) 2HL (b) 3HL (c) 5HL + 1.00 1.00 1.00 + 0 0.973 0.973 0.973 0.971 0.944 0 0.974 0.976 0.975 0.974 0.940 0 0.975 0.977 0.977 0.977 0.919 + 0.95 0.95 0.95 + + + + + Weight noise intensity + + + + + Weight noise intensity + Weight noise intensity + 0.001 0.971 0.971 0.974 0.970 0.945 0.90 0.001 0.975 0.973 0.973 0.972 0.945 0.90 0.001 0.973 0.973 0.973 0.973 0.928 0.90 + + 0.01 0.919 0.916 0.919 0.918 0.912 0.85 0.01 0.908 0.903 0.908 0.902 0.894 0.85 0.01 0.885 0.876 0.880 0.873 0.842 0.85 + 0.80 0.80 0.80 + 0.1 0.199 0.187 0.194 0.219 0.177 0.1 0.176 0.182 0.152 0.167 0.142 0.1 0.133 0.128 0.143 0.163 0.142 + 0.75 0.75 0.75 + 0 0.001 0.001 0.1 0. 0 0.001 0.001 0.1 0.5 0 0.001 0.001 0.1 0.5 + State noise intensity State noise intensity State noise intensity + + + + +Figure S13: The maximum test accuracy model with different noise intensity on weights and states +(the state noise is added only in test). (a) With 2 hidden layers; (b) With 3 hidden layers; (c) With 5 +hidden layers. The model is trained for 50 epochs in a single experiment. + + +and the test accuracy almost remains unaffected (Figure S13). Therefore, the network is potentially +resilient to noise. However, how to improve resilience in the training process requires further study. + + +D T RAINING DETAILS + +Table S2 provides the parameters of the Adam optimizer that are used in Tables S1–S2 (Kingma & +Ba, 2015). The training details for Table 1 are given in Table S3. For convolutional architectures +in EP, the training process can be described by Algorithm S2. The training sample is fed into the +network through Conv0 . Then the state of the first layer goes through max pooling MaxPool1 and +convolution Conv1 sequentially to reach the second layer. The second layer also feedbacks its states +to the first layer through transposed convolution ConvT1 and max-unpooling MaxUnpool1 . With +T iterations, the RNN converges to the steady states and produces outputs through MaxPool2 and +a fully connected layer. Then the prediction error is computed and used to nudge the RNN by +the reverse of the fully connected layer and max-unpooling MaxUnpool2 . Note that the unpooling +MaxUnpooli requires the indices from the corresponding pooling MaxPooli . +For Table S2, Adam optimizer is used for all experiments. The activation functions sigmoid-s + 1 +and hard-sigmoid are defined as ρ(x) = 1+e−4(x−0.5) , ρ(x) = max(min(x, 0), 1), respectively +(Ernoult et al., 2019). For 5-HL, 10-HL and 20-HL architectures, the Adam optimizer parameters +are as shown in Table S2 (epoch: 50, batch size: 500). The inference details of the architecture +shown in Figure 3b are described by Algorithm S3. The details for convolutional architectures are +given in Table S3 and Figure S14–S15. The cosine-annealing scheduler is used in convolutional +architectures for CIFAR-10 (Tmax = 50, ηmin = 10−6 ). +For MNIST, no pre-processing is used. For the CIFAR-10 dataset, we follow ref. (Scellier +et al., 2023) to pre-process the images. We normalize the input images using mean µ = +(0.4914, 0.4822, 0.4465) and standard deviation σ = 3 × (0.2023, 0.1944, 0.2010). +The results for comparison of time consumption were obtained in a virtualized Windows 11 envi- +ronment with Intel Xeon Gold 6238R CPU, 16 GB RAM, and Nvidia RTX A5000 (24 GB VRAM). +Other results were obtained on a Windows 11 environment with Intel Core i5-12490F, 32 GB RAM, +and Nvidia GTX 1650 (4 GB VRAM) or a Windows 11 environment with AMD R7-7700, 32 GB +RAM, and Nvidia RTX 4070 (12 GB VRAM). The default numerical precision is float32 (single- +precision float). + + + Table S2: The parameters of the Adam optimizer. + Parameter Name Default Value + Learning rate (MNIST / CIFAR-10) 10−3 /2 × 10−4 + First-order moment estimation decay rate (β1 ) 0.9 + Second-order moment estimation decay rate (β2 ) 0.999 + Small constant for numerical stability (ϵ) 10−8 + + + + 28 +Published as a conference paper at ICLR 2026 + + + + +Table S3: Training details for Table 1 and Table 2. The results of EB-EP and P-EP come from +previous work (Ernoult et al., 2019). SGD refers to Stochastic Gradient Descent with mini-batches. + Epoch / Batch size + Architecture Training approach Optimizer Learning rate Weight decay + -T/K + P-EP (sigmoid-s) SGD 50/20-100/20 [0.005, 0.05, 0.2] None + 2HL + Proposed (tanh, Adam) Adam 50/500-10/10 [0.001, 0.001, 0.001] None + P-EP (sigmoid-s) SGD 100/20-180/20 [0.002, 0.01, 0.05, 0.2] None + Proposed-DLR (tanh) SGD 100/20-18/10 [0.002, 0.01, 0.05, 0.2] None + 3HL Proposed (tanh) SGD 100/20-18/10 [0.1, 0.1, 0.1, 0.1] None + Proposed (tanh, Adam) Adam 50/500-18/10 10−3 None + BP (tanh, Adam) Adam 50/500-1/1 10−3 None + P-EP (hard-sigmoid) SGD 40/20-200/10 [0.015, 0.035, 0.15] None + Conv + Proposed (hard-sigmoid) SGD 40/128-20/10 [0.15, 0.35, 0.9] 10−5 + (Table 1) + BP (hard-sigmoid) SGD 40/128-1/1 [0.001, 0.02, 0.4] 10−5 + Conv Proposed (hard-sigmoid) Adam 40/128-20/10 2 × 10−4 10−6 + (Table 2 MNIST) BP (hard-sigmoid) Adam 40/128-1/1 2 × 10−4 10−6 + Conv Proposed (hard-sigmoid) Adam 50/128-40/10 2.5 × 10−4 2 × 10−4 + (Table 2 CIFAR-10) BP (hard-sigmoid) Adam 50/128-1/1 2.5 × 10−4 2 × 10−4 + + + + + 64@8x8 64@4x4 + 32@24x24 32@12x12 + 1@28x28 1x10 + + + + + Conv1 MaxPool1 Conv2 MaxPool2 Dense + + Figure S14: Convolutional architectures for MNIST. + + + + + 128@8x8 128@4x4 + 64@16x16 64@8x8 + 32@32x32 32@16x16 + 3@32x32 1x10 + + + + + Conv1 MaxPool1 Conv2 MaxPool2 Conv3 MaxPool3 + + Figure S15: Convolutional architectures for CIFAR-10. + + + + + 29 +Published as a conference paper at ICLR 2026 + + + + +Algorithm S2 Two phases in EP training process for convolution architecture +Input: Sample-label pairs (x, star ) +Parameter: θ = [W0 , W1 , Wf , Bf , B1 , α1 , β1 , βf 1 ] +Output: θ + 1: function F IRST- PHASE(θ, star ) + 2: s0 ← x + 3: for t ← 1 to T do + 4: h1 ← Conv0 (s0 ) + β1 · MaxUnpool1 (ConvT1 (s02 )) + 5: h2 ← Conv1 (MaxPool1 (s01 )) + 6: hp ← Wf · Flatten(MaxPool2 (s02 )) + 7: s01 , s02 , s0p ← ρ(h1 ), ρ(h2 ), SoftMax(hp ) + 8: end for + 9: Λ1 ← [s0i ], i = 0, 1, 2, p +10: return Λ1 +11: end function +12: function S ECOND - PHASE(θ, Λ1 , star ) + β β β +13: s0 , s1 f 1 , s2 f 1 , sp f 1 ← x, s01 , s02 , s0p +14: for t ← 1 to K do + β +15: ep ← star − sp f 1 + β +16: h1 ← Conv0 (s0 ) + β1 · MaxUnpool1 (ConvT1 (s2 f 1 )) + β +17: h2 ← Conv1 (MaxPool1 (s1 f 1 )) + βf · MaxUnpool2 (Unflatten(Wf⊤ ep )) + β +18: hp ← Wf · Flatten(MaxPool2 (s2 f 1 )) + β β β +19: s1 f 1 , s2 f 1 , sp f 1 ← ρ(h1 ), ρ(h2 ), SoftMax(hp ) +20: end for +21: end function + + + + + 30 +Published as a conference paper at ICLR 2026 + + + + +Algorithm S3 EP with feedback scaling and residual connections (Figure 3b) +Input: (x, star ) +Parameter: θ = [W0 , Wi , Wf , Bf , Bi , βi , βf 1 ] +Output: θ + 1: function I TERATION(θ, Λ1 , star ) + 2: for t ← 1 to K do + 3: if Nudging Phase then + 4: βf ← βf 1 + β β β + 5: s0 , s1 f 1 , s2 f 1 , sp f 1 ← x, s01 , s02 , s0p + 6: else + 7: βf ← 0 + 8: s0 ← x + 9: end if + β β +10: h1 ← W0 s0 + β1 B1 s2 f + β4,1 B4,1 s4 f + βf βf +11: h2 ← W1 s1 + β2 B2 s3 + β β +12: h3 ← W2 s2 f + β3 B3 s4 f + βf β β β +13: h4 ← W3 s3 + β14 B4 s5 f + W1,4 s1 f + β7,4 B7,4 s7 f + β β +14: h5 ← W4 s4 f + β5 B5 s6 f + βf β +15: h6 ← W5 s5 + β6 B6 s7 f + β β β β +16: h7 ← W6 s6 f + β7 B7 s8 f + W4,7 s4 f + β10,7 B10,7 s10f + βf βf +17: h8 ← W7 s7 + β8 B8 s9 + β β +18: h9 ← W8 s8 f + β9 B9 s10f + βf β β +19: h10 ← W9 s9 + βf Bf ep f + W7,10 s7 f + β +20: hp ← Wf s10f + βf +21: si ← ρ(hi ), i = 0, 1, 2, . . . , 10 + β +22: sp f ← SoftMax(hp ) +23: end for +24: end function + + + + + 31 +
\ No newline at end of file diff --git a/refs/hw_groups_claims.json b/refs/hw_groups_claims.json new file mode 100644 index 0000000..3a3bffc --- /dev/null +++ b/refs/hw_groups_claims.json @@ -0,0 +1,103 @@ +[ + { + "claim": "Naresh Shanbhag's UIUC DIMA line produced a fabricated 65nm SRAM-CIM test-chip that is the closest named-group match for an EP demo because it does BOTH analog/mixed-signal in-memory MVM AND genuine on-chip in-situ weight update. The chip (Gonugondla/Kang/Shanbhag, JSSC 2018, 'A Variation-Tolerant In-Memory ML Classifier via On-Chip Training') implements a Deep In-Memory Architecture SVM on a standard 16kB 6T SRAM array, computing the dot product W^T X via 'functional read' (PWAM word-line pulses generate a bit-line discharge proportional to a weighted sum), column-pitch-matched bit-line processors doing signed multiplication, and charge-sharing aggregation. It has a dedicated on-chip digital trainer that computes SGD gradients from on-chip inference and writes updated weights back into the SRAM array each batch (converging from random init to within 1% of floating-point in ~400 batches). In-situ training yields a concrete benefit: it lets the chip run at 38% lower bit-line swing (320mV vs 520mV) at iso-accuracy, a 2.4x IM-CORE energy reduction vs an off-chip-trained DIMA.", + "confidence": "high", + "sources": [ + "http://shanbhag.ece.illinois.edu/publications/sujan-jssc-2018.pdf" + ], + "evidence": "Four unanimous 3-0 claims [0,1,2,3] all grounded in the primary JSSC 2018 paper (text extracted directly via pdftotext by verifiers). Abstract: 'a robust deep in-memory machine learning classifier with a stochastic gradient descent (SGD)-based on-chip trainer using a standard 16 kB 6T SRAM array... 65 nm CMOS prototype IC... on-chip trainable support vector machine.' Architecture has 'IM-CORE' (inference) + 'digital trainer' that 'writes the updated weights once per batch into the BCA'. Datapath: functional read -> BLP signed multiply -> CBLP charge-sharing -> W^T X. Energy: '38% lower VBL,max = 320 mV... reduction in IM-CORE energy by 2.4x at iso-accuracy.' This is the ONLY named-group chip combining analog MVM + on-chip gradient-based weight write-back.", + "vote": "3-0 (all four constituent claims)" + }, + { + "claim": "Caveat on Shanbhag's match: the on-chip-trainable DIMA chip is a single-layer SVM binary/linear classifier with batch-mode SGD, not a deep multilayer or equilibrium/recurrent network. It demonstrates the two EP-critical mechanisms (analog in-memory MVM + per-batch on-chip weight write-back) but lacks a native settling/feedback (relaxation) loop and multilayer credit assignment; an EP equilibrium-transformer demo would need to add the relaxation dynamics and local two-phase update rule on top of this substrate.", + "confidence": "high", + "sources": [ + "http://shanbhag.ece.illinois.edu/publications/sujan-jssc-2018.pdf" + ], + "evidence": "Verifier notes on claims [1] and [3] explicitly flag: 'it is a single-layer SVM with batch-mode SGD, not a deep multilayer trainer' and 'the demo is an SVM classifier on 6T SRAM-CIM, not an NN/transformer.' The chip proves on-chip SGD gradient computation + per-batch weight write, which is the load-bearing capability for EP-style local updates, but not the equilibrium/settling dynamics themselves.", + "vote": "3-0" + }, + { + "claim": "Shanbhag's group also produced the C3SRAM macro (with ASU; Jiang/Yin/Seo/Seok, ESSCIRC 2019 + JSSC 2020), a fabricated 65nm SRAM in-memory-computing macro using analog-mixed-signal capacitive-coupling computing to do fully parallel XNOR-and-accumulate vector-matrix multiplication in a single cycle with one ADC per column \u2014 BUT it is INFERENCE-ONLY for binary (1-bit) weights/activations, with no on-chip training or in-situ weight update.", + "confidence": "high", + "sources": [ + "https://www.researchgate.net/publication/336711317_C3SRAM_In-Memory-Computing_SRAM_Macro_Based_on_Capacitive-Coupling_Computing" + ], + "evidence": "Two unanimous 3-0 claims [4,5]. JSSC 2020 abstract: 'analog-mixed-signal (AMS) capacitive-coupling computing... binary-multiply-and-accumulate... one ADC per column... fully parallel vector-matrix multiplication in a single cycle... prototyped in 65-nm CMOS' (671.5 TOPS/W, 98.3% MNIST, 85.5% CIFAR-10). Targeted WebFetch of the paper confirmed 'a purely inference-accelerator system, not an on-chip training platform... does not support on-chip training, weight gradient computation, or in-situ learning.' A search snippet suggesting in-situ updates was confirmed to conflate a DIFFERENT (Meng-Fan Chang) macro. So C3SRAM provides analog MVM but fails the EP in-situ-update requirement.", + "vote": "3-0 (both constituent claims)" + }, + { + "claim": "Yingyan Celine Lin is a UIUC ECE PhD (2017), was at Rice 2017-2022, and is now Associate Professor in the School of Computer Science at Georgia Tech \u2014 i.e. 'Yingyan Lin/Zhu at UIUC' refers to a UIUC-trained researcher now at Georgia Tech, NOT current UIUC faculty, and the surname is Lin, not Zhu. Her group does digital-accelerator/hardware-aware co-design and Digital-CIM (DCIM), with NO evidence of fabricated analog/mixed-signal, SRAM-analog-CIM, RRAM/memristor, or in-situ-trainable silicon; her venues are ML/architecture/EDA (MICRO/ISCA/HPCA/DAC/ICLR), not analog/circuits (ISSCC/VLSI/JSSC/Nature Electronics). She is therefore a poor fit to host the analog-equilibrium hardware an EP demo needs.", + "confidence": "high", + "sources": [ + "https://www.cc.gatech.edu/people/yingyan-celine-lin" + ], + "evidence": "Two unanimous 3-0 claims [6,7]. GT page: PhD 'in Electrical and Computer Engineering from the University of Illinois at Urbana-Champaign in 2017... Assistant Professor at Rice University from 2017 to 2022... currently Associate Professor in the School of Computer Science.' dblp (pid 120/6981) confirms ZERO ISSCC/JSSC/VLSI/Nature-Electronics publications. Closest hardware works are DIGITAL: 'Fusion-3D' (MICRO 2024, taped-out 28nm DIGITAL NeRF accelerator) and '3DGauCIM' (explicitly Digital-CIM). No fabricated analog/RRAM/in-situ-trainable silicon attributed to her EIC lab.", + "vote": "3-0 (both constituent claims)" + }, + { + "claim": "USTC's Tao Chen (\u9648\u6d9b) is a Special Professor (\u7279\u4efb\u6559\u6388) in the School of Microelectronics at USTC doing device/materials-level 'unconventional information processing' / in-materio reservoir computing \u2014 disordered dopant-atom networks in silicon (lead author of Nature 2020 'Classification with a disordered dopant-atom network in silicon', done at U. Twente), sulfonated-polyaniline reservoir networks, graphene devices. He does NOT build SRAM-CIM, RRAM arrays, ADC/DAC, or switched-capacitor circuit test-chips, and does NOT do EP hardware. His substrate is device-physics, not circuit-level CIM, so it is not a match for hosting an EP equilibrium-transformer demo.", + "confidence": "high", + "sources": [ + "https://faculty.ustc.edu.cn/tchen/en/index.htm" + ], + "evidence": "Two unanimous 3-0 claims [12,13]. Faculty page: 'Special Professor (\u7279\u4efb\u6559\u6388)... School of Microelectronics... exploring the essence of intelligence and developing unconventional information processing technology' via 'multi-degree-of-freedom physical systems.' Flagship: Nature 577, 341-345 (2020) dopant-network classifier (nonlinear electron-hopping as physical reservoir). Targeted searches for RRAM/SRAM-CIM/ADC-DAC/switched-cap test-chips by this Tao Chen returned nothing; the 'graphene memristor' hits belong to other authors. Confirms the user's premise that he does hardware but not EP/CIM circuit hardware.", + "vote": "3-0 (both constituent claims)" + }, + { + "claim": "The Stanford targets Wong and Raina (with UCSD's Gert Cauwenberghs, plus Tsinghua/Notre Dame/Pittsburgh) co-built the NeuRRAM chip (Wan et al., Nature 608:504-512, 2022): a fully-integrated RRAM/memristor compute-in-memory chip (48 cores, ~3 million RRAM cells + CMOS neurons) that performs analog in-memory matrix-vector multiplication \u2014 exactly the analog MVM primitive an EP equilibrium network needs. This establishes a fabricated, EP-relevant analog substrate from the named Stanford group, and a plausible collaboration target for a student.", + "confidence": "high", + "sources": [ + "https://www.nature.com/articles/s41586-022-04992-8", + "https://web.stanford.edu/~hspwong/" + ], + "evidence": "Four unanimous 3-0 claims [8,10,11] (plus the inference-only caveat in [9]). Nature paper: '3 million RRAM devices... monolithically integrated with CMOS... 48-core RRAM-CIM hardware... in-memory matrix-vector multiplication.' Author list spans Stanford (Wong, Raina), UCSD (Cauwenberghs), Tsinghua (Huaqiang Wu), Notre Dame (Joshi/Schaefer), Pittsburgh (Kubendran). Official Stanford EE page titled 'Priyanka Raina and H.S. Philip Wong's NeuRRAM chip.' 130nm foundry process, real silicon. This is the strongest analog-MVM substrate among the named groups.", + "vote": "3-0 (all constituent claims)" + }, + { + "claim": "Critical limitation: NeuRRAM is INFERENCE-ONLY and does NOT satisfy the EP in-situ local-weight-update requirement. Weights are trained off-chip in software and programmed onto the RRAM via incremental-pulse write-verify; the only on-hardware adaptation is 'chip-in-the-loop progressive fine-tuning' where the chip runs only the forward pass while backpropagation/weight updates run in software off-chip. So Stanford/Wong-Raina provide the analog MVM primitive but NOT native on-chip learning.", + "confidence": "high", + "sources": [ + "https://www.nature.com/articles/s41586-022-04992-8" + ], + "evidence": "Unanimous 3-0 claim [9]. Nature paper: 'chip-in-the-loop progressive model finetuning, which uses the chip to perform the forward-pass one layer at a time during the back-propagation fine-tuning.' Weights set via 'incremental-pulse write-verify technique' \u2014 no on-chip learning rule. Corroborated by IEEE Brain ('AI Inference' chip) and UCSD/Stanford press ('Future versions will be able to learn'). IMPORTANT for the synthesis: a claim that Wong's group has demonstrated on-chip/in-situ weight update (RRAM gain-cell, IEDM 2024) was REFUTED 0-3, and an ISSCC-2020 74-TOPS/W RRAM test-chip attribution was refuted 1-2 \u2014 so do not over-credit Stanford with in-situ training.", + "vote": "3-0" + }, + { + "claim": "Industry/analog in-memory vendor TetraMem (Qiangfei Xia / J. Joshua Yang, USC/UMass Amherst) ships a real fabricated memristive SoC, the MX100: 10 memristive computing cores each with a 248x256 1T1R RRAM/memristor crossbar in 65nm CMOS + RISC-V (Nature Electronics 2025). But it is INFERENCE-ONLY: networks are trained offline externally and the trained weights are mapped/programmed onto memristor conductances (closed-loop programming is write-verify of pre-computed weights, not gradient-based learning). It does NOT meet the EP local on-chip weight-update requirement. This also disambiguates the RRAM-SoC work as TetraMem/Xia/Yang, not USTC's Tao Chen.", + "confidence": "high", + "sources": [ + "https://www.nature.com/articles/s41928-025-01409-y" + ], + "evidence": "Two unanimous 3-0 claims [14,15]. Nature Electronics 2025 'Radiofrequency signal processing with a memristive system-on-a-chip': '10 memristive computing cores... 65-nm CMOS... 248x256 1T1R crossbar.' Co-author Yi Huang's first-party post confirms 10 cores + 1T1R + RISC-V. Inference-only: sibling MX100 paper (arXiv 2410.14882) states 'weights were trained using a quantization-aware training method and subsequently converted to int8... transferred to the NPUs.' Vendor positions it for 'energy-efficient AI inference.' Mar-2026 currency check: still framed as 'AI inference chips' \u2014 no on-chip-training announcement.", + "vote": "3-0 (both constituent claims)" + }, + { + "claim": "The in-situ on-chip-training capability EP needs currently exists primarily as research test-chips, not commercial product. A representative example: a fabricated 28nm CMOS Hybrid-Stochastic-Neuron (HSN) chip integrated with a 32x32 TiOx-based memristor VMM crossbar (Neurocomputing/Elsevier 2025, S0925231225031972, NRF Korea) performs FULLY IN-SITU training \u2014 the memristor VMM does forward propagation while the CMOS HSN chip runs backpropagation (gradient calculation AND weight updates) directly in hardware, host-free (demonstrated MNIST 94.13%, CIFAR-10 79.88%, 1.78 TOPS). This is an on-chip-trainable analog/mixed-signal system at small (32x32) RRAM test-chip scale \u2014 the architectural template closest to EP's in-situ-learning requirement, though it implements backprop, not EP.", + "confidence": "high", + "sources": [ + "https://www.sciencedirect.com/science/article/abs/pii/S0925231225031972" + ], + "evidence": "Two unanimous 3-0 claims [17,18]. Abstract: 'Fabricated using 28 nm CMOS technology, the HSN chip integrates with a 32x32 TiOx-based memristor vector matrix multiplication (VMM)... performing backpropagation computations, including gradient calculation and weight updates, directly in hardware to eliminate external dependencies.' 'In-situ training operations were successfully demonstrated.' Caveat: 'analog' is loose (mixed-signal CMOS + analog memristor VMM), and full-CNN figures are chip+simulation system-level, not fully silicon-measured end-to-end. Demonstrates that on-chip gradient + weight update on a memristor crossbar is achievable today at test-chip scale.", + "vote": "3-0 (both constituent claims)" + }, + { + "claim": "A practical path for a lab to build its OWN in-situ-trainable EP test-chip is foundry RRAM/MPW access: SkyWater's S130 (130nm volume-production CMOS) now offers a Weebit Nano ReRAM IP module (256Kb ReRAM array + control logic, decoders, IOs, ECC; fully JEDEC/AEC-Q100 qualified June 2023), explicitly positioned first for analog/mixed-signal applications, and SKY130 is the industry's first open-source foundry PDK (Apache 2.0, June 2020) based on a real commercially-manufactured node. This gives a research lab accessible embedded RRAM + an open PDK to tape out a custom analog-CIM/EP test-chip via MPW.", + "confidence": "medium", + "sources": [ + "https://www.skywatertechnology.com/weebit-nano-reram-ip-now-available-in-skywater-technologys-s130-process/", + "https://www.skywatertechnology.com/sky130-open-source-pdk/" + ], + "evidence": "Claims [19] (3-0), [21] (3-0), and [20] (2-1). SkyWater/Weebit press (Mar 2023): S130 module 'includes a 256Kb ReRAM array, control logic, decoders, IOs... and error correcting code (ECC)' for 'analog/mixed-signal, IoT, automotive, industrial, and medical.' SKY130 = 'the industry's first open-source foundry Process Design Kit... based on SkyWater's volume 130 nm CMOS technology' (corroborated by FOSSi Foundation, Hackaday, GSA). Confidence MEDIUM (not high) because the analog-positioning claim [20] split 2-1, and the key CAVEAT (flagged by verifier): foundry 'analog/mixed-signal' marketing means embedded digital 1T1R NVM ON analog SoCs, NOT that this IP itself performs analog in-memory MVM or supports the multi-level analog-conductance weight tuning EP would ultimately need. So it is a substrate to BUILD on, not turnkey EP hardware.", + "vote": "3-0 (S130 RRAM IP, SKY130 PDK); 2-1 (analog/mixed-signal positioning)" + }, + { + "claim": "There is no fabricated EP equilibrium-transformer hardware demo in the verified evidence; the only directly-EP-on-hardware work found is circuit-level SPICE simulation, not silicon. 'Memristor Crossbar Circuits Implementing Equilibrium Propagation for On-Device Learning' (Oh et al., Kookmin University, MDPI Micromachines 2023) is a Cadence Spectre / Ngspice / Verilog-A simulation study using a behavioral memristor model fit to other groups' device curves \u2014 there is no taped-out chip or physical crossbar. This frames the collaboration as net-new hardware work: the EP-specific two-phase local update has been simulated but not fabricated.", + "confidence": "high", + "sources": [ + "https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10384638/" + ], + "evidence": "Unanimous 3-0 claim [16]. Paper states results are 'simulated and verified' using 'CADENCE SPECTRE', 'Ngspice and Python', 'CADENCE GPDK 45-nm CMOS parameters'; memristor is a Verilog-A behavioral model. No fabricated chip/tape-out/measured results. Additionally, a claim that this work computes free-phase and nudge-phase EP solutions simultaneously in analog circuits was REFUTED 0-3. So EP-on-hardware remains pre-silicon; an EP equilibrium-transformer demo would be a first-of-kind fabrication built on one of the above substrates.", + "vote": "3-0" + } +]
\ No newline at end of file diff --git a/refs/hw_research_claims.json b/refs/hw_research_claims.json new file mode 100644 index 0000000..12b6206 --- /dev/null +++ b/refs/hw_research_claims.json @@ -0,0 +1,134 @@ +[ + { + "claim": "The Mythic M1076 Analog Matrix Processor uses analog compute-in-memory to perform matrix multiplication (MVM) on-chip without external DRAM/memory, storing up to 80 million weights on a single chip.", + "source": "https://mythic.ai/products/m1076-analog-matrix-processor/", + "quote": "Capacity for up to 80M weights - able to run single or multiple complex DNNs entirely on-chip ... execute matrix multiplication operations without any external memory", + "vote": "3-0" + }, + { + "claim": "The M1076 is an inference-only accelerator: DNN models are trained off-device and then programmed into the chip for inference, with no described support for online/in-situ weight updates inside a training loop.", + "source": "https://mythic.ai/products/m1076-analog-matrix-processor/", + "quote": "DNN models...then programmed into the Mythic AMP for inference", + "vote": "3-0" + }, + { + "claim": "The IBM HERMES chip is a 64-core mixed-signal in-memory compute chip fabricated in 14 nm CMOS with backend-integrated phase-change memory (PCM), with each core holding a 256x256 analog weight matrix, for up to 4,194,304 (~4.2 million) weights stored on a single chip using over 16 million PCM devices (four PCM devices per unit cell).", + "source": "https://www.nature.com/articles/s41928-023-01010-1", + "quote": "up to 4,194,304 weights can be stored on the chip ... 64 AIMC cores interconnected via an on-chip communication network ... Four PCM devices per unit cell (two for each polarity)", + "vote": "3-0" + }, + { + "claim": "The chip is inference-only: weights are programmed once via offline hardware-aware training before deployment, and it does not support in-situ / online weight update inside a training loop (making it unsuitable as-is for Equilibrium Propagation's local in-situ weight updates).", + "source": "https://www.nature.com/articles/s41928-023-01010-1", + "quote": "Inference-only; supports offline training with \"hardware-aware training\" before deployment.", + "vote": "3-0" + }, + { + "claim": "A memristor-based compute-in-memory module performs vector-matrix multiplication (VMM) in situ and in parallel \u2014 directly relevant to the analog MVM substrate sought for the EP relaxation feedback path.", + "source": "https://arxiv.org/abs/2305.14547", + "quote": "Memristor-based compute-in-memory (CIM) modules can perform vector-matrix multiplication (VMM) in situ and in parallel, and have shown great promises in DNN inference applications.", + "vote": "3-0" + }, + { + "claim": "The work experimentally implements an on-chip mixed-precision TRAINING scheme (not inference-only) on a bulk-switching memristor CIM module, demonstrating in-situ weight update during training rather than one-time-programmed fixed weights.", + "source": "https://arxiv.org/abs/2305.14547", + "quote": "In this work, we experimentally implement a mixed-precision training scheme to mitigate these effects using a bulk-switching memristor CIM module.", + "vote": "3-0" + }, + { + "claim": "Weight updates are accumulated in digital units at high precision and the analog memristor devices are only physically programmed when the accumulated update exceeds a threshold \u2014 a concrete mechanism for fast online weight update that limits write frequency/endurance stress, exactly the kind of hybrid-update scheme an EP training loop needs.", + "source": "https://arxiv.org/abs/2305.14547", + "quote": "Lowprecision CIM modules are used to accelerate the expensive VMM operations, with high precision weight updates accumulated in digital units. Memristor devices are only changed when the accumulated weight update value exceeds a pre-defined threshold.", + "vote": "3-0" + }, + { + "claim": "Most memristor/ReRAM in-memory ML accelerators are designed for inference-only workloads that deliberately avoid frequent weight updates, because of high write energy and limited write endurance \u2014 meaning in-situ/online training (required for EP) is not the dominant design target.", + "source": "https://arxiv.org/html/2501.12644v1", + "quote": "Most memristor-based machine learning accelerators target workloads that avoid frequent memristor updates because of high write energy and limited endurance.", + "vote": "3-0" + }, + { + "claim": "Attention-based models like Transformers require frequent updates of K, V, and Q matrices, unlike standard inference where weights are stationary \u2014 a property that conflicts with memristor crossbars' weakness at frequent reprogramming and is flagged as an open hardware challenge.", + "source": "https://arxiv.org/html/2501.12644v1", + "quote": "Unlike conventional neural network inference, where all weights are stationary, models with attention mechanisms, such as Transformers, require frequent updates of K, V, and Q matrices and computation with them.", + "vote": "3-0" + }, + { + "claim": "In-situ (on-chip) training of MLP, CNN, LSTM, and reinforcement-learning models has been experimentally demonstrated on memristor crossbar platforms, establishing that local in-array weight updates during a learning loop are physically achievable on memristive hardware.", + "source": "https://arxiv.org/html/2501.12644v1", + "quote": "Soon after, various ML algorithms have been experimentally implemented on the same platform, including in-situ training of multilayer perceptron (MLP), convolutional neural network (CNN), long short-term memory (LSTM) and reinforcement learning(RL).", + "vote": "3-0" + }, + { + "claim": "Memristive crossbar weight reprogramming is constrained by limited non-volatile memory (NVM) endurance, which restricts the number of times memristors can be rewritten \u2014 a fundamental obstacle for in-situ training loops that require repeated weight updates.", + "source": "https://arxiv.org/abs/2410.21730", + "quote": "Our idea addresses the limited non-volatile memory endurance, which restrict the number of times they can be reprogrammed.", + "vote": "2-1" + }, + { + "claim": "This work deliberately avoids on-chip/in-situ training for MRAM in-memory computing, instead programming weights once via off-chip adaptive quantization and keeping them fixed during inference \u2014 i.e., it is an inference-only approach, not online in-situ weight update.", + "source": "https://www.science.org/doi/10.1126/sciadv.adp3710", + "quote": "off-chip adaptive training method to adjust deep neural network parameters, achieving accurate AiMC inference", + "vote": "3-0" + }, + { + "claim": "The authors state that in-situ/on-chip training, while optimal for accuracy, increases energy consumption and degrades device programming lifespan \u2014 the motivating reason analog in-memory substrates avoid online weight updates (directly relevant to EP's requirement for repeated in-situ updates).", + "source": "https://www.science.org/doi/10.1126/sciadv.adp3710", + "quote": "in situ training achieves optimal accuracy, the on-chip update cycles increase energy consumption and reduce memristor programming life span", + "vote": "3-0" + }, + { + "claim": "RRAM/memristor devices suffer non-idealities that directly constrain in-situ trainability: high variability, asymmetric and nonlinear weight update, limited endurance, retention loss, and stuck-at-faults.", + "source": "https://escholarship.org/uc/item/4t9278vc", + "quote": "they suffer from numerous nonidealities limiting performance, including high variability, asymmetric and nonlinear weight update, endurance, retention and stuck at fault (SAF)", + "vote": "3-0" + }, + { + "claim": "The thesis develops methods for online (in-situ) training of memristive crossbars under stochastic asymmetric nonlinear weight update, including a compensation technique (small-scale) and stochastic rounding (tested on spiking neural networks) \u2014 i.e., memristive crossbars can be trained with weights updated inside the loop, not just one-time programmed.", + "source": "https://escholarship.org/uc/item/4t9278vc", + "quote": "developed a model to incorporate stochastic asymmetric nonlinear weight update in online (in-situ) training, proposing solutions including a compensation technique tested on small-scale problems and stochastic rounding tested on spiking neural networks", + "vote": "3-0" + }, + { + "claim": "Weight programming here is INFERENCE-ONLY / one-time: it is treated as a one-time cost computed per DNN, not an in-situ/online update inside a training loop. This places PCM crossbars in this regime in the LARGE-but-fixed-weight category, failing EP's critical in-situ-update filter.", + "source": "https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9247051/", + "quote": "weight programming optimisation represents a one-time computational cost that should be performed for each unique DNN and unique set of underlying analogue memory device characteristics.", + "vote": "3-0" + }, + { + "claim": "A physical analog electronic network of self-adjusting nonlinear transistor-based resistive elements learns nonlinear tasks (XOR, nonlinear regression) entirely in-situ without a computer/processor and without backpropagation, using the local Coupled Learning rule explicitly described as closely related to Equilibrium Propagation and Contrastive Hebbian Learning.", + "source": "https://www.pnas.org/doi/10.1073/pnas.2319718121", + "quote": "Here we introduce a nonlinear learning metamaterial -- an analog electronic network made of self-adjusting nonlinear resistive elements based on transistors. We demonstrate that the system learns tasks unachievable in linear systems, including XOR and nonlinear regression, without a computer. ... ac", + "vote": "3-0" + }, + { + "claim": "Weights are physically adjustable in-situ: each edge's learning degree of freedom is a transistor gate voltage stored on a local capacitor (C0 = 22 uF), and dedicated on-edge circuitry continuously updates it by charging/discharging based on the local difference between a 'free' and a 'clamped' twin network \u2014 a physically realized free-phase/nudged-phase contrastive update with no central gradient computation.", + "source": "https://www.pnas.org/doi/10.1073/pnas.2319718121", + "quote": "During training, circuitry on each twin edge continuously updates G by charging or discharging the local capacitor, depending on the local difference between the electronic states in the two networks. ... We designate one network as 'Free' and impose only inputs (V1 , V- ), and the other as 'Clamped", + "vote": "3-0" + }, + { + "claim": "The forward computation is physical settling: the analog network relaxes to an equilibrium steady state on a ~1 microsecond timescale (tau_V ~ 1 us), so the relaxation IS the physical circuit dynamics rather than a digital iteration, and learning updates evolve on a slower 18 ms timescale.", + "source": "https://www.pnas.org/doi/10.1073/pnas.2319718121", + "quote": "the network 'calculates' the output (orange) naturally from the inputs ... reaches equilibrium on a timescale of tau_V ~ 1 us ... They will evolve with timescale tau0 = 18 ms until frozen, or until the system reaches a state where the two networks have", + "vote": "3-0" + }, + { + "claim": "Equilibrium Propagation has been physically realized on a commercial hardware substrate: the D-Wave quantum-annealing Ising machine, where the physical machine performs both the free-phase and nudge-phase relaxation to a steady state (the settling is physical, not simulated).", + "source": "https://www.nature.com/articles/s41467-024-46879-4", + "quote": "Employing the commercial D-Wave Ising machine, composed of thousands of two-state components... we utilized the quantum annealing procedure of the D-Wave chip to achieve the free phase of EP... The steady states of the spins at the end of the free and nudge phase are measured, recorded, and used to ", + "vote": "3-0" + }, + { + "claim": "The weights are NOT stored or updated in-situ on the analog/quantum substrate; they live on a classical digital computer, where the input-weight product and the SGD weight update are computed, and only bias fields and couplings are loaded onto the chip each phase. This makes it an inference-only / externally-trained substrate from the EP-in-situ-learning standpoint.", + "source": "https://www.nature.com/articles/s41467-024-46879-4", + "quote": "we calculate the product of the input data X (an input image, for instance) and a trainable weight matrix Winput using a digital computer... The updates are then applied to the weights using the standard stochastic gradient descent algorithm", + "vote": "1-1" + }, + { + "claim": "The learning rule is local: weight updates are derived purely from local measurements of the two equilibrium (free and nudged) states, explicitly avoiding backpropagation's non-local computation \u2014 validating EP/local backprop-free learning as a viable training rule on physical equilibrium hardware.", + "source": "https://www.nature.com/articles/s41467-024-46879-4", + "quote": "The parameter changes required for learning are derived from local measurements of the equilibrium states, as opposed to a complex non-local analytic mathematical procedure like backpropagation", + "vote": "3-0" + } +]
\ No newline at end of file diff --git a/refs/hw_research_claims2.json b/refs/hw_research_claims2.json new file mode 100644 index 0000000..0f28d6a --- /dev/null +++ b/refs/hw_research_claims2.json @@ -0,0 +1,85 @@ +[ + { + "claim": "Analog/in-memory self-attention HARDWARE provably exists across multiple substrates, spanning real fabricated silicon to detailed simulation \u2014 directly answering GAP 1's core question. Fabricated: a UCSD 65nm CMOS charge-based SRAM-CIM attention accelerator (Moradifirouzabadi, Dodla, Kang; arXiv 2409.04940 / ESSERC 2024 / IEEE 10719540), the first to use charge-based analog CIM in SRAM for transformers, with measured 14.8 TOPS/W (analog core) and a custom 9-T bitcell doing the Q-dot-K^T score via capacitor charge-sharing. Simulated/architected: a Julich gain-cell (charge-based, capacitor) in-memory attention design (Leroux et al.; arXiv 2409.19315 / Nature Comp. Sci. 2025, s43588-025-00854-1) reporting ~70,000x energy / ~100x speedup vs GPU. Memristor/RRAM: a Nature Sci. Reports 2024 memristor self-attention accelerator (s41598-024-75021-z; 128x128 subarrays, 2-bit cells, NeuroSim/32nm) and the STAR RRAM softmax engine (arXiv 2401.17582, DATE 2023). Photonic: a cascaded TFLN-microring softmax proposal (arXiv 2603.12934, 2026).", + "confidence": "high", + "sources": [ + "https://arxiv.org/pdf/2409.04940", + "https://arxiv.org/html/2409.04940v2", + "https://arxiv.org/abs/2409.19315", + "https://www.nature.com/articles/s41598-024-75021-z", + "https://arxiv.org/pdf/2401.17582", + "https://arxiv.org/pdf/2603.12934" + ], + "evidence": "Merges claims 0, 3, 16, 19, 5, 6. UCSD chip: 'The accelerator is fabricated in 65nm CMOS technology... first to use charge-based analog CIM in SRAM... for Transformer application' with measured 14.8 TOPS/W and a die photo \u2014 real silicon (claims 3, 16, all 3-0). Julich: 'custom self-attention in-memory computing architecture based on emerging charge-based memories called gain cells' (claim 0, 3-0). Nature Sci Reports memristor: 128x128 subarrays, 2-bit, Roff/Ron 1MOhm/100kOhm, NeuroSim V3.0 at 32nm (claim 19, 3-0). STAR is an RRAM softmax engine (claim 5, 3-0). All primary peer-reviewed sources, unanimous votes.", + "vote": "consolidated 3-0 (claims 0,3,5,16,19); claim 6 photonic at 3-0" + }, + { + "claim": "EVERY surviving analog/in-memory attention implementation is INFERENCE-ONLY / fixed-weight \u2014 none performs in-situ training or on-device weight updates \u2014 so they fall in the large-but-fixed-weight inference camp, NOT the in-situ-trainable camp the EP demo requires. The Julich gain-cell design reaches GPT-2-comparable performance via OFFLINE hardware-aware initialization + offline backprop fine-tuning (~3,000 + ~10,000 iterations), all before deployment; gain cells store KV-cache ACTIVATIONS (token projections), not learned weights. This is the load-bearing limitation for an in-situ-EP demo: the analog-attention datapath is validated, but the trainability requirement is unmet by all named products/papers.", + "confidence": "high", + "sources": [ + "https://arxiv.org/abs/2409.19315", + "https://arxiv.org/pdf/2409.19315v1", + "https://arxiv.org/pdf/2409.04940" + ], + "evidence": "Claim 2 (3-0): 'initialization algorithm achieving text processing performance comparable to GPT-2 without training from scratch'; verifier confirmed weights computed offline/digitally, NO in-situ training, gain cells store KV-cache activations not weights, with ~13,000 offline fine-tuning iterations. Claim 12 (3-0) confirms the Julich design is SPICE-simulated (TSMC 28nm PDK), full floorplan/layout, but explicitly 'NOT physically fabricated' and 'limited to device simulations.' The UCSD chip (claims 3,4,16) is likewise an inference accelerator. Across all 20 claims, none asserts in-situ weight learning.", + "vote": "3-0 (claims 2, 12)" + }, + { + "claim": "The softmax exponential + normalization IS physically realizable in analog in principle: a standard emitter-coupled (BJT) or subthreshold source-coupled (NMOS) differential-pair / winner-take-all network natively computes softmax, because KCL at the shared tail node makes each branch current = exp(x_i/V_T) / sum_k exp(x_k/V_T) \u2014 the normalization/division is obtained 'for free' from the shared-current node with NO explicit analog divider. The exponential I-V comes from subthreshold MOS (ID proportional to exp((Vgs-VTH)/nVT), tail currents kept small, ~180-300 nA) or BJT forward-active (any tail current). This is the classic Gilbert normalizer / translinear-principle result, independently corroborated across multiple sources.", + "confidence": "high", + "sources": [ + "https://arxiv.org/pdf/2305.13649", + "https://arxiv.org/html/2507.04338v1" + ], + "evidence": "Merges claims 8, 9, 11. Claim 8 (3-0): 'IC,i = exp(xi/VT) / sum_k exp(xk/VT) * IEE'; KCL at shared emitter node yields the softmax denominator for free, no divider; verifier notes this is textbook Gilbert normalizer/translinear, independently corroborated (MDPI Electronics 2021 current-mode softmax). Claim 9 (3-0): subthreshold ID exponential, 180-300nA tested, BJT forward-active any tail current. Claim 11 (3-0): voltage-mode WTA branch current 'is identical to the softmax equation' (Zyarah & Kudithipudi arXiv 2507.04338, citing foundational Elfadel & Wyatt 1994). Caveats: NMOS exponent has slope factor n; only valid in weak inversion; real BJT breadboard shows +/-4.2% error.", + "vote": "3-0 (claims 8, 9, 11)" + }, + { + "claim": "GAP 1(c) CONFIRMED \u2014 real analog-attention prototypes overwhelmingly adopt the pragmatic mixed-signal split: keep softmax/normalization (and often the value-multiply) in DIGITAL/LUT/CMOS, put only the linear maps + score dot-products in analog. Concrete instances: (a) UCSD 65nm chip computes only Q-dot-K^T in analog CIM (binary token-pruning decision via comparator), with softmax AND value-multiply done exclusively in the digital processor for the ~25% unpruned tokens; (b) the FeFET IMC approach (Julich ref [10] = Laguna et al., Frontiers in Electronics 2022) puts ONLY Q/K/V linear projections in analog, computing the attention dot-product in digital CMOS with K/V cached in SRAM; (c) the Nature Sci. Reports memristor accelerator maps only the linear MatMuls onto RRAM crossbars while softmax is done off-crossbar (RRAM compare-select for xmax + LUT/CMOS for exp/log); (d) STAR computes softmax exp via CAM+LUT lookup, not analog exp physics. NOTE the dissent: the Julich gain-cell paper itself does the OPPOSITE (computes attention in-memory) and replaces softmax with ReLU/HardSigmoid because softmax normalization needs a costly across-sequence vertical connection 'challenging to implement using analog circuitry.'", + "confidence": "high", + "sources": [ + "https://arxiv.org/pdf/2409.04940", + "https://arxiv.org/pdf/2409.19315v1", + "https://www.nature.com/articles/s41598-024-75021-z", + "https://arxiv.org/pdf/2401.17582" + ], + "evidence": "Merges claims 4, 15, 18, 5, 13. Claim 4 (3-0): 'Softmax... and multiplying with value embeddings (V) are also performed in the digital processor only for the unpruned tokens'; 'No Softmax, normalization, or value multiplication occurs in the analog domain.' Claim 15 (3-0): FeFET ref [10] uses 'IMC only for computing the linear projections... attention itself is not computed in memory.' Claim 18 (3-0): memristor accelerator softmax 'divided into two parts: RRAM-based compare and select logics... and look-up tables for exponential and logarithmic functions.' Claim 5 (3-0): STAR uses CAM+LUT for exp, exploiting softmax precision-insensitivity. Claim 13 (3-0): Julich replaces softmax with ReLU because normalization 'necessitate[s] an additional vertical connection along the sequence dimension... challenging to implement using analog circuitry' \u2014 a concrete instance of the softmax-mapping difficulty.", + "vote": "3-0 (claims 4, 5, 13, 15, 18)" + }, + { + "claim": "Photonic softmax for attention has been PROPOSED (not fabricated): a cascade of N tunable thin-film lithium niobate (TFLN) microring resonators synthesizes the per-channel exp(x_n - max) optically, with a single shared-WDM-laser PI feedback loop performing normalization, and electronic preprocessing (max-finding, shift/bias, digital interfacing) kept off-chip \u2014 itself a photonic instance of the GAP 1(c) split. Accuracy is depth/Q-limited: at the FDTD-validated quality factor (Q~10,300-15,500, Dmax~0.36) only N=5-7 rings are practical with error up to ~11%; sub-percent error (N~20-30) requires unrealized high-Q (Q>=1e6) devices. This is conceptual/layout + 3D-FDTD simulation only, not buy-now hardware.", + "confidence": "medium", + "sources": [ + "https://arxiv.org/pdf/2603.12934" + ], + "evidence": "Claim 6 (3-0): N=10 -> 2.68% error, N=20 -> 0.67%, N=30 -> 0.30% analytically, but FDTD regime supports only N=5-7 with ~11% error. Claim 7 (2-1, the only split vote): electronic max/shift/bias preprocessing, photonic exp + PI-feedback normalization; verifier flagged that summation is actually ELECTRICAL (Eq.35) and only normalization is photonic, and that TABLE VII labels the full softmax as 'Conceptual + layout' \u2014 simulation/proposal, not silicon. Single source (Park & Park, Chungnam National Univ., 2026); medium confidence due to one split vote and proposal-stage maturity.", + "vote": "claim 6 at 3-0; claim 7 at 2-1 (split)" + }, + { + "claim": "PARTIAL GAP-3 signal (the only endurance/write-stress evidence in the surviving set): non-volatile memories (memristor/RRAM, Flash, FeFET, PCM) were explicitly rejected for in-memory KV-cache computation because each step must WRITE K/V values and NVM has slow writes, high write energy, and low endurance; charge-based gain cells were chosen specifically for higher endurance and lower write energy/time. This indirectly supports the EP-demo concern that write-limited NVM substrates are stressed by frequent updates \u2014 but NO surviving claim gives per-device endurance NUMBERS (cycles-to-failure), and ECRAM/electrochemical-RAM is NOT mentioned anywhere in the evidence.", + "confidence": "high", + "sources": [ + "https://arxiv.org/pdf/2409.19315v1" + ], + "evidence": "Claim 14 (3-0): 'Non-volatile memory technologies exhibit slow write speeds, high energy consumption during the writing process, and low endurance, which collectively limit their suitability for IMC of the attention mechanisms... gain cells have more endurance and require less write energy and time than non-volatile memories.' Verifier added independent context (not in primary claim): RRAM endurance ~1e6-1e9 cycles, PCM similar, vs charge-based/SRAM effectively >1e15 \u2014 but this is verifier background, not a separately voted claim. The reference [9] is Sebastian et al. Nature Nanotechnology 2020. This is the closest the evidence gets to GAP 3; it is directional, not a quantified endurance budget.", + "vote": "3-0 (claim 14)" + }, + { + "claim": "Noise-margin quantification for an analog attention pipeline: a variation-aware memristor ViT accelerator simulation (MDPI Electronics 2026, 15(5),1116) reports tolerating ~35% analog computation error + ~10% memristor conductance variation while matching Top-1 accuracy of a digital baseline \u2014 useful for sizing how much analog imprecision an in-memory attention datapath can absorb. Caveats: it is simulation (not silicon), the match is a LARGER analog ViT-L vs a SMALLER digital ViT-B (not like-for-like), the '5nm baseline' is primarily the ENERGY comparator (not the accuracy one), and the paper is standard ViT inference (NOT energy-based/Hopfield attention and NOT Equilibrium Propagation).", + "confidence": "medium", + "sources": [ + "https://www.mdpi.com/2079-9292/15/5/1116" + ], + "evidence": "Claim 17 (3-0): 'up to 35% analog computation error and 10% memristor conductance variation, the analog ViT-L accelerator maintains Top-1 accuracy equivalent to that of a digital ViT-B.' Verifier flagged three qualifications (ViT-L vs ViT-B mismatch; 5nm baseline is energy comparator; no EP/energy-based attention). Single peer-reviewed source; the headline numbers are robust but the 'energy-based attention' framing is the claimant's interpretive bridge. Medium confidence: unanimous vote but single source and interpretive caveats.", + "vote": "3-0 (claim 17)" + }, + { + "claim": "Sole datapoint relevant to GAP 1's full-analog-nonlinearity question: the analog-softmax-circuit author (Sillman, arXiv 2305.13649) argues an analog softmax block only pays off INSIDE a fully-analog system, because isolating it and bridging via ADC/DAC arrays would dwarf the processing block's power; he points to IBM's fully-analog NN-training hardware (PCM + capacitive-MOS matrix-multiply arrays) as the right integration context. This is a hedged single-author opinion (breadboard prototype) and argues AGAINST the GAP 1(c) digital-softmax split when the surrounding datapath is already analog \u2014 relevant nuance for an all-analog EP system.", + "confidence": "medium", + "sources": [ + "https://arxiv.org/pdf/2305.13649" + ], + "evidence": "Claim 10 (3-0): 'the amount of power the ADC and DAC arrays would consume... would also dwarf the power consumption of the processing block... this processor will likely find its home in a fully-analog design scheme. AI labs such as IBM have already begun redesigning neural network training hardware as fully analog systems by using phase-change memory (PCM) and capacitive MOS arrays.' Verifier notes this is a hedged opinion in a breadboard preprint and the IBM line is one motivational sentence. Medium confidence: unanimous vote but explicitly an opinion, not a measured result.", + "vote": "3-0 (claim 10)" + } +]
\ No newline at end of file diff --git a/refs/paper_2603.12934.txt b/refs/paper_2603.12934.txt new file mode 100644 index 0000000..4f521d8 --- /dev/null +++ b/refs/paper_2603.12934.txt @@ -0,0 +1,2039 @@ + Photonic Exponential Approximation via Cascaded TFLN Microring Resonators + toward Softmax + Hyoseok Park1 and Yeonsang Park1, ∗ + 1 + Department of Physics, Chungnam National University, Daejeon 34134, Republic of Korea + (Dated: March 26, 2026) + The rapid growth of large-scale AI models has intensified energy consumption and data-movement + challenges in modern datacenters. Photonic accelerators offer a promising path by executing the linear + matrix multiplications of transformer inference at high throughput and low energy. However, the + softmax attention layer—which requires element-wise exponentiation followed by normalization—still + relies on electronic post-processing, creating an electro-optic conversion bottleneck that negates much + of the potential photonic advantage. +arXiv:2603.12934v3 [physics.optics] 25 Mar 2026 + + + + + We present a cascaded micro-ring resonator (MRR) architecture that synthesizes the per-channel + exponential function required by softmax, exn −max(x) , over a finite interval with tunable worst-case + relative error. A control signal detunes each ring via an electro-optic mechanism; a weak probe + at fixed frequency experiences Lorentzian transmission, and cascading N identical stages yields a + multiplicative transfer function whose logarithm is approximately linear. + We derive mapping rules, depth-scaling estimates, and a minimax fitting formulation, and validate + the framework with three-dimensional FDTD simulations of X-cut thin-film lithium niobate (TFLN) + add-drop micro-ring resonators. Direct multi-ring FDTD validation extends to a five-ring cascade + and confirms agreement with theory primarily over the upper operating range; deeper cascades and + higher quality factors are assessed analytically. The cascade implements the per-channel exponential + block—the key missing nonlinearity for photonic softmax. We further present a WDM-parallel + chip architecture with closed-loop PI feedback that completes the full softmax—exponentiation, + summation, and normalization—on a single photonic chip without per-channel normalization circuitry. + + + I. INTRODUCTION is approximately linear over a finite interval, enabling + exponential-function synthesis with sub-2% worst-case + Transformer inference is often limited by power and error—an order of magnitude more accurate than SOFT- + memory traffic, motivating optical accelerators that ex- ONIC’s polynomial approach—while remaining compati- + ploit parallel propagation and multiplexing [1, 2, 4, 5, 7, 9]. ble with integrated microring platforms [20–24]. We term + Recent perspective articles also discuss data-center power this cascade block an approximate exponential function + consumption as one motivation for optical comput- (AEF) unit. We further propose a WDM-parallel archi- + ing [3, 8]. While linear operators are comparatively tecture with a single PI feedback loop that realizes the + amenable to photonic implementation [4–6], the softmax complete softmax function—including summation and + function used in attention layers requires an exponen- normalization—without per-channel electronic process- + tial mapping together with global normalization—both ing. + difficult to realize in passive photonic circuits, where We extend the theoretical framework with three- + transmission is fundamentally bounded by unity. Parallel dimensional FDTD simulations of a single X-cut TFLN + digital-hardware studies treat the exponential/softmax add-drop micro-ring resonator. The simulated device + stage as a bottleneck and propose dedicated approxima- parameters—quality factor, free spectral range, and + tions [11–19]. Many integrated-photonic classifier demon- electro-optic sensitivity—calibrate the cascade design pa- + strations still rely on electronic post-processing for the rameters, bridging analytical fitting and physically realiz- + final nonlinear readout [10]; the resulting electro-optic able hardware. Two operating regimes emerge from this + conversion overhead can negate the throughput and en- calibration: an FDTD-characterized regime with moder- + ergy benefits of the photonic front-end. Notably, the ate drop-port depth (Dmax ≈ 0.36), where the analytic + SOFTONIC architecture [11] explicitly argues that “the error stays below ∼5% for N ≤ 7 but the power bud- + inability of MRRs and MZMs to handle SMA’s expo- get limits practical cascades to N ≤ 5; and a projected + nential and division functions” necessitates alternative high-Q regime (Dmax ≥ 0.95), enabling deeper cascades + approaches based on microdisk modulators and polyno- (N ≤ 30) with sub-percent error. Cascade performance is + mial approximation, achieving 89.7% accuracy with a predicted analytically and validated by a five-ring cascade + third-degree Chebyshev polynomial. Here we challenge 3D FDTD simulation (Sec. IV). + this premise: we show that a passive Lorentzian cascade The paper is organized as follows: Section II presents + of microring resonators can be tuned so that its logarithm the mapping, transfer model, and depth-design rules; Sec- + tion III provides numerical fits and validation; Section IV + describes the single-ring TFLN device design and FDTD + validation; Section V assesses physical feasibility including + ∗ yeonsang.park@cnu.ac.kr; Corresponding author + voltage requirements, insertion loss, and energy efficiency; + 2 + +Section VI discusses implementation scope, platform com- +parisons, and limits; and Section VII concludes. 1 + Tk (∆ωk ) = . (9) + ∆ωk 2 + 1+ Γ + II. MODEL AND DESIGN FRAMEWORK + In a control–probe architecture, a nonnegative control- + signal amplitude I ≥ 0 shifts the ring resonance. Here I +Target mapping. Let x = (x1 , . . . , xK ) ∈ RK be an denotes a generic control amplitude: for optical-pump op- +arbitrary real-valued sequence (or vector). Directly gener- eration it maps to optical intensity, while for EO operation +ating exp(xn ) as a passive optical transmission is impos- it maps to electrical control level (e.g., voltage). Across +sible in general because exp(x) grows beyond unity while many physical mechanisms (optical pump via Kerr/XPM, +a passive transmission satisfies 0 < T ≤ 1 [25]. However, EO drive via Pockels effect, thermal, carrier tuning), the +for softmax, shift can be linearized on a working range [20, 26–30]: + + exn (0) + softmax(x)n = P xj , (1) ω0,k (I) = ω0,k + ηI, (10) + je + (0) + where ω0,k is the cold-cavity resonance and η is the control- +a common shift cancels: to-resonance sensitivity. In practice, the control channel + can be optical or electrical (optical pump, EO/Pockels + exn +c exn drive, thermal, or carrier tuning); a quantitative EO + P x +c = P x (∀c ∈ R). (2) feasibility example is given in the Discussion. With + je je + j j + (0) + ∆ω0,k ≡ ωL − ω0,k , the control-dependent detuning be- +Thus it suffices to generate comes + + + exn −m , m ≡ max xj , (3) ∆ωk (I) = ∆ω0,k − ηI. (11) + j + Define dimensionless parameters +since the global factor em cancels. + To ensure a nonnegative control-signal amplitude, de- +fine ∆ω0,k η + ak ≡ , b≡− . (12) + Γ Γ + Then Eq. (9) yields the control-to-probe transfer of a +un ≡ xn − m ≤ 0, L ≡ − min un = m − min xn ≥ 0, single ring, + n n + (4) +and map each scalar to a nonnegative control-signal am- 1 +plitude Tk (I) = . (13) + 1 + (ak + bI)2 + Physical meaning: ak is a static detuning in linewidth + In ≡ un + L ∈ [0, L]. (5) units (set by heater/carrier tuning/fabrication), and |b| + is the normalized sensitivity magnitude (linewidths of +Then + resonance shift per unit control-signal amplitude); the sign + convention is absorbed into the detuning expression. For + exn −m = eun = eIn −L . (6) “same-material/same-geometry” rings, b is often common, + while ak can be tuned per ring. +Hence the optical design task is to realize, for I ∈ [0, L], Sign convention. Simultaneously flipping (ak , b) 7→ + (−ak , −b) leaves Tk (I) unchanged, so we may take b > 0 + without loss of generality. + f (I) = eI−L ∈ [e−L , 1]. (7) Let N rings be cascaded in a serial add-drop topology: + Tk (I) denotes the add-to-drop transmission of ring k, and +Control–probe transfer. Consider a weak probe at the drop output of ring k feeds the add (input bus) port +fixed angular frequency ωL . For the kth ring, let ω0,k of ring k+1. Assuming the probe is sufficiently weak so +denote its resonance frequency and Γ > 0 its loaded half- the control channel dominates the resonance shift, the +width at half maximum (HWHM). Define the detuning normalized probe output is the product + + ∆ωk ≡ ωL − ω0,k . (8) (probe) + Pout (I) + N + Y N + Y 1 + y(I) ≡ = Tk (I) = . +Near resonance, the normalized Lorentzian transmission + (probe) + Pin 1 + (ak + bI)2 + k=1 k=1 +is modeled as [20, 21] (14) + 3 + + + (a) Electronic Preprocessing + Control In + Find max: Shift: Bias: + {xn } m = max(xn ) un = xn −m In = un +L + + + EO tuning + (b) N -MRR Cascade + + N stages + Probe + (fixed ωL ) + + + MRR MRR MRR MRR MRR + #1 #2 #3 #4 #5 + + + + + (c) Output + + ỹ(In ) ≈ exp(In − L) → exp(xn − m) PD + + + FIG. 1: Overview of the control–probe add-drop cascade N -MRR exponential block. (a) Electronic preprocessing + maps an arbitrary input sequence {xn } to a nonnegative control signal via m = maxn xn , un = xn − m, and +In = un + L with L = m − minn xn . (b) The control signal In induces resonance shifts in a cascade of N rings, while a + weak fixed-frequency probe propagates through the serial add-drop cascade (the drop output of each ring feeds the + next stage), experiencing multiplicative transmission. (c) After photodetection, the block implements + y(In ) ≈ exp(In − L) ≈ exp(xn − m), i.e., the normalized exponential used in softmax. + + +To focus on the shape of the approximation, we allow a +global scale factor C > 0: + E∞ ≡ sup ln ỹ(I) − (I − L) . (18) + I∈[0,L] + + ỹ(I) ≡ C y(I). (15) If E∞ ≤ εlog , then for all I ∈ [0, L], +In softmax, pn = CeIn −L / j CeIj −L , so C cancels + P +between numerator and denominator and is physically ỹ(I) ỹ(I) + e−εlog ≤ ≤ eεlog ⇒ − 1 ≤ eεlog − 1. (19) +inessential; nevertheless it is convenient for error analysis. f (I) f (I) +For a fixed (N, b, {ak }), the optimal C for the minimax + Thus achieving a prescribed worst-case relative error ε is +log-error in Eq. (18) can be written in closed form. Let + guaranteed by +g(I) ≡ ln y(I) − (I − L) on [0, L]. Then the minimax- +optimal shift is ln C ⋆ = −(maxI g(I)+minI g(I))/2, yield- +ing E∞ = (maxI g(I) − minI g(I))/2. E∞ ≤ εlog ≡ ln(1 + ε) ≈ ε. (20) + Taking logarithms, + Depth scaling. We derive depth-related constraints and + design rules for a prescribed approximation tolerance. + N + X Necessary slope condition. Differentiate Eq. (16): + ln 1 + (ak + bI)2 . + + ln ỹ(I) = ln C − (16) + k=1 + N + d X 2b(ak + bI) +The target ln f (I) = I − L is linear; hence exponential ln y(I) = − . (21) + dI 1 + (ak + bI)2 +approximation is equivalent to the log-linearization goal k=1 + + Since |2u/(1 + u2 )| ≤ 1 for all real u, + ln ỹ(I) ≈ I − L uniformly on I ∈ [0, L]. (17) + d + ln y(I) ≤ N |b|. (22) +Error metric. Define the worst-case log-error on [0, L]: dI + 4 + +The target ln f (I) = I − L has constant slope +1, so a with a minimax refinement. After choosing N , set +necessary condition to track it is b = min(bmax , 1/N ) and a = −1 − bL/2 as initializa- + tion, then refine (a, b) by a two-parameter minimax fit on + [0, L]. + N |b| ≳ 1. (23) A heuristic conservative screening bound N ≥ ⌈(L2 /4 + +Near-optimal parameterization. The full design prob- 1/(2b2 ))/ ln(1 + ε)⌉ (derived via the same local-expansion +lem can be written as a minimax fit in the log domain [31]: argument; see Supplementary Sec. S1) provides a quick + upper estimate but is not a rigorous guarantee. + + min sup |r(I)|, + a1 ,...,aN , ln C I∈[0,L] + III. NUMERICAL FITS AND VALIDATION + N + X (24) + ln 1 + (ak + bI)2 − (I − L). + + r(I) ≡ ln C − We validate the analytical framework with minimax + k=1 numerical fits and sampled robustness checks. Figure 2 +This objective is permutation-invariant in the ak ’s (ring shows the fitted approximation quality at L = 8: the +index k). In practice (and in numerical experiments top (linear) panel plots N = 1, 3, 5, 7 over I ∈ [0, 20], the +reported below), the optimizer frequently collapses to a middle (log) panel compares N = 5, 10, 20, 30 on I ∈ [0, 8], +permutation-symmetric solution and the bottom panel shows the pointwise relative error + with the characteristic Chebyshev equioscillation pattern. + We fit identical-detuning cascades (Eq. 25) on I ∈ [0, L] + a1 = · · · = aN ≡ a, (25) and compare several depths using a minimax criterion. + Table I makes the accuracy–depth trade-off explicit +reducing the design to two parameters (a, b) (plus C). at L = 8. A worked input-to-output example demon- +With Eq. (25), strating the mapping from an arbitrary input sequence + x = [−3.2, 1.2, 4.8, −0.9] through the cascade is provided + + 1 + N in Supplementary Sec. S2. The example shows that the + ỹ(I) = C y(I) = C . (26) N = 10 cascade keeps the worst-case relative error below + 1 + (a + bI)2 2.7% across all channels. +A robust initialization is obtained by placing the midpoint Empirical calibration. We calibrate the effective +of the interval on the Lorentzian half-maximum flank and logit range Leff from autoregressive Transformers (dis- +matching the slope: tilgpt2/gpt2) [1, 32–35] at context length 128, finding + Leff,0.999 ≈ 7–9 at the 50th–90th percentiles (Supplemen- + tary Sec. S2). A clipping threshold t∗ = −12 preserves + L p99 softmax accuracy below 0.1%. Full protocol details, + a+b ≈ −1, N b ≈ 1. (27) + 2 clipping-sweep tables/plots, and per-run statistics are +These two equations already yield a good design; a small provided in Supplementary Sec. S3. +(two-parameter) refinement can then enforce the desired A synthetic design-space map (Supplementary Table S3) +worst-case tolerance. shows that near L ≈ 8, moderate depth (N ≥ 10) reaches + Local expansion and depth scaling. A Taylor few-percent error, whereas L ≳ 12 requires deeper cas- +expansion of the log-domain residual around the flank- cades. All fits follow the same pipeline: minimize the +centered point I0 = L/2 (with a + bI0 = −1 and N b = 1) worst-case log-error on a uniform grid, initialize from the +shows that the quadratic term vanishes identically, leaving flank rules in Eq. (27), perform multi-start global search, +a leading cubic residual r(δ) ∼ δ 3 /(6N 2 ). Over I ∈ [0, L], and apply bounded local refinement; implementation de- +this implies E∞ ∼ L3 /N 2 , so that achieving a prescribed tails and scripts are provided in a public repository [36] + √ (commit: 585e695). +tolerance εlog requires N ∝ L3/2 / εlog , which explains +the scaling used in Eq. (28). The full derivation is provided +in Supplementary Sec. S0; an intuitive local-expansion +summary appears in Sec. S1. + Practical engineering estimate. Given L and a TABLE I: Depth comparison for L = 8 using fitted +target worst-case relative error ε, define εlog = ln(1 + ε). ỹ(I) = C[1 + (a + bI)2 ]−N (same fitting pipeline for all +A heuristic engineering estimate (not a rigorous bound) N ). +that matched our percent-level numerical designs is + N a b max rel. err. mean rel. err. + L3/2 + + 1 + N ≈ max , κ√ , (28) 5 −2.0789 0.21658 10.9% 6.43% + bmax εlog 10 −1.4588 0.10202 2.68% 1.65% + 20 −1.2135 0.05025 0.67% 0.42% +where bmax is the physically achievable sensitivity bound 30 −1.1392 0.03341 0.30% 0.19% +and κ ≃ 0.07 for the identical-detuning flank design + 5 + + TABLE II: Waveguide and ring parameters of the X-cut + TFLN micro-ring resonator. Electro-optic electrode + parameters are listed separately in Table III. + + Parameter Symbol Value Unit + Total TFLN thickness tTFLN 600 nm + Etch depth tetch 500 nm + Slab thickness tslab 100 nm + Waveguide width w 1.4 µm + Bend radius R 20 µm + Coupling gap g 100 nm + Circumference Lring 125.7 µm + Free spectral range FSR 8.29 nm + Effective index (TE0 ) neff 1.903 — + Group index (TE0 ) ng 2.24 — + Extraordinary index ne 2.138 — + + + + IV. TFLN SINGLE-RING DEVICE DESIGN AND + FDTD VALIDATION + + A. Waveguide and ring geometry + + + The device is based on an X-cut thin-film lithium nio- + bate (LiNbO3 ) on insulator wafer with a 600 nm-thick + LiNbO3 film on SiO2 . A 500 nm-deep rib etch defines + a 1.4 µm-wide single-mode waveguide with a 100 nm un- + etched slab (Fig. 3). Lumerical MODE simulations yield + neff = 1.903 and ng = 2.24 at λ = 1550 nm for the funda- + mental TE0 mode. + The ring resonator (R = 20 µm, Lring = 125.7 µm) is + configured as an add-drop resonator with 100 nm coupling + gaps (Fig. 4). The FDTD-measured free spectral range + is FSR = 8.29 nm (ng ≈ 2.30), slightly above the MODE + value due to bend-induced dispersion. + + + + +FIG. 2: Minimax cascade fits at L = 8. (a) Linear scale: + shallow cascades (N = 1, 3, 5, 7) over I ∈ [0, 20]. The +target eI−L (black) is progressively better matched as N + increases. (b) Log scale: depth comparison + (N = 5, 10, 20, 30) on I ∈ [0, 8]. Inset zooms into + I ∈ [6, 8] showing convergence. (c) Pointwise relative + error showing the Chebyshev equioscillation pattern + characteristic of minimax optimality. + FIG. 3: Cross-section of the X-cut TFLN rib waveguide + on a SiO2 substrate. The 600 nm LiNbO3 film is etched + 500 nm to form a 1.4 µm-wide single-mode rib waveguide. + Lateral signal (S) and ground (G) electrode positions are + indicated; electrode design details are discussed in + Sec. IV D. + 6 + + Table II summarizes the waveguide and ring parame- +ters. + + + B. 3D FDTD Methodology + + The ring resonator response is simulated using Lumeri- +cal 3D FDTD with conformal variant 1 meshing. A broad- +band TE0 mode source (1530 nm to 1570 nm) is injected +into the input bus waveguide, and through- and drop-port +spectra are recorded. A “z-refined 3-fix” meshing strat- +egy ensures convergence in the thin-film geometry [37]; +detailed simulation setup is provided in Supplementary +Sec. S4 (Table S6). + + + FIG. 5: Simulated through-port (blue) and drop-port + (red) transmission spectra of the single add-drop + micro-ring resonator from 3D FDTD. Top: logarithmic + scale; bottom: linear scale. Five resonances are visible + with FSR ≈ 8.29 nm. + + + + 15,500, Dmax = 0.360); using the five-resonance mean + would increase required voltages by ∼24% (see Table IV + caption). + The simulation time of 50 ps exceeds the loaded pho- + ton lifetime τL = QL λ0 /(2πc) ≈ 12.7 ps by ∼4×, but + the intrinsic lifetime τi ≈ 32 ps is comparable, so the ex- + tracted Qi may be slightly conservative. An independent + eigenmode (FDE) analysis of the same cross-section at + R = 20 µm—using a 300 × 300 mesh (∆y ≈ 10 nm, 5× + FIG. 4: Top view of the single add-drop micro-ring finer than the FDTD vertical grid)—yields Qrad+leak = + resonator used in the 3D FDTD simulation. The ring 2.4 × 107 ; including bulk LiNbO3 absorption (Γ = 0.89) + waveguide (R = 20 µm, w = 1.4 µm) is evanescently gives a theoretical Qi > 107 [37–42], confirming that + coupled to input and drop bus waveguides through the gap between the numerical Qi and published val- + 100 nm gaps at coupling points CP1 and CP2. ues (> 106 ) originates from mesh discretization (Sup- + plementary S4.5, Table S8). In the CMT framework, + Dmax = [2κ/(2κ+γ)]2 increases as Qi rises; at the present + coupling gap, increasing Qi to 106 would raise Dmax from + 0.36 to ∼0.95 and QL from 15,500 to ∼25,200. + C. Single-Ring Add-Drop Results + Figure 6(a) shows a Lorentzian fit to the best drop- + Figure 5 shows the through- and drop-port spectra from port resonance at λ = 1566 nm, validating the cascade +3D FDTD. Five resonances are resolved across 1530 nm model (Eq. 9). Figure 6(b) demonstrates that cascading +to 1570 nm with FSR = 8.29 nm (ng ≈ 2.30). N copies of this FDTD-extracted Lorentzian reproduces + the target exponential eI−L with increasing fidelity as N + Lorentzian fitting of the drop-port peaks yields QL = + grows. +10,300–15,500, with the best resonance at λ = 1566 nm +reaching QL = 15,500 (FWHM = 101 pm, Dmax = 0.360, To validate the cascade prediction directly, a five- +−4.4 dB). The through-port extinction ratio is 1.6 dB to ring cascade 3D FDTD simulation was performed us- +2.6 dB, and the five-resonance mean is QL = 12,500 ± ing Tidy3D [43]; the full simulation notebook is publicly +1,800 (Dmax = 0.29–0.36). CMT √ analysis of the best available [43]. The |E|2 field at λ = 1549 nm [Fig. 6(d)] +resonance gives Qi = QL /(1 − Dmax ) = 15,500/0.400 ≈ confirms resonant excitation across all five rings. Map- +38,800, confirming that the 500 nm etch provides sufficient ping the drop-port spectrum onto the control variable I +confinement and that the 100 nm gap places the ring yields 11 data points within the AEF operating range +in the coupling-limited regime. The cascade analysis [Fig. 6(e, f)], with the FDTD transmission closely tracking +below adopts the best-case FDTD calibration (QL = the N = 5 theoretical curve near I ≈ L = 8. + 7 + + + + +FIG. 6: FDTD-based AEF validation. (a) Lorentzian fit to the drop-port resonance at λ = 1566 nm from 3D FDTD + (Lumerical) (QL = 15,500, Dmax = 0.360, bV = 0.180 V−1 ). (b) Five-ring cascade drop-port spectrum near + λ0 ≈ 1550 nm with Lorentzian5 fit (red curve), confirming the expected T 5 line shape. (c) Five-ring cascade MRR +layout with diagonal zigzag bus waveguides. (d) |E|2 field profile at λ = 1549 nm from a five-ring cascade 3D FDTD + simulation (Tidy3D [43]). (e, f ) AEF validation of the five-ring cascade on log (e) and linear (f) scales with + 11 spectral FDTD data points. + 8 + + D. X-cut electrode design and EO parameters TABLE III: Electro-optic electrode parameters for the + X-cut TFLN micro-ring with lateral S–G arc electrodes. + We employ lateral signal–ground (S–G) arc electrodes +on the slab surface alongside the ring waveguide (Fig. 7). Parameter Symbol Value Unit +In the X-cut orientation, the crystal Z-axis is at 45◦ from Crystal orientation — X-cut — +the horizontal in the substrate plane, giving a lateral- EO coefficient r33 30.9 pm V−1 +field projection proportional to cos(θ − 45◦ ) at azimuthal EO fill factor fEO 1/π ≈ 0.318 — +angle θ. The cos(θ − 45◦ ) = 0 boundaries at θ = 135◦ EO overlap factor ΓEO 0.7 — +and 315◦ naturally separate the coupling regions from Electrode gap gel 5 µm + Effective electrode distance deff 2.5 µm +the electrode regions. Each ring carries a full semicir- +cular arc electrode on the side opposite to its coupling +points, engaging the large r33 = 30.9 pm V−1 Pockels co- +efficient [37, 38]. The effective EO fill factor follows from ized voltage sensitivity is (Supplementary Sec. S4; here +integrating | cos(θ − 45◦ )| over the semicircle: dλ/dV = 28.5 pm/V is the straight-section value and + 1 fEO accounts for partial electrode coverage of the ring + fEO = ≈ 0.318 (29) circumference): + π +(see Supplementary Sec. S4 for derivation). The electrode 2 Q (dλ/dV ) +gap is gel = 5 µm (deff ≈ 2.5 µm), and the electro-optic bV = fEO ≈ 0.182 V−1 (30) +overlap integral is ΓEO = 0.7. Table III lists the electrode λ0 +parameters. + at QL = 15,500. This estimate relies on a first-order + electrostatic model (deff ≈ 2.5 µm, ΓEO = 0.7); a ±30% + variation in bV would shift the cascade depth by one to + two rings at constant εmax (Table IV), leaving the quali- + tative design conclusions unchanged. With the cascade + framework of Sec. II (Eqs. 14–18), the N -ring drop-port + transmission ỹ(I) = C [1 + (a + bI)2 ]−N approximates + eI−L over I ∈ [0, L], with (a, b) optimized by minimax + fitting for each N . + Table IV presents the optimization results for the stan- + dard dynamic range L = 8 (e8 ≈ 2981, 34.7 dB). + + TABLE IV: Cascade optimization results for L = 8. The + bias voltage Vbias = |a|/bV sets the DC offset, and + Vctrl = bL/bV is the maximum control voltage at I = L. + Voltages computed with bV = 0.182 V−1 (X-cut arc + electrode, FDTD-calibrated best resonance QL = 15,500, + ng = 2.30). The mean FDTD quality factor across five +FIG. 7: Illustrative two-ring cascade layout showing the resonances is QL = 12,500 ± 1,800; using the mean would +lateral S–G arc electrode placement on X-cut TFLN (the increase voltages by ∼24%. +cascade design extends to N rings; this two-ring example + clarifies the electrode geometry). The crystal Z-axis is N a b E∞ εmax (%) Vbias (V) Vctrl (V) + oriented at 45◦ from the horizontal in the substrate 5 −2.0789 0.21658 0.1035 10.91 11.4 9.5 +plane. The cos(θ − 45◦ ) = 0 boundaries at θ = 135◦ and 10 −1.4588 0.10202 0.0265 2.68 8.0 4.5 + 315◦ naturally separate the bus-waveguide coupling 12 −1.3731 0.08450 0.0184 1.86 7.5 3.7 +regions from the electrode semicircles: each ring carries a 20 −1.2136 0.05025 0.0067 0.67 6.7 2.2 + 25 −1.1685 0.04013 0.0043 0.43 6.4 1.8 +full semicircular arc electrode on the side opposite to its + 30 −1.141 0.03340 0.0030 0.30 6.3 1.5 + coupling points. The resulting effective EO fill factor is 32 −1.1301 0.03131 0.0026 0.26 6.2 1.4 + fEO = 1/π ≈ 0.318. + a The complete cascade optimization results for all N values are + + listed in Supplementary Table S7. + + +E. FDTD-Calibrated bV and Cascade Optimization The approximation quality across different cascade + depths is shown in Fig. 2 (Sec. III). Key thresholds (e.g., + From the device parameters in Tables II and III and ε < 2% at N ≥ 12, ε < 1% at N ≥ 17) and the complete +the FDTD-calibrated ng ≈ 2.30, the effective normal- optimization results are listed in Supplementary Sec. S4. + 9 + + V. PHYSICAL FEASIBILITY TABLE V: Two-regime power budget for the MRR + cascade. Pout assumes per-channel input + Having established the cascade approximation theory Pin,ch = 100 µW (from a shared Pin,tot = 1 mW CW +(Sec. II) and the FDTD-calibrated device parameters laser split across M = 10 parallel channels via a 1×M +(Sec. IV), we now assess the physical feasibility of the splitter, or equivalently multiplexed as d WDM channels +proposed architecture in terms of voltage requirements, sharing a single cascade) and accounts only for the ideal + N +insertion loss, and energy efficiency. on-resonance cascade transmission Dmax (upper bound); + additional inter-ring coupling loss (ηcoupling ≈ 0.9 per + stage, ∼0.46 dB/stage) and off-resonance propagation + A. Electro-optic voltage requirements loss (0.08–0.25 dB/stage) are analyzed separately in + Sec. V C. + For the primary target of ε < 2% (N = 12), minimax + N +optimization gives a = −1.373, b = 0.0845. With the Dmax N Dmax (dB) Pout εmax +FDTD-calibrated QL = 15,500 (bV = 0.182 V−1 ), the 0.36 3 0.0467 −13.3 4.67 µW ∼15% + I +required voltages are (FDTD) 0.36 5 0.00605 −22.2 0.61 µW 10.9% + 0.36 7 7.84 × 10−4 −31.1 78 nW ∼5% + |a| 1.373 0.95 10 0.599 −2.2 59.9 µW 2.68% + Vbias = = = 7.5 V, (31) II + (high-Q) 0.95 20 0.358 −4.5 35.8 µW 0.67% + bV 0.182 + 0.95 30 0.215 −6.7 21.5 µW ∼0.30% + bL 0.0845 × 8 + Vctrl,max = = = 3.7 V. (32) Regime I: FDTD-characterized (Qi = 38,800). Regime II: + bV 0.182 fabricated high-Q (Qi > 106 ). Pout scales linearly with Pin,ch . + +Since bV ∝ Q, voltage scales inversely with quality factor: + + bL bL λ0 independent evidence that intrinsic quality factors in + Vctrl = = . (33) the projected range are physically achievable in TFLN— + bV 2Q |dλ0 /dV | + albeit with wider waveguides and larger ring radii than the +CMOS-compatible control voltages (Vctrl < 3.3 V) are present design. Transferring comparable sidewall quality +achievable at N ≥ 14 with QL = 15,500; at the design to our geometry (R = 20 µm, W = 1.4 µm) is an open +point N = 30 (εmax = 0.30%), Vctrl = 1.47 V. fabrication challenge; the projections should be read as + design targets contingent on achieving it. + The total insertion loss comprises on-resonance + N + B. Power budget: two-regime analysis cascade transmission Dmax , inter-ring coupling loss + (∼0.46 dB/stage for the present diagonal-bus layout), + The on-resonance cascade transmission DmaxN + is the off-resonance propagation loss (0.08–0.25 dB/stage), and +dominant contribution to total insertion loss. Table V fiber-to-chip coupling (1.5–3.0 dB). For the fabricated +presents two regimes: the FDTD-characterized regime high-Q regime (N = 30), the total ranges from ∼13 dB +(Dmax = 0.36) and the fabricated high-Q regime (Dmax = (optimized layout) to ∼24 dB (current geometry); see +0.95, achievable with Qi > 106 and gap-optimized cou- Supplementary Sec. S6 for detailed scenarios. +pling). + In the FDTD-characterized regime, Dmax = 0.36 limits +practical cascades to N ≤ 5: at N = 5 the output is D. Energy comparison +0.61 µW (−22.2 dB) with ε = 10.9%, suited for proof- +of-concept validation. In the fabricated high-Q regime For N = 30 X-cut TFLN micro-ring resonators in the +(Dmax ≥ 0.95), deep cascades become practical: N = 30 fabricated high-Q regime (QL ≈ 25,200 at Qi = 106 ; Sup- +yields Pout = 21.5 µW (−6.7 dB) with εmax ≈ 0.30%. plementary Sec. S5), the three energy components are EO +The transition to fabricated high-Q devices is therefore tuning (EEO = 0.22 pJ), amortized laser (Elaser = 0.07 pJ, +critical for achieving both high accuracy and sufficient shared across M = 10 channels), and photodetector +output power. (EPD = 0.50 pJ), yielding Ephotonic = 0.79 pJ (deriva- + tions in Supplementary Sec. S7). Including thermal stabi- + lization for N = 30 rings (0.15–0.60 pJ; Supplementary + C. Feasibility outlook Sec. S7), the total rises to 0.94–1.39 pJ. + Table S12 compares the photonic cascade with digital + Published TFLN micro-ring resonators achieve Qi ≥ implementations. Including thermal stabilization (0.94– +106 –108 using optimized fabrication [39–42]. At Qi = 1.39 pJ), the advantage over INT8 (2.3 pJ) is 1.7–2.4×, +106 with the present coupling geometry, CMT predicts while operating at 10 GHz bandwidth and 58× lower than +Dmax ≈ 0.95 and QL ≈ 25,200 (Supplementary Sec. S5, digital FP32 (46 pJ). At fabricated Q ≥ 30,000, EEO +Tables S4–S7), enabling deep cascades (N ≤ 30) with drops to 0.16 pJ and Etotal ≈ 0.73 pJ (excluding thermal; +sub-percent error. The literature values provide strong Supplementary Table S11), recovering a 3.2× advantage + 10 + + TABLE VI: Energy per exponential operation: with a distinct FSR order of the same ring set, traverse a + single-channel comparison. single N -ring cascade simultaneously (Fig. 8). Because + each channel λj sees its own Lorentzian detuning set by + Implementation E/op (pJ) Bandwidth Notes an independent control QN + voltage Vj , the cascade output + Digital FP32 (Taylor) ∼46 1 GHz 10 FP MACsper channel is ỹj = C k=1 Tdrop,k (λj , Vj ) ≈ eVj , and all + Digital INT8 (Taylor) ∼2.3 1 GHz 10 INT MACsd exponentials are computed in parallel on the same phys- + Photonic MRR (N = 30) 0.94–1.39 10 GHz Analog† ical waveguide. Compared with a 1×M power-splitter + † 0.79 pJ excluding thermal; 0.94–1.39 pJ including thermal. architecture that replicates the cascade for each channel, + Self-consistent with fabricated high-Q regime (QL = 25,200); see the WDM approach reduces the total ring count from + Supplementary Sec. S7. N × d to N (a factor-d saving) and eliminates the splitter + insertion loss (10 log10 d dB). At the output, a WDM + demultiplexer or wavelength-selective photodetector array +over INT8. Since EEO ∝ 1/Q2 , improving Q beyond separates the channels for electrical readout. Figure 8 +∼30,000 yields diminishing energy returns but continues shows a representative chip layout for N = 5 cascade +to relax CMOS driver voltage requirements. stages and d = 8 WDM channels, where alternating U- + turn bus connections route the drop-port output of each + stage into the input bus of the next. + VI. DISCUSSION Why cascade helps. A single Lorentzian in I is too + rigid to mimic the log-linear target over a wide interval. + Practical design procedure. For a given input se- Cascading turns the transfer into a product; taking a +quence x = (x1 , . . . , xK ), the design proceeds as follows: logarithm gives a sum of smooth terms, and the approx- + imation improves as N increases. The slope constraint + 1. Compute m = maxn xn , un = xn − m, and L = N |b| ≳ 1 is an immediate feasibility check. + − minn un . Global softmax normalization via WDM feed- + 2. Map to nonnegative control-signal amplitudes: In = back. The WDM-parallel architecture (Fig. 8) integrates + un + L ∈ [0, L]. naturally with a closed-loop normalization scheme to com- + plete the full softmax function. After the N -stage cascade, + 3. Choose tolerance ε and set εlog = ln(1 + ε). a WDM demultiplexer (e.g., arrayed-waveguide grating or + ring-filter bank) routes each channel λj to a dedicated pho- + 4. Select a physically feasible bmax and estimate N todetector, producing photocurrents Iλj ∝ ỹj ≈ C Pin eVj . + using Eq. (28). The d photocurrents are summed electrically: + 5. Initialize b = min(bmax , 1/N ) and a = −1 − bL/2, d d + then refine (a, b) by a two-parameter minimax fit if + X X + S= Iλj ∝ C Pin eVj . (35) + required. j=1 j=1 + + 6. The optical block yields ỹ(In ) ≈ exn −m , and soft- A proportional–integral (PI) controller compares S with + max weights follow as a fixed reference Sref and adjusts the shared WDM laser + power Pin so that S → Sref [44, 45]. Because all d channels + share the same probe source, scaling Pin multiplies every + ỹ(In ) + pn = P . (34) ỹj by the same factor; upon convergence + j ỹ(Ij ) + Iλj eVj + pj = = Pd = softmax(V )j , (36) + Scope and limits. The approximation is for a fi- Sref Vk + k=1 e +nite interval I ∈ [0, L], where L is determined by the +input batch via Eq. (4). In practice, one designs for a realizing the complete softmax with a single feedback loop +worst-case L expected in operation (or retunes a and and no per-channel normalization circuitry. Compared +rescales the control signal to adapt L). Noise, insertion with the replicated-cascade approach (one AEF block per +loss, and control-induced parasitics limit accuracy and channel), WDM feedback offers two additional benefits: +dynamic range; we treat these effects as platform-specific (i) the splitter-induced power imbalance that would bias +margins. Detailed non-ideality assumptions, parameter the Iλj ratios is absent, since all channels traverse the +distributions, and robustness statistics are reported in same optical path; and (ii) a single laser control point +Supplementary Sec. S8. With K channels in parallel, replaces d independent probe adjustments. Design de- +one can form softmax by summing channel powers and tails and stability analysis of the PI loop are provided in +applying a shared reciprocal scale factor, depending on Supplementary Sec. S9. +the chosen mixed-signal normalization scheme. Beyond ring-resonator AEF implementations, the same + WDM parallelism. A particularly hardware-efficient cascade principle can be extended to other cavity-based +realization exploits wavelength-division multiplexing photonic platforms, such as serial 1D photonic-crystal cav- +(WDM): d probe wavelengths λ1 , . . . , λd , each resonant ities and other cascaded resonant architectures [21, 46]. + 11 + +What these platforms share is transfer-function shaping TABLE VII: Summary of evidence levels for key claims. +through cascaded resonances; loss, tuning range, fabrica- +tion tolerance, and calibration overhead remain platform- Claim Evidence Sec. +dependent. Cascade → exp. approx. Analytic II + The insertion loss budget (Sec. V C) and electro-optic Depth scaling Analytic + num. II, III +voltage requirements (Sec. V A) suggest that the cas- QL , Dmax , bV 3-D FDTD IV +cade architecture is feasible under optimized coupling 5-ring line shape 3-D FDTD IV +and layout conditions. Using monolithic TFLN microring N ≤ 30 deep cascade CMT proj.∗ V + Energy < 1 pJ Estimate V +data from Bahadori et al. [47] (Q ≈ 5432, dλ0 /dV ≈ + Full softmax (WDM + feedback) Conceptual + layout VI +9–20 pm/V), the normalized sensitivity bV ≃ 0.063– + ∗ Based on published Q +0.14 V−1 , within the range required by the cascade design. i ≥ 10 + 6 values [39, 42] and CMT coupling + + model. +Crystal orientation and electrode design. The X- +cut TFLN platform was chosen for several reasons. First, +X-cut is the prevailing industry standard for integrated tified in the Monte Carlo robustness analysis (Supple- +TFLN modulators, with well-established fabrication pro- mentary Sec. S8). Monte Carlo simulations (Supplemen- +cesses and commercial wafer availability [37, 38]. Second, tary Sec. S8) show that under nominal non-ideality levels +the TE0 mode—which is strongly confined in the rib (σa = 0.020, σb,rel = 0.020), a single-point calibration of +waveguide geometry—can engage the large r33 coefficient C per chip keeps the median softmax KL divergence below +via lateral electric fields aligned with the crystal Z-axis. 2.2 × 10−4 , with 95th-percentile max probability error +In contrast, Z-cut geometry with TE polarization can only under 0.32%. Even under stress conditions (σa = 0.032), +access the smaller r13 coefficient (∼ 10 pm/V), resulting 95th-percentile errors remain below 0.42%, demonstrat- +in significantly lower electro-optic efficiency. The arc elec- ing that the identical-detuning design is robust to realis- +trode design (Sec. IV D) addresses the phase-cancellation tic fabrication variations provided a per-chip calibration +problem inherent to X-cut circular rings [47] by orienting step is performed. Conversely, if coupling gaps are in- +the crystal Z-axis at 45◦ from the horizontal in the sub- tentionally varied across rings, the per-ring parameters +strate plane. This rotation places the cos(θ − 45◦ ) = 0 (ak , bk ) become independent degrees of freedom. A Taylor- +boundaries at θ = 135◦ and 315◦ , naturally separating the expansion analysis shows that K non-identical rings can +bus-waveguide coupling regions from the electrode regions. cancel curvature + P terms up to order 2K in the Taylor series +Each ring carries a full semicircular arc electrode on the of g(I) = k ln Tk , one order higher than K identical +side opposite to its coupling points, yielding an effective rings, so that fewer rings suffice for a given error target. +fill factor fEO = 1/π ≈ 0.318. While this reduces the +round-trip EO efficiency compared to a hypothetical full- +circumference design, it preserves the compact footprint +of a circular ring resonator. The cascade performance +can be further improved beyond the R = 20 µm circular- +ring design presented here. Increasing the ring radius +reduces bending loss and raises the intrinsic quality factor +Qi , which directly increases bV (∝ Q) and lowers the +required control voltage. Alternatively, adopting a race- +track geometry with extended straight coupling sections +strengthens the bus–ring coupling, pushing the drop-port +maximum Dmax closer to critical coupling and improving +the per-stage transfer efficiency. Either approach—or their +combination—would yield higher bV and Dmax , enabling +lower N or tighter approximation accuracy at reduced +operating voltages. +Fabrication considerations. The X-cut TFLN rib +waveguide (600 nm total thickness, 500 nm etch, w = +1.4 µm) follows established fabrication processes for com- +mercial TFLN wafers on SiO2 [37, 38]. The lateral signal– +ground (SG) electrode configuration is fabricated in a +single metal layer, which is standard in TFLN foundry +processes. The primary fabrication challenge for the +cascade architecture is maintaining uniform coupling +gaps (g = 100 nm) across N rings to ensure identi- +cal Lorentzian transfer functions. Post-fabrication trim- +ming via UV exposure or localized thermal oxidation can +compensate residual detuning variations [30], as quan- + 12 + + + + + Softmax Full Chip Layout – N = 5 × d = 8 (TFLN) + d = 8 WDM channels + + + Vλ1 Vλ2 Vλ3 Vλ4 Vλ5 Vλ6 Vλ7 Vλ8 + + WDM + λ1−λ8 n=1 + Pin + + + n=2 + N = 5 + cascade + n=3 stages + + + + + n=4 + + + n=5 + + + + + WDM Demux (AWG / ring filter) + + Sref + PD1 PD2 PD3 PD4 PD5 PD6 PD7 PD8 + Iλ + j S e + Σ − PI + p1 p2 p3 p4 p5 p6 p7 p8 + + + + + Feedback: adjust Pin + Iλj + Output: pj = = softmax(V )j + Sref + +FIG. 8: WDM-parallel MRR-AEF system with closed-loop softmax normalization (N = 5 cascade stages, d = 8 WDM + channels) on X-cut TFLN. A single WDM source (λ1 –λ8 ) enters the top input bus waveguide; each stage applies a + Lorentzian drop-port transfer, and alternating U-turn connections route the drop-port output into the next stage’s +input bus. Per-channel EO bias voltages (Vλ1 –Vλ8 ) independently tune each column of rings. The final drop output + passes through a WDM demultiplexer (AWG / ring filter) and is detected by a PD array, producing per-channel + photocurrents Iλj ∝ eVj . The photocurrents are summed (Σ) and compared with a reference Sref ; a PI controller + adjusts the shared laser power Pin until S = Sref , at which point each PD output directly yields + pj = Iλj /Sref = softmax(V )j (Eq. 36). + 13 + + VII. CONCLUSION Dmax ≥ 0.95) are realized in the cascade geometry, deeper + cascades (N ≈ 20–30) would reach sub-percent approx- + We have presented a cascaded micro-ring resonator ar- imation error with an estimated per-operation energy +chitecture that approximates the exponential function of 0.79–1.39 pJ, which is 1.7–2.4× lower than an INT8 +exn −m on a finite interval [0, L] using multiplicative MAC at the 7 nm node. Monte Carlo analysis shows that +Lorentzian transfer functions. Increasing the cascade the identical-detuning design tolerates realistic fabrica- +depth N systematically reduces the worst-case relative tion variations (σa = 0.020, σb,rel = 0.020) with a single +error, and an identical-detuning design initialized by flank per-chip calibration, keeping the 95th-percentile softmax +and slope matching provides a practical two-parameter probability error below 0.32%. +design. + Three-dimensional FDTD simulations of a single X-cut The formulation is not restricted to electro-optic tuning: +TFLN add-drop ring (R = 20 µm, g = 100 nm) yield it requires only a controllable detuning coordinate with lo- +QL = 10,300–15,500 and Dmax ≈ 0.36, calibrating the cal linearization, so both Pockels and optical (Kerr/XPM) +cascade transfer model. A five-ring cascade 3D FDTD mechanisms are compatible [37, 38, 47, 48]. We demon- +simulation directly validates the multi-ring framework: strate a photonic exponential block and present a WDM- +all five rings exhibit resonant excitation, and mapping parallel chip architecture (Fig. 8) in which d wavelength +the drop-port spectrum onto the dimensionless control channels share a single N -ring cascade, reducing the total +variable reproduces the theoretical N = 5 curve with ring count by a factor of d and eliminating power-splitter +∼11% integrated relative-area error over the upper op- loss. Combined with a single-loop PI feedback that adjusts +erating range (I ≥ 5.8), providing the first multi-ring the shared WDM laser power, the architecture realizes the +confirmation of the cascade exponential approximation. complete softmax function—exponentiation, summation, +At the present FDTD-characterized quality factor, practi- and normalization—without per-channel normalization +cal cascades are limited to N = 5–7 (ε ≲ 11%). If high-Q circuitry. Max-finding and digital interfacing remain open +TFLN resonators reported in the literature (Qi ≥ 106 , for future experimental validation. + + + + + [1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Shengyuan Lu, Qihang Zhang, Lingyan He, C. A. A. + Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Franken, Keith Powell, Hana Warner, Daniel Assumpcao, + and Illia Polosukhin. Attention is all you need. In Dylan Renaud, Ying Wang, et al. Integrated lithium + Advances in Neural Information Processing Systems 30 niobate photonic computing circuit based on efficient and + (NeurIPS 2017), pages 5998–6008, 2017. high-speed electro-optic conversion. Nature Communica- + [2] Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, tions, 16:8178, 2025. + and Christopher Ré. FlashAttention: Fast and memory- [11] Priyabrata Dash, Anxiao Jiang, and Dharanidhar Dang. + efficient exact attention with IO-awareness. In Advances SOFTONIC: A photonic design approach to softmax + in Neural Information Processing Systems 35 (NeurIPS activation for high-speed fully analog AI acceleration. + 2022), pages 16344–16359, 2022. In Proceedings of the Great Lakes Symposium on VLSI + [3] Neil Savage. Light could lower AI’s appetite for power. (GLSVLSI ’25), pages 118–125, 2025. + Nature Nanotechnology, 21:6–8, 2026. [12] Ziyu Zhan, Hao Wang, Qiang Liu, and Xing Fu. Opto- + [4] Yichen Shen et al. Deep learning with coherent nanopho- electronic nonlinear softmax operator based on diffractive + tonic circuits. Nature Photonics, 11(7):441–446, 2017. neural networks. Optics Express, 32(15):26458–26469, + [5] Johannes Feldmann et al. Parallel convolutional process- 2024. + ing using an integrated photonic tensor core. Nature, [13] Ye Tian, Shuiying Xiang, Xingxing Guo, Yahui Zhang, + 589(7840):52–58, 2021. Jiashang Xu, Shangxuan Shi, Haowen Zhao, Yizhi Wang, + [6] Nicholas C. Harris et al. Linear programmable nanopho- Xinran Niu, Wenzhuo Liu, and Yue Hao. Photonic trans- + tonic processors. Optica, 5(12):1623–1631, 2018. former chip: interference is all you need. PhotoniX, 6:45, + [7] Bowei Dong, Samarth Aggarwal, Wen Zhou, Utku Emre 2025. + Ali, Nikolaos Farmakidis, June Sang Lee, Yuhan He, Xuan [14] Jacob R. Stevens, Rangharajan Venkatesan, Steve Dai, + Li, Dim-Lee Kwong, C. D. Wright, Wolfram H. P. Pernice, Brucek Khailany, and Anand Raghunathan. Softermax: + and H. Bhaskaran. Higher-dimensional processing using Hardware/software co-design of an efficient softmax for + a photonic tensor core with continuous-time data. Nature transformers. In Proceedings of the 58th ACM/IEEE + Photonics, 17(12):1080–1088, 2023. Design Automation Conference (DAC), pages 469–474, + [8] Sudip Shekhar, Wim Bogaerts, Lukas Chrostowski, 2021. + John E. Bowers, Michael Hochberg, Richard Soref, and [15] Nazim Altar Koca, Anh Tuan Do, and Chip-Hong + Bhavin J. Shastri. Roadmapping the next generation of Chang. Hardware-efficient softmax approximation for + silicon photonics. Nature Communications, 15:751, 2024. self-attention networks. In Proceedings of the IEEE Inter- + [9] Mario Miscuglio and Volker J. Sorger. Photonic tensor national Symposium on Circuits and Systems (ISCAS), + cores for machine learning. Applied Physics Reviews, pages 1–5, 2023. + 7(3):031404, 2020. [16] Wenxun Wang, Shuchang Zhou, Wenyu Sun, Peiqin Sun, +[10] Yaowen Hu, Yunxiang Song, Xinrui Zhu, Xiangwen Guo, and Yongpan Liu. SOLE: Hardware-software co-design + 14 + + of softmax and layernorm for efficient transformer infer- 2025. accessed 2026-02-21. + ence. In Proceedings of the IEEE/ACM International [35] Jane Austen. Pride and prejudice. Project Gutenberg + Conference on Computer-Aided Design (ICCAD), pages eBook No. 1342, 2025. accessed 2026-02-21. + 1–9, 2023. [36] Hyoseok Park. MRR-AEF: reproducible MRR depth- +[17] Yuan Zhang, Yonggang Zhang, Lele Peng, Lianghua Quan, sweep fitting and supplementary validation scripts. + Shubin Zheng, Zhonghai Lu, and Hui Chen. Base-2 soft- GitHub repository, 2025. commit 585e695, accessed 2026- + max function: Suitability for training and efficient hard- 02-21. + ware implementation. IEEE Transactions on Circuits and [37] Di Zhu et al. Integrated photonics on thin-film lithium + Systems I: Regular Papers, 69(9):3605–3618, 2022. niobate. Advances in Optics and Photonics, 13(2):242–352, +[18] Zhengyu Mei, Hongxi Dong, Yuxuan Wang, and Hongbing 2021. + Pan. TEA-S: A tiny and efficient architecture for PLAC- [38] Yaowen Hu, Di Zhu, Shengyuan Lu, Xinrui Zhu, Yunxiang + based softmax in transformers. IEEE Transactions on Song, Dylan Renaud, Daniel Assumpcao, Rebecca Cheng, + Circuits and Systems II: Express Briefs, 70:3594–3598, CJ Xin, Matthew Yeh, Hana Warner, Xiangwen Guo, + 2023. Amirhassan Shams-Ansari, David Barton, Neil Sinclair, +[19] Ke Chen, Yue Gao, Haroon Waris, Weiqiang Liu, and and Marko Loncar. Integrated electro-optics on thin-film + Fabrizio Lombardi. Approximate softmax functions for lithium niobate. Nature Reviews Physics, 2025. + energy-efficient deep neural networks. IEEE Transactions [39] Mian Zhang, Cheng Wang, Rebecca Cheng, Amirhassan + on Very Large Scale Integration (VLSI) Systems, 31:4–16, Shams-Ansari, and Marko Lončar. Monolithic ultra-high- + 2023. Q lithium niobate microring resonator. Optica, 4(12):1536– +[20] Wim Bogaerts et al. Silicon microring resonators. Laser 1537, 2017. + & Photonics Reviews, 6(1):47–73, 2012. [40] Rongjin Zhuang, Jinze He, Yifan Qi, and Yang Li. High-Q +[21] John E. Heebner, Robert W. Boyd, and Q.-Han thin-film lithium niobate microrings fabricated with wet + Park. Scissor solitons and other propagation effects in etching. Adv. Mater., 35(3):2208113, 2023. + microresonator-modified waveguides. Journal of the Opti- [41] Xinrui Zhu, Yaowen Hu, Shengyuan Lu, Hana K. + cal Society of America B, 19(4):722–731, 2002. Warner, Xudong Li, Yunxiang Song, Letı́cia S. Mag- +[22] Jiahui Wang, Sean P. Rodrigues, Ercan M. Dede, and alhães, Amirhassan Shams-Ansari, Neil Sinclair, and + Shanhui Fan. Microring-based programmable coherent Marko Lončar. Twenty-nine million intrinsic Q-factor + optical neural networks. Optics Express, 31(12):18871, monolithic microresonators on thin-film lithium niobate. + 2023. Photon. Res., 12(8):A63–A68, 2024. +[23] Pengxing Guo, Niujie Zhou, Weigang Hou, and Lei Guo. [42] Renhong Gao, Ni Yao, Jianglin Guan, Li Deng, Jintian + StarLight: a photonic neural network accelerator featur- Lin, Min Wang, Lingling Qiao, Wei Fang, and Ya Cheng. + ing a hybrid mode-wavelength division multiplexing and Lithium niobate microring with ultra-high Q factor above + photonic nonvolatile memory. Optics Express, 30:37051, 108 . Chin. Opt. Lett., 20(1):011902, 2022. + 2022. [43] Flexcompute Inc. Tidy3D: electromagnetic simula- +[24] Weizhen Yu, Shuang Zheng, Zhenyu Zhao, Bin Wang, tion software. https://www.flexcompute.com/tidy3d/, + and Weifeng Zhang. Reconfigurable low-threshold all- 2024. v2.10; cloud GPU FDTD. Accompany- + optical nonlinear activation functions based on an add- ing notebook: https://www.flexcompute.com/tidy3d/ + drop silicon microring resonator. IEEE Photonics Journal, community/notebooks/CascadedMRRTFLN/. + 14(6):1–7, 2022. [44] John K. Doylend, Paul E. Jessop, and Andrew P. Knights. +[25] Bahaa E. A. Saleh and Malvin C. Teich. Fundamentals Silicon photonic dynamic optical channel leveler with + of Photonics. Wiley, Hoboken, NJ, 2 edition, 2007. external feedback loop. Optics Express, 18(13):13805– +[26] Vı́tor R. Almeida, Carlos A. Barrios, Roberto R. 13812, 2010. + Panepucci, and Michal Lipson. All-optical control of light [45] Karl J. Åström and Richard M. Murray. Feedback Systems: + on a silicon chip. Nature, 431(7012):1081–1084, 2004. An Introduction for Scientists and Engineers. Princeton +[27] Qianfan Xu, Bradley Schmidt, Sameer Pradhan, and University Press, Princeton, NJ, 2008. + Michal Lipson. Micrometre-scale silicon electro-optic mod- [46] Amnon Yariv, Yong Xu, Reginald K. Lee, and Axel + ulator. Nature, 435(7040):325–327, 2005. Scherer. Coupled-resonator optical waveguide: a proposal +[28] Kishore Padmaraju and Keren Bergman. Resolving the and analysis. Optics Letters, 24(11):711–713, 1999. + thermal challenges for silicon microring resonator devices. [47] Meisam Bahadori, Yansong Yang, Ahmed E. Hassanien, + Nanophotonics, 3:269–281, 2014. Lynford L. Goddard, and Songbin Gong. Ultra-efficient +[29] Erwen Li, Behzad Ashrafi Nia, Bokun Zhou, and Alan X. and fully isotropic monolithic microring modulators in + Wang. Transparent conductive oxide-gated silicon mi- a thin-film lithium niobate photonics platform. Optics + croring with extreme resonance wavelength tunability. Express, 28(20):29644–29661, 2020. + Photonics Research, 7(4):473, 2019. [48] Abu Naim R. Ahmed, Shouyuan Shi, Mathew Zablocki, +[30] Lahiru Jayatilleka et al. Post-fabrication trimming of Peng Yao, and Dennis W. Prather. Tunable hybrid sil- + silicon photonic ring resonators at wafer-scale. Journal icon nitride and thin-film lithium niobate electro-optic + of Lightwave Technology, 39:5083–5088, 2021. microresonator. Optics Letters, 44(3):618, 2019. +[31] Elliott W. Cheney. Introduction to Approximation Theory. + McGraw–Hill, New York, 1966. +[32] Alec Radford et al. Language models are unsupervised + multitask learners. Technical report, OpenAI, 2019. +[33] Hugging Face. distilgpt2 model card, 2025. accessed + 2026-02-21. +[34] Andrej Karpathy. Tiny shakespeare dataset (char-rnn), + 15 + + SUPPLEMENTARY INFORMATION + +Supplementary material accompanying “Photonic Exponential Approximation via Cascaded TFLN Microring Resonators +toward Softmax.” + + + S0. RIGOROUS DERIVATION AND VALIDITY SCOPE + + This section derives the depth-scaling relations and screening bounds used in the main text, and states the assumptions +under which they apply, together with validity scope and failure cases. We separate proved statements (Lemma, +Proposition, Theorem) from heuristic engineering estimates that rely on empirical calibration. + + + S0.1 Assumptions + +Assumption 1 (Lorentzian single-ring transfer). Each ring k has a normalized add-to-drop transmission of the form +Tk (I) = [1 + (ak + bI)2 ]−1 , where ak ∈ R is the dimensionless static detuning, b > 0 is the common normalized +sensitivity, and I ≥ 0 is a nonnegative control-signal amplitude. +Assumption 2 (Multiplicative cascade). The N rings are cascaded in a serial add-drop topology (the drop output of +ring k feeds the add input of ring k+1), and the probe is sufficiently weak that cross-ring and nonlinear probe-induced + QN +effects are negligible; thus the total normalized transmission is y(I) = k=1 Tk (I). +Assumption 3 (Identical-detuning family). All rings share the same static detuning: a1 = · · · = aN ≡ a. This reduces +the design space to (a, b) and a global scale C > 0; the scaled output is ỹ(I) = C [1 + (a + bI)2 ]−N . +Assumption 4 (Linear control-to-resonance mapping). Within the operating range I ∈ [0, L], the resonance shift is +a linear function of the control-signal amplitude (Eq. (10) of main text), i.e., higher-order detuning nonlinearity is +negligible. +Assumption 5 (Finite interval and bounded L). The approximation target is f (I) = eI−L on a finite interval +I ∈ [0, L] with L > 0 determined by the input batch (L = max(x) − min(x)). The depth-scaling results are derived for +fixed, finite L. +Assumption 6 (Flank-centered operating regime). The design uses the “flank-centered” initialization: a+b(L/2) = −1 +(midpoint on the Lorentzian half-maximum) and N b = 1 (slope matching). This places the operating point in the +steepest-slope region of the Lorentzian, where the log-transfer is most nearly linear. + + + S0.2 Rigorous results + + Throughout, define the log-domain residual + + r(I) ≡ ln ỹ(I) − (I − L) = ln C − N ln 1 + (a + bI)2 − (I − L), + + (S0.1) + +and the worst-case log-error E∞ = supI∈[0,L] |r(I)|. We set C to the minimax-optimal value ln C ⋆ = − maxI g(I) + + +minI g(I) /2, where g(I) = ln y(I) − (I − L), throughout. +Lemma 1 (Slope bound — rigorous). Under Assumptions 1–3, for every I ≥ 0, + + d + ln y(I) ≤ N |b|. + dI + +Proof. From Assumption 3, ln y(I) = −N ln 1 + (a + bI)2 . Differentiating: + + + d 2b(a + bI) + ln y(I) = −N . + dI 1 + (a + bI)2 + +Let u ≡ a + bI ∈ R. The function h(u) = 2u/(1 + u2 ) satisfies |h(u)| ≤ 1 for all u ∈ R (since 1 + u2 ≥ 2|u| by AM–GM). +Therefore |d(ln y)/dI| = N |b| |h(u)| ≤ N |b|. + 16 + +Remark 1 (Necessary condition for approximation). Since the target ln f (I) = I − L has constant slope +1, a +necessary condition for the cascade log-transfer to track this slope at any point is N |b| ≥ 1. This is Eq. (23) of the +main text and is a rigorous (not heuristic) necessary condition. +Proposition 1 (Log-domain Taylor expansion at flank center). Under Assumptions 1–6, define I0 = L/2 and +δ = I − I0 . Then + δ3 + ln ỹ(I) = const + δ + + R4 (δ), (S0.2) + 6N 2 +where |R4 (δ)| ≤ M4 δ 4 with M4 = N |b|4 · sup|δ|≤L/2 |q (4) (u(δ))| and q(u) = − ln(1 + u2 ). In particular, the quadratic +term vanishes identically at the flank point u0 = a + bI0 = −1. +Proof. Set u(δ) = a + b(I0 + δ) = −1 + bδ (using a + bI0 = −1). Define ϕ(u) = − ln(1 + u2 ). Then ln y(I) = N ϕ(u(δ)) +and u′ (δ) = b. Compute derivatives of ϕ at u0 = −1: + 2u + ϕ′ (u) = − , ϕ′ (−1) = 1, + 1 + u2 + 2(u2 − 1) + ϕ′′ (u) = , ϕ′′ (−1) = 0, + (1 + u2 )2 + 4u(3 − u2 ) −4(−1)(3 − 1) + ϕ′′′ (u) = , ϕ′′′ (−1) = = 1. + (1 + u2 )3 (1 + 1)3 +By the chain rule, writing F (δ) = N ϕ(u(δ)): + F ′ (0) = N b ϕ′ (−1) = N b = 1, + F ′′ (0) = N b2 ϕ′′ (−1) = 0, + 1 + F ′′′ (0) = N b3 ϕ′′′ (−1) = N b3 = + , + N2 +where we used b = 1/N from Assumption 6 in the last step. Hence the Taylor expansion with the minimax-optimal C +is + δ2 1 δ3 + ln ỹ(I) = const + δ + 0 · + 2· + R4 (δ). + 2 N 6 +Subtracting the target δ (the linear part of I − L around I0 ) gives a leading residual δ 3 /(6N 2 ). The remainder is +bounded by the standard Taylor remainder estimate. +Theorem 1 (Heuristic depth-scaling law). Under Assumptions 1–6 and ignoring the fourth-order remainder R4 , the +leading-order worst-case log-error on I ∈ [0, L] satisfies + 3 + (leading) 1 L L3 + E∞ ∼ = . (S0.3) + 6N 2 2 48 N 2 + (leading) +Setting E∞ ≤ εlog = ln(1 + ε) and solving for N gives + L3/2 + N ≥ p . (S0.4) + 48 εlog +Derivation (heuristic). From Proposition 1, the residual with respect to the target is dominated by δ 3 /(6N 2 ) for +|δ| ≤ L/2. The maximum of |δ|3 on [−L/2, L/2] is (L/2)3 . Setting the bound equal to εlog and solving: + L3 L3/2 + ≤ εlog =⇒ N≥p . + 48 N 2 48 εlog + √ +With 1/ 48 ≈ 0.144, and accounting for the fact that the minimax-optimal residual is typically smaller than the +one-sided Taylor bound by a factor of ∼ 2 (equi-oscillation), the effective prefactor becomes κ ≈ 0.07, yielding the + √ +main-text engineering estimate Eq. (28): N ≈ ⌈max(1/bmax , κ L3/2 / εlog )⌉. +Remark 2 (Status of Theorem 1). This is a heuristic scaling law, not a rigorous minimax guarantee. The +derivation truncates the Taylor series at third order and approximates the equi-oscillation factor empirically (κ ≈ 0.07). +For a rigorous bound one would need explicit control of R4 over the full interval [0, L], which depends on L, N , and + √ +higher derivatives of the Lorentzian; we do not claim such a bound here. The scaling N ∼ L3/2 / εlog is supported by +numerical evidence (Table I) but should be treated as an engineering design rule. + 17 + + S0.3 Derivation of the conservative screening bound + + We now derive the conservative screening bound (Eqs. S0.7–S0.8 below), which is stated inline in Sec. II of the main +text. +Proposition 2 (Conservative log-error bound). Under Assumptions 1–5 (identical detuning, but not restricted to the +flank-centered choice), fix b > 0 and choose the normalization ỹ(L) = 1. Define ϕ(u) = − ln(1 + u2 ) and write + + ln ỹ(I) = N ϕ(a + bI) − ϕ(a + bL) . + +The target in this normalization is (I − L). Denoting the residual r(I) = ln ỹ(I) − (I − L), we have r(L) = 0 and +r(0) = N [ϕ(a) − ϕ(a + bL)] + L. + For any choice of a such that the operating range {a + bI : I ∈ [0, L]} lies in the region where ϕ is concave (i.e., +ϕ′′ (u) ≤ 0 throughout), the worst-case log-error satisfies + + N ∥ϕ′′ ∥∞ b2 L2 N ϕ′ (a + bL) · b − 1 + E∞ ≤ + · L, (S0.5) + 8 2 + +where ∥ϕ′′ ∥∞ = supu∈[a, a+bL] |ϕ′′ (u)|. +Derivation sketch. Write h(I) ≡ N ϕ(a + bI). The slope is h′ (I) = N b ϕ′ (a + bI). At I = L, we want the slope to +match the target slope 1; define the slope mismatch ∆s ≡ h′ (L) − 1 = N b ϕ′ (a + bL) − 1. By the mean-value theorem +on [0, L]: + Z L + 1 − h′ (t) dt. + + r(I) − r(L) = r(I) = h(I) − h(L) − (I − L) = + I + RL +Write 1 − h′ (t) = (1 − h′ (L)) + (h′ (L) − h′ (t)) = −∆s + t h′′ (s) ds. Since h′′ (s) = N b2 ϕ′′ (a + bs), we bound +|h′′ (s)| ≤ N b2 ∥ϕ′′ ∥∞ . Integrating twice and applying the triangle inequality gives (S0.5). +Corollary 1 (Main-text conservative bound). Under slope matching at I = L (i.e., N b ϕ′ (a + bL) = 1, so ∆s = 0), +and using ∥ϕ′′ ∥∞ ≤ 2 (which holds since |ϕ′′ (u)| = |2(u2 − 1)/(1 + u2 )2 | ≤ 2 for all u ∈ R), the bound simplifies to + + N b2 L 2 + E∞ ≤ . (S0.6) + 4 +Using b = 1/N (the slope-matching choice from N b = 1) gives E∞ ≤ L2 /(4N ). If instead we retain a general b but add +the penalty from imperfect slope matching (e.g., from the constraint b ≤ bmax ), a combined conservative bound is + + L2 1 + E∞ ≤ + 2 , (S0.7) + 4N 2b N +which provides a conservative heuristic bound on the log-error. Setting this ≤ ln(1 + ε) and solving for N yields the +conservative screening depth: + 2 + L /4 + 1/(2b2 ) + + Nsafe ≥ . (S0.8) + ln(1 + ε) + +Remark 3 (Status of the conservative bound). Equation (S0.7) is a conservative heuristic design rule. It is +conservative because: (i) we use a global upper bound ∥ϕ′′ ∥∞ ≤ 2 instead of the actual curvature, (ii) we do not exploit +the minimax-optimal C shift. It is heuristic (not a formal guarantee) because: (i) the derivation assumes the operating +range lies in the concavity region of ϕ, which may not hold for all detuning choices; (ii) the second term 1/(2b2 N ) +arises from a simplified penalty model for flank-curvature mismatch that has not been proved to be a rigorous upper +bound in all parameter regimes. Nsafe from Eq. (S0.8) is therefore a screening estimate, suitable for preliminary +design-space exploration but not a certified minimax guarantee. + + + S0.4 Validity scope and failure cases + + The derivations above hold under the stated assumptions. We now identify the regimes where each assumption may +break down. + 18 + +(V1) Lorentzian model (A1). The single-ring Lorentzian form T = [1+(a+bI)2 ]−1 is a near-resonance approximation + valid when the probe frequency is within a few linewidths of the resonance. Far from resonance, higher-order + dispersion, Fano interference, or multi-mode effects introduce deviations. Failure case: operation with very large + detuning (|a + bI| ≫ 1 across the full interval), where the Lorentzian tails may not be accurate for high-Q rings. + +(V2) Multiplicative cascade (A2). Requires that inter-ring reflections and back-scattering are negligible (forward- + propagating coupling only, i.e. negligible back-reflection at each inter-ring junction). Failure case: very high ring + count N with non-negligible back-reflection per stage, which can produce Fabry–Pérot-like ripples in the cascade + transfer function. + +(V3) Identical-detuning family (A3). The Taylor expansion and conservative bound both assume a1 = · · · = aN . + In practice, fabrication variations introduce per-ring detuning spread σa . The Monte Carlo analysis in Sec. S8 + quantifies robustness, but the analytical bounds (S0.2)–(S0.8) are strictly valid only for identical detuning. + (0) +(V4) Linear control-to-resonance mapping (A4). The linearized model ω0 (I) = ω0 + ηI introduces systematic + error at large control amplitudes. For carrier-injection (free-carrier plasma effect) or thermal tuning over wide + ranges, second-order nonlinearity in the control-to-detuning mapping can exceed 1%. Failure case: large L + requiring a control swing exceeding the linearity range of the tuning mechanism. + +(V5) Finite interval (A5). All bounds scale with L (typically as L2 or L3/2 ). As L → ∞, N grows without bound + and insertion loss accumulates (∼ N · ILstage ), eventually degrading the probe SNR below the useful regime. + There is no finite N that works for all L simultaneously. Practical regime: L ≲ 10–12 (consistent with Leff at + p90–p95 from Sec. S3) is the primary target; L ≳ 16 requires N ≳ 30 even for moderate tolerance, pushing loss + budgets. + +(V6) Flank-centered initialization (A6). The Taylor-based scaling (Theorem 1) relies on the cancellation + ϕ′′ (−1) = 0 at the half-maximum point. If the operating point deviates (e.g., due to fabrication offset pushing + a + bI0 away from −1), a nonzero quadratic residual appears and the effective scaling worsens to E∞ ∼ L2 /N + rather than L3 /N 2 . Mitigation: heater/bias trimming to restore the flank condition. + + + S0.5 Mapping to main-text equations + +For reference, the results derived here correspond to the following main-text equations: + + • Slope bound (Lemma 1): rigorous; corresponds to main-text Eqs. (22)–(23). This is a guaranteed necessary + condition. + + • Engineering N -estimate (Theorem 1): heuristic scaling with empirical prefactor κ ≈ 0.07; corresponds to + main-text Eq. (28). This is a heuristic design rule calibrated against numerical fits. + + • Conservative bound (Corollary 1): conservative but not rigorously certified as a minimax upper bound; derived + as Eq. (S0.7) in this supplement, stated inline in Sec. II. This is a conservative heuristic screening condition. + + • Nsafe (Corollary 1, Eq. S0.8): the safe screening depth derived from the conservative bound; derived as Eq. (S0.8) + in this supplement, stated inline in Sec. II. This is a conservative backstop estimate for preliminary design. + +Summary of guarantee status: +Result Status Main-text Eq. +Slope bound N |b| ≥ 1 Rigorous (proved) (23) + √ +Scaling N ∼ κL3/2 / εlog Heuristic (Taylor truncation + empirical κ) (28) +Bound E∞ ≤ L2 /(4N ) + 1/(2b2 N ) Conservative heuristic (S0.7) +Nsafe screening depth Conservative backstop (S0.8) + + + S1. DEPTH-SCALING DERIVATION AND CONSERVATIVE SCREENING BOUND + + This section provides the detailed derivations underlying the depth-scaling relations and conservative screening +bounds summarized in the main text (Sec. II). These results complement the rigorous treatment in Sec. S0. + 19 + + S1.1 Local expansion and exponential-like behavior + + To provide immediate local intuition (without changing the global minimax objective), let δ = I − I0 around the +flank-centered point I0 = L/2 and impose a + bI0 = −1. With the local normalization C = 2N (so that ỹ(I0 ) = 1), a +third-order expansion of ỹ(I) = C[1 + (a + bI)2 ]−N gives + + N 2 2 2 N (N 2 − 1) 3 3 + ỹ(I) ≈ 1 + N b δ + b δ + b δ + O(δ 4 ), (S1.1) + 2 6 +so with b ∼ 1/N , the linear and quadratic coefficients align with those of eδ = 1 + δ + δ 2 /2 + δ 3 /6 + · · · , explaining +why the initialization is already close before refinement. + + + S1.2 Log-domain analysis and scaling derivation + + For depth scaling, the logarithmic domain is more transparent. Under the same flank centering (a + bI0 = −1), +expand around I0 = L/2 with δ = I − I0 to obtain + + N b3 3 + ln ỹ(I) = const + N b δ + δ + O(δ 4 ). (S1.2) + 6 +At a + bI0 = −1, the quadratic term cancels identically in the log expansion; imposing slope matching (N b = 1) gives + + δ3 + ln ỹ(I) = const + δ + + O(δ 4 ). (S1.3) + 6N 2 +Hence the leading log-domain residual scales as r(δ) ∼ δ 3 /N 2 . Over I ∈ [0, L] with |δ| ≤ L/2, this implies E∞ ∼ L3 /N 2 . +Requiring E∞ ≤ εlog leads to + + L3/2 + N∝√ , (S1.4) + εlog + +which explains the scaling used in the main-text engineering estimate (Eq. (28)). This derivation is heuristic (not a +formal guarantee), and the prefactor remains platform- and fitting-criterion dependent. + + + S1.3 Conservative upper bound and screening depth + + For fixed b and the identical-detuning family (a1 = · · · = aN ≡ a), one can write a conservative heuristic condition +for achieving a prescribed log-tolerance. A simple normalization is to enforce ỹ(L) = 1 (matching the target f (L) = 1). +For a particular constructive choice of a that keeps (a + bI) large and negative across [0, L], one can bound the +worst-case log-error as + + L2 1 + E∞ ≤ + 2 . (S1.5) + 4N 2b N +(This is a conservative rule of thumb; obtaining a formal guarantee would require a separate proof.) As a screening +estimate (not a formal guarantee), one may use + 2 + L /4 + 1/(2b2 ) + + N ≥ . (S1.6) + ln(1 + ε) + +While this bound is typically pessimistic, it provides a conservative backstop-style estimate for preliminary design +screening. The rigorous derivation of these bounds, including the concavity conditions and slope-matching assumptions, +is given in Sec. S0.3. + 20 + + S2. WORKED EXAMPLE AND EMPIRICAL LOGIT-RANGE CALIBRATION + + This section provides the detailed worked example for the input-to-output mapping and the empirical logit-range +calibration tables referenced in the main text (Sec. III). + + + S2.1 Worked input-to-output mapping example + + As a worked example, consider + + x = [−3.2, 1.2, 4.8, −0.9]. (S2.1) + +Compute m = max xn = 4.8. Then u = x − m = [−8.0, −3.6, 0, −5.7] and L = − min un = 8.0. The mapped +control-signal levels are + + I = u + L = [0, 4.4, 8.0, 2.3], (S2.2) + +and the required normalized exponentials are exn −m = eun = eIn −L . Using the fitted model directly, + N + 1 Y + Tk (In ) = , y(In ) = Tk (In ). + 1 + (ak + bIn )2 + k=1 + +Under the identical-detuning fit (a1 = · · · = aN ≡ a), this becomes + N + 1 + ỹ(In ) = C y(In ) = C . + 1 + (a + bIn )2 +For the re-fitted parameters used in this example, + + a = −1.4588, b = 0.10202, + (S2.3) + N = 10, C = 3.0896 × 101 . + +which gives + N + 1 + ỹ(In ) = C , + 1 + (a + bIn )2 + (S2.4) + ≈ [3.44 × 10−4 , 2.73 × 10−2 , + 9.74 × 10−1 , 3.26 × 10−3 ]. + + For reference, the corresponding target terms are + + In − L = [−8.0, −3.6, 0, −5.7], (S2.5) + +and + In −L + e ≈ 3.35 × 10−4 , 2.73 × 10−2 , + (S2.6) + 1.00, 3.35 × 10−3 . + + + + + + S2.2 Effective-range percentiles and clipping calibration + + We first estimate the logit range observed in data and then choose clipping accordingly. From two autoregressive +Transformers (distilgpt2 and gpt2) and two public corpora (Tiny Shakespeare and Pride and Prejudice) [1–5] at context +length 128, the effective range + + Leff,α = max(log pkept ) − min(log pkept ), α = 0.999, (S2.7) + +fell in a relatively narrow band, summarized in Table S2. + 21 + + TABLE S1: Example (N = 10): approximating exn −m = eIn −L using ỹ(I) = C[1 + (a + bI)2 ]−N with parameters + re-fitted on I ∈ [0, 8.0] using the same minimax pipeline. + + xn In target exn −m approx ỹ(In ) rel. err. + −4 −4 +−3.2 0.0 3.3546 × 10 3.4443 × 10 2.673% + 1.2 4.4 2.7324 × 10−2 2.7325 × 10−2 0.004% + 4.8 8.0 1.0000 0.9739 2.608% +−0.9 2.3 3.3460 × 10−3 3.2585 × 10−3 2.614% + + + TABLE S2: Effective-range percentiles (Leff,0.999 ) at context length 128. + + Percentile All runs (4 runs) GPT-2 + p50 6.92–7.23 7.09–7.23 + p90 8.60–8.75 8.73–8.75 + p95 8.97–9.12 9.06–9.12 + p99 9.50–9.69 9.58–9.69 + + + We then test clipping on the same rows with + + Ecum (t) = 12 ∥softmax(u(t) ) − softmax(u)∥1 , + (S2.8) + u(t) = max(u, t), u = s − max(s). + +and require p99{Ecum } ≤ 10−3 (0.1% budget). This criterion is satisfied at t = −12 (p99 ≈ 4.27 × 10−4 ) and violated +at t = −11 (p99 ≈ 1.24 × 10−3 ), so we set t∗ = −12 (Nclip = 12). + In practice, we (i) estimate an effective L from data, (ii) verify that fixed clipping keeps softmax error small, and (iii) +choose representative design points (e.g., L ≈ 8 or L ≈ 12) while treating the clipped tail as negligible. Full protocol +details, clipping-sweep tables/plots, and per-run statistics are provided in Sec. S3. + + + S2.3 Illustrative synthetic range map + √ + As a design-space reference, we consider synthetic logit-range regimes using L = max(x) − min(x) after QK ⊤ / dk +scaling. These regimes are illustrative rather than corpus-level percentiles; using the same fitting pipeline, Table S3 +summarizes achievable approximation error versus depth. + + TABLE S3: Synthetic softmax logit-range regimes (L = max(x) − min(x)) and fitted worst-case relative error + (design-space illustration; not intended as corpus-level statistics). + +L regime N =5 N = 10 N = 20 N = 30 + L=8 10.9% 2.68% 0.67% 0.30% + L = 12 40.0% 9.25% 2.27% 1.01% + L = 16 113% 23.0% 5.44% 2.41% + + + Table S3 suggests a simple rule of thumb: the required depth depends mainly on the target L regime. Near L ≈ 8, +moderate depth reaches a few-percent error, whereas L ≳ 12 typically requires deeper cascades to approach < 1% +error. + We include Table S3 as a synthetic design map rather than an empirical benchmark. + 22 + + S3. EMPIRICAL LOGIT-RANGE EXTRACTION FROM REAL TRANSFORMER RUNS + + We extracted empirical attention-logit ranges from real model runs to complement the synthetic L-regime map in +the main text. We used two open-source autoregressive Transformers (distilgpt2 and gpt2) and two public corpora +(Tiny Shakespeare and Pride and Prejudice), with context length 128 and causal masking. For each valid attention +row, if p = softmax(s) then the raw range is + Lraw = max(s) − min(s) = max(log p) − min(log p), (37) +where max/min are taken over valid causal keys only. Because very small tail probabilities can dominate min(log p), +we additionally report an effective range: + Leff,α = max(log pkept ) − min(log pkept ), (38) +where keys are sorted by attention weight and retained until cumulative mass reaches α = 0.999. + To stay within a 16 GB RAM budget, we processed one model at a time, batch size 1, fixed windowing (stride 128), +and streaming histogram quantiles. Observed process RSS stayed below 1.24 GB in these runs. + + TABLE S4: Empirical global logit-range percentiles from real model–dataset runs (context length 128): raw vs + effective (α = 0.999). + + Model Dataset raw p95 raw p99 Leff p50 Leff p90 Leff p95 Leff p99 + distilgpt2 tiny shakespeare 22.82 69.00 7.10 8.60 8.97 9.50 + distilgpt2 pride prejudice 21.76 68.60 6.92 8.60 9.03 9.57 + gpt2 tiny shakespeare 25.48 43.34 7.23 8.73 9.06 9.58 + gpt2 pride prejudice 24.13 40.92 7.09 8.75 9.12 9.69 + + For quick linkage to the main manuscript: the effective-range summary quoted in the main text corresponds to this +table (all runs: p50 = 6.92–7.23, p90 = 8.60–8.75, p95 = 8.97–9.12, p99 = 9.50–9.69), and the GPT-2 subset is p50 += 7.09–7.23, p90 = 8.73–8.75, p95 = 9.06–9.12, p99 = 9.58–9.69. +Clipping-validity sweep (additional justification). To test whether practical clipping magnitudes can be used +without materially changing softmax outputs, we evaluated a thresholded-logit approximation. For each row, define +u = s − max(s) and, for threshold t ≤ 0, + u(t) = max(u, t), p(t) = softmax(u(t) ). (39) +We report the cumulative softmax error + 1 (t) + p −p , + Ecum (t) = (40) + 2 1 +then sweep t ∈ {−14, −13, . . . , −6} and compute p50/p90/p95/p99 of Ecum over all extracted rows. + + TABLE S5: Global clipping-validity sweep: percentile statistics of Ecum (t) versus clipping threshold t. + + t p50 p90 p95 p99 + −5 −5 −5 + −14 2.53 × 10 4.55 × 10 4.80 × 10 5.18 × 10−5 + −5 −5 −5 + −13 2.69 × 10 4.85 × 10 7.38 × 10 1.48 × 10−4 + −5 −4 −4 + −12 2.99 × 10 1.21 × 10 2.13 × 10 4.27 × 10−4 + −11 3.31 × 10−5 3.95 × 10−4 6.55 × 10−4 1.24 × 10−3 + −10 3.72 × 10−5 1.28 × 10−3 2.01 × 10−3 3.58 × 10−3 + −9 4.41 × 10−5 4.04 × 10−3 6.11 × 10−3 1.03 × 10−2 + −8 2.25 × 10−4 1.26 × 10−2 1.83 × 10−2 2.91 × 10−2 + −7 2.76 × 10−3 3.85 × 10−2 5.30 × 10−2 7.89 × 10−2 + −6 1.88 × 10−2 1.11 × 10−1 1.43 × 10−1 1.95 × 10−1 + + Under a conservative budget criterion p99{Ecum } ≤ 10−3 , the least negative admissible threshold in this sweep +is t∗ = −12 (p99 ≈ 4.27 × 10−4 ). Equivalently, the operational clipping magnitude is Nclip ≡ −t∗ = 12. Notably, +this is closely aligned with the empirical effective-range scale (Table S4: p99 of Leff,0.999 up to ≈ 9.69), indicating +that clipping-constrained implementation and effective-range statistics operate in the same order-of-magnitude range +budget. This supports using a practical clipping magnitude comparable to the design range scale (L ≈ Nclip ) while +keeping aggregate softmax distortion below 0.1%. + 23 + + + + + FIG. S1: Global CDFs of raw Lraw (dashed) and effective Leff,0.999 (solid) for the four model–dataset runs. + + + + +FIG. S2: Percentile curves of cumulative softmax error Ecum (t) versus clipping threshold t. The dashed line marks the + 0.1% budget (10−3 ). + 24 + + S4. FDTD METHODOLOGY DETAILS AND X-CUT bV DERIVATION + + This section provides the detailed FDTD simulation methodology, the step-by-step X-cut arc electrode voltage +sensitivity derivation, and the full cascade optimization table referenced in the main text (Sec. IV–V). + + + S4.1 z-refined 3-fix simulation strategy + + For thin-film LiNbO3 structures, special care is required in the vertical (z) direction due to the high index contrast +between LiNbO3 (no ≈ 2.21) and SiO2 (n ≈ 1.44) and the sub-micron film thickness. We apply a “z-refined 3-fix” +strategy: + 1. Ordinary index correction: the material model uses the corrected ordinary refractive index no appropriate + for the TE mode in X-cut geometry, rather than the extraordinary index ne that governs TM propagation; + 2. z-span expansion: the simulation z-span is extended beyond the minimal waveguide region to include sufficient + substrate and superstrate so that evanescent field tails are captured without PML truncation artifacts; + 3. Auto-mesh: accuracy level 3; conformal variant 1 meshing is enabled, and no manual mesh override is applied. + The resulting vertical grid spacing in the slab region is approximately 55 nm, providing ∼2 cells across the 100 nm + slab. +This refinement strategy is critical for obtaining converged results in TFLN ring resonators, where the high-Q spectral +features are sensitive to numerical dispersion in under-resolved thin films [6]. Table S6 lists the full simulation +parameters. + + TABLE S6: 3D FDTD simulation parameters (Lumerical). + +Parameter Value +Solver Lumerical 3D FDTD +Mesh type Conformal variant 1 +Mesh accuracy 3 (auto-mesh) +z-mesh override None (auto-mesh) +Simulation time 50 ps +Auto shutoff 1 × 10−6 +Wavelength range 1530 nm to 1570 nm +Grid size 532 × 816 × 44 +Source Broadband mode source (TE0 ) + + + + + S4.2 X-cut arc electrode bV step-by-step derivation + + For the X-cut circular ring with lateral S–G arc electrodes (Table II), the crystal Z-axis (c-axis) is oriented at 45◦ +from the horizontal axis in the substrate plane. At azimuthal angle θ around the ring, the projection of the lateral +electric field onto the Z-axis is proportional to cos(θ − 45◦ ). The cos(θ − 45◦ ) = 0 boundaries fall at θ = 135◦ and +θ = 315◦ , naturally separating the bus-waveguide coupling regions from the electrode regions. Each ring carries a full +semicircular arc electrode on the side opposite to its coupling points. By the substitution φ = θ − 45◦ , the effective +EO fill factor is + Z Z +π/2 + 1 1 1 +π/2 1 + fEO = | cos(θ − 45◦ )| dθ = cos φ dφ = sin φ −π/2 = ≈ 0.318. (S4.1) + 2π semicircle 2π −π/2 2π π +The 45◦ rotation ensures that the electrode semicircle does not overlap with the coupling points, while the fill factor +integral is identical to the standard cos θ case by the change of variable. + The lateral S–G electrodes have gap gel = 5 µm, giving an effective electrode–waveguide distance deff ≈ gel /2 = 2.5 µm. +The lateral field geometry yields an EO overlap factor ΓEO = 0.7, compared to 0.5 for a vertical electrode configuration. + The refractive index change per volt in the electrode-covered section is + ∆neff 1 ΓEO 1 0.7 + = − n3e r33 = − × 2.1383 × 30.9 × 10−12 × = −4.226 × 10−5 V−1 . (S4.2) + V 2 deff 2 2.5 × 10−6 + 25 + +The corresponding resonance wavelength shift is + dλ0 1550 × 4.226 × 10−5 + = = 28.48 pm V−1 , (S4.3) + dV straight 2.30 + +giving an intrinsic (straight-section) voltage sensitivity of + 2QL dλ0 2 × 15,500 + bstraight + V = = × 0.02848 = 0.570 V−1 . (S4.4) + λ0 dV straight 1550 +However, only the arc-electrode portion of the ring circumference contributes to the round-trip phase shift. The +effective voltage sensitivity is therefore + 1 + bV = bstraight + V × fEO = 0.570 × ≈ 0.182 V−1 . (S4.5) + π +A 1 V applied voltage shifts the normalized detuning by ∆a ≈ 0.182. Despite the fill-factor penalty (fEO = 1/π ≈ 0.318), +the X-cut arc design benefits from a smaller effective electrode distance (2.5 µm vs. 4 µm for vertical configurations) +and a higher overlap factor (0.7 vs. 0.5), which partially compensate the reduced active length. + + + S4.3 Full cascade optimization table + + Table S7 presents the complete optimization results for the standard dynamic range L = 8 (corresponding to +e8 ≈ 2981, i.e., 34.7 dB), covering all cascade depths from N = 5 to N = 30. + + TABLE S7: Cascade optimization results for L = 8. The bias voltage Vbias = |a|/bV sets the DC offset, and +Vctrl = bL/bV is the maximum control voltage at I = L. Voltages computed with bV = 0.182 V−1 (FDTD-calibrated + best resonance QL = 15,500). + +N a b E∞ εmax (%) Vbias (V) Vctrl (V) + 5 −2.0789 0.21658 0.1035 10.91 11.4 9.5 + 8 −1.5959 0.12896 0.0412 4.20 8.8 5.7 +10 −1.4588 0.10202 0.0265 2.68 8.0 4.5 +12 −1.3731 0.08450 0.0184 1.86 7.5 3.7 +15 −1.2914 0.06726 0.0118 1.19 7.1 3.0 +17 −1.2543 0.05923 0.0092 0.92 6.9 2.6 +20 −1.2136 0.05025 0.0067 0.67 6.7 2.2 +25 −1.1685 0.04013 0.0043 0.43 6.4 1.8 +30 −1.1392 0.03341 0.0030 0.30 6.3 1.5 + + + Key thresholds for the minimum number of rings at various error targets are: + • ε < 10%: N ≥ 6, + • ε < 5%: N ≥ 8, + • ε < 2%: N ≥ 12, + • ε < 1%: N ≥ 17, + • ε < 0.5%: N ≥ 24. +These thresholds are independent of the quality factor Q, since the minimax approximation operates entirely in +normalized detuning space. The Q factor affects only the physical voltage required to achieve the necessary detuning +range, through bV . + + + S4.4 Lorentzian fit validation + + Figure S3 shows the Lorentzian fit to the FDTD drop-port resonance near λ = 1566 nm. The analytical Lorentzian +Tdrop (∆λ) = A/[1 + (2Q∆λ/λ0 )2 ] with QL = 15,500 closely tracks the FDTD data, validating the single-ring transfer +function model used in the cascade analysis. + 26 + + + + + FIG. S3: Lorentzian fit to the FDTD drop-port resonance. Markers: FDTD data; solid line: Lorentzian fit. The + extracted quality factor is QL = 15,500 with FWHM = 101 pm. + + + S4.5 Eigenmode (FDE) analysis of theoretical Qi + + To quantify how far below the physical limit the FDTD-extracted Qi = 38,800 lies, we perform a two-dimensional +finite-difference eigenmode (FDE) analysis of the bent rib waveguide cross-section using Lumerical MODE Solutions. + a. Setup. The FDE solver models the cross-section of the rib waveguide at the design bend radius R = 20 µm +and wavelength λ = 1550 nm, with perfectly matched layer (PML) boundaries on all four edges. The geometry is +identical to the 3D FDTD model: 600 nm total LiNbO3 (no = 2.211, lossless dielectric), 100 nm slab, 500 nm rib etch, +waveguide width W = 1.4 µm, on a 2 µm SiO2 substrate (n = 1.444) with air cladding. The mesh is set to 300 × 300 +cells over a 6 µm × 3 µm cross-section, yielding effective grid spacings ∆x ≈ 20 nm and ∆y ≈ 10 nm—substantially +finer than the 3D FDTD auto-mesh (55 nm vertical). + b. Complex effective index. The FDE solver returns a complex effective index neff = nr + i ni for each guided +mode, where the imaginary part ni encodes propagation loss. For the fundamental TE mode at R = 20 µm: + neff = 1.9653 + i (4.73 × 10−8 ), (41) + 4π ni + = 0.383 m−1 0.017 dB cm−1 . + + αrad+leak = (42) + λ +Since the material is set as lossless, this α captures only bending radiation loss and substrate leakage through the +100 nm slab. The corresponding quality factor is + 2π ng + Qrad+leak = = 2.43 × 107 , (43) + αrad+leak λ +where ng = 2.354 is the group index from the FDE solver (consistent with the FDTD FSR-derived ng = 2.30; the +small difference arises from the straight-section approximation inherent to 2D FDE). + c. Decomposition into bending and leakage. A separate FDE run with R = 1 mm (effectively straight) yields +Qleak = 2.93 × 107 , isolating the substrate leakage contribution. The pure bending radiation quality factor follows from + 1 1 1 + = − , Qbend = 1.43 × 108 . (44) + Qbend Qrad+leak Qleak +This confirms that bending radiation loss at R = 20 µm is negligible; substrate leakage through the thin slab is the +dominant geometric loss channel. + d. Material absorption. The FDE mode profile yields a confinement factor Γ = 0.887 (fraction of the optical +intensity within the LiNbO3 core and slab regions). The material-absorption-limited quality factor is + 2π ng + Qabs = , (45) + Γ αmat λ + 27 + +where αmat is the bulk material power-attenuation coefficient of LiNbO3 at 1550 nm. Table S8 evaluates Eq. (45) for +representative TFLN absorption values from the literature [6, 7]. + +TABLE S8: Theoretical intrinsic quality factor Qi of the R = 20 µm TFLN ring, decomposed into radiation (Qbend ), + substrate leakage (Qleak ), and material absorption (Qabs ). Sidewall scattering (fabrication-dependent) is excluded. + The total is 1/Qi = 1/Qrad+leak + 1/Qabs with Qrad+leak = 2.43 × 107 . + +Material condition αmat (dB/cm) Qabs Qi (total) +Bulk LiNbO3 (pristine) 0.002 2.3 × 108 2.2 × 107 +High-quality TFLN 0.01 4.7 × 107 1.6 × 107 +Good TFLN 0.03 1.6 × 107 9.5 × 106 +Typical TFLN 0.1 4.7 × 106 3.9 × 106 + + + For high-quality TFLN (αmat ≲ 0.01 dB cm−1 ), the theoretical Qi exceeds 107 —more than 400× higher than the +FDTD-extracted value of 38,800. This confirms that the FDTD result is dominated by numerical mesh artifacts +(approximately two cells across the 100 nm slab), not by physical loss mechanisms. Bending radiation loss at R = 20 µm +is negligible (Qbend = 1.43 × 108 ); the dominant geometric loss channel in the ideal structure is substrate leakage +through the thin slab (Qleak = 2.93 × 107 ). + 28 + + S5. FABRICATED HIGH-Q DESIGN PROJECTIONS + + Reproducing Qi > 105 in three-dimensional FDTD is computationally impractical: at accuracy level 3 the 100 nm +slab requires ∆z ≲ 20 nm to suppress staircase-induced scattering, inflating wall times beyond 30 days per run. The +numerically extracted Qi = 38,800 therefore represents a simulation floor, not a physical one. A two-dimensional +MODE-solver bend analysis confirms Qbend > 4.5 × 107 for R = 20 µm, placing bending radiation loss far below any +realistic intrinsic loss. + Table S9 surveys recent high-Q TFLN microring demonstrations. These studies show that Qi ≥ 9 × 106 has been +demonstrated in X-cut TFLN using multiple fabrication routes, including Ar+ milling, wet etching, and ICP-RIE/CMP- +based processes. + + TABLE S9: Demonstrated intrinsic quality factors in TFLN micro-ring resonators. “EO compatible” indicates + whether the fabrication process preserves electrode patterning capability. + +Ref. Qi R (µm) w (µm) Etch +Zhang [8] 107 80 ∼2 Ar+ mill +Gao [9] 108 100 ∼3 CMP∗ +Zhuang [10] 9×106 100 ∼2 Wet etch +Song [11] 2.9×107 200 4.5 ICP-RIE+CMP + All processes except ∗ are EO-electrode compatible. ∗ CMP-only (no dry etch); subsequent electrode patterning may degrade Qi . + + To project cascade performance into the fabricated regime, we fix Qext = 25,800 (the FDTD-extracted coupling +quality factor at gap = 100 nm) and compute Dmax = [Qi /(Qi + Qext )]2 for three representative intrinsic quality +factors (Table S10). + + N + TABLE S10: Projected cascade transmission for fabricated Qi values at fixed Qext = 25,800. Dmax is the ideal +on-resonance cascade transmission in dB. The minimax approximation error εmax depends only on N and L (not on + Qi ); at N = 20, L = 8: εmax = 0.67% (Table I). + +Projection Qi Dmax N =10 N =20 N =30 +FDTD baseline 3.88×104 0.36 −44.3 −88.5 −132.8 +Conservative 5×105 0.90 −4.4 −8.8 −13.2 +Moderate 106 0.95 −2.2 −4.5 −6.7 +Optimistic 5×106 0.99 −0.44 −0.88 −1.3 + + + Even in the conservative scenario (Qi = 5 × 105 ), Dmax = 0.90 and the N = 10 cascade loss is only −4.4 dB—an +order-of-magnitude improvement over the FDTD baseline. The moderate projection (Qi = 106 ) matches the “fabricated +high-Q” column in Table V. Because Qbend ≈ 4.5 × 107 ≫ Qi for all projections, bending loss is never the bottleneck; +the dominant loss mechanism is sidewall scattering, which is determined entirely by fabrication quality. The literature +values in Table S9 support the view that intrinsic quality factors in the projected range are physically achievable +in TFLN—albeit with wider waveguides (w ≥ 2 µm) and larger ring radii (R ≥ 80 µm) than the present design. +Transferring comparable sidewall quality to our geometry (R = 20 µm, w = 1.4 µm) is an open fabrication challenge; +the projections in Table S10 should be read as design targets contingent on achieving it. + 29 + + S6. INSERTION LOSS BUDGET DETAILS + + For a cascade of N rings, the total insertion loss is modeled as + + ILtot ≈ N · ILstage + ILcoupling , (S6.1) + +where ILstage is the per-ring insertion loss at off-resonance operation and ILcoupling accounts for fiber-to-chip and +chip-to-fiber coupling losses. Using typical loss numbers from the literature [12–16], we consider two scenarios: + + • Optimistic: ILstage = 0.08 dB, ILcoupling = 1.5 dB. Then ILtot ≈ 1.90 dB (N = 5), 2.30 dB (N = 10), 3.10 dB + (N = 20), and 3.80 dB (N = 30). + • Conservative: ILstage = 0.25 dB, ILcoupling = 3.0 dB. Then ILtot ≈ 4.25 dB (N = 5), 5.50 dB (N = 10), + 8.00 dB (N = 20), and 10.5 dB (N = 30). + + In both scenarios, N = 5–10 is manageable for probe-power budgeting, whereas N = 20 and N = 30 require tighter +power budgeting and more amplification margin. Higher ILtot raises the required probe SNR and pushes operation +closer to the detector noise floor, reducing usable dynamic range. + e. Four-component loss breakdown. The total insertion loss of the cascade has four components: + N + 1. On-resonance cascade transmission Dmax (dominant; see Table V); + 2. Inter-ring coupling loss (N − 1) × (−10 log10 ηcoupling ), where ηcoupling is the power transfer efficiency at each + inter-ring bus section. Two-ring FDTD yields ηcoupling ≈ 0.9 for the present diagonal-bus geometry, corresponding + to ∼0.46 dB per inter-ring stage; + 3. Off-resonance propagation loss N × ILstage , where ILstage = 0.08–0.25 dB per ring [12–14, 16]; + 4. Fiber-to-chip coupling loss ILcoupling = 1.5–3.0 dB [15]. + N +Table V presents the ideal on-resonance budget (Dmax only). Including all four components for the present diagonal-bus +layout: in the FDTD-characterized regime (Dmax = 0.36, N = 5), the total loss is approximately 22.2 + 1.8 + 0.4 + 1.5 ≈ +26 dB; in the fabricated high-Q regime (Dmax = 0.95, N = 30), the total loss is 6.7 + 13.3 + 2.4 + 1.5 ≈ 24 dB. The +inter-ring coupling loss dominates in the high-Q regime, underscoring that layout optimization (e.g., adiabatic tapers or +straight-bus coupling) is as important as achieving Dmax ≥ 0.95 through quality-factor improvement. For an optimized +layout with ηcoupling ≥ 0.98 (≤0.09 dB per stage), the N = 30 total loss would reduce to ∼13 dB. + 30 + + S7. ENERGY EFFICIENCY DETAILED DERIVATION + + This section provides the detailed energy-per-operation derivations for both electrical analog exponential circuits +and the photonic MRR cascade, as summarized in the main text (Sec. V). + + + S7.1 Electrical analog exponential circuits + + Three main families of electrical circuits realize the exponential function in the analog domain: + f. BJT translinear / Gilbert cell circuits. The collector current of a bipolar junction transistor is IC = +IS exp(VBE /VT ), providing an intrinsic exponential map [17, 18]. A Gilbert cell multiplier—the core building +block of translinear exponential circuits—dissipates 250–325 µW in typical CMOS/BiCMOS implementations [19]. At +a signal bandwidth of B ≈ 100 MHz, the energy per operation is + P 300 µW + EGilbert = = = 3 pJ. (S7.1) + B 100 MHz + g. CMOS subthreshold exponential circuits. A MOSFET in weak inversion exhibits ID ∝ exp(VGS /nVT ), enabling +direct exponential computation at ultra-low power [18]. A reconfigurable softmax circuit in 180 nm CMOS implements +a 10-input softmax at VDD = 500 mV with P = 3 µW [20]. Per-channel: Pexp ≈ 0.43 µW. At B ≈ 1 MHz (limited by +subthreshold fT ): + 0.43 µW + Esub-VT = = 0.43 pJ. (S7.2) + 1 MHz +This is the most energy-efficient electrical approach, but at severely limited bandwidth (∼1 MHz). + h. Digital CMOS (for reference). A digital exponential via Taylor series requires ∼10 multiply-add operations. +Using Horowitz’s energy figures [21] for 45 nm at 0.9 V: 32-bit FP multiply costs 3.7 pJ, FP add costs 0.9 pJ, giving + Edigital ≈ 10 × (3.7 pJ + 0.9 pJ) = 46 pJ. (S7.3) +At 8-bit precision (sufficient for inference): ∼2.3 pJ. + + + S7.2 Photonic MRR cascade: single-channel energy derivation + + We evaluate the energy at N = 30 cascaded X-cut TFLN micro-ring resonators with R = 20 µm in the fabricated +high-Q regime (Qi = 106 , QL ≈ 25,200; Supplementary Sec. S5), which achieves εmax = 0.30% with Vctrl = 0.91 V +(fully CMOS-compatible). The energy per exponential operation has three components: + (i) Electro-optic tuning energy. Each ring is tuned by charging the arc electrode capacitance to Vctrl . For the lateral +S–G arc electrodes covering one semicircle (Larc = πR = 62.8 µm), the electrode capacitance is estimated as + Cel ≈ 18 fF, (S7.4) +based on coplanar electrode modeling for TFLN lateral S–G geometries with gel = 5 µm (comparable to values reported +by Bahadori et al. [22] for similar geometries). The switching energy per ring at Vctrl = 0.91 V (using the projected +QL = 25,200, which gives bV = 0.295 V−1 ): + 2 + Ering = 12 Cel Vctrl = 12 × 18 fF × (0.91 V)2 = 7.4 fJ. (S7.5) +For N = 30 rings: EEO = 30 × 7.4 = 0.22 pJ. + Note the important scaling: EEO ∝ 1/N since b ∝ 1/N from minimax optimization, because + 2 + EEO ∝ N × Vctrl ∝ N × (b/bV )2 ∝ 1/N. (S7.6) +The bias voltage (3.9 V) is static and does not contribute per-operation energy. + (ii) Laser source energy (amortized). Because every cascade channel uses the same fixed probe wavelength, a single +CW laser can be shared among M parallel softmax channels via a 1 × M optical power splitter. With wall-plug +efficiency ηWPE ≈ 15% [23], the per-channel optical power is Popt = Pin /M ≈ 100 µW (for Pin = 1 mW, M = 10), +requiring Plaser ≈ 667 µW per channel. At fmod = 10 GHz: Elaser = 667 µW / 10 GHz = 67 fJ. + (iii) Photodetector energy. Integrated SiGe photodetectors with TIA achieve sub-pJ reception [24]: EPD ≈ 0.5 pJ. + The total single-channel energy is + (1ch) + Ephotonic = EEO + Elaser + EPD = 0.22 + 0.07 + 0.50 = 0.79 pJ. (S7.7) + 31 + + S7.3 Q-factor scaling of energy efficiency + + 2 + Since Vctrl ∝ 1/Q and EEO ∝ Vctrl , the EO energy scales as 1/Q2 . Table S11 shows the total energy for N = 30 at +various quality factors. + +TABLE S11: Energy per exponential operation vs. quality factor (N = 30, εmax = 0.30%, X-cut arc electrode with bV + scales linearly with Q; Cel = 18 fF). Elaser + EPD = 0.57 pJ is the Q-independent floor. The dagger (†) marks the +FDTD-calibrated quality factor; the double dagger (‡) marks the high-Q design point (Qi = 106 ). Excludes thermal + stabilization (0.15–0.60 pJ for N = 30). + + Q Vctrl (V) Vbias (V) EEO (pJ) Etotal (pJ) + 5,000 4.57 19.5 5.64 6.21 + 10,000 2.28 9.7 1.40 1.97 + 12,500 1.83 7.8 0.90 1.47 +15,500† 1.47 6.3 0.58 1.15 + 20,000 1.14 4.9 0.35 0.92 +25,200‡ 0.91 3.9 0.22 0.79 + 30,000 0.76 3.2 0.16 0.73 + 50,000 0.46 1.9 0.06 0.63 + + + At QL = 15,500 (FDTD-calibrated), the EO contribution (0.58 pJ) is comparable to the optical floor, placing the +design in the efficient operating regime. Beyond Q ≈ 30,000, the EO contribution becomes negligible and the total +energy saturates near the floor; further Q improvement primarily benefits CMOS driver voltage compatibility rather +than energy. + i. Additional energy contributions. The estimates above exclude two further contributions: (i) DAC energy +for setting the per-ring control voltages, typically 0.1–1 pJ per conversion at 10 GHz bandwidth; and (ii) thermal +stabilization power for maintaining resonance alignment, estimated at ∼50–200 µW per ring for TFLN (lower than +silicon due to the small thermo-optic coefficient of LiNbO3 , dn/dT ≈ 3.9 × 10−6 K−1 ). At 10 GHz modulation rate, +the thermal contribution amounts to ∼0.005–0.02 pJ per ring per operation. For the N = 30 cascade, this sums to +0.15–0.60 pJ, which is comparable to EEO and must be included in the total: Etotal ≈ 0.94–1.39 pJ. The total energy +comparison should therefore be treated as an order-of-magnitude estimate. + + + S7.4 Comparison with electronic implementations + + Here we provide an order-of-magnitude energy comparison between electrical analog exponential circuits and our +photonic MRR cascade, grounding the analysis in published device data and first-principles estimates. We assume +a shared CW laser with total optical output Pin,tot = 1 mW, split across M = 10 parallel softmax channels via a +1 × M power splitter, yielding per-channel input Pin,ch = 100 µW. The output power at the cascade drop port is + N +Pout = Pin,ch × Dmax , which ranges from 0.61 µW (FDTD regime, N = 5) to 21.5 µW (fabricated regime, N = 30) +(Table V). + j. Electrical analog exponential circuits. Three main families of electrical analog exponential circuits are compared: +BJT translinear/Gilbert cell (∼ 3 pJ at 100 MHz [17–19]), CMOS subthreshold (∼ 0.43 pJ at 1 MHz [18, 20]), and +digital FP32 Taylor series (∼ 46 pJ at 1 GHz [21]). + k. Photonic MRR cascade: single-channel energy. For N = 30 X-cut TFLN micro-ring resonators in the self- +consistent high-Q regime (QL = 25,200), the three energy components are EO tuning (EEO = 0.22 pJ), amortized +laser (Elaser = 0.07 pJ, shared across M = 10 parallel channels), and photodetector (EPD = 0.50 pJ), yielding +Ephotonic = 0.79 pJ. Including thermal stabilization for N = 30 rings (0.15–0.60 pJ), the total rises to 0.94–1.39 pJ. +Notably, EEO ∝ 1/N since b ∝ 1/N from minimax optimization. + l. Single-channel comparison. Table S12 presents the comparison. The photonic cascade at N = 30 achieves +0.79 pJ baseline—3.8× lower than the BJT Gilbert cell (3 pJ) and 58× lower than digital FP32 (46 pJ). Including +thermal stabilization (0.94–1.39 pJ), the advantage over INT8 (2.3 pJ) is 1.7–2.4×, while operating at 10 GHz +bandwidth. At fabricated Q ≥ 30,000, EEO drops to 0.16 pJ and Etotal ≈ 0.73 pJ (excluding thermal; Table S11), +recovering a 3.2× advantage over INT8. Subthreshold CMOS achieves the lowest energy (0.43 pJ) but at 10,000× +lower bandwidth. + m. Caveats. These values are order-of-magnitude estimates, not device-accurate predictions. The photonic +estimate excludes DAC energy for voltage generation (typically 0.1–1 pJ per conversion at 10 GHz bandwidth, shared +with any analog approach) and thermal tuning power for maintaining resonance alignment (∼50–200 µW per ring for + 32 + + TABLE S12: Energy per exponential operation: single-channel comparison. + +Implementation E/op (pJ) Bandwidth Notes +Digital FP32 (Taylor) ∼46 1 GHz 10 FP MACs +BJT Gilbert cell ∼3 100 MHz Analog +Digital INT8 (Taylor) ∼2.3 1 GHz 10 INT MACs +Photonic MRR (N = 30) 0.94–1.39 10 GHz Analog† +Subthreshold CMOS ∼0.43 1 MHz Analog + † 0.79 pJ excluding thermal; 0.94–1.39 pJ including thermal. Self-consistent with fabricated high-Q regime (Q = 25,200); see + L + Supplementary Sec. S7. + + +TFLN, lower than silicon due to the small thermo-optic coefficient of LiNbO3 , dn/dT ≈ 3.9 × 10−6 K−1 ). Effective +precision at the photodetector is limited to ∼6–8 bits by shot noise and receiver electronics. The energy advantage +over electrical implementations is strongest in the fabricated high-Q regime (Dmax ≥ 0.95), where N = 30 is practical +and Vctrl remains CMOS-compatible. + 33 + + S8. MONTE CARLO ROBUSTNESS UNDER DEVICE NON-IDEALITIES + + This section describes the robustness model summarized in the main text. For the fitted L = 8, N = 10 design +(a = −1.4588, b = 0.10202), each Monte Carlo chip sample includes: (i) per-ring static detuning spread, (ii) per- +ring sensitivity spread, (iii) global thermal drift and crosstalk-like slope drift, (iv) stage insertion-loss variation, (v) +control-channel noise, and (vi) detector noise with one-point calibration at I = L. + For ring k, we use + 1 + Tk (I) = 2, (46) + 1 + (ak + bk I + dth + dxt I/L) + +with + N + Y + y(I) = Tk (I) × 10−ILtot /10 , (47) + k=1 + +and one-point calibration ỹ(I) = Ccal y(I) such that ỹ(L) = 1 for the same chip instance. + + TABLE S13: Non-ideality distributions used in the Monte Carlo sweeps. + + Parameter Nominal Stress + σa 0.020 0.032 + σb,rel 0.020 0.032 + σth 0.015 0.025 + σxt 0.012 0.020 + σI 0.004 0.007 + ILstage (dB, µ ± σ) 0.12 ± 0.03 0.18 ± 0.05 + σdet 3.0 × 10−6 6.0 × 10−6 + + + + TABLE S14: Monte Carlo summary (same run reported in main text). + + Metric Nominal Stress + Median KL(pref ∥papprox ) 2.17 × 10−4 7.39 × 10−4 + p95 KL(pref ∥papprox ) 5.92 × 10−4 2.21 × 10−3 + Median max |∆p| 0.170% 0.193% + p95 max |∆p| 0.319% 0.419% + +Conservative-bound sketch used for the main-text screening equation. For the identical-detuning family +with fixed b, define + + ln ỹ(I) = N ϕ(a + bI) − N ϕ(a + bL), ϕ(u) = − ln(1 + u2 ), (48) + +so that ỹ(L) = 1 by construction. Around a constructive choice with a + bI < 0 on [0, L], a second-order remainder +argument for the mismatch between the target slope and the fitted slope yields a term scaling as L2 /(4N ), while the +flank-curvature penalty contributes a term scaling as 1/(2b2 N ). Combining the two contributions gives the screening +inequality + + L2 1 + E∞ ≲ + 2 , (49) + 4N 2b N +which leads to the conservative screening equation reported in the main manuscript. We emphasize that this is a +conservative heuristic design rule (not a formal minimax theorem), used only for preliminary depth screening. + 34 + + + + +FIG. S4: CDF of end-to-end softmax probability error under the same non-ideality samples. + 35 + + S9. DELAY-AWARE FEEDBACK NORMALIZATION VALIDATION + + We model global normalization as a delayed PI-controlled loop: + + S(t) = G(t)P (t) + n(t), (50) + dP + τ = −P (t) + u(t − Td ), (51) + dt Z + u(t) = Kp e(t) + Ki e(t) dt, e(t) = Sref − S(t), (52) + +with actuator saturation 0 ≤ u ≤ Pmax . A piecewise G(t) profile is used to emulate workload changes. For physical +intuition, Table S15 converts normalized delay/settling metrics into absolute-time examples. + +TABLE S15: Example absolute-time interpretation of normalized PI-loop metrics using one representative stable case + ((Kp , Ki , Td /τ ) = (0.55, 0.8, 0.2)) and a ±2% settling-time definition (Tsettle ∼ 12.4τ ). + + Assumed τ Delay Td = 0.2τ Settling ∼ 12.4τ Interpretation + 100 ns 20 ns 1.24 µs fast loop + 1 µs 200 ns 12.4 µs moderate loop + 5 µs 1 µs 62 µs slower loop + +Reference-backed latency context for bottleneck screening. To place the delayed-loop times against mixed- +signal system latencies, Table S16 summarizes representative time scales with explicit path classes (on-chip vs off-chip) +for memory and interconnect paths, alongside conversion latency ranges. These are intentionally order-of-magnitude +ranges (not fixed constants), and can shift with architecture, clocking, and protocol stack choices. + + TABLE S16: Representative subsystem latency ranges used for conservative bottleneck screening in Sec. S9. + + Subsystem path Tsys Sources + On-chip memory (L1/L2) 20–200 ns [25] + Off-chip memory (DRAM) 200–700 ns [25, 26] + ADC conversion 10–710 ns [27, 28] + DAC + driver/settling 1–200 ns [29] + On-chip interconnect (NoC) 5–100 ns [30] + Off-chip I/O (PCIe/CXL) 1–10 µs [25, 31] + +Conservative risk-screening heuristic for loop latency. As a screening heuristic, we use the settling time from +one representative stable case ((Kp , Ki , Td /τ ) = (0.55, 0.8, 0.2); Table S18), with settling defined as the first time +entering and remaining within a ±2% band around Sref , as a normalization-loop latency proxy: + + Tnorm ≈ 12.4 τ. (53) + +This value is not a universal bound; different gain settings, delay ratios, or loop architectures will yield different settling +times. It is used only as a reference point for order-of-magnitude risk screening. Define the conservative screening +metric + + Tnorm ≥ β Tsys , (54) + +with β = 1 (high-risk screening line) and β = 0.5 (early-warning line); this is a heuristic risk indicator, not a formal +dominance proof. The corresponding threshold is + β Tsys + τcrit (β) = . (55) + 12.4 +Table S17 gives the resulting numeric ranges. +For the explicit examples in Table S15, τ = 0.1 µs gives Tnorm ≈ 1.24 µs, τ = 1 µs gives Tnorm ≈ 12.4 µs, and τ = 5 µs +gives Tnorm ≈ 62 µs. These numbers indicate a risk trend (not a hard boundary): for this representative case, the +normalization loop is typically non-dominant when τ is well below the relevant τcrit band, and it may become dominant + 36 + + TABLE S17: Computed τcrit ranges from Eq. (55) using Table S16. + + Subsystem Tsys range τcrit (β = 0.5) τcrit (β = 1) + On-chip memory path 20–200 ns 0.81–8.06 ns 1.61–16.13 ns + Off-chip memory path 200–700 ns 8.06–28.23 ns 16.13–56.45 ns + ADC conversion 10–710 ns 0.40–28.63 ns 0.81–57.26 ns + DAC+driver/settling 1–200 ns 0.04–8.06 ns 0.08–16.13 ns + On-chip interconnect (NoC) 5–100 ns 0.20–4.03 ns 0.40–8.06 ns + Off-chip I/O fabric 1–10 µs 0.04–0.40 µs 0.08–0.81 µs + + +as τ approaches or exceeds that band. The transition depends on path class (on-chip vs off-chip) and on architecture- +specific timing closure, including whether the normalization path lies on the end-to-end critical path (Table S16). +Accordingly, this analysis is intended for preliminary risk screening only; concrete implementations +require full timing validation. + +TABLE S18: Representative step-response cases for the delayed PI loop (settling defined by a ±2% band around Sref ). + + Case (Kp , Ki , Td /τ ) Overshoot Settling Stable + Stable (0.55, 0.8, 0.2) 25.6% ∼ 12.4τ Yes + Marginal (0.95, 1.6, 0.45) 25.6% ∼ 12.8τ Yes + Unstable (1.2, 2.2, 0.75) 45.1% not settled No + + + + TABLE S19: Stable-region fraction from gain-map scans at each delay ratio. + + Td /τ Stable fraction + 0.0 88.1% + 0.2 88.0% + 0.5 72.4% + 0.8 47.5% + 37 + + + + +FIG. S5: Step-response examples of the delayed PI normalization loop. + 38 + + + + +FIG. S6: Delay-dependent stability maps over scanned (Kp , Ki ) ranges. + 39 + + S10. REPRODUCIBILITY + + Scripts used for this Supplementary validation: + • scripts/nonideality montecarlo.py + + • scripts/feedback loop validation.py + + • scripts/extract logit range effective.py + + • scripts/analyze softmax clipping validity.py +Public code repository: https://github.com/hyoseokp/MRR-AEF (commit 585e695). Empirical extraction outputs +are stored under: + • paper/empirical L v3/ + + + + + [1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia + Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems 30 (NeurIPS 2017), pages + 5998–6008, 2017. + [2] Alec Radford et al. Language models are unsupervised multitask learners. Technical report, OpenAI, 2019. + [3] Hugging Face. distilgpt2 model card, 2025. accessed 2026-02-21. + [4] Andrej Karpathy. Tiny shakespeare dataset (char-rnn), 2025. accessed 2026-02-21. + [5] Jane Austen. Pride and prejudice. Project Gutenberg eBook No. 1342, 2025. accessed 2026-02-21. + [6] Di Zhu et al. Integrated photonics on thin-film lithium niobate. Advances in Optics and Photonics, 13(2):242–352, 2021. + [7] Yaowen Hu, Di Zhu, Shengyuan Lu, Xinrui Zhu, Yunxiang Song, Dylan Renaud, Daniel Assumpcao, Rebecca Cheng, + CJ Xin, Matthew Yeh, Hana Warner, Xiangwen Guo, Amirhassan Shams-Ansari, David Barton, Neil Sinclair, and Marko + Loncar. Integrated electro-optics on thin-film lithium niobate. Nature Reviews Physics, 2025. + [8] Mian Zhang, Cheng Wang, Rebecca Cheng, Amirhassan Shams-Ansari, and Marko Lončar. Monolithic ultra-high-Q lithium + niobate microring resonator. Optica, 4(12):1536–1537, 2017. + [9] Renhong Gao, Ni Yao, Jianglin Guan, Li Deng, Jintian Lin, Min Wang, Lingling Qiao, Wei Fang, and Ya Cheng. Lithium + niobate microring with ultra-high Q factor above 108 . Chin. Opt. Lett., 20(1):011902, 2022. +[10] Rongjin Zhuang, Jinze He, Yifan Qi, and Yang Li. High-Q thin-film lithium niobate microrings fabricated with wet etching. + Adv. Mater., 35(3):2208113, 2023. +[11] Xinrui Zhu, Yaowen Hu, Shengyuan Lu, Hana K. Warner, Xudong Li, Yunxiang Song, Letı́cia S. Magalhães, Amirhassan + Shams-Ansari, Neil Sinclair, and Marko Lončar. Twenty-nine million intrinsic Q-factor monolithic microresonators on + thin-film lithium niobate. Photon. Res., 12(8):A63–A68, 2024. +[12] Sudip Shekhar, Wim Bogaerts, Lukas Chrostowski, John E. Bowers, Michael Hochberg, Richard Soref, and Bhavin J. + Shastri. Roadmapping the next generation of silicon photonics. Nature Communications, 15:751, 2024. +[13] Xaveer Leijtens et al. Multimode silicon photonics. Nanophotonics, 7:1571–1580, 2018. +[14] Haoqian Li et al. In-memory photonic dot-product engine with electrically programmable weight banks. Nature Communi- + cations, 14:2389, 2023. +[15] Daan Vermeulen et al. High-efficiency fiber-to-chip grating couplers realized using an advanced cmos-compatible silicon-on- + insulator platform. Optics Express, 18(17):18278–18283, 2010. +[16] F. S. Tan, D. J. W. Klunder, H. F. Bulthuis, G. Sengo, H. J. W. M. Hoekstra, and A. Driessen. Direct measurement of + the on-chip insertion loss of high finesse microring resonators in si3 n4 -sio2 technology. In Proceedings of the IEEE LEOS + Benelux Chapter, 2001. +[17] B. Gilbert. Translinear circuits: a proposed classification. Electron. Lett., 11(1):14–16, 1975. +[18] C. Mead. Analog VLSI and Neural Systems. Addison-Wesley, 1989. +[19] B. Razavi. Design of Analog CMOS Integrated Circuits. McGraw-Hill, 2 edition, 2017. +[20] Massimo Vatalaro, Tatiana Moposita, Sebastiano Strangio, Lionel Trojman, Andrei Vladimirescu, Marco Lanuzza, and + Felice Crupi. A low-voltage, low-power reconfigurable current-mode softmax circuit for analog neural networks. Electronics, + 10(9):1004, 2021. +[21] M. Horowitz. 1.1 computing’s energy problem (and what we can do about it). In 2014 IEEE International Solid-State + Circuits Conference (ISSCC), pages 10–14, 2014. +[22] Meisam Bahadori, Yansong Yang, Ahmed E. Hassanien, Lynford L. Goddard, and Songbin Gong. Ultra-efficient and fully + isotropic monolithic microring modulators in a thin-film lithium niobate photonics platform. Optics Express, 28(20):29644– + 29661, 2020. +[23] A. Biberman and K. Bergman. Optical interconnection networks for high-performance computing systems. Rep. Prog. + Phys., 75(4):046402, 2012. + 40 + +[24] D. A. B. Miller. Attojoule optoelectronics for low-energy information processing and communications. J. Lightwave Technol., + 35(3):346–396, 2017. +[25] Zhe Jia, Marco Maggioni, Benjamin Staiger, and Daniele P. Scarpazza. Dissecting the NVIDIA volta GPU architecture via + microbenchmarking. arXiv preprint arXiv:1804.06826, 2018. +[26] Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, and Onur Mutlu. A case for exploiting subarray-level parallelism + (SALP) in DRAM. In Proceedings of the 39th Annual International Symposium on Computer Architecture (ISCA), pages + 368–379, 2012. +[27] Texas Instruments. ADC12DJ3200: 6.4-GSPS single-channel or 3.2-GSPS dual-channel, 12-bit, RF-sampling analog-to-digital + converter. Datasheet (SLVSD97A, revised April 2020), 2020. Accessed 2026-02-22. +[28] Texas Instruments. ADS8881: 18-bit, 1-MSPS, low-power, true-differential SAR ADC. Datasheet (SBAS547D, revised + August 2015), 2015. Accessed 2026-02-22. +[29] Texas Instruments. DAC38RF82/DAC38RF89: Dual-channel, 14-bit, 9-GSPS and 6-GSPS RF DACs. Datasheet + (SLASEA6D, revised June 2020), 2020. Accessed 2026-02-22. +[30] W. J. Dally and B. Towles. Route packets, not wires: on-chip interconnection networks. In Proceedings of the 38th Design + Automation Conference (DAC), pages 684–689, 2001. +[31] Shintaro Sano, Yosuke Bando, Kazuhiro Hiwada, Hirotsugu Kajihara, Tomoya Suzuki, Yu Nakanishi, Daisuke Taki, and + Akiyuki Kaneko. Gpu graph processing on CXL-based microsecond-latency external memory. In Proceedings of the SC ’23 + Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2023. +
\ No newline at end of file diff --git a/refs/scurria_2602.03670v2.txt b/refs/scurria_2602.03670v2.txt new file mode 100644 index 0000000..d9eba6a --- /dev/null +++ b/refs/scurria_2602.03670v2.txt @@ -0,0 +1,3311 @@ +Equilibrium Propagation for Non-Conservative Systems + +Antonino Emanuele Scurria 1 Dimitri Vanden Abeele 1 Bortolo Matteo Mognetti 2 Serge Massar 1 + +arXiv:2602.03670v2 [cs.LG] 1 Jun 2026 + +Abstract + +from inference, the transmission of nonlocal error signals, +and synchronous layer-wise computations with explicit gradient storage. These constraints have no clear analog in +physical systems, making backpropagation challenging to +implement in neuromorphic or analog hardware. Consequently, understanding how credit assignment can instead +emerge from intrinsic system dynamics, through local interactions and continuous relaxation, is a central question in +neuroscience and machine learning. + +Equilibrium Propagation (EP) is a physicsinspired learning algorithm that uses stationary +states of a dynamical system both for inference +and learning. In its original formulation it is +limited to conservative systems, i.e. to dynamics which derive from an energy function. Given +their applications, it is important to extend EP +to non-conservative systems, i.e. systems with +non-reciprocal interactions. Previous attempts to +generalize EP to such systems failed to compute +the exact gradient of the cost function. Here we +propose a framework that extends EP to arbitrary +non-conservative systems, including feedforward +networks. We keep the key property of equilibrium propagation, namely the use of stationary +states both for inference and learning. However, +we modify the dynamics in the learning phase by +a term proportional to the non-reciprocal part of +the interaction so as to obtain the exact gradient +of the cost function. This algorithm can also be +derived using a variational formulation that generates the learning dynamics through an energy +function defined over an augmented state space. +Numerical experiments show that this algorithm +achieves better performance and learns faster than +previous proposals. + +Equilibrium Propagation (EP) (Scellier &Bengio, 2017) +represents one of the most promising advances in this direction. It formulates supervised learning as a contrast between +two stationary states of a dynamical system: a ‘free’ phase +where the system evolves autonomously, and a ‘nudged’ +phase where outputs are weakly pushed toward their targets. +The local change in neural states between these phases recovers the exact gradient of the cost function with respect to +parameters. This enables spatially local learning exploiting +the continuous relaxation of the system without a distinct +backward circuit or explicit weight transport. +Since its introduction, several works have sought to improve the practicality and biological realism of EP. Algorithmic adaptations include enforcing temporal locality to +avoid state storage (Ernoult et al., 2020; Falk et al., 2025), +deriving agnostic updates for black-box energies (Scellier et al., 2022), and substituting nudging with clamping +(Stern et al., 2021). Theoretically, the framework has been +extended to stochastic systems (Scellier &Bengio, 2017; +Massar &Mognetti, 2025) and Lagrangian dynamics for +time-varying inputs (Massar, 2025; Pourcel et al., 2025; +Berneman &Hexner, 2025). In parallel, simulations have +explored suitable substrates, ranging from spiking (Martin et al., 2021; O’Connor et al., 2019) and resistive networks (Kendall et al., 2020) to coupled oscillators (Wang +et al., 2024; Rageau &Grollier, 2025), as well as quantum +systems (Wanjura &Marquardt, 2025; Massar &Mognetti, +2025; Scellier, 2024). Experimental realizations have been +demonstrated in memristor crossbars (Yi et al., 2023), selfadjusting electrical circuits (Dillavou et al., 2022; 2024), +elastic networks (Altman et al., 2024), and classical Ising +models trained on quantum annealers (Laydevant et al., +2024). + +1. Introduction +Standard neural network optimization relies on error backpropagation, an algorithm whose computational mechanism +is difficult to reconcile with biological (Crick, 1989) and +physical implementations (Indiveri &Liu, 2015). Specifically, backpropagation requires a backward pass distinct +1 +Laboratoire d’Information Quantique (LIQ) CP224, Université +libre de Bruxelles (ULB), Av. F. D. Roosevelt 50, 1050 Bruxelles, +Belgium 2 Interdisciplinary Center for Nonlinear Phenomena and +Complex Systems CP231, Université libre de Bruxelles (ULB), Av. +F. D. Roosevelt 50, 1050 Bruxelles, Belgium. Correspondence to: +Antonino Emanuele Scurria <antonino.scurria@ulb.be>. + +Proceedings of the 43 rd International Conference on Machine +Learning, Seoul, South Korea. PMLR 306, 2026. Copyright 2026 +by the author(s). + +Despite these recent developments and the theoretical el- + +1 + +Equilibrium Propagation for Non-Conservative Systems + +egance of EP, its standard formulation remains restricted +to conservative systems. In these systems, dynamics are +derived from an energy function, which inherently enforces +symmetry (e.g., symmetric synaptic connections Jij = Jji ) +through the action-reaction principle. This constraint precludes the use of EP in a broad class of models characterized +by non-conservative forces. This includes the feedforward +architectures dominant in modern AI, biological circuits, +as well as physical systems that reach stationary states far +from thermodynamic equilibrium, such as nonlinear optical +systems driven by external lasers (Cin et al., 2025), optoelectronic systems (Kalinin et al., 2025), exciton-polariton +condensates (Sajnok &Matuszewski, 2025), active metamaterials (Brandenbourger et al., 2019) and active colloids +(Bishop et al., 2023; Osat &Golestanian, 2023) (see (Bowick +et al., 2022) for a review). + +a framework where the original dynamics serve for inference, while a new augmented dynamic is used to compute +gradients of the cost Eq. (2). In this augmented phase, the +output neurons are nudged towards their targets (as in standard EP), while a local corrective term – proportional to the +antisymmetric part of the Jacobian at the free equilibrium +∂ +JF (x0 , θ, u) = ∂x +F (x0 , θ, u) – is added to the forces. The +exact gradients of the cost with respect to parameters are +then obtained by contrasting stationary states of the augmented system. +Second, we introduce Dyadic EP, a ‘variational’ approach +to learning in non-conservative systems. This method involves doubling the number of variables in the system’s +state space and subsequently introducing a new energy function in this extended space. This approach takes advantage +of the extended space to execute the positive and negative +nudging phases in parallel, recovering the same computational cost as AsymEP. Derived from first principles, this +approach is inspired by established methods for mapping +dissipative dynamical systems onto conservative ones by +doubling the degrees of freedom (Bateman, 1931; Galley, +2013; Aykroyd et al., 2025). A more comprehensive study +of the theoretical framework and its application to feedforward networks can be found in (Scurria, 2026). Our method +is related to the Dual Propagation algorithm (Høier et al., +2023; Høier &Zach, 2023; 2024) and constitutes an independent, first-principles generalization of Dyadic Learning +(Nest &Høier; Høier et al., 2024)—previously limited to +Hopfield networks—to arbitrary force fields. + +Formally, we consider a dynamical system governed by +a non-reciprocal force field F (x, θ, u), which relaxes to a +stationary configuration x0 satisfying: +F (x0 , θ, u) = 0, + +(1) + +where x represents the state variables, θ the learnable parameters and u the static input. Our goal, given a target y(u), is +to compute the gradient of the cost function C(x0 , y) at this +equilibrium, +dC 0 +(x , y), +(2) +dθ +and update θ to minimize the cost. +Previous attempts to extend EP to non-conservative dynamics include the Vector Field (VF) algorithm (Scellier et al., +2018). However, as noted by the authors, this method provides an unbiased gradient of the cost Eq. (2) only in the +conservative case. To mitigate this, (Laborieux &Zenke, +2024) proposed adding a penalty to keep the Jacobian close +to symmetry, essentially forcing the system to be as conservative as possible. Alternative methods related to VF, +which similarly do not compute the exact gradient, were +proposed in (Farinha et al., 2020; Costa &Santos, 2025) and +for specific systems in simulation (Cin et al., 2025; Sajnok +&Matuszewski, 2025). + +Third, we validate our framework on MNIST (LeCun, 1998), +Fashion-MNIST, and CIFAR-10. In continuous Hopfield +networks initialized with symmetric connection matrices, +AsymEP achieves better accuracy and learns faster than +EP and VF. Additionally, when we constrain the network +to have a strong degree of structural asymmetry, in which +case EP is inapplicable, AsymEP outperforms VF. Finally, +when we restrict connections to a feedforward structure, our +algorithm effectively trains all parameters; in contrast, VF +is limited to training the last layer, acting essentially as an +Extreme Learning Machine (Huang et al., 2006; Wang et al., +2022) with poor performance. + +Conversely, generalizations of backpropagation can handle +non-reciprocal forces and compute the exact gradient of +the cost Eq. (2) but inherit the same challenges in physical +implementations. For instance, Backpropagation Through +Time (Werbos, 1990) unfolds the network in time to apply standard backpropagation, Recurrent Backpropagation +(Almeida, 1990; Pineda, 1987) avoids this memory requirement but still requires a specific circuit to propagate errors, +and the continuous Adjoint Method (Chen et al., 2018) additionally requires integrating the dynamics backward in time +which is not physically possible for a dissipative system. + +In summary, this theoretical work proposes two generalizations of EP beyond conservative systems to arbitrary differentiable dynamics that compute in their stationary states. + +2. Equilibrium Propagation Overview +2.1. Conservative Systems +We first review standard Equilibrium Propagation (EP) +(Scellier &Bengio, 2017). We consider a network described +by an energy function E(x, θ, u), such that the force field is + +In this paper, we first propose Asymmetric EP (AsymEP), +2 + +Equilibrium Propagation for Non-Conservative Systems + +derived from the potential E: +FE (x, θ, u) = − + +∂ +E(x, θ, u). +∂x + +stationary point, i.e., that Eq. (7) holds. Second, EP implic∂ +itly assumes that the Jacobian JE (x0 , u) = ∂x +FE (x0 , u) is +invertible. In this work, we assume this condition holds and +will not state it explicitly hereafter. Third, for simplicity, +we omit the dependency on the input u and target y in the +following equations. + +(3) + +0 +The objective is to compute the total gradient dC +dθ (x , y) of a +(quadratic) cost function C(x, y) evaluated at the minimum +energy configuration of the system. This free equilibrium +denoted x0 (which depend implicitly in θ and u), satisfies +the stationarity condition: + +∂ +− E(x0 , θ, u) = 0. +∂x + +2.2. Vector Field +The Vector Field (VF) algorithm, introduced in (Scellier +et al., 2018), is an early attempt to adapt EP to nonreciprocal forces. This method relies on the observation +that, for conservative systems, linearizing the right-hand +side of Eq. (9) around the equilibrium point x0 yields + +(4) + +To compute gradients, we introduce the augmented energy +functional: +ET (x, θ, β, u, y) = E(x, θ, u) + βC(x, y), + +1 +β→0 2β + +(5) + +lim + +where β is a scalar nudging parameter. The stationary configuration of this augmented system is obtained by integrating +the dynamics +∂ET (x, θ, β, u) +dx +=− +, +dt +∂x + +where FE = −∂x E(x, θ) is the conservative force. It is +therefore tempting to use the right-hand side of Eq. (10) for +parameter updates of non-conservative systems, for which +no energy function E exists. + +(6) + +until the energy minimum is reached. This new fixed point +xβ , called nudged equilibrium, satisfies: +∂E(xβ , θ, u) +∂C(xβ , y) ++β += 0. +∂x +∂x + +The VF algorithm adopts precisely this approach. It uses +the nudged counterpart of Eq. (7), + +(7) + +F (xβ , θ) − β + +The training procedure, as improved in (Laborieux et al., +2021), uses two nudged phases with factors ±β (with +β ̸= 0). Starting from x0 , the system relaxes to two +nearby perturbed equilibria, x+β and x−β . The displacement x+β − x−β is then used to compute the parameter +update in the learning rule: + + +1 ∂E(xβ , θ, u) ∂E(x−β , θ, u) +− +, (8) +∆θ = −ϵ +2β +∂θ +∂θ + +∂C β +(x ) = 0, +∂x + +(11) + +in conjunction with the learning rule Eq. (10): + +∆θ = ϵ + +∂F 0 +(x , θ) +∂θ + +⊤ β + +x − x−β +. +2β + +(12) + +However, as noted in (Scellier et al., 2018), Eq. (12) does +0 +not align with the true gradient dC +dθ (x ) and is exact only if +the force is conservative. To see this, let JF (x, θ) denote +the Jacobian of the vector field F (x, θ) (in components +i (x,θ) +(JF (x, θ))ij = ∂F∂x +). Differentiating the equilibrium +j +0 +condition F (x , θ) = 0 with respect to θ gives + +where ϵ > 0 is the learning rate. The theoretical foundation +of EP is the result that, in the limβ→0 of Eq. (8), we get: +dC(x0 , y) +d ∂E(xβ , θ, u) += +, +dθ +dβ +∂θ + + +∂E(xβ , θ) ∂E(x−β , θ) +− +∂θ +∂θ + +⊤ β + (10) +x − x−β +∂FE 0 +(x , θ) += lim − +, +β→0 +∂θ +2β + + + +(9) + +JF (x0 , θ) + +see Appendix D.1. The error of the above method is O(β 2 ). +This error can be further reduced using holomorphic equilibrium propagation (Laborieux &Zenke, 2022). + +dx0 +∂F 0 ++ +(x , θ) = 0. +dθ +∂θ + +(13) + +Consequently, the exact gradient of the cost is + +Thus, EP recovers the exact gradient of the cost function +using only local computations. In this manner, learning +implements gradient descent without an explicit backward +pass, and credit assignment is realized through the system’s +intrinsic relaxation dynamics. + +⊤ + +dx0 ∂C 0 +dC 0 +(x ) = +(x ) +dθ +dθ ∂x + +⊤ + +−1 ∂C 0 +∂F 0 +⊤ 0 +=− +(x , θ) +JF (x , θ) +(x ) . +∂θ +∂x +| +{z +}| +{z +} + +Three remarks can be made at this point. First, EP does not +require the system to be at an energy minimum, but only at a + +pre-synaptic + +post-synaptic + +(14) +3 + +Equilibrium Propagation for Non-Conservative Systems + +Algorithm 1 Asymmetric EP (AsymEP) + +The terms ’pre-synaptic’ and ’post-synaptic’ in Eq. (14) +are used by analogy with neuronal transmission: the presynaptic factor captures the local influence of θ on the force +F , while the post-synaptic factor is the sensitivity of the +cost to state perturbations. + +1: Inputs: Force field F (x, θ), cost function C(x), nudg- + +ing parameter β, learning rate ϵ. +2: repeat +3: +1. Free Phase: Evolve to stationary state +4: +Evolve the system dynamics +5: + +If instead we differentiate the nudged equilibrium condition +in Eq. (11) with respect to β and evaluate at β = 0, we +obtain +β + +JF (x0 , θ) + +dx +dβ + +− +β=0 + +∂C 0 +(x ) = 0, +∂x + +dx += F (x, θ), +dt + +(15) + +6: +7: +8: +9: + +which gives +−1 ∂C 0 +dxβ += JF (x0 , θ) +(x , y). +dβ β=0 +∂x + +(17) + +until convergence to the stationary state x0 . +2. Jacobian Decomposition +Compute the Jacobian at equilibrium: +∂F 0 +(x , θ), +(18) +∂x +and decompose it in its antisymmetric part: +JF (x0 , θ) = + +(16) +10: +11: + +The right-hand side of Eq. (16) represents the effective postsynaptic term used by the VF algorithm (Eq. 12). Comparing this with the exact post-synaptic term derived in Eq. (14), +we see that they coincide only if JF = JF⊤ , i.e., only if the +system is conservative. + +AJ (x0 , θ) = 12 (JF (x0 , θ) − JF (x0 , θ)⊤ ). (19) +12: +13: +14: + +Now, let SJ (x0 , θ) and AJ (x0 , θ) denote the symmetric +and antisymmetric parts of the Jacobian at the free (unnudged) equilibrium, respectively. Then, we show in Appendix A that the gradient error increases with the spectral +−1 +radius of SJ (x0 , θ) +AJ (x0 , θ). Consequently, large +antisymmetric contributions degrade the gradient estimation, confirming empirical observations in the Appendix of +(Ernoult et al., 2020). In fact, in the pathological limit where +the Jacobian would be purely antisymmetric SJ (x0 , θ) = 0, +the update of VF gives the negative of the true gradient, +maximizing the cost rather than minimizing it. + +15: +16: +17: +18: + +3. Nudged Phase: Augmented Dynamics +Integrate the dynamics twice starting from x0 +dx +∂C += F (x, θ) − β +(x) − 2AJ (x0 , θ) (x − x0 ), +dt +∂x +(20) +until convergence to two new stationary states +x±β +A . +4. Parameter Update +Update the parameters according to: + +∆θ = ϵ + +3. Asymmetric EP + +⊤ +∂F 0 +(x , θ) +∂θ + +xβA − x−β +A +2β + +! +. + +(21) + +19: until convergence of θ +20: Output: Optimized parameters θ. + +Here, we introduce Asymmetric EP (AsymEP), see Algorithm 1, which removes the gradient estimate error inherent +to VF by adding a local correction term to the augmented +inference dynamics. The new nudged equilibrium xβA satisfies: +∂C β +F (xβA , θ) − β +(x ) − 2AJ (x0 , θ) (xβA − x0 ) = 0, (22) +∂x A + +where JFA (x, θ) is the Jacobian of the modified dynamical +system Eq. (20). At the equilibrium point x0 , JFA is equal +to the transpose of the original Jacobian: +JFA (x0 , θ) + +As in VF, we then obtain two perturbed states x±β +A for opposite nudging ±β and apply the contrastive learning rule +of Eq. (12). + += + +JF (x0 , θ) − 2AJ (x0 , θ) + += + +SJ (x0 , θ) − AJ (x0 , θ) + += + +JF⊤ (x0 , θ). + +(24) + +We now show that AsymEP gives rise to the correct learning +rule, i.e. that right-hand side of Eq. (21) is proportional to +0 +the gradient of the cost function dC +dθ (x ) at the equilibrium +0 +point x (Eq. 14). To this end, note that the same reasoning +leading to Eq. (16) leads to + +where we have used the decomposition Eq. (44) of the original Jacobian J into its symmetric and antisymmetric components. Therefore, the left hand side of Eq. (23) is equal to +the true post-synaptic term + +−1 ∂C 0 +dxβA += JFA (x0 , θ) +(x ). +dβ β=0 +∂x + +−1 ∂C 0 +dxβA += JF⊤ (x0 , θ) +(x ), +dβ β=0 +∂x + +(23) +4 + +(25) + +Equilibrium Propagation for Non-Conservative Systems + +until a stationary point (z β , z ′β ) is reached. Upon convergence, we follow the standard EP paradigm in using the +difference z β − z ′β to compute the post-synaptic term. Un′ +′ +der the change of variables m = z+z +2 and d = z − z , we +prove in Appendix D that m follows the original dynamics +F (ensuring valid inference), while d relaxes to a "physical" +error signal proportional to the cost gradient. + +which, using Eq. (14), proves the result. Additionally, although implied by the equality with the true gradient, we +explicitly demonstrate the equivalence of the gradient estimates obtained by AsymEP and Backpropagation Through +Time in Appendix B following (Ernoult et al., 2019). +Note that the corrective term −2AJ (x0 , θ)(x − x0 ) in +Eq. (20) is spatially local: AJ vanishes for unconnected +neurons, and (x − x0 ) is available at the synapse given the +memory mechanism already required by Eq. (12). This +correction can create backward connections (Section 5.3). +However, in physical realizations, both feedforward and +feedback connections must be physically present, though +feedback may be deactivated during inference. + +It is important to notice that while Dyadic EP introduces a +distinct formulation, it remains consistent with the general +theoretical setting of EP and matches the computational +cost of AsymEP. Note also that we start the evolution of +the free phase (β = 0) with the identical initial condition +for z and z ′ , (i.e., d = 0). This guarantees that integrating Eq. (32) leads to a symmetric stationary point where +z 0 = z ′0 . Finally, we underline that the modified variational update rule in Eq. (34) is equivalent to the standard +symmetric EP update rule in Eq. (8) (see Appendix D). + +4. Dyadic EP +We now introduce Dyadic EP (Algorithm 2), a variational +algorithm that computes the exact cost gradient in the limit +of infinitesimal nudging. It maps the original n-variable +dynamics F (x, θ) onto a 2n-variable system (z, z ′ ) defined +by an energy H(z, z ′ , θ) and cost D(z, z ′ ). We show in +Appendix E that AsymEP can be seen as the first-order +projection of Dyadic EP onto the original n-dimensional +state space. + +Now, to make this concrete, consider a continuous Hopfield +network (see also Eq. (35)) with an asymmetric connection +matrix J. After some calculations (see Appendix F), the +augmented energy of the system can be re-expressed as: +1 +1 +HT = − ρ(z)⊤ Sρ(z) + ρ(z ′ )⊤ Sρ(z ′ ) − ρ(z)⊤ Aρ(z ′ ) +2 +2 +1 +β +2 +′ 2 ++ (∥z∥ − ∥z ∥ ) + (C(z, y) + C(z ′ , y)) , +2 +2 +(29) +where S and A are the symmetric and antisymmetric parts +of J, respectively and ρ is an element-wise non-linearity. +An interesting analogy can be drawn with standard learning +rules in discrete Hopfield networks (Hopfield, 1982). For +a sequence of binary memories {ξ 1 , . . . , ξ m } where ξ µ ∈ +corresponds to the standard autoassociative +{−1, 1}n , S P +Hebbian rule µ ξ µ (ξ µ )⊤ , creating stable attractors at each +pattern. In contrast, A corresponds to the heteroassociative +rule (e.g., a cycle between ξ µ and ξ ν given by ξ ν (ξ µ )⊤ − +ξ µ (ξ ν )⊤ ), encoding transitions between patterns. + +The new system is defined by the energy H and cost function +D, given in terms of F and C by: + + +z + z′ +H(z, z ′ , θ) = −(z − z ′ )⊤ F +,θ , +2 + + +′ +z+z +, +(26) +D(z, z ′ ) = C +2 +where z, z ′ ∈ Rn . In order to learn, we introduce the augmented energy +HT (z, z ′ , θ, β) = H(z, z ′ , θ) + βD(z, z ′ ). + +(27) + +The equilibrium configuration corresponds to a saddle point +of HT , where z minimizes and z ′ maximizes the energy. +This poses no issue for EP, which requires only that the +joint state (z, z ′ ) reaches a stationary state. Although this +min-maximization can be interpreted as z evolving forward +and z ′ backward in time, in practice they evolve forward +simultaneously, as we integrate the coupled equations: + + +dz +∂HT +z + z′ +=− +=F +,θ +dt +2 +∂z +⊤ + + +z − z′ +∂F +β ∂C z + z ′ ++ +− +, +2 +∂z z+z′ +2 ∂z +2 +2 + + +dz ′ +∂HT +z + z′ +=+ +=F +,θ +′ +dt +2 +∂z + + + +′ ⊤ +z−z +∂F +β ∂C z + z ′ +− ++ +, +2 +∂z ′ z+z′ +2 ∂z ′ +2 +2 +(28) + +For this specific energy, the update rule given by Eq. (34) +can be re-expressed as: + +⊤ +1 +∆J ∝ − +ρ(z ′β ) − ρ(z β ) ρ(z ′β ) + ρ(z β ) . (30) +2β +In the limit β → 0, this gives: +! +β +d +∆J ∝ +⊙ ρ′ (m)ρ(m)⊤ . +β + +(31) + +matching the learning rule in (Pineda, 1987), with +β +limβ→0 dβ being the error signal. + +5. Numerical Experiments +In this section, we numerically validate AsymEP (Algorithm 1). The neuronal dynamics follows the one introduced +5 + +Equilibrium Propagation for Non-Conservative Systems + +where ∥ · ∥F denotes the Frobenius norm. Note that this +metric does not capture the asymmetry of the Jacobian, +which depends on the state x. + +Algorithm 2 Dyadic EP +1: Inputs: Force field F (x, θ), cost function C(x, y), + +nudging parameter β, learning rate ϵ +2: repeat +3: +1. Free Phase: Evolve to stationary state +4: +Evolve the system dynamics, starting from identi- + +For numerical experiments, we restricted the network to a +layered architecture with a single hidden layer to facilitate +comparison with prior work. Accordingly, J in contains +only input-to-hidden connections, while J dyn is block offdiagonal, encoding bidirectional interactions between the +hidden and output layers. Both J in and J dyn are trained. + +cal initial conditions z(0) = z ′ (0) = z0 , +5: + +6: +7: +8: + +dz +∂H +=− +, +dt +∂z + +11: +12: +13: + +(32) + +We first use MNIST (LeCun, 1998) (60k train, 10k test) +followed by Fashion-MNIST to validate AsymEP, and then +we further validate AsymEP and Dyadic EP by comparing +them to Backpropagation on a convolutional feedforward, +with CIFAR-10. Inputs are normalized using min-max to +[−1, 1] and targets are one-hot encoded in {−1, 1}. All +hyperparameters are detailed in Appendix G, along with +additional details and numerical results. + +until stationary states z 0 , z ′0 are reached. +2. Nudged Equilibrium +Evolve the system dynamics, starting from the +solution of the free phase z 0 = z ′0 : + +9: + +10: + +dz ′ +∂H +=+ ′, +dt +∂z + +dz +∂HT +=− +, +dt +∂z + +dz ′ +∂HT +, +=+ +dt +∂z ′ + +(33) + +until two nudged stationary states z β , z ′β are +reached. +3. Parameter Update +Update the parameters according to: + + +1 ∂H(z β , z ′β , θ) +∆θ = −ϵ +(34) +β +∂θ + +5.1. Symmetric Initialization +We start by comparing AsymEP with standard EP and +VF. All algorithms are initialized with an identical symmetric matrix J dyn . EP maintains this symmetry throughout training, while VF and AsymEP induce asymmetry in +J dyn . Since EP and VF already achieve strong performance +on MNIST, the purpose of this experiment is to validate +AsymEP and compare it against EP and VF rather than +outperform the state of the art. + +14: until convergence of θ +15: Output: Optimized parameters θ. + +in (Scellier &Bengio, 2017), and is generalized to allow +for non-reciprocal forces as in (Scellier et al., 2018). For +clarity, we express the forces in a form that explicitly separates the contributions of the external input and the recurrent +interactions: + +F (x) = ρ′ (x) ⊙ J in u + J dyn ρ(x) − x, +(35) + +Figure 1 compares the three algorithms as a function of +hidden-layer dimension after 1 and 20 epochs. AsymEP +consistently outperforms the baselines, suggesting it learns +faster and better. +Figure 2 studies the evolution of the asymmetry ratio rstr . +The results are reported for 50 hidden neurons. As expected, +EP preserves the initial weight symmetry. In contrast, VF +and AsymEP induce non-trivial evolution of rstr following +two distinct patterns, resulting in three distinct network +configurations. A complementary figure is available in Appendix G.1. + +where u ∈ RNin denotes the input and x ∈ RNdyn the neuronal state, comprising both hidden and output units. The +matrices J in ∈ RNdyn ×Nin and J dyn ∈ RNdyn ×Ndyn define the +input and recurrent connectivity, respectively. The activation +function ρ(·) is taken to be the hyperbolic tangent, applied +element-wise. + +5.2. Fixed Asymmetry Ratio + +If J dyn is symmetric, we can define the energy: +E(x) = + +While the previous section focused on networks compatible +with all three algorithms (EP, VF, AsymEP), we now turn +to architectures with strong structural asymmetry. In this +regime, EP is inapplicable by construction, and, as we show, +VF performs poorly, contrary to AsymEP which remains +effective. + +1 +1 +∥x∥2 − ρ(x)⊤ J dyn ρ(x) − ρ(x)⊤ J in u, (36) +2 +2 + +which is identical to that of (Scellier &Bengio, 2017), provided that the input neurons are activated as ρ(u). +Equation (35) naturally motivates a quantitative measure of +structural asymmetry rstr , defined as: + +To this end, we consider a class of networks where the +asymmetry ratio rstr defined in Eq. (37) is kept fixed. Let S̃ +and à be arbitrary symmetric and antisymmetric matrices +in RNdyn ×Ndyn respectively. We enforce a fixed rstr via the + +⊤ + +rstr = + +∥(J dyn − J dyn )/2∥F +, +∥J dyn ∥F + +(37) +6 + +Equilibrium Propagation for Non-Conservative Systems + +where γ ∈ R is a learnable global scale. +Using VF and AsymEP, we train a layered network with one +hidden layer of 50 neurons (in which case S̃ and à are block +off-diagonal) for different values of rstr to investigate the +impact of structural asymmetry. We compare two training +regimes: training only the input weights J in (and the scale +γ), versus training all parameters including J dyn . The first +regime trains only the external forces from the input ρ′ (x) ⊙ +J in u (which correspond to a symmetric contribution in the +Jacobian) applied to our non-conservative system, while +the second additionally trains J dyn and therefore the nonsymmetric part of the Jacobian directly. + +(a) Results after one epoch. + +Figure 3 summarizes the results. We find that AsymEP +maintains robust performance across all asymmetry levels +(e.g., achieving an accuracy of 93.8 ± 0.4% at rstr = 0 and +94.9 ± 0.2% at rstr = 0.875 when training all parameters) +and can even learn when the recurrent connection matrix +J dyn is completely antisymmetric (rstr = 1). Additionally, +training all parameters shows significant improvement over +training only J in . +In contrast, VF performs well at low asymmetry ratios +but degrades as asymmetry increases, eventually dropping +to chance levels (e.g., accuracies of 5 ± 3% and 8 ± 4% +at rstr = 1 for input-only and all-parameter training, respectively). When only J in is trained, VF accuracy collapses around rstr ≈ 0.5, whereas training all parameters +delays this collapse until rstr ≈ 0.8. Our analysis in Appendix G.2.1 reveals that VF adjusts the dynamics such that +the asymmetry of the Jacobian’s off-diagonal terms remains +strictly lower than the structural asymmetry ratio. The training appears to adjust the neuronal state such that neurons +connected by strongly asymmetric weights have low activation. As shown in Appendix G.2.1, AsymEP learns faster +than VF across all levels of asymmetry. + +(b) Results after 20 epochs. +Figure 1. Comparison of algorithm performance on MNIST using +a layered architecture with one hidden layer and symmetric initialization. Squares denote AsymEP, circles EP, and triangles VF. +Test accuracy (averaged over 10 runs) is shown after one epoch +(Fig. 1a) and 20 epochs (Fig. 1b). + +Finally, Appendix G.3 opens with a brief theoretical discussion of the stability of these non-conservative dynamics, +followed by simulations on all-to-all topologies with constrained rstr and input projections J in . Even in this worstcase setting, AsymEP reduces oscillations and improves +stability. +5.3. Feedforward Architectures +Figure 2. Evolution of the asymmetry ratio rstr (defined in Eq. (37)) +during training on MNIST for AsymEP, EP and VF, initialized +from a symmetric configuration. The models use 50 hidden neurons. + +We now consider a purely feedforward architecture. Here +VF trains only the last layer: with no backward connections, +the output nudging signal cannot reach earlier layers, so for +every layer but the last the nudged stationary states coincide +with the free states, giving zero weight updates. As only +the output layer is trained, the system essentially becomes +an Extreme Learning Machine (Huang et al., 2006; Wang +et al., 2022). In contrast, AsymEP introduces a correction +that generates effective backward connections, allowing the + +following parameterization of the recurrent parameters: +"q +# +S̃ +à +dyn +2 +J =γ +1 − rstr ++ rstr +, +(38) +∥S̃∥F +∥Ã∥F +7 + +Equilibrium Propagation for Non-Conservative Systems + +tivity structures inspired by (Millidge et al., 2023), while +keeping the number of trainable parameters fixed. +Experiments are conducted on Fashion-MNIST using a twohidden-layer network with hidden dimensions 500 and 200. +Network states are denoted (x0 , x1 , x2 , x3 ), where x0 is +the input and x3 = xL the output. Forward and backward +connections are denoted by Wk and Bk , respectively, with +W1 = J in . +We consider three classes of dynamics. First, the Continuous +Hopfield (CH) dynamics introduced previously: + + +dxk += −xk +ρ′ (xk )⊙ Wk ρ(xk−1 )+(1−δk,L )Bk ρ(xk+1 ) . +dt +(40) +Second, Predictive Coding (PC) dynamics, defined through +the prediction errors ek = xk − Wk ρ(xk−1 ), whose fixed +point ek = 0 corresponds to a standard feedforward network: + +Figure 3. Impact of the structural asymmetry ratio rstr on accuracy +(top) and standard deviation over 10 runs (bottom) on MNIST. +We compare VF (orange) and AsymEP (blue) under two training +regimes: training only J in (dashed) or all parameters (solid). + +dxk += −ek + (1 − δk,L ) (ρ′ (xk ) ⊙ (Bk ek+1 )) . +dt + +nudging signal to influence all layers. We make this explicit +for a network with one hidden layer. + +(41) + +Third, a standard dynamics chosen for direct comparison +with backpropagation: + +Let the state x be partitioned in hidden h and output o +layers. The recurrent connection matrix is then J dyn = + +0 +0 +. The forces of the system are: +Wh→o 0 + β + + +Fh = ρ′ (h) ⊙ J in u + λ(Wh→o )⊤ (o − o0 ) − h + + + + + + β +0 +Fo = ρ′ (o) ⊙ Wh→o ρ(h) − λWh→o (h − h ) +(39) + + + +∂C + + +−o ++ λβ +∂o + +dxk += −xk + Wk ρ(xk−1 ) + (1 − δk,L )Bk ρ(xk+1 ). (42) +dt +For each dynamics, we examine three connectivity scenarios. +⊤ +• In the asymmetric case (Bk ̸= Wk+1 +), the backward +weights Bk are randomly initialized and kept fixed +while only the forward weights are trained, ensuring a +fair comparison (i.e., identical number of parameters); +in PC, the learning rule for Bk is zero when only inputs +are clamped. + +where λ is 0 during the free inference and 1 during the +nudged phase (Eq. 20). The force on the hidden layer Fhβ +now depends on the output layer through the term ρ′ (h) ⊙ +⊤ +(Wh→o ) (o − o0 ), enabling the nudge (the term β ∂C +∂o ) to +influence the hidden layer. This implicitly assumes that the +hardware implementation supports the physical activation +of these backward connections. + +⊤ +• In the symmetric / conservative case (Bk = Wk+1 +), the +CH and PC dynamics derive from an energy functional, +while the standard dynamics remains non-conservative +due to its non-symmetric Jacobian. + +• In the feedforward case (Bk = 0), the PC and standard dynamics coincide; for the standard dynamics, the +AsymEP learning rule mirrors backpropagation, with +1 +∆xβk = 2β +(xβ − x−β ) acting as the propagated error +signal. + +We validate this using a single hidden layer of only 20 neurons on MNIST. After training, VF saturates with 64.3 ± +2.0% accuracy, whereas AsymEP reaches 92.7 ± 0.5% accuracy. We expect this discrepancy to increase with network +depth, since this increases the number of layers unable to +learn under VF. A figure with the accuracy during training +can be found in Appendix G.4.2. + +Table 1 shows that AsymEP consistently outperforms VF +in both asymmetric and feedforward settings, in final accuracy, learning speed, and stability. After a single epoch +it already provides on average a 15% accuracy gain with +an order-of-magnitude reduction in variance. Remarkably, +AsymEP with asymmetric connectivity also surpasses EP +on symmetric networks despite training only the forward + +5.4. Advantages of Non-Conservative Dynamics +AsymEP is not tied to a specific neural dynamics. To further +assess the benefits of training non-conservative dynamics +using AsymEP, we compare several dynamics and connec8 + +Equilibrium Propagation for Non-Conservative Systems + +6. Discussion and Conclusion + +weights, suggesting that relaxing symmetry constraints may +improve expressivity. Supplementary results are provided +in Appendix G.5. + +In this work, we extended Equilibrium Propagation (EP) +to non-conservative systems that reach stationary states by +deriving two mathematically equivalent algorithms that recover the exact gradient of the cost function in the limit of +infinitesimal nudging. + +Table 1. Test accuracy on Fashion-MNIST (%) at Epoch 50 (mean +± std 10 runs). BP on a standard feedforward architecture using +MSE and SGD achieve 87.37 ± 0.29%. +EP + +AsymEP + +Asym +86.78 ± 0.14 +Feedfor +86.05 ± 0.12 +Sym +84.30 ± 0.13 +Asym +86.20 ± 0.17 +PC +Sym +84.78 ± 0.14 +Asym +82.91 ± 0.48 +Standard +Feedfor +86.25 ± 0.16 +CH + +The first approach, Asymmetric EP, preserves the original +inference dynamics. It introduces a corrective force during +the nudged phase that remains spatially local, as the antisymmetric Jacobian is null for unconnected neurons and the +perturbation from equilibrium is available at the synapse +level. Unlike standard methods like Recurrent Backpropagation (Almeida, 1990; Pineda, 1987), this avoids explicit +digital weight transposition. However, a physical mechanism to obtain the local corrective force at the synapse +level remains a subject for future work. We also note that +AsymEP shares the temporal non-locality of standard EP. + +VF +85.20 ± 0.12 +77.76 ± 0.37 +80.71 ± 6.17 +75.52 ± 1.69 +78.58 ± 0.28 + +Finally, to investigate how AsymEP scales with depth, we +trained deeper fully connected networks with two and three +hidden layers of 500 neurons on Fashion-MNIST, reaching +86.41 ± 0.22% and 87.8 ± 0.15% test accuracy respectively. + +The second approach, Dyadic EP, doubles the state space +to map non-reciprocal dynamics onto an energy landscape—conceptually reminiscent of multi-compartment cortical neurons, where apical dendrites integrate feedback +(analogous to z − z ′ ) separately from basal feedforward +input (analogous to z + z ′ ) (Guerguiev et al., 2017). Additionally, this expanded space also enables the positive and +negative nudging phases to run in parallel. This offers a +pathway to implement a version of EP that is local in time, +but would require a doubling of the degrees of freedom +on the physical hardware. More fundamentally, the energy +defined on the extended state shows that the tools and theoretical guarantees obtained for EP should also apply to +the case of non-reciprocal forces, and that the variational +principle behind EP is universal in the sense that it can be +applied to all networks which operate in a stationary state. + +5.5. Feedforward Training on CIFAR-10: BP vs. Dyadic +EP vs. AsymEP +To test whether our framework scales beyond shallow networks, we conclude with a deep, purely feedforward CNN +architecture trained on CIFAR-10. We compare backpropagation (BP), VF, AsymEP and Dyadic EP in a controlled +setting where the gradient estimator is the only difference +between runs: all methods share the same configuration, +with the BP gradient replaced by the contrast of stationary +states for the EP-based methods (see App. G.6 for details). +Each configuration is trained for 40 epochs over 5 seeds. + +Furthermore, Dyadic EP is not restricted to the EP community and could suggest a more physically plausible alternative to the stationary-state Adjoint Method (for fixed +inputs) (Chen et al., 2018): by solving the forward and adjoint equations simultaneously via relaxation, it circumvents +a separate backward-in-time pass. + +Table 2 reports the final test accuracy. Both of our algorithms scale to this regime, closely tracking the BP baseline +throughout training and matching its final accuracy: a paired +t-test finds no significant difference between Dyadic EP and +BP (p = 0.75), and only a sub-percent gap for AsymEP. +In contrast, VF makes slight initial progress (peaking near +30%) before collapsing to chance level (10%). Additional +details can be found in Appendix G.6 + +Finally, our experiments on MNIST, Fashion-MNIST, and +CIFAR-10 confirm that AsymEP and Dyadic EP consistently outperform EP and VF, and notably enables effective +training of feedforward networks. +Our work thus opens new avenues for learning in neuromorphic hardware, dissipative physical systems, and neural +architectures where asymmetry is intrinsic rather than incidental. + +Table 2. Test accuracy on CIFAR-10 (%) at epoch 40 (mean ± std +over 5 seeds). +Method + +Test Acc. (%) + +Backpropagation +Dyadic EP +AsymEP +VF + +90.66 ± 0.25 +90.69 ± 0.14 +89.74 ± 0.14 +10.00 ± 0.00 + +9 + +Equilibrium Propagation for Non-Conservative Systems + +Impact Statement + +References + +This paper presents results that advance the field of machine +learning. There are many potential societal consequences +of our work, none of which we feel must be specifically +highlighted here. + +Almeida, L. B. A learning rule for asynchronous perceptrons with feedback in a combinatorial environment. In +Artificial neural networks: concept learning, pp. 102–111. +1990. + +Acknowledgments + +Altman, L. E., Stern, M., Liu, A. J., and Durian, D. J. Experimental demonstration of coupled learning in elastic +networks. Physical Review Applied, 22(2):024053, 2024. + +AES is fully funded by the Horizon Europe Marie +Skłodowska-Curie Doctoral Network ’Postdigital Plus’ +(Grant 101169118). DVA acknowledges the support of +the French Community of Belgium through a FRIA fellowship. SM acknowledges financial support by the Fonds de la +Recherche Scientifique–FNRS, Belgium under EOS Project +No. 40007536. Computational resources have been provided by the Consortium des Équipements de Calcul Intensif +(CÉCI), funded by the Fonds de la Recherche Scientifique +de Belgique (F.R.S.-FNRS) under Grant No. 2.5020.11 and +by the Walloon Region. + +Aykroyd, C., Bourgoin, A., and Poncin-Lafitte, C. L. Hamiltonian treatment of non-conservative systems. arXiv +preprint arXiv:2507.18658, 2025. +Bateman, H. On dissipative systems and related variational +principles. Physical Review, 38(4):815, 1931. +Berneman, M. and Hexner, D. Equilibrium propagation for +dissipative dynamics. Advanced Intelligent Systems, pp. +e202501310, 2025. +Bishop, K. J., Biswal, S. L., and Bharti, B. Active colloids +as models, materials, and machines. Annual Review of +Chemical and Biomolecular Engineering, 14(1):1–30, +2023. + +“ἁρμονίη ἀφανὴς φανερῆς κρείττων” + +Bowick, M. J., Fakhri, N., Marchetti, M. C., and Ramaswamy, S. Symmetry, thermodynamics, and topology +in active matter. Physical Review X, 12(1):010501, 2022. +Brandenbourger, M., Locsin, X., Lerner, E., and Coulais, C. +Non-reciprocal robotic metamaterials. Nature communications, 10(1):4608, 2019. +Cesa-Bianchi, N. and Lugosi, G. Prediction, learning, and +games. Cambridge university press, 2006. +Chen, R. T., Rubanova, Y., Bettencourt, J., and Duvenaud, +D. K. Neural ordinary differential equations. Advances +in neural information processing systems, 31, 2018. +Cin, N. D., Marquardt, F., and Wanjura, C. C. Training +nonlinear optical neural networks with scattering backpropagation. arXiv preprint arXiv:2508.11750, 2025. +Costa, P. and Santos, P. A. Directed equilibrium propagation +revisited. Mathematics, 13(11), 2025. ISSN 2227-7390. +Crick, F. The recent excitement about neural networks. +Nature, 337, 1989. +Dillavou, S., Stern, M., Liu, A. J., and Durian, D. J. Demonstration of decentralized physics-driven learning. Physical Review Applied, 18(1):014040, 2022. +Dillavou, S., Beyer, B. D., Stern, M., Liu, A. J., Miskin, +M. Z., and Durian, D. J. Machine learning without a processor: Emergent learning in a nonlinear analog network. +Proceedings of the National Academy of Sciences, 121 +(28):e2319718121, 2024. +10 + +Equilibrium Propagation for Non-Conservative Systems + +Ernoult, M., Grollier, J., Querlioz, D., Bengio, Y., and Scellier, B. Updates of equilibrium prop match gradients +of backprop through time in an rnn with static input. +Advances in neural information processing systems, 32, +2019. + +Indiveri, G. and Liu, S.-C. Memory and information processing in neuromorphic systems. Proceedings of the +IEEE, 103(8):1379–1397, 2015. + +Ernoult, M., Grollier, J., Querlioz, D., Bengio, Y., and Scellier, B. Equilibrium propagation with continual weight +updates. arXiv preprint arXiv:2005.04168, 2020. + +Kalinin, K. P., Gladrow, J., Chu, J., Clegg, J. H., Cletheroe, +D., Kelly, D. J., Rahmani, B., Brennan, G., Canakci, B., +Falck, F., et al. Analog optical computer for ai inference +and combinatorial optimization. Nature, 645(8080):354– +361, 2025. + +Falk, M. J., Strupp, A. T., Scellier, B., and Murugan, +A. Temporal contrastive learning through implicit nonequilibrium memory. Nature Communications, (16), +2025. + +Kendall, J., Pantone, R., Manickavasagam, K., Bengio, +Y., and Scellier, B. Training end-to-end analog neural +networks with equilibrium propagation. arXiv preprint +arXiv:2006.01981, 2020. + +Farinha, M. T., Pequito, S., Santos, P. A., and Figueiredo, +M. A. T. Equilibrium propagation for complete directed +neural networks. In Proceedings of the 28th European +Symposium on Artificial Neural Networks, Computational +Intelligence and Machine Learning (ESANN 2020), 2020. + +Laborieux, A. and Zenke, F. Holomorphic equilibrium +propagation computes exact gradients through finite size +oscillations. Advances in Neural Information Processing +Systems, 35:12950–12963, 2022. +Laborieux, A. and Zenke, F. Improving equilibrium propagation without weight symmetry through jacobian homeostasis. In Proceedings of the International Conference on Learning Representations (ICLR) 2024, Virtual +(ICLR), May 2024. + +Galley, C. R. Classical mechanics of nonconservative systems. Physical review letters, 110(17):174301, 2013. +Guerguiev, J., Lillicrap, T. P., and Richards, B. A. Towards +deep learning with segregated dendrites. elife, 6:e22901, +2017. + +Laborieux, A., Ernoult, M., Scellier, B., Bengio, Y., Grollier, +J., and Querlioz, D. Scaling equilibrium propagation to +deep convnets by drastically reducing its gradient estimator bias. Frontiers in neuroscience, 15:633674, 2021. + +Høier, R. and Zach, C. A lagrangian perspective on dual +propagation. In Proceedings of the First Workshop on Machine Learning with New Compute Paradigms at NeurIPS +2023, New Orleans, LA, USA, Dec 2023. + +Laydevant, J., Marković, D., and Grollier, J. Training an +ising machine with equilibrium propagation. Nature Communications, 15(1):3671, 2024. + +Høier, R. and Zach, C. Two tales of single-phase contrastive +hebbian learning. In Salakhutdinov, R., Kolter, Z., Heller, +K., Weller, A., Oliver, N., Scarlett, J., and Berkenkamp, F. +(eds.), Proceedings of the 41st International Conference +on Machine Learning, volume 235 of Proceedings of +Machine Learning Research, pp. 18470–18488. PMLR, +21–27 Jul 2024. + +LeCun, Y. The mnist database of handwritten digits. +http://yann. lecun. com/exdb/mnist/, 1998. +Martin, E., Ernoult, M., Laydevant, J., Li, S., Querlioz, D., +Petrisor, T., and Grollier, J. Eqspike: spike-driven equilibrium propagation for neuromorphic implementations. +Iscience, 24(3), 2021. + +Høier, R., Staudt, D., and Zach, C. Dual propagation: accelerating contrastive hebbian learning with dyadic neurons. +In Proceedings of the 40th International Conference on +Machine Learning, ICML’23. JMLR.org, 2023. + +Massar, S. Equilibrium propagation for learning in lagrangian dynamical systems. Physical Review E, 112 +(3):035304, 2025. + +Høier, R., Kalinin, K., Ernoult, M., and Zach, C. Dyadic +learning in recurrent and feedforward models. In NeurIPS +2024 Workshop Machine Learning with new Compute +Paradigms, 2024. + +Massar, S. and Mognetti, B. M. Equilibrium propagation: +the quantum and the thermal cases. Quantum Studies: +Mathematics and Foundations, 12(1):6, 2025. + +Hopfield, J. J. Neural networks and physical systems with +emergent collective computational abilities. Proceedings +of the national academy of sciences, 79(8):2554–2558, +1982. + +Millidge, B., Song, Y., Salvatori, T., Lukasiewicz, T., and +Bogacz, R. Backpropagation at the infinitesimal inference limit of energy-based models: Unifying predictive +coding, equilibrium propagation, and contrastive hebbian +learning. In International Conference on Learning Representations (ICLR), 2023. + +Huang, G.-B., Zhu, Q.-Y., and Siew, C.-K. Extreme learning +machine: theory and applications. Neurocomputing, 70 +(1-3):489–501, 2006. +11 + +Equilibrium Propagation for Non-Conservative Systems + +Nest, T. and Høier, R. Dyadic learning in asymmetric +convnets. In New Frontiers in Associative MemoriesWorkshop at ICLR 2026. + +Wang, Q., Wanjura, C. C., and Marquardt, F. Training +coupled phase oscillators as a neuromorphic platform +using equilibrium propagation. Neuromorphic Computing +and Engineering, 4(3):034014, 2024. + +Osat, S. and Golestanian, R. Non-reciprocal multifarious +self-organization. Nature Nanotechnology, 18(1):79–85, +2023. + +Wanjura, C. C. and Marquardt, F. Quantum equilibrium +propagation for efficient training of quantum systems +based on onsager reciprocity. Nature Communications, +16(1):6595, 2025. + +O’Connor, P., Gavves, E., and Welling, M. Training a spiking neural network with equilibrium propagation. In The +22nd international conference on artificial intelligence +and statistics, pp. 1516–1523. PMLR, 2019. + +Werbos, P. J. Backpropagation through time: what it does +and how to do it. Proceedings of the IEEE, 78(10):1550– +1560, 1990. + +Pineda, F. Generalization of back propagation to recurrent +and higher order neural networks. In Neural information +processing systems, 1987. + +Yi, S.-i., Kendall, J. D., Williams, R. S., and Kumar, S. +Activity-difference training of deep neural networks using +memristor crossbars. Nature Electronics, 6(1):45–51, +2023. + +Pourcel, G., Basu, D., Ernoult, M., and Gilra, A. Lagrangianbased equilibrium propagation: generalisation to arbitrary boundary conditions & equivalence with hamiltonian echo learning. arXiv preprint arXiv:2506.06248, +2025. +Rageau, T. and Grollier, J. Training and synchronizing +oscillator networks with equilibrium propagation. Neuromorphic Computing and Engineering, 2025. +Sajnok, K. and Matuszewski, M. Near-equilibrium propagation training in nonlinear wave systems. arXiv preprint +arXiv:2510.16084, 2025. +Scellier, B. Quantum equilibrium propagation: Gradientdescent training of quantum systems. arXiv preprint +arXiv:2406.00879, 2024. +Scellier, B. and Bengio, Y. Equilibrium propagation: Bridging the gap between energy-based models and backpropagation. Frontiers in computational neuroscience, 11:24, +2017. +Scellier, B., Goyal, A., Binas, J., Mesnard, T., and Bengio, +Y. Generalization of equilibrium propagation to vector +field dynamics. arXiv preprint arXiv:1808.04873, 2018. +Scellier, B., Mishra, S., Bengio, Y., and Ollivier, Y. Agnostic +physics-driven deep learning. arXiv:2205.15021v1, 2022. +Scurria, A. E. A physical theory of backpropagation: Exact +gradients from the least-action principle. 2026. +Stern, M., Hexner, D., Rocks, J. W., and Liu, A. J. Supervised learning in physical networks: From machine +learning to learning machines. Physical Review X, 11(2): +021045, 2021. +Wang, J., Lu, S., Wang, S.-H., and Zhang, Y.-D. A review +on extreme learning machine. Multimedia Tools and +Applications, 81(29):41611–41660, 2022. +12 + +Equilibrium Propagation for Non-Conservative Systems + +A. Gradient Estimation Error in VF + +where s denotes the dynamical state of the system. This +symmetry is the linchpin of the equivalence proof, as the +gradient expressions derived for BPTT and standard EP +differ precisely by a transpose operation applied to ∂F +∂s . + +In this appendix, we quantify the gradient estimation error +introduced by VF in the limit where the Jacobian asymmetry +is small. + +This observation aligns with our analysis in the main text: +VF fails in non-conservative systems due to the missing +transpose in the post-synaptic term (see Eq. (16)). Following +the derivation in Ernoult et al. (2019) (viz., Appendix A, Eqs. +(31–33)), the recursive relations for the gradients in BPTT +are given by: + +Comparing the post-synaptic update terms in Eqs. (12) and +(14) gives the following error in the gradient of the cost: +⊤ +∂F 0 +(x , θ) +∂θ + +−1 +−1 ∂C 0 +(x , y), (43) +× JF (x0 , θ) +− JF⊤ (x0 , θ) +∂x + + +Error = − + +∇BPTT +(0) = +s + +To quantify this error, we decompose the Jacobian JF (x, θ) +into its symmetric part SJ (x, θ) and antisymmetric part + +SJ (x, θ) = 12 JF (x, θ) + JF⊤ (x, θ) , +(44) + +AJ (x, θ) = 12 JF (x, θ) − JF⊤ (x, θ) . + +(JF ) + += + +(JF⊤ )−1 = + +∞ +X + +∇BPTT +(t) = +s +∇BPTT +(t) = +θ + +(−1) + +n=0 +∞ +X + +(SJ−1 AJ )n + +SJ−1 , + + + + +∂F +(x, s⋆ , θ) +∂s + +⊤ + +∂F +(x, s⋆ , θ) +∂θ + +⊤ + +∇BPTT +(t − 1), +s + +(50) + +∇BPTT +(t − 1), +s + +(51) + +where θ represents the optimization parameters, ℓ is the +cost function, s⋆ is the free equilibrium state (satisfying +F (s⋆ ) = 0), y is the target, and x is the input. The index t +denotes the unrolled time steps, initialized at s(0) = s⋆ . + +! +n + +(49) + +and for all t = 1, . . . , K, + +Assuming the asymmetry AJ (x, θ) is small, we can make +a series expansion in SJ−1 AJ (omitting the dependencies +for clarity). Applying the Neumann expansion for small +∥SJ−1 AJ ∥ gives +−1 + +∂ℓ +(s⋆ , y), +∂s + +(45) + +In contrast, the gradients computed by VF follow the recursion (viz., Ernoult et al. (2019), Appendix A, Eqs. (24–26)): + +! +(SJ−1 AJ )n + +SJ−1 . + +(46) + +∆EP +s (0) = − + +n=0 + +∂ℓ +(s⋆ , y), +∂s + +(52) + +and for all t ≥ 0, + +Subtracting the two series and assuming convergence, we +finally obtain +! +∞ +X +2n+1 +−1 +−1 +⊤ −1 +(JF ) − (JF ) = −2 +SJ AJ +SJ−1 . + +∂F +(x, s⋆ , θ) ∆EP +s (t), +∂s +⊤ + +∂F +(x, +s +, +θ) +∆EP +∆EP +(t ++ +1) += +⋆ +s (t). +θ +∂θ + +∆EP +s (t + 1) = + +n=0 + +(47) + +(53) +(54) + +Comparing these two sets of equations confirms that the only +difference are Eqs. (50) and (53), specifically the transpose +of the Jacobian ∂F +∂s (ignoring the global sign difference in +Eqs. (49) and (52)). + +B. Equivalence between AsymEP and BPTT +In this appendix, we sketch the equivalence between the +gradient estimate computed by AsymEP and Backpropagation Through Time (BPTT) (Werbos, 1990) for a Recurrent +Neural Network with fixed inputs. Our derivation relies on +the proof provided by Ernoult et al. (2019), which established that standard (conservative) EP computes gradients +identical to those of BPTT. To facilitate direct comparison, +we adopt their notation for this section. + +In AsymEP, we modify the dynamics by adding a correction +term dependent on the antisymmetric part of the Jacobian. +Denoting the force of this augmented system by F A , the +Jacobian at the free equilibrium satisfies: + +The proof provided by Ernoult et al. (2019) relies on the +assumption that the vector field F (i.e., transition function) +is derived from a scalar potential function, which implies +that + +⊤ +∂F +∂F += +, +(48) +∂s +∂s + +By substituting this corrected Jacobian into the recursive +relations, AsymEP recovers the exact transpose required +by BPTT. Consequently, our method extends the equivalence between EP and BPTT to the general case of nonconservative force. + +∂F A +(x, s⋆ , θ) = +∂s + +13 + + + +∂F +(x, s⋆ , θ) +∂s + +⊤ +. + +(55) + +Equilibrium Propagation for Non-Conservative Systems + +C. Out-of-Equilibrium Mechanics + +C.3. Symmetry Breaking as Credit Assignment + +Here we sketch the physical picture behind the doubledenergy construction of Eq. (26). The full derivation from +Hamilton’s least-action principle, together with its connection to the Bateman–Galley formalism for non-conservative +classical mechanics (Bateman, 1931; Galley, 2013; Aykroyd +et al., 2025), can be found in (Scurria, 2026). + +On the diagonal manifold z = z ′ the doubled system enjoys +a gauge symmetry: the auxiliary variable z ′ is redundant and +the difference d is identically zero. Credit assignment is implemented by deliberately breaking this symmetry through +the task cost. Adding βD(z, z ′ ) = β C(m) to H exerts +opposite forces on z and z ′ and drives them apart, so that +the difference d ceases to be redundant and begins to carry +information about the loss landscape. + +C.1. The Helmholtz Obstruction +The natural physical route to a variational principle for a +dynamical system ẋ = F (x, θ) is to seek a scalar potential +E such that F = −∂x E. The classical Helmholtz integrability condition states that such an E exists if and only if the +Jacobian JF is symmetric everywhere. Whenever the interactions are non-reciprocal — as in feedforward networks, +active matter, or driven optical systems — JF acquires +a non-zero antisymmetric part and the Helmholtz condition fails identically. No scalar potential on the original +n-dimensional state space can then generate the dynamics, +and the “energy minimisation” route at the heart of standard +EP is blocked at the structural level. The obstruction is not +a matter of computational convenience: it reflects the fact +that the rotational component of F carries information that +no scalar function of x alone can record. + +D. Proofs for Dyadic EP +We now demonstrate that Dyadic EP correctly trains the +parameters θ of the original force field F (x, θ), giving the +0 +) +in the limit of infinitesimal nudging. +exact gradient dC(x̄ +dθ +D.1. Proof of EP +First, recall that standard EP does not strictly require the +system to settle at an energy minimum; it requires only that +the system reaches a stationary state (a fixed point of the +dynamics). Indeed, using the notation of Section 2.1, EP +relies on the key identity: +d2 +d2 +ET (xβ , θ) = +ET (xβ , θ). +dθdβ +dβdθ + +C.2. Variational Reconstruction on a Doubled Space + +Expanding the total derivative with respect to β gives: + +Applying the Bateman–Galley formalism circumvents this +obstruction by enlarging the configuration space. The single +state x ∈ Rn is replaced by a conjugate pair (z, z ′ ) ∈ R2n , +and the rotational component of F — which has no scalar +generator on the original n-dimensional space — is absorbed into a bilinear coupling between z and z ′ on the +doubled space, where it does admit a variational description. The physical motion is recovered on the diagonal +submanifold z = z ′ (the so called ’physical limit’), while +the off-diagonal direction d = z − z ′ supplies the additional +degree of freedom needed to encode non-reciprocity. + +d +ET (xβ , θ) = +dβ + + + + +z + z′ +,θ , +2 + + + +∂ET (xβ , θ) +∂x + +⊤ + +dxβ +∂ET (xβ , θ) ++ +dβ +∂β + += C(xβ ). + +(58) + +Where the first term vanishes because the system is at a +∂ +ET (xβ , θ) = 0; this holds even if +stationary state, i.e., ∂x +the system is not at a minimum of ET . Similarly, for the +derivative with respect to θ: +d +∂ET (xβ , θ) +ET (xβ , θ) = +, +dθ +∂θ + +Specializing this reconstruction to the overdamped (firstorder) regime relevant to relaxational neural dynamics yields +the bilinear energy +H(z, z ′ , θ) = −(z − z ′ )⊤ F + +(57) + +(59) + +where we additionally assume that the cost function does +not depend explicitly on the parameters θ. Substituting these +results into Eq. (57) in the limit of infinitesimal nudging +(β → 0) recovers the fundamental relation given by Eq. (9). + +(56) +D.2. Proof of Dyadic EP + +which is precisely Eq. (26). The symmetric midpoint m = +(z + z ′ )/2 plays the role of the physical coordinate of the +doubled system, while d is the auxiliary direction along +which non-reciprocity is stored. On the submanifold z = z ′ +the coupling proportional to (z − z ′ ) vanishes identically +and both states evolve under the original field F , so the +doubling leaves the on-shell physics unchanged. We refer +the reader to (Scurria, 2026) for the full construction. + +We analyze now the stationary states of Dyadic EP by introducing the change of variables: +m= + +z + z′ +, +2 + +d = z − z′. + +(60) + +In these coordinates, the augmented energy HT becomes +HT (m, d, θ, β) = −d⊤ F (m, θ) + βC(m) +14 + +(61) + +Equilibrium Propagation for Non-Conservative Systems + +and the dynamics in Eq. (28) can be rewritten as: +∂HT +dm +=− += F (m, θ), +dt +∂d +dd +∂HT +∂ +=− += dT JF (m, θ) − β +C(m). +dt +∂m +∂m + +In Dyadic EP, we instead employ the single-phase update: + + +1 ∂H(z β , z ′β , θ) +∆θ ∝ − +(70) +β +∂θ + +(62) +(63) + +This choice avoids the overhead of evolving two coupled +equations in the extended space, which would be computationally equivalent to evolving four equations in the original +space (two for +β and two for −β). Using Eq. (70), we +evolve only one coupled equation for +β in the extended +space; this corresponds to two equations in the original +space, thereby achieving the same computational complexity as AsymEP. Furthermore, this single-phase formulation +suggests a pathway toward making the update local in time, +provided appropriate hardware is used to implement the +augmented phase. + +β + +The stationary states (mβ , d ) are the solutions to: +F (mβ , θ) = 0, +∂ +βT +d JF (mβ , θ) − β +C(mβ ) = 0. +∂m + +(64) +(65) + +This leads to the following observations: +1) The stationary state of m is independent of β and coincides with the stationary state of the original system: +z β + z ′β += mβ = m0 = x0 . +2 + +Mathematically, these two approaches yield the same gradient estimate because the equations for dβ are linear. Explicitly we have : + + +∂H(z β , z ′β , θ) +∂F z β + z ′β += −(z β − z ′β )⊤ +,θ +∂θ +∂θ +2 + +⊤ +−1 +∂F 0 += −β +z ,θ +JF⊤ (z 0 , θ) +∂θ + + +∂C 0 +× +(z ) , +(71) +∂x + +(66) + +2) The Jacobian of the extended system defined in Eq. (26) +is invertible, provided JF is invertible. This is most evident +from Eq. (63). +3) The stationary state value of d is given by: + + +−1 ∂C 0 +β +d = β JF⊤ (m0 , θ) +(x ) +∂x + +(67) + +where we have used Eqs. (66) and (67). Inspection of +Eq. (71) confirms that, up to corrections of order β 2 , we +obtain exactly the same gradient as in AsymEP. + +0 + +In particular, when β = 0, we have d = 0, which implies +that the free stationary states coincide: z 0 = z ′0 . + +E. AsymEP versus Dyadic EP + +4) The cost at the stationary state of the extended system +is equal to the cost at the stationary state of the original +system: +D(m0 ) = C(x0 ). +(68) + +In this appendix, we demonstrate that Asymmetric Equilibrium Propagation (AsymEP) emerges naturally as the firstorder projection of the 2N -dimensional Dyadic Equilibrium +Propagation onto a single N -dimensional state space. We +then formalize the physical trade-offs between the two architectures. + +Consequently, the gradients of the cost with respect to the +parameters are identical. +Since both the original and extended systems, given respectively in Eq. (28) and Eq. (1-2), share the same cost at their +respective stationary states, and because the Jacobians of +both models are invertible, applying EP update rule to the +extended system give the correct gradient estimate for the +parameters θ of the original system. + +E.1. AsymEP as the Linear Projection of Dyadic EP +As established in Appendix D.2, transforming the 2N dimensional extended space (z, z ′ ) into the mean state +′ +m = z+z +and the difference state d = z − z ′ exactly +2 +decouples the stationary dynamics. Because the stationary +state of m is the free state of the original system (mβ = x0 ), +the cost function drives the difference variable to a stationary +β +state d satisfying: + +The final step of the proof is to establish the equivalence +between the standard parameter update rule in Eq. (8) and +the modified rule used by Dyadic EP in Eq. (34). Indeed, if +we were to apply the standard update rule in the extended +space, the update would be: +1 +∆θ ∝ − +2β + + + +β + +′β + +−β + +∂H(z , z , θ) ∂H(z , z +− +∂θ +∂θ + +′−β + +, θ) + +β + +JF⊤ (x0 , θ)d = β + + +. + +∂C 0 +(x ) +∂x + +(72) + +To recover this exact error signal in an N -dimensional space, +we postulate a modified dynamical system FA (x) compris- + +(69) +15 + +Equilibrium Propagation for Non-Conservative Systems + +F. Derivation of the Hopfield-like Energy + +ing the standard EP dynamics and a spatial correction Γ(x): +FA (x) = F (x) − β + +∂C +(x) + Γ(x) +∂x + +In this section, we derive the explicit energy functional for +the Continuous Asymmetric Hopfield dynamics defined in +Eq. (35). The force field is given by: + +(73) + +Let ∆x = xβA − x0 denote the displacement from the +free equilibrium. Expanding the stationarity condition +FA (xβA ) = 0 to first order around x0 yields: +JF (x0 , θ)∆x − β + +∂C 0 +(x ) + Γ(xβA ) ≈ 0 +∂x + +F (x) = ρ′ (x) ⊙ (Jρ(x)) − x. + +We omit external inputs J in for brevity, as they appear symmetrically in the Jacobian. The variational Hamiltonian is +defined as: + + + + +z + z′ +z + z′ +H(z, z ′ ) = −(z − z ′ )⊤ F ++ βC +. +2 +2 +(79) + +(74) + +To ensure the first-order displacement matches the Dyadic +β +EP error signal (i.e., ∆x ≈ d ), we substitute Eq. (72) into +the expansion: + +Γ(xβA ) = JF⊤ (x0 , θ) − JF (x0 , θ) ∆x + +(75) + += −2AJ (x0 , θ)(xβA − x0 ) + +(76) + +To analyze this expression, we introduce the midpoint m = +z+z ′ +and the difference d = z − z ′ . Since the separation +2 +between z and z ′ is induced solely by the nudging parameter +β, the difference scales as ∥d∥ ∼ O(β). We therefore +neglect terms of order O(∥d∥3 ) (i.e., or equivalently O(β 3 )) +as they do not contribute to the gradient of the cost. + +This uniquely recovers the AsymEP augmented dynamics. +Finally, to eliminate the O(β 2 ) error, AsymEP evaluates the +centered difference of two opposite nudges: +0 +x±β +A =x ±β + +dxA ++ O(β 2 ) +dβ β=0 + +(78) + +The activation at the midpoint can be approximated as: +ρ(m) = + +(77) + +ρ(z) + ρ(z ′ ) ++ O(∥d∥2 ). +2 + +(80) + +Similarly, the difference in activations is: + +Subtracting these states cancels the O(β 2 ) error, yielding +β ++β +−β +1 +3 +2 (xA − xA ) = d + O(β ), successfully recovering the +exact post-synaptic update term. + +ρ(z) − ρ(z ′ ) = ρ′ (m) ⊙ d + O(∥d∥3 ). + +(81) + +Inverting this relation, we express the state difference as: +E.2. Physical Trade-offs and the Extended Space + +z − z ′ = (ρ(z) − ρ(z ′ )) ⊙ ρ′ (m) + O(∥d∥3 ). + +We can view AsymEP and Dyadic EP as a space-time tradeoff of the same underlying physical optimization problem. + +(82) + +We substitute these expansions into the interaction term +of the Hamiltonian, Hint = −(z − z ′ )⊤ (ρ′ (m) ⊙ Jρ(m)). +Applying the identity a⊤ (b ⊙ c) = (a ⊙ b)⊤ c, we obtain: + +AsymEP preserves the original N -dimensional state space +of the network at the cost of temporal non-locality. The system must evolve sequentially, requiring physical memory +not only to store the free equilibrium x0 for the asymmetric correction, but also to store the successive stationary +states required to evaluate the contrastive gradient update. +AsymEP thus serves as the direct, spatially minimal extension of EP. + +⊤ + +Hint = − ((z − z ′ ) ⊙ ρ′ (m)) Jρ(m) + + +ρ(z) + ρ(z ′ ) +≈ −(ρ(z) − ρ(z ′ ))⊤ J +. +2 + +(83) + +Expanding the product gives: + +Dyadic EP provide a learning signal that is local in both +space (where z − z ′ encodes the gradient) and time (allowing the nudged phases to execute in parallel) at the cost +of doubling the state space. In particular, capturing nonconservative forces in this extended space requires a specific bilinear coupling, rather than a trivial superposition +of uncoupled subsystems. It can be seen as a blueprint for +future neuromorphic hardware. + +Hint = − + +1h +ρ(z)⊤ Jρ(z) + ρ(z)⊤ Jρ(z ′ ) +2 +i + +− ρ(z ′ )⊤ Jρ(z) − ρ(z ′ )⊤ Jρ(z ′ ) . + +(84) + +We decompose the connectivity matrix J into its symmetric +part S and antisymmetric part A. The first and last terms +simplify to ρ(z)⊤ Sρ(z). The cross terms satisfy: + +Ultimately, the reduction of Dyadic EP to AsymEP via the +variables m and d proves the universality of EP’s variational +principle. + +ρ(z)⊤ Jρ(z ′ ) − ρ(z ′ )⊤ Jρ(z) = ρ(z)⊤ (J − J ⊤ )ρ(z ′ ) += ρ(z)⊤ (2A)ρ(z ′ ). +16 + +(85) + +Equilibrium Propagation for Non-Conservative Systems + +Thus, the interaction term reduces to: +1 +1 +Hint = − ρ(z)⊤ Sρ(z) + ρ(z ′ )⊤ Sρ(z ′ ) +2 +2 +− ρ(z)⊤ Aρ(z ′ ) + O(∥d∥3 ). + +The input parameters are then updated using the standard +learning rule (21). In particular, the presynaptic term associated with the input weights is given by, +∂Fi += δik ρ′ (xi )ul . +in +∂Jkl + +(86) + +Finally, for the nudging term, we expand the cost function +around the midpoint: +C(m) = + +1 +(C(z) + C(z ′ )) + O(∥d∥2 ). +2 + +The presynaptic terms associated with the dynamical paramdyn +eters Jij +depend on the experiment. +G.1. Symmetric Initialization + +(87) + +G.1.1. L EARNING RULES + +When multiplying by β, the remainder term becomes β · +O(∥d∥2 ). Since ∥d∥ ∼ O(β), this remainder is of order +O(β 3 ) and can be consistently discarded alongside the thirdorder terms from the interaction expansion. + +For clarity, we write the learning rules for VF and AsymEP. +For the input weights, using (93), we have: +i +1 h +β +in +′ 0 +∆Jik +∝ +(xi − x−β +(94) +i )ρ (xi )uk , +2β + +Combining all these components, the final Hamiltonian is: + +while for the recurrent weight, we get: +i +1 h +β +dyn +0 +′ 0 +(xi − x−β +)ρ(x +) +. +∆Jij +)ρ +(x +∝ +i +j +i +2β + +1 +1 +H(z, z ′ ) = − ρ(z)⊤ Sρ(z) + ρ(z ′ )⊤ Sρ(z ′ ) +2 +2 +1 +⊤ +′ +− ρ(z) Aρ(z ) + (∥z∥2 − ∥z ′ ∥2 ) +2 +β +′ ++ (C(z) + C(z )). +(88) +2 + +in +∆Jik +∝ + +(89) +(90) + +To complement Fig. 2, we report the evolution of the accuracy of the three methods in Fig. 4. We consider a layered +network with 50 hidden neurons. While this capacity is +insufficient for state-of-the-art performance, it amplifies the +difference in accuracy between models to aid visualization. +Models are trained for 20 epochs starting from a symmetric +configuration, the natural setting for both VF and EP. With +this initialization, AsymEP consistently outperforms the +other methods and learns faster by exploiting the additional +degrees of freedom of the asymmetric network. + +As in the main text, the neuronal dynamics are governed by +the vector field: + + +X dyn +Fi = ρ′ (xi ) +Jij ρ(xj ) + bi (u) − xi , +(91) +j + +G.2. Fixed Asymmetry Ratio + +where the input-dependent bias bi (u) is precomputed for +each MNIST input u as: +Jilin ul . + +(96) + +G.1.2. S UPPLEMENTARY N UMERICAL R ESULTS + +G. Experimental Details + +bi (u) = + + i +1 h +β +ρ(xi ) − ρ(x−β +) +uk , +i +2β + +and for the recurrent weights: +i +1 h +β +dyn +−β +−β +∆Jij +∝ +ρ(xi )ρ(x+β +j ) − ρ(xi )ρ(xj ) . (97) +2β + +This system recovers the original continuous Hopfield dynamics when z = z ′ (assuming β = 0). + +X + +(95) + +For EP, we have: + +The saddle-point dynamics, given by Eq. 32, generated by +this Hamiltonian are: +β ∂C +dz += ρ′ (z) ⊙ (Sρ(z) + Aρ(z ′ )) − z − +, +dt +2 ∂z +′ +dz +β ∂C += ρ′ (z ′ ) ⊙ (Sρ(z ′ ) + Aρ(z)) − z ′ + +. +dt +2 ∂z ′ + +(93) + +This section details the implementation for the fixed asymmetry ratio experiments presented in Section 5.2, followed +by complementary numerical results regarding learning +speed and induced Jacobian asymmetry. + +(92) + +l∈in + +This term projects the input space into the recurrent subspace. The bias yields a diagonal contribution to the Jacobian JF = ∂F +∂x , and therefore does not contribute to the +antisymmetric correction used in the augmented dynamics +Eq. (20) of AsymEP. + +G.2.1. L EARNING RULES +Parametrization and notation. To enforce a fixed asymmetry ratio, we explicitly parameterize the independent elements of Eq. (38). We introduce two parameter vectors θS +17 + +Equilibrium Propagation for Non-Conservative Systems + +Parameter + +Sym. Init. / Feedforward +sec. 5.1 & 5.3 + +Fixed rstr +sec. 5.2 + +Fixed rstr & rin +app. G.3 + +0.05 +0.01 +0.5 +0.5 +20 +10 +40 / 20 +64 +n.a. +784 - n.a. -10 +tanh +s ∼ U (−1, 1) +θ ∼ N (0, N1 ) +10 + +0.05 +0.01 +0.3 +0.5 +30 +10 +30 +√64 +60 +784-50-10 +tanh +s ∼ U(−1, 1) +θ ∼ N (0, N1 ) +10 + +0.0125 +0.0025 +0.3 +0.5 +40 +10 +40 +√64 +60 +all-to-all, 500 hid +tanh +s ∼ U(−1, 1) +θ ∼ N (0, N1 ) +10 + +Learning Rate (Input-Hidden) +Learning Rate (Hidden-Output) +Time Step (Dynamics Integration) +Nudging Parameter (β) +Free-phase Steps (nfree ) +Nudged-phase Steps (nnudge ) +Number of Epochs +Batch Size +Scaling Parameter γ +Structure +Activation function ρ +Initial Recurrent State s +Initial Parameters θ +Number of Runs (training + inference) + +Table 3. Trained Model Hyperparameters on MNIST. N is the total number of neurons, U(−1, 1) is a uniform distribution, and N (µ, σ 2 ) +is a Gaussian distribution. For the rstr parametrization, we choose more cautious hyperparameters for training and inference compared to +the symmetric initialization, due to increasingly non-conservative and potentially oscillatory dynamics. + +elements of S̃, the full matrices are constructed as: +S +S̃ij = δij ξi + (1 − δij )θk(max(i,j),min(i,j)) +, +A +Ãij = ϵij θk(max(i,j),min(i,j)) +, + +(99) +(100) + +where ϵij is the Levi-Civita symbol. The dynamical parameters are then given by: +dyn +Jij += γ(cS S̃ij + cA Ãij ), + +with normalization coefficients +p +2 +1 − rstr +cS = +, +FS + +cA = + +(101) + +rstr +, +FA + +defined in terms of the Frobenius norms: +v +uN +M +uX +X +2 +F =t +ξ2 + 2 +θS , + +Figure 4. Evolution of the mean accuracy and standard deviation +(over 10 runs) during training on MNIST for AsymEP, EP, and VF. +Models use 50 hidden neurons. + +S + +i + +i=1 + +k + +(102) + +(103) + +k=1 + +v +u M +u X +2 +θkA . +FA = t2 + +(104) + +k=1 + +and θA of size M = Ndyn (Ndyn − 1)/2, which encode the +off-diagonal elements of the symmetric and antisymmetric +components S̃ and Ã, respectively. The correspondence +between matrix and vector indices is given by: + +k(i, j) = + +(i − 1)(i − 2) ++ j, +2 + +Presynaptic computation. The dependence of the normalization coefficients on the parameters introduces additional regularization terms in the learning rule compared +to the parameterization of (Scellier &Bengio, 2017). The +gradients of the normalization coefficients are: + +(1 ≤ j < i ≤ Ndyn ) +(98) + +∂cS +θkS += +−2c +S +2, +∂θkS +(FS ) + +where the condition j < i selects the strictly lower triangular +elements. Introducing an additional vector ξ for the diagonal + +∂cA +θA += −2cA k 2 . +A +∂θk +(FA ) +18 + +∂cS +ξm += −cS +2, +∂ξm +(FS ) + +(105) +(106) + +Equilibrium Propagation for Non-Conservative Systems + +Parameter + +Comparison Dyn. +sec. 5.4 + +2 hidden layers +sec. 5.4 + +3 hidden layers +sec. 5.4 + +Learning Rate (Input-Hidden) +Learning Rate (Hidden-Hidden) +Learning Rate (Hidden-Output) +Time Step (Dynamics Integration) +Nudging Parameter (β) +Free-phase Steps (nfree ) +Nudged-phase Steps (nnudge ) +Number of Epochs +Batch Size +Layer Structure +Activation function ρ +Initial Recurrent State s +Initial Parameters θ +Number of Runs (training + inference) + +0.0016 +0.0016 +0.0016 +0.4 +0.3 +40 +20 +50 +64 +784-500-200-10 +tanh +s ∼ U(−1, 1) +θ ∼ N (0, N1 ) +10 + +0.0013 +0.0013 +0.0013 +0.3 +0.5 +40 +20 +40 +64 +784-500-500-10 +tanh +s ∼ U(−1, 1) +θ ∼ N (0, N1 ) +10 + +0.6 +0.6 +0.6 +0.0075 +0.20 +60 +30 +40 +64 +784-500-500-500-10 +tanh +s ∼ U (−1, 1) +θ ∼ N (0, N1 ) +10 + +Table 4. Trained Model Hyperparameters on Fashion-MNIST. N is the total number of neurons, U(−1, 1) is a uniform distribution, and +N (µ, σ 2 ) is a Gaussian distribution. For the rstr parametrization, we choose more cautious hyperparameters for training and inference +compared to the symmetric initialization, due to increasingly non-conservative and potentially oscillatory dynamics. + +Combining these with the derivatives of the matrices S̃ and +Ã, we have: +∂ S̃ij += δip δjq + δiq δjp , +∂θkS + +∂ S̃ij += δij δkj +∂ξk + +∂ Ãij += δip δjq − δiq δjp , +∂θkA + +(where p > q): + +N +θkA X +∂Fi +′ += +γc +ρ +(x +) +−2 +Ãij ρ(xj ) +A +i +(FA )2 j=1 +∂θkA + +(107) + + ++ δip ρ(xq ) − δiq ρ(xp ) . + +(108) + +(111) + +where k corresponds to the index pair (p, q) with p > q, as +defined in Eq. (98). The full presynaptic terms are then: + +Initialization. To ensure the stability of the system, we +initialize our parameters suchhthat the +i variance of dynamdyn +ical parameters scales as Var Jij ∝ 1/Ndyn . This is a +conservative choice for the layered architectures used in our +dyn +experiments, where many entries of Jij +are zero. + +• For the diagonal parameters ξm : + +N +∂Fi +ξm X +S̃ij ρ(xj ) += γcS ρ′ (xi ) − +∂ξm +(FS )2 j=1 + ++ δim ρ(xm ) . + +In practice, we initialize the parameter vectors θS , θA , and +ξ with identical variances σ 2 . For large Ndyn , the expected +Frobenius norms approximate to E[FS,A ] ≈ Ndyn σ. Consequently, the normalization coefficients become: + +(109) + +p + +• For the off-diagonal symmetric parameters θkS (where +p > q): + +cS ≈ + + +N +∂Fi +θkS X +′ += +γc +ρ +(x +) +−2 +S̃ij ρ(xj ) +S +i +(FS )2 j=1 +∂θkS + +2 +1 − rstr +, +Ndyn σ + +cA ≈ + +rstr +. +Ndyn σ + +(112) + +Since the symmetric and antisymmetric components are statistically independent, the variance of the weights is derived +as follows: + + ++ δip ρ(xq ) + δiq ρ(xp ) . + +• Diagonal elements (i = j): + +(110) +h +i +2 +1 − rstr +Var Jiidyn = γ 2 c2S σ 2 ≈ γ 2 +. +2 +Ndyn + +• For the off-diagonal antisymmetric parameters θkA +19 + +(113) + +Equilibrium Propagation for Non-Conservative Systems + +• Off-diagonal elements (i ̸= j): +i +h + +γ2 +dyn += γ 2 c2S + c2A σ 2 ≈ 2 , +Var Jij +Ndyn + +a zero-cost baseline (perfect prediction) during learning. +Specifically, for each method and value of rstr , we calculate the cumulative loss by summing the batch-averaged +costs of the first 5 epochs (out of 30, to avoid saturation +effects), and reporting the mean and standard deviation over +10 independent training runs. Mathematically, for each run: + +(114) + +i +h +dyn +∝ 1/Ndyn , we set: +To satisfy Var Jij +γ= + +p +Ndyn + +(115) +Cumul. Loss = + +Note that by random matrix theory, diagonal elements do +not affect stability in the large Ndyn limit. + +(116) + +Ãij = ϵij θk(max(i,j),min(i,j)) . + +(117) + +NX +batches + + +X + + +epoch=1 k=1 + +(x0 ,u)∈Bk + + +C(x0 , u) +, +|Bk | + +(120) +where Bk represents the k-th batch, and |Bk | denotes the +number of examples in the batch. The parameters are updated after each batch step; consequently, the free equilibrium x0 is inferred using the updated parameters and the +current example u. + +Potential Simplification. Although the parameterization +above is fully general, a simpler construction is possible +by removing self-connections (ξ = 0) and enforcing identical parameterization for the symmetric and antisymmetric +components, i.e., θS = θA = θ. The matrix elements then +become: +S̃ij = (1 − δij )θk(max(i,j),min(i,j)) , + +5 +X + +In this case, the Frobenius norms are equal (FS = FA ), and +we can omit the explicit normalization: +q +dyn +2 S̃ + r à . +Jij += 1 − rstr +(118) +ij +str ij +For a parameter θk corresponding to indices (p, q) with +p > q, the presynaptic term is given by: +q + +∂Fi +2 +r +1 − rstr += ρ′ (xi ) +str δip ρ(xq ) +∂θk +q + + (119) +2 −r +1 − rstr ++ +δ +ρ(x +) +str +iq +p . +While this parameterization works in simulations and keeps +the number of parameters constant for all rstr , it constrains +the asymmetry to be “homogeneous”, by which we mean +that the asymmetry ratio is identical for every pair of neurons; hence, the network cannot learn to be symmetric in one +region and antisymmetric in another. Therefore, we choose +to explore the more general case of (38) in our experiments. + +Figure 5. Cumulative loss as defined by (120) over the first 5 +epochs of training, for different asymmetry ratios rstr . We compare +VF (orange) and AsymEP (blue), under two training regimes: +training only J in (dashed) or all parameters (solid). + +In Fig 5, we observe that learning slows down for both algorithms when rstr ≳ 0.6. This behavior likely results from +the increased difficulty of reaching a stationary state as the +dynamics become strongly asymmetric. With a fixed number of inference steps, incomplete convergence degrades the +accuracy of the gradient estimates, thereby slowing down +the learning. Fig 5 shows that while VF can eventually +achieve competitive accuracy, it is consistently slower than +AsymEP as soon as asymmetry is introduced. + +G.2.2. S UPPLEMENTARY N UMERICAL R ESULTS +To complement the results of Fig 3, we analyze the training +efficiency as a function of the asymmetry ratio rstr and investigate the robustness of VF by monitoring the Jacobian +asymmetry. +Training efficiency. We first study the training efficiency +of the two algorithms as a function of the asymmetry ratio rstr . Inspired by the related concept in (Cesa-Bianchi +&Lugosi, 2006), we define the cumulative loss as the accumulated difference between the free equilibrium cost and + +Jacobian asymmetry. We next examine how the structural asymmetry rstr is reflected in the Jacobian of the dy20 + +Equilibrium Propagation for Non-Conservative Systems + +namics (35), given by: +∂Fi +dyn ′ += (1 − δij )ρ′ (xi )Jij +ρ (xj ) +∂sj +h +i ++ δij ρ′ (xi )(Jiidyn ρ′ (xi )) + ρ′′ (xi )bi − 1 . +(121) +In our layered architecture, the self-connections are zero +(Jiidyn = 0). For the following analysis, we neglect all diagonal terms in the Jacobian (including external inputs and +potential), since they do not contribute to the antisymmetric +correction (20) and thus to the discrepancy between the performance of VF and AsymEP. Consequently, we define the +following asymmetry ratio based solely on the off-diagonal +Jacobian JF,off : +(122) + +Figure 6. Asymmetry ratio of the Jacobian rjac defined in equation +(122) after training for different asymmetry ratios rstr . We compare +VF (orange) and AsymEP (blue), under two training regimes: +training only J in (dashed) or all parameters (solid). + +The results are presented in Fig 6. For each trained model +and ratio rstr , we compute rjac averaged over the stationary +states of the first batch (64 images) across 10 independent +runs. We observe that when structural asymmetry is strong +and all parameters are trained, VF partially compensates for +the asymmetry by adjusting the neuronal states. This can be +understood by rewriting the ratio as: + + +dyn +dyn ⊤ +ρ′ (xi ) Jij +− (Jji +) ρ′ (xj ) +F +rjac = +. +(123) +dyn ′ +ρ′ (xi )Jij +ρ (xj ) + +Consequently, local stability requires the largest real eigenvalue of the effective weight matrix to be strictly less than 1. +Assuming weights are initialized independently with variance σ 2 , Girko’s circular law dictates that the eigenvalues +of√an asymmetric matrix uniformly populate a disk of radius +σ n in the complex plane. In contrast, imposing symmetry +forces the eigenvalues +√ onto the real line, broadening the +spectral radius to 2σ n according to Wigner’s semicircle +law. As a result, asymmetric networks can stably accommodate larger variance in the weight initializations than their +symmetric counterparts. + +Compared to the structural asymmetry ratio in Eq. (37), +a value of rjac < rstr indicates that the neuronal states effectively dampen the structural asymmetry, rendering the +dynamics more symmetric. This symmetrization of the Jacobian appears without imposing an additional symmetrization penalty and could be enhanced using the method of +(Laborieux &Zenke, 2022). This mechanism likely explains +the superior performance of ‘All (VF)’ compared to ‘J in +(VF)’ in Fig 3, as the former is able to use the additional +degrees of freedom to reduce the effective asymmetry at +high rstr . + +Asymmetry nevertheless introduces imaginary eigenvalues +and, consequently, damped oscillations. To study this effect +experimentally in a controlled setting, we constrain the input +projections J in . In the experiments of the main text, fixing +the structural asymmetry ratio rstr still allowed AsymEP +to reduce oscillations by aligning and increasing the input +projections J in , thereby adding stabilizing diagonal contributions to the Jacobian. To isolate the network’s ability to +suppress oscillations independently of the magnitude of the +input drive, we further constrain the relative scale of J in and +J dyn by imposing + +rjac = + +⊤ +∥JF,off − JF,off +∥F +, +∥JF,off ∥F + +F + +rin = + +G.3. Stability analysis with Fixed Asymmetry Ratio & +Constrained Inputs Projection + +∥J in ∥F +∥J in ∥F += +, +∥J dyn ∥F +γ + +(124) + +where ∥J dyn ∥F = γ following Eq. (101). Defining unscaled +input projections J˜in , we set + +A complete stability analysis of the non-conservative dynamics trainable with AsymEP is beyond the scope of this +work. Nevertheless, for the class of continuous Hopfield +networks considered here, standard arguments from random +matrix theory suggest that asymmetry inherently improves +asymptotic stability. + +J in = rin γ + +J˜in +∥J˜in ∥F + +(125) + +G.3.1. L EARNING RULES + +In the dynamics defined by Eq. (91), the linear leak term +−xi shifts the spectrum of the system’s Jacobian by −1. + +Reusing the notations of the previous section, we write +Jilin = γcin J˜ilin with the normalization cin = rin /Fin , where +21 + +Equilibrium Propagation for Non-Conservative Systems + +Fin = ∥J˜in ∥F . Applying the chain rule yields: +" +# +in X +∂Fi +J˜kl +′ +in +˜ += γcin ρ (xi ) δik ul − 2 +J um . +in +Fin m im +∂ J˜kl + +(126) + +And for γ we have: +1 +∂Fi += (Fi + xi ). +∂γ +γ + +(127) + +G.3.2. S UPPLEMENTARY N UMERICAL R ESULTS +Figure 7. Comparison of AsymEP and VF on a feedforward network. Test accuracy on MNIST is shown as a function of training +epochs for a single-hidden-layer network with 20 neurons. Curves +report the mean and standard deviation over 10 runs. Best accuracies are 92.7% ± 0.5% (AsymEP) and 64.3% ± 2.0% (VF). + +Table 5 reports a worst-case control experiment where the +structural asymmetry is fixed at rstr = 0.7 while the input +scale ratio rin is varied. The experiment uses an all-to-all +architecture on MNIST (excluding direct input-to-output +connections). The output variance during extended inference (steps 30-50) confirms that the system successfully +learns to suppress oscillations even when rin is severely restricted. Any small residual oscillations can be mitigated by +time-averaging over the inference steps. + +G.5. Advantages of Non-Conservatives Dynamics +In Section 5.4, we compare three (non-)conservative dynamics under varying constraints. To further evaluate learning +speed, Table 6 reports network performance after a single epoch. These results confirm our earlier observation: +AsymEP learns faster than VF. + +Finally, rin can be interpreted as a measure of the external +signal magnitude relative to the internal recurrent dynamics. +These results indicate that the system remains capable of +learning and stabilizing even under a low external input +drive. Even when the input projection ∥J in ∥F is 100 times +smaller than the recurrent connections ∥J dyn ∥F , the network +still achieves 36.34 ± 6.25% accuracy, which is well above +chance level (∼ 10%). + +G.6. Feedforward CIFAR-10 Experiments +This appendix details the architecture and hyperparameters +of the deep feedforward experiments comparing backpropagation (BP), VF, AsymEP and Dyadic EP on CIFAR-10 +(see subsection 5.5) + +G.4. Feedforward Network +Architecture. We use a nine-layer convolutional network +(denoted CNN9). The first eight layers are convolutional +with 3 × 3 kernels and zero-padding; spatial downsampling +is performed by strided convolutions (stride 2 on layers 2, 4, +6, 8 and stride 1 otherwise), so no pooling is used. The channel widths follow the sequence 3 → 64 → 64 → 128 → +128 → 256 → 256 → 512 → 512, reducing the spatial +resolution from 32 × 32 to 2 × 2. The last layer is a fully +connected readout mapping the 512 × 2 × 2 feature map +to the 10 class logits. All hidden units use a ReLU nonlinearity. +p Weights are initialized with the Kaiming scheme +(σ = 2/fan-in) and biases at zero. + +G.4.1. L EARNING RULES +For clarity, we write the learning rules for VF and AsymEP +in a feedforward architecture with one hidden layer using +the notation of Section 5.3. For the input weights connecting +to the hidden layer, we get the usual formula: +in +∆Jik +∝ + +i +1 h +β +−β +0 +(hi − hi )ρ′ (hi )uk , +2β + +(128) + +while for the feedforward weights connecting the hidden to +the output layer, we get: +∆(Wh→o )ji ∝ + +i +1 h +β +0 +′ 0 +(oj − o−β +j )ρ (oj )ρ(hi ) . +2β + +Training setup. All methods are trained for 40 epochs +with batch size 64 and repeated over 5 seeds. Inputs are +normalized per channel and augmented with random 32 × +32 crops (padding 4), random horizontal flips and Cutout +(one 8 × 8 patch). Parameters are updated with SGD with +momentum 0.9, weight decay 5 × 10−4 and gradient-norm +clipping at 1, under a cosine learning-rate schedule decaying +from 3.5 × 10−2 to 2 × 10−4 . Test accuracy is reported +on an exponential moving average of the weights (decay + +(129) + +Note that EP is not applicable in this case. +G.4.2. S UPPLEMENTARY N UMERICAL R ESULTS +In addition to the final accuracy reported in Sec. 5.3, we +show in Fig. 7 the evolution of the accuracy over 20 epochs +for AsymEP and VF. +22 + +Equilibrium Propagation for Non-Conservative Systems +Table 5. Output variance and final test accuracy on MNIST (%) across different values of rin with rstr = 0.7. (mean ± std over 10 runs) +(500 hiddens, all-to-all, no input-output). + +rin +0.01 +0.10 +0.50 +1.00 + +Output variance +Untrained +Epoch 80 +(3.38 ± 0.90) × 10−4 +(2.33 ± 0.48) × 10−4 +(1.34 ± 0.32) × 10−5 +(6.27 ± 1.24) × 10−7 + +(5.22 ± 2.34) × 10−5 +(1.39 ± 0.17) × 10−4 +(1.06 ± 0.25) × 10−6 +(1.75 ± 0.50) × 10−8 + +Table 6. Test accuracy on Fashion-MNIST (%) at Epoch 1 (mean +± std 10 runs). The table compares three classes of network +dynamics: Continuous Hopfield (CH), Predictive Coding (PC), +and Standard dynamics. Each is evaluated under three connec⊤ +tivity structures: Asymmetric (Asym, Bk ̸= Wk+1 +), Symmet⊤ +ric/conservative (Sym, Bk = Wk+1 ), and Feedforward (Feedfor, +Bk = 0). +EP + +AsymEP + +Asym +74.91 ± 0.45 +Feedfor +74.36 ± 0.29 +Sym +74.57 ± 0.43 +Asym +77.83 ± 0.47 +PC +Sym +76.23 ± 0.39 +Asym +76.87 ± 0.51 +Standard +Feedfor +77.92 ± 0.51 +CH + +VF +48.98 ± 4.09 +48.84 ± 3.46 +57.75 ± 3.37 +61.50 ± 4.06 +63.98 ± 0.73 + +0.9995). The targets are smoothed (ε = 0.1), which for +the EP methods amounts to nudging toward the smoothed +one-hot vector. +Relaxation hyperparameters. The four methods differ +only in the gradient estimator: BP uses automatic differentiation, while the EP-based methods contrast stationary +states of the corresponding relaxation dynamics. VF uses +an integration step η = 1.0, nudging β = 0.1, and at most +K = 1000 relaxation steps with an early-stopping threshold +of 9 × 10−6 on the mean state update. Dyadic EP uses +the same settings except for a nudging strength β = 0.1. +AsymEP uses a smaller step η = 0.5, nudging β = 0.1, +and up to K = 250 relaxation steps with a threshold of +1 × 10−4 . + +23 + +Test Acc. (%) +Epoch 80 +36.34 ± 6.25 +90.54 ± 0.19 +94.96 ± 0.10 +96.30 ± 0.09 + +
\ No newline at end of file |
