From b83947778e2c776f757a07d4719b7ce961d7ed55 Mon Sep 17 00:00:00 2001
From: Yuren Hao <yurenh2@illinois.edu>
Date: Fri, 3 Jul 2026 05:56:50 -0500
Subject: =?UTF-8?q?Initial=20commit:=20ept=20=E2=80=94=20backprop-free=20e?=
 =?UTF-8?q?quilibrium=20transformer=20(EP)?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Code (ep_run/), organized docs (docs/{method,campaign,hardware,outreach,paper}),
analysis scripts (scripts/), ONBOARDING.md entry point. Large data/checkpoints
git-ignored (share separately).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_014FAPDWQ49M5Ye3NpTndTpn
---
 ep_run/analogET_extracted.txt | 1861 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 1861 insertions(+)
 create mode 100644 ep_run/analogET_extracted.txt

(limited to 'ep_run/analogET_extracted.txt')

diff --git a/ep_run/analogET_extracted.txt b/ep_run/analogET_extracted.txt
new file mode 100644
index 0000000..b139640
--- /dev/null
+++ b/ep_run/analogET_extracted.txt
@@ -0,0 +1,1861 @@
+                                                      Dense Associative Memories with Analog Circuits
+                                                Marc Gong Bacvanski1 , Xincheng You2 , John Hopfield3 , and Dmitry Krotov4
+                                                                                             1
+                                                                                                MIT
+                                                                                 2
+                                                                                     Independent Researcher
+                                                                                      3
+                                                                                        Princeton University
+                                                                                          4
+                                                                                            IBM Research
+
+                                                                                       December 16 2025
+arXiv:2512.15002v1 [cs.NE] 17 Dec 2025
+
+
+
+
+                                             Abstract: The increasing computational demands of modern AI systems have exposed fundamental
+                                         limitations of digital hardware, driving interest in alternative paradigms for efficient large-scale inference.
+                                         Dense Associative Memory (DenseAM) is a family of models that offers a flexible framework for repre-
+                                         senting many contemporary neural architectures, such as transformers and diffusion models, by casting
+                                         them as dynamical systems evolving on an energy landscape. In this work, we propose a general method
+                                         for building analog accelerators for DenseAMs and implementing them using electronic RC circuits, cross-
+                                         bar arrays, and amplifiers. We find that our analog DenseAM hardware performs inference in constant
+                                         time independent of model size. This result highlights an asymptotic advantage of analog DenseAMs
+                                         over digital numerical solvers that scale at least linearly with the model size. We consider three settings
+                                         of progressively increasing complexity: XOR, the Hamming (7,4) code, and a simple language model
+                                         defined on binary variables. We propose analog implementations of these three models and analyze the
+                                         scaling of inference time, energy consumption, and hardware. Finally, we estimate lower bounds on the
+                                         achievable time constants imposed by amplifier specifications, suggesting that even conservative existing
+                                         analog technology can enable inference times on the order of tens to hundreds of nanoseconds. By har-
+                                         nessing the intrinsic parallelism and continuous-time operation of analog circuits, our DenseAM-based
+                                         accelerator design offers a new avenue for fast and scalable AI hardware.
+
+
+                                         1     Introduction
+                                         The unprecedented growth of artificial intelligence (AI) has driven demand for increasingly large and
+                                         powerful models. At present, the field of generative AI is primarily driven by two settings: autore-
+                                         gressive transformers [1] and diffusion models [2]. While these settings have demonstrated remarkable
+                                         capabilities, they do so at a substantial computational cost. Their current implementations utilize digital
+                                         computation, which faces fundamental challenges in energy efficiency, scalability, and latency, especially
+                                         as model sizes and deployment demands continue to grow [3, 4, 5]. These limitations have prompted
+                                         interest in alternative computational paradigms that can efficiently handle the demands of modern AI
+                                         workloads [6].
+                                             Dense Associative Memories (DenseAMs) [7, 8], a promising class of AI models which generalize
+                                         Hopfield networks [9], offer a new angle for tackling these problems. Unlike conventional feed-forward
+                                         models, DenseAM inference can be defined through the temporal evolution of a state vector that is
+                                         governed by a system of differential equations [10]. The state vector can be thought of as a particle
+                                         exploring the surface of a high-dimensional energy landscape, which is the Lyapunov function of these
+                                         dynamical equations. DenseAMs have been demonstrated to be flexible and expressive computational
+                                         frameworks, capable of representing many primitives of modern AI architectures, such as attention
+                                         mechanism [11], transformers [12], and diffusion models [13, 14, 15]. Furthermore, DenseAMs are error-
+                                         correcting systems [16], a property ensuring that small perturbations of the desired temporal evolution
+                                         of the state vector are corrected away by the dynamics of the network itself, rather than accumulated
+                                         in time. Finally, DenseAMs are asymptotically stable—during the course of temporal evolution the
+                                         computation happens during a finite transient period of time, which is followed by a steady state of
+                                             Code available at https://github.com/mbacvanski/AnalogET.
+
+
+
+                                                                                                 1
+neural activities. This asymptotic stabilization of dynamical trajectories removes the requirement to read
+out the “answer” to the computation problem at a precise moment of time, making DenseAMs robust
+to several classes of hardware imperfections. The confluence of the above properties makes DenseAMs
+appealing networks for analog hardware implementations that, on the one hand, are grounded in the
+physics of stable error-correcting dynamical systems and, on the other hand, are capable of representing
+computation in state-of-the-art AI networks.
+    In 1989, Hopfield argued that analog neural hardware can exceed the efficiency of digital implemen-
+tations when the device physics directly instantiate the computational dynamics of the model itself [17].
+Here, we revisit this idea with DenseAM models: we propose an analog circuit-based hardware accel-
+erator design whose dynamics directly realize those of the DenseAM. We find that analog DenseAM
+hardware enables constant-time inference independent of model size, which is in stark contrast to GPU
+solvers and digital implementations. This intrinsic property makes DenseAM a natural fit for analog AI
+accelerators, and it highlights our circuit architecture as a viable hardware path to realize them. Using
+component specifications already demonstrated in fabricated devices, analog DenseAM hardware may
+achieve inference times on the order of tens to hundreds of nanoseconds, several orders of magnitude
+faster than digital systems.
+    By leveraging the natural dynamics of analog systems, this work establishes a new design of fast and
+scalable AI accelerators. The framework of DenseAMs and their efficient analog hardware implementa-
+tions suggest a pathway for fundamentally redesigning the hardware-software interface for AI, enabling
+a new paradigm for fast, energy-efficient, and scalable computation.
+
+
+2    Dense Associative Memory basics
+The DenseAM framework [10, 18] provides a model that has straightforward neuronal dynamics, yet is
+surprisingly expressive in its ability to represent AI models including transformer attention, diffusion
+models, and associative memories. In its simplest form it is defined by two sets of neurons (typically
+called visible and hidden neurons) and a system of coupled non-linear differential equations governing
+their behavior, see Figure 1. The visible neurons are characterized by their internal states vi and their
+outputs gi , index i = 1 . . . Nv ; while the hidden neurons have internal states hµ and outputs fµ , index
+µ = 1 . . . Nh . From the AI perspective, one can think about internal state of the neuron as a pre-activation
+of that neuron, and the output as a post-activation, which is obtained by applying an activation function
+to the pre-activation. From the biological perspective, one can think about the internal state of the
+neuron as a membrane voltage potential, and the output of that neuron as an axonal output, or a firing
+rate of that neuron. This framework admits both neuron-wise activation functions (gi = g(vi ), where
+g(·) is some continuous function, e.g., a ReLU), and collective activation functions such as softmax or
+layer normalization, which depend on the states of multiple neurons.
+    The network parameters are stored in the synaptic weights ξ ∈ RNh ×Nv , whose matrix elements
+denoted by ξµi can be either hand-engineered or learned. The time decay constants for the two groups
+of neurons are τv and τh . With these conventions, the temporal evolution of the two groups of neurons
+can be expressed as                                 Nh
+                                              dvi   X
+                                            τ      =     ξµi fµ + ai − vi
+                                         
+                                          v dt
+                                         
+                                         
+                                         
+                                                     µ=1
+                                                                                                           (1)
+                                                     Nv
+                                              dh
+                                         
+                                                 µ
+                                                    X
+                                         τh dt =        ξµi gi + bµ − hµ
+                                         
+                                         
+                                         
+                                                     i=1
+
+This forms a bipartite graph of neuronal connections, where the state of the hidden neurons is updated
+by the state of the visible neurons, and vice versa. Importantly, the same matrix ξ appears in both
+equations, once as ξ and again as ξ ⊤ . Although this is sometimes described as using “symmetric”
+weights, ξ is not assumed to be symmetric in the linear-algebraic sense; it is simply the same matrix
+used in both directions. Finally, ai and bµ denote biases, which are additional weights of the system and
+whose values may be hard-coded or learned depending on the application.
+    The most important aspect of this model is the existence of a global energy function (Lyapunov
+function) that describes neuronal dynamics. To demonstrate this, it is most convenient to use the
+Lagrangian formalism [10, 18, 16]. Each set of neurons is defined through a Lagrangian function of their
+internal states. The activation functions are defined as partial derivatives of that Lagrangian with respect
+to internal states. The total energy is the sum of energies of each set of neurons, plus the interaction
+
+
+
+                                                      2
+Figure 1: Top left: Bipartite neural network formulation, where hidden neurons hµ and visible neurons
+vi are connected via symmetric synaptic weights ξ. Top right: Circuit realization of symmetric weight
+matrix via resistive crossbar array. Each crosspoint encodes a weight ξµi by its resistance Rµi = 1/ξµi .
+Lower right: Circuit schematic of a single hidden neuron. It drives its row of the crossbar array with
+a voltage according to its activation fµ , and its internal dynamics are driven by the incoming current
+flowing into it from the crossbar array. Lower left: Softmax activation function built from bipolar
+junction transistors (some components not shown).
+
+
+energy. The energy of each set of neurons is a Legendre transformation of the corresponding Lagrangian
+(plus the term proportional to the bias). Thus, the global energy of Equation 1 is given by
+                    Nv
+                   X                                      Nh
+                                                           X                                    Nh X
+                                                                                                  X  Nv
+              E=             gi (vi − ai ) − Lv        +             fµ (hµ − bµ ) − Lh       −             fµ ξµi gi   (2)
+                       i=1                                     µ=1                                µ=1 i=1
+                   |              {z               }       |               {z             }       |      {z        }
+                       energy of visible neurons               energy of hidden neurons           interaction energy
+
+where the activation functions are defined as partial derivatives of the Lagrangians
+                                                           ∂Lv                  ∂Lh
+                                                   gi =        ,         fµ =
+                                                           ∂vi                  ∂hµ
+For convex Lagrangians this global energy decreases with time on the dynamical trajectories of Equa-
+tion 1. If, additionally, the activation functions (and corresponding Lagrangians) are chosen in such a
+way that this energy is bounded from below, the dynamical trajectories are guaranteed to arrive at a
+stable fixed point of activations. The dynamical equations typically have many asymptotic fixed points,
+which correspond to local minima of the energy function in Equation 2. Both properties above (convexity
+of Lagrangians and lower-bounded energy) are satisfied for all settings studied in this paper. By picking
+different nonlinear activation functions (or corresponding Lagrangians), this system yields a variety of
+models that can describe associative memory, softmax attention, and other commonly used settings in
+AI [10, 11, 18, 19, 20].
+    A particularly relevant example for modern sequence modeling is the Energy Transformer (ET) [12],
+which reformulates transformer’s inference pass as a gradient flow on an energy function defined over the
+
+
+                                                                     3
+set of tokens. The ET block contains two contributions to the energy function: attention energy and the
+Hopfield network. The energy attention module routes the information between the tokens, while the
+Hopfield module aligns the tokens with the manifold of token embeddings. In our implementation, the
+context tokens act as a set of dynamically instantiated memories that interact with the predicted token
+through a DenseAM-like energy. In section 6 we exploit this connection to construct an Analog Energy
+Transformer (Analog ET) whose continuous-time dynamics are implemented directly in hardware using
+our DenseAM circuit primitives.
+
+
+3    Related work
+Early analog implementations of associative memories focused on the classical Hopfield network. Founda-
+tional designs, such as continuous-time analog circuits [21, 22] and later demonstrations using amorphous-
+silicon resistors [23], memristive devices [24, 25], and phase-change memories [26], targeted the quadratic
+Hopfield energy function. These works emphasize device engineering and memory-cell design rather than
+system-level dynamics, and inherit the limited storage capacity and representational power of traditional
+Hopfield networks. That line of research is largely concerned with how to fabricate programmable re-
+sistance elements themselves; our work assumes programmable conductances as a given primitive and
+focuses on the continuous-time dynamics that operate on top of them. Our work also differs from these
+works by addressing DenseAMs with higher-order energy functions and continuous-valued states.
+     Another direction is the use of cavity-QED systems for associative memory. Marsh et al. [27] analyze
+a confocal cavity implementation of a quadratic Hopfield network and show that the cavity dynamics
+induce a descent-like relaxation rule on spin states. Their model remains restricted to quadratic energies
+and binary spins, and operates in a cryogenic, cavity-QED setting. Our work instead targets higher-order
+DenseAMs with continuous states, and emphasizes scalable, room-temperature analog microelectronics
+with explicit hardware-aware dynamical analysis.
+     More recent physical implementations move beyond purely quadratic energies. Musa et al. [28]
+propose a free-space optical realization of the higher-order DenseAM energy. Their system constructs a
+static physical representation of the energy landscape, but inference relies on an external digital controller
+that performs iterative spin-flip updates. Thus, the hardware computes energies, while the optimization
+dynamics remain digital. In contrast, our analog microelectronic architecture embeds the gradient flow
+itself into hardware: inference is performed by a single continuous-time evolution rather than by discrete
+digital updates.
+
+
+4    DenseAM circuit design
+Here, we introduce a novel architecture for a class of analog electronic hardware accelerators that models
+Equation 1’s system of nonlinear differential equations using time evolution. Our DenseAM design
+shown in Figure 1 is comprised of two sets of neurons that interact through a resistive crossbar array.
+The resistive crossbar array turns voltage differences between neurons into currents flowing between the
+neurons according to synaptic weights, and each neuron’s internal circuitry converts those currents into
+dynamics that reproduce Equation 1.
+
+Resistive weights as a crossbar array. The crossbar array construction is a canonical design of
+matrix-vector multiplication using analog electronics [17, 29], and is a natural fit for the weight matrix
+ξ in our model. Traditionally, each crosspoint between a row and column line is connected by a resistor
+(often memristors, RRAM, or other analog memories that produce resistances), a vector of input voltages
+is applied at row lines, and the column lines are held at ground typically via a transimpedance amplifier.
+By Ohm’s law, each resistive crosspoint produces a current that multiplies the row’s input voltage by
+the inverse of the resistance. Because currents add along each column line, the total current output at a
+column is the inner product between the vector of input voltages and the column’s conductance vector.
+Thus, the array as a whole implements a parallel analog matrix multiplication of the form Iout = GVin ,
+where G is the matrix of conductances (inverse of resistances).
+    Unlike a traditional crossbar array whose rows are driven at a fixed voltage and whose columns
+are held at ground, our DenseAM circuit design uses each weight bidirectionally, exactly representing
+the bidirectional connections between visible and hidden neurons. As a result, the current flowing into
+each neuron corresponds to the weighted sum of the differences P     between visible and hidden neuron
+activations. For example, for hidden neuron µ, this current is i ξµi (gi − fµ ). This construction enables
+
+
+                                                      4
+                                                                                     (1, 0)               (1, 1)
+            1                                            g3                0.4
+ Neurons
+  Visible
+
+
+
+
+                                                                  Energy
+                                                                           0.2
+            0
+
+            1                                            f3                0.0
+ Neurons
+ Hidden
+
+
+
+
+                                                                                     (0, 0)               (0, 1)
+            0                                                              0.4
+
+
+
+
+                                                                  Energy
+           0.5
+  Energy
+
+
+
+
+                                                                           0.2
+
+           0.0                                                             0.0
+              0.0   0.5   1.0   1.5   2.0    2.5   3.0
+                                                                                 0            1      0             1
+                                Time (s)
+                                                                                      v3                   v3
+
+Figure 2: Solving XOR with a DenseAM. Visible                     Figure 3: XOR energy landscape of neuron v3 un-
+neuron g3 = v3 serves as the output, while the two                der different settings of visible input neurons v1 and
+input neurons (unlabeled, thin lines) are clamped                 v2 . Minima in the energy function correspond to
+at 1 and 0 for True and False. Output v3 is initial-              stationary points of the dynamics. Gradient flow
+ized at 0.5 and converges to a positive prediction of             dynamics bring v3 to these attractor points, result-
+1. The activation of the hidden neuron f3 for the                 ing in correct XOR outputs.
+truth-table row (1, 0, 1) becomes highly activated,
+with others (fine lines) are suppressed by softmax.
+Energy (2), or equivalently (5), decreases monoton-
+ically along the inference trajectory.
+
+
+weight symmetry to be enforced by hardware sharing: both forward and reverse weights are realized by
+the same resistive elements. Importantly, as long as weights are represented as conductances, they must
+be non-negative.
+
+Design of a single neuron. Each neuron in the circuit computes its dynamics by integrating the cur-
+rents it receives from the crossbar array, which represent weighted differences between its own activation
+and those of connected neurons. Considering a hidden neuron (the design for visible neurons is symmet-
+ric by design), the neuron’s internal voltage hµ is stored on capacitor C1 and evolves in continuous time,
+while the neuron’s activation fµ is obtained by passing hµ through a nonlinear function (e.g. ReLU or
+softmax).
+    The current flowing into hidden neuron µ is produced by its interaction with all visible neurons via
+the synaptic weights ξµi for P i = 1, . . . , Nv . Specifically, this is as a weighted sum of the differences
+between neuron P  activations: i ξµi (gi − fµ ). Inside each neuron, a “self” path scales fµ to produceP the
+voltage sµ = fµ i ξµi . This term is added to the value of the incoming current so that the −fµ i ξµi
+term is cancelled inside each neuron. As a result, the hidden state, represented as the voltage across
+capacitor C1 , integrates only the desired weighted input plus any external stimulus bµ . Its dynamics
+reduce to the canonical DenseAM form with a time constant of R2 C1 :
+                                                         Nv
+                                                   dhµ   X
+                                            R2 C 1     =     ξµi gi + bµ − hµ                                          (3)
+                                                    dt   i=1
+
+Elementwise (or vectorized) nonlinearities then produce activations gi = g(vi ) and fµ = f (hµ ) (e.g.,
+ReLU, softmax) across the visible and hidden neurons. See Appendix A for the full circuit derivation.
+
+
+5           Analog DenseAM Examples
+We begin by studying two examples of the proposed design: the XOR task, and the (7,4) error-correcting
+Hamming code.
+
+
+
+
+                                                              5
+5.1    XOR
+The XOR problem is a canonical test for nonlinear representation and inference, as it cannot be solved
+by any linear model. We show a minimal DenseAM model for the XOR task, illustrating how energy-
+based dynamics can solve this simple task with a continuous-time analog system. The network consists
+of Nv = 3 visible neurons, and Nh = 4 hidden neurons. At t = 0 visible neurons v1 and v2 are initialized
+at their input values corresponding to the input bits. The last visible neuron v3 is initialized at v3 = 0.5.
+The hidden neurons are initialized at zero. The two input visible neurons remain clamped during the
+dynamics, while the third output visible neuron and the hidden neurons evolve in time according to (1).
+Each row of the memory matrix ξ corresponds to a row of the XOR truth table. The visible neurons
+use an identity activation function where gi = vi , and the hidden neurons use a softmax activation. The
+biases are set as
+                                                                N v
+                                                              1X        2
+                                        ai = 0,      bµ = −         ξµi
+                                                              2 i=1
+
+    Figure 2 shows the temporal evolution of visible and hidden neuron activations, as well as the total
+energy, during inference on the XOR input (1, 0). The output visible neuron’s activation g3 gradually
+converges to the correct prediction of 1, while the hidden neuron associated with that memory, f3 ,
+becomes strongly activated and the remaining hidden neurons are suppressed by the softmax nonlinearity.
+The system’s energy decreases monotonically throughout the trajectory and stabilizes once the network
+reaches its fixed-point prediction. Figure 3 depicts the system’s energy landscape as a function of output
+neuron v3 for different clamped input configurations (v1 , v2 ). In each case, the energy exhibits a clear
+convex minimum at the correct XOR output, demonstrating that gradient flow along the energy surface
+drives v3 reliably toward the correct prediction. As shown in Appendix C, we validate our circuit design
+and dynamics using SPICE simulation.
+                                                                     τh → 0. Since the second equation in
+    To analyze this DenseAM, it is instructive to consider the limit P
+                                                                       Nh
+(1) is linear in hidden units hµ , they can be integrated out. With µ=1    fµ = 1, the resulting dynamics
+of the visible neurons can be written as
+                            Nh                                             Nv
+                      dvi   X                                          βX                  
+                                                                               (ξµi − vi )2
+                                        
+                 τv       =     ξµi − vi fµ       where   fµ = softmax −                                 (4)
+                      dt    µ=1
+                                                                         2 i=1
+
+The effective energy on the visible neurons can be written as
+                                                     Nh          Nv
+                                               1     X      h βX                  i
+                               E eff (v) = −     log     exp −       (ξµi − vi )2                        (5)
+                                               β     µ=1
+                                                               2 i=1
+
+Intuitively, each hidden neuron computes a squared Euclidean distance between the visible state and its
+stored pattern ξ µ . The softmax nonlinearity assigns higher weight to the pattern closest to the current
+state of the visible neurons. The resulting visible neuron dynamics are gradient flow for this effective
+energy. It is important to note that memories in this implementation are represented by conductances
+of the crossbar array, which are always positive. For this reason, matrix elements of memories ξµi must
+be positive, necessitating the use of the bias terms, which are just voltage sources that can be arbitrarily
+signed.
+    While a time constant of τh = 0 is impossible to physically construct due to finite conductances
+and nonzero capacitances, setting τh ≪ τv realizes the same adiabatic limit in practice. When hidden
+neurons evolve much faster than visible ones, they reach their steady state almost instantaneously for each
+configuration of visible neurons. The result is an adiabatic elimination of hidden dynamics, yielding the
+effective visible-only dynamics above. In practice, for the XOR task, even a relatively modest τh = τv /10
+ratio yields perfect performance.
+
+5.2    Hamming (7,4) code
+The Hamming (7,4) code is an error-correcting code that encodes 4 data bits into a 7-bit codeword by
+adding 3 parity bits. The resulting 7-bit strings are special: only certain patterns are valid codewords,
+and they are spaced apart so that if a single bit is flipped, the error can be detected and corrected [30].
+Table 1 lists the 16 codewords corresponding to four arbitrary data bits.
+
+
+                                                          6
+              1
+                                                     g5
+ Neurons
+  Visible
+                                                               Data bits (d1 d2 d3 d4 )       Codeword (c1 c2 c3 c4 c5 c6 c7 )
+
+              0
+                                                                        0000                             0000000
+                                                                        0001                             0001111
+              1                                      f7                 0010                             0010110
+ Neurons
+ Hidden
+
+
+
+
+                                                                        0011                             0011001
+                                                                        0100                             0100101
+              0
+                                                                        0101                             0101010
+             0.5                                                        0110                             0110011
+    Energy
+
+
+
+
+                                                                        0111                             0111100
+                                                                        1000                             1000011
+             0.0                                                        1001                             1001100
+                   0   1   2        3     4      5
+                                                                        1010                             1010101
+                               Time (s)
+                                                                        1011                             1011010
+                                                                        1100                             1100110
+                                                                        1101                             1101001
+Figure 4: Correcting a bit error in a Hamming                           1110                             1110000
+(7,4) code. Visible neuron g5 flips indicating the                      1111                             1111111
+bit flip error happened on the 5th codeword bit. f7
+is the hidden neuron corresponding to the memory              Table 1: Valid codewords of the Hamming(7,4)
+of the correct codeword. Thin lines correspond to             code, ordered by their 4-bit data payload.
+the other neuron activations.
+
+
+    Unlike the XOR case where the only evolving neuron is the readout bit, the Hamming (7,4) code may
+require flipping the value of any one of the visible neurons. During inference, the visible neurons are
+initialized to the corrupted 7-bit input word. All neurons are left free to evolve, and the dynamics relax
+the state toward the nearest stored codeword. Energy minima are located at the valid codewords, so the
+network converges to the correct code provided the error is within the Hamming radius of 1. Thus, the
+DenseAM replicates the standard decoding property of the Hamming (7,4) code: any single-bit flip is
+corrected automatically. Figure 4 illustrates a case where a flipped bit g5 is restored during convergence.
+    The Hamming (7,4) model’s 7 visible neurons, each corresponding to a codeword bit, are connected
+to 16 hidden neurons, each representing one valid codeword. The weight matrix ξ ∈ {0, 1}16×7 is formed
+by stacking the 16 codewords as its rows. Visible neurons have the identity activation, hidden neurons
+use a softmax activation, and biases are chosen as in the XOR case to give the same integrated-out
+visible dynamics as (4).
+
+
+6             Analog Energy Transformer (Analog ET) via DenseAM
+Our DenseAM circuit construction can be used to build more complex energy-based models, such as
+the transformer-like architecture proposed in the Energy Transformer paper [12]. For causal next-token
+prediction with a single attention head, the Energy Transformer’s energy function can be written as the
+following (See Appendix J for full derivation):
+                               ⊤              ⊤             ⊤ attn               ⊤ hopf
+   E = 12 ∥v − a∥2 − v⊤ ξ attn f attn + ξ hopf f hopf + f attn           − b + f hopf
+                                                                                                   
+                                                                    h                     h     −c
+                   − Lattn hattn − Lhopf hhopf
+                                               
+                                                                                                    (6)
+
+with the activation functions and their Lagrangians defined as
+                                                                                 L
+                                                                                 X
+                           fAattn = softmax(βhattn )A ,   Lattn (h) = β1 log           eβhA                                 (7)
+                                                                                 A=1
+                                                                           M h
+                                                                           X                    i2
+                           fµhopf = ReLU(hhopf
+                                          µ    ),         Lhopf (h) = 21          ReLU(hµ )                                 (8)
+                                                                           µ=1
+
+where a, b, and c correspond to the biases of the visible neurons, attention hidden neurons, and Hopfield
+network hidden neurons, respectively. The L context tokens are indexed by A, and the M hidden neurons
+of the Hopfield network are indexed by µ. Because the visible units use an identity activation function,
+
+
+                                                          7
+Figure 5: Analog ET circuit demonstrating the autoregressive inference procedure. A newly inferenced
+token is decoded, sampled, and re-embedded to obtain the weight vector ξ attn
+                                                                           L+1 , which is set as the weight
+vector for a new hidden neuron hattn
+                                 L+1  in the energy attention block (light gray  on right). For this layout
+we have flipped the crossbar array, so that indices A and µ run horizontally and index i runs vertically.
+
+
+gi = vi using the languge of Equation 1, the gradient flow of the energy yields the dynamics:
+                                      ∂E         ⊤             ⊤
+                            τv v̇ = −    = ξ attn f attn + ξ hopf f hopf + a − v                           (9)
+                                      ∂v
+                                       ∂E
+                       τh ḣattn
+                                  = − attn = ξattn v + b − hattn                                          (10)
+                                      ∂f
+                                       ∂E
+                       τh ḣhopf = − hopf = ξhopf v + c − hhopf                                           (11)
+                                      ∂f
+In this formulation, v represents the embedding of the output (next) token, and its evolution is driven by
+two terms: one term from the energy attention with weights ξattn and hidden neuron activations f attn ,
+and one term from the Hopfield network with weights ξ hopf and hidden neuron activations f hopf . The
+weights of the energy attention DenseAM are dependent on the context: for a token dimension D, context
+length L, and the task of predicting the token at index L + 1, the weights ξ attn ∈ RL×D are generated
+by embedding each token of the context via a learned embedding matrix applied to each context token.
+In contrast, the Hopfield network weights ξ hopf are learned during training and fixed at inference. The
+number of memories in the Hopfield network is a hyperparameter M , such that ξ hopf ∈ RM ×D .
+    This system suggests a hardware implementation where v interacts with two independent DenseAMs,
+one for the energy attention and one for the Hopfield term, which can share the same physical crossbar
+structure. Figure 5 shows that the circuit structure remains a crossbar array (like Figure 1), but with
+two distinct classes of hidden neurons. Because of the summation of currents along each row of the
+crossbar array, the incoming current to visible neuron vi is the sum of contributions from the energy
+attention block and from the Hopfield network block. The energy attention hidden neurons hattn use a
+softmax activation function, while the Hopfield network hidden neurons hhopf use a ReLU activation.
+
+6.1    Analog Energy Transformer on the parity task
+We build and evaluate the Analog ET on the L-bit parity task,        which can 
+                                                                    P          be thought of as an elementary
+                                                                      L
+“language model”: given bits bit1 , . . . , bitL , predict bitL+1 =   A=1  bitA    mod 2. Parity is instructive
+because it requires a representation of a global, order-L interaction, precluding linear and shallow models
+from representing it efficiently. A successful model must be able to form high-order interactions in order
+to generalize. We formulate parity as a next-token prediction problem: given an L-bit string as context,
+predict its parity in the next token.
+    We train the Analog ET model digitally using backpropagation through time [31] implemented with
+Jax’s automatic differentiation. The resulting weights can be deployed onto the analog hardware; in
+
+
+                                                      8
+                                   11001010    0                          01000110        1
+
+                  4
+Visible neurons
+
+
+                  2
+                  0
+                  1
+Prediction
+
+
+
+
+                   0
+                  10
+Energy
+
+
+
+
+                  20
+                  30
+                       0.0   0.2    0.4 0.6        0.8   1.0 0.0    0.2      0.4 0.6          0.8     1.0
+                                      Time t                                   Time t
+Figure 6: Inference of parity Analog ET on two example 8-bit strings. Top row plots the visible neurons vi
+over time, middle row plots the decoded token prediction, bottom row plots the energy that monotonically
+decreases during inference. After a transient period of computation, the network arrives at a steady-
+state, making the result of the computation robust against the precise timing of the readout.
+
+
+our experiments we simulate the dynamics of hardware with the Diffrax [32] ODE solver library. On
+the 8-bit parity task, our model achieves 100% accuracy on the hold-out validation set of 52 bit strings,
+demonstrating clear generalization capabilities. See Appendix H.1 for more details on training and model
+design.
+    Figure 6 shows the dynamics of the visible neurons and energy during two example inference runs
+of the Analog ET. Notably, the visible neuron values are constant by the end of the inference period,
+meaning that the inference remains highly stable to mismatch and delay in timing during readout. A
+single sample-and-hold and switching circuit would enable a single Analog-Digital Converter (ADC) to
+read out all the visible neurons at convergence, significantly reducing mismatch, and drastically saving
+device area, complexity, and energy. The intrinsic stability of attractor points arises uniquely from
+the continuous-time dynamics of the DenseAM, making these models particularly well suited to analog
+hardware.
+
+6.2               Autoregressive inference
+Dashed lines in Figure 5 illustrate the autoregressive inference procedure of the Analog ET. To generate
+the L-th token given context tokens x(1) , . . . , x(L−1) , each token is first embedded and concatenated to
+form the attention weight matrix
+                                                     (1) 
+                                                       e
+                                                     e(2) 
+                                    ξ attn,(L−1) =  .  ∈ R(L−1)×D
+                                                            
+                                                     .. 
+                                                      e(L−1)
+
+These rows are loaded into the Analog ET’s energy attention weight matrix ξ attn by programming the
+corresponding crossbar resistances. During inference, the visible state v(t) evolves according to the
+Analog ET dynamics until convergence. A decoder readout (e.g., a linear layer) applied to the converged
+v(t = T ) values produces logits, from which the next token x(L) is sampled. This token is then embedded
+to form e(L) , and appended to the existing context. The cycle repeats with the updated attention weight
+
+
+                                                         9
+matrix
+                                                         attn,(L−1) 
+                                                         ξ
+                                         ξ attn,(L) =                  ∈ RL×D
+                                                            e(L)
+
+which now includes the new embedding e(L) . In hardware, this corresponds to connecting an additional
+hidden neuron in the energy attention block of Figure 5, and setting its resistive weights with e(L) .
+Because the physical order of hidden neurons does not affect the energy function, this new neuron can
+be placed in any position among the hidden neurons. When the context length is fixed, the hidden
+neuron corresponding to the earliest token can simply be reprogrammed with the new vector of weights
+e(L) , resulting in the hardware equivalent of a sliding-window context. In practice, an external digital
+controller, e.g., an Field-Programmable Gate Array (FPGA) or Application-Specific Integrated Circuit
+(ASIC) would orchestrate crossbar programming and token decoding, while the DenseAM dynamics
+perform the far more substantial workload of computing each next-token embedding.
+    This procedure is analogous to key-value (KV) caching in standard transformer inference [33]. Context
+tokens x(1) , . . . , x(L−1) produce key and value vectors k(1) , . . . , k(L−1) and v(1) , . . . , v(L−1) respectively.
+When new token x(L) is generated, its corresponding k(L) and v(L) vectors are appended to the cache,
+allowing all previous k(<L) and v(<L) to be reused without recomputation. When the key and value
+matrices are tied so that k(A) = v(A) , the ET’s row-append operation is equivalent to the standard KV-
+cache update. The ET performs an autoregressive rollout that reproduces the same recurrence structure
+as KV-cached transformer inference, but implemented physically through the addition of new neurons
+and weights without touching existing hardware. For a formal derivation of the equivalence between ET
+attention and conventional attention with tied keys and values, see [12].
+
+
+7     Scaling properties
+Inference time and energy consumption are crucial characteristics of our system. This section investigates
+these metrics with respect to the network size.
+
+7.1      Inference time scaling
+The model (4) and (5) is considered. In the adiabatic limit (τh → 0), which is satisfied by our hardware
+implementation, the time derivative of the energy can be written as
+                                       Nv                   Nv 
+                              dE eff   X   ∂E eff dvi    1 X     ∂E eff 2    Nv
+                                     =                =−                   ∼−                                      (12)
+                               dt      i=1
+                                            ∂vi dt       τv i=1 ∂vi           τv
+
+This derivative is always negative, since the dynamical system performs the gradient descent on the
+energy landscape. The derivative vanishes eventually when the network state vector v converges to the
+steady state. Since the state vector vi is typically initialized in the vicinity of the memory vectors, which
+are chosen to be of order one (∼ 1), the right hand side of (4) is of order one too, independent of the
+network size. This results in the characteristic value of the temporal derivative shown in (12).
+    At the same time, the typical value1 of the energy (5) is
+                                                                 1
+                                              |E eff | ∼ Nv +      log(Nh )                                        (13)
+                                                                 β
+During the inference dynamics the network is initialized in a high energy state, which has the charac-
+teristic value of energy (13), and performs energy descent to a lower value of the energy (which has a
+similar order of magnitude). In order to estimate the scaling of the time required to perform this energy
+descent, one can take a ratio of the energy drop by the rate of the energy decrease (12). This gives the
+following estimate
+                                         |E eff |         1 log(Nh ) 
+                                T conv ∼          ∼ τv 1 +              ∼ τv                         (14)
+                                          dE               β Nv
+                                                dt
+
+The last ∼ sign holds since in none of the designs presented here does Nh grow super-exponentially in
+Nv . In fact, in all the use cases Nh is sub-exponential in Nv .
+   1 We estimate the absolute value of the energy, since it can be both positive and negative depending on the mutual
+
+arrangement of memories, the state vector, and the number of hidden units.
+
+
+                                                            10
+    This back-of-the-envelope estimation provides the core intuition behind the scaling relationship.
+The inference time is constant, and independent of the size of the network. A more careful     anal-
+ysis (Appendix E) shows that in the high-β regime the worst-case dependence is O τβv logNNv
+                                                                                            h
+                                                                                               , which
+remains bounded for all architectures we consider. Thus, for our settings the convergence time is ef-
+fectively constant in Nv and Nh . Based on amplifier gain–bandwidth, slew-rate, and output-current
+constraints, we estimate achievable inference times of tens to hundreds of nanoseconds using existing
+CMOS technology (see Appendix I.2).
+
+7.2    Scaling of energy consumption
+We now analyze how the total inference energy scales with network size. Energy dissipation arises
+primarily from (i) Ohmic loss in the resistive weights, (ii) charging of neuron-state capacitors, and (iii)
+constant per-neuron overhead from amplifiers and bias currents. We show that, under bounded voltage
+swings and fixed conductance budgets, total energy grows only linearly with the number of neurons.
+
+Weight dissipation. Let the neuron output voltages be proportional to activations: u = κg and
+w = κf , where κ is a fixed voltage swing. Such a bounded swing can always be enforced by global
+rescaling of ξ, β, and voltage units without changing the dynamics (see Appendix F). The instantaneous
+power dissipated by the resistive crossbar array is
+                                                     Nh X
+                                                     X  Nv
+                                    Pweights (t) =             ξµi (ui − wµ )2                                   (15)
+                                                     µ=1 i=1
+                                                                             P                  P
+With 0 ≤ gi ≤ 1, f -softmax, and row/column conductance budgets                  µ ξµi ≤ Cc ,   i ξµi ≤ Cr , the total
+power obeys
+
+                                 Pweights (t) ≤ 2κ2 (Cc Nv + Cr ) = O(Nv )                                       (16)
+
+For a runtime of duration T ∼ T conv , the energy dissipated by the weights is therefore Eweights = O(Nv T ),
+where T ∼ 1 from subsection 7.1.
+
+Capacitive and overhead energy.           Each neuron charges a local capacitor a finite number of times
+by at most Vswing ∼ κ, giving
+                                                                    !
+                                                (v)
+                                         X             X
+                             Ecap ≤ κ2         Ci +         Cµ(h)       = O(Nv + Nh )                            (17)
+                                           i            µ
+
+Active bias and amplifier inefficiencies contribute fixed per-neuron power, yielding Eother = O((Nv + Nh )T ).
+
+Total energy scaling.      With bounded voltage swing and conductance budgets,
+
+                                           Etotal = O(Nv + Nh )                                                  (18)
+
+Hence, the total inference energy scales only linearly with system size. For the full derivation, see
+Appendix G.
+
+7.3    Scaling of hardware area
+The area is dominated by two components: the area taken up by the synaptic weights, which is imple-
+mented as a crossbar array with programmable weights, and the area taken up by the neurons feeding
+the crossbar array. The area of the crossbar array scales as the number of weights O(Nv Nh ). The area
+of the neurons scales as O(Nv + Nh ).
+
+
+8     Conclusion
+In this paper, we have presented an analog accelerator architecture for Dense Associative Memories,
+implemented using resistive crossbar arrays and continuous-time RC neuron dynamics. Our design im-
+plements DenseAM inference as time evolution of a physical dynamical system, rather than a sequence of
+
+
+                                                       11
+discrete numerical update steps. We demonstrated this architecture with three representative settings of
+increasing complexity: XOR, Hamming (7,4) error decoding, and an Energy Transformer-style sequence
+model. These examples show that the analog DenseAM accelerator architecture covers both associative
+memory tasks and attention-based sequence models.
+    Our analysis shows that DenseAM accelerators enjoy favorable asymptotic scaling properties. In-
+ference time is constant in the dimensions of the model size, meaning that inference time is governed
+primarily by the physical time constants of the circuit. This is in sharp contrast to digital implementa-
+tions of the same dynamics, whose runtime must grow at least linearly with model size.
+    To assess hardware feasibility, we derived lower bounds on the neuronal time constants imposed by
+amplifier gain-bandwidth product, slew rate, and output current limits in our neuron design. Reported
+figures from representative CMOS OTAs in the literature give inference times on the order of tens-to-
+hundreds of nanoseconds, even with conservative design margins. Combined with the constant scaling of
+inference with model size, these estimates suggest that DenseAM accelerators can match or exceed the
+latency of digital GPUs as models grow, without requiring exotic devices or beyond-CMOS technologies.
+    Our results highlight DenseAMs as a natural abstraction for analog AI hardware. Their error cor-
+recting dynamics and asymptotic stability directly address long-standing concerns about robustness and
+readout timing: small perturbations are corrected by the dynamics instead of accumulated, and the final
+state is stable when readout happens over a wide temporal window. At the same time, the DenseAM
+framework is expressive enough to capture modern primitives such as attention and transformer-like ar-
+chitectures, as illustrated by our Analog Energy Transformer construction. These properties suggest that
+DenseAM-based analog accelerators may be a promising substrate for future AI systems, and motivate
+further co-design of models, dynamics, and devices.
+
+Acknowledgements
+MGB would like to thank Faiz Muhammad for exploratory attempts at SPICE simulations. DK would
+like to thank Kwabena Boahen for helpful discussions.
+
+
+References
+ [1]   Ashish Vaswani. “Attention is all you need”. In: arXiv preprint arXiv:1706.03762 (2017).
+ [2]   Jascha Sohl-Dickstein et al. “Deep unsupervised learning using nonequilibrium thermodynamics”.
+       In: International conference on machine learning. pmlr. 2015, pp. 2256–2265.
+ [3]   Norman P Jouppi et al. “In-datacenter performance analysis of a tensor processing unit”. In:
+       Proceedings of the 44th annual international symposium on computer architecture. 2017, pp. 1–12.
+ [4]   Eric Masanet et al. “Recalibrating global data center energy-use estimates”. In: Science 367.6481
+       (2020), pp. 984–986.
+ [5]   David Patterson et al. “Carbon emissions and large neural network training”. In: arXiv preprint
+       arXiv:2104.10350 (2021).
+ [6]   Maxwell Aifer et al. “Solving the compute crisis with physics-based ASICs”. In: arXiv preprint
+       arXiv:2507.10463 (2025).
+ [7]   Dmitry Krotov and John J Hopfield. “Dense associative memory for pattern recognition”. In:
+       Advances in neural information processing systems 29 (2016).
+ [8]   Dmitry Krotov and John Hopfield. “Dense associative memory is robust to adversarial inputs”. In:
+       Neural computation 30.12 (2018), pp. 3151–3167.
+ [9]   John J Hopfield. “Neural networks and physical systems with emergent collective computational
+       abilities.” In: Proceedings of the national academy of sciences 79.8 (1982), pp. 2554–2558.
+[10]   Dmitry Krotov and John J Hopfield. “Large Associative Memory Problem in Neurobiology and
+       Machine Learning”. In: International Conference on Learning Representations. 2021.
+[11]   Hubert Ramsauer et al. “Hopfield networks is all you need”. In: arXiv preprint arXiv:2008.02217
+       (2020).
+[12]   Benjamin Hoover et al. “Energy transformer”. In: Advances in Neural Information Processing
+       Systems 36 (2024).
+
+
+
+                                                   12
+[13]   Benjamin Hoover et al. “Memory in plain sight: A survey of the uncanny resemblances between
+       diffusion models and associative memories”. In: arXiv preprint arXiv:2309.16750 (2023).
+[14]   Luca Ambrogioni. “In search of dispersed memories: Generative diffusion models are associative
+       memory networks”. In: arXiv preprint arXiv:2309.17290 (2023).
+[15]   Bao Pham et al. “Memorization to generalization: Emergence of diffusion models from associative
+       memory”. In: arXiv preprint arXiv:2505.21777 (2025).
+[16]   Dmitry Krotov et al. “Modern methods in associative memory”. In: arXiv preprint arXiv:2507.06211
+       (2025).
+[17]   JJ Hopfield. “The effectiveness of analogue’neural network’hardware”. In: Network: Computation
+       in Neural Systems 1.1 (1990), p. 27.
+[18]   Dmitry Krotov. “Hierarchical associative memory”. In: arXiv preprint arXiv:2107.06446 (2021).
+[19]   Fei Tang and Michael Kopp. “A remark on a paper of krotov and hopfield [arxiv: 2008.06996]”. In:
+       arXiv preprint arXiv:2105.15034 (2021).
+[20]   Benjamin Hoover et al. “A universal abstraction for hierarchical hopfield networks”. In: The Sym-
+       biosis of Deep Learning and Differential Equations II. 2022.
+[21]   John J Hopfield. “Neurons with graded response have collective computational properties like those
+       of two-state neurons.” In: Proceedings of the national academy of sciences 81.10 (1984), pp. 3088–
+       3092.
+[22]   David W Tank and John J Hopfield. “Simple “Neural” optimization networks: an A/D converter,
+       signal decision circuit, and a linear programming circuit”. In: Artificial neural networks: theoretical
+       concepts. 1988, pp. 87–95.
+[23]   HP Graf et al. “VLSI implementation of a neural network memory with several hundreds of neu-
+       rons”. In: AIP conference proceedings. Vol. 151. 1. American Institute of Physics. 1986, pp. 182–
+       187.
+[24]   Xinjie Guo et al. “Modeling and experimental demonstration of a Hopfield network analog-to-
+       digital converter with hybrid CMOS/memristor circuits”. In: Frontiers in neuroscience 9 (2015),
+       p. 488.
+[25]   SG Hu et al. “Associative memory realized by a reconfigurable memristive Hopfield neural net-
+       work”. In: Nature communications 6.1 (2015), p. 7522.
+[26]   Sukru B Eryilmaz et al. “Brain-like associative learning using a nanoscale non-volatile phase change
+       synaptic device array”. In: Frontiers in neuroscience 8 (2014), p. 205.
+[27]   Brendan P Marsh et al. “Enhancing associative memory recall and storage capacity using confocal
+       cavity QED”. In: Physical Review X 11.2 (2021), p. 021048.
+[28]   Khalid Musa et al. “Dense Associative Memory in a Nonlinear Optical Hopfield Neural Network”.
+       In: arXiv preprint arXiv:2506.07849 (2025).
+[29]   Carver Mead and Mohammed Ismail. Analog VLSI implementation of neural systems. Vol. 80.
+       Springer Science & Business Media, 2012.
+[30]   Richard W Hamming. “Error detecting and error correcting codes”. In: The Bell system technical
+       journal 29.2 (1950), pp. 147–160.
+[31]   Paul J Werbos. “Backpropagation through time: what it does and how to do it”. In: Proceedings
+       of the IEEE 78.10 (2002), pp. 1550–1560.
+[32]   Patrick Kidger. “On Neural Differential Equations”. PhD thesis. University of Oxford, 2021.
+[33]   Zihang Dai et al. “Transformer-xl: Attentive language models beyond a fixed-length context”.
+       In: Proceedings of the 57th annual meeting of the association for computational linguistics. 2019,
+       pp. 2978–2988.
+[34]   Jacob Sillman. “Analog Implementation of the Softmax Function”. In: arXiv preprint arXiv:2305.13649
+       (2023).
+[35]   John J Hopfield and David W Tank. “Computing with neural circuits: A model”. In: Science
+       233.4764 (1986), pp. 625–633.
+[36]   Aldo Pena Perez and Franco Maloberti. “Performance enhanced op-amp for 65nm CMOS tech-
+       nologies and below”. In: 2012 IEEE International Symposium on Circuits and Systems (ISCAS).
+       IEEE. 2012, pp. 201–204.
+
+
+                                                      13
+                                  Figure 7: Circuit for a single neuron.
+
+
+[37]   Rida S Assaad and Jose Silva-Martinez. “The recycling folded cascode: A general enhancement of
+       the folded cascode amplifier”. In: IEEE Journal of Solid-State Circuits 44.9 (2009), pp. 2535–2542.
+[38]   Alec Yen and Benjamin J Blalock. “A High Slew Rate, Low Power, Compact Operational Ampli-
+       fier Based on the Super-Class AB Recycling Folded Cascode”. In: 2020 IEEE 63rd International
+       Midwest Symposium on Circuits and Systems (MWSCAS). IEEE. 2020, pp. 9–12.
+[39]   Mohammad H Naderi, Suraj Prakash, and Jose Silva-Martinez. “Operational transconductance
+       amplifier with class-B slew-rate boosting for fast high-performance switched-capacitor circuits”.
+       In: IEEE Transactions on Circuits and Systems I: Regular Papers 65.11 (2018), pp. 3769–3779.
+[40]   Franz Schlögl and Horst Zimmermann. “A design example of a 65 nm CMOS operational amplifier”.
+       In: International Journal of Circuit Theory and Applications 35.3 (2007), pp. 343–354.
+
+
+A      Neuron Design
+Figure 7 shows the circuit design of a single neuron, with labels corresponding to this being a hidden
+neuron at index µ. We derive the dynamics of the neuron internal state hµ and activation output voltage
+fµ . We proceed using only Kirchhoff’s Current Law (KCL) and the definition of an ideal op-amp.
+
+Assumptions and conventions.
+    • Ideal op-amps: infinite open-loop gain, infinite input impedance (no input current), zero output
+      impedance. Under stable negative feedback this enforces a virtual short V+ = V− .
+    • Current Jµ : we define Jµ as the current which flows from fµ to mµ through R1 .
+
+    • Op-amp input labels: We denote the inverting and noninverting inputs of each op-amp explicitly,
+      e.g. U 2− for the inverting input of U2, U 3+ for the noninverting input of U3, etc.
+    • Node labels: Label mµ as the output of U1, sµ as the output of U2, and dµ as the output of U3.
+      The neuron pre-activation state is labeled hµ , and the post-activation state is labeled fµ . Voltage
+      bµ (as an ideal voltage source) drives the bias for this neuron. Voltages hµ , bµ , and fµ correspond
+      directly to the state variables in equation (1).
+
+
+
+
+                                                    14
+Block U1: buffer of activation voltage fµ . Op-amp U1 buffers the output of the activation function
+f (·) and drives the output of the neuron, fµ . Because no current can flow into U 1− , all the current
+flowing into this neuron must flow through R1 to mµ and is sourced or sunk by U1’s output node.
+
+Block U2: non-inverting stage producing sµ from fµ and mµ . The positive input of U2 is
+U 2+ = fµ , and by U2’s virtual short, the negative input U 2− = U 2+ = fµ . By KCL at U 2− ,
+                                                                      
+                             U 2−      sµ − U 2−                    R9
+                                   =               ⇒ sµ = 1 +            fµ                   (19)
+                              R10         R9                       R10
+
+Block U3: non-inverting stage producing dµ from sµ , bµ , and mµ . By KCL at the positive input
+of U3,
+                bµ − U 3+    sµ − U 3+     U 3+             R4 R5 bµ + R3 R5 sµ
+                          +            =        ⇒ U 3+ =                                    (20)
+                   R3           R4          R5             R4 R5 + R3 R5 + R3 R4
+KCL at the negative input of U3 gives us
+                                                                          
+        mµ − U 3−     −U 3−     U 3− − d µ                          1     1       R8 mµ
+                   +         =                ⇒ dµ = U 3− 1 + R8       +        −           (21)
+            R6          R7         R8                              R6    R7        R6
+Virtual short of U3 means U 3− = U 3+ . Combining equations (20) and (21), get
+                               R6 R7 + R8 (R6 + R7 )    R4 R5 bµ + R3 R5 sµ    R8
+                        dµ =                         ·                       −    mµ                     (22)
+                                      R6 R7            R4 R5 + R3 R5 + R3 R4   R6
+
+Dynamics of RC circuit. R2 and C1 form an RC circuit driven by voltage dµ . The voltage across
+the capacitor hµ follows the relation
+                       dhµ
+              R2 C 1       = −hµ + dµ
+                        dt
+                                   R6 R7 + R8 (R6 + R7 )    R4 R5 bµ + R3 R5 sµ    R8
+                           = −hµ +                       ·                       −    mµ                 (23)
+                                          R6 R7            R4 R5 + R3 R5 + R3 R4   R6
+                                                          P
+With incoming current. Take the incoming current  PJµ = i ξµi (gi − fµ ). This produces a voltage
+drop across R1 such that mµ = fµ − R1 Jµ = fµ − R1 i ξµi (gi − fµ ). Then, the dynamics of hµ from
+equation (23) are
+                 dhµ         R6 R7 + R8 (R6 + R7 )    R4 R5 bµ + R3 R5 sµ    R8
+         R2 C1       = −hµ +                       ·                       −    (fµ − R1 Jµ )            (24)
+                  dt                R6 R7            R4 R5 + R3 R5 + R3 R4   R6
+Substituting in sµ from equation (19) and Jµ :
+                                                                  
+                                                                R9                                              !
+      dhµ         R6 R7 + R8 (R6 + R7 )   R  R b
+                                            4 5 µ + R  R
+                                                      3 5   1 + R10 fµ   R8                   X
+R2 C1     = −hµ +                       ·                              −           fµ − R 1       ξµi (gi − fµ )
+       dt                R6 R7                R4 R5 + R3 R5 + R3 R4      R6                   i
+                                                                                                         (25)
+
+Equal-resistance special case. Set R1 = R3 = R4 = R5 = R6 = R7 = R8 . Then, equation (25)
+reduces to
+                              dhµ              R9       X
+                       R2 C 1     = −hµ + bµ +     fµ +   ξµi (gi − fµ )             (26)
+                               dt              R10      i
+
+
+Selection of R9 /RP10 self-term gain. Evidently, in order to match the form of equation (1), we need
+to cancel the −fµ i ξµi term that appears on the right hand side of equation (26). The R9 /R10 term
+allows us to do that by setting
+                                           R9    X
+                                               =     ξµi                                        (27)
+                                           R10     i
+
+Taking equation (27)’s assignment to R9 and R10 simplifies equation (26) into
+                                         dhµ   X
+                                   R2 C1     =    ξµi gi − hµ + bµ                                       (28)
+                                          dt    i
+which exactly matches our desired dynamics.
+
+
+                                                     15
+Figure 8: Crossbar Array. Each pentagon contains a neuron of design in Figure 7. In this layout we
+have flipped the crossbar array, so that index µ runs horizontally and index i runs vertically.
+
+
+A.1     Activation function
+The voltage across C1 gives us the dynamics of the neuron internal state hµ . Figure 7 contains a block
+representing a nonlinear amplifier, denoted f (·), whose input is hµ and whose output is fµ = f (hµ ). This
+voltage is buffered with U1 onto the neuron output line, labeled fµ , which is what other neurons “see”
+in the crossbar array. The chosen activation function does not affect the rest of the dynamics of the
+neuron. Particularly, the activation function need not be element-wise: a vector-wise activation function
+like softmax can be readily applied instead.
+
+A.2     Neurons interacting in a network
+So far we have examined the dynamics
+                                   P of a single neuron, treating as an assumption that the neuron will
+receive an incoming current Jµ = i ξµi (gi − fµ ). Now, we will show how to wire these neurons together
+to realize this. Figure 8 shows the simplest DenseAM construction where each pentagonal node is a
+circuit of design in Figure 7. Each neuron exposes a single node whose voltage is driven at the activation
+of the neuron, and which accepts an incoming current which it uses to drive its dynamics. Each hidden
+neuron fµ is connected to a visible neuron gi via a resistance
+                                                          P Rµi = 1/ξµi that is the inverse of the weight
+it represents. The current flowing into node fµ is Jµ = i R1µi (gi − fµ ), which is the assumption needed
+for equation (24). This same analysis holds for other hidden and visible neurons, and so together they
+realize the large dynamical system of (1).
+
+A.3     SPICE Netlist
+Following is the SPICE netlist for the single neuron circuit, using ideal op-amps. Component values are
+omitted for brevity. There is no nonlinearity here; adding one would be a matter of inserting a nonlinear
+amplifier between node h µ and XU1’s positive terminal.
+R1 f_µ m_µ
+XU1 f_µ h_µ m_µ opamp Aol=100K GBW=10Meg
+XU2 u2- f_µ s_µ opamp Aol=100K GBW=10Meg
+R2 u2- 0
+R3 s_µ u2-
+R4 u3+ s_µ
+R5 u3+ 0
+XU3 u3- u3+ d_µ opamp Aol=100K GBW=10Meg
+R6 u3- m_µ
+R7 d_µ u3-
+R8 d_µ h_µ
+C1 h_µ 0
+
+
+                                                    16
+                                           Figure 9: Softmax circuit design
+
+
+V§b_µ N001 0
+R9 u3+ N001
+R10 u3- 0
+
+
+B      Softmax Circuit
+For demonstration purposes, we follow the construction of an analog softmax circuit using bipolar junc-
+tion transistors (BJTs) described in [34]. Figure 9 shows the design of a four-way softmax circuit using
+BJTs. The softmax function we aim to produce is:
+                                                   ezi
+                                       softmaxi = PN                  ,    i = 1, . . . , N                             (29)
+                                                                 zj
+                                                         j=1 e
+
+   For the µth BJT in the circuit, the collector current IC,µ can be expressed in terms of the base voltage
+hµ and the emitter voltage VE when in the forward-active mode as:
+                                                                                                   hµ −VE
+                          IC,µ = Is eVBE /VT ,    VBE,µ = hµ − VE ,           ⇒      IC,µ = IS e     VT
+                                                                                                                        (30)
+where Is is the BJT’s saturation current and VT is the thermal voltage. Assuming large BJT β (note:
+this β is unrelated to the softmax β)2 , we can neglect base currents IC,µ = IE,µ . Applying KCL at
+                                                       PN
+the shared emitter node VE , the total current IEE = µ=1 IC,µ . We can expand the expression for the
+collector currents to get the currents in terms of node voltages:
+                                                         Nh
+                                                         X
+                                                 IEE =         IS e(hµ −VE )/VT
+                                                         µ=1
+                                                         Nh
+                                                         X  IS ehµ /VT
+                                                     =                                                                  (31)
+                                                         µ=1
+                                                                 eVE /VT
+
+Simultaneously, the current IEE is also fixed by the ideal current source, so IC,µ can also be expressed
+                                                                 I
+as the ratio of the branch current to the total current: IC,µ = IC,µ
+                                                                   EE
+                                                                      IEE . Plugging in (30) for IC,µ and
+(31) for IEE in the denominator and canceling the term containing VE ,
+                                                          ehµ /VT
+                                                 IC,µ = PNh           IEE                                               (32)
+                                                               hj /VT
+                                                         j=1 e
+
+This already looks very much like the ideal softmax function. The voltage at node fi is created by
+current flowing through resistor Ri , producing a voltage drop relative to VCC . Specifically, the voltage
+                hµ /VT
+fµ = VCC − PNeh hj /VT IEE Rµ . When IEE Rµ = 1, this voltage fµ is a negated and shifted softmax in
+                  j=1 e
+the range of 1 volt. This scale and negation can be easily corrected with an op amp, which is also needed
+to isolate the node and prevent loading. Note that VCC must be chosen to be positive supply in order
+for the BJTs to remain in the forward-active mode.
+  2 In BJTs, β denotes the ratio of the collector current to the base current. High BJT β indicates the transistor is able to
+
+amplify a small base current into a much larger collector current, allowing the BJT to function as an amplifier or switch.
+A high β reflects that the BJT can efficiently transmit carriers from emitter to collector, without losing them to the base.
+
+
+                                                               17
+                                       Parameter                Value
+                                       RF                         1000         Ω
+                                       RT                             1        Ω
+                                       R1                             1        Ω
+                                       R2 , R3 , . . . , R8      10 000        Ω
+                                       RS                            40        Ω
+                                       C                             10        µF
+                                       a3                             0        V
+                                       b1                             0        V
+                                       b2                           −1         V
+                                       b3                           −1         V
+                                       b4                           −1         V
+
+                                Table 2: Component and parameter values.
+
+
+C     XOR DenseAM Circuit
+Figure 10 is a full circuit diagram of the DenseAM that solves the XOR problem. Given input voltages
+at V1, V2∈ {0, 1}, the output voltage at g3 is the result of the XOR operation between V1 and V2. In
+this model, the visible neuron is linear, and the hidden neurons share a softmax activation function im-
+plemented by a set of bipolar junction transistors. Table 2 lists the component values used in simulation.
+
+
+Visible neurons. In the XOR task, only one visible neuron is left evolving, corresponding to the output
+column of the truth table. As such, the first two neurons are clamped to the input voltages, represented
+by V1 and V2. The third visible neuron, highlighted in blue, is a linear unit with no nonlinear activation:
+the internal state voltage v3 directly drives the output, setting g3 = v3 . This is the same circuit described
+in Appendix A, except where the activation block is not present.
+
+Hidden neurons. The XOR task requires four hidden neurons, highlighted in green. These are iden-
+tical circuit constructions with the exception of the voltage sources bµ for the biases, which are set
+according to the values in Table 2. Unlike the visible neuron, the hidden neurons have a softmax activa-
+tion function, such that fµ = softmaxµ (h).
+
+Softmax activation function. The red highlights the same softmax circuit described in Appendix B,
+comprised of BJT transistors, resistors, a voltage source for VCC and a current source for IEE . We
+use the 2N5088 transistors in our model, reflecting a standard and widely available BJT. Noninverting
+buffers (U10, U11, etc.) are used to prevent loading effects on the state capacitors Cµ from current draw
+of the BJT base in forward-active mode. As discussed in Appendix B, the softmax circuit itself produces
+an output voltage of
+                                                  ezi
+                             softmax(z)i = VCC − PN                        ,   i = 1, . . . , N
+                                                                      zj
+                                                              j=1 e
+
+When VCC = 5V as in this circuit, this requires extra circuitry, highlighted in yellow, to shift and negate
+the softmax output. This is done by first buffering the voltage output to prevent loading effects, followed
+by a summing op amp that subtracts VCC and inverts the softmax output. For the first hidden neuron
+h1 (lower left of figure), op-amp U2 buffers the voltage output, while U1 is configured in an inverting
+summing configuration to add -5V (the inverse of VCC ) to the buffered voltage output, producing the
+correct softmax output.
+
+Weight matrix. The weight matrix is comprised of resistors R1 -R12 that represent the weight matrix
+ξ. These are set directly according to the XOR truth table, where each row corresponds to one hidden
+neuron. A boolean value of 1 (RT ) is set to be a high conductance (1Ω), while a boolean value of 0 (RF )
+is set to be a relatively small conductance (1kΩ).
+    The gain si /gi governing the value of si is set to be the sum of the resistances in that neuron’s crossbar
+column. The column of resistances for neuron 1 has 3 RF resistances, which sum to 3 × 10−3 . Hence,
+
+
+                                                         18
+19
+     Figure 10: Full schematic for XOR DenseAM built with 1 evolving linear visible neuron and 4 hidden neurons with softmax activation. Blue: visible neuron.
+     Green: hidden neurons. Yellow: buffers for softmax activation circuit. Red: analog softmax circuit.
+neuron 1’s R47 /RR46 = 3/1000. The crossbar resistances for neuron 2, 3, and 4 have 2 RT resistances
+and one RF resistance, which sums to approximately 2. Hence, we approximate R59 /R56 = 2000/1000
+and similarly for hidden neurons 3 and 4.
+
+
+D     Design and implementation variations
+A large design space remains open across analog electronics and other substrates for realizing DenseAMs,
+with clear speed–energy–area–precision trade-offs. In electronics, the core primitives admit multiple re-
+alizations: passive, nonvolatile weights (e.g., memristors, triode-region or floating-gate transistors, and
+other programmable conductors); active, gained weights via OTAs; and nonlinearities via diode clamps,
+reverse-biased diode/BJT exponentials, MOS quadratic regions, or translinear blocks. Architectures in
+the spirit of [35, 23] are compact but couple synaptic values to neuronal time constants, making dynamics
+drift when a single weight changes—problematic for learning and consistent timing—whereas our decou-
+pled neuron preserves a fixed time constant under weight updates. Simpler neuron/network topologies
+likely exist and can be attractive in resource-constrained regimes, provided their deviations from the
+target ODEs are validated not to degrade performance. Beyond CMOS, photonics (e.g., overdamped,
+low-Q microring resonators) can naturally implement first-order ODEs and can offer extreme bandwidth
+with distinct calibration and noise constraints. Across these options, open problems include robust
+weight storage/programmability and drift control, mixed-signal learning rules compatible with device
+limits, scaling under current/GBW/SR constraints, tolerance to mismatch/noise, and algorithm–circuit
+co-design to exploit substrate-specific advantages.
+
+
+E     Scaling of inference time
+There are two conditions under which inference times should be studied, dependent on the softmax
+temperature β. In the low-β regime, the DenseAM reaches equilibria with multiple hidden neurons
+“competing” in the softmax, while in the high-β regime, the DenseAM reaches equilibria with only one
+hidden neuron “winning out” in the softmax. Intuitively, the high-β regime corresponds to exact memory
+recall, while the low-β regime corresponds to interpolation. The XOR and Hamming (7,4) code are in
+the high-β regime, while the energy transformer lies in the low-β regime. In both regimes, we find that
+the DenseAM converges in time that is constant with respect to the number of neurons.
+
+Assumptions.
+(A1) There is a per-synapse device limit of 0 ≤ ξµi ≤ Gmax where Gmax is the maximum conductance
+    set by the physics of the crossbar crosspoints. Because f is the output of a softmax so fµ ≤ 1 ∀µ,
+    this means
+                                             X
+                                                 ξµi fµ ≤ Gmax                                    (33)
+                                                    µ
+
+     so the RHS of the visible neuron dynamics is O(1).
+     There exist both column-sum and row-sum budgets that are enforced by the hardware, since each
+     neuron’s output stage can only source/sink a finite amount of current while maintaining GBW/SR
+     margins. This dictates a per-column and per-row conductance budget to stay within this maximum
+     current, resulting in
+                                    Nv
+                                    X                         Nh
+                                                              X
+                                         ξµi ≤ Cr       ∀µ,        ξµi ≤ Cc   ∀i                      (34)
+                                     i                        µ
+
+
+     Weights can only be positive since conductances can only be positive, so ξµi ≥ 0.
+     As a corollary of (A1), note also that we can bound ∥ξ µ ∥2 ≤ S ∀µ, and since ∥ξµ ∥2 ≤ ∥ξ µ ∥1 , then
+     ∥ξ µ ∥2 ≤ Cc ∀µ.
+(A2) Bounded biases. |ai | ≤ A, |bµ | ≤ B for all i, µ. In realistic regimes, this typically holds, for
+    example the typical choice in boolean functions of bµ = − β2 ∥ξ µ ∥2 (seen in Section 5.1).
+
+
+
+                                                         20
+Model. Take the system of equation (1) with a softmax activation on hidden neurons and an identity
+activation on visible neurons. For clarity we assume 0 biases on visible neurons, but they do not change
+the analysis.
+
+                      τv v̇ = ξ⊤ f + a − v,   τh ḣ = ξv + b − h,      f = softmaxβ (h)            (35)
+
+Integrating out the hidden units,
+
+                                        τv v̇ = ξ ⊤ f (v) − v,                                     (36)
+                                                                      
+                                       f (v) = softmax β(ξv + b)                                   (37)
+
+yields the effective energy function expressed in terms of visible neurons:
+                                      1       1    X            
+                             E(v) =     ∥v∥2 − log   exp β ξ ⊤
+                                                             µv+b                                  (38)
+                                      2       β    µ
+
+
+where ∇E(v) = v − ξ ⊤ f (v). Because τv v̇ = −∇E(v), we see that the dynamical trajectory causes the
+energy to monotonically decrease over time:
+                           d                            1
+                              E(v(t)) = ∇E(v(t))⊤ v̇ = − ∥∇E(v(t))∥2 ≤ 0                           (39)
+                           dt                           τv
+
+E.1    Low-β regime
+The energy landscape in the low-β regime exhibits uniform strong convexity, so the gradient flow dy-
+namics cause the energy gap to decay exponentially, reaching an ϵ-fraction of the original energy gap
+in constant time. To show E(v) is α-strongly convex, we must show ∇2 E(v) ⪰ αI for some α > 0.
+This means that all the eigenvalues of the Hessian are ≥ α. Equivalently, λmin (∇2 E) ≥ α. Denote
+G(f ) = Diag(f ) − ff ⊤ ⪰ 0, which is the Jacobian of the softmax function f (v) = softmax(β(ξv + b)).
+
+                                    ∇2 E(v) = I − βξ ⊤ G(f )ξ                                      (40)
+                                                                 
+                              λmin ∇2 E(v) = λmin I − βξ⊤ G(f )ξ
+                                          
+                                                                                                   (41)
+                                                                  
+                                            = 1 − βλmax ξ ⊤ G(f )ξ                                 (42)
+                                                                   
+                               ⇒ ∇2 E(v) ⪰ 1 − βλmax ξ ⊤ G(f )ξ I                                  (43)
+
+Because G(f ) ⪯ Diag(f ) ⪯PI is PSD and therefore ξG(f )ξ ⊤ is also PSD, and G(f ) is a probability-
+weighted covariance where µ fµ = 1,
+                                                             X
+                       λmax (ξ ⊤ G(f )ξ) ≤ tr(ξ⊤ G(f )ξ) ≤       fµ ∥ξ µ ∥2 ≤ max ∥ξ µ ∥2          (44)
+                                                                               µ
+                                                             µ
+
+
+Denote S 2 = maxµ ∥ξ µ ∥2 ≤ Cc as in (A1). Therefore, the Hessian of E can be bounded as
+
+                                       ∇2 E(v) ⪰ (1 − βS 2 )I = αI                                 (45)
+
+where α = 1 − βS 2 . Then α > 0 when β < 1/ maxµ ∥ξ µ ∥2 . This is a sufficient (but not necessary)
+condition for the system to be in the low-β (uniformly convex) regime, where the softmax is diffuse
+enough that its covariance term does not contribute so much negative curvature as to overwhelm the
+positive curvature contributed by the identity term. In this regime, the uniform lower bound on the
+Hessian implies α-strong convexity, which gives the PL inequality
+                                       1
+                                         ∥∇E(v)∥2 ≥ α(E(v) − E ∗ )                                 (46)
+                                       2
+Together with (39), this allows us to bound the time constant of gradient flow:
+
+                      d                      1               2α
+                         (E(v(t)) − E ⋆ ) = − ∥∇E(v(t))∥2 ≤ − (E(v(t)) − E ⋆ )                     (47)
+                      dt                     τv              τv
+
+
+                                                     21
+If the curvature is bounded below by α, then the gradient magnitude grows at least linearly with distance
+to the minimum, ensuring the energy function is “steep enough” to ensure exponential convergence.
+Integrating,
+                                                                           2α
+                                  E(v(t)) − E ⋆ ≤ (E(v(0)) − E ⋆ )e− τv t                              (48)
+This indicates exponential decay of the energy gap. In order to reach an ϵ-fraction of the original energy
+gap, this takes time
+                                               τv    1
+                                     T (ϵ) ≤      log = O(τv log(1/ϵ))                                 (49)
+                                               2α    ϵ
+which is entirely independent of system size Nv and Nh . In the energy transformer case, this means that
+convergence time is entirely independent of context length L and token dimension D.
+
+E.2      High-β regime
+E.2.1    TI : Basin selection
+Denote
+                      sµ (v) := ξ⊤
+                                 µ v + bµ ,     m(v) := max sµ (v),      f := softmax(βs)              (50)
+                                                            µ
+
+Define the basin of attraction around the winning softmax logit k by the margin γ > 0:
+                                   Bk (γ) = {v : sk (v) − max sj (v) ≥ γ}                              (51)
+                                                                j̸=k
+
+Let TI be the first time t such that v(t) ∈ ∪k Bk (γ). Defining the softmax component of the energy
+function (38) as
+                                                             Nh
+                                                       1     X
+                                         LSEβ (s) =      log     eβsµ
+                                                       β     µ=1
+
+then for every v, we can bound the LSE as
+                                                                       1
+                                 m(v) ≤ LSEβ (s(v)) ≤ m(v) +             log Nh                        (52)
+                                                                       β
+Thus, the “softmax slack” δ(v) := LSEβ (s(v)) − m(v) obeys 0 ≤ δ(v) ≤ β1 log Nh . In the high-β regime,
+there are no critical points other than the softmax basins (those within ∪k Bk (γ) for any reasonable
+γ > ϵ > 0). To reduce δ from its initial value to the cusp of one of the basins requires dissipating at most
+                                                            1
+                                              ∆Esoftmax ≤     log Nh                                   (53)
+                                                            β
+∂E
+∂vi = −τv v̇i , and outside winning basins τv v̇i ∼ 1, so the squared magnitude of the gradient grows at
+least linearly in Nv :
+                                                    Nv      2
+                                               2
+                                                   X     ∂E
+                                    ∥∇E(v)∥ =                   ≥ cNv                               (54)
+                                                    i=1
+                                                         ∂vi
+
+for some c > 0 independent of Nv and Nh for all v in the trajectory outside a winning basin. Therefore,
+the energy dissipation rate satisfies
+                                                 1                c
+                                     −Ė(t) =       ∥∇E(v(t))∥2 ≥ Nv                                   (55)
+                                                 τv              τv
+    Under assumptions (A1)–(A2), the visible state v remains in a bounded box, so the quadratic part of
+the energy contributes at most O(Nv ) to the energy difference between any two points on the trajectory.
+Since the energy dissipation rate during TI scales proportionally to Nv , the quadratic component of
+the energy contribution is dissipated in constant time. The only nontrivial Nh dependence is due to the
+softmax slack. Together with the bound on ∆Esoftmax , the total time this phase takes is characteristically
+                                                             
+                                                    τv log Nh
+                                           TI = O                                                      (56)
+                                                     β Nv
+
+                                                       22
+E.2.2    TII : Contractive convergence within a winning basin
+Find a basin Bk (γ) that is entered at tin = TI . We will now show local strong convexity within this
+basin, allowing us to invoke the PL inequality and find exponential convergence within the basin. Define
+G := Diag(f ) − ff ⊤ . First, consider that the non-winning softmax mass is 1 − fk , which is
+                                                X
+                                       1 − fk =     fj ≤ (Nh − 1)e−βγ                              (57)
+                                                      j̸=k
+
+
+Additionally, since ∥f ∥2 = fk2 +            2    2
+                                    P
+                                       j̸=k fj ≥ fk and 0 ≤ fk ≤ 1,
+
+
+                λmax (G(f )) ≤ tr(G(f )) = 1 − ∥f ∥2 ≤ 1 − fk2 ≤ 2(1 − fk ) ≤ 2(Nh − 1)e−βγ              (58)
+
+Hence, with S 2 = maxµ ∥ξ µ ∥2 ,
+
+                            λmax (ξ ⊤ G(f )ξ) ≤ S 2 λmax (G(f )) ≤ 2S 2 (Nh − 1)e−βγ                     (59)
+
+This gives a bound on the largest eigenvalue of G(f ) in a way that incorporates the softmax beta.
+   Now, we can show local strong convexity in the winning basin:
+
+                     ∇2 E(v) = I − βξ ⊤ G(f )ξ ⪰ (1 − β2S 2 (Nh − 1)e−βγ )I ≡ α(β, γ)I                   (60)
+
+for all v ∈ Bk (γ). Particularly, if
+                                                                           1
+                                               e−βγ (Nh − 1) ≤                                           (61)
+                                                                         4βS 2
+
+then α(β, γ) ≥ 12 , independent of Nh , Nv . Note that this is always possible: if the softmax is not peaked
+enough to make this inequality true, simply keep moving in trajectory “Phase I” for a little longer until
+the margin γ grows slightly larger such that the condition holds true. This strong convexity within Bk (γ)
+implies the PL inequality
+                              1
+                                ∥∇E(v)∥2 ≥ α(β, γ)(E(v) − E ⋆ ),                 ∀v ∈ Bk (γ)             (62)
+                              2
+Therefore, along the trajectory within the basin for times t ≥ tin ,
+
+                    d                   1                2α(β, γ)
+                       E(v(t)) − E ⋆ = − ∥∇E(v(t))∥2 ≤ −          E(v(t)) − E ⋆
+                                                                               
+                                                                                                         (63)
+                    dt                  τv                 τv
+Integrating,
+                                                        2α(β,γ)
+                              E(v(t)) − E ⋆ ≤ e−                (t−tin )
+                                                                           E(v(tin )) − E ⋆
+                                                                                              
+                                                          τv                                             (64)
+
+Impose a relative-to-initial convergence criteria:
+
+                                E(v(t)) − E ⋆ ≤ ϵ E(v(0)) − E ⋆ ,
+                                                               
+                                                                                 ϵ ∈ (0, 1)
+
+Since E is non-increasing along the trajectory, E(v(tin )) − E ⋆ ≤ E(v(0)) − E ⋆ , so it suffices that
+                                                      2α(β,γ)
+                                                 e−     τv    (t−tin )
+                                                                         ≤ϵ
+
+Hence the in-basin time satisfies
+                                                                        
+                                                 τv        1           1
+                                       TII ≤            log = O τv log                                   (65)
+                                               2α(β, γ)    ϵ           ϵ
+
+which is size-free of Nh and Nv .
+
+
+
+
+                                                             23
+E.2.3    Combined bound
+Altogether, in the high-β regime, to reach a relative-to-initial tolerance of
+                                   E(v(t)) − E ⋆ ≤ ϵ E(v(0)) − E ⋆
+                                                                       
+                                                                                                       (66)
+the combined convergence time satisfies
+                                                                             
+                                   τv log Nh                                  1
+                      T (ϵ) = O                           +          O τv log                          (67)
+                                    β Nv                                      ϵ
+                              |       {z     }                       |   {z     }
+                                winner selection (TI )        convergence within basin (TII )
+
+For fixed ϵ, β, and τv , TII is independent of Nv and Nh , while TI carries all the model-size dependence.
+The dependence of the convergence time on Nh and Nv in the high-β regime is
+                                                               
+                                                      τv log Nh
+                                          T (ϵ) = O               .                                   (68)
+                                                      β Nv
+The convergence time is at most logarithmic in the number of hidden neurons Nh , and actually decreases
+as 1/Nv in the number of visible neurons.
+
+E.3     Limitations
+Our analysis assumes that the timescales of the crossbar array are much faster than the fastest neuronal
+timescales. In practice, as the crossbar array gets bigger, it may contribute to the time scales of the
+entire system, since wires have non-zero capacitances. Once the size of the crossbar array reaches the
+point when it significantly modifies the time scales of the neurons, our analysis and the scaling argument
+becomes invalid. For this reason, one cannot scale this design to infinitely large sizes. Analyzing that
+boundary is outside the scope of our paper, because it is dependent on fabrication and design parameters,
+which is a different level of abstraction than our present paper.
+
+
+F       Design invariance under voltage scaling
+Given hardware constraints of Gmax , Cc , and Cr , we can still implement models with arbitrarily large
+weights. Convergence bounds rely on the weight matrix constraints, which can be made feasible by
+global normalization at the hardware level, keeping the effective model weights unchanged. Consider the
+scaling factor for any non-negative ξ:
+                                  (                                          )
+                                        Gmax           Cc            Cr
+                          κ = min 1,             ,      P       ,     P                            (69)
+                                       maxµ,i ξµi maxi µ ξµi maxµ i ξµi
+
+Set ξ̃ = κξ. Then, ξ̃ satisfies all the hardware constraints of assumption (A1):
+                                              X                   X
+                          0 ≤ ξ˜µi ≤ Gmax ,       ξ˜µi ≤ Cr ∀µ,      ξ˜µi ≤ Cc ∀i                      (70)
+                                                  i                       µ
+
+So any ξ matrix can be mapped onto budgets with one scalar κ. Consider the pre-softmax arguments
+for the hidden neurons: if we scale weights ξ → ξ̃ = κξ, rescale the voltage unit v → ṽ = κv and biases
+b → b̃ = κ2 b and set β̃ = β/κ2 , then
+                                              ⊤
+                                        β̃(ξ˜µ ṽ + b̃) = β(ξ ⊤
+                                                              µ v + b)                                 (71)
+
+so the softmax outputs f and the system’s attractors are unchanged. The visible ODE τv v̇ = ξ⊤ f (v) − v
+is preserved up to units, as the κ terms can be absorbed into the gain of U2 and U3 without affecting the
+convergence time bounds.
+
+
+G       Scaling of energy consumption
+The energy consumption of DenseAM circuits can be broken up into two parts: the energy dissipated
+by the weights as a result of Ohm’s Law, and the energy from engineering overhead found in amplifiers
+and active circuitry. The energy dissipated by the weights in the crossbar array can be expressed as the
+integral of the power dissipated by each resistor of resistance Rµi from time 0 until convergence at Tconv .
+
+
+                                                         24
+Energy consumption of weights. Let the neuron output voltages be proportional to activations:
+ui = κgi and wµ = κfµ , where κ is a fixed voltage scale. We assume rail-bounded outputs |ui | ≤ κ and
+|wµ | ≤ κ (by Appendix F, global rescaling of ξ, voltages, and β preserves the DenseAM dynamics, so
+this choice of κ does not affect behavior.) The instantaneous power in the resistive crossbar is:
+                                                     X
+                                      Pweights (t) =   ξµi (ui − wµ )2                             (72)
+                                                           i,µ
+                                                       P                             P
+Using the row/column conductance budgets          µ ξµi ≤ Cc and                         i ξµi    ≤ Cr (Appendix E) and the
+inequality (a − b)2 ≤ 2a2 + 2b2 ,
+                                                                
+                                           X           X
+                         Pweights (t) ≤ 2   ξµi u2i +   ξµi wµ2                                                         (73)
+                                             i,µ                 i,µ
+                                                                       !                               !!
+                                             X             X                   X         X
+                                    =2             u2i           ξµi       +       wµ2           ξµi                      (74)
+                                              i            µ                   µ           i
+                                                                                 !
+                                                  X                    X
+                                    ≤ 2 Cc               u2i + Cr          wµ2                                            (75)
+                                                   i                   µ
+
+                                                                    2                             2    2
+                                                            P                             P
+If the hidden layer uses a softmax activation, then
+P 2                                                              µ fµ ≤ 1 and so               µ wµ ≤ κ ; and rail bounds give
+              2
+   i ui ≤ Nv κ . Therefore,
+
+                                 Pweights (t) ≤ 2κ2 (Cc Nv + Cr ) = O(Nv )                                                (76)
+
+Therefore, a system taking time T conv to converge results in an energy consumption of
+                                       Z T
+                          Eweights =         Pweights (t)dt ≤ 2κ2 (Cc Nv + Cr )T conv                                     (77)
+                                        0
+
+According to the convergence time bounds of Appendix E, T conv = O(τv ). Thus, Eweights = O(Nv ), as
+a function of system size.
+
+Energy consumption of capacitors. Let each neuron node voltage be bounded by hardware limits
+|ui (t)|, |wµ (t)| ≤ κ. Charging a capacitor of capacitance C from a supply through a resistive path draws
+CV 2 from the power supply. The number of times each capacitor charges is finite because the Lyapunov
+energy of the DenseAM forbids limit cycles. This means the total supply energy per node can be bounded
+by a constant. Therefore, the total energy needed to (re)charge all neuron capacitors is bounded by
+                                                            Nh
+                                                  Nv
+                                                                     !
+                                                       (v)
+                                                 X          X
+                                               2                 (h)
+                         Ecapacitors ≤ O(1) · κ      Ci +      Cµ      = O(Nv + Nh )                   (78)
+                                                    i=1                µ=1
+
+
+Energy consumption of amplifiers, bias, control, and overhead. Per neuron, the energy expen-
+diture to amplifier inefficiency, bias terms, and general overhead do not depend on system size. For a
+runtime of duration T conv , the energy consumption of these elements in the entire network scales as
+
+                                       Eother = O((Nv + Nh )T conv )                                                      (79)
+
+Combined energy consumption.             All together, the total energy consumption can be written as
+
+                                             Etotal = O(Nv + Nh )                                                         (80)
+
+
+H     Model Specifications and Details
+Table 3, Table 4, and Table 5 summarize the model design for the XOR, Hamming (7,4), and parity
+DenseAM models.
+
+
+
+                                                            25
+                                    Table 3: XOR model specification
+
+Visible neurons vi                    Nv = 3 (inputs v1 , v2 clamped to {0,1}; output v3 free)
+Hidden neurons hµ                     Nh = 4 (one per truth-table row)
+                                                                 PNv 2
+Visible activation and Lagrangian     Identity: gi = vi , Lv = 21 i=1 vi
+                                                                                 NPh βhµ 
+Hidden activation and Lagrangian      Softmax: fµ = softmax(βhµ ), Lh = β1 log        e
+                                                                                   µ=1
+Visible biases                        ai = 0
+                                               PNv 2
+Hidden biases                         bµ = − 12 i=1 ξµi                                    
+                                                                                    0 0 0
+                                                                                   0 1 1
+Weights ξ                             ξ ∈ {0, 1}4×3 , rows encode memories: ξ =   1 0 1
+                                                                                            
+
+                                                                                    1 1 0
+Inference protocol                    Clamp (v1 , v2 ) to input values; read out v3 at convergence
+
+
+
+
+                             Table 4: Hamming (7,4) model specification
+
+Visible neurons (Nv )   7 (codeword bits)
+Hidden neurons (Nh )    16 (one per valid codeword)
+Visible activation      Identity: gi = vi
+Hidden activation       Softmax over µ ∈ {1, . . . , 16} with temperature β
+Visible biases          ai = 0
+                                  PNv 2
+Hidden biases           bµ = − 21 i=1    ξµi
+Weights ξ               ξ ∈ {0, 1}16×7 , each row is a valid Hamming(7,4) codeword
+Inference protocol      Initialize visible neurons to corrupted 7-bit input codeword; let all visible and
+                        hidden neurons evolve; converged visible neurons give the corrected codeword
+
+
+
+
+                               Table 5: 8-bit parity model specification
+
+Visible neurons vi                          Nv = 16 (dimension of embedding D)
+Hidden neurons (energy attention) hattn
+                                   A        Nhattn = 8 (context length L)
+Hidden neurons (Hopfield network) hhopf
+                                    µ       Nhhopf = 16 (Hopfield network memories M )
+Hidden neurons (total)                      Nh = 24 (L + M )
+Visible activation                          Identity: gi = vi
+Hidden activation (energy attention)        Softmax: fAattn = softmax(βhattn )A for A = 1, . . . , L
+Hidden activation (Hopfield network)        ReLU: fµhopf = max (hhopf
+                                                                    µ   , 0) for µ = 1, . . . , M
+Weights (energy attention)                  ξattn ∈ RL×D , where ξattn
+                                                                     A    is embedded A’th context token
+Weights (Hopfield network)                  ξ hopf ∈ RM ×D , static after training
+Inference protocol                          Embed L context tokens to obtain ξ attn . Let visible neurons
+                                            evolve until convergence
+
+
+
+
+                                                    26
+H.1      Bit string energy transformer implementation
+As described in Table 5, our trained model uses an embedding matrix of 2 × D = 32 parameters, the
+Hopfield network with D × M = 256 parameters, an additional D × 2 = 32 parameter matrix to decode
+embeddings to logits, a total of D + L + M = 40 neuron bias terms, and 2 biases for the linear decoder.
+This is a total of 362 parameters.
+    In training and inference we use time constants τv = 0.1 and τh = 0.01. We train with Euler steps of
+1e-3, and test with Euler steps of 1e-4 for a time horizon of T = 1 second. Jax’s automatic differentiation
+was used to implement backpropagation through time. We encourage the model to reach fixed points
+by penalizing v̇ at time T. This yields models that are more robust to hardware imperfection due to the
+intrinsic stability of attractor points. The convergence to an attractor also means the inference remains
+stable to mismatch and delay in timing during readout.
+
+
+I     Hardware analysis
+I.1     Hardware speed analysis
+As discussed in subsection 7.1, the convergence time of analog DenseAMs is governed not by system size,
+but rather primarily by the timescales of the dynamics in hardware. These timescales are set by the time
+constants τv and τh . The smaller these time constants, the faster the dynamics move, and the faster the
+system converges. In this section, we derive bounds on the minimum time constant min{τv , τh } of the
+DenseAM, which is limited by the constraints of active components like amplifiers.
+    The maximum speed of neuronal dynamics is limited by the ability of active stages (op-amps/buffers)
+to track changing signals. If the input slope to an active stage exceeds its slew rate (SR), the output
+distorts; if the signal spectrum approaches or exceeds the stage’s closed-loop bandwidth, attenuation
+and phase lag appear. Here, we derive lower bounds on the time constants τv , τh imposed by (i) finite
+gain–bandwidth product (GBW) and (ii) finite SR of the three active stages in the neuron design (Ap-
+pendix A). Without loss of generality we will express the derivation for the hidden neurons, with the
+derivations for visible neurons following by symmetry. Throughout, define the following:
+
+    • State swing: |vi (t)| ≤ Av , so that |v̇i | ≲ Av /τ . Similarly, |hµ (t)| ≤ Ah , so that |ḣµ | ≲ Ah /τ .
+    • Activation swing: Visible activation g(·) is Lipschitz with slope bound Lg = supx |g ′ (x)|. Then,
+      |ġi | ≤ Lg |v̇i | ≤ Lg Av /τ . Similarly, hidden activation f (·) is Lipschitz with slope bounded by
+      Lf = supx |f ′ (x)|. Then, |f˙µ | ≤ Lf |ḣµ | ≤ Lf Ah /τ .
+
+    • Weights ξ ≥ 0. Hardware normalization gives
+                                                P per-row/column conductivity budgets, so the self-
+      term gain for hidden neuron µ is Aself,µ = i ξµi = O(1).
+We will derive three independent lower bounds and then take the max:
+
+                         τmin ≥ max{          τGBW          ,      τSR         ,   τI−limit }                     (81)
+                                              | {z }               |{z}            | {z }
+                                       tracking small signals edge/large-signals output current
+
+
+I.1.1   Gain-bandwidth product bound
+For a single-pole op-amp with gain-bandwidth product GBW in a closed-loop configuration with loop
+gain ACL , the −3db bandwidth is fc ≈ GBW/ACL . In order for the neuron to faithfully track with a
+time constant τ , we require fc ≳ 1/(2πτ ) for every stage in the signal path. Closed-loop gains for each
+of the op-amps are: ACL (U 1) = 1 because it is a unity-gain buffer, ACL (U 2) = Aself because it needs
+to realize the self term gain, and ACL (U 3) ≈ 1 because it is a unity-gain summer. Assuming the same
+op-amp design for U1, U2, and U3, and taking the worst case,
+
+                                                         max(1, Aself )
+                                              τGBW =                                                              (82)
+                                                          2πGBW
+
+I.1.2   Slew rate bound
+The slew-rate limits cap the maximum output slope of each op-amp stage:
+    • U1: activation buffer. |f˙µ | ≤ Lf Ah /τ , which gives τ ≥ (Lf Ah )/SRU1 .
+
+
+                                                          27
+Table 6: Estimated neuron time constants and conservative convergence times with Av = Ah = 1 V,
+                                                                                      1
+Lg = 1, Aself = 1 for representative amplifiers in literature. GBW bound τGBW = 2π GBW  ; SR bound
+      Lg Av
+τSR = SR (visible path). Overall τmin = max{τGBW , τSR }; we report Tconv = 10 τmin .
+
+CMOS Amplifier (ref.)                          SR (V/µs)        GBW (MHz)            τSR (ns)    τGBW (ns)        Tconv (ns)
+Perez and Maloberti [36]                              84.50               321.50         11.83             0.50    118.34
+Assaad and Silva-Martinez [37]                        94.10               134.20         10.63             1.19    106.27
+Yen and Blalock [38]                                 202.00                10.70          4.95            14.87    148.74
+Naderi, Prakash, and Silva-Martinez [39]            1250.00              3600.00          0.80             0.04     8.00
+Schlögl and Zimmermann [40]                        1650.00              2510.00          0.61             0.06     6.06
+Notes. (i) τSR values assume the visible path dominates the summer’s SR (low/moderate-β). If softmax dominates at U3
+   in the high-β regime, multiply SR-limited values by κ = (β/2) (Ah /Av ) (with Ah = Av = 1 V, simply β/2). (ii) The
+ current-limit bound τI-limit = CAv /Imax is typically ≪ all reported values for C ∼ 50 fF and Imax ∼mA, so it is omitted
+                                from the table but must still be respected in circuit sizing.
+
+
+   • U2: self-term. sµ = Aself fµ , so |ṡµ | = Aself |f˙µ | ≤ (Aself Lf Ah )/τ , which gives τ ≥ (Aself Lf Ah )/SRU2 .
+   • U3: internal state drive. The time-varying portion of the RC circuit drive dµ is a linear combina-
+     tion of fµ and gi , with coefficients that have a maximum magnitude of Aself . Using the bounds on
+     the slopes of those inputs, we get the following bound on |d˙µ | and subsequently the time constant
+     bound:
+                                   Aself                                      Aself max(Lf Ah , Lg Av )
+                        |d˙µ | ≲         max{Lf Ah , Lg Av }     ⇒       τ≥                                          (83)
+                                    τ                                                  SRU3
+
+All together, the combined constraint is
+                                                                               
+                                    Lf Ah Aself Lf Ah Aself max(Lf Ah , Lg Av )
+                       τSR = max         ,           ,                                                               (84)
+                                    SRU1    SRU2               SRU3
+
+I.1.3   Current / headroom limit
+U3 must provide the current through R2 to charge C1 . The RC circuit dynamics dictate R2 C1 ḣµ =
+−hµ + dµ , so the instantaneous current needed by U3 is
+
+                                                         dµ − h µ
+                                             IU3,out =            = C1 ḣµ                                           (85)
+                                                           R2
+
+We must respect |IU3,out | ≤ Imax,U3 . With |ḣµ | ≲ Ah /τ ,
+
+                                                                C1 Ah
+                                                  τI-limit ≥                                                         (86)
+                                                               Imax,U3
+
+I.1.4   Combined bound on minimum time constant
+Taken together, the minimum time constant must satisfy the bounds (82), (84), and (86):
+
+                                           τmin ≥ max{τGBW , τSR , τI-limit }                                        (87)
+
+I.2     Estimates of inference times with existing hardware
+Under standard assumptions for DenseAMs (symmetric couplings and monotone activations), the Lya-
+punov energy decreases monotonically and the dynamics converge without oscillations. The settling time
+is therefore on the order of a few multiples of the largest neuronal time constant, which we bound by
+amplifier non-idealities. In this section we take some representative examples of op-amps from literature
+and estimate the inference speeds from reasonable and representative design parameters.
+
+
+
+
+                                                           28
+Minimum time constant.             For illustration purposes, we choose three reasonable hardware constraints:
+    • Activation slopes. Take the slope of the visible activation to be Lg = 1, such as would occur in
+      a identity visible neuron activation. Take the worst-case (maximum) slope of the hidden activation
+      to be according to the softmax with fixed β, whose Jacobian is βG(f ) with ∥G(f )∥2 ≤ 12 , so a safe
+      global bound is Lf ≤ β2 .
+    • Signal swing. Use the voltage scaling invariance (see Appendix F) to rescale v, ξ, and β together
+      to pick a swing that is slew-rate friendly but well above component noise limits. Take both Av =
+      Ah = 1V .
+
+    • Self-term gain. With row/column budgets, use Aself as a worst-case bound.
+With those choices, the three lower bounds per neuron are:
+
+    1. GBW Bound: τGBW = max(1,A
+                          2πGBW
+                                self )      1
+                                       = 2πGBW .
+                                                             L A
+    2. SR Bound: The U1/U2 path give τSR,vis = SR  g v    1
+                                                       = SR µs. In the U3 (summer) path, equation (84)
+       has two cases. In the low-β regime where Lg Av ≥ Lf Ah , then U3 bound reduces to 1/SR µs. In
+       the high-β regime where Lf Ah = β/2 dominates, scale the slew-rate limited bound by β/2.
+    3. Output Current Bound: In practice, this bound generally does not limit the op amp choice:
+       even with a large capacitor C = 50 fF, Av = 1V, Imax = 2mA, τI-limit ≈ 0.025ns, which is negligible
+       compared to the bounds from SR and GBW.
+To quantify realistic inference speeds, Table 6 lists representative CMOS operational transconductance
+amplifiers (OTAs)3 drawn from recent literature, together with their corresponding lower bounds on
+neuronal time constants under the GBW and slew-rate limits. Even using conservative assumptions
+with existing amplifier designs, the analysis shows that modern high-speed OTAs can achieve sub–10 ns
+neuronal convergence times—corresponding to inference rates in the hundreds of megahertz.
+
+
+J     Connection between analog and canonical Energy Transformer
+In this section we show that in the adiabatic limit, our Analog Energy Transformer (Analog ET) reduces
+to the canonical Energy Transformer. Begin with the dynamics for the Analog Energy Transformer
+implemented by our circuit designs.
+
+                                        ∂E         ⊤              ⊤
+                               τv v̇ = −   = ξ attn f attn + ξ hopf f hopf + a − v                                   (88)
+                                        ∂v
+                                         ∂E
+                           τh ḣattn
+                                     = − attn = ξattn v + b − hattn                                                  (89)
+                                        ∂f
+                                         ∂E
+                           τh ḣhopf = − hopf = ξhopf v + c − hhopf                                                  (90)
+                                        ∂f
+Integrating out hidden neurons in the adiabatic limit where τh → 0, we see the relations
+
+                                                 hattn (v) = ξ attn v + b                                            (91)
+                                                   hopf             hopf
+                                                 h        (v) = ξ          v+c                                       (92)
+
+which we can use to integrate out the hidden neuron activations as
+
+                                    f attn (v) = softmax ξ attn v + b
+                                                                      
+                                                                                                                     (93)
+                                                                  
+                                   f hopf (v) = ReLU ξ hopf v + c                                                    (94)
+
+Substituting into the visible dynamics:
+                                               ⊤ attn            ⊤
+                              τv v̇ = ξ attn     f     (v) + ξ hopf f hopf (v) + a − v                               (95)
+   3 Many high-speed CMOS “op-amps” are reported as OTAs (transconductors). In our neuron, these OTA cores operate
+
+in closed-loop (unity/non-inverting) configurations, so the literature SR and GBW directly constrain τ via Eqs. (82)–(84).
+
+
+
+                                                              29
+We can ask ourselves, what scalar energy produces this ODE? We seek an energy Eeff (v) such that
+τv v̇ = − ∂E
+           ∂v . Equivalently,
+            eff
+
+
+
+                                                       ⊤ attn            ⊤
+                        ∇v Eeff (v) = v − a − ξ attn     f     (v) − ξ hopf f hopf (v)                    (96)
+
+We can construct Eeff (v) as a sum of three pieces whose gradients match each term Eeff (v) = Equad (v) +
+Eattn (v) + Ehopf (v). By inspection we see that Equad (v) = 21 ∥v − a∥2 .
+
+Attention term.       The energy function
+                                               1     X
+                                                       exp β ξ attn
+                                                                        
+                               Eattn (v) = −     log           A v + bA                                   (97)
+                                               β
+                                                    A
+
+satisfies our requirement. We can see that by differentiating with respect to vi , we get
+                                ∂Eattn    X
+                                       =−   softmax(ξ attn v + b)A · ξAi
+                                                                      attn
+                                                                                                          (98)
+                                 ∂vi
+                                          A
+                                          X
+                                             attn attn
+                                       =−   ξAi  fA (v)                                                   (99)
+                                               A
+                                                                 ⊤ attn
+which yields our desired dynamics of ∇v Eattn (v) = − ξ attn       f     (v).
+
+Hopfield term.       A simple way to achieve the desired dynamics is with a Hopfield-type energy function
+                                             X1                        2
+                               Ehopf (v) = −         ReLU ξ hopf
+                                                             µ   v + c µ                            (100)
+                                              µ
+                                                 2
+
+whose derivative with respect to vi yields
+                                ∂Ehopf    X                     
+                                                                      hopf
+                                       =−   ReLU ξ hopf
+                                                     µ   v + c µ   · ξµi                                 (101)
+                                 ∂vi      µ
+                                          X hopf
+                                       =−   ξµi fµhopf (v)                                               (102)
+                                               µ
+
+                                                           ⊤
+which yields our desired dynamics of ∇v Ehopf (v) = − ξ hopf f hopf (v).
+
+Effective energy function of analog energy transformer.               All together, the effective scalar energy
+over the visible state v after integrating out hidden neurons is
+                    1             1   X                   X 1                    2
+      Eeff (v) =      ∥v − a∥22 − log   exp β ξ attn
+                                                A v + bA    −     ReLU ξ hopf
+                                                                         µ    v + cµ                     (103)
+                   |2 {z } β          A                       µ
+                                                                2
+                       Equad    |          {z             } |         {z               }
+                                            Eattn                               Ehopf
+
+This effective energy aligns with the canonical Energy Transformer’s energy function. Because our effec-
+tive dynamics use hidden neurons, the energy function written in the main text reflects the contributions
+of the hidden neurons. When τh ≪ τv , this regime converges to the behavior when the hidden neurons
+are integrated out. Hence, the effective expressibility and behavior of our system is equivalent to that of
+the original Energy Transformer.
+    In our model we omit the layer normalization activation that the original Energy Transformer applies
+to the visible neurons. This keeps the circuit design simple, while still enabling models with high
+expressibility. This choice does not modify the structure of the attention or the Hopfield parts of the
+energy; only the self-energy of v differs. From a modeling perspective, layer normalization mainly
+improves conditioning and learning of deep networks rather than changing the computational primitive
+and expressibility. We empirically observe that the resulting models without layer normalization remain
+expressive enough to solve the problems we present. In principle, a layer normalization-type visible
+activation function could be implemented in analog hardware (e.g. by subtracting the mean voltage
+and normalizing by an on-chip variance estimate), but this would add distracting complications to the
+minimalist neuron and circuit designs we show in this paper.
+
+
+                                                       30
+
\ No newline at end of file
-- 
cgit v1.2.3