summaryrefslogtreecommitdiff
path: root/ep_run/analogET_extracted.txt
diff options
context:
space:
mode:
Diffstat (limited to 'ep_run/analogET_extracted.txt')
-rw-r--r--ep_run/analogET_extracted.txt1861
1 files changed, 1861 insertions, 0 deletions
diff --git a/ep_run/analogET_extracted.txt b/ep_run/analogET_extracted.txt
new file mode 100644
index 0000000..b139640
--- /dev/null
+++ b/ep_run/analogET_extracted.txt
@@ -0,0 +1,1861 @@
+ Dense Associative Memories with Analog Circuits
+ Marc Gong Bacvanski1 , Xincheng You2 , John Hopfield3 , and Dmitry Krotov4
+ 1
+ MIT
+ 2
+ Independent Researcher
+ 3
+ Princeton University
+ 4
+ IBM Research
+
+ December 16 2025
+arXiv:2512.15002v1 [cs.NE] 17 Dec 2025
+
+
+
+
+ Abstract: The increasing computational demands of modern AI systems have exposed fundamental
+ limitations of digital hardware, driving interest in alternative paradigms for efficient large-scale inference.
+ Dense Associative Memory (DenseAM) is a family of models that offers a flexible framework for repre-
+ senting many contemporary neural architectures, such as transformers and diffusion models, by casting
+ them as dynamical systems evolving on an energy landscape. In this work, we propose a general method
+ for building analog accelerators for DenseAMs and implementing them using electronic RC circuits, cross-
+ bar arrays, and amplifiers. We find that our analog DenseAM hardware performs inference in constant
+ time independent of model size. This result highlights an asymptotic advantage of analog DenseAMs
+ over digital numerical solvers that scale at least linearly with the model size. We consider three settings
+ of progressively increasing complexity: XOR, the Hamming (7,4) code, and a simple language model
+ defined on binary variables. We propose analog implementations of these three models and analyze the
+ scaling of inference time, energy consumption, and hardware. Finally, we estimate lower bounds on the
+ achievable time constants imposed by amplifier specifications, suggesting that even conservative existing
+ analog technology can enable inference times on the order of tens to hundreds of nanoseconds. By har-
+ nessing the intrinsic parallelism and continuous-time operation of analog circuits, our DenseAM-based
+ accelerator design offers a new avenue for fast and scalable AI hardware.
+
+
+ 1 Introduction
+ The unprecedented growth of artificial intelligence (AI) has driven demand for increasingly large and
+ powerful models. At present, the field of generative AI is primarily driven by two settings: autore-
+ gressive transformers [1] and diffusion models [2]. While these settings have demonstrated remarkable
+ capabilities, they do so at a substantial computational cost. Their current implementations utilize digital
+ computation, which faces fundamental challenges in energy efficiency, scalability, and latency, especially
+ as model sizes and deployment demands continue to grow [3, 4, 5]. These limitations have prompted
+ interest in alternative computational paradigms that can efficiently handle the demands of modern AI
+ workloads [6].
+ Dense Associative Memories (DenseAMs) [7, 8], a promising class of AI models which generalize
+ Hopfield networks [9], offer a new angle for tackling these problems. Unlike conventional feed-forward
+ models, DenseAM inference can be defined through the temporal evolution of a state vector that is
+ governed by a system of differential equations [10]. The state vector can be thought of as a particle
+ exploring the surface of a high-dimensional energy landscape, which is the Lyapunov function of these
+ dynamical equations. DenseAMs have been demonstrated to be flexible and expressive computational
+ frameworks, capable of representing many primitives of modern AI architectures, such as attention
+ mechanism [11], transformers [12], and diffusion models [13, 14, 15]. Furthermore, DenseAMs are error-
+ correcting systems [16], a property ensuring that small perturbations of the desired temporal evolution
+ of the state vector are corrected away by the dynamics of the network itself, rather than accumulated
+ in time. Finally, DenseAMs are asymptotically stable—during the course of temporal evolution the
+ computation happens during a finite transient period of time, which is followed by a steady state of
+ Code available at https://github.com/mbacvanski/AnalogET.
+
+
+
+ 1
+ neural activities. This asymptotic stabilization of dynamical trajectories removes the requirement to read
+out the “answer” to the computation problem at a precise moment of time, making DenseAMs robust
+to several classes of hardware imperfections. The confluence of the above properties makes DenseAMs
+appealing networks for analog hardware implementations that, on the one hand, are grounded in the
+physics of stable error-correcting dynamical systems and, on the other hand, are capable of representing
+computation in state-of-the-art AI networks.
+ In 1989, Hopfield argued that analog neural hardware can exceed the efficiency of digital implemen-
+tations when the device physics directly instantiate the computational dynamics of the model itself [17].
+Here, we revisit this idea with DenseAM models: we propose an analog circuit-based hardware accel-
+erator design whose dynamics directly realize those of the DenseAM. We find that analog DenseAM
+hardware enables constant-time inference independent of model size, which is in stark contrast to GPU
+solvers and digital implementations. This intrinsic property makes DenseAM a natural fit for analog AI
+accelerators, and it highlights our circuit architecture as a viable hardware path to realize them. Using
+component specifications already demonstrated in fabricated devices, analog DenseAM hardware may
+achieve inference times on the order of tens to hundreds of nanoseconds, several orders of magnitude
+faster than digital systems.
+ By leveraging the natural dynamics of analog systems, this work establishes a new design of fast and
+scalable AI accelerators. The framework of DenseAMs and their efficient analog hardware implementa-
+tions suggest a pathway for fundamentally redesigning the hardware-software interface for AI, enabling
+a new paradigm for fast, energy-efficient, and scalable computation.
+
+
+2 Dense Associative Memory basics
+The DenseAM framework [10, 18] provides a model that has straightforward neuronal dynamics, yet is
+surprisingly expressive in its ability to represent AI models including transformer attention, diffusion
+models, and associative memories. In its simplest form it is defined by two sets of neurons (typically
+called visible and hidden neurons) and a system of coupled non-linear differential equations governing
+their behavior, see Figure 1. The visible neurons are characterized by their internal states vi and their
+outputs gi , index i = 1 . . . Nv ; while the hidden neurons have internal states hµ and outputs fµ , index
+µ = 1 . . . Nh . From the AI perspective, one can think about internal state of the neuron as a pre-activation
+of that neuron, and the output as a post-activation, which is obtained by applying an activation function
+to the pre-activation. From the biological perspective, one can think about the internal state of the
+neuron as a membrane voltage potential, and the output of that neuron as an axonal output, or a firing
+rate of that neuron. This framework admits both neuron-wise activation functions (gi = g(vi ), where
+g(·) is some continuous function, e.g., a ReLU), and collective activation functions such as softmax or
+layer normalization, which depend on the states of multiple neurons.
+ The network parameters are stored in the synaptic weights ξ ∈ RNh ×Nv , whose matrix elements
+denoted by ξµi can be either hand-engineered or learned. The time decay constants for the two groups
+of neurons are τv and τh . With these conventions, the temporal evolution of the two groups of neurons
+can be expressed as  Nh
+  dvi X
+ τ = ξµi fµ + ai − vi
+ 
+  v dt
+ 
+ 
+ 
+ µ=1
+ (1)
+ Nv
+ dh
+ 
+ µ
+  X
+ τh dt = ξµi gi + bµ − hµ
+ 
+ 
+ 
+ i=1
+
+This forms a bipartite graph of neuronal connections, where the state of the hidden neurons is updated
+by the state of the visible neurons, and vice versa. Importantly, the same matrix ξ appears in both
+equations, once as ξ and again as ξ ⊤ . Although this is sometimes described as using “symmetric”
+weights, ξ is not assumed to be symmetric in the linear-algebraic sense; it is simply the same matrix
+used in both directions. Finally, ai and bµ denote biases, which are additional weights of the system and
+whose values may be hard-coded or learned depending on the application.
+ The most important aspect of this model is the existence of a global energy function (Lyapunov
+function) that describes neuronal dynamics. To demonstrate this, it is most convenient to use the
+Lagrangian formalism [10, 18, 16]. Each set of neurons is defined through a Lagrangian function of their
+internal states. The activation functions are defined as partial derivatives of that Lagrangian with respect
+to internal states. The total energy is the sum of energies of each set of neurons, plus the interaction
+
+
+
+ 2
+ Figure 1: Top left: Bipartite neural network formulation, where hidden neurons hµ and visible neurons
+vi are connected via symmetric synaptic weights ξ. Top right: Circuit realization of symmetric weight
+matrix via resistive crossbar array. Each crosspoint encodes a weight ξµi by its resistance Rµi = 1/ξµi .
+Lower right: Circuit schematic of a single hidden neuron. It drives its row of the crossbar array with
+a voltage according to its activation fµ , and its internal dynamics are driven by the incoming current
+flowing into it from the crossbar array. Lower left: Softmax activation function built from bipolar
+junction transistors (some components not shown).
+
+
+energy. The energy of each set of neurons is a Legendre transformation of the corresponding Lagrangian
+(plus the term proportional to the bias). Thus, the global energy of Equation 1 is given by
+ Nv
+ X  Nh
+ X  Nh X
+ X Nv
+ E= gi (vi − ai ) − Lv + fµ (hµ − bµ ) − Lh − fµ ξµi gi (2)
+ i=1 µ=1 µ=1 i=1
+ | {z } | {z } | {z }
+ energy of visible neurons energy of hidden neurons interaction energy
+
+where the activation functions are defined as partial derivatives of the Lagrangians
+ ∂Lv ∂Lh
+ gi = , fµ =
+ ∂vi ∂hµ
+For convex Lagrangians this global energy decreases with time on the dynamical trajectories of Equa-
+tion 1. If, additionally, the activation functions (and corresponding Lagrangians) are chosen in such a
+way that this energy is bounded from below, the dynamical trajectories are guaranteed to arrive at a
+stable fixed point of activations. The dynamical equations typically have many asymptotic fixed points,
+which correspond to local minima of the energy function in Equation 2. Both properties above (convexity
+of Lagrangians and lower-bounded energy) are satisfied for all settings studied in this paper. By picking
+different nonlinear activation functions (or corresponding Lagrangians), this system yields a variety of
+models that can describe associative memory, softmax attention, and other commonly used settings in
+AI [10, 11, 18, 19, 20].
+ A particularly relevant example for modern sequence modeling is the Energy Transformer (ET) [12],
+which reformulates transformer’s inference pass as a gradient flow on an energy function defined over the
+
+
+ 3
+ set of tokens. The ET block contains two contributions to the energy function: attention energy and the
+Hopfield network. The energy attention module routes the information between the tokens, while the
+Hopfield module aligns the tokens with the manifold of token embeddings. In our implementation, the
+context tokens act as a set of dynamically instantiated memories that interact with the predicted token
+through a DenseAM-like energy. In section 6 we exploit this connection to construct an Analog Energy
+Transformer (Analog ET) whose continuous-time dynamics are implemented directly in hardware using
+our DenseAM circuit primitives.
+
+
+3 Related work
+Early analog implementations of associative memories focused on the classical Hopfield network. Founda-
+tional designs, such as continuous-time analog circuits [21, 22] and later demonstrations using amorphous-
+silicon resistors [23], memristive devices [24, 25], and phase-change memories [26], targeted the quadratic
+Hopfield energy function. These works emphasize device engineering and memory-cell design rather than
+system-level dynamics, and inherit the limited storage capacity and representational power of traditional
+Hopfield networks. That line of research is largely concerned with how to fabricate programmable re-
+sistance elements themselves; our work assumes programmable conductances as a given primitive and
+focuses on the continuous-time dynamics that operate on top of them. Our work also differs from these
+works by addressing DenseAMs with higher-order energy functions and continuous-valued states.
+ Another direction is the use of cavity-QED systems for associative memory. Marsh et al. [27] analyze
+a confocal cavity implementation of a quadratic Hopfield network and show that the cavity dynamics
+induce a descent-like relaxation rule on spin states. Their model remains restricted to quadratic energies
+and binary spins, and operates in a cryogenic, cavity-QED setting. Our work instead targets higher-order
+DenseAMs with continuous states, and emphasizes scalable, room-temperature analog microelectronics
+with explicit hardware-aware dynamical analysis.
+ More recent physical implementations move beyond purely quadratic energies. Musa et al. [28]
+propose a free-space optical realization of the higher-order DenseAM energy. Their system constructs a
+static physical representation of the energy landscape, but inference relies on an external digital controller
+that performs iterative spin-flip updates. Thus, the hardware computes energies, while the optimization
+dynamics remain digital. In contrast, our analog microelectronic architecture embeds the gradient flow
+itself into hardware: inference is performed by a single continuous-time evolution rather than by discrete
+digital updates.
+
+
+4 DenseAM circuit design
+Here, we introduce a novel architecture for a class of analog electronic hardware accelerators that models
+Equation 1’s system of nonlinear differential equations using time evolution. Our DenseAM design
+shown in Figure 1 is comprised of two sets of neurons that interact through a resistive crossbar array.
+The resistive crossbar array turns voltage differences between neurons into currents flowing between the
+neurons according to synaptic weights, and each neuron’s internal circuitry converts those currents into
+dynamics that reproduce Equation 1.
+
+Resistive weights as a crossbar array. The crossbar array construction is a canonical design of
+matrix-vector multiplication using analog electronics [17, 29], and is a natural fit for the weight matrix
+ξ in our model. Traditionally, each crosspoint between a row and column line is connected by a resistor
+(often memristors, RRAM, or other analog memories that produce resistances), a vector of input voltages
+is applied at row lines, and the column lines are held at ground typically via a transimpedance amplifier.
+By Ohm’s law, each resistive crosspoint produces a current that multiplies the row’s input voltage by
+the inverse of the resistance. Because currents add along each column line, the total current output at a
+column is the inner product between the vector of input voltages and the column’s conductance vector.
+Thus, the array as a whole implements a parallel analog matrix multiplication of the form Iout = GVin ,
+where G is the matrix of conductances (inverse of resistances).
+ Unlike a traditional crossbar array whose rows are driven at a fixed voltage and whose columns
+are held at ground, our DenseAM circuit design uses each weight bidirectionally, exactly representing
+the bidirectional connections between visible and hidden neurons. As a result, the current flowing into
+each neuron corresponds to the weighted sum of the differences P between visible and hidden neuron
+activations. For example, for hidden neuron µ, this current is i ξµi (gi − fµ ). This construction enables
+
+
+ 4
+ (1, 0) (1, 1)
+ 1 g3 0.4
+ Neurons
+ Visible
+
+
+
+
+ Energy
+ 0.2
+ 0
+
+ 1 f3 0.0
+ Neurons
+ Hidden
+
+
+
+
+ (0, 0) (0, 1)
+ 0 0.4
+
+
+
+
+ Energy
+ 0.5
+ Energy
+
+
+
+
+ 0.2
+
+ 0.0 0.0
+ 0.0 0.5 1.0 1.5 2.0 2.5 3.0
+ 0 1 0 1
+ Time (s)
+ v3 v3
+
+Figure 2: Solving XOR with a DenseAM. Visible Figure 3: XOR energy landscape of neuron v3 un-
+neuron g3 = v3 serves as the output, while the two der different settings of visible input neurons v1 and
+input neurons (unlabeled, thin lines) are clamped v2 . Minima in the energy function correspond to
+at 1 and 0 for True and False. Output v3 is initial- stationary points of the dynamics. Gradient flow
+ized at 0.5 and converges to a positive prediction of dynamics bring v3 to these attractor points, result-
+1. The activation of the hidden neuron f3 for the ing in correct XOR outputs.
+truth-table row (1, 0, 1) becomes highly activated,
+with others (fine lines) are suppressed by softmax.
+Energy (2), or equivalently (5), decreases monoton-
+ically along the inference trajectory.
+
+
+weight symmetry to be enforced by hardware sharing: both forward and reverse weights are realized by
+the same resistive elements. Importantly, as long as weights are represented as conductances, they must
+be non-negative.
+
+Design of a single neuron. Each neuron in the circuit computes its dynamics by integrating the cur-
+rents it receives from the crossbar array, which represent weighted differences between its own activation
+and those of connected neurons. Considering a hidden neuron (the design for visible neurons is symmet-
+ric by design), the neuron’s internal voltage hµ is stored on capacitor C1 and evolves in continuous time,
+while the neuron’s activation fµ is obtained by passing hµ through a nonlinear function (e.g. ReLU or
+softmax).
+ The current flowing into hidden neuron µ is produced by its interaction with all visible neurons via
+the synaptic weights ξµi for P i = 1, . . . , Nv . Specifically, this is as a weighted sum of the differences
+between neuron P activations: i ξµi (gi − fµ ). Inside each neuron, a “self” path scales fµ to produceP the
+voltage sµ = fµ i ξµi . This term is added to the value of the incoming current so that the −fµ i ξµi
+term is cancelled inside each neuron. As a result, the hidden state, represented as the voltage across
+capacitor C1 , integrates only the desired weighted input plus any external stimulus bµ . Its dynamics
+reduce to the canonical DenseAM form with a time constant of R2 C1 :
+ Nv
+ dhµ X
+ R2 C 1 = ξµi gi + bµ − hµ (3)
+ dt i=1
+
+Elementwise (or vectorized) nonlinearities then produce activations gi = g(vi ) and fµ = f (hµ ) (e.g.,
+ReLU, softmax) across the visible and hidden neurons. See Appendix A for the full circuit derivation.
+
+
+5 Analog DenseAM Examples
+We begin by studying two examples of the proposed design: the XOR task, and the (7,4) error-correcting
+Hamming code.
+
+
+
+
+ 5
+ 5.1 XOR
+The XOR problem is a canonical test for nonlinear representation and inference, as it cannot be solved
+by any linear model. We show a minimal DenseAM model for the XOR task, illustrating how energy-
+based dynamics can solve this simple task with a continuous-time analog system. The network consists
+of Nv = 3 visible neurons, and Nh = 4 hidden neurons. At t = 0 visible neurons v1 and v2 are initialized
+at their input values corresponding to the input bits. The last visible neuron v3 is initialized at v3 = 0.5.
+The hidden neurons are initialized at zero. The two input visible neurons remain clamped during the
+dynamics, while the third output visible neuron and the hidden neurons evolve in time according to (1).
+Each row of the memory matrix ξ corresponds to a row of the XOR truth table. The visible neurons
+use an identity activation function where gi = vi , and the hidden neurons use a softmax activation. The
+biases are set as
+ N v
+ 1X 2
+ ai = 0, bµ = − ξµi
+ 2 i=1
+
+ Figure 2 shows the temporal evolution of visible and hidden neuron activations, as well as the total
+energy, during inference on the XOR input (1, 0). The output visible neuron’s activation g3 gradually
+converges to the correct prediction of 1, while the hidden neuron associated with that memory, f3 ,
+becomes strongly activated and the remaining hidden neurons are suppressed by the softmax nonlinearity.
+The system’s energy decreases monotonically throughout the trajectory and stabilizes once the network
+reaches its fixed-point prediction. Figure 3 depicts the system’s energy landscape as a function of output
+neuron v3 for different clamped input configurations (v1 , v2 ). In each case, the energy exhibits a clear
+convex minimum at the correct XOR output, demonstrating that gradient flow along the energy surface
+drives v3 reliably toward the correct prediction. As shown in Appendix C, we validate our circuit design
+and dynamics using SPICE simulation.
+ τh → 0. Since the second equation in
+ To analyze this DenseAM, it is instructive to consider the limit P
+ Nh
+(1) is linear in hidden units hµ , they can be integrated out. With µ=1 fµ = 1, the resulting dynamics
+of the visible neurons can be written as
+ Nh Nv
+ dvi X  βX 
+ (ξµi − vi )2
+ 
+ τv = ξµi − vi fµ where fµ = softmax − (4)
+ dt µ=1
+ 2 i=1
+
+The effective energy on the visible neurons can be written as
+ Nh Nv
+ 1 X h βX i
+ E eff (v) = − log exp − (ξµi − vi )2 (5)
+ β µ=1
+ 2 i=1
+
+Intuitively, each hidden neuron computes a squared Euclidean distance between the visible state and its
+stored pattern ξ µ . The softmax nonlinearity assigns higher weight to the pattern closest to the current
+state of the visible neurons. The resulting visible neuron dynamics are gradient flow for this effective
+energy. It is important to note that memories in this implementation are represented by conductances
+of the crossbar array, which are always positive. For this reason, matrix elements of memories ξµi must
+be positive, necessitating the use of the bias terms, which are just voltage sources that can be arbitrarily
+signed.
+ While a time constant of τh = 0 is impossible to physically construct due to finite conductances
+and nonzero capacitances, setting τh ≪ τv realizes the same adiabatic limit in practice. When hidden
+neurons evolve much faster than visible ones, they reach their steady state almost instantaneously for each
+configuration of visible neurons. The result is an adiabatic elimination of hidden dynamics, yielding the
+effective visible-only dynamics above. In practice, for the XOR task, even a relatively modest τh = τv /10
+ratio yields perfect performance.
+
+5.2 Hamming (7,4) code
+The Hamming (7,4) code is an error-correcting code that encodes 4 data bits into a 7-bit codeword by
+adding 3 parity bits. The resulting 7-bit strings are special: only certain patterns are valid codewords,
+and they are spaced apart so that if a single bit is flipped, the error can be detected and corrected [30].
+Table 1 lists the 16 codewords corresponding to four arbitrary data bits.
+
+
+ 6
+ 1
+ g5
+ Neurons
+ Visible
+ Data bits (d1 d2 d3 d4 ) Codeword (c1 c2 c3 c4 c5 c6 c7 )
+
+ 0
+ 0000 0000000
+ 0001 0001111
+ 1 f7 0010 0010110
+ Neurons
+ Hidden
+
+
+
+
+ 0011 0011001
+ 0100 0100101
+ 0
+ 0101 0101010
+ 0.5 0110 0110011
+ Energy
+
+
+
+
+ 0111 0111100
+ 1000 1000011
+ 0.0 1001 1001100
+ 0 1 2 3 4 5
+ 1010 1010101
+ Time (s)
+ 1011 1011010
+ 1100 1100110
+ 1101 1101001
+Figure 4: Correcting a bit error in a Hamming 1110 1110000
+(7,4) code. Visible neuron g5 flips indicating the 1111 1111111
+bit flip error happened on the 5th codeword bit. f7
+is the hidden neuron corresponding to the memory Table 1: Valid codewords of the Hamming(7,4)
+of the correct codeword. Thin lines correspond to code, ordered by their 4-bit data payload.
+the other neuron activations.
+
+
+ Unlike the XOR case where the only evolving neuron is the readout bit, the Hamming (7,4) code may
+require flipping the value of any one of the visible neurons. During inference, the visible neurons are
+initialized to the corrupted 7-bit input word. All neurons are left free to evolve, and the dynamics relax
+the state toward the nearest stored codeword. Energy minima are located at the valid codewords, so the
+network converges to the correct code provided the error is within the Hamming radius of 1. Thus, the
+DenseAM replicates the standard decoding property of the Hamming (7,4) code: any single-bit flip is
+corrected automatically. Figure 4 illustrates a case where a flipped bit g5 is restored during convergence.
+ The Hamming (7,4) model’s 7 visible neurons, each corresponding to a codeword bit, are connected
+to 16 hidden neurons, each representing one valid codeword. The weight matrix ξ ∈ {0, 1}16×7 is formed
+by stacking the 16 codewords as its rows. Visible neurons have the identity activation, hidden neurons
+use a softmax activation, and biases are chosen as in the XOR case to give the same integrated-out
+visible dynamics as (4).
+
+
+6 Analog Energy Transformer (Analog ET) via DenseAM
+Our DenseAM circuit construction can be used to build more complex energy-based models, such as
+the transformer-like architecture proposed in the Energy Transformer paper [12]. For causal next-token
+prediction with a single attention head, the Energy Transformer’s energy function can be written as the
+following (See Appendix J for full derivation):
+  ⊤ ⊤  ⊤ attn ⊤ hopf
+ E = 12 ∥v − a∥2 − v⊤ ξ attn f attn + ξ hopf f hopf + f attn − b + f hopf
+  
+ h h −c
+ − Lattn hattn − Lhopf hhopf
+  
+ (6)
+
+with the activation functions and their Lagrangians defined as
+ L
+ X
+ fAattn = softmax(βhattn )A , Lattn (h) = β1 log eβhA (7)
+ A=1
+ M h
+ X i2
+ fµhopf = ReLU(hhopf
+ µ ), Lhopf (h) = 21 ReLU(hµ ) (8)
+ µ=1
+
+where a, b, and c correspond to the biases of the visible neurons, attention hidden neurons, and Hopfield
+network hidden neurons, respectively. The L context tokens are indexed by A, and the M hidden neurons
+of the Hopfield network are indexed by µ. Because the visible units use an identity activation function,
+
+
+ 7
+ Figure 5: Analog ET circuit demonstrating the autoregressive inference procedure. A newly inferenced
+token is decoded, sampled, and re-embedded to obtain the weight vector ξ attn
+ L+1 , which is set as the weight
+vector for a new hidden neuron hattn
+ L+1 in the energy attention block (light gray on right). For this layout
+we have flipped the crossbar array, so that indices A and µ run horizontally and index i runs vertically.
+
+
+gi = vi using the languge of Equation 1, the gradient flow of the energy yields the dynamics:
+ ∂E ⊤  ⊤
+ τv v̇ = − = ξ attn f attn + ξ hopf f hopf + a − v (9)
+ ∂v
+ ∂E
+ τh ḣattn
+ = − attn = ξattn v + b − hattn (10)
+ ∂f
+ ∂E
+ τh ḣhopf = − hopf = ξhopf v + c − hhopf (11)
+ ∂f
+In this formulation, v represents the embedding of the output (next) token, and its evolution is driven by
+two terms: one term from the energy attention with weights ξattn and hidden neuron activations f attn ,
+and one term from the Hopfield network with weights ξ hopf and hidden neuron activations f hopf . The
+weights of the energy attention DenseAM are dependent on the context: for a token dimension D, context
+length L, and the task of predicting the token at index L + 1, the weights ξ attn ∈ RL×D are generated
+by embedding each token of the context via a learned embedding matrix applied to each context token.
+In contrast, the Hopfield network weights ξ hopf are learned during training and fixed at inference. The
+number of memories in the Hopfield network is a hyperparameter M , such that ξ hopf ∈ RM ×D .
+ This system suggests a hardware implementation where v interacts with two independent DenseAMs,
+one for the energy attention and one for the Hopfield term, which can share the same physical crossbar
+structure. Figure 5 shows that the circuit structure remains a crossbar array (like Figure 1), but with
+two distinct classes of hidden neurons. Because of the summation of currents along each row of the
+crossbar array, the incoming current to visible neuron vi is the sum of contributions from the energy
+attention block and from the Hopfield network block. The energy attention hidden neurons hattn use a
+softmax activation function, while the Hopfield network hidden neurons hhopf use a ReLU activation.
+
+6.1 Analog Energy Transformer on the parity task
+We build and evaluate the Analog ET on the L-bit parity task, which can 
+ P be thought of as an elementary
+ L
+“language model”: given bits bit1 , . . . , bitL , predict bitL+1 = A=1 bitA mod 2. Parity is instructive
+because it requires a representation of a global, order-L interaction, precluding linear and shallow models
+from representing it efficiently. A successful model must be able to form high-order interactions in order
+to generalize. We formulate parity as a next-token prediction problem: given an L-bit string as context,
+predict its parity in the next token.
+ We train the Analog ET model digitally using backpropagation through time [31] implemented with
+Jax’s automatic differentiation. The resulting weights can be deployed onto the analog hardware; in
+
+
+ 8
+ 11001010 0 01000110 1
+
+ 4
+Visible neurons
+
+
+ 2
+ 0
+ 1
+Prediction
+
+
+
+
+ 0
+ 10
+Energy
+
+
+
+
+ 20
+ 30
+ 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
+ Time t Time t
+Figure 6: Inference of parity Analog ET on two example 8-bit strings. Top row plots the visible neurons vi
+over time, middle row plots the decoded token prediction, bottom row plots the energy that monotonically
+decreases during inference. After a transient period of computation, the network arrives at a steady-
+state, making the result of the computation robust against the precise timing of the readout.
+
+
+our experiments we simulate the dynamics of hardware with the Diffrax [32] ODE solver library. On
+the 8-bit parity task, our model achieves 100% accuracy on the hold-out validation set of 52 bit strings,
+demonstrating clear generalization capabilities. See Appendix H.1 for more details on training and model
+design.
+ Figure 6 shows the dynamics of the visible neurons and energy during two example inference runs
+of the Analog ET. Notably, the visible neuron values are constant by the end of the inference period,
+meaning that the inference remains highly stable to mismatch and delay in timing during readout. A
+single sample-and-hold and switching circuit would enable a single Analog-Digital Converter (ADC) to
+read out all the visible neurons at convergence, significantly reducing mismatch, and drastically saving
+device area, complexity, and energy. The intrinsic stability of attractor points arises uniquely from
+the continuous-time dynamics of the DenseAM, making these models particularly well suited to analog
+hardware.
+
+6.2 Autoregressive inference
+Dashed lines in Figure 5 illustrate the autoregressive inference procedure of the Analog ET. To generate
+the L-th token given context tokens x(1) , . . . , x(L−1) , each token is first embedded and concatenated to
+form the attention weight matrix
+  (1) 
+ e
+  e(2) 
+ ξ attn,(L−1) =  .  ∈ R(L−1)×D
+  
+  .. 
+ e(L−1)
+
+These rows are loaded into the Analog ET’s energy attention weight matrix ξ attn by programming the
+corresponding crossbar resistances. During inference, the visible state v(t) evolves according to the
+Analog ET dynamics until convergence. A decoder readout (e.g., a linear layer) applied to the converged
+v(t = T ) values produces logits, from which the next token x(L) is sampled. This token is then embedded
+to form e(L) , and appended to the existing context. The cycle repeats with the updated attention weight
+
+
+ 9
+ matrix
+  attn,(L−1) 
+ ξ
+ ξ attn,(L) = ∈ RL×D
+ e(L)
+
+which now includes the new embedding e(L) . In hardware, this corresponds to connecting an additional
+hidden neuron in the energy attention block of Figure 5, and setting its resistive weights with e(L) .
+Because the physical order of hidden neurons does not affect the energy function, this new neuron can
+be placed in any position among the hidden neurons. When the context length is fixed, the hidden
+neuron corresponding to the earliest token can simply be reprogrammed with the new vector of weights
+e(L) , resulting in the hardware equivalent of a sliding-window context. In practice, an external digital
+controller, e.g., an Field-Programmable Gate Array (FPGA) or Application-Specific Integrated Circuit
+(ASIC) would orchestrate crossbar programming and token decoding, while the DenseAM dynamics
+perform the far more substantial workload of computing each next-token embedding.
+ This procedure is analogous to key-value (KV) caching in standard transformer inference [33]. Context
+tokens x(1) , . . . , x(L−1) produce key and value vectors k(1) , . . . , k(L−1) and v(1) , . . . , v(L−1) respectively.
+When new token x(L) is generated, its corresponding k(L) and v(L) vectors are appended to the cache,
+allowing all previous k(<L) and v(<L) to be reused without recomputation. When the key and value
+matrices are tied so that k(A) = v(A) , the ET’s row-append operation is equivalent to the standard KV-
+cache update. The ET performs an autoregressive rollout that reproduces the same recurrence structure
+as KV-cached transformer inference, but implemented physically through the addition of new neurons
+and weights without touching existing hardware. For a formal derivation of the equivalence between ET
+attention and conventional attention with tied keys and values, see [12].
+
+
+7 Scaling properties
+Inference time and energy consumption are crucial characteristics of our system. This section investigates
+these metrics with respect to the network size.
+
+7.1 Inference time scaling
+The model (4) and (5) is considered. In the adiabatic limit (τh → 0), which is satisfied by our hardware
+implementation, the time derivative of the energy can be written as
+ Nv Nv 
+ dE eff X ∂E eff dvi 1 X ∂E eff 2 Nv
+ = =− ∼− (12)
+ dt i=1
+ ∂vi dt τv i=1 ∂vi τv
+
+This derivative is always negative, since the dynamical system performs the gradient descent on the
+energy landscape. The derivative vanishes eventually when the network state vector v converges to the
+steady state. Since the state vector vi is typically initialized in the vicinity of the memory vectors, which
+are chosen to be of order one (∼ 1), the right hand side of (4) is of order one too, independent of the
+network size. This results in the characteristic value of the temporal derivative shown in (12).
+ At the same time, the typical value1 of the energy (5) is
+ 1
+ |E eff | ∼ Nv + log(Nh ) (13)
+ β
+During the inference dynamics the network is initialized in a high energy state, which has the charac-
+teristic value of energy (13), and performs energy descent to a lower value of the energy (which has a
+similar order of magnitude). In order to estimate the scaling of the time required to perform this energy
+descent, one can take a ratio of the energy drop by the rate of the energy decrease (12). This gives the
+following estimate
+ |E eff |  1 log(Nh ) 
+ T conv ∼ ∼ τv 1 + ∼ τv (14)
+ dE β Nv
+ dt
+
+The last ∼ sign holds since in none of the designs presented here does Nh grow super-exponentially in
+Nv . In fact, in all the use cases Nh is sub-exponential in Nv .
+ 1 We estimate the absolute value of the energy, since it can be both positive and negative depending on the mutual
+
+arrangement of memories, the state vector, and the number of hidden units.
+
+
+ 10
+ This back-of-the-envelope estimation provides the core intuition behind the scaling relationship.
+The inference time is constant, and independent of the size of the network. A more careful  anal-
+ysis (Appendix E) shows that in the high-β regime the worst-case dependence is O τβv logNNv
+ h
+ , which
+remains bounded for all architectures we consider. Thus, for our settings the convergence time is ef-
+fectively constant in Nv and Nh . Based on amplifier gain–bandwidth, slew-rate, and output-current
+constraints, we estimate achievable inference times of tens to hundreds of nanoseconds using existing
+CMOS technology (see Appendix I.2).
+
+7.2 Scaling of energy consumption
+We now analyze how the total inference energy scales with network size. Energy dissipation arises
+primarily from (i) Ohmic loss in the resistive weights, (ii) charging of neuron-state capacitors, and (iii)
+constant per-neuron overhead from amplifiers and bias currents. We show that, under bounded voltage
+swings and fixed conductance budgets, total energy grows only linearly with the number of neurons.
+
+Weight dissipation. Let the neuron output voltages be proportional to activations: u = κg and
+w = κf , where κ is a fixed voltage swing. Such a bounded swing can always be enforced by global
+rescaling of ξ, β, and voltage units without changing the dynamics (see Appendix F). The instantaneous
+power dissipated by the resistive crossbar array is
+ Nh X
+ X Nv
+ Pweights (t) = ξµi (ui − wµ )2 (15)
+ µ=1 i=1
+ P P
+With 0 ≤ gi ≤ 1, f -softmax, and row/column conductance budgets µ ξµi ≤ Cc , i ξµi ≤ Cr , the total
+power obeys
+
+ Pweights (t) ≤ 2κ2 (Cc Nv + Cr ) = O(Nv ) (16)
+
+For a runtime of duration T ∼ T conv , the energy dissipated by the weights is therefore Eweights = O(Nv T ),
+where T ∼ 1 from subsection 7.1.
+
+Capacitive and overhead energy. Each neuron charges a local capacitor a finite number of times
+by at most Vswing ∼ κ, giving
+ !
+ (v)
+ X X
+ Ecap ≤ κ2 Ci + Cµ(h) = O(Nv + Nh ) (17)
+ i µ
+
+Active bias and amplifier inefficiencies contribute fixed per-neuron power, yielding Eother = O((Nv + Nh )T ).
+
+Total energy scaling. With bounded voltage swing and conductance budgets,
+
+ Etotal = O(Nv + Nh ) (18)
+
+Hence, the total inference energy scales only linearly with system size. For the full derivation, see
+Appendix G.
+
+7.3 Scaling of hardware area
+The area is dominated by two components: the area taken up by the synaptic weights, which is imple-
+mented as a crossbar array with programmable weights, and the area taken up by the neurons feeding
+the crossbar array. The area of the crossbar array scales as the number of weights O(Nv Nh ). The area
+of the neurons scales as O(Nv + Nh ).
+
+
+8 Conclusion
+In this paper, we have presented an analog accelerator architecture for Dense Associative Memories,
+implemented using resistive crossbar arrays and continuous-time RC neuron dynamics. Our design im-
+plements DenseAM inference as time evolution of a physical dynamical system, rather than a sequence of
+
+
+ 11
+ discrete numerical update steps. We demonstrated this architecture with three representative settings of
+increasing complexity: XOR, Hamming (7,4) error decoding, and an Energy Transformer-style sequence
+model. These examples show that the analog DenseAM accelerator architecture covers both associative
+memory tasks and attention-based sequence models.
+ Our analysis shows that DenseAM accelerators enjoy favorable asymptotic scaling properties. In-
+ference time is constant in the dimensions of the model size, meaning that inference time is governed
+primarily by the physical time constants of the circuit. This is in sharp contrast to digital implementa-
+tions of the same dynamics, whose runtime must grow at least linearly with model size.
+ To assess hardware feasibility, we derived lower bounds on the neuronal time constants imposed by
+amplifier gain-bandwidth product, slew rate, and output current limits in our neuron design. Reported
+figures from representative CMOS OTAs in the literature give inference times on the order of tens-to-
+hundreds of nanoseconds, even with conservative design margins. Combined with the constant scaling of
+inference with model size, these estimates suggest that DenseAM accelerators can match or exceed the
+latency of digital GPUs as models grow, without requiring exotic devices or beyond-CMOS technologies.
+ Our results highlight DenseAMs as a natural abstraction for analog AI hardware. Their error cor-
+recting dynamics and asymptotic stability directly address long-standing concerns about robustness and
+readout timing: small perturbations are corrected by the dynamics instead of accumulated, and the final
+state is stable when readout happens over a wide temporal window. At the same time, the DenseAM
+framework is expressive enough to capture modern primitives such as attention and transformer-like ar-
+chitectures, as illustrated by our Analog Energy Transformer construction. These properties suggest that
+DenseAM-based analog accelerators may be a promising substrate for future AI systems, and motivate
+further co-design of models, dynamics, and devices.
+
+Acknowledgements
+MGB would like to thank Faiz Muhammad for exploratory attempts at SPICE simulations. DK would
+like to thank Kwabena Boahen for helpful discussions.
+
+
+References
+ [1] Ashish Vaswani. “Attention is all you need”. In: arXiv preprint arXiv:1706.03762 (2017).
+ [2] Jascha Sohl-Dickstein et al. “Deep unsupervised learning using nonequilibrium thermodynamics”.
+ In: International conference on machine learning. pmlr. 2015, pp. 2256–2265.
+ [3] Norman P Jouppi et al. “In-datacenter performance analysis of a tensor processing unit”. In:
+ Proceedings of the 44th annual international symposium on computer architecture. 2017, pp. 1–12.
+ [4] Eric Masanet et al. “Recalibrating global data center energy-use estimates”. In: Science 367.6481
+ (2020), pp. 984–986.
+ [5] David Patterson et al. “Carbon emissions and large neural network training”. In: arXiv preprint
+ arXiv:2104.10350 (2021).
+ [6] Maxwell Aifer et al. “Solving the compute crisis with physics-based ASICs”. In: arXiv preprint
+ arXiv:2507.10463 (2025).
+ [7] Dmitry Krotov and John J Hopfield. “Dense associative memory for pattern recognition”. In:
+ Advances in neural information processing systems 29 (2016).
+ [8] Dmitry Krotov and John Hopfield. “Dense associative memory is robust to adversarial inputs”. In:
+ Neural computation 30.12 (2018), pp. 3151–3167.
+ [9] John J Hopfield. “Neural networks and physical systems with emergent collective computational
+ abilities.” In: Proceedings of the national academy of sciences 79.8 (1982), pp. 2554–2558.
+[10] Dmitry Krotov and John J Hopfield. “Large Associative Memory Problem in Neurobiology and
+ Machine Learning”. In: International Conference on Learning Representations. 2021.
+[11] Hubert Ramsauer et al. “Hopfield networks is all you need”. In: arXiv preprint arXiv:2008.02217
+ (2020).
+[12] Benjamin Hoover et al. “Energy transformer”. In: Advances in Neural Information Processing
+ Systems 36 (2024).
+
+
+
+ 12
+ [13] Benjamin Hoover et al. “Memory in plain sight: A survey of the uncanny resemblances between
+ diffusion models and associative memories”. In: arXiv preprint arXiv:2309.16750 (2023).
+[14] Luca Ambrogioni. “In search of dispersed memories: Generative diffusion models are associative
+ memory networks”. In: arXiv preprint arXiv:2309.17290 (2023).
+[15] Bao Pham et al. “Memorization to generalization: Emergence of diffusion models from associative
+ memory”. In: arXiv preprint arXiv:2505.21777 (2025).
+[16] Dmitry Krotov et al. “Modern methods in associative memory”. In: arXiv preprint arXiv:2507.06211
+ (2025).
+[17] JJ Hopfield. “The effectiveness of analogue’neural network’hardware”. In: Network: Computation
+ in Neural Systems 1.1 (1990), p. 27.
+[18] Dmitry Krotov. “Hierarchical associative memory”. In: arXiv preprint arXiv:2107.06446 (2021).
+[19] Fei Tang and Michael Kopp. “A remark on a paper of krotov and hopfield [arxiv: 2008.06996]”. In:
+ arXiv preprint arXiv:2105.15034 (2021).
+[20] Benjamin Hoover et al. “A universal abstraction for hierarchical hopfield networks”. In: The Sym-
+ biosis of Deep Learning and Differential Equations II. 2022.
+[21] John J Hopfield. “Neurons with graded response have collective computational properties like those
+ of two-state neurons.” In: Proceedings of the national academy of sciences 81.10 (1984), pp. 3088–
+ 3092.
+[22] David W Tank and John J Hopfield. “Simple “Neural” optimization networks: an A/D converter,
+ signal decision circuit, and a linear programming circuit”. In: Artificial neural networks: theoretical
+ concepts. 1988, pp. 87–95.
+[23] HP Graf et al. “VLSI implementation of a neural network memory with several hundreds of neu-
+ rons”. In: AIP conference proceedings. Vol. 151. 1. American Institute of Physics. 1986, pp. 182–
+ 187.
+[24] Xinjie Guo et al. “Modeling and experimental demonstration of a Hopfield network analog-to-
+ digital converter with hybrid CMOS/memristor circuits”. In: Frontiers in neuroscience 9 (2015),
+ p. 488.
+[25] SG Hu et al. “Associative memory realized by a reconfigurable memristive Hopfield neural net-
+ work”. In: Nature communications 6.1 (2015), p. 7522.
+[26] Sukru B Eryilmaz et al. “Brain-like associative learning using a nanoscale non-volatile phase change
+ synaptic device array”. In: Frontiers in neuroscience 8 (2014), p. 205.
+[27] Brendan P Marsh et al. “Enhancing associative memory recall and storage capacity using confocal
+ cavity QED”. In: Physical Review X 11.2 (2021), p. 021048.
+[28] Khalid Musa et al. “Dense Associative Memory in a Nonlinear Optical Hopfield Neural Network”.
+ In: arXiv preprint arXiv:2506.07849 (2025).
+[29] Carver Mead and Mohammed Ismail. Analog VLSI implementation of neural systems. Vol. 80.
+ Springer Science & Business Media, 2012.
+[30] Richard W Hamming. “Error detecting and error correcting codes”. In: The Bell system technical
+ journal 29.2 (1950), pp. 147–160.
+[31] Paul J Werbos. “Backpropagation through time: what it does and how to do it”. In: Proceedings
+ of the IEEE 78.10 (2002), pp. 1550–1560.
+[32] Patrick Kidger. “On Neural Differential Equations”. PhD thesis. University of Oxford, 2021.
+[33] Zihang Dai et al. “Transformer-xl: Attentive language models beyond a fixed-length context”.
+ In: Proceedings of the 57th annual meeting of the association for computational linguistics. 2019,
+ pp. 2978–2988.
+[34] Jacob Sillman. “Analog Implementation of the Softmax Function”. In: arXiv preprint arXiv:2305.13649
+ (2023).
+[35] John J Hopfield and David W Tank. “Computing with neural circuits: A model”. In: Science
+ 233.4764 (1986), pp. 625–633.
+[36] Aldo Pena Perez and Franco Maloberti. “Performance enhanced op-amp for 65nm CMOS tech-
+ nologies and below”. In: 2012 IEEE International Symposium on Circuits and Systems (ISCAS).
+ IEEE. 2012, pp. 201–204.
+
+
+ 13
+ Figure 7: Circuit for a single neuron.
+
+
+[37] Rida S Assaad and Jose Silva-Martinez. “The recycling folded cascode: A general enhancement of
+ the folded cascode amplifier”. In: IEEE Journal of Solid-State Circuits 44.9 (2009), pp. 2535–2542.
+[38] Alec Yen and Benjamin J Blalock. “A High Slew Rate, Low Power, Compact Operational Ampli-
+ fier Based on the Super-Class AB Recycling Folded Cascode”. In: 2020 IEEE 63rd International
+ Midwest Symposium on Circuits and Systems (MWSCAS). IEEE. 2020, pp. 9–12.
+[39] Mohammad H Naderi, Suraj Prakash, and Jose Silva-Martinez. “Operational transconductance
+ amplifier with class-B slew-rate boosting for fast high-performance switched-capacitor circuits”.
+ In: IEEE Transactions on Circuits and Systems I: Regular Papers 65.11 (2018), pp. 3769–3779.
+[40] Franz Schlögl and Horst Zimmermann. “A design example of a 65 nm CMOS operational amplifier”.
+ In: International Journal of Circuit Theory and Applications 35.3 (2007), pp. 343–354.
+
+
+A Neuron Design
+Figure 7 shows the circuit design of a single neuron, with labels corresponding to this being a hidden
+neuron at index µ. We derive the dynamics of the neuron internal state hµ and activation output voltage
+fµ . We proceed using only Kirchhoff’s Current Law (KCL) and the definition of an ideal op-amp.
+
+Assumptions and conventions.
+ • Ideal op-amps: infinite open-loop gain, infinite input impedance (no input current), zero output
+ impedance. Under stable negative feedback this enforces a virtual short V+ = V− .
+ • Current Jµ : we define Jµ as the current which flows from fµ to mµ through R1 .
+
+ • Op-amp input labels: We denote the inverting and noninverting inputs of each op-amp explicitly,
+ e.g. U 2− for the inverting input of U2, U 3+ for the noninverting input of U3, etc.
+ • Node labels: Label mµ as the output of U1, sµ as the output of U2, and dµ as the output of U3.
+ The neuron pre-activation state is labeled hµ , and the post-activation state is labeled fµ . Voltage
+ bµ (as an ideal voltage source) drives the bias for this neuron. Voltages hµ , bµ , and fµ correspond
+ directly to the state variables in equation (1).
+
+
+
+
+ 14
+ Block U1: buffer of activation voltage fµ . Op-amp U1 buffers the output of the activation function
+f (·) and drives the output of the neuron, fµ . Because no current can flow into U 1− , all the current
+flowing into this neuron must flow through R1 to mµ and is sourced or sunk by U1’s output node.
+
+Block U2: non-inverting stage producing sµ from fµ and mµ . The positive input of U2 is
+U 2+ = fµ , and by U2’s virtual short, the negative input U 2− = U 2+ = fµ . By KCL at U 2− ,
+  
+ U 2− sµ − U 2− R9
+ = ⇒ sµ = 1 + fµ (19)
+ R10 R9 R10
+
+Block U3: non-inverting stage producing dµ from sµ , bµ , and mµ . By KCL at the positive input
+of U3,
+ bµ − U 3+ sµ − U 3+ U 3+ R4 R5 bµ + R3 R5 sµ
+ + = ⇒ U 3+ = (20)
+ R3 R4 R5 R4 R5 + R3 R5 + R3 R4
+KCL at the negative input of U3 gives us
+   
+ mµ − U 3− −U 3− U 3− − d µ 1 1 R8 mµ
+ + = ⇒ dµ = U 3− 1 + R8 + − (21)
+ R6 R7 R8 R6 R7 R6
+Virtual short of U3 means U 3− = U 3+ . Combining equations (20) and (21), get
+ R6 R7 + R8 (R6 + R7 ) R4 R5 bµ + R3 R5 sµ R8
+ dµ = · − mµ (22)
+ R6 R7 R4 R5 + R3 R5 + R3 R4 R6
+
+Dynamics of RC circuit. R2 and C1 form an RC circuit driven by voltage dµ . The voltage across
+the capacitor hµ follows the relation
+ dhµ
+ R2 C 1 = −hµ + dµ
+ dt
+ R6 R7 + R8 (R6 + R7 ) R4 R5 bµ + R3 R5 sµ R8
+ = −hµ + · − mµ (23)
+ R6 R7 R4 R5 + R3 R5 + R3 R4 R6
+ P
+With incoming current. Take the incoming current PJµ = i ξµi (gi − fµ ). This produces a voltage
+drop across R1 such that mµ = fµ − R1 Jµ = fµ − R1 i ξµi (gi − fµ ). Then, the dynamics of hµ from
+equation (23) are
+ dhµ R6 R7 + R8 (R6 + R7 ) R4 R5 bµ + R3 R5 sµ R8
+ R2 C1 = −hµ + · − (fµ − R1 Jµ ) (24)
+ dt R6 R7 R4 R5 + R3 R5 + R3 R4 R6
+Substituting in sµ from equation (19) and Jµ :
+  
+ R9 !
+ dhµ R6 R7 + R8 (R6 + R7 ) R R b
+ 4 5 µ + R R
+ 3 5 1 + R10 fµ R8 X
+R2 C1 = −hµ + · − fµ − R 1 ξµi (gi − fµ )
+ dt R6 R7 R4 R5 + R3 R5 + R3 R4 R6 i
+ (25)
+
+Equal-resistance special case. Set R1 = R3 = R4 = R5 = R6 = R7 = R8 . Then, equation (25)
+reduces to
+ dhµ R9 X
+ R2 C 1 = −hµ + bµ + fµ + ξµi (gi − fµ ) (26)
+ dt R10 i
+
+
+Selection of R9 /RP10 self-term gain. Evidently, in order to match the form of equation (1), we need
+to cancel the −fµ i ξµi term that appears on the right hand side of equation (26). The R9 /R10 term
+allows us to do that by setting
+ R9 X
+ = ξµi (27)
+ R10 i
+
+Taking equation (27)’s assignment to R9 and R10 simplifies equation (26) into
+ dhµ X
+ R2 C1 = ξµi gi − hµ + bµ (28)
+ dt i
+which exactly matches our desired dynamics.
+
+
+ 15
+ Figure 8: Crossbar Array. Each pentagon contains a neuron of design in Figure 7. In this layout we
+have flipped the crossbar array, so that index µ runs horizontally and index i runs vertically.
+
+
+A.1 Activation function
+The voltage across C1 gives us the dynamics of the neuron internal state hµ . Figure 7 contains a block
+representing a nonlinear amplifier, denoted f (·), whose input is hµ and whose output is fµ = f (hµ ). This
+voltage is buffered with U1 onto the neuron output line, labeled fµ , which is what other neurons “see”
+in the crossbar array. The chosen activation function does not affect the rest of the dynamics of the
+neuron. Particularly, the activation function need not be element-wise: a vector-wise activation function
+like softmax can be readily applied instead.
+
+A.2 Neurons interacting in a network
+So far we have examined the dynamics
+ P of a single neuron, treating as an assumption that the neuron will
+receive an incoming current Jµ = i ξµi (gi − fµ ). Now, we will show how to wire these neurons together
+to realize this. Figure 8 shows the simplest DenseAM construction where each pentagonal node is a
+circuit of design in Figure 7. Each neuron exposes a single node whose voltage is driven at the activation
+of the neuron, and which accepts an incoming current which it uses to drive its dynamics. Each hidden
+neuron fµ is connected to a visible neuron gi via a resistance
+ P Rµi = 1/ξµi that is the inverse of the weight
+it represents. The current flowing into node fµ is Jµ = i R1µi (gi − fµ ), which is the assumption needed
+for equation (24). This same analysis holds for other hidden and visible neurons, and so together they
+realize the large dynamical system of (1).
+
+A.3 SPICE Netlist
+Following is the SPICE netlist for the single neuron circuit, using ideal op-amps. Component values are
+omitted for brevity. There is no nonlinearity here; adding one would be a matter of inserting a nonlinear
+amplifier between node h µ and XU1’s positive terminal.
+R1 f_µ m_µ
+XU1 f_µ h_µ m_µ opamp Aol=100K GBW=10Meg
+XU2 u2- f_µ s_µ opamp Aol=100K GBW=10Meg
+R2 u2- 0
+R3 s_µ u2-
+R4 u3+ s_µ
+R5 u3+ 0
+XU3 u3- u3+ d_µ opamp Aol=100K GBW=10Meg
+R6 u3- m_µ
+R7 d_µ u3-
+R8 d_µ h_µ
+C1 h_µ 0
+
+
+ 16
+ Figure 9: Softmax circuit design
+
+
+V§b_µ N001 0
+R9 u3+ N001
+R10 u3- 0
+
+
+B Softmax Circuit
+For demonstration purposes, we follow the construction of an analog softmax circuit using bipolar junc-
+tion transistors (BJTs) described in [34]. Figure 9 shows the design of a four-way softmax circuit using
+BJTs. The softmax function we aim to produce is:
+ ezi
+ softmaxi = PN , i = 1, . . . , N (29)
+ zj
+ j=1 e
+
+ For the µth BJT in the circuit, the collector current IC,µ can be expressed in terms of the base voltage
+hµ and the emitter voltage VE when in the forward-active mode as:
+ hµ −VE
+ IC,µ = Is eVBE /VT , VBE,µ = hµ − VE , ⇒ IC,µ = IS e VT
+ (30)
+where Is is the BJT’s saturation current and VT is the thermal voltage. Assuming large BJT β (note:
+this β is unrelated to the softmax β)2 , we can neglect base currents IC,µ = IE,µ . Applying KCL at
+ PN
+the shared emitter node VE , the total current IEE = µ=1 IC,µ . We can expand the expression for the
+collector currents to get the currents in terms of node voltages:
+ Nh
+ X
+ IEE = IS e(hµ −VE )/VT
+ µ=1
+ Nh
+ X IS ehµ /VT
+ = (31)
+ µ=1
+ eVE /VT
+
+Simultaneously, the current IEE is also fixed by the ideal current source, so IC,µ can also be expressed
+ I
+as the ratio of the branch current to the total current: IC,µ = IC,µ
+ EE
+ IEE . Plugging in (30) for IC,µ and
+(31) for IEE in the denominator and canceling the term containing VE ,
+ ehµ /VT
+ IC,µ = PNh IEE (32)
+ hj /VT
+ j=1 e
+
+This already looks very much like the ideal softmax function. The voltage at node fi is created by
+current flowing through resistor Ri , producing a voltage drop relative to VCC . Specifically, the voltage
+ hµ /VT
+fµ = VCC − PNeh hj /VT IEE Rµ . When IEE Rµ = 1, this voltage fµ is a negated and shifted softmax in
+ j=1 e
+the range of 1 volt. This scale and negation can be easily corrected with an op amp, which is also needed
+to isolate the node and prevent loading. Note that VCC must be chosen to be positive supply in order
+for the BJTs to remain in the forward-active mode.
+ 2 In BJTs, β denotes the ratio of the collector current to the base current. High BJT β indicates the transistor is able to
+
+amplify a small base current into a much larger collector current, allowing the BJT to function as an amplifier or switch.
+A high β reflects that the BJT can efficiently transmit carriers from emitter to collector, without losing them to the base.
+
+
+ 17
+ Parameter Value
+ RF 1000 Ω
+ RT 1 Ω
+ R1 1 Ω
+ R2 , R3 , . . . , R8 10 000 Ω
+ RS 40 Ω
+ C 10 µF
+ a3 0 V
+ b1 0 V
+ b2 −1 V
+ b3 −1 V
+ b4 −1 V
+
+ Table 2: Component and parameter values.
+
+
+C XOR DenseAM Circuit
+Figure 10 is a full circuit diagram of the DenseAM that solves the XOR problem. Given input voltages
+at V1, V2∈ {0, 1}, the output voltage at g3 is the result of the XOR operation between V1 and V2. In
+this model, the visible neuron is linear, and the hidden neurons share a softmax activation function im-
+plemented by a set of bipolar junction transistors. Table 2 lists the component values used in simulation.
+
+
+Visible neurons. In the XOR task, only one visible neuron is left evolving, corresponding to the output
+column of the truth table. As such, the first two neurons are clamped to the input voltages, represented
+by V1 and V2. The third visible neuron, highlighted in blue, is a linear unit with no nonlinear activation:
+the internal state voltage v3 directly drives the output, setting g3 = v3 . This is the same circuit described
+in Appendix A, except where the activation block is not present.
+
+Hidden neurons. The XOR task requires four hidden neurons, highlighted in green. These are iden-
+tical circuit constructions with the exception of the voltage sources bµ for the biases, which are set
+according to the values in Table 2. Unlike the visible neuron, the hidden neurons have a softmax activa-
+tion function, such that fµ = softmaxµ (h).
+
+Softmax activation function. The red highlights the same softmax circuit described in Appendix B,
+comprised of BJT transistors, resistors, a voltage source for VCC and a current source for IEE . We
+use the 2N5088 transistors in our model, reflecting a standard and widely available BJT. Noninverting
+buffers (U10, U11, etc.) are used to prevent loading effects on the state capacitors Cµ from current draw
+of the BJT base in forward-active mode. As discussed in Appendix B, the softmax circuit itself produces
+an output voltage of
+ ezi
+ softmax(z)i = VCC − PN , i = 1, . . . , N
+ zj
+ j=1 e
+
+When VCC = 5V as in this circuit, this requires extra circuitry, highlighted in yellow, to shift and negate
+the softmax output. This is done by first buffering the voltage output to prevent loading effects, followed
+by a summing op amp that subtracts VCC and inverts the softmax output. For the first hidden neuron
+h1 (lower left of figure), op-amp U2 buffers the voltage output, while U1 is configured in an inverting
+summing configuration to add -5V (the inverse of VCC ) to the buffered voltage output, producing the
+correct softmax output.
+
+Weight matrix. The weight matrix is comprised of resistors R1 -R12 that represent the weight matrix
+ξ. These are set directly according to the XOR truth table, where each row corresponds to one hidden
+neuron. A boolean value of 1 (RT ) is set to be a high conductance (1Ω), while a boolean value of 0 (RF )
+is set to be a relatively small conductance (1kΩ).
+ The gain si /gi governing the value of si is set to be the sum of the resistances in that neuron’s crossbar
+column. The column of resistances for neuron 1 has 3 RF resistances, which sum to 3 × 10−3 . Hence,
+
+
+ 18
+ 19
+ Figure 10: Full schematic for XOR DenseAM built with 1 evolving linear visible neuron and 4 hidden neurons with softmax activation. Blue: visible neuron.
+ Green: hidden neurons. Yellow: buffers for softmax activation circuit. Red: analog softmax circuit.
+ neuron 1’s R47 /RR46 = 3/1000. The crossbar resistances for neuron 2, 3, and 4 have 2 RT resistances
+and one RF resistance, which sums to approximately 2. Hence, we approximate R59 /R56 = 2000/1000
+and similarly for hidden neurons 3 and 4.
+
+
+D Design and implementation variations
+A large design space remains open across analog electronics and other substrates for realizing DenseAMs,
+with clear speed–energy–area–precision trade-offs. In electronics, the core primitives admit multiple re-
+alizations: passive, nonvolatile weights (e.g., memristors, triode-region or floating-gate transistors, and
+other programmable conductors); active, gained weights via OTAs; and nonlinearities via diode clamps,
+reverse-biased diode/BJT exponentials, MOS quadratic regions, or translinear blocks. Architectures in
+the spirit of [35, 23] are compact but couple synaptic values to neuronal time constants, making dynamics
+drift when a single weight changes—problematic for learning and consistent timing—whereas our decou-
+pled neuron preserves a fixed time constant under weight updates. Simpler neuron/network topologies
+likely exist and can be attractive in resource-constrained regimes, provided their deviations from the
+target ODEs are validated not to degrade performance. Beyond CMOS, photonics (e.g., overdamped,
+low-Q microring resonators) can naturally implement first-order ODEs and can offer extreme bandwidth
+with distinct calibration and noise constraints. Across these options, open problems include robust
+weight storage/programmability and drift control, mixed-signal learning rules compatible with device
+limits, scaling under current/GBW/SR constraints, tolerance to mismatch/noise, and algorithm–circuit
+co-design to exploit substrate-specific advantages.
+
+
+E Scaling of inference time
+There are two conditions under which inference times should be studied, dependent on the softmax
+temperature β. In the low-β regime, the DenseAM reaches equilibria with multiple hidden neurons
+“competing” in the softmax, while in the high-β regime, the DenseAM reaches equilibria with only one
+hidden neuron “winning out” in the softmax. Intuitively, the high-β regime corresponds to exact memory
+recall, while the low-β regime corresponds to interpolation. The XOR and Hamming (7,4) code are in
+the high-β regime, while the energy transformer lies in the low-β regime. In both regimes, we find that
+the DenseAM converges in time that is constant with respect to the number of neurons.
+
+Assumptions.
+(A1) There is a per-synapse device limit of 0 ≤ ξµi ≤ Gmax where Gmax is the maximum conductance
+ set by the physics of the crossbar crosspoints. Because f is the output of a softmax so fµ ≤ 1 ∀µ,
+ this means
+ X
+ ξµi fµ ≤ Gmax (33)
+ µ
+
+ so the RHS of the visible neuron dynamics is O(1).
+ There exist both column-sum and row-sum budgets that are enforced by the hardware, since each
+ neuron’s output stage can only source/sink a finite amount of current while maintaining GBW/SR
+ margins. This dictates a per-column and per-row conductance budget to stay within this maximum
+ current, resulting in
+ Nv
+ X Nh
+ X
+ ξµi ≤ Cr ∀µ, ξµi ≤ Cc ∀i (34)
+ i µ
+
+
+ Weights can only be positive since conductances can only be positive, so ξµi ≥ 0.
+ As a corollary of (A1), note also that we can bound ∥ξ µ ∥2 ≤ S ∀µ, and since ∥ξµ ∥2 ≤ ∥ξ µ ∥1 , then
+ ∥ξ µ ∥2 ≤ Cc ∀µ.
+(A2) Bounded biases. |ai | ≤ A, |bµ | ≤ B for all i, µ. In realistic regimes, this typically holds, for
+ example the typical choice in boolean functions of bµ = − β2 ∥ξ µ ∥2 (seen in Section 5.1).
+
+
+
+ 20
+ Model. Take the system of equation (1) with a softmax activation on hidden neurons and an identity
+activation on visible neurons. For clarity we assume 0 biases on visible neurons, but they do not change
+the analysis.
+
+ τv v̇ = ξ⊤ f + a − v, τh ḣ = ξv + b − h, f = softmaxβ (h) (35)
+
+Integrating out the hidden units,
+
+ τv v̇ = ξ ⊤ f (v) − v, (36)
+ 
+ f (v) = softmax β(ξv + b) (37)
+
+yields the effective energy function expressed in terms of visible neurons:
+ 1 1 X   
+ E(v) = ∥v∥2 − log exp β ξ ⊤
+ µv+b (38)
+ 2 β µ
+
+
+where ∇E(v) = v − ξ ⊤ f (v). Because τv v̇ = −∇E(v), we see that the dynamical trajectory causes the
+energy to monotonically decrease over time:
+ d 1
+ E(v(t)) = ∇E(v(t))⊤ v̇ = − ∥∇E(v(t))∥2 ≤ 0 (39)
+ dt τv
+
+E.1 Low-β regime
+The energy landscape in the low-β regime exhibits uniform strong convexity, so the gradient flow dy-
+namics cause the energy gap to decay exponentially, reaching an ϵ-fraction of the original energy gap
+in constant time. To show E(v) is α-strongly convex, we must show ∇2 E(v) ⪰ αI for some α > 0.
+This means that all the eigenvalues of the Hessian are ≥ α. Equivalently, λmin (∇2 E) ≥ α. Denote
+G(f ) = Diag(f ) − ff ⊤ ⪰ 0, which is the Jacobian of the softmax function f (v) = softmax(β(ξv + b)).
+
+ ∇2 E(v) = I − βξ ⊤ G(f )ξ (40)
+  
+ λmin ∇2 E(v) = λmin I − βξ⊤ G(f )ξ
+ 
+ (41)
+  
+ = 1 − βλmax ξ ⊤ G(f )ξ (42)
+   
+ ⇒ ∇2 E(v) ⪰ 1 − βλmax ξ ⊤ G(f )ξ I (43)
+
+Because G(f ) ⪯ Diag(f ) ⪯PI is PSD and therefore ξG(f )ξ ⊤ is also PSD, and G(f ) is a probability-
+weighted covariance where µ fµ = 1,
+ X
+ λmax (ξ ⊤ G(f )ξ) ≤ tr(ξ⊤ G(f )ξ) ≤ fµ ∥ξ µ ∥2 ≤ max ∥ξ µ ∥2 (44)
+ µ
+ µ
+
+
+Denote S 2 = maxµ ∥ξ µ ∥2 ≤ Cc as in (A1). Therefore, the Hessian of E can be bounded as
+
+ ∇2 E(v) ⪰ (1 − βS 2 )I = αI (45)
+
+where α = 1 − βS 2 . Then α > 0 when β < 1/ maxµ ∥ξ µ ∥2 . This is a sufficient (but not necessary)
+condition for the system to be in the low-β (uniformly convex) regime, where the softmax is diffuse
+enough that its covariance term does not contribute so much negative curvature as to overwhelm the
+positive curvature contributed by the identity term. In this regime, the uniform lower bound on the
+Hessian implies α-strong convexity, which gives the PL inequality
+ 1
+ ∥∇E(v)∥2 ≥ α(E(v) − E ∗ ) (46)
+ 2
+Together with (39), this allows us to bound the time constant of gradient flow:
+
+ d 1 2α
+ (E(v(t)) − E ⋆ ) = − ∥∇E(v(t))∥2 ≤ − (E(v(t)) − E ⋆ ) (47)
+ dt τv τv
+
+
+ 21
+ If the curvature is bounded below by α, then the gradient magnitude grows at least linearly with distance
+to the minimum, ensuring the energy function is “steep enough” to ensure exponential convergence.
+Integrating,
+ 2α
+ E(v(t)) − E ⋆ ≤ (E(v(0)) − E ⋆ )e− τv t (48)
+This indicates exponential decay of the energy gap. In order to reach an ϵ-fraction of the original energy
+gap, this takes time
+ τv 1
+ T (ϵ) ≤ log = O(τv log(1/ϵ)) (49)
+ 2α ϵ
+which is entirely independent of system size Nv and Nh . In the energy transformer case, this means that
+convergence time is entirely independent of context length L and token dimension D.
+
+E.2 High-β regime
+E.2.1 TI : Basin selection
+Denote
+ sµ (v) := ξ⊤
+ µ v + bµ , m(v) := max sµ (v), f := softmax(βs) (50)
+ µ
+
+Define the basin of attraction around the winning softmax logit k by the margin γ > 0:
+ Bk (γ) = {v : sk (v) − max sj (v) ≥ γ} (51)
+ j̸=k
+
+Let TI be the first time t such that v(t) ∈ ∪k Bk (γ). Defining the softmax component of the energy
+function (38) as
+ Nh
+ 1 X
+ LSEβ (s) = log eβsµ
+ β µ=1
+
+then for every v, we can bound the LSE as
+ 1
+ m(v) ≤ LSEβ (s(v)) ≤ m(v) + log Nh (52)
+ β
+Thus, the “softmax slack” δ(v) := LSEβ (s(v)) − m(v) obeys 0 ≤ δ(v) ≤ β1 log Nh . In the high-β regime,
+there are no critical points other than the softmax basins (those within ∪k Bk (γ) for any reasonable
+γ > ϵ > 0). To reduce δ from its initial value to the cusp of one of the basins requires dissipating at most
+ 1
+ ∆Esoftmax ≤ log Nh (53)
+ β
+∂E
+∂vi = −τv v̇i , and outside winning basins τv v̇i ∼ 1, so the squared magnitude of the gradient grows at
+least linearly in Nv :
+ Nv  2
+ 2
+ X ∂E
+ ∥∇E(v)∥ = ≥ cNv (54)
+ i=1
+ ∂vi
+
+for some c > 0 independent of Nv and Nh for all v in the trajectory outside a winning basin. Therefore,
+the energy dissipation rate satisfies
+ 1 c
+ −Ė(t) = ∥∇E(v(t))∥2 ≥ Nv (55)
+ τv τv
+ Under assumptions (A1)–(A2), the visible state v remains in a bounded box, so the quadratic part of
+the energy contributes at most O(Nv ) to the energy difference between any two points on the trajectory.
+Since the energy dissipation rate during TI scales proportionally to Nv , the quadratic component of
+the energy contribution is dissipated in constant time. The only nontrivial Nh dependence is due to the
+softmax slack. Together with the bound on ∆Esoftmax , the total time this phase takes is characteristically
+  
+ τv log Nh
+ TI = O (56)
+ β Nv
+
+ 22
+ E.2.2 TII : Contractive convergence within a winning basin
+Find a basin Bk (γ) that is entered at tin = TI . We will now show local strong convexity within this
+basin, allowing us to invoke the PL inequality and find exponential convergence within the basin. Define
+G := Diag(f ) − ff ⊤ . First, consider that the non-winning softmax mass is 1 − fk , which is
+ X
+ 1 − fk = fj ≤ (Nh − 1)e−βγ (57)
+ j̸=k
+
+
+Additionally, since ∥f ∥2 = fk2 + 2 2
+ P
+ j̸=k fj ≥ fk and 0 ≤ fk ≤ 1,
+
+
+ λmax (G(f )) ≤ tr(G(f )) = 1 − ∥f ∥2 ≤ 1 − fk2 ≤ 2(1 − fk ) ≤ 2(Nh − 1)e−βγ (58)
+
+Hence, with S 2 = maxµ ∥ξ µ ∥2 ,
+
+ λmax (ξ ⊤ G(f )ξ) ≤ S 2 λmax (G(f )) ≤ 2S 2 (Nh − 1)e−βγ (59)
+
+This gives a bound on the largest eigenvalue of G(f ) in a way that incorporates the softmax beta.
+ Now, we can show local strong convexity in the winning basin:
+
+ ∇2 E(v) = I − βξ ⊤ G(f )ξ ⪰ (1 − β2S 2 (Nh − 1)e−βγ )I ≡ α(β, γ)I (60)
+
+for all v ∈ Bk (γ). Particularly, if
+ 1
+ e−βγ (Nh − 1) ≤ (61)
+ 4βS 2
+
+then α(β, γ) ≥ 12 , independent of Nh , Nv . Note that this is always possible: if the softmax is not peaked
+enough to make this inequality true, simply keep moving in trajectory “Phase I” for a little longer until
+the margin γ grows slightly larger such that the condition holds true. This strong convexity within Bk (γ)
+implies the PL inequality
+ 1
+ ∥∇E(v)∥2 ≥ α(β, γ)(E(v) − E ⋆ ), ∀v ∈ Bk (γ) (62)
+ 2
+Therefore, along the trajectory within the basin for times t ≥ tin ,
+
+ d 1 2α(β, γ)
+ E(v(t)) − E ⋆ = − ∥∇E(v(t))∥2 ≤ − E(v(t)) − E ⋆
+  
+ (63)
+ dt τv τv
+Integrating,
+ 2α(β,γ)
+ E(v(t)) − E ⋆ ≤ e− (t−tin )
+ E(v(tin )) − E ⋆
+ 
+ τv (64)
+
+Impose a relative-to-initial convergence criteria:
+
+ E(v(t)) − E ⋆ ≤ ϵ E(v(0)) − E ⋆ ,
+ 
+ ϵ ∈ (0, 1)
+
+Since E is non-increasing along the trajectory, E(v(tin )) − E ⋆ ≤ E(v(0)) − E ⋆ , so it suffices that
+ 2α(β,γ)
+ e− τv (t−tin )
+ ≤ϵ
+
+Hence the in-basin time satisfies
+  
+ τv 1 1
+ TII ≤ log = O τv log (65)
+ 2α(β, γ) ϵ ϵ
+
+which is size-free of Nh and Nv .
+
+
+
+
+ 23
+ E.2.3 Combined bound
+Altogether, in the high-β regime, to reach a relative-to-initial tolerance of
+ E(v(t)) − E ⋆ ≤ ϵ E(v(0)) − E ⋆
+ 
+ (66)
+the combined convergence time satisfies
+    
+ τv log Nh 1
+ T (ϵ) = O + O τv log (67)
+ β Nv ϵ
+ | {z } | {z }
+ winner selection (TI ) convergence within basin (TII )
+
+For fixed ϵ, β, and τv , TII is independent of Nv and Nh , while TI carries all the model-size dependence.
+The dependence of the convergence time on Nh and Nv in the high-β regime is
+  
+ τv log Nh
+ T (ϵ) = O . (68)
+ β Nv
+The convergence time is at most logarithmic in the number of hidden neurons Nh , and actually decreases
+as 1/Nv in the number of visible neurons.
+
+E.3 Limitations
+Our analysis assumes that the timescales of the crossbar array are much faster than the fastest neuronal
+timescales. In practice, as the crossbar array gets bigger, it may contribute to the time scales of the
+entire system, since wires have non-zero capacitances. Once the size of the crossbar array reaches the
+point when it significantly modifies the time scales of the neurons, our analysis and the scaling argument
+becomes invalid. For this reason, one cannot scale this design to infinitely large sizes. Analyzing that
+boundary is outside the scope of our paper, because it is dependent on fabrication and design parameters,
+which is a different level of abstraction than our present paper.
+
+
+F Design invariance under voltage scaling
+Given hardware constraints of Gmax , Cc , and Cr , we can still implement models with arbitrarily large
+weights. Convergence bounds rely on the weight matrix constraints, which can be made feasible by
+global normalization at the hardware level, keeping the effective model weights unchanged. Consider the
+scaling factor for any non-negative ξ:
+ ( )
+ Gmax Cc Cr
+ κ = min 1, , P , P (69)
+ maxµ,i ξµi maxi µ ξµi maxµ i ξµi
+
+Set ξ̃ = κξ. Then, ξ̃ satisfies all the hardware constraints of assumption (A1):
+ X X
+ 0 ≤ ξ˜µi ≤ Gmax , ξ˜µi ≤ Cr ∀µ, ξ˜µi ≤ Cc ∀i (70)
+ i µ
+
+So any ξ matrix can be mapped onto budgets with one scalar κ. Consider the pre-softmax arguments
+for the hidden neurons: if we scale weights ξ → ξ̃ = κξ, rescale the voltage unit v → ṽ = κv and biases
+b → b̃ = κ2 b and set β̃ = β/κ2 , then
+ ⊤
+ β̃(ξ˜µ ṽ + b̃) = β(ξ ⊤
+ µ v + b) (71)
+
+so the softmax outputs f and the system’s attractors are unchanged. The visible ODE τv v̇ = ξ⊤ f (v) − v
+is preserved up to units, as the κ terms can be absorbed into the gain of U2 and U3 without affecting the
+convergence time bounds.
+
+
+G Scaling of energy consumption
+The energy consumption of DenseAM circuits can be broken up into two parts: the energy dissipated
+by the weights as a result of Ohm’s Law, and the energy from engineering overhead found in amplifiers
+and active circuitry. The energy dissipated by the weights in the crossbar array can be expressed as the
+integral of the power dissipated by each resistor of resistance Rµi from time 0 until convergence at Tconv .
+
+
+ 24
+ Energy consumption of weights. Let the neuron output voltages be proportional to activations:
+ui = κgi and wµ = κfµ , where κ is a fixed voltage scale. We assume rail-bounded outputs |ui | ≤ κ and
+|wµ | ≤ κ (by Appendix F, global rescaling of ξ, voltages, and β preserves the DenseAM dynamics, so
+this choice of κ does not affect behavior.) The instantaneous power in the resistive crossbar is:
+ X
+ Pweights (t) = ξµi (ui − wµ )2 (72)
+ i,µ
+ P P
+Using the row/column conductance budgets µ ξµi ≤ Cc and i ξµi ≤ Cr (Appendix E) and the
+inequality (a − b)2 ≤ 2a2 + 2b2 ,
+  
+ X X
+ Pweights (t) ≤ 2  ξµi u2i + ξµi wµ2  (73)
+ i,µ i,µ
+ ! !!
+ X X X X
+ =2 u2i ξµi + wµ2 ξµi (74)
+ i µ µ i
+ !
+ X X
+ ≤ 2 Cc u2i + Cr wµ2 (75)
+ i µ
+
+ 2 2 2
+ P P
+If the hidden layer uses a softmax activation, then
+P 2 µ fµ ≤ 1 and so µ wµ ≤ κ ; and rail bounds give
+ 2
+ i ui ≤ Nv κ . Therefore,
+
+ Pweights (t) ≤ 2κ2 (Cc Nv + Cr ) = O(Nv ) (76)
+
+Therefore, a system taking time T conv to converge results in an energy consumption of
+ Z T
+ Eweights = Pweights (t)dt ≤ 2κ2 (Cc Nv + Cr )T conv (77)
+ 0
+
+According to the convergence time bounds of Appendix E, T conv = O(τv ). Thus, Eweights = O(Nv ), as
+a function of system size.
+
+Energy consumption of capacitors. Let each neuron node voltage be bounded by hardware limits
+|ui (t)|, |wµ (t)| ≤ κ. Charging a capacitor of capacitance C from a supply through a resistive path draws
+CV 2 from the power supply. The number of times each capacitor charges is finite because the Lyapunov
+energy of the DenseAM forbids limit cycles. This means the total supply energy per node can be bounded
+by a constant. Therefore, the total energy needed to (re)charge all neuron capacitors is bounded by
+ Nh
+ Nv
+ !
+ (v)
+ X X
+ 2 (h)
+ Ecapacitors ≤ O(1) · κ Ci + Cµ = O(Nv + Nh ) (78)
+ i=1 µ=1
+
+
+Energy consumption of amplifiers, bias, control, and overhead. Per neuron, the energy expen-
+diture to amplifier inefficiency, bias terms, and general overhead do not depend on system size. For a
+runtime of duration T conv , the energy consumption of these elements in the entire network scales as
+
+ Eother = O((Nv + Nh )T conv ) (79)
+
+Combined energy consumption. All together, the total energy consumption can be written as
+
+ Etotal = O(Nv + Nh ) (80)
+
+
+H Model Specifications and Details
+Table 3, Table 4, and Table 5 summarize the model design for the XOR, Hamming (7,4), and parity
+DenseAM models.
+
+
+
+ 25
+ Table 3: XOR model specification
+
+Visible neurons vi Nv = 3 (inputs v1 , v2 clamped to {0,1}; output v3 free)
+Hidden neurons hµ Nh = 4 (one per truth-table row)
+ PNv 2
+Visible activation and Lagrangian Identity: gi = vi , Lv = 21 i=1 vi
+ NPh βhµ 
+Hidden activation and Lagrangian Softmax: fµ = softmax(βhµ ), Lh = β1 log e
+ µ=1
+Visible biases ai = 0
+ PNv 2
+Hidden biases bµ = − 12 i=1 ξµi  
+ 0 0 0
+ 0 1 1
+Weights ξ ξ ∈ {0, 1}4×3 , rows encode memories: ξ =  1 0 1
+ 
+
+ 1 1 0
+Inference protocol Clamp (v1 , v2 ) to input values; read out v3 at convergence
+
+
+
+
+ Table 4: Hamming (7,4) model specification
+
+Visible neurons (Nv ) 7 (codeword bits)
+Hidden neurons (Nh ) 16 (one per valid codeword)
+Visible activation Identity: gi = vi
+Hidden activation Softmax over µ ∈ {1, . . . , 16} with temperature β
+Visible biases ai = 0
+ PNv 2
+Hidden biases bµ = − 21 i=1 ξµi
+Weights ξ ξ ∈ {0, 1}16×7 , each row is a valid Hamming(7,4) codeword
+Inference protocol Initialize visible neurons to corrupted 7-bit input codeword; let all visible and
+ hidden neurons evolve; converged visible neurons give the corrected codeword
+
+
+
+
+ Table 5: 8-bit parity model specification
+
+Visible neurons vi Nv = 16 (dimension of embedding D)
+Hidden neurons (energy attention) hattn
+ A Nhattn = 8 (context length L)
+Hidden neurons (Hopfield network) hhopf
+ µ Nhhopf = 16 (Hopfield network memories M )
+Hidden neurons (total) Nh = 24 (L + M )
+Visible activation Identity: gi = vi
+Hidden activation (energy attention) Softmax: fAattn = softmax(βhattn )A for A = 1, . . . , L
+Hidden activation (Hopfield network) ReLU: fµhopf = max (hhopf
+ µ , 0) for µ = 1, . . . , M
+Weights (energy attention) ξattn ∈ RL×D , where ξattn
+ A is embedded A’th context token
+Weights (Hopfield network) ξ hopf ∈ RM ×D , static after training
+Inference protocol Embed L context tokens to obtain ξ attn . Let visible neurons
+ evolve until convergence
+
+
+
+
+ 26
+ H.1 Bit string energy transformer implementation
+As described in Table 5, our trained model uses an embedding matrix of 2 × D = 32 parameters, the
+Hopfield network with D × M = 256 parameters, an additional D × 2 = 32 parameter matrix to decode
+embeddings to logits, a total of D + L + M = 40 neuron bias terms, and 2 biases for the linear decoder.
+This is a total of 362 parameters.
+ In training and inference we use time constants τv = 0.1 and τh = 0.01. We train with Euler steps of
+1e-3, and test with Euler steps of 1e-4 for a time horizon of T = 1 second. Jax’s automatic differentiation
+was used to implement backpropagation through time. We encourage the model to reach fixed points
+by penalizing v̇ at time T. This yields models that are more robust to hardware imperfection due to the
+intrinsic stability of attractor points. The convergence to an attractor also means the inference remains
+stable to mismatch and delay in timing during readout.
+
+
+I Hardware analysis
+I.1 Hardware speed analysis
+As discussed in subsection 7.1, the convergence time of analog DenseAMs is governed not by system size,
+but rather primarily by the timescales of the dynamics in hardware. These timescales are set by the time
+constants τv and τh . The smaller these time constants, the faster the dynamics move, and the faster the
+system converges. In this section, we derive bounds on the minimum time constant min{τv , τh } of the
+DenseAM, which is limited by the constraints of active components like amplifiers.
+ The maximum speed of neuronal dynamics is limited by the ability of active stages (op-amps/buffers)
+to track changing signals. If the input slope to an active stage exceeds its slew rate (SR), the output
+distorts; if the signal spectrum approaches or exceeds the stage’s closed-loop bandwidth, attenuation
+and phase lag appear. Here, we derive lower bounds on the time constants τv , τh imposed by (i) finite
+gain–bandwidth product (GBW) and (ii) finite SR of the three active stages in the neuron design (Ap-
+pendix A). Without loss of generality we will express the derivation for the hidden neurons, with the
+derivations for visible neurons following by symmetry. Throughout, define the following:
+
+ • State swing: |vi (t)| ≤ Av , so that |v̇i | ≲ Av /τ . Similarly, |hµ (t)| ≤ Ah , so that |ḣµ | ≲ Ah /τ .
+ • Activation swing: Visible activation g(·) is Lipschitz with slope bound Lg = supx |g ′ (x)|. Then,
+ |ġi | ≤ Lg |v̇i | ≤ Lg Av /τ . Similarly, hidden activation f (·) is Lipschitz with slope bounded by
+ Lf = supx |f ′ (x)|. Then, |f˙µ | ≤ Lf |ḣµ | ≤ Lf Ah /τ .
+
+ • Weights ξ ≥ 0. Hardware normalization gives
+ P per-row/column conductivity budgets, so the self-
+ term gain for hidden neuron µ is Aself,µ = i ξµi = O(1).
+We will derive three independent lower bounds and then take the max:
+
+ τmin ≥ max{ τGBW , τSR , τI−limit } (81)
+ | {z } |{z} | {z }
+ tracking small signals edge/large-signals output current
+
+
+I.1.1 Gain-bandwidth product bound
+For a single-pole op-amp with gain-bandwidth product GBW in a closed-loop configuration with loop
+gain ACL , the −3db bandwidth is fc ≈ GBW/ACL . In order for the neuron to faithfully track with a
+time constant τ , we require fc ≳ 1/(2πτ ) for every stage in the signal path. Closed-loop gains for each
+of the op-amps are: ACL (U 1) = 1 because it is a unity-gain buffer, ACL (U 2) = Aself because it needs
+to realize the self term gain, and ACL (U 3) ≈ 1 because it is a unity-gain summer. Assuming the same
+op-amp design for U1, U2, and U3, and taking the worst case,
+
+ max(1, Aself )
+ τGBW = (82)
+ 2πGBW
+
+I.1.2 Slew rate bound
+The slew-rate limits cap the maximum output slope of each op-amp stage:
+ • U1: activation buffer. |f˙µ | ≤ Lf Ah /τ , which gives τ ≥ (Lf Ah )/SRU1 .
+
+
+ 27
+ Table 6: Estimated neuron time constants and conservative convergence times with Av = Ah = 1 V,
+ 1
+Lg = 1, Aself = 1 for representative amplifiers in literature. GBW bound τGBW = 2π GBW ; SR bound
+ Lg Av
+τSR = SR (visible path). Overall τmin = max{τGBW , τSR }; we report Tconv = 10 τmin .
+
+CMOS Amplifier (ref.) SR (V/µs) GBW (MHz) τSR (ns) τGBW (ns) Tconv (ns)
+Perez and Maloberti [36] 84.50 321.50 11.83 0.50 118.34
+Assaad and Silva-Martinez [37] 94.10 134.20 10.63 1.19 106.27
+Yen and Blalock [38] 202.00 10.70 4.95 14.87 148.74
+Naderi, Prakash, and Silva-Martinez [39] 1250.00 3600.00 0.80 0.04 8.00
+Schlögl and Zimmermann [40] 1650.00 2510.00 0.61 0.06 6.06
+Notes. (i) τSR values assume the visible path dominates the summer’s SR (low/moderate-β). If softmax dominates at U3
+ in the high-β regime, multiply SR-limited values by κ = (β/2) (Ah /Av ) (with Ah = Av = 1 V, simply β/2). (ii) The
+ current-limit bound τI-limit = CAv /Imax is typically ≪ all reported values for C ∼ 50 fF and Imax ∼mA, so it is omitted
+ from the table but must still be respected in circuit sizing.
+
+
+ • U2: self-term. sµ = Aself fµ , so |ṡµ | = Aself |f˙µ | ≤ (Aself Lf Ah )/τ , which gives τ ≥ (Aself Lf Ah )/SRU2 .
+ • U3: internal state drive. The time-varying portion of the RC circuit drive dµ is a linear combina-
+ tion of fµ and gi , with coefficients that have a maximum magnitude of Aself . Using the bounds on
+ the slopes of those inputs, we get the following bound on |d˙µ | and subsequently the time constant
+ bound:
+ Aself Aself max(Lf Ah , Lg Av )
+ |d˙µ | ≲ max{Lf Ah , Lg Av } ⇒ τ≥ (83)
+ τ SRU3
+
+All together, the combined constraint is
+  
+ Lf Ah Aself Lf Ah Aself max(Lf Ah , Lg Av )
+ τSR = max , , (84)
+ SRU1 SRU2 SRU3
+
+I.1.3 Current / headroom limit
+U3 must provide the current through R2 to charge C1 . The RC circuit dynamics dictate R2 C1 ḣµ =
+−hµ + dµ , so the instantaneous current needed by U3 is
+
+ dµ − h µ
+ IU3,out = = C1 ḣµ (85)
+ R2
+
+We must respect |IU3,out | ≤ Imax,U3 . With |ḣµ | ≲ Ah /τ ,
+
+ C1 Ah
+ τI-limit ≥ (86)
+ Imax,U3
+
+I.1.4 Combined bound on minimum time constant
+Taken together, the minimum time constant must satisfy the bounds (82), (84), and (86):
+
+ τmin ≥ max{τGBW , τSR , τI-limit } (87)
+
+I.2 Estimates of inference times with existing hardware
+Under standard assumptions for DenseAMs (symmetric couplings and monotone activations), the Lya-
+punov energy decreases monotonically and the dynamics converge without oscillations. The settling time
+is therefore on the order of a few multiples of the largest neuronal time constant, which we bound by
+amplifier non-idealities. In this section we take some representative examples of op-amps from literature
+and estimate the inference speeds from reasonable and representative design parameters.
+
+
+
+
+ 28
+ Minimum time constant. For illustration purposes, we choose three reasonable hardware constraints:
+ • Activation slopes. Take the slope of the visible activation to be Lg = 1, such as would occur in
+ a identity visible neuron activation. Take the worst-case (maximum) slope of the hidden activation
+ to be according to the softmax with fixed β, whose Jacobian is βG(f ) with ∥G(f )∥2 ≤ 12 , so a safe
+ global bound is Lf ≤ β2 .
+ • Signal swing. Use the voltage scaling invariance (see Appendix F) to rescale v, ξ, and β together
+ to pick a swing that is slew-rate friendly but well above component noise limits. Take both Av =
+ Ah = 1V .
+
+ • Self-term gain. With row/column budgets, use Aself as a worst-case bound.
+With those choices, the three lower bounds per neuron are:
+
+ 1. GBW Bound: τGBW = max(1,A
+ 2πGBW
+ self ) 1
+ = 2πGBW .
+ L A
+ 2. SR Bound: The U1/U2 path give τSR,vis = SR g v 1
+ = SR µs. In the U3 (summer) path, equation (84)
+ has two cases. In the low-β regime where Lg Av ≥ Lf Ah , then U3 bound reduces to 1/SR µs. In
+ the high-β regime where Lf Ah = β/2 dominates, scale the slew-rate limited bound by β/2.
+ 3. Output Current Bound: In practice, this bound generally does not limit the op amp choice:
+ even with a large capacitor C = 50 fF, Av = 1V, Imax = 2mA, τI-limit ≈ 0.025ns, which is negligible
+ compared to the bounds from SR and GBW.
+To quantify realistic inference speeds, Table 6 lists representative CMOS operational transconductance
+amplifiers (OTAs)3 drawn from recent literature, together with their corresponding lower bounds on
+neuronal time constants under the GBW and slew-rate limits. Even using conservative assumptions
+with existing amplifier designs, the analysis shows that modern high-speed OTAs can achieve sub–10 ns
+neuronal convergence times—corresponding to inference rates in the hundreds of megahertz.
+
+
+J Connection between analog and canonical Energy Transformer
+In this section we show that in the adiabatic limit, our Analog Energy Transformer (Analog ET) reduces
+to the canonical Energy Transformer. Begin with the dynamics for the Analog Energy Transformer
+implemented by our circuit designs.
+
+ ∂E ⊤  ⊤
+ τv v̇ = − = ξ attn f attn + ξ hopf f hopf + a − v (88)
+ ∂v
+ ∂E
+ τh ḣattn
+ = − attn = ξattn v + b − hattn (89)
+ ∂f
+ ∂E
+ τh ḣhopf = − hopf = ξhopf v + c − hhopf (90)
+ ∂f
+Integrating out hidden neurons in the adiabatic limit where τh → 0, we see the relations
+
+ hattn (v) = ξ attn v + b (91)
+ hopf hopf
+ h (v) = ξ v+c (92)
+
+which we can use to integrate out the hidden neuron activations as
+
+ f attn (v) = softmax ξ attn v + b
+ 
+ (93)
+  
+ f hopf (v) = ReLU ξ hopf v + c (94)
+
+Substituting into the visible dynamics:
+ ⊤ attn  ⊤
+ τv v̇ = ξ attn f (v) + ξ hopf f hopf (v) + a − v (95)
+ 3 Many high-speed CMOS “op-amps” are reported as OTAs (transconductors). In our neuron, these OTA cores operate
+
+in closed-loop (unity/non-inverting) configurations, so the literature SR and GBW directly constrain τ via Eqs. (82)–(84).
+
+
+
+ 29
+ We can ask ourselves, what scalar energy produces this ODE? We seek an energy Eeff (v) such that
+τv v̇ = − ∂E
+ ∂v . Equivalently,
+ eff
+
+
+
+ ⊤ attn  ⊤
+ ∇v Eeff (v) = v − a − ξ attn f (v) − ξ hopf f hopf (v) (96)
+
+We can construct Eeff (v) as a sum of three pieces whose gradients match each term Eeff (v) = Equad (v) +
+Eattn (v) + Ehopf (v). By inspection we see that Equad (v) = 21 ∥v − a∥2 .
+
+Attention term. The energy function
+ 1 X
+ exp β ξ attn
+ 
+ Eattn (v) = − log A v + bA (97)
+ β
+ A
+
+satisfies our requirement. We can see that by differentiating with respect to vi , we get
+ ∂Eattn X
+ =− softmax(ξ attn v + b)A · ξAi
+ attn
+ (98)
+ ∂vi
+ A
+ X
+ attn attn
+ =− ξAi fA (v) (99)
+ A
+ ⊤ attn
+which yields our desired dynamics of ∇v Eattn (v) = − ξ attn f (v).
+
+Hopfield term. A simple way to achieve the desired dynamics is with a Hopfield-type energy function
+ X1  2
+ Ehopf (v) = − ReLU ξ hopf
+ µ v + c µ (100)
+ µ
+ 2
+
+whose derivative with respect to vi yields
+ ∂Ehopf X  
+ hopf
+ =− ReLU ξ hopf
+ µ v + c µ · ξµi (101)
+ ∂vi µ
+ X hopf
+ =− ξµi fµhopf (v) (102)
+ µ
+
+  ⊤
+which yields our desired dynamics of ∇v Ehopf (v) = − ξ hopf f hopf (v).
+
+Effective energy function of analog energy transformer. All together, the effective scalar energy
+over the visible state v after integrating out hidden neurons is
+ 1 1 X  X 1   2
+ Eeff (v) = ∥v − a∥22 − log exp β ξ attn
+ A v + bA − ReLU ξ hopf
+ µ v + cµ (103)
+ |2 {z } β A µ
+ 2
+ Equad | {z } | {z }
+ Eattn Ehopf
+
+This effective energy aligns with the canonical Energy Transformer’s energy function. Because our effec-
+tive dynamics use hidden neurons, the energy function written in the main text reflects the contributions
+of the hidden neurons. When τh ≪ τv , this regime converges to the behavior when the hidden neurons
+are integrated out. Hence, the effective expressibility and behavior of our system is equivalent to that of
+the original Energy Transformer.
+ In our model we omit the layer normalization activation that the original Energy Transformer applies
+to the visible neurons. This keeps the circuit design simple, while still enabling models with high
+expressibility. This choice does not modify the structure of the attention or the Hopfield parts of the
+energy; only the self-energy of v differs. From a modeling perspective, layer normalization mainly
+improves conditioning and learning of deep networks rather than changing the computational primitive
+and expressibility. We empirically observe that the resulting models without layer normalization remain
+expressive enough to solve the problems we present. In principle, a layer normalization-type visible
+activation function could be implemented in analog hardware (e.g. by subtracting the mean voltage
+and normalizing by an on-chip variance estimate), but this would add distracting complications to the
+minimalist neuron and circuit designs we show in this paper.
+
+
+ 30
+ \ No newline at end of file