Dense Associative Memories with Analog Circuits
                                                Marc Gong Bacvanski1 , Xincheng You2 , John Hopfield3 , and Dmitry Krotov4
                                                                                             1
                                                                                                MIT
                                                                                 2
                                                                                     Independent Researcher
                                                                                      3
                                                                                        Princeton University
                                                                                          4
                                                                                            IBM Research

                                                                                       December 16 2025
arXiv:2512.15002v1 [cs.NE] 17 Dec 2025


                                             Abstract: The increasing computational demands of modern AI systems have exposed fundamental
                                         limitations of digital hardware, driving interest in alternative paradigms for efficient large-scale inference.
                                         Dense Associative Memory (DenseAM) is a family of models that offers a flexible framework for repre-
                                         senting many contemporary neural architectures, such as transformers and diffusion models, by casting
                                         them as dynamical systems evolving on an energy landscape. In this work, we propose a general method
                                         for building analog accelerators for DenseAMs and implementing them using electronic RC circuits, cross-
                                         bar arrays, and amplifiers. We find that our analog DenseAM hardware performs inference in constant
                                         time independent of model size. This result highlights an asymptotic advantage of analog DenseAMs
                                         over digital numerical solvers that scale at least linearly with the model size. We consider three settings
                                         of progressively increasing complexity: XOR, the Hamming (7,4) code, and a simple language model
                                         defined on binary variables. We propose analog implementations of these three models and analyze the
                                         scaling of inference time, energy consumption, and hardware. Finally, we estimate lower bounds on the
                                         achievable time constants imposed by amplifier specifications, suggesting that even conservative existing
                                         analog technology can enable inference times on the order of tens to hundreds of nanoseconds. By har-
                                         nessing the intrinsic parallelism and continuous-time operation of analog circuits, our DenseAM-based
                                         accelerator design offers a new avenue for fast and scalable AI hardware.


                                         1     Introduction
                                         The unprecedented growth of artificial intelligence (AI) has driven demand for increasingly large and
                                         powerful models. At present, the field of generative AI is primarily driven by two settings: autore-
                                         gressive transformers [1] and diffusion models [2]. While these settings have demonstrated remarkable
                                         capabilities, they do so at a substantial computational cost. Their current implementations utilize digital
                                         computation, which faces fundamental challenges in energy efficiency, scalability, and latency, especially
                                         as model sizes and deployment demands continue to grow [3, 4, 5]. These limitations have prompted
                                         interest in alternative computational paradigms that can efficiently handle the demands of modern AI
                                         workloads [6].
                                             Dense Associative Memories (DenseAMs) [7, 8], a promising class of AI models which generalize
                                         Hopfield networks [9], offer a new angle for tackling these problems. Unlike conventional feed-forward
                                         models, DenseAM inference can be defined through the temporal evolution of a state vector that is
                                         governed by a system of differential equations [10]. The state vector can be thought of as a particle
                                         exploring the surface of a high-dimensional energy landscape, which is the Lyapunov function of these
                                         dynamical equations. DenseAMs have been demonstrated to be flexible and expressive computational
                                         frameworks, capable of representing many primitives of modern AI architectures, such as attention
                                         mechanism [11], transformers [12], and diffusion models [13, 14, 15]. Furthermore, DenseAMs are error-
                                         correcting systems [16], a property ensuring that small perturbations of the desired temporal evolution
                                         of the state vector are corrected away by the dynamics of the network itself, rather than accumulated
                                         in time. Finally, DenseAMs are asymptotically stable—during the course of temporal evolution the
                                         computation happens during a finite transient period of time, which is followed by a steady state of
                                             Code available at https://github.com/mbacvanski/AnalogET.


                                                                                                 1
neural activities. This asymptotic stabilization of dynamical trajectories removes the requirement to read
out the “answer” to the computation problem at a precise moment of time, making DenseAMs robust
to several classes of hardware imperfections. The confluence of the above properties makes DenseAMs
appealing networks for analog hardware implementations that, on the one hand, are grounded in the
physics of stable error-correcting dynamical systems and, on the other hand, are capable of representing
computation in state-of-the-art AI networks.
    In 1989, Hopfield argued that analog neural hardware can exceed the efficiency of digital implemen-
tations when the device physics directly instantiate the computational dynamics of the model itself [17].
Here, we revisit this idea with DenseAM models: we propose an analog circuit-based hardware accel-
erator design whose dynamics directly realize those of the DenseAM. We find that analog DenseAM
hardware enables constant-time inference independent of model size, which is in stark contrast to GPU
solvers and digital implementations. This intrinsic property makes DenseAM a natural fit for analog AI
accelerators, and it highlights our circuit architecture as a viable hardware path to realize them. Using
component specifications already demonstrated in fabricated devices, analog DenseAM hardware may
achieve inference times on the order of tens to hundreds of nanoseconds, several orders of magnitude
faster than digital systems.
    By leveraging the natural dynamics of analog systems, this work establishes a new design of fast and
scalable AI accelerators. The framework of DenseAMs and their efficient analog hardware implementa-
tions suggest a pathway for fundamentally redesigning the hardware-software interface for AI, enabling
a new paradigm for fast, energy-efficient, and scalable computation.


2    Dense Associative Memory basics
The DenseAM framework [10, 18] provides a model that has straightforward neuronal dynamics, yet is
surprisingly expressive in its ability to represent AI models including transformer attention, diffusion
models, and associative memories. In its simplest form it is defined by two sets of neurons (typically
called visible and hidden neurons) and a system of coupled non-linear differential equations governing
their behavior, see Figure 1. The visible neurons are characterized by their internal states vi and their
outputs gi , index i = 1 . . . Nv ; while the hidden neurons have internal states hµ and outputs fµ , index
µ = 1 . . . Nh . From the AI perspective, one can think about internal state of the neuron as a pre-activation
of that neuron, and the output as a post-activation, which is obtained by applying an activation function
to the pre-activation. From the biological perspective, one can think about the internal state of the
neuron as a membrane voltage potential, and the output of that neuron as an axonal output, or a firing
rate of that neuron. This framework admits both neuron-wise activation functions (gi = g(vi ), where
g(·) is some continuous function, e.g., a ReLU), and collective activation functions such as softmax or
layer normalization, which depend on the states of multiple neurons.
    The network parameters are stored in the synaptic weights ξ ∈ RNh ×Nv , whose matrix elements
denoted by ξµi can be either hand-engineered or learned. The time decay constants for the two groups
of neurons are τv and τh . With these conventions, the temporal evolution of the two groups of neurons
can be expressed as                                 Nh
                                              dvi   X
                                            τ      =     ξµi fµ + ai − vi
                                         
                                          v dt
                                         
                                         
                                         
                                                     µ=1
                                                                                                           (1)
                                                     Nv
                                              dh
                                         
                                                 µ
                                                    X
                                         τh dt =        ξµi gi + bµ − hµ
                                         
                                         
                                         
                                                     i=1

This forms a bipartite graph of neuronal connections, where the state of the hidden neurons is updated
by the state of the visible neurons, and vice versa. Importantly, the same matrix ξ appears in both
equations, once as ξ and again as ξ ⊤ . Although this is sometimes described as using “symmetric”
weights, ξ is not assumed to be symmetric in the linear-algebraic sense; it is simply the same matrix
used in both directions. Finally, ai and bµ denote biases, which are additional weights of the system and
whose values may be hard-coded or learned depending on the application.
    The most important aspect of this model is the existence of a global energy function (Lyapunov
function) that describes neuronal dynamics. To demonstrate this, it is most convenient to use the
Lagrangian formalism [10, 18, 16]. Each set of neurons is defined through a Lagrangian function of their
internal states. The activation functions are defined as partial derivatives of that Lagrangian with respect
to internal states. The total energy is the sum of energies of each set of neurons, plus the interaction


                                                      2
Figure 1: Top left: Bipartite neural network formulation, where hidden neurons hµ and visible neurons
vi are connected via symmetric synaptic weights ξ. Top right: Circuit realization of symmetric weight
matrix via resistive crossbar array. Each crosspoint encodes a weight ξµi by its resistance Rµi = 1/ξµi .
Lower right: Circuit schematic of a single hidden neuron. It drives its row of the crossbar array with
a voltage according to its activation fµ , and its internal dynamics are driven by the incoming current
flowing into it from the crossbar array. Lower left: Softmax activation function built from bipolar
junction transistors (some components not shown).


energy. The energy of each set of neurons is a Legendre transformation of the corresponding Lagrangian
(plus the term proportional to the bias). Thus, the global energy of Equation 1 is given by
                    Nv
                   X                                      Nh
                                                           X                                    Nh X
                                                                                                  X  Nv
              E=             gi (vi − ai ) − Lv        +             fµ (hµ − bµ ) − Lh       −             fµ ξµi gi   (2)
                       i=1                                     µ=1                                µ=1 i=1
                   |              {z               }       |               {z             }       |      {z        }
                       energy of visible neurons               energy of hidden neurons           interaction energy

where the activation functions are defined as partial derivatives of the Lagrangians
                                                           ∂Lv                  ∂Lh
                                                   gi =        ,         fµ =
                                                           ∂vi                  ∂hµ
For convex Lagrangians this global energy decreases with time on the dynamical trajectories of Equa-
tion 1. If, additionally, the activation functions (and corresponding Lagrangians) are chosen in such a
way that this energy is bounded from below, the dynamical trajectories are guaranteed to arrive at a
stable fixed point of activations. The dynamical equations typically have many asymptotic fixed points,
which correspond to local minima of the energy function in Equation 2. Both properties above (convexity
of Lagrangians and lower-bounded energy) are satisfied for all settings studied in this paper. By picking
different nonlinear activation functions (or corresponding Lagrangians), this system yields a variety of
models that can describe associative memory, softmax attention, and other commonly used settings in
AI [10, 11, 18, 19, 20].
    A particularly relevant example for modern sequence modeling is the Energy Transformer (ET) [12],
which reformulates transformer’s inference pass as a gradient flow on an energy function defined over the


                                                                     3
set of tokens. The ET block contains two contributions to the energy function: attention energy and the
Hopfield network. The energy attention module routes the information between the tokens, while the
Hopfield module aligns the tokens with the manifold of token embeddings. In our implementation, the
context tokens act as a set of dynamically instantiated memories that interact with the predicted token
through a DenseAM-like energy. In section 6 we exploit this connection to construct an Analog Energy
Transformer (Analog ET) whose continuous-time dynamics are implemented directly in hardware using
our DenseAM circuit primitives.


3    Related work
Early analog implementations of associative memories focused on the classical Hopfield network. Founda-
tional designs, such as continuous-time analog circuits [21, 22] and later demonstrations using amorphous-
silicon resistors [23], memristive devices [24, 25], and phase-change memories [26], targeted the quadratic
Hopfield energy function. These works emphasize device engineering and memory-cell design rather than
system-level dynamics, and inherit the limited storage capacity and representational power of traditional
Hopfield networks. That line of research is largely concerned with how to fabricate programmable re-
sistance elements themselves; our work assumes programmable conductances as a given primitive and
focuses on the continuous-time dynamics that operate on top of them. Our work also differs from these
works by addressing DenseAMs with higher-order energy functions and continuous-valued states.
     Another direction is the use of cavity-QED systems for associative memory. Marsh et al. [27] analyze
a confocal cavity implementation of a quadratic Hopfield network and show that the cavity dynamics
induce a descent-like relaxation rule on spin states. Their model remains restricted to quadratic energies
and binary spins, and operates in a cryogenic, cavity-QED setting. Our work instead targets higher-order
DenseAMs with continuous states, and emphasizes scalable, room-temperature analog microelectronics
with explicit hardware-aware dynamical analysis.
     More recent physical implementations move beyond purely quadratic energies. Musa et al. [28]
propose a free-space optical realization of the higher-order DenseAM energy. Their system constructs a
static physical representation of the energy landscape, but inference relies on an external digital controller
that performs iterative spin-flip updates. Thus, the hardware computes energies, while the optimization
dynamics remain digital. In contrast, our analog microelectronic architecture embeds the gradient flow
itself into hardware: inference is performed by a single continuous-time evolution rather than by discrete
digital updates.


4    DenseAM circuit design
Here, we introduce a novel architecture for a class of analog electronic hardware accelerators that models
Equation 1’s system of nonlinear differential equations using time evolution. Our DenseAM design
shown in Figure 1 is comprised of two sets of neurons that interact through a resistive crossbar array.
The resistive crossbar array turns voltage differences between neurons into currents flowing between the
neurons according to synaptic weights, and each neuron’s internal circuitry converts those currents into
dynamics that reproduce Equation 1.

Resistive weights as a crossbar array. The crossbar array construction is a canonical design of
matrix-vector multiplication using analog electronics [17, 29], and is a natural fit for the weight matrix
ξ in our model. Traditionally, each crosspoint between a row and column line is connected by a resistor
(often memristors, RRAM, or other analog memories that produce resistances), a vector of input voltages
is applied at row lines, and the column lines are held at ground typically via a transimpedance amplifier.
By Ohm’s law, each resistive crosspoint produces a current that multiplies the row’s input voltage by
the inverse of the resistance. Because currents add along each column line, the total current output at a
column is the inner product between the vector of input voltages and the column’s conductance vector.
Thus, the array as a whole implements a parallel analog matrix multiplication of the form Iout = GVin ,
where G is the matrix of conductances (inverse of resistances).
    Unlike a traditional crossbar array whose rows are driven at a fixed voltage and whose columns
are held at ground, our DenseAM circuit design uses each weight bidirectionally, exactly representing
the bidirectional connections between visible and hidden neurons. As a result, the current flowing into
each neuron corresponds to the weighted sum of the differences P     between visible and hidden neuron
activations. For example, for hidden neuron µ, this current is i ξµi (gi − fµ ). This construction enables


                                                      4
                                                                                     (1, 0)               (1, 1)
            1                                            g3                0.4
 Neurons
  Visible


                                                                  Energy
                                                                           0.2
            0

            1                                            f3                0.0
 Neurons
 Hidden


                                                                                     (0, 0)               (0, 1)
            0                                                              0.4


                                                                  Energy
           0.5
  Energy


                                                                           0.2

           0.0                                                             0.0
              0.0   0.5   1.0   1.5   2.0    2.5   3.0
                                                                                 0            1      0             1
                                Time (s)
                                                                                      v3                   v3

Figure 2: Solving XOR with a DenseAM. Visible                     Figure 3: XOR energy landscape of neuron v3 un-
neuron g3 = v3 serves as the output, while the two                der different settings of visible input neurons v1 and
input neurons (unlabeled, thin lines) are clamped                 v2 . Minima in the energy function correspond to
at 1 and 0 for True and False. Output v3 is initial-              stationary points of the dynamics. Gradient flow
ized at 0.5 and converges to a positive prediction of             dynamics bring v3 to these attractor points, result-
1. The activation of the hidden neuron f3 for the                 ing in correct XOR outputs.
truth-table row (1, 0, 1) becomes highly activated,
with others (fine lines) are suppressed by softmax.
Energy (2), or equivalently (5), decreases monoton-
ically along the inference trajectory.


weight symmetry to be enforced by hardware sharing: both forward and reverse weights are realized by
the same resistive elements. Importantly, as long as weights are represented as conductances, they must
be non-negative.

Design of a single neuron. Each neuron in the circuit computes its dynamics by integrating the cur-
rents it receives from the crossbar array, which represent weighted differences between its own activation
and those of connected neurons. Considering a hidden neuron (the design for visible neurons is symmet-
ric by design), the neuron’s internal voltage hµ is stored on capacitor C1 and evolves in continuous time,
while the neuron’s activation fµ is obtained by passing hµ through a nonlinear function (e.g. ReLU or
softmax).
    The current flowing into hidden neuron µ is produced by its interaction with all visible neurons via
the synaptic weights ξµi for P i = 1, . . . , Nv . Specifically, this is as a weighted sum of the differences
between neuron P  activations: i ξµi (gi − fµ ). Inside each neuron, a “self” path scales fµ to produceP the
voltage sµ = fµ i ξµi . This term is added to the value of the incoming current so that the −fµ i ξµi
term is cancelled inside each neuron. As a result, the hidden state, represented as the voltage across
capacitor C1 , integrates only the desired weighted input plus any external stimulus bµ . Its dynamics
reduce to the canonical DenseAM form with a time constant of R2 C1 :
                                                         Nv
                                                   dhµ   X
                                            R2 C 1     =     ξµi gi + bµ − hµ                                          (3)
                                                    dt   i=1

Elementwise (or vectorized) nonlinearities then produce activations gi = g(vi ) and fµ = f (hµ ) (e.g.,
ReLU, softmax) across the visible and hidden neurons. See Appendix A for the full circuit derivation.


5           Analog DenseAM Examples
We begin by studying two examples of the proposed design: the XOR task, and the (7,4) error-correcting
Hamming code.


                                                              5
5.1    XOR
The XOR problem is a canonical test for nonlinear representation and inference, as it cannot be solved
by any linear model. We show a minimal DenseAM model for the XOR task, illustrating how energy-
based dynamics can solve this simple task with a continuous-time analog system. The network consists
of Nv = 3 visible neurons, and Nh = 4 hidden neurons. At t = 0 visible neurons v1 and v2 are initialized
at their input values corresponding to the input bits. The last visible neuron v3 is initialized at v3 = 0.5.
The hidden neurons are initialized at zero. The two input visible neurons remain clamped during the
dynamics, while the third output visible neuron and the hidden neurons evolve in time according to (1).
Each row of the memory matrix ξ corresponds to a row of the XOR truth table. The visible neurons
use an identity activation function where gi = vi , and the hidden neurons use a softmax activation. The
biases are set as
                                                                N v
                                                              1X        2
                                        ai = 0,      bµ = −         ξµi
                                                              2 i=1

    Figure 2 shows the temporal evolution of visible and hidden neuron activations, as well as the total
energy, during inference on the XOR input (1, 0). The output visible neuron’s activation g3 gradually
converges to the correct prediction of 1, while the hidden neuron associated with that memory, f3 ,
becomes strongly activated and the remaining hidden neurons are suppressed by the softmax nonlinearity.
The system’s energy decreases monotonically throughout the trajectory and stabilizes once the network
reaches its fixed-point prediction. Figure 3 depicts the system’s energy landscape as a function of output
neuron v3 for different clamped input configurations (v1 , v2 ). In each case, the energy exhibits a clear
convex minimum at the correct XOR output, demonstrating that gradient flow along the energy surface
drives v3 reliably toward the correct prediction. As shown in Appendix C, we validate our circuit design
and dynamics using SPICE simulation.
                                                                     τh → 0. Since the second equation in
    To analyze this DenseAM, it is instructive to consider the limit P
                                                                       Nh
(1) is linear in hidden units hµ , they can be integrated out. With µ=1    fµ = 1, the resulting dynamics
of the visible neurons can be written as
                            Nh                                             Nv
                      dvi   X                                          βX                  
                                                                               (ξµi − vi )2
                                        
                 τv       =     ξµi − vi fµ       where   fµ = softmax −                                 (4)
                      dt    µ=1
                                                                         2 i=1

The effective energy on the visible neurons can be written as
                                                     Nh          Nv
                                               1     X      h βX                  i
                               E eff (v) = −     log     exp −       (ξµi − vi )2                        (5)
                                               β     µ=1
                                                               2 i=1

Intuitively, each hidden neuron computes a squared Euclidean distance between the visible state and its
stored pattern ξ µ . The softmax nonlinearity assigns higher weight to the pattern closest to the current
state of the visible neurons. The resulting visible neuron dynamics are gradient flow for this effective
energy. It is important to note that memories in this implementation are represented by conductances
of the crossbar array, which are always positive. For this reason, matrix elements of memories ξµi must
be positive, necessitating the use of the bias terms, which are just voltage sources that can be arbitrarily
signed.
    While a time constant of τh = 0 is impossible to physically construct due to finite conductances
and nonzero capacitances, setting τh ≪ τv realizes the same adiabatic limit in practice. When hidden
neurons evolve much faster than visible ones, they reach their steady state almost instantaneously for each
configuration of visible neurons. The result is an adiabatic elimination of hidden dynamics, yielding the
effective visible-only dynamics above. In practice, for the XOR task, even a relatively modest τh = τv /10
ratio yields perfect performance.

5.2    Hamming (7,4) code
The Hamming (7,4) code is an error-correcting code that encodes 4 data bits into a 7-bit codeword by
adding 3 parity bits. The resulting 7-bit strings are special: only certain patterns are valid codewords,
and they are spaced apart so that if a single bit is flipped, the error can be detected and corrected [30].
Table 1 lists the 16 codewords corresponding to four arbitrary data bits.


                                                          6
              1
                                                     g5
 Neurons
  Visible
                                                               Data bits (d1 d2 d3 d4 )       Codeword (c1 c2 c3 c4 c5 c6 c7 )

              0
                                                                        0000                             0000000
                                                                        0001                             0001111
              1                                      f7                 0010                             0010110
 Neurons
 Hidden


                                                                        0011                             0011001
                                                                        0100                             0100101
              0
                                                                        0101                             0101010
             0.5                                                        0110                             0110011
    Energy


                                                                        0111                             0111100
                                                                        1000                             1000011
             0.0                                                        1001                             1001100
                   0   1   2        3     4      5
                                                                        1010                             1010101
                               Time (s)
                                                                        1011                             1011010
                                                                        1100                             1100110
                                                                        1101                             1101001
Figure 4: Correcting a bit error in a Hamming                           1110                             1110000
(7,4) code. Visible neuron g5 flips indicating the                      1111                             1111111
bit flip error happened on the 5th codeword bit. f7
is the hidden neuron corresponding to the memory              Table 1: Valid codewords of the Hamming(7,4)
of the correct codeword. Thin lines correspond to             code, ordered by their 4-bit data payload.
the other neuron activations.


    Unlike the XOR case where the only evolving neuron is the readout bit, the Hamming (7,4) code may
require flipping the value of any one of the visible neurons. During inference, the visible neurons are
initialized to the corrupted 7-bit input word. All neurons are left free to evolve, and the dynamics relax
the state toward the nearest stored codeword. Energy minima are located at the valid codewords, so the
network converges to the correct code provided the error is within the Hamming radius of 1. Thus, the
DenseAM replicates the standard decoding property of the Hamming (7,4) code: any single-bit flip is
corrected automatically. Figure 4 illustrates a case where a flipped bit g5 is restored during convergence.
    The Hamming (7,4) model’s 7 visible neurons, each corresponding to a codeword bit, are connected
to 16 hidden neurons, each representing one valid codeword. The weight matrix ξ ∈ {0, 1}16×7 is formed
by stacking the 16 codewords as its rows. Visible neurons have the identity activation, hidden neurons
use a softmax activation, and biases are chosen as in the XOR case to give the same integrated-out
visible dynamics as (4).


6             Analog Energy Transformer (Analog ET) via DenseAM
Our DenseAM circuit construction can be used to build more complex energy-based models, such as
the transformer-like architecture proposed in the Energy Transformer paper [12]. For causal next-token
prediction with a single attention head, the Energy Transformer’s energy function can be written as the
following (See Appendix J for full derivation):
                               ⊤              ⊤             ⊤ attn               ⊤ hopf
   E = 12 ∥v − a∥2 − v⊤ ξ attn f attn + ξ hopf f hopf + f attn           − b + f hopf
                                                                                                   
                                                                    h                     h     −c
                   − Lattn hattn − Lhopf hhopf
                                               
                                                                                                    (6)

with the activation functions and their Lagrangians defined as
                                                                                 L
                                                                                 X
                           fAattn = softmax(βhattn )A ,   Lattn (h) = β1 log           eβhA                                 (7)
                                                                                 A=1
                                                                           M h
                                                                           X                    i2
                           fµhopf = ReLU(hhopf
                                          µ    ),         Lhopf (h) = 21          ReLU(hµ )                                 (8)
                                                                           µ=1

where a, b, and c correspond to the biases of the visible neurons, attention hidden neurons, and Hopfield
network hidden neurons, respectively. The L context tokens are indexed by A, and the M hidden neurons
of the Hopfield network are indexed by µ. Because the visible units use an identity activation function,


                                                          7
Figure 5: Analog ET circuit demonstrating the autoregressive inference procedure. A newly inferenced
token is decoded, sampled, and re-embedded to obtain the weight vector ξ attn
                                                                           L+1 , which is set as the weight
vector for a new hidden neuron hattn
                                 L+1  in the energy attention block (light gray  on right). For this layout
we have flipped the crossbar array, so that indices A and µ run horizontally and index i runs vertically.


gi = vi using the languge of Equation 1, the gradient flow of the energy yields the dynamics:
                                      ∂E         ⊤             ⊤
                            τv v̇ = −    = ξ attn f attn + ξ hopf f hopf + a − v                           (9)
                                      ∂v
                                       ∂E
                       τh ḣattn
                                  = − attn = ξattn v + b − hattn                                          (10)
                                      ∂f
                                       ∂E
                       τh ḣhopf = − hopf = ξhopf v + c − hhopf                                           (11)
                                      ∂f
In this formulation, v represents the embedding of the output (next) token, and its evolution is driven by
two terms: one term from the energy attention with weights ξattn and hidden neuron activations f attn ,
and one term from the Hopfield network with weights ξ hopf and hidden neuron activations f hopf . The
weights of the energy attention DenseAM are dependent on the context: for a token dimension D, context
length L, and the task of predicting the token at index L + 1, the weights ξ attn ∈ RL×D are generated
by embedding each token of the context via a learned embedding matrix applied to each context token.
In contrast, the Hopfield network weights ξ hopf are learned during training and fixed at inference. The
number of memories in the Hopfield network is a hyperparameter M , such that ξ hopf ∈ RM ×D .
    This system suggests a hardware implementation where v interacts with two independent DenseAMs,
one for the energy attention and one for the Hopfield term, which can share the same physical crossbar
structure. Figure 5 shows that the circuit structure remains a crossbar array (like Figure 1), but with
two distinct classes of hidden neurons. Because of the summation of currents along each row of the
crossbar array, the incoming current to visible neuron vi is the sum of contributions from the energy
attention block and from the Hopfield network block. The energy attention hidden neurons hattn use a
softmax activation function, while the Hopfield network hidden neurons hhopf use a ReLU activation.

6.1    Analog Energy Transformer on the parity task
We build and evaluate the Analog ET on the L-bit parity task,        which can 
                                                                    P          be thought of as an elementary
                                                                      L
“language model”: given bits bit1 , . . . , bitL , predict bitL+1 =   A=1  bitA    mod 2. Parity is instructive
because it requires a representation of a global, order-L interaction, precluding linear and shallow models
from representing it efficiently. A successful model must be able to form high-order interactions in order
to generalize. We formulate parity as a next-token prediction problem: given an L-bit string as context,
predict its parity in the next token.
    We train the Analog ET model digitally using backpropagation through time [31] implemented with
Jax’s automatic differentiation. The resulting weights can be deployed onto the analog hardware; in


                                                      8
                                   11001010    0                          01000110        1

                  4
Visible neurons


                  2
                  0
                  1
Prediction


                   0
                  10
Energy


                  20
                  30
                       0.0   0.2    0.4 0.6        0.8   1.0 0.0    0.2      0.4 0.6          0.8     1.0
                                      Time t                                   Time t
Figure 6: Inference of parity Analog ET on two example 8-bit strings. Top row plots the visible neurons vi
over time, middle row plots the decoded token prediction, bottom row plots the energy that monotonically
decreases during inference. After a transient period of computation, the network arrives at a steady-
state, making the result of the computation robust against the precise timing of the readout.


our experiments we simulate the dynamics of hardware with the Diffrax [32] ODE solver library. On
the 8-bit parity task, our model achieves 100% accuracy on the hold-out validation set of 52 bit strings,
demonstrating clear generalization capabilities. See Appendix H.1 for more details on training and model
design.
    Figure 6 shows the dynamics of the visible neurons and energy during two example inference runs
of the Analog ET. Notably, the visible neuron values are constant by the end of the inference period,
meaning that the inference remains highly stable to mismatch and delay in timing during readout. A
single sample-and-hold and switching circuit would enable a single Analog-Digital Converter (ADC) to
read out all the visible neurons at convergence, significantly reducing mismatch, and drastically saving
device area, complexity, and energy. The intrinsic stability of attractor points arises uniquely from
the continuous-time dynamics of the DenseAM, making these models particularly well suited to analog
hardware.

6.2               Autoregressive inference
Dashed lines in Figure 5 illustrate the autoregressive inference procedure of the Analog ET. To generate
the L-th token given context tokens x(1) , . . . , x(L−1) , each token is first embedded and concatenated to
form the attention weight matrix
                                                     (1) 
                                                       e
                                                     e(2) 
                                    ξ attn,(L−1) =  .  ∈ R(L−1)×D
                                                            
                                                     .. 
                                                      e(L−1)

These rows are loaded into the Analog ET’s energy attention weight matrix ξ attn by programming the
corresponding crossbar resistances. During inference, the visible state v(t) evolves according to the
Analog ET dynamics until convergence. A decoder readout (e.g., a linear layer) applied to the converged
v(t = T ) values produces logits, from which the next token x(L) is sampled. This token is then embedded
to form e(L) , and appended to the existing context. The cycle repeats with the updated attention weight


                                                         9
matrix
                                                         attn,(L−1) 
                                                         ξ
                                         ξ attn,(L) =                  ∈ RL×D
                                                            e(L)

which now includes the new embedding e(L) . In hardware, this corresponds to connecting an additional
hidden neuron in the energy attention block of Figure 5, and setting its resistive weights with e(L) .
Because the physical order of hidden neurons does not affect the energy function, this new neuron can
be placed in any position among the hidden neurons. When the context length is fixed, the hidden
neuron corresponding to the earliest token can simply be reprogrammed with the new vector of weights
e(L) , resulting in the hardware equivalent of a sliding-window context. In practice, an external digital
controller, e.g., an Field-Programmable Gate Array (FPGA) or Application-Specific Integrated Circuit
(ASIC) would orchestrate crossbar programming and token decoding, while the DenseAM dynamics
perform the far more substantial workload of computing each next-token embedding.
    This procedure is analogous to key-value (KV) caching in standard transformer inference [33]. Context
tokens x(1) , . . . , x(L−1) produce key and value vectors k(1) , . . . , k(L−1) and v(1) , . . . , v(L−1) respectively.
When new token x(L) is generated, its corresponding k(L) and v(L) vectors are appended to the cache,
allowing all previous k(<L) and v(<L) to be reused without recomputation. When the key and value
matrices are tied so that k(A) = v(A) , the ET’s row-append operation is equivalent to the standard KV-
cache update. The ET performs an autoregressive rollout that reproduces the same recurrence structure
as KV-cached transformer inference, but implemented physically through the addition of new neurons
and weights without touching existing hardware. For a formal derivation of the equivalence between ET
attention and conventional attention with tied keys and values, see [12].


7     Scaling properties
Inference time and energy consumption are crucial characteristics of our system. This section investigates
these metrics with respect to the network size.

7.1      Inference time scaling
The model (4) and (5) is considered. In the adiabatic limit (τh → 0), which is satisfied by our hardware
implementation, the time derivative of the energy can be written as
                                       Nv                   Nv 
                              dE eff   X   ∂E eff dvi    1 X     ∂E eff 2    Nv
                                     =                =−                   ∼−                                      (12)
                               dt      i=1
                                            ∂vi dt       τv i=1 ∂vi           τv

This derivative is always negative, since the dynamical system performs the gradient descent on the
energy landscape. The derivative vanishes eventually when the network state vector v converges to the
steady state. Since the state vector vi is typically initialized in the vicinity of the memory vectors, which
are chosen to be of order one (∼ 1), the right hand side of (4) is of order one too, independent of the
network size. This results in the characteristic value of the temporal derivative shown in (12).
    At the same time, the typical value1 of the energy (5) is
                                                                 1
                                              |E eff | ∼ Nv +      log(Nh )                                        (13)
                                                                 β
During the inference dynamics the network is initialized in a high energy state, which has the charac-
teristic value of energy (13), and performs energy descent to a lower value of the energy (which has a
similar order of magnitude). In order to estimate the scaling of the time required to perform this energy
descent, one can take a ratio of the energy drop by the rate of the energy decrease (12). This gives the
following estimate
                                         |E eff |         1 log(Nh ) 
                                T conv ∼          ∼ τv 1 +              ∼ τv                         (14)
                                          dE               β Nv
                                                dt

The last ∼ sign holds since in none of the designs presented here does Nh grow super-exponentially in
Nv . In fact, in all the use cases Nh is sub-exponential in Nv .
   1 We estimate the absolute value of the energy, since it can be both positive and negative depending on the mutual

arrangement of memories, the state vector, and the number of hidden units.


                                                            10
    This back-of-the-envelope estimation provides the core intuition behind the scaling relationship.
The inference time is constant, and independent of the size of the network. A more careful     anal-
ysis (Appendix E) shows that in the high-β regime the worst-case dependence is O τβv logNNv
                                                                                            h
                                                                                               , which
remains bounded for all architectures we consider. Thus, for our settings the convergence time is ef-
fectively constant in Nv and Nh . Based on amplifier gain–bandwidth, slew-rate, and output-current
constraints, we estimate achievable inference times of tens to hundreds of nanoseconds using existing
CMOS technology (see Appendix I.2).

7.2    Scaling of energy consumption
We now analyze how the total inference energy scales with network size. Energy dissipation arises
primarily from (i) Ohmic loss in the resistive weights, (ii) charging of neuron-state capacitors, and (iii)
constant per-neuron overhead from amplifiers and bias currents. We show that, under bounded voltage
swings and fixed conductance budgets, total energy grows only linearly with the number of neurons.

Weight dissipation. Let the neuron output voltages be proportional to activations: u = κg and
w = κf , where κ is a fixed voltage swing. Such a bounded swing can always be enforced by global
rescaling of ξ, β, and voltage units without changing the dynamics (see Appendix F). The instantaneous
power dissipated by the resistive crossbar array is
                                                     Nh X
                                                     X  Nv
                                    Pweights (t) =             ξµi (ui − wµ )2                                   (15)
                                                     µ=1 i=1
                                                                             P                  P
With 0 ≤ gi ≤ 1, f -softmax, and row/column conductance budgets                  µ ξµi ≤ Cc ,   i ξµi ≤ Cr , the total
power obeys

                                 Pweights (t) ≤ 2κ2 (Cc Nv + Cr ) = O(Nv )                                       (16)

For a runtime of duration T ∼ T conv , the energy dissipated by the weights is therefore Eweights = O(Nv T ),
where T ∼ 1 from subsection 7.1.

Capacitive and overhead energy.           Each neuron charges a local capacitor a finite number of times
by at most Vswing ∼ κ, giving
                                                                    !
                                                (v)
                                         X             X
                             Ecap ≤ κ2         Ci +         Cµ(h)       = O(Nv + Nh )                            (17)
                                           i            µ

Active bias and amplifier inefficiencies contribute fixed per-neuron power, yielding Eother = O((Nv + Nh )T ).

Total energy scaling.      With bounded voltage swing and conductance budgets,

                                           Etotal = O(Nv + Nh )                                                  (18)

Hence, the total inference energy scales only linearly with system size. For the full derivation, see
Appendix G.

7.3    Scaling of hardware area
The area is dominated by two components: the area taken up by the synaptic weights, which is imple-
mented as a crossbar array with programmable weights, and the area taken up by the neurons feeding
the crossbar array. The area of the crossbar array scales as the number of weights O(Nv Nh ). The area
of the neurons scales as O(Nv + Nh ).


8     Conclusion
In this paper, we have presented an analog accelerator architecture for Dense Associative Memories,
implemented using resistive crossbar arrays and continuous-time RC neuron dynamics. Our design im-
plements DenseAM inference as time evolution of a physical dynamical system, rather than a sequence of


                                                       11
discrete numerical update steps. We demonstrated this architecture with three representative settings of
increasing complexity: XOR, Hamming (7,4) error decoding, and an Energy Transformer-style sequence
model. These examples show that the analog DenseAM accelerator architecture covers both associative
memory tasks and attention-based sequence models.
    Our analysis shows that DenseAM accelerators enjoy favorable asymptotic scaling properties. In-
ference time is constant in the dimensions of the model size, meaning that inference time is governed
primarily by the physical time constants of the circuit. This is in sharp contrast to digital implementa-
tions of the same dynamics, whose runtime must grow at least linearly with model size.
    To assess hardware feasibility, we derived lower bounds on the neuronal time constants imposed by
amplifier gain-bandwidth product, slew rate, and output current limits in our neuron design. Reported
figures from representative CMOS OTAs in the literature give inference times on the order of tens-to-
hundreds of nanoseconds, even with conservative design margins. Combined with the constant scaling of
inference with model size, these estimates suggest that DenseAM accelerators can match or exceed the
latency of digital GPUs as models grow, without requiring exotic devices or beyond-CMOS technologies.
    Our results highlight DenseAMs as a natural abstraction for analog AI hardware. Their error cor-
recting dynamics and asymptotic stability directly address long-standing concerns about robustness and
readout timing: small perturbations are corrected by the dynamics instead of accumulated, and the final
state is stable when readout happens over a wide temporal window. At the same time, the DenseAM
framework is expressive enough to capture modern primitives such as attention and transformer-like ar-
chitectures, as illustrated by our Analog Energy Transformer construction. These properties suggest that
DenseAM-based analog accelerators may be a promising substrate for future AI systems, and motivate
further co-design of models, dynamics, and devices.

Acknowledgements
MGB would like to thank Faiz Muhammad for exploratory attempts at SPICE simulations. DK would
like to thank Kwabena Boahen for helpful discussions.


References
 [1]   Ashish Vaswani. “Attention is all you need”. In: arXiv preprint arXiv:1706.03762 (2017).
 [2]   Jascha Sohl-Dickstein et al. “Deep unsupervised learning using nonequilibrium thermodynamics”.
       In: International conference on machine learning. pmlr. 2015, pp. 2256–2265.
 [3]   Norman P Jouppi et al. “In-datacenter performance analysis of a tensor processing unit”. In:
       Proceedings of the 44th annual international symposium on computer architecture. 2017, pp. 1–12.
 [4]   Eric Masanet et al. “Recalibrating global data center energy-use estimates”. In: Science 367.6481
       (2020), pp. 984–986.
 [5]   David Patterson et al. “Carbon emissions and large neural network training”. In: arXiv preprint
       arXiv:2104.10350 (2021).
 [6]   Maxwell Aifer et al. “Solving the compute crisis with physics-based ASICs”. In: arXiv preprint
       arXiv:2507.10463 (2025).
 [7]   Dmitry Krotov and John J Hopfield. “Dense associative memory for pattern recognition”. In:
       Advances in neural information processing systems 29 (2016).
 [8]   Dmitry Krotov and John Hopfield. “Dense associative memory is robust to adversarial inputs”. In:
       Neural computation 30.12 (2018), pp. 3151–3167.
 [9]   John J Hopfield. “Neural networks and physical systems with emergent collective computational
       abilities.” In: Proceedings of the national academy of sciences 79.8 (1982), pp. 2554–2558.
[10]   Dmitry Krotov and John J Hopfield. “Large Associative Memory Problem in Neurobiology and
       Machine Learning”. In: International Conference on Learning Representations. 2021.
[11]   Hubert Ramsauer et al. “Hopfield networks is all you need”. In: arXiv preprint arXiv:2008.02217
       (2020).
[12]   Benjamin Hoover et al. “Energy transformer”. In: Advances in Neural Information Processing
       Systems 36 (2024).


                                                   12
[13]   Benjamin Hoover et al. “Memory in plain sight: A survey of the uncanny resemblances between
       diffusion models and associative memories”. In: arXiv preprint arXiv:2309.16750 (2023).
[14]   Luca Ambrogioni. “In search of dispersed memories: Generative diffusion models are associative
       memory networks”. In: arXiv preprint arXiv:2309.17290 (2023).
[15]   Bao Pham et al. “Memorization to generalization: Emergence of diffusion models from associative
       memory”. In: arXiv preprint arXiv:2505.21777 (2025).
[16]   Dmitry Krotov et al. “Modern methods in associative memory”. In: arXiv preprint arXiv:2507.06211
       (2025).
[17]   JJ Hopfield. “The effectiveness of analogue’neural network’hardware”. In: Network: Computation
       in Neural Systems 1.1 (1990), p. 27.
[18]   Dmitry Krotov. “Hierarchical associative memory”. In: arXiv preprint arXiv:2107.06446 (2021).
[19]   Fei Tang and Michael Kopp. “A remark on a paper of krotov and hopfield [arxiv: 2008.06996]”. In:
       arXiv preprint arXiv:2105.15034 (2021).
[20]   Benjamin Hoover et al. “A universal abstraction for hierarchical hopfield networks”. In: The Sym-
       biosis of Deep Learning and Differential Equations II. 2022.
[21]   John J Hopfield. “Neurons with graded response have collective computational properties like those
       of two-state neurons.” In: Proceedings of the national academy of sciences 81.10 (1984), pp. 3088–
       3092.
[22]   David W Tank and John J Hopfield. “Simple “Neural” optimization networks: an A/D converter,
       signal decision circuit, and a linear programming circuit”. In: Artificial neural networks: theoretical
       concepts. 1988, pp. 87–95.
[23]   HP Graf et al. “VLSI implementation of a neural network memory with several hundreds of neu-
       rons”. In: AIP conference proceedings. Vol. 151. 1. American Institute of Physics. 1986, pp. 182–
       187.
[24]   Xinjie Guo et al. “Modeling and experimental demonstration of a Hopfield network analog-to-
       digital converter with hybrid CMOS/memristor circuits”. In: Frontiers in neuroscience 9 (2015),
       p. 488.
[25]   SG Hu et al. “Associative memory realized by a reconfigurable memristive Hopfield neural net-
       work”. In: Nature communications 6.1 (2015), p. 7522.
[26]   Sukru B Eryilmaz et al. “Brain-like associative learning using a nanoscale non-volatile phase change
       synaptic device array”. In: Frontiers in neuroscience 8 (2014), p. 205.
[27]   Brendan P Marsh et al. “Enhancing associative memory recall and storage capacity using confocal
       cavity QED”. In: Physical Review X 11.2 (2021), p. 021048.
[28]   Khalid Musa et al. “Dense Associative Memory in a Nonlinear Optical Hopfield Neural Network”.
       In: arXiv preprint arXiv:2506.07849 (2025).
[29]   Carver Mead and Mohammed Ismail. Analog VLSI implementation of neural systems. Vol. 80.
       Springer Science & Business Media, 2012.
[30]   Richard W Hamming. “Error detecting and error correcting codes”. In: The Bell system technical
       journal 29.2 (1950), pp. 147–160.
[31]   Paul J Werbos. “Backpropagation through time: what it does and how to do it”. In: Proceedings
       of the IEEE 78.10 (2002), pp. 1550–1560.
[32]   Patrick Kidger. “On Neural Differential Equations”. PhD thesis. University of Oxford, 2021.
[33]   Zihang Dai et al. “Transformer-xl: Attentive language models beyond a fixed-length context”.
       In: Proceedings of the 57th annual meeting of the association for computational linguistics. 2019,
       pp. 2978–2988.
[34]   Jacob Sillman. “Analog Implementation of the Softmax Function”. In: arXiv preprint arXiv:2305.13649
       (2023).
[35]   John J Hopfield and David W Tank. “Computing with neural circuits: A model”. In: Science
       233.4764 (1986), pp. 625–633.
[36]   Aldo Pena Perez and Franco Maloberti. “Performance enhanced op-amp for 65nm CMOS tech-
       nologies and below”. In: 2012 IEEE International Symposium on Circuits and Systems (ISCAS).
       IEEE. 2012, pp. 201–204.


                                                      13
                                  Figure 7: Circuit for a single neuron.


[37]   Rida S Assaad and Jose Silva-Martinez. “The recycling folded cascode: A general enhancement of
       the folded cascode amplifier”. In: IEEE Journal of Solid-State Circuits 44.9 (2009), pp. 2535–2542.
[38]   Alec Yen and Benjamin J Blalock. “A High Slew Rate, Low Power, Compact Operational Ampli-
       fier Based on the Super-Class AB Recycling Folded Cascode”. In: 2020 IEEE 63rd International
       Midwest Symposium on Circuits and Systems (MWSCAS). IEEE. 2020, pp. 9–12.
[39]   Mohammad H Naderi, Suraj Prakash, and Jose Silva-Martinez. “Operational transconductance
       amplifier with class-B slew-rate boosting for fast high-performance switched-capacitor circuits”.
       In: IEEE Transactions on Circuits and Systems I: Regular Papers 65.11 (2018), pp. 3769–3779.
[40]   Franz Schlögl and Horst Zimmermann. “A design example of a 65 nm CMOS operational amplifier”.
       In: International Journal of Circuit Theory and Applications 35.3 (2007), pp. 343–354.


A      Neuron Design
Figure 7 shows the circuit design of a single neuron, with labels corresponding to this being a hidden
neuron at index µ. We derive the dynamics of the neuron internal state hµ and activation output voltage
fµ . We proceed using only Kirchhoff’s Current Law (KCL) and the definition of an ideal op-amp.

Assumptions and conventions.
    • Ideal op-amps: infinite open-loop gain, infinite input impedance (no input current), zero output
      impedance. Under stable negative feedback this enforces a virtual short V+ = V− .
    • Current Jµ : we define Jµ as the current which flows from fµ to mµ through R1 .

    • Op-amp input labels: We denote the inverting and noninverting inputs of each op-amp explicitly,
      e.g. U 2− for the inverting input of U2, U 3+ for the noninverting input of U3, etc.
    • Node labels: Label mµ as the output of U1, sµ as the output of U2, and dµ as the output of U3.
      The neuron pre-activation state is labeled hµ , and the post-activation state is labeled fµ . Voltage
      bµ (as an ideal voltage source) drives the bias for this neuron. Voltages hµ , bµ , and fµ correspond
      directly to the state variables in equation (1).


                                                    14
Block U1: buffer of activation voltage fµ . Op-amp U1 buffers the output of the activation function
f (·) and drives the output of the neuron, fµ . Because no current can flow into U 1− , all the current
flowing into this neuron must flow through R1 to mµ and is sourced or sunk by U1’s output node.

Block U2: non-inverting stage producing sµ from fµ and mµ . The positive input of U2 is
U 2+ = fµ , and by U2’s virtual short, the negative input U 2− = U 2+ = fµ . By KCL at U 2− ,
                                                                      
                             U 2−      sµ − U 2−                    R9
                                   =               ⇒ sµ = 1 +            fµ                   (19)
                              R10         R9                       R10

Block U3: non-inverting stage producing dµ from sµ , bµ , and mµ . By KCL at the positive input
of U3,
                bµ − U 3+    sµ − U 3+     U 3+             R4 R5 bµ + R3 R5 sµ
                          +            =        ⇒ U 3+ =                                    (20)
                   R3           R4          R5             R4 R5 + R3 R5 + R3 R4
KCL at the negative input of U3 gives us
                                                                          
        mµ − U 3−     −U 3−     U 3− − d µ                          1     1       R8 mµ
                   +         =                ⇒ dµ = U 3− 1 + R8       +        −           (21)
            R6          R7         R8                              R6    R7        R6
Virtual short of U3 means U 3− = U 3+ . Combining equations (20) and (21), get
                               R6 R7 + R8 (R6 + R7 )    R4 R5 bµ + R3 R5 sµ    R8
                        dµ =                         ·                       −    mµ                     (22)
                                      R6 R7            R4 R5 + R3 R5 + R3 R4   R6

Dynamics of RC circuit. R2 and C1 form an RC circuit driven by voltage dµ . The voltage across
the capacitor hµ follows the relation
                       dhµ
              R2 C 1       = −hµ + dµ
                        dt
                                   R6 R7 + R8 (R6 + R7 )    R4 R5 bµ + R3 R5 sµ    R8
                           = −hµ +                       ·                       −    mµ                 (23)
                                          R6 R7            R4 R5 + R3 R5 + R3 R4   R6
                                                          P
With incoming current. Take the incoming current  PJµ = i ξµi (gi − fµ ). This produces a voltage
drop across R1 such that mµ = fµ − R1 Jµ = fµ − R1 i ξµi (gi − fµ ). Then, the dynamics of hµ from
equation (23) are
                 dhµ         R6 R7 + R8 (R6 + R7 )    R4 R5 bµ + R3 R5 sµ    R8
         R2 C1       = −hµ +                       ·                       −    (fµ − R1 Jµ )            (24)
                  dt                R6 R7            R4 R5 + R3 R5 + R3 R4   R6
Substituting in sµ from equation (19) and Jµ :
                                                                  
                                                                R9                                              !
      dhµ         R6 R7 + R8 (R6 + R7 )   R  R b
                                            4 5 µ + R  R
                                                      3 5   1 + R10 fµ   R8                   X
R2 C1     = −hµ +                       ·                              −           fµ − R 1       ξµi (gi − fµ )
       dt                R6 R7                R4 R5 + R3 R5 + R3 R4      R6                   i
                                                                                                         (25)

Equal-resistance special case. Set R1 = R3 = R4 = R5 = R6 = R7 = R8 . Then, equation (25)
reduces to
                              dhµ              R9       X
                       R2 C 1     = −hµ + bµ +     fµ +   ξµi (gi − fµ )             (26)
                               dt              R10      i


Selection of R9 /RP10 self-term gain. Evidently, in order to match the form of equation (1), we need
to cancel the −fµ i ξµi term that appears on the right hand side of equation (26). The R9 /R10 term
allows us to do that by setting
                                           R9    X
                                               =     ξµi                                        (27)
                                           R10     i

Taking equation (27)’s assignment to R9 and R10 simplifies equation (26) into
                                         dhµ   X
                                   R2 C1     =    ξµi gi − hµ + bµ                                       (28)
                                          dt    i
which exactly matches our desired dynamics.


                                                     15
Figure 8: Crossbar Array. Each pentagon contains a neuron of design in Figure 7. In this layout we
have flipped the crossbar array, so that index µ runs horizontally and index i runs vertically.


A.1     Activation function
The voltage across C1 gives us the dynamics of the neuron internal state hµ . Figure 7 contains a block
representing a nonlinear amplifier, denoted f (·), whose input is hµ and whose output is fµ = f (hµ ). This
voltage is buffered with U1 onto the neuron output line, labeled fµ , which is what other neurons “see”
in the crossbar array. The chosen activation function does not affect the rest of the dynamics of the
neuron. Particularly, the activation function need not be element-wise: a vector-wise activation function
like softmax can be readily applied instead.

A.2     Neurons interacting in a network
So far we have examined the dynamics
                                   P of a single neuron, treating as an assumption that the neuron will
receive an incoming current Jµ = i ξµi (gi − fµ ). Now, we will show how to wire these neurons together
to realize this. Figure 8 shows the simplest DenseAM construction where each pentagonal node is a
circuit of design in Figure 7. Each neuron exposes a single node whose voltage is driven at the activation
of the neuron, and which accepts an incoming current which it uses to drive its dynamics. Each hidden
neuron fµ is connected to a visible neuron gi via a resistance
                                                          P Rµi = 1/ξµi that is the inverse of the weight
it represents. The current flowing into node fµ is Jµ = i R1µi (gi − fµ ), which is the assumption needed
for equation (24). This same analysis holds for other hidden and visible neurons, and so together they
realize the large dynamical system of (1).

A.3     SPICE Netlist
Following is the SPICE netlist for the single neuron circuit, using ideal op-amps. Component values are
omitted for brevity. There is no nonlinearity here; adding one would be a matter of inserting a nonlinear
amplifier between node h µ and XU1’s positive terminal.
R1 f_µ m_µ
XU1 f_µ h_µ m_µ opamp Aol=100K GBW=10Meg
XU2 u2- f_µ s_µ opamp Aol=100K GBW=10Meg
R2 u2- 0
R3 s_µ u2-
R4 u3+ s_µ
R5 u3+ 0
XU3 u3- u3+ d_µ opamp Aol=100K GBW=10Meg
R6 u3- m_µ
R7 d_µ u3-
R8 d_µ h_µ
C1 h_µ 0


                                                    16
                                           Figure 9: Softmax circuit design


V§b_µ N001 0
R9 u3+ N001
R10 u3- 0


B      Softmax Circuit
For demonstration purposes, we follow the construction of an analog softmax circuit using bipolar junc-
tion transistors (BJTs) described in [34]. Figure 9 shows the design of a four-way softmax circuit using
BJTs. The softmax function we aim to produce is:
                                                   ezi
                                       softmaxi = PN                  ,    i = 1, . . . , N                             (29)
                                                                 zj
                                                         j=1 e

   For the µth BJT in the circuit, the collector current IC,µ can be expressed in terms of the base voltage
hµ and the emitter voltage VE when in the forward-active mode as:
                                                                                                   hµ −VE
                          IC,µ = Is eVBE /VT ,    VBE,µ = hµ − VE ,           ⇒      IC,µ = IS e     VT
                                                                                                                        (30)
where Is is the BJT’s saturation current and VT is the thermal voltage. Assuming large BJT β (note:
this β is unrelated to the softmax β)2 , we can neglect base currents IC,µ = IE,µ . Applying KCL at
                                                       PN
the shared emitter node VE , the total current IEE = µ=1 IC,µ . We can expand the expression for the
collector currents to get the currents in terms of node voltages:
                                                         Nh
                                                         X
                                                 IEE =         IS e(hµ −VE )/VT
                                                         µ=1
                                                         Nh
                                                         X  IS ehµ /VT
                                                     =                                                                  (31)
                                                         µ=1
                                                                 eVE /VT

Simultaneously, the current IEE is also fixed by the ideal current source, so IC,µ can also be expressed
                                                                 I
as the ratio of the branch current to the total current: IC,µ = IC,µ
                                                                   EE
                                                                      IEE . Plugging in (30) for IC,µ and
(31) for IEE in the denominator and canceling the term containing VE ,
                                                          ehµ /VT
                                                 IC,µ = PNh           IEE                                               (32)
                                                               hj /VT
                                                         j=1 e

This already looks very much like the ideal softmax function. The voltage at node fi is created by
current flowing through resistor Ri , producing a voltage drop relative to VCC . Specifically, the voltage
                hµ /VT
fµ = VCC − PNeh hj /VT IEE Rµ . When IEE Rµ = 1, this voltage fµ is a negated and shifted softmax in
                  j=1 e
the range of 1 volt. This scale and negation can be easily corrected with an op amp, which is also needed
to isolate the node and prevent loading. Note that VCC must be chosen to be positive supply in order
for the BJTs to remain in the forward-active mode.
  2 In BJTs, β denotes the ratio of the collector current to the base current. High BJT β indicates the transistor is able to

amplify a small base current into a much larger collector current, allowing the BJT to function as an amplifier or switch.
A high β reflects that the BJT can efficiently transmit carriers from emitter to collector, without losing them to the base.


                                                               17
                                       Parameter                Value
                                       RF                         1000         Ω
                                       RT                             1        Ω
                                       R1                             1        Ω
                                       R2 , R3 , . . . , R8      10 000        Ω
                                       RS                            40        Ω
                                       C                             10        µF
                                       a3                             0        V
                                       b1                             0        V
                                       b2                           −1         V
                                       b3                           −1         V
                                       b4                           −1         V

                                Table 2: Component and parameter values.


C     XOR DenseAM Circuit
Figure 10 is a full circuit diagram of the DenseAM that solves the XOR problem. Given input voltages
at V1, V2∈ {0, 1}, the output voltage at g3 is the result of the XOR operation between V1 and V2. In
this model, the visible neuron is linear, and the hidden neurons share a softmax activation function im-
plemented by a set of bipolar junction transistors. Table 2 lists the component values used in simulation.


Visible neurons. In the XOR task, only one visible neuron is left evolving, corresponding to the output
column of the truth table. As such, the first two neurons are clamped to the input voltages, represented
by V1 and V2. The third visible neuron, highlighted in blue, is a linear unit with no nonlinear activation:
the internal state voltage v3 directly drives the output, setting g3 = v3 . This is the same circuit described
in Appendix A, except where the activation block is not present.

Hidden neurons. The XOR task requires four hidden neurons, highlighted in green. These are iden-
tical circuit constructions with the exception of the voltage sources bµ for the biases, which are set
according to the values in Table 2. Unlike the visible neuron, the hidden neurons have a softmax activa-
tion function, such that fµ = softmaxµ (h).

Softmax activation function. The red highlights the same softmax circuit described in Appendix B,
comprised of BJT transistors, resistors, a voltage source for VCC and a current source for IEE . We
use the 2N5088 transistors in our model, reflecting a standard and widely available BJT. Noninverting
buffers (U10, U11, etc.) are used to prevent loading effects on the state capacitors Cµ from current draw
of the BJT base in forward-active mode. As discussed in Appendix B, the softmax circuit itself produces
an output voltage of
                                                  ezi
                             softmax(z)i = VCC − PN                        ,   i = 1, . . . , N
                                                                      zj
                                                              j=1 e

When VCC = 5V as in this circuit, this requires extra circuitry, highlighted in yellow, to shift and negate
the softmax output. This is done by first buffering the voltage output to prevent loading effects, followed
by a summing op amp that subtracts VCC and inverts the softmax output. For the first hidden neuron
h1 (lower left of figure), op-amp U2 buffers the voltage output, while U1 is configured in an inverting
summing configuration to add -5V (the inverse of VCC ) to the buffered voltage output, producing the
correct softmax output.

Weight matrix. The weight matrix is comprised of resistors R1 -R12 that represent the weight matrix
ξ. These are set directly according to the XOR truth table, where each row corresponds to one hidden
neuron. A boolean value of 1 (RT ) is set to be a high conductance (1Ω), while a boolean value of 0 (RF )
is set to be a relatively small conductance (1kΩ).
    The gain si /gi governing the value of si is set to be the sum of the resistances in that neuron’s crossbar
column. The column of resistances for neuron 1 has 3 RF resistances, which sum to 3 × 10−3 . Hence,


                                                         18
19
     Figure 10: Full schematic for XOR DenseAM built with 1 evolving linear visible neuron and 4 hidden neurons with softmax activation. Blue: visible neuron.
     Green: hidden neurons. Yellow: buffers for softmax activation circuit. Red: analog softmax circuit.
neuron 1’s R47 /RR46 = 3/1000. The crossbar resistances for neuron 2, 3, and 4 have 2 RT resistances
and one RF resistance, which sums to approximately 2. Hence, we approximate R59 /R56 = 2000/1000
and similarly for hidden neurons 3 and 4.


D     Design and implementation variations
A large design space remains open across analog electronics and other substrates for realizing DenseAMs,
with clear speed–energy–area–precision trade-offs. In electronics, the core primitives admit multiple re-
alizations: passive, nonvolatile weights (e.g., memristors, triode-region or floating-gate transistors, and
other programmable conductors); active, gained weights via OTAs; and nonlinearities via diode clamps,
reverse-biased diode/BJT exponentials, MOS quadratic regions, or translinear blocks. Architectures in
the spirit of [35, 23] are compact but couple synaptic values to neuronal time constants, making dynamics
drift when a single weight changes—problematic for learning and consistent timing—whereas our decou-
pled neuron preserves a fixed time constant under weight updates. Simpler neuron/network topologies
likely exist and can be attractive in resource-constrained regimes, provided their deviations from the
target ODEs are validated not to degrade performance. Beyond CMOS, photonics (e.g., overdamped,
low-Q microring resonators) can naturally implement first-order ODEs and can offer extreme bandwidth
with distinct calibration and noise constraints. Across these options, open problems include robust
weight storage/programmability and drift control, mixed-signal learning rules compatible with device
limits, scaling under current/GBW/SR constraints, tolerance to mismatch/noise, and algorithm–circuit
co-design to exploit substrate-specific advantages.


E     Scaling of inference time
There are two conditions under which inference times should be studied, dependent on the softmax
temperature β. In the low-β regime, the DenseAM reaches equilibria with multiple hidden neurons
“competing” in the softmax, while in the high-β regime, the DenseAM reaches equilibria with only one
hidden neuron “winning out” in the softmax. Intuitively, the high-β regime corresponds to exact memory
recall, while the low-β regime corresponds to interpolation. The XOR and Hamming (7,4) code are in
the high-β regime, while the energy transformer lies in the low-β regime. In both regimes, we find that
the DenseAM converges in time that is constant with respect to the number of neurons.

Assumptions.
(A1) There is a per-synapse device limit of 0 ≤ ξµi ≤ Gmax where Gmax is the maximum conductance
    set by the physics of the crossbar crosspoints. Because f is the output of a softmax so fµ ≤ 1 ∀µ,
    this means
                                             X
                                                 ξµi fµ ≤ Gmax                                    (33)
                                                    µ

     so the RHS of the visible neuron dynamics is O(1).
     There exist both column-sum and row-sum budgets that are enforced by the hardware, since each
     neuron’s output stage can only source/sink a finite amount of current while maintaining GBW/SR
     margins. This dictates a per-column and per-row conductance budget to stay within this maximum
     current, resulting in
                                    Nv
                                    X                         Nh
                                                              X
                                         ξµi ≤ Cr       ∀µ,        ξµi ≤ Cc   ∀i                      (34)
                                     i                        µ


     Weights can only be positive since conductances can only be positive, so ξµi ≥ 0.
     As a corollary of (A1), note also that we can bound ∥ξ µ ∥2 ≤ S ∀µ, and since ∥ξµ ∥2 ≤ ∥ξ µ ∥1 , then
     ∥ξ µ ∥2 ≤ Cc ∀µ.
(A2) Bounded biases. |ai | ≤ A, |bµ | ≤ B for all i, µ. In realistic regimes, this typically holds, for
    example the typical choice in boolean functions of bµ = − β2 ∥ξ µ ∥2 (seen in Section 5.1).


                                                         20
Model. Take the system of equation (1) with a softmax activation on hidden neurons and an identity
activation on visible neurons. For clarity we assume 0 biases on visible neurons, but they do not change
the analysis.

                      τv v̇ = ξ⊤ f + a − v,   τh ḣ = ξv + b − h,      f = softmaxβ (h)            (35)

Integrating out the hidden units,

                                        τv v̇ = ξ ⊤ f (v) − v,                                     (36)
                                                                      
                                       f (v) = softmax β(ξv + b)                                   (37)

yields the effective energy function expressed in terms of visible neurons:
                                      1       1    X            
                             E(v) =     ∥v∥2 − log   exp β ξ ⊤
                                                             µv+b                                  (38)
                                      2       β    µ


where ∇E(v) = v − ξ ⊤ f (v). Because τv v̇ = −∇E(v), we see that the dynamical trajectory causes the
energy to monotonically decrease over time:
                           d                            1
                              E(v(t)) = ∇E(v(t))⊤ v̇ = − ∥∇E(v(t))∥2 ≤ 0                           (39)
                           dt                           τv

E.1    Low-β regime
The energy landscape in the low-β regime exhibits uniform strong convexity, so the gradient flow dy-
namics cause the energy gap to decay exponentially, reaching an ϵ-fraction of the original energy gap
in constant time. To show E(v) is α-strongly convex, we must show ∇2 E(v) ⪰ αI for some α > 0.
This means that all the eigenvalues of the Hessian are ≥ α. Equivalently, λmin (∇2 E) ≥ α. Denote
G(f ) = Diag(f ) − ff ⊤ ⪰ 0, which is the Jacobian of the softmax function f (v) = softmax(β(ξv + b)).

                                    ∇2 E(v) = I − βξ ⊤ G(f )ξ                                      (40)
                                                                 
                              λmin ∇2 E(v) = λmin I − βξ⊤ G(f )ξ
                                          
                                                                                                   (41)
                                                                  
                                            = 1 − βλmax ξ ⊤ G(f )ξ                                 (42)
                                                                   
                               ⇒ ∇2 E(v) ⪰ 1 − βλmax ξ ⊤ G(f )ξ I                                  (43)

Because G(f ) ⪯ Diag(f ) ⪯PI is PSD and therefore ξG(f )ξ ⊤ is also PSD, and G(f ) is a probability-
weighted covariance where µ fµ = 1,
                                                             X
                       λmax (ξ ⊤ G(f )ξ) ≤ tr(ξ⊤ G(f )ξ) ≤       fµ ∥ξ µ ∥2 ≤ max ∥ξ µ ∥2          (44)
                                                                               µ
                                                             µ


Denote S 2 = maxµ ∥ξ µ ∥2 ≤ Cc as in (A1). Therefore, the Hessian of E can be bounded as

                                       ∇2 E(v) ⪰ (1 − βS 2 )I = αI                                 (45)

where α = 1 − βS 2 . Then α > 0 when β < 1/ maxµ ∥ξ µ ∥2 . This is a sufficient (but not necessary)
condition for the system to be in the low-β (uniformly convex) regime, where the softmax is diffuse
enough that its covariance term does not contribute so much negative curvature as to overwhelm the
positive curvature contributed by the identity term. In this regime, the uniform lower bound on the
Hessian implies α-strong convexity, which gives the PL inequality
                                       1
                                         ∥∇E(v)∥2 ≥ α(E(v) − E ∗ )                                 (46)
                                       2
Together with (39), this allows us to bound the time constant of gradient flow:

                      d                      1               2α
                         (E(v(t)) − E ⋆ ) = − ∥∇E(v(t))∥2 ≤ − (E(v(t)) − E ⋆ )                     (47)
                      dt                     τv              τv


                                                     21
If the curvature is bounded below by α, then the gradient magnitude grows at least linearly with distance
to the minimum, ensuring the energy function is “steep enough” to ensure exponential convergence.
Integrating,
                                                                           2α
                                  E(v(t)) − E ⋆ ≤ (E(v(0)) − E ⋆ )e− τv t                              (48)
This indicates exponential decay of the energy gap. In order to reach an ϵ-fraction of the original energy
gap, this takes time
                                               τv    1
                                     T (ϵ) ≤      log = O(τv log(1/ϵ))                                 (49)
                                               2α    ϵ
which is entirely independent of system size Nv and Nh . In the energy transformer case, this means that
convergence time is entirely independent of context length L and token dimension D.

E.2      High-β regime
E.2.1    TI : Basin selection
Denote
                      sµ (v) := ξ⊤
                                 µ v + bµ ,     m(v) := max sµ (v),      f := softmax(βs)              (50)
                                                            µ

Define the basin of attraction around the winning softmax logit k by the margin γ > 0:
                                   Bk (γ) = {v : sk (v) − max sj (v) ≥ γ}                              (51)
                                                                j̸=k

Let TI be the first time t such that v(t) ∈ ∪k Bk (γ). Defining the softmax component of the energy
function (38) as
                                                             Nh
                                                       1     X
                                         LSEβ (s) =      log     eβsµ
                                                       β     µ=1

then for every v, we can bound the LSE as
                                                                       1
                                 m(v) ≤ LSEβ (s(v)) ≤ m(v) +             log Nh                        (52)
                                                                       β
Thus, the “softmax slack” δ(v) := LSEβ (s(v)) − m(v) obeys 0 ≤ δ(v) ≤ β1 log Nh . In the high-β regime,
there are no critical points other than the softmax basins (those within ∪k Bk (γ) for any reasonable
γ > ϵ > 0). To reduce δ from its initial value to the cusp of one of the basins requires dissipating at most
                                                            1
                                              ∆Esoftmax ≤     log Nh                                   (53)
                                                            β
∂E
∂vi = −τv v̇i , and outside winning basins τv v̇i ∼ 1, so the squared magnitude of the gradient grows at
least linearly in Nv :
                                                    Nv      2
                                               2
                                                   X     ∂E
                                    ∥∇E(v)∥ =                   ≥ cNv                               (54)
                                                    i=1
                                                         ∂vi

for some c > 0 independent of Nv and Nh for all v in the trajectory outside a winning basin. Therefore,
the energy dissipation rate satisfies
                                                 1                c
                                     −Ė(t) =       ∥∇E(v(t))∥2 ≥ Nv                                   (55)
                                                 τv              τv
    Under assumptions (A1)–(A2), the visible state v remains in a bounded box, so the quadratic part of
the energy contributes at most O(Nv ) to the energy difference between any two points on the trajectory.
Since the energy dissipation rate during TI scales proportionally to Nv , the quadratic component of
the energy contribution is dissipated in constant time. The only nontrivial Nh dependence is due to the
softmax slack. Together with the bound on ∆Esoftmax , the total time this phase takes is characteristically
                                                             
                                                    τv log Nh
                                           TI = O                                                      (56)
                                                     β Nv

                                                       22
E.2.2    TII : Contractive convergence within a winning basin
Find a basin Bk (γ) that is entered at tin = TI . We will now show local strong convexity within this
basin, allowing us to invoke the PL inequality and find exponential convergence within the basin. Define
G := Diag(f ) − ff ⊤ . First, consider that the non-winning softmax mass is 1 − fk , which is
                                                X
                                       1 − fk =     fj ≤ (Nh − 1)e−βγ                              (57)
                                                      j̸=k


Additionally, since ∥f ∥2 = fk2 +            2    2
                                    P
                                       j̸=k fj ≥ fk and 0 ≤ fk ≤ 1,


                λmax (G(f )) ≤ tr(G(f )) = 1 − ∥f ∥2 ≤ 1 − fk2 ≤ 2(1 − fk ) ≤ 2(Nh − 1)e−βγ              (58)

Hence, with S 2 = maxµ ∥ξ µ ∥2 ,

                            λmax (ξ ⊤ G(f )ξ) ≤ S 2 λmax (G(f )) ≤ 2S 2 (Nh − 1)e−βγ                     (59)

This gives a bound on the largest eigenvalue of G(f ) in a way that incorporates the softmax beta.
   Now, we can show local strong convexity in the winning basin:

                     ∇2 E(v) = I − βξ ⊤ G(f )ξ ⪰ (1 − β2S 2 (Nh − 1)e−βγ )I ≡ α(β, γ)I                   (60)

for all v ∈ Bk (γ). Particularly, if
                                                                           1
                                               e−βγ (Nh − 1) ≤                                           (61)
                                                                         4βS 2

then α(β, γ) ≥ 12 , independent of Nh , Nv . Note that this is always possible: if the softmax is not peaked
enough to make this inequality true, simply keep moving in trajectory “Phase I” for a little longer until
the margin γ grows slightly larger such that the condition holds true. This strong convexity within Bk (γ)
implies the PL inequality
                              1
                                ∥∇E(v)∥2 ≥ α(β, γ)(E(v) − E ⋆ ),                 ∀v ∈ Bk (γ)             (62)
                              2
Therefore, along the trajectory within the basin for times t ≥ tin ,

                    d                   1                2α(β, γ)
                       E(v(t)) − E ⋆ = − ∥∇E(v(t))∥2 ≤ −          E(v(t)) − E ⋆
                                                                               
                                                                                                         (63)
                    dt                  τv                 τv
Integrating,
                                                        2α(β,γ)
                              E(v(t)) − E ⋆ ≤ e−                (t−tin )
                                                                           E(v(tin )) − E ⋆
                                                                                              
                                                          τv                                             (64)

Impose a relative-to-initial convergence criteria:

                                E(v(t)) − E ⋆ ≤ ϵ E(v(0)) − E ⋆ ,
                                                               
                                                                                 ϵ ∈ (0, 1)

Since E is non-increasing along the trajectory, E(v(tin )) − E ⋆ ≤ E(v(0)) − E ⋆ , so it suffices that
                                                      2α(β,γ)
                                                 e−     τv    (t−tin )
                                                                         ≤ϵ

Hence the in-basin time satisfies
                                                                        
                                                 τv        1           1
                                       TII ≤            log = O τv log                                   (65)
                                               2α(β, γ)    ϵ           ϵ

which is size-free of Nh and Nv .


                                                             23
E.2.3    Combined bound
Altogether, in the high-β regime, to reach a relative-to-initial tolerance of
                                   E(v(t)) − E ⋆ ≤ ϵ E(v(0)) − E ⋆
                                                                       
                                                                                                       (66)
the combined convergence time satisfies
                                                                             
                                   τv log Nh                                  1
                      T (ϵ) = O                           +          O τv log                          (67)
                                    β Nv                                      ϵ
                              |       {z     }                       |   {z     }
                                winner selection (TI )        convergence within basin (TII )

For fixed ϵ, β, and τv , TII is independent of Nv and Nh , while TI carries all the model-size dependence.
The dependence of the convergence time on Nh and Nv in the high-β regime is
                                                               
                                                      τv log Nh
                                          T (ϵ) = O               .                                   (68)
                                                      β Nv
The convergence time is at most logarithmic in the number of hidden neurons Nh , and actually decreases
as 1/Nv in the number of visible neurons.

E.3     Limitations
Our analysis assumes that the timescales of the crossbar array are much faster than the fastest neuronal
timescales. In practice, as the crossbar array gets bigger, it may contribute to the time scales of the
entire system, since wires have non-zero capacitances. Once the size of the crossbar array reaches the
point when it significantly modifies the time scales of the neurons, our analysis and the scaling argument
becomes invalid. For this reason, one cannot scale this design to infinitely large sizes. Analyzing that
boundary is outside the scope of our paper, because it is dependent on fabrication and design parameters,
which is a different level of abstraction than our present paper.


F       Design invariance under voltage scaling
Given hardware constraints of Gmax , Cc , and Cr , we can still implement models with arbitrarily large
weights. Convergence bounds rely on the weight matrix constraints, which can be made feasible by
global normalization at the hardware level, keeping the effective model weights unchanged. Consider the
scaling factor for any non-negative ξ:
                                  (                                          )
                                        Gmax           Cc            Cr
                          κ = min 1,             ,      P       ,     P                            (69)
                                       maxµ,i ξµi maxi µ ξµi maxµ i ξµi

Set ξ̃ = κξ. Then, ξ̃ satisfies all the hardware constraints of assumption (A1):
                                              X                   X
                          0 ≤ ξ˜µi ≤ Gmax ,       ξ˜µi ≤ Cr ∀µ,      ξ˜µi ≤ Cc ∀i                      (70)
                                                  i                       µ

So any ξ matrix can be mapped onto budgets with one scalar κ. Consider the pre-softmax arguments
for the hidden neurons: if we scale weights ξ → ξ̃ = κξ, rescale the voltage unit v → ṽ = κv and biases
b → b̃ = κ2 b and set β̃ = β/κ2 , then
                                              ⊤
                                        β̃(ξ˜µ ṽ + b̃) = β(ξ ⊤
                                                              µ v + b)                                 (71)

so the softmax outputs f and the system’s attractors are unchanged. The visible ODE τv v̇ = ξ⊤ f (v) − v
is preserved up to units, as the κ terms can be absorbed into the gain of U2 and U3 without affecting the
convergence time bounds.


G       Scaling of energy consumption
The energy consumption of DenseAM circuits can be broken up into two parts: the energy dissipated
by the weights as a result of Ohm’s Law, and the energy from engineering overhead found in amplifiers
and active circuitry. The energy dissipated by the weights in the crossbar array can be expressed as the
integral of the power dissipated by each resistor of resistance Rµi from time 0 until convergence at Tconv .


                                                         24
Energy consumption of weights. Let the neuron output voltages be proportional to activations:
ui = κgi and wµ = κfµ , where κ is a fixed voltage scale. We assume rail-bounded outputs |ui | ≤ κ and
|wµ | ≤ κ (by Appendix F, global rescaling of ξ, voltages, and β preserves the DenseAM dynamics, so
this choice of κ does not affect behavior.) The instantaneous power in the resistive crossbar is:
                                                     X
                                      Pweights (t) =   ξµi (ui − wµ )2                             (72)
                                                           i,µ
                                                       P                             P
Using the row/column conductance budgets          µ ξµi ≤ Cc and                         i ξµi    ≤ Cr (Appendix E) and the
inequality (a − b)2 ≤ 2a2 + 2b2 ,
                                                                
                                           X           X
                         Pweights (t) ≤ 2   ξµi u2i +   ξµi wµ2                                                         (73)
                                             i,µ                 i,µ
                                                                       !                               !!
                                             X             X                   X         X
                                    =2             u2i           ξµi       +       wµ2           ξµi                      (74)
                                              i            µ                   µ           i
                                                                                 !
                                                  X                    X
                                    ≤ 2 Cc               u2i + Cr          wµ2                                            (75)
                                                   i                   µ

                                                                    2                             2    2
                                                            P                             P
If the hidden layer uses a softmax activation, then
P 2                                                              µ fµ ≤ 1 and so               µ wµ ≤ κ ; and rail bounds give
              2
   i ui ≤ Nv κ . Therefore,

                                 Pweights (t) ≤ 2κ2 (Cc Nv + Cr ) = O(Nv )                                                (76)

Therefore, a system taking time T conv to converge results in an energy consumption of
                                       Z T
                          Eweights =         Pweights (t)dt ≤ 2κ2 (Cc Nv + Cr )T conv                                     (77)
                                        0

According to the convergence time bounds of Appendix E, T conv = O(τv ). Thus, Eweights = O(Nv ), as
a function of system size.

Energy consumption of capacitors. Let each neuron node voltage be bounded by hardware limits
|ui (t)|, |wµ (t)| ≤ κ. Charging a capacitor of capacitance C from a supply through a resistive path draws
CV 2 from the power supply. The number of times each capacitor charges is finite because the Lyapunov
energy of the DenseAM forbids limit cycles. This means the total supply energy per node can be bounded
by a constant. Therefore, the total energy needed to (re)charge all neuron capacitors is bounded by
                                                            Nh
                                                  Nv
                                                                     !
                                                       (v)
                                                 X          X
                                               2                 (h)
                         Ecapacitors ≤ O(1) · κ      Ci +      Cµ      = O(Nv + Nh )                   (78)
                                                    i=1                µ=1


Energy consumption of amplifiers, bias, control, and overhead. Per neuron, the energy expen-
diture to amplifier inefficiency, bias terms, and general overhead do not depend on system size. For a
runtime of duration T conv , the energy consumption of these elements in the entire network scales as

                                       Eother = O((Nv + Nh )T conv )                                                      (79)

Combined energy consumption.             All together, the total energy consumption can be written as

                                             Etotal = O(Nv + Nh )                                                         (80)


H     Model Specifications and Details
Table 3, Table 4, and Table 5 summarize the model design for the XOR, Hamming (7,4), and parity
DenseAM models.


                                                            25
                                    Table 3: XOR model specification

Visible neurons vi                    Nv = 3 (inputs v1 , v2 clamped to {0,1}; output v3 free)
Hidden neurons hµ                     Nh = 4 (one per truth-table row)
                                                                 PNv 2
Visible activation and Lagrangian     Identity: gi = vi , Lv = 21 i=1 vi
                                                                                 NPh βhµ 
Hidden activation and Lagrangian      Softmax: fµ = softmax(βhµ ), Lh = β1 log        e
                                                                                   µ=1
Visible biases                        ai = 0
                                               PNv 2
Hidden biases                         bµ = − 12 i=1 ξµi                                    
                                                                                    0 0 0
                                                                                   0 1 1
Weights ξ                             ξ ∈ {0, 1}4×3 , rows encode memories: ξ =   1 0 1
                                                                                            

                                                                                    1 1 0
Inference protocol                    Clamp (v1 , v2 ) to input values; read out v3 at convergence


                             Table 4: Hamming (7,4) model specification

Visible neurons (Nv )   7 (codeword bits)
Hidden neurons (Nh )    16 (one per valid codeword)
Visible activation      Identity: gi = vi
Hidden activation       Softmax over µ ∈ {1, . . . , 16} with temperature β
Visible biases          ai = 0
                                  PNv 2
Hidden biases           bµ = − 21 i=1    ξµi
Weights ξ               ξ ∈ {0, 1}16×7 , each row is a valid Hamming(7,4) codeword
Inference protocol      Initialize visible neurons to corrupted 7-bit input codeword; let all visible and
                        hidden neurons evolve; converged visible neurons give the corrected codeword


                               Table 5: 8-bit parity model specification

Visible neurons vi                          Nv = 16 (dimension of embedding D)
Hidden neurons (energy attention) hattn
                                   A        Nhattn = 8 (context length L)
Hidden neurons (Hopfield network) hhopf
                                    µ       Nhhopf = 16 (Hopfield network memories M )
Hidden neurons (total)                      Nh = 24 (L + M )
Visible activation                          Identity: gi = vi
Hidden activation (energy attention)        Softmax: fAattn = softmax(βhattn )A for A = 1, . . . , L
Hidden activation (Hopfield network)        ReLU: fµhopf = max (hhopf
                                                                    µ   , 0) for µ = 1, . . . , M
Weights (energy attention)                  ξattn ∈ RL×D , where ξattn
                                                                     A    is embedded A’th context token
Weights (Hopfield network)                  ξ hopf ∈ RM ×D , static after training
Inference protocol                          Embed L context tokens to obtain ξ attn . Let visible neurons
                                            evolve until convergence


                                                    26
H.1      Bit string energy transformer implementation
As described in Table 5, our trained model uses an embedding matrix of 2 × D = 32 parameters, the
Hopfield network with D × M = 256 parameters, an additional D × 2 = 32 parameter matrix to decode
embeddings to logits, a total of D + L + M = 40 neuron bias terms, and 2 biases for the linear decoder.
This is a total of 362 parameters.
    In training and inference we use time constants τv = 0.1 and τh = 0.01. We train with Euler steps of
1e-3, and test with Euler steps of 1e-4 for a time horizon of T = 1 second. Jax’s automatic differentiation
was used to implement backpropagation through time. We encourage the model to reach fixed points
by penalizing v̇ at time T. This yields models that are more robust to hardware imperfection due to the
intrinsic stability of attractor points. The convergence to an attractor also means the inference remains
stable to mismatch and delay in timing during readout.


I     Hardware analysis
I.1     Hardware speed analysis
As discussed in subsection 7.1, the convergence time of analog DenseAMs is governed not by system size,
but rather primarily by the timescales of the dynamics in hardware. These timescales are set by the time
constants τv and τh . The smaller these time constants, the faster the dynamics move, and the faster the
system converges. In this section, we derive bounds on the minimum time constant min{τv , τh } of the
DenseAM, which is limited by the constraints of active components like amplifiers.
    The maximum speed of neuronal dynamics is limited by the ability of active stages (op-amps/buffers)
to track changing signals. If the input slope to an active stage exceeds its slew rate (SR), the output
distorts; if the signal spectrum approaches or exceeds the stage’s closed-loop bandwidth, attenuation
and phase lag appear. Here, we derive lower bounds on the time constants τv , τh imposed by (i) finite
gain–bandwidth product (GBW) and (ii) finite SR of the three active stages in the neuron design (Ap-
pendix A). Without loss of generality we will express the derivation for the hidden neurons, with the
derivations for visible neurons following by symmetry. Throughout, define the following:

    • State swing: |vi (t)| ≤ Av , so that |v̇i | ≲ Av /τ . Similarly, |hµ (t)| ≤ Ah , so that |ḣµ | ≲ Ah /τ .
    • Activation swing: Visible activation g(·) is Lipschitz with slope bound Lg = supx |g ′ (x)|. Then,
      |ġi | ≤ Lg |v̇i | ≤ Lg Av /τ . Similarly, hidden activation f (·) is Lipschitz with slope bounded by
      Lf = supx |f ′ (x)|. Then, |f˙µ | ≤ Lf |ḣµ | ≤ Lf Ah /τ .

    • Weights ξ ≥ 0. Hardware normalization gives
                                                P per-row/column conductivity budgets, so the self-
      term gain for hidden neuron µ is Aself,µ = i ξµi = O(1).
We will derive three independent lower bounds and then take the max:

                         τmin ≥ max{          τGBW          ,      τSR         ,   τI−limit }                     (81)
                                              | {z }               |{z}            | {z }
                                       tracking small signals edge/large-signals output current


I.1.1   Gain-bandwidth product bound
For a single-pole op-amp with gain-bandwidth product GBW in a closed-loop configuration with loop
gain ACL , the −3db bandwidth is fc ≈ GBW/ACL . In order for the neuron to faithfully track with a
time constant τ , we require fc ≳ 1/(2πτ ) for every stage in the signal path. Closed-loop gains for each
of the op-amps are: ACL (U 1) = 1 because it is a unity-gain buffer, ACL (U 2) = Aself because it needs
to realize the self term gain, and ACL (U 3) ≈ 1 because it is a unity-gain summer. Assuming the same
op-amp design for U1, U2, and U3, and taking the worst case,

                                                         max(1, Aself )
                                              τGBW =                                                              (82)
                                                          2πGBW

I.1.2   Slew rate bound
The slew-rate limits cap the maximum output slope of each op-amp stage:
    • U1: activation buffer. |f˙µ | ≤ Lf Ah /τ , which gives τ ≥ (Lf Ah )/SRU1 .


                                                          27
Table 6: Estimated neuron time constants and conservative convergence times with Av = Ah = 1 V,
                                                                                      1
Lg = 1, Aself = 1 for representative amplifiers in literature. GBW bound τGBW = 2π GBW  ; SR bound
      Lg Av
τSR = SR (visible path). Overall τmin = max{τGBW , τSR }; we report Tconv = 10 τmin .

CMOS Amplifier (ref.)                          SR (V/µs)        GBW (MHz)            τSR (ns)    τGBW (ns)        Tconv (ns)
Perez and Maloberti [36]                              84.50               321.50         11.83             0.50    118.34
Assaad and Silva-Martinez [37]                        94.10               134.20         10.63             1.19    106.27
Yen and Blalock [38]                                 202.00                10.70          4.95            14.87    148.74
Naderi, Prakash, and Silva-Martinez [39]            1250.00              3600.00          0.80             0.04     8.00
Schlögl and Zimmermann [40]                        1650.00              2510.00          0.61             0.06     6.06
Notes. (i) τSR values assume the visible path dominates the summer’s SR (low/moderate-β). If softmax dominates at U3
   in the high-β regime, multiply SR-limited values by κ = (β/2) (Ah /Av ) (with Ah = Av = 1 V, simply β/2). (ii) The
 current-limit bound τI-limit = CAv /Imax is typically ≪ all reported values for C ∼ 50 fF and Imax ∼mA, so it is omitted
                                from the table but must still be respected in circuit sizing.


   • U2: self-term. sµ = Aself fµ , so |ṡµ | = Aself |f˙µ | ≤ (Aself Lf Ah )/τ , which gives τ ≥ (Aself Lf Ah )/SRU2 .
   • U3: internal state drive. The time-varying portion of the RC circuit drive dµ is a linear combina-
     tion of fµ and gi , with coefficients that have a maximum magnitude of Aself . Using the bounds on
     the slopes of those inputs, we get the following bound on |d˙µ | and subsequently the time constant
     bound:
                                   Aself                                      Aself max(Lf Ah , Lg Av )
                        |d˙µ | ≲         max{Lf Ah , Lg Av }     ⇒       τ≥                                          (83)
                                    τ                                                  SRU3

All together, the combined constraint is
                                                                               
                                    Lf Ah Aself Lf Ah Aself max(Lf Ah , Lg Av )
                       τSR = max         ,           ,                                                               (84)
                                    SRU1    SRU2               SRU3

I.1.3   Current / headroom limit
U3 must provide the current through R2 to charge C1 . The RC circuit dynamics dictate R2 C1 ḣµ =
−hµ + dµ , so the instantaneous current needed by U3 is

                                                         dµ − h µ
                                             IU3,out =            = C1 ḣµ                                           (85)
                                                           R2

We must respect |IU3,out | ≤ Imax,U3 . With |ḣµ | ≲ Ah /τ ,

                                                                C1 Ah
                                                  τI-limit ≥                                                         (86)
                                                               Imax,U3

I.1.4   Combined bound on minimum time constant
Taken together, the minimum time constant must satisfy the bounds (82), (84), and (86):

                                           τmin ≥ max{τGBW , τSR , τI-limit }                                        (87)

I.2     Estimates of inference times with existing hardware
Under standard assumptions for DenseAMs (symmetric couplings and monotone activations), the Lya-
punov energy decreases monotonically and the dynamics converge without oscillations. The settling time
is therefore on the order of a few multiples of the largest neuronal time constant, which we bound by
amplifier non-idealities. In this section we take some representative examples of op-amps from literature
and estimate the inference speeds from reasonable and representative design parameters.


                                                           28
Minimum time constant.             For illustration purposes, we choose three reasonable hardware constraints:
    • Activation slopes. Take the slope of the visible activation to be Lg = 1, such as would occur in
      a identity visible neuron activation. Take the worst-case (maximum) slope of the hidden activation
      to be according to the softmax with fixed β, whose Jacobian is βG(f ) with ∥G(f )∥2 ≤ 12 , so a safe
      global bound is Lf ≤ β2 .
    • Signal swing. Use the voltage scaling invariance (see Appendix F) to rescale v, ξ, and β together
      to pick a swing that is slew-rate friendly but well above component noise limits. Take both Av =
      Ah = 1V .

    • Self-term gain. With row/column budgets, use Aself as a worst-case bound.
With those choices, the three lower bounds per neuron are:

    1. GBW Bound: τGBW = max(1,A
                          2πGBW
                                self )      1
                                       = 2πGBW .
                                                             L A
    2. SR Bound: The U1/U2 path give τSR,vis = SR  g v    1
                                                       = SR µs. In the U3 (summer) path, equation (84)
       has two cases. In the low-β regime where Lg Av ≥ Lf Ah , then U3 bound reduces to 1/SR µs. In
       the high-β regime where Lf Ah = β/2 dominates, scale the slew-rate limited bound by β/2.
    3. Output Current Bound: In practice, this bound generally does not limit the op amp choice:
       even with a large capacitor C = 50 fF, Av = 1V, Imax = 2mA, τI-limit ≈ 0.025ns, which is negligible
       compared to the bounds from SR and GBW.
To quantify realistic inference speeds, Table 6 lists representative CMOS operational transconductance
amplifiers (OTAs)3 drawn from recent literature, together with their corresponding lower bounds on
neuronal time constants under the GBW and slew-rate limits. Even using conservative assumptions
with existing amplifier designs, the analysis shows that modern high-speed OTAs can achieve sub–10 ns
neuronal convergence times—corresponding to inference rates in the hundreds of megahertz.


J     Connection between analog and canonical Energy Transformer
In this section we show that in the adiabatic limit, our Analog Energy Transformer (Analog ET) reduces
to the canonical Energy Transformer. Begin with the dynamics for the Analog Energy Transformer
implemented by our circuit designs.

                                        ∂E         ⊤              ⊤
                               τv v̇ = −   = ξ attn f attn + ξ hopf f hopf + a − v                                   (88)
                                        ∂v
                                         ∂E
                           τh ḣattn
                                     = − attn = ξattn v + b − hattn                                                  (89)
                                        ∂f
                                         ∂E
                           τh ḣhopf = − hopf = ξhopf v + c − hhopf                                                  (90)
                                        ∂f
Integrating out hidden neurons in the adiabatic limit where τh → 0, we see the relations

                                                 hattn (v) = ξ attn v + b                                            (91)
                                                   hopf             hopf
                                                 h        (v) = ξ          v+c                                       (92)

which we can use to integrate out the hidden neuron activations as

                                    f attn (v) = softmax ξ attn v + b
                                                                      
                                                                                                                     (93)
                                                                  
                                   f hopf (v) = ReLU ξ hopf v + c                                                    (94)

Substituting into the visible dynamics:
                                               ⊤ attn            ⊤
                              τv v̇ = ξ attn     f     (v) + ξ hopf f hopf (v) + a − v                               (95)
   3 Many high-speed CMOS “op-amps” are reported as OTAs (transconductors). In our neuron, these OTA cores operate

in closed-loop (unity/non-inverting) configurations, so the literature SR and GBW directly constrain τ via Eqs. (82)–(84).


                                                              29
We can ask ourselves, what scalar energy produces this ODE? We seek an energy Eeff (v) such that
τv v̇ = − ∂E
           ∂v . Equivalently,
            eff


                                                       ⊤ attn            ⊤
                        ∇v Eeff (v) = v − a − ξ attn     f     (v) − ξ hopf f hopf (v)                    (96)

We can construct Eeff (v) as a sum of three pieces whose gradients match each term Eeff (v) = Equad (v) +
Eattn (v) + Ehopf (v). By inspection we see that Equad (v) = 21 ∥v − a∥2 .

Attention term.       The energy function
                                               1     X
                                                       exp β ξ attn
                                                                        
                               Eattn (v) = −     log           A v + bA                                   (97)
                                               β
                                                    A

satisfies our requirement. We can see that by differentiating with respect to vi , we get
                                ∂Eattn    X
                                       =−   softmax(ξ attn v + b)A · ξAi
                                                                      attn
                                                                                                          (98)
                                 ∂vi
                                          A
                                          X
                                             attn attn
                                       =−   ξAi  fA (v)                                                   (99)
                                               A
                                                                 ⊤ attn
which yields our desired dynamics of ∇v Eattn (v) = − ξ attn       f     (v).

Hopfield term.       A simple way to achieve the desired dynamics is with a Hopfield-type energy function
                                             X1                        2
                               Ehopf (v) = −         ReLU ξ hopf
                                                             µ   v + c µ                            (100)
                                              µ
                                                 2

whose derivative with respect to vi yields
                                ∂Ehopf    X                     
                                                                      hopf
                                       =−   ReLU ξ hopf
                                                     µ   v + c µ   · ξµi                                 (101)
                                 ∂vi      µ
                                          X hopf
                                       =−   ξµi fµhopf (v)                                               (102)
                                               µ

                                                           ⊤
which yields our desired dynamics of ∇v Ehopf (v) = − ξ hopf f hopf (v).

Effective energy function of analog energy transformer.               All together, the effective scalar energy
over the visible state v after integrating out hidden neurons is
                    1             1   X                   X 1                    2
      Eeff (v) =      ∥v − a∥22 − log   exp β ξ attn
                                                A v + bA    −     ReLU ξ hopf
                                                                         µ    v + cµ                     (103)
                   |2 {z } β          A                       µ
                                                                2
                       Equad    |          {z             } |         {z               }
                                            Eattn                               Ehopf

This effective energy aligns with the canonical Energy Transformer’s energy function. Because our effec-
tive dynamics use hidden neurons, the energy function written in the main text reflects the contributions
of the hidden neurons. When τh ≪ τv , this regime converges to the behavior when the hidden neurons
are integrated out. Hence, the effective expressibility and behavior of our system is equivalent to that of
the original Energy Transformer.
    In our model we omit the layer normalization activation that the original Energy Transformer applies
to the visible neurons. This keeps the circuit design simple, while still enabling models with high
expressibility. This choice does not modify the structure of the attention or the Hopfield parts of the
energy; only the self-energy of v differs. From a modeling perspective, layer normalization mainly
improves conditioning and learning of deep networks rather than changing the computational primitive
and expressibility. We empirically observe that the resulting models without layer normalization remain
expressive enough to solve the problems we present. In principle, a layer normalization-type visible
activation function could be implemented in analog hardware (e.g. by subtracting the mean voltage
and normalizing by an on-chip variance estimate), but this would add distracting complications to the
minimalist neuron and circuit designs we show in this paper.


                                                       30