Dense Associative Memories with Analog Circuits Marc Gong Bacvanski1 , Xincheng You2 , John Hopfield3 , and Dmitry Krotov4 1 MIT 2 Independent Researcher 3 Princeton University 4 IBM Research December 16 2025 arXiv:2512.15002v1 [cs.NE] 17 Dec 2025 Abstract: The increasing computational demands of modern AI systems have exposed fundamental limitations of digital hardware, driving interest in alternative paradigms for efficient large-scale inference. Dense Associative Memory (DenseAM) is a family of models that offers a flexible framework for repre- senting many contemporary neural architectures, such as transformers and diffusion models, by casting them as dynamical systems evolving on an energy landscape. In this work, we propose a general method for building analog accelerators for DenseAMs and implementing them using electronic RC circuits, cross- bar arrays, and amplifiers. We find that our analog DenseAM hardware performs inference in constant time independent of model size. This result highlights an asymptotic advantage of analog DenseAMs over digital numerical solvers that scale at least linearly with the model size. We consider three settings of progressively increasing complexity: XOR, the Hamming (7,4) code, and a simple language model defined on binary variables. We propose analog implementations of these three models and analyze the scaling of inference time, energy consumption, and hardware. Finally, we estimate lower bounds on the achievable time constants imposed by amplifier specifications, suggesting that even conservative existing analog technology can enable inference times on the order of tens to hundreds of nanoseconds. By har- nessing the intrinsic parallelism and continuous-time operation of analog circuits, our DenseAM-based accelerator design offers a new avenue for fast and scalable AI hardware. 1 Introduction The unprecedented growth of artificial intelligence (AI) has driven demand for increasingly large and powerful models. At present, the field of generative AI is primarily driven by two settings: autore- gressive transformers [1] and diffusion models [2]. While these settings have demonstrated remarkable capabilities, they do so at a substantial computational cost. Their current implementations utilize digital computation, which faces fundamental challenges in energy efficiency, scalability, and latency, especially as model sizes and deployment demands continue to grow [3, 4, 5]. These limitations have prompted interest in alternative computational paradigms that can efficiently handle the demands of modern AI workloads [6]. Dense Associative Memories (DenseAMs) [7, 8], a promising class of AI models which generalize Hopfield networks [9], offer a new angle for tackling these problems. Unlike conventional feed-forward models, DenseAM inference can be defined through the temporal evolution of a state vector that is governed by a system of differential equations [10]. The state vector can be thought of as a particle exploring the surface of a high-dimensional energy landscape, which is the Lyapunov function of these dynamical equations. DenseAMs have been demonstrated to be flexible and expressive computational frameworks, capable of representing many primitives of modern AI architectures, such as attention mechanism [11], transformers [12], and diffusion models [13, 14, 15]. Furthermore, DenseAMs are error- correcting systems [16], a property ensuring that small perturbations of the desired temporal evolution of the state vector are corrected away by the dynamics of the network itself, rather than accumulated in time. Finally, DenseAMs are asymptotically stable—during the course of temporal evolution the computation happens during a finite transient period of time, which is followed by a steady state of Code available at https://github.com/mbacvanski/AnalogET. 1 neural activities. This asymptotic stabilization of dynamical trajectories removes the requirement to read out the “answer” to the computation problem at a precise moment of time, making DenseAMs robust to several classes of hardware imperfections. The confluence of the above properties makes DenseAMs appealing networks for analog hardware implementations that, on the one hand, are grounded in the physics of stable error-correcting dynamical systems and, on the other hand, are capable of representing computation in state-of-the-art AI networks. In 1989, Hopfield argued that analog neural hardware can exceed the efficiency of digital implemen- tations when the device physics directly instantiate the computational dynamics of the model itself [17]. Here, we revisit this idea with DenseAM models: we propose an analog circuit-based hardware accel- erator design whose dynamics directly realize those of the DenseAM. We find that analog DenseAM hardware enables constant-time inference independent of model size, which is in stark contrast to GPU solvers and digital implementations. This intrinsic property makes DenseAM a natural fit for analog AI accelerators, and it highlights our circuit architecture as a viable hardware path to realize them. Using component specifications already demonstrated in fabricated devices, analog DenseAM hardware may achieve inference times on the order of tens to hundreds of nanoseconds, several orders of magnitude faster than digital systems. By leveraging the natural dynamics of analog systems, this work establishes a new design of fast and scalable AI accelerators. The framework of DenseAMs and their efficient analog hardware implementa- tions suggest a pathway for fundamentally redesigning the hardware-software interface for AI, enabling a new paradigm for fast, energy-efficient, and scalable computation. 2 Dense Associative Memory basics The DenseAM framework [10, 18] provides a model that has straightforward neuronal dynamics, yet is surprisingly expressive in its ability to represent AI models including transformer attention, diffusion models, and associative memories. In its simplest form it is defined by two sets of neurons (typically called visible and hidden neurons) and a system of coupled non-linear differential equations governing their behavior, see Figure 1. The visible neurons are characterized by their internal states vi and their outputs gi , index i = 1 . . . Nv ; while the hidden neurons have internal states hµ and outputs fµ , index µ = 1 . . . Nh . From the AI perspective, one can think about internal state of the neuron as a pre-activation of that neuron, and the output as a post-activation, which is obtained by applying an activation function to the pre-activation. From the biological perspective, one can think about the internal state of the neuron as a membrane voltage potential, and the output of that neuron as an axonal output, or a firing rate of that neuron. This framework admits both neuron-wise activation functions (gi = g(vi ), where g(·) is some continuous function, e.g., a ReLU), and collective activation functions such as softmax or layer normalization, which depend on the states of multiple neurons. The network parameters are stored in the synaptic weights ξ ∈ RNh ×Nv , whose matrix elements denoted by ξµi can be either hand-engineered or learned. The time decay constants for the two groups of neurons are τv and τh . With these conventions, the temporal evolution of the two groups of neurons can be expressed as  Nh  dvi X τ = ξµi fµ + ai − vi   v dt    µ=1 (1) Nv dh  µ  X τh dt = ξµi gi + bµ − hµ    i=1 This forms a bipartite graph of neuronal connections, where the state of the hidden neurons is updated by the state of the visible neurons, and vice versa. Importantly, the same matrix ξ appears in both equations, once as ξ and again as ξ ⊤ . Although this is sometimes described as using “symmetric” weights, ξ is not assumed to be symmetric in the linear-algebraic sense; it is simply the same matrix used in both directions. Finally, ai and bµ denote biases, which are additional weights of the system and whose values may be hard-coded or learned depending on the application. The most important aspect of this model is the existence of a global energy function (Lyapunov function) that describes neuronal dynamics. To demonstrate this, it is most convenient to use the Lagrangian formalism [10, 18, 16]. Each set of neurons is defined through a Lagrangian function of their internal states. The activation functions are defined as partial derivatives of that Lagrangian with respect to internal states. The total energy is the sum of energies of each set of neurons, plus the interaction 2 Figure 1: Top left: Bipartite neural network formulation, where hidden neurons hµ and visible neurons vi are connected via symmetric synaptic weights ξ. Top right: Circuit realization of symmetric weight matrix via resistive crossbar array. Each crosspoint encodes a weight ξµi by its resistance Rµi = 1/ξµi . Lower right: Circuit schematic of a single hidden neuron. It drives its row of the crossbar array with a voltage according to its activation fµ , and its internal dynamics are driven by the incoming current flowing into it from the crossbar array. Lower left: Softmax activation function built from bipolar junction transistors (some components not shown). energy. The energy of each set of neurons is a Legendre transformation of the corresponding Lagrangian (plus the term proportional to the bias). Thus, the global energy of Equation 1 is given by Nv X  Nh X  Nh X X Nv E= gi (vi − ai ) − Lv + fµ (hµ − bµ ) − Lh − fµ ξµi gi (2) i=1 µ=1 µ=1 i=1 | {z } | {z } | {z } energy of visible neurons energy of hidden neurons interaction energy where the activation functions are defined as partial derivatives of the Lagrangians ∂Lv ∂Lh gi = , fµ = ∂vi ∂hµ For convex Lagrangians this global energy decreases with time on the dynamical trajectories of Equa- tion 1. If, additionally, the activation functions (and corresponding Lagrangians) are chosen in such a way that this energy is bounded from below, the dynamical trajectories are guaranteed to arrive at a stable fixed point of activations. The dynamical equations typically have many asymptotic fixed points, which correspond to local minima of the energy function in Equation 2. Both properties above (convexity of Lagrangians and lower-bounded energy) are satisfied for all settings studied in this paper. By picking different nonlinear activation functions (or corresponding Lagrangians), this system yields a variety of models that can describe associative memory, softmax attention, and other commonly used settings in AI [10, 11, 18, 19, 20]. A particularly relevant example for modern sequence modeling is the Energy Transformer (ET) [12], which reformulates transformer’s inference pass as a gradient flow on an energy function defined over the 3 set of tokens. The ET block contains two contributions to the energy function: attention energy and the Hopfield network. The energy attention module routes the information between the tokens, while the Hopfield module aligns the tokens with the manifold of token embeddings. In our implementation, the context tokens act as a set of dynamically instantiated memories that interact with the predicted token through a DenseAM-like energy. In section 6 we exploit this connection to construct an Analog Energy Transformer (Analog ET) whose continuous-time dynamics are implemented directly in hardware using our DenseAM circuit primitives. 3 Related work Early analog implementations of associative memories focused on the classical Hopfield network. Founda- tional designs, such as continuous-time analog circuits [21, 22] and later demonstrations using amorphous- silicon resistors [23], memristive devices [24, 25], and phase-change memories [26], targeted the quadratic Hopfield energy function. These works emphasize device engineering and memory-cell design rather than system-level dynamics, and inherit the limited storage capacity and representational power of traditional Hopfield networks. That line of research is largely concerned with how to fabricate programmable re- sistance elements themselves; our work assumes programmable conductances as a given primitive and focuses on the continuous-time dynamics that operate on top of them. Our work also differs from these works by addressing DenseAMs with higher-order energy functions and continuous-valued states. Another direction is the use of cavity-QED systems for associative memory. Marsh et al. [27] analyze a confocal cavity implementation of a quadratic Hopfield network and show that the cavity dynamics induce a descent-like relaxation rule on spin states. Their model remains restricted to quadratic energies and binary spins, and operates in a cryogenic, cavity-QED setting. Our work instead targets higher-order DenseAMs with continuous states, and emphasizes scalable, room-temperature analog microelectronics with explicit hardware-aware dynamical analysis. More recent physical implementations move beyond purely quadratic energies. Musa et al. [28] propose a free-space optical realization of the higher-order DenseAM energy. Their system constructs a static physical representation of the energy landscape, but inference relies on an external digital controller that performs iterative spin-flip updates. Thus, the hardware computes energies, while the optimization dynamics remain digital. In contrast, our analog microelectronic architecture embeds the gradient flow itself into hardware: inference is performed by a single continuous-time evolution rather than by discrete digital updates. 4 DenseAM circuit design Here, we introduce a novel architecture for a class of analog electronic hardware accelerators that models Equation 1’s system of nonlinear differential equations using time evolution. Our DenseAM design shown in Figure 1 is comprised of two sets of neurons that interact through a resistive crossbar array. The resistive crossbar array turns voltage differences between neurons into currents flowing between the neurons according to synaptic weights, and each neuron’s internal circuitry converts those currents into dynamics that reproduce Equation 1. Resistive weights as a crossbar array. The crossbar array construction is a canonical design of matrix-vector multiplication using analog electronics [17, 29], and is a natural fit for the weight matrix ξ in our model. Traditionally, each crosspoint between a row and column line is connected by a resistor (often memristors, RRAM, or other analog memories that produce resistances), a vector of input voltages is applied at row lines, and the column lines are held at ground typically via a transimpedance amplifier. By Ohm’s law, each resistive crosspoint produces a current that multiplies the row’s input voltage by the inverse of the resistance. Because currents add along each column line, the total current output at a column is the inner product between the vector of input voltages and the column’s conductance vector. Thus, the array as a whole implements a parallel analog matrix multiplication of the form Iout = GVin , where G is the matrix of conductances (inverse of resistances). Unlike a traditional crossbar array whose rows are driven at a fixed voltage and whose columns are held at ground, our DenseAM circuit design uses each weight bidirectionally, exactly representing the bidirectional connections between visible and hidden neurons. As a result, the current flowing into each neuron corresponds to the weighted sum of the differences P between visible and hidden neuron activations. For example, for hidden neuron µ, this current is i ξµi (gi − fµ ). This construction enables 4 (1, 0) (1, 1) 1 g3 0.4 Neurons Visible Energy 0.2 0 1 f3 0.0 Neurons Hidden (0, 0) (0, 1) 0 0.4 Energy 0.5 Energy 0.2 0.0 0.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0 1 0 1 Time (s) v3 v3 Figure 2: Solving XOR with a DenseAM. Visible Figure 3: XOR energy landscape of neuron v3 un- neuron g3 = v3 serves as the output, while the two der different settings of visible input neurons v1 and input neurons (unlabeled, thin lines) are clamped v2 . Minima in the energy function correspond to at 1 and 0 for True and False. Output v3 is initial- stationary points of the dynamics. Gradient flow ized at 0.5 and converges to a positive prediction of dynamics bring v3 to these attractor points, result- 1. The activation of the hidden neuron f3 for the ing in correct XOR outputs. truth-table row (1, 0, 1) becomes highly activated, with others (fine lines) are suppressed by softmax. Energy (2), or equivalently (5), decreases monoton- ically along the inference trajectory. weight symmetry to be enforced by hardware sharing: both forward and reverse weights are realized by the same resistive elements. Importantly, as long as weights are represented as conductances, they must be non-negative. Design of a single neuron. Each neuron in the circuit computes its dynamics by integrating the cur- rents it receives from the crossbar array, which represent weighted differences between its own activation and those of connected neurons. Considering a hidden neuron (the design for visible neurons is symmet- ric by design), the neuron’s internal voltage hµ is stored on capacitor C1 and evolves in continuous time, while the neuron’s activation fµ is obtained by passing hµ through a nonlinear function (e.g. ReLU or softmax). The current flowing into hidden neuron µ is produced by its interaction with all visible neurons via the synaptic weights ξµi for P i = 1, . . . , Nv . Specifically, this is as a weighted sum of the differences between neuron P activations: i ξµi (gi − fµ ). Inside each neuron, a “self” path scales fµ to produceP the voltage sµ = fµ i ξµi . This term is added to the value of the incoming current so that the −fµ i ξµi term is cancelled inside each neuron. As a result, the hidden state, represented as the voltage across capacitor C1 , integrates only the desired weighted input plus any external stimulus bµ . Its dynamics reduce to the canonical DenseAM form with a time constant of R2 C1 : Nv dhµ X R2 C 1 = ξµi gi + bµ − hµ (3) dt i=1 Elementwise (or vectorized) nonlinearities then produce activations gi = g(vi ) and fµ = f (hµ ) (e.g., ReLU, softmax) across the visible and hidden neurons. See Appendix A for the full circuit derivation. 5 Analog DenseAM Examples We begin by studying two examples of the proposed design: the XOR task, and the (7,4) error-correcting Hamming code. 5 5.1 XOR The XOR problem is a canonical test for nonlinear representation and inference, as it cannot be solved by any linear model. We show a minimal DenseAM model for the XOR task, illustrating how energy- based dynamics can solve this simple task with a continuous-time analog system. The network consists of Nv = 3 visible neurons, and Nh = 4 hidden neurons. At t = 0 visible neurons v1 and v2 are initialized at their input values corresponding to the input bits. The last visible neuron v3 is initialized at v3 = 0.5. The hidden neurons are initialized at zero. The two input visible neurons remain clamped during the dynamics, while the third output visible neuron and the hidden neurons evolve in time according to (1). Each row of the memory matrix ξ corresponds to a row of the XOR truth table. The visible neurons use an identity activation function where gi = vi , and the hidden neurons use a softmax activation. The biases are set as N v 1X 2 ai = 0, bµ = − ξµi 2 i=1 Figure 2 shows the temporal evolution of visible and hidden neuron activations, as well as the total energy, during inference on the XOR input (1, 0). The output visible neuron’s activation g3 gradually converges to the correct prediction of 1, while the hidden neuron associated with that memory, f3 , becomes strongly activated and the remaining hidden neurons are suppressed by the softmax nonlinearity. The system’s energy decreases monotonically throughout the trajectory and stabilizes once the network reaches its fixed-point prediction. Figure 3 depicts the system’s energy landscape as a function of output neuron v3 for different clamped input configurations (v1 , v2 ). In each case, the energy exhibits a clear convex minimum at the correct XOR output, demonstrating that gradient flow along the energy surface drives v3 reliably toward the correct prediction. As shown in Appendix C, we validate our circuit design and dynamics using SPICE simulation. τh → 0. Since the second equation in To analyze this DenseAM, it is instructive to consider the limit P Nh (1) is linear in hidden units hµ , they can be integrated out. With µ=1 fµ = 1, the resulting dynamics of the visible neurons can be written as Nh Nv dvi X  βX  (ξµi − vi )2  τv = ξµi − vi fµ where fµ = softmax − (4) dt µ=1 2 i=1 The effective energy on the visible neurons can be written as Nh Nv 1 X h βX i E eff (v) = − log exp − (ξµi − vi )2 (5) β µ=1 2 i=1 Intuitively, each hidden neuron computes a squared Euclidean distance between the visible state and its stored pattern ξ µ . The softmax nonlinearity assigns higher weight to the pattern closest to the current state of the visible neurons. The resulting visible neuron dynamics are gradient flow for this effective energy. It is important to note that memories in this implementation are represented by conductances of the crossbar array, which are always positive. For this reason, matrix elements of memories ξµi must be positive, necessitating the use of the bias terms, which are just voltage sources that can be arbitrarily signed. While a time constant of τh = 0 is impossible to physically construct due to finite conductances and nonzero capacitances, setting τh ≪ τv realizes the same adiabatic limit in practice. When hidden neurons evolve much faster than visible ones, they reach their steady state almost instantaneously for each configuration of visible neurons. The result is an adiabatic elimination of hidden dynamics, yielding the effective visible-only dynamics above. In practice, for the XOR task, even a relatively modest τh = τv /10 ratio yields perfect performance. 5.2 Hamming (7,4) code The Hamming (7,4) code is an error-correcting code that encodes 4 data bits into a 7-bit codeword by adding 3 parity bits. The resulting 7-bit strings are special: only certain patterns are valid codewords, and they are spaced apart so that if a single bit is flipped, the error can be detected and corrected [30]. Table 1 lists the 16 codewords corresponding to four arbitrary data bits. 6 1 g5 Neurons Visible Data bits (d1 d2 d3 d4 ) Codeword (c1 c2 c3 c4 c5 c6 c7 ) 0 0000 0000000 0001 0001111 1 f7 0010 0010110 Neurons Hidden 0011 0011001 0100 0100101 0 0101 0101010 0.5 0110 0110011 Energy 0111 0111100 1000 1000011 0.0 1001 1001100 0 1 2 3 4 5 1010 1010101 Time (s) 1011 1011010 1100 1100110 1101 1101001 Figure 4: Correcting a bit error in a Hamming 1110 1110000 (7,4) code. Visible neuron g5 flips indicating the 1111 1111111 bit flip error happened on the 5th codeword bit. f7 is the hidden neuron corresponding to the memory Table 1: Valid codewords of the Hamming(7,4) of the correct codeword. Thin lines correspond to code, ordered by their 4-bit data payload. the other neuron activations. Unlike the XOR case where the only evolving neuron is the readout bit, the Hamming (7,4) code may require flipping the value of any one of the visible neurons. During inference, the visible neurons are initialized to the corrupted 7-bit input word. All neurons are left free to evolve, and the dynamics relax the state toward the nearest stored codeword. Energy minima are located at the valid codewords, so the network converges to the correct code provided the error is within the Hamming radius of 1. Thus, the DenseAM replicates the standard decoding property of the Hamming (7,4) code: any single-bit flip is corrected automatically. Figure 4 illustrates a case where a flipped bit g5 is restored during convergence. The Hamming (7,4) model’s 7 visible neurons, each corresponding to a codeword bit, are connected to 16 hidden neurons, each representing one valid codeword. The weight matrix ξ ∈ {0, 1}16×7 is formed by stacking the 16 codewords as its rows. Visible neurons have the identity activation, hidden neurons use a softmax activation, and biases are chosen as in the XOR case to give the same integrated-out visible dynamics as (4). 6 Analog Energy Transformer (Analog ET) via DenseAM Our DenseAM circuit construction can be used to build more complex energy-based models, such as the transformer-like architecture proposed in the Energy Transformer paper [12]. For causal next-token prediction with a single attention head, the Energy Transformer’s energy function can be written as the following (See Appendix J for full derivation):  ⊤ ⊤  ⊤ attn ⊤ hopf E = 12 ∥v − a∥2 − v⊤ ξ attn f attn + ξ hopf f hopf + f attn − b + f hopf   h h −c − Lattn hattn − Lhopf hhopf   (6) with the activation functions and their Lagrangians defined as L X fAattn = softmax(βhattn )A , Lattn (h) = β1 log eβhA (7) A=1 M h X i2 fµhopf = ReLU(hhopf µ ), Lhopf (h) = 21 ReLU(hµ ) (8) µ=1 where a, b, and c correspond to the biases of the visible neurons, attention hidden neurons, and Hopfield network hidden neurons, respectively. The L context tokens are indexed by A, and the M hidden neurons of the Hopfield network are indexed by µ. Because the visible units use an identity activation function, 7 Figure 5: Analog ET circuit demonstrating the autoregressive inference procedure. A newly inferenced token is decoded, sampled, and re-embedded to obtain the weight vector ξ attn L+1 , which is set as the weight vector for a new hidden neuron hattn L+1 in the energy attention block (light gray on right). For this layout we have flipped the crossbar array, so that indices A and µ run horizontally and index i runs vertically. gi = vi using the languge of Equation 1, the gradient flow of the energy yields the dynamics: ∂E ⊤  ⊤ τv v̇ = − = ξ attn f attn + ξ hopf f hopf + a − v (9) ∂v ∂E τh ḣattn = − attn = ξattn v + b − hattn (10) ∂f ∂E τh ḣhopf = − hopf = ξhopf v + c − hhopf (11) ∂f In this formulation, v represents the embedding of the output (next) token, and its evolution is driven by two terms: one term from the energy attention with weights ξattn and hidden neuron activations f attn , and one term from the Hopfield network with weights ξ hopf and hidden neuron activations f hopf . The weights of the energy attention DenseAM are dependent on the context: for a token dimension D, context length L, and the task of predicting the token at index L + 1, the weights ξ attn ∈ RL×D are generated by embedding each token of the context via a learned embedding matrix applied to each context token. In contrast, the Hopfield network weights ξ hopf are learned during training and fixed at inference. The number of memories in the Hopfield network is a hyperparameter M , such that ξ hopf ∈ RM ×D . This system suggests a hardware implementation where v interacts with two independent DenseAMs, one for the energy attention and one for the Hopfield term, which can share the same physical crossbar structure. Figure 5 shows that the circuit structure remains a crossbar array (like Figure 1), but with two distinct classes of hidden neurons. Because of the summation of currents along each row of the crossbar array, the incoming current to visible neuron vi is the sum of contributions from the energy attention block and from the Hopfield network block. The energy attention hidden neurons hattn use a softmax activation function, while the Hopfield network hidden neurons hhopf use a ReLU activation. 6.1 Analog Energy Transformer on the parity task We build and evaluate the Analog ET on the L-bit parity task, which can  P be thought of as an elementary L “language model”: given bits bit1 , . . . , bitL , predict bitL+1 = A=1 bitA mod 2. Parity is instructive because it requires a representation of a global, order-L interaction, precluding linear and shallow models from representing it efficiently. A successful model must be able to form high-order interactions in order to generalize. We formulate parity as a next-token prediction problem: given an L-bit string as context, predict its parity in the next token. We train the Analog ET model digitally using backpropagation through time [31] implemented with Jax’s automatic differentiation. The resulting weights can be deployed onto the analog hardware; in 8 11001010 0 01000110 1 4 Visible neurons 2 0 1 Prediction 0 10 Energy 20 30 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Time t Time t Figure 6: Inference of parity Analog ET on two example 8-bit strings. Top row plots the visible neurons vi over time, middle row plots the decoded token prediction, bottom row plots the energy that monotonically decreases during inference. After a transient period of computation, the network arrives at a steady- state, making the result of the computation robust against the precise timing of the readout. our experiments we simulate the dynamics of hardware with the Diffrax [32] ODE solver library. On the 8-bit parity task, our model achieves 100% accuracy on the hold-out validation set of 52 bit strings, demonstrating clear generalization capabilities. See Appendix H.1 for more details on training and model design. Figure 6 shows the dynamics of the visible neurons and energy during two example inference runs of the Analog ET. Notably, the visible neuron values are constant by the end of the inference period, meaning that the inference remains highly stable to mismatch and delay in timing during readout. A single sample-and-hold and switching circuit would enable a single Analog-Digital Converter (ADC) to read out all the visible neurons at convergence, significantly reducing mismatch, and drastically saving device area, complexity, and energy. The intrinsic stability of attractor points arises uniquely from the continuous-time dynamics of the DenseAM, making these models particularly well suited to analog hardware. 6.2 Autoregressive inference Dashed lines in Figure 5 illustrate the autoregressive inference procedure of the Analog ET. To generate the L-th token given context tokens x(1) , . . . , x(L−1) , each token is first embedded and concatenated to form the attention weight matrix  (1)  e  e(2)  ξ attn,(L−1) =  .  ∈ R(L−1)×D    ..  e(L−1) These rows are loaded into the Analog ET’s energy attention weight matrix ξ attn by programming the corresponding crossbar resistances. During inference, the visible state v(t) evolves according to the Analog ET dynamics until convergence. A decoder readout (e.g., a linear layer) applied to the converged v(t = T ) values produces logits, from which the next token x(L) is sampled. This token is then embedded to form e(L) , and appended to the existing context. The cycle repeats with the updated attention weight 9 matrix  attn,(L−1)  ξ ξ attn,(L) = ∈ RL×D e(L) which now includes the new embedding e(L) . In hardware, this corresponds to connecting an additional hidden neuron in the energy attention block of Figure 5, and setting its resistive weights with e(L) . Because the physical order of hidden neurons does not affect the energy function, this new neuron can be placed in any position among the hidden neurons. When the context length is fixed, the hidden neuron corresponding to the earliest token can simply be reprogrammed with the new vector of weights e(L) , resulting in the hardware equivalent of a sliding-window context. In practice, an external digital controller, e.g., an Field-Programmable Gate Array (FPGA) or Application-Specific Integrated Circuit (ASIC) would orchestrate crossbar programming and token decoding, while the DenseAM dynamics perform the far more substantial workload of computing each next-token embedding. This procedure is analogous to key-value (KV) caching in standard transformer inference [33]. Context tokens x(1) , . . . , x(L−1) produce key and value vectors k(1) , . . . , k(L−1) and v(1) , . . . , v(L−1) respectively. When new token x(L) is generated, its corresponding k(L) and v(L) vectors are appended to the cache, allowing all previous k(