From b83947778e2c776f757a07d4719b7ce961d7ed55 Mon Sep 17 00:00:00 2001 From: Yuren Hao Date: Fri, 3 Jul 2026 05:56:50 -0500 Subject: =?UTF-8?q?Initial=20commit:=20ept=20=E2=80=94=20backprop-free=20e?= =?UTF-8?q?quilibrium=20transformer=20(EP)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Code (ep_run/), organized docs (docs/{method,campaign,hardware,outreach,paper}), analysis scripts (scripts/), ONBOARDING.md entry point. Large data/checkpoints git-ignored (share separately). Co-Authored-By: Claude Opus 4.8 Claude-Session: https://claude.ai/code/session_014FAPDWQ49M5Ye3NpTndTpn --- ep_run/analogET_extracted.txt | 1861 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 1861 insertions(+) create mode 100644 ep_run/analogET_extracted.txt (limited to 'ep_run/analogET_extracted.txt') diff --git a/ep_run/analogET_extracted.txt b/ep_run/analogET_extracted.txt new file mode 100644 index 0000000..b139640 --- /dev/null +++ b/ep_run/analogET_extracted.txt @@ -0,0 +1,1861 @@ + Dense Associative Memories with Analog Circuits + Marc Gong Bacvanski1 , Xincheng You2 , John Hopfield3 , and Dmitry Krotov4 + 1 + MIT + 2 + Independent Researcher + 3 + Princeton University + 4 + IBM Research + + December 16 2025 +arXiv:2512.15002v1 [cs.NE] 17 Dec 2025 + + + + + Abstract: The increasing computational demands of modern AI systems have exposed fundamental + limitations of digital hardware, driving interest in alternative paradigms for efficient large-scale inference. + Dense Associative Memory (DenseAM) is a family of models that offers a flexible framework for repre- + senting many contemporary neural architectures, such as transformers and diffusion models, by casting + them as dynamical systems evolving on an energy landscape. In this work, we propose a general method + for building analog accelerators for DenseAMs and implementing them using electronic RC circuits, cross- + bar arrays, and amplifiers. We find that our analog DenseAM hardware performs inference in constant + time independent of model size. This result highlights an asymptotic advantage of analog DenseAMs + over digital numerical solvers that scale at least linearly with the model size. We consider three settings + of progressively increasing complexity: XOR, the Hamming (7,4) code, and a simple language model + defined on binary variables. We propose analog implementations of these three models and analyze the + scaling of inference time, energy consumption, and hardware. Finally, we estimate lower bounds on the + achievable time constants imposed by amplifier specifications, suggesting that even conservative existing + analog technology can enable inference times on the order of tens to hundreds of nanoseconds. By har- + nessing the intrinsic parallelism and continuous-time operation of analog circuits, our DenseAM-based + accelerator design offers a new avenue for fast and scalable AI hardware. + + + 1 Introduction + The unprecedented growth of artificial intelligence (AI) has driven demand for increasingly large and + powerful models. At present, the field of generative AI is primarily driven by two settings: autore- + gressive transformers [1] and diffusion models [2]. While these settings have demonstrated remarkable + capabilities, they do so at a substantial computational cost. Their current implementations utilize digital + computation, which faces fundamental challenges in energy efficiency, scalability, and latency, especially + as model sizes and deployment demands continue to grow [3, 4, 5]. These limitations have prompted + interest in alternative computational paradigms that can efficiently handle the demands of modern AI + workloads [6]. + Dense Associative Memories (DenseAMs) [7, 8], a promising class of AI models which generalize + Hopfield networks [9], offer a new angle for tackling these problems. Unlike conventional feed-forward + models, DenseAM inference can be defined through the temporal evolution of a state vector that is + governed by a system of differential equations [10]. The state vector can be thought of as a particle + exploring the surface of a high-dimensional energy landscape, which is the Lyapunov function of these + dynamical equations. DenseAMs have been demonstrated to be flexible and expressive computational + frameworks, capable of representing many primitives of modern AI architectures, such as attention + mechanism [11], transformers [12], and diffusion models [13, 14, 15]. Furthermore, DenseAMs are error- + correcting systems [16], a property ensuring that small perturbations of the desired temporal evolution + of the state vector are corrected away by the dynamics of the network itself, rather than accumulated + in time. Finally, DenseAMs are asymptotically stable—during the course of temporal evolution the + computation happens during a finite transient period of time, which is followed by a steady state of + Code available at https://github.com/mbacvanski/AnalogET. + + + + 1 + neural activities. This asymptotic stabilization of dynamical trajectories removes the requirement to read +out the “answer” to the computation problem at a precise moment of time, making DenseAMs robust +to several classes of hardware imperfections. The confluence of the above properties makes DenseAMs +appealing networks for analog hardware implementations that, on the one hand, are grounded in the +physics of stable error-correcting dynamical systems and, on the other hand, are capable of representing +computation in state-of-the-art AI networks. + In 1989, Hopfield argued that analog neural hardware can exceed the efficiency of digital implemen- +tations when the device physics directly instantiate the computational dynamics of the model itself [17]. +Here, we revisit this idea with DenseAM models: we propose an analog circuit-based hardware accel- +erator design whose dynamics directly realize those of the DenseAM. We find that analog DenseAM +hardware enables constant-time inference independent of model size, which is in stark contrast to GPU +solvers and digital implementations. This intrinsic property makes DenseAM a natural fit for analog AI +accelerators, and it highlights our circuit architecture as a viable hardware path to realize them. Using +component specifications already demonstrated in fabricated devices, analog DenseAM hardware may +achieve inference times on the order of tens to hundreds of nanoseconds, several orders of magnitude +faster than digital systems. + By leveraging the natural dynamics of analog systems, this work establishes a new design of fast and +scalable AI accelerators. The framework of DenseAMs and their efficient analog hardware implementa- +tions suggest a pathway for fundamentally redesigning the hardware-software interface for AI, enabling +a new paradigm for fast, energy-efficient, and scalable computation. + + +2 Dense Associative Memory basics +The DenseAM framework [10, 18] provides a model that has straightforward neuronal dynamics, yet is +surprisingly expressive in its ability to represent AI models including transformer attention, diffusion +models, and associative memories. In its simplest form it is defined by two sets of neurons (typically +called visible and hidden neurons) and a system of coupled non-linear differential equations governing +their behavior, see Figure 1. The visible neurons are characterized by their internal states vi and their +outputs gi , index i = 1 . . . Nv ; while the hidden neurons have internal states hµ and outputs fµ , index +µ = 1 . . . Nh . From the AI perspective, one can think about internal state of the neuron as a pre-activation +of that neuron, and the output as a post-activation, which is obtained by applying an activation function +to the pre-activation. From the biological perspective, one can think about the internal state of the +neuron as a membrane voltage potential, and the output of that neuron as an axonal output, or a firing +rate of that neuron. This framework admits both neuron-wise activation functions (gi = g(vi ), where +g(·) is some continuous function, e.g., a ReLU), and collective activation functions such as softmax or +layer normalization, which depend on the states of multiple neurons. + The network parameters are stored in the synaptic weights ξ ∈ RNh ×Nv , whose matrix elements +denoted by ξµi can be either hand-engineered or learned. The time decay constants for the two groups +of neurons are τv and τh . With these conventions, the temporal evolution of the two groups of neurons +can be expressed as  Nh +  dvi X + τ = ξµi fµ + ai − vi +  +  v dt +  +  +  + µ=1 + (1) + Nv + dh +  + µ +  X + τh dt = ξµi gi + bµ − hµ +  +  +  + i=1 + +This forms a bipartite graph of neuronal connections, where the state of the hidden neurons is updated +by the state of the visible neurons, and vice versa. Importantly, the same matrix ξ appears in both +equations, once as ξ and again as ξ ⊤ . Although this is sometimes described as using “symmetric” +weights, ξ is not assumed to be symmetric in the linear-algebraic sense; it is simply the same matrix +used in both directions. Finally, ai and bµ denote biases, which are additional weights of the system and +whose values may be hard-coded or learned depending on the application. + The most important aspect of this model is the existence of a global energy function (Lyapunov +function) that describes neuronal dynamics. To demonstrate this, it is most convenient to use the +Lagrangian formalism [10, 18, 16]. Each set of neurons is defined through a Lagrangian function of their +internal states. The activation functions are defined as partial derivatives of that Lagrangian with respect +to internal states. The total energy is the sum of energies of each set of neurons, plus the interaction + + + + 2 + Figure 1: Top left: Bipartite neural network formulation, where hidden neurons hµ and visible neurons +vi are connected via symmetric synaptic weights ξ. Top right: Circuit realization of symmetric weight +matrix via resistive crossbar array. Each crosspoint encodes a weight ξµi by its resistance Rµi = 1/ξµi . +Lower right: Circuit schematic of a single hidden neuron. It drives its row of the crossbar array with +a voltage according to its activation fµ , and its internal dynamics are driven by the incoming current +flowing into it from the crossbar array. Lower left: Softmax activation function built from bipolar +junction transistors (some components not shown). + + +energy. The energy of each set of neurons is a Legendre transformation of the corresponding Lagrangian +(plus the term proportional to the bias). Thus, the global energy of Equation 1 is given by + Nv + X  Nh + X  Nh X + X Nv + E= gi (vi − ai ) − Lv + fµ (hµ − bµ ) − Lh − fµ ξµi gi (2) + i=1 µ=1 µ=1 i=1 + | {z } | {z } | {z } + energy of visible neurons energy of hidden neurons interaction energy + +where the activation functions are defined as partial derivatives of the Lagrangians + ∂Lv ∂Lh + gi = , fµ = + ∂vi ∂hµ +For convex Lagrangians this global energy decreases with time on the dynamical trajectories of Equa- +tion 1. If, additionally, the activation functions (and corresponding Lagrangians) are chosen in such a +way that this energy is bounded from below, the dynamical trajectories are guaranteed to arrive at a +stable fixed point of activations. The dynamical equations typically have many asymptotic fixed points, +which correspond to local minima of the energy function in Equation 2. Both properties above (convexity +of Lagrangians and lower-bounded energy) are satisfied for all settings studied in this paper. By picking +different nonlinear activation functions (or corresponding Lagrangians), this system yields a variety of +models that can describe associative memory, softmax attention, and other commonly used settings in +AI [10, 11, 18, 19, 20]. + A particularly relevant example for modern sequence modeling is the Energy Transformer (ET) [12], +which reformulates transformer’s inference pass as a gradient flow on an energy function defined over the + + + 3 + set of tokens. The ET block contains two contributions to the energy function: attention energy and the +Hopfield network. The energy attention module routes the information between the tokens, while the +Hopfield module aligns the tokens with the manifold of token embeddings. In our implementation, the +context tokens act as a set of dynamically instantiated memories that interact with the predicted token +through a DenseAM-like energy. In section 6 we exploit this connection to construct an Analog Energy +Transformer (Analog ET) whose continuous-time dynamics are implemented directly in hardware using +our DenseAM circuit primitives. + + +3 Related work +Early analog implementations of associative memories focused on the classical Hopfield network. Founda- +tional designs, such as continuous-time analog circuits [21, 22] and later demonstrations using amorphous- +silicon resistors [23], memristive devices [24, 25], and phase-change memories [26], targeted the quadratic +Hopfield energy function. These works emphasize device engineering and memory-cell design rather than +system-level dynamics, and inherit the limited storage capacity and representational power of traditional +Hopfield networks. That line of research is largely concerned with how to fabricate programmable re- +sistance elements themselves; our work assumes programmable conductances as a given primitive and +focuses on the continuous-time dynamics that operate on top of them. Our work also differs from these +works by addressing DenseAMs with higher-order energy functions and continuous-valued states. + Another direction is the use of cavity-QED systems for associative memory. Marsh et al. [27] analyze +a confocal cavity implementation of a quadratic Hopfield network and show that the cavity dynamics +induce a descent-like relaxation rule on spin states. Their model remains restricted to quadratic energies +and binary spins, and operates in a cryogenic, cavity-QED setting. Our work instead targets higher-order +DenseAMs with continuous states, and emphasizes scalable, room-temperature analog microelectronics +with explicit hardware-aware dynamical analysis. + More recent physical implementations move beyond purely quadratic energies. Musa et al. [28] +propose a free-space optical realization of the higher-order DenseAM energy. Their system constructs a +static physical representation of the energy landscape, but inference relies on an external digital controller +that performs iterative spin-flip updates. Thus, the hardware computes energies, while the optimization +dynamics remain digital. In contrast, our analog microelectronic architecture embeds the gradient flow +itself into hardware: inference is performed by a single continuous-time evolution rather than by discrete +digital updates. + + +4 DenseAM circuit design +Here, we introduce a novel architecture for a class of analog electronic hardware accelerators that models +Equation 1’s system of nonlinear differential equations using time evolution. Our DenseAM design +shown in Figure 1 is comprised of two sets of neurons that interact through a resistive crossbar array. +The resistive crossbar array turns voltage differences between neurons into currents flowing between the +neurons according to synaptic weights, and each neuron’s internal circuitry converts those currents into +dynamics that reproduce Equation 1. + +Resistive weights as a crossbar array. The crossbar array construction is a canonical design of +matrix-vector multiplication using analog electronics [17, 29], and is a natural fit for the weight matrix +ξ in our model. Traditionally, each crosspoint between a row and column line is connected by a resistor +(often memristors, RRAM, or other analog memories that produce resistances), a vector of input voltages +is applied at row lines, and the column lines are held at ground typically via a transimpedance amplifier. +By Ohm’s law, each resistive crosspoint produces a current that multiplies the row’s input voltage by +the inverse of the resistance. Because currents add along each column line, the total current output at a +column is the inner product between the vector of input voltages and the column’s conductance vector. +Thus, the array as a whole implements a parallel analog matrix multiplication of the form Iout = GVin , +where G is the matrix of conductances (inverse of resistances). + Unlike a traditional crossbar array whose rows are driven at a fixed voltage and whose columns +are held at ground, our DenseAM circuit design uses each weight bidirectionally, exactly representing +the bidirectional connections between visible and hidden neurons. As a result, the current flowing into +each neuron corresponds to the weighted sum of the differences P between visible and hidden neuron +activations. For example, for hidden neuron µ, this current is i ξµi (gi − fµ ). This construction enables + + + 4 + (1, 0) (1, 1) + 1 g3 0.4 + Neurons + Visible + + + + + Energy + 0.2 + 0 + + 1 f3 0.0 + Neurons + Hidden + + + + + (0, 0) (0, 1) + 0 0.4 + + + + + Energy + 0.5 + Energy + + + + + 0.2 + + 0.0 0.0 + 0.0 0.5 1.0 1.5 2.0 2.5 3.0 + 0 1 0 1 + Time (s) + v3 v3 + +Figure 2: Solving XOR with a DenseAM. Visible Figure 3: XOR energy landscape of neuron v3 un- +neuron g3 = v3 serves as the output, while the two der different settings of visible input neurons v1 and +input neurons (unlabeled, thin lines) are clamped v2 . Minima in the energy function correspond to +at 1 and 0 for True and False. Output v3 is initial- stationary points of the dynamics. Gradient flow +ized at 0.5 and converges to a positive prediction of dynamics bring v3 to these attractor points, result- +1. The activation of the hidden neuron f3 for the ing in correct XOR outputs. +truth-table row (1, 0, 1) becomes highly activated, +with others (fine lines) are suppressed by softmax. +Energy (2), or equivalently (5), decreases monoton- +ically along the inference trajectory. + + +weight symmetry to be enforced by hardware sharing: both forward and reverse weights are realized by +the same resistive elements. Importantly, as long as weights are represented as conductances, they must +be non-negative. + +Design of a single neuron. Each neuron in the circuit computes its dynamics by integrating the cur- +rents it receives from the crossbar array, which represent weighted differences between its own activation +and those of connected neurons. Considering a hidden neuron (the design for visible neurons is symmet- +ric by design), the neuron’s internal voltage hµ is stored on capacitor C1 and evolves in continuous time, +while the neuron’s activation fµ is obtained by passing hµ through a nonlinear function (e.g. ReLU or +softmax). + The current flowing into hidden neuron µ is produced by its interaction with all visible neurons via +the synaptic weights ξµi for P i = 1, . . . , Nv . Specifically, this is as a weighted sum of the differences +between neuron P activations: i ξµi (gi − fµ ). Inside each neuron, a “self” path scales fµ to produceP the +voltage sµ = fµ i ξµi . This term is added to the value of the incoming current so that the −fµ i ξµi +term is cancelled inside each neuron. As a result, the hidden state, represented as the voltage across +capacitor C1 , integrates only the desired weighted input plus any external stimulus bµ . Its dynamics +reduce to the canonical DenseAM form with a time constant of R2 C1 : + Nv + dhµ X + R2 C 1 = ξµi gi + bµ − hµ (3) + dt i=1 + +Elementwise (or vectorized) nonlinearities then produce activations gi = g(vi ) and fµ = f (hµ ) (e.g., +ReLU, softmax) across the visible and hidden neurons. See Appendix A for the full circuit derivation. + + +5 Analog DenseAM Examples +We begin by studying two examples of the proposed design: the XOR task, and the (7,4) error-correcting +Hamming code. + + + + + 5 + 5.1 XOR +The XOR problem is a canonical test for nonlinear representation and inference, as it cannot be solved +by any linear model. We show a minimal DenseAM model for the XOR task, illustrating how energy- +based dynamics can solve this simple task with a continuous-time analog system. The network consists +of Nv = 3 visible neurons, and Nh = 4 hidden neurons. At t = 0 visible neurons v1 and v2 are initialized +at their input values corresponding to the input bits. The last visible neuron v3 is initialized at v3 = 0.5. +The hidden neurons are initialized at zero. The two input visible neurons remain clamped during the +dynamics, while the third output visible neuron and the hidden neurons evolve in time according to (1). +Each row of the memory matrix ξ corresponds to a row of the XOR truth table. The visible neurons +use an identity activation function where gi = vi , and the hidden neurons use a softmax activation. The +biases are set as + N v + 1X 2 + ai = 0, bµ = − ξµi + 2 i=1 + + Figure 2 shows the temporal evolution of visible and hidden neuron activations, as well as the total +energy, during inference on the XOR input (1, 0). The output visible neuron’s activation g3 gradually +converges to the correct prediction of 1, while the hidden neuron associated with that memory, f3 , +becomes strongly activated and the remaining hidden neurons are suppressed by the softmax nonlinearity. +The system’s energy decreases monotonically throughout the trajectory and stabilizes once the network +reaches its fixed-point prediction. Figure 3 depicts the system’s energy landscape as a function of output +neuron v3 for different clamped input configurations (v1 , v2 ). In each case, the energy exhibits a clear +convex minimum at the correct XOR output, demonstrating that gradient flow along the energy surface +drives v3 reliably toward the correct prediction. As shown in Appendix C, we validate our circuit design +and dynamics using SPICE simulation. + τh → 0. Since the second equation in + To analyze this DenseAM, it is instructive to consider the limit P + Nh +(1) is linear in hidden units hµ , they can be integrated out. With µ=1 fµ = 1, the resulting dynamics +of the visible neurons can be written as + Nh Nv + dvi X  βX  + (ξµi − vi )2 +  + τv = ξµi − vi fµ where fµ = softmax − (4) + dt µ=1 + 2 i=1 + +The effective energy on the visible neurons can be written as + Nh Nv + 1 X h βX i + E eff (v) = − log exp − (ξµi − vi )2 (5) + β µ=1 + 2 i=1 + +Intuitively, each hidden neuron computes a squared Euclidean distance between the visible state and its +stored pattern ξ µ . The softmax nonlinearity assigns higher weight to the pattern closest to the current +state of the visible neurons. The resulting visible neuron dynamics are gradient flow for this effective +energy. It is important to note that memories in this implementation are represented by conductances +of the crossbar array, which are always positive. For this reason, matrix elements of memories ξµi must +be positive, necessitating the use of the bias terms, which are just voltage sources that can be arbitrarily +signed. + While a time constant of τh = 0 is impossible to physically construct due to finite conductances +and nonzero capacitances, setting τh ≪ τv realizes the same adiabatic limit in practice. When hidden +neurons evolve much faster than visible ones, they reach their steady state almost instantaneously for each +configuration of visible neurons. The result is an adiabatic elimination of hidden dynamics, yielding the +effective visible-only dynamics above. In practice, for the XOR task, even a relatively modest τh = τv /10 +ratio yields perfect performance. + +5.2 Hamming (7,4) code +The Hamming (7,4) code is an error-correcting code that encodes 4 data bits into a 7-bit codeword by +adding 3 parity bits. The resulting 7-bit strings are special: only certain patterns are valid codewords, +and they are spaced apart so that if a single bit is flipped, the error can be detected and corrected [30]. +Table 1 lists the 16 codewords corresponding to four arbitrary data bits. + + + 6 + 1 + g5 + Neurons + Visible + Data bits (d1 d2 d3 d4 ) Codeword (c1 c2 c3 c4 c5 c6 c7 ) + + 0 + 0000 0000000 + 0001 0001111 + 1 f7 0010 0010110 + Neurons + Hidden + + + + + 0011 0011001 + 0100 0100101 + 0 + 0101 0101010 + 0.5 0110 0110011 + Energy + + + + + 0111 0111100 + 1000 1000011 + 0.0 1001 1001100 + 0 1 2 3 4 5 + 1010 1010101 + Time (s) + 1011 1011010 + 1100 1100110 + 1101 1101001 +Figure 4: Correcting a bit error in a Hamming 1110 1110000 +(7,4) code. Visible neuron g5 flips indicating the 1111 1111111 +bit flip error happened on the 5th codeword bit. f7 +is the hidden neuron corresponding to the memory Table 1: Valid codewords of the Hamming(7,4) +of the correct codeword. Thin lines correspond to code, ordered by their 4-bit data payload. +the other neuron activations. + + + Unlike the XOR case where the only evolving neuron is the readout bit, the Hamming (7,4) code may +require flipping the value of any one of the visible neurons. During inference, the visible neurons are +initialized to the corrupted 7-bit input word. All neurons are left free to evolve, and the dynamics relax +the state toward the nearest stored codeword. Energy minima are located at the valid codewords, so the +network converges to the correct code provided the error is within the Hamming radius of 1. Thus, the +DenseAM replicates the standard decoding property of the Hamming (7,4) code: any single-bit flip is +corrected automatically. Figure 4 illustrates a case where a flipped bit g5 is restored during convergence. + The Hamming (7,4) model’s 7 visible neurons, each corresponding to a codeword bit, are connected +to 16 hidden neurons, each representing one valid codeword. The weight matrix ξ ∈ {0, 1}16×7 is formed +by stacking the 16 codewords as its rows. Visible neurons have the identity activation, hidden neurons +use a softmax activation, and biases are chosen as in the XOR case to give the same integrated-out +visible dynamics as (4). + + +6 Analog Energy Transformer (Analog ET) via DenseAM +Our DenseAM circuit construction can be used to build more complex energy-based models, such as +the transformer-like architecture proposed in the Energy Transformer paper [12]. For causal next-token +prediction with a single attention head, the Energy Transformer’s energy function can be written as the +following (See Appendix J for full derivation): +  ⊤ ⊤  ⊤ attn ⊤ hopf + E = 12 ∥v − a∥2 − v⊤ ξ attn f attn + ξ hopf f hopf + f attn − b + f hopf +   + h h −c + − Lattn hattn − Lhopf hhopf +   + (6) + +with the activation functions and their Lagrangians defined as + L + X + fAattn = softmax(βhattn )A , Lattn (h) = β1 log eβhA (7) + A=1 + M h + X i2 + fµhopf = ReLU(hhopf + µ ), Lhopf (h) = 21 ReLU(hµ ) (8) + µ=1 + +where a, b, and c correspond to the biases of the visible neurons, attention hidden neurons, and Hopfield +network hidden neurons, respectively. The L context tokens are indexed by A, and the M hidden neurons +of the Hopfield network are indexed by µ. Because the visible units use an identity activation function, + + + 7 + Figure 5: Analog ET circuit demonstrating the autoregressive inference procedure. A newly inferenced +token is decoded, sampled, and re-embedded to obtain the weight vector ξ attn + L+1 , which is set as the weight +vector for a new hidden neuron hattn + L+1 in the energy attention block (light gray on right). For this layout +we have flipped the crossbar array, so that indices A and µ run horizontally and index i runs vertically. + + +gi = vi using the languge of Equation 1, the gradient flow of the energy yields the dynamics: + ∂E ⊤  ⊤ + τv v̇ = − = ξ attn f attn + ξ hopf f hopf + a − v (9) + ∂v + ∂E + τh ḣattn + = − attn = ξattn v + b − hattn (10) + ∂f + ∂E + τh ḣhopf = − hopf = ξhopf v + c − hhopf (11) + ∂f +In this formulation, v represents the embedding of the output (next) token, and its evolution is driven by +two terms: one term from the energy attention with weights ξattn and hidden neuron activations f attn , +and one term from the Hopfield network with weights ξ hopf and hidden neuron activations f hopf . The +weights of the energy attention DenseAM are dependent on the context: for a token dimension D, context +length L, and the task of predicting the token at index L + 1, the weights ξ attn ∈ RL×D are generated +by embedding each token of the context via a learned embedding matrix applied to each context token. +In contrast, the Hopfield network weights ξ hopf are learned during training and fixed at inference. The +number of memories in the Hopfield network is a hyperparameter M , such that ξ hopf ∈ RM ×D . + This system suggests a hardware implementation where v interacts with two independent DenseAMs, +one for the energy attention and one for the Hopfield term, which can share the same physical crossbar +structure. Figure 5 shows that the circuit structure remains a crossbar array (like Figure 1), but with +two distinct classes of hidden neurons. Because of the summation of currents along each row of the +crossbar array, the incoming current to visible neuron vi is the sum of contributions from the energy +attention block and from the Hopfield network block. The energy attention hidden neurons hattn use a +softmax activation function, while the Hopfield network hidden neurons hhopf use a ReLU activation. + +6.1 Analog Energy Transformer on the parity task +We build and evaluate the Analog ET on the L-bit parity task, which can  + P be thought of as an elementary + L +“language model”: given bits bit1 , . . . , bitL , predict bitL+1 = A=1 bitA mod 2. Parity is instructive +because it requires a representation of a global, order-L interaction, precluding linear and shallow models +from representing it efficiently. A successful model must be able to form high-order interactions in order +to generalize. We formulate parity as a next-token prediction problem: given an L-bit string as context, +predict its parity in the next token. + We train the Analog ET model digitally using backpropagation through time [31] implemented with +Jax’s automatic differentiation. The resulting weights can be deployed onto the analog hardware; in + + + 8 + 11001010 0 01000110 1 + + 4 +Visible neurons + + + 2 + 0 + 1 +Prediction + + + + + 0 + 10 +Energy + + + + + 20 + 30 + 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 + Time t Time t +Figure 6: Inference of parity Analog ET on two example 8-bit strings. Top row plots the visible neurons vi +over time, middle row plots the decoded token prediction, bottom row plots the energy that monotonically +decreases during inference. After a transient period of computation, the network arrives at a steady- +state, making the result of the computation robust against the precise timing of the readout. + + +our experiments we simulate the dynamics of hardware with the Diffrax [32] ODE solver library. On +the 8-bit parity task, our model achieves 100% accuracy on the hold-out validation set of 52 bit strings, +demonstrating clear generalization capabilities. See Appendix H.1 for more details on training and model +design. + Figure 6 shows the dynamics of the visible neurons and energy during two example inference runs +of the Analog ET. Notably, the visible neuron values are constant by the end of the inference period, +meaning that the inference remains highly stable to mismatch and delay in timing during readout. A +single sample-and-hold and switching circuit would enable a single Analog-Digital Converter (ADC) to +read out all the visible neurons at convergence, significantly reducing mismatch, and drastically saving +device area, complexity, and energy. The intrinsic stability of attractor points arises uniquely from +the continuous-time dynamics of the DenseAM, making these models particularly well suited to analog +hardware. + +6.2 Autoregressive inference +Dashed lines in Figure 5 illustrate the autoregressive inference procedure of the Analog ET. To generate +the L-th token given context tokens x(1) , . . . , x(L−1) , each token is first embedded and concatenated to +form the attention weight matrix +  (1)  + e +  e(2)  + ξ attn,(L−1) =  .  ∈ R(L−1)×D +   +  ..  + e(L−1) + +These rows are loaded into the Analog ET’s energy attention weight matrix ξ attn by programming the +corresponding crossbar resistances. During inference, the visible state v(t) evolves according to the +Analog ET dynamics until convergence. A decoder readout (e.g., a linear layer) applied to the converged +v(t = T ) values produces logits, from which the next token x(L) is sampled. This token is then embedded +to form e(L) , and appended to the existing context. The cycle repeats with the updated attention weight + + + 9 + matrix +  attn,(L−1)  + ξ + ξ attn,(L) = ∈ RL×D + e(L) + +which now includes the new embedding e(L) . In hardware, this corresponds to connecting an additional +hidden neuron in the energy attention block of Figure 5, and setting its resistive weights with e(L) . +Because the physical order of hidden neurons does not affect the energy function, this new neuron can +be placed in any position among the hidden neurons. When the context length is fixed, the hidden +neuron corresponding to the earliest token can simply be reprogrammed with the new vector of weights +e(L) , resulting in the hardware equivalent of a sliding-window context. In practice, an external digital +controller, e.g., an Field-Programmable Gate Array (FPGA) or Application-Specific Integrated Circuit +(ASIC) would orchestrate crossbar programming and token decoding, while the DenseAM dynamics +perform the far more substantial workload of computing each next-token embedding. + This procedure is analogous to key-value (KV) caching in standard transformer inference [33]. Context +tokens x(1) , . . . , x(L−1) produce key and value vectors k(1) , . . . , k(L−1) and v(1) , . . . , v(L−1) respectively. +When new token x(L) is generated, its corresponding k(L) and v(L) vectors are appended to the cache, +allowing all previous k(