diff options
Diffstat (limited to 'ep_run/analogET_extracted.txt')
| -rw-r--r-- | ep_run/analogET_extracted.txt | 1861 |
1 files changed, 1861 insertions, 0 deletions
diff --git a/ep_run/analogET_extracted.txt b/ep_run/analogET_extracted.txt new file mode 100644 index 0000000..b139640 --- /dev/null +++ b/ep_run/analogET_extracted.txt @@ -0,0 +1,1861 @@ + Dense Associative Memories with Analog Circuits + Marc Gong Bacvanski1 , Xincheng You2 , John Hopfield3 , and Dmitry Krotov4 + 1 + MIT + 2 + Independent Researcher + 3 + Princeton University + 4 + IBM Research + + December 16 2025 +arXiv:2512.15002v1 [cs.NE] 17 Dec 2025 + + + + + Abstract: The increasing computational demands of modern AI systems have exposed fundamental + limitations of digital hardware, driving interest in alternative paradigms for efficient large-scale inference. + Dense Associative Memory (DenseAM) is a family of models that offers a flexible framework for repre- + senting many contemporary neural architectures, such as transformers and diffusion models, by casting + them as dynamical systems evolving on an energy landscape. In this work, we propose a general method + for building analog accelerators for DenseAMs and implementing them using electronic RC circuits, cross- + bar arrays, and amplifiers. We find that our analog DenseAM hardware performs inference in constant + time independent of model size. This result highlights an asymptotic advantage of analog DenseAMs + over digital numerical solvers that scale at least linearly with the model size. We consider three settings + of progressively increasing complexity: XOR, the Hamming (7,4) code, and a simple language model + defined on binary variables. We propose analog implementations of these three models and analyze the + scaling of inference time, energy consumption, and hardware. Finally, we estimate lower bounds on the + achievable time constants imposed by amplifier specifications, suggesting that even conservative existing + analog technology can enable inference times on the order of tens to hundreds of nanoseconds. By har- + nessing the intrinsic parallelism and continuous-time operation of analog circuits, our DenseAM-based + accelerator design offers a new avenue for fast and scalable AI hardware. + + + 1 Introduction + The unprecedented growth of artificial intelligence (AI) has driven demand for increasingly large and + powerful models. At present, the field of generative AI is primarily driven by two settings: autore- + gressive transformers [1] and diffusion models [2]. While these settings have demonstrated remarkable + capabilities, they do so at a substantial computational cost. Their current implementations utilize digital + computation, which faces fundamental challenges in energy efficiency, scalability, and latency, especially + as model sizes and deployment demands continue to grow [3, 4, 5]. These limitations have prompted + interest in alternative computational paradigms that can efficiently handle the demands of modern AI + workloads [6]. + Dense Associative Memories (DenseAMs) [7, 8], a promising class of AI models which generalize + Hopfield networks [9], offer a new angle for tackling these problems. Unlike conventional feed-forward + models, DenseAM inference can be defined through the temporal evolution of a state vector that is + governed by a system of differential equations [10]. The state vector can be thought of as a particle + exploring the surface of a high-dimensional energy landscape, which is the Lyapunov function of these + dynamical equations. DenseAMs have been demonstrated to be flexible and expressive computational + frameworks, capable of representing many primitives of modern AI architectures, such as attention + mechanism [11], transformers [12], and diffusion models [13, 14, 15]. Furthermore, DenseAMs are error- + correcting systems [16], a property ensuring that small perturbations of the desired temporal evolution + of the state vector are corrected away by the dynamics of the network itself, rather than accumulated + in time. Finally, DenseAMs are asymptotically stable—during the course of temporal evolution the + computation happens during a finite transient period of time, which is followed by a steady state of + Code available at https://github.com/mbacvanski/AnalogET. + + + + 1 +neural activities. This asymptotic stabilization of dynamical trajectories removes the requirement to read +out the “answer” to the computation problem at a precise moment of time, making DenseAMs robust +to several classes of hardware imperfections. The confluence of the above properties makes DenseAMs +appealing networks for analog hardware implementations that, on the one hand, are grounded in the +physics of stable error-correcting dynamical systems and, on the other hand, are capable of representing +computation in state-of-the-art AI networks. + In 1989, Hopfield argued that analog neural hardware can exceed the efficiency of digital implemen- +tations when the device physics directly instantiate the computational dynamics of the model itself [17]. +Here, we revisit this idea with DenseAM models: we propose an analog circuit-based hardware accel- +erator design whose dynamics directly realize those of the DenseAM. We find that analog DenseAM +hardware enables constant-time inference independent of model size, which is in stark contrast to GPU +solvers and digital implementations. This intrinsic property makes DenseAM a natural fit for analog AI +accelerators, and it highlights our circuit architecture as a viable hardware path to realize them. Using +component specifications already demonstrated in fabricated devices, analog DenseAM hardware may +achieve inference times on the order of tens to hundreds of nanoseconds, several orders of magnitude +faster than digital systems. + By leveraging the natural dynamics of analog systems, this work establishes a new design of fast and +scalable AI accelerators. The framework of DenseAMs and their efficient analog hardware implementa- +tions suggest a pathway for fundamentally redesigning the hardware-software interface for AI, enabling +a new paradigm for fast, energy-efficient, and scalable computation. + + +2 Dense Associative Memory basics +The DenseAM framework [10, 18] provides a model that has straightforward neuronal dynamics, yet is +surprisingly expressive in its ability to represent AI models including transformer attention, diffusion +models, and associative memories. In its simplest form it is defined by two sets of neurons (typically +called visible and hidden neurons) and a system of coupled non-linear differential equations governing +their behavior, see Figure 1. The visible neurons are characterized by their internal states vi and their +outputs gi , index i = 1 . . . Nv ; while the hidden neurons have internal states hµ and outputs fµ , index +µ = 1 . . . Nh . From the AI perspective, one can think about internal state of the neuron as a pre-activation +of that neuron, and the output as a post-activation, which is obtained by applying an activation function +to the pre-activation. From the biological perspective, one can think about the internal state of the +neuron as a membrane voltage potential, and the output of that neuron as an axonal output, or a firing +rate of that neuron. This framework admits both neuron-wise activation functions (gi = g(vi ), where +g(·) is some continuous function, e.g., a ReLU), and collective activation functions such as softmax or +layer normalization, which depend on the states of multiple neurons. + The network parameters are stored in the synaptic weights ξ ∈ RNh ×Nv , whose matrix elements +denoted by ξµi can be either hand-engineered or learned. The time decay constants for the two groups +of neurons are τv and τh . With these conventions, the temporal evolution of the two groups of neurons +can be expressed as Nh + dvi X + τ = ξµi fµ + ai − vi + + v dt + + + + µ=1 + (1) + Nv + dh + + µ + X + τh dt = ξµi gi + bµ − hµ + + + + i=1 + +This forms a bipartite graph of neuronal connections, where the state of the hidden neurons is updated +by the state of the visible neurons, and vice versa. Importantly, the same matrix ξ appears in both +equations, once as ξ and again as ξ ⊤ . Although this is sometimes described as using “symmetric” +weights, ξ is not assumed to be symmetric in the linear-algebraic sense; it is simply the same matrix +used in both directions. Finally, ai and bµ denote biases, which are additional weights of the system and +whose values may be hard-coded or learned depending on the application. + The most important aspect of this model is the existence of a global energy function (Lyapunov +function) that describes neuronal dynamics. To demonstrate this, it is most convenient to use the +Lagrangian formalism [10, 18, 16]. Each set of neurons is defined through a Lagrangian function of their +internal states. The activation functions are defined as partial derivatives of that Lagrangian with respect +to internal states. The total energy is the sum of energies of each set of neurons, plus the interaction + + + + 2 +Figure 1: Top left: Bipartite neural network formulation, where hidden neurons hµ and visible neurons +vi are connected via symmetric synaptic weights ξ. Top right: Circuit realization of symmetric weight +matrix via resistive crossbar array. Each crosspoint encodes a weight ξµi by its resistance Rµi = 1/ξµi . +Lower right: Circuit schematic of a single hidden neuron. It drives its row of the crossbar array with +a voltage according to its activation fµ , and its internal dynamics are driven by the incoming current +flowing into it from the crossbar array. Lower left: Softmax activation function built from bipolar +junction transistors (some components not shown). + + +energy. The energy of each set of neurons is a Legendre transformation of the corresponding Lagrangian +(plus the term proportional to the bias). Thus, the global energy of Equation 1 is given by + Nv + X Nh + X Nh X + X Nv + E= gi (vi − ai ) − Lv + fµ (hµ − bµ ) − Lh − fµ ξµi gi (2) + i=1 µ=1 µ=1 i=1 + | {z } | {z } | {z } + energy of visible neurons energy of hidden neurons interaction energy + +where the activation functions are defined as partial derivatives of the Lagrangians + ∂Lv ∂Lh + gi = , fµ = + ∂vi ∂hµ +For convex Lagrangians this global energy decreases with time on the dynamical trajectories of Equa- +tion 1. If, additionally, the activation functions (and corresponding Lagrangians) are chosen in such a +way that this energy is bounded from below, the dynamical trajectories are guaranteed to arrive at a +stable fixed point of activations. The dynamical equations typically have many asymptotic fixed points, +which correspond to local minima of the energy function in Equation 2. Both properties above (convexity +of Lagrangians and lower-bounded energy) are satisfied for all settings studied in this paper. By picking +different nonlinear activation functions (or corresponding Lagrangians), this system yields a variety of +models that can describe associative memory, softmax attention, and other commonly used settings in +AI [10, 11, 18, 19, 20]. + A particularly relevant example for modern sequence modeling is the Energy Transformer (ET) [12], +which reformulates transformer’s inference pass as a gradient flow on an energy function defined over the + + + 3 +set of tokens. The ET block contains two contributions to the energy function: attention energy and the +Hopfield network. The energy attention module routes the information between the tokens, while the +Hopfield module aligns the tokens with the manifold of token embeddings. In our implementation, the +context tokens act as a set of dynamically instantiated memories that interact with the predicted token +through a DenseAM-like energy. In section 6 we exploit this connection to construct an Analog Energy +Transformer (Analog ET) whose continuous-time dynamics are implemented directly in hardware using +our DenseAM circuit primitives. + + +3 Related work +Early analog implementations of associative memories focused on the classical Hopfield network. Founda- +tional designs, such as continuous-time analog circuits [21, 22] and later demonstrations using amorphous- +silicon resistors [23], memristive devices [24, 25], and phase-change memories [26], targeted the quadratic +Hopfield energy function. These works emphasize device engineering and memory-cell design rather than +system-level dynamics, and inherit the limited storage capacity and representational power of traditional +Hopfield networks. That line of research is largely concerned with how to fabricate programmable re- +sistance elements themselves; our work assumes programmable conductances as a given primitive and +focuses on the continuous-time dynamics that operate on top of them. Our work also differs from these +works by addressing DenseAMs with higher-order energy functions and continuous-valued states. + Another direction is the use of cavity-QED systems for associative memory. Marsh et al. [27] analyze +a confocal cavity implementation of a quadratic Hopfield network and show that the cavity dynamics +induce a descent-like relaxation rule on spin states. Their model remains restricted to quadratic energies +and binary spins, and operates in a cryogenic, cavity-QED setting. Our work instead targets higher-order +DenseAMs with continuous states, and emphasizes scalable, room-temperature analog microelectronics +with explicit hardware-aware dynamical analysis. + More recent physical implementations move beyond purely quadratic energies. Musa et al. [28] +propose a free-space optical realization of the higher-order DenseAM energy. Their system constructs a +static physical representation of the energy landscape, but inference relies on an external digital controller +that performs iterative spin-flip updates. Thus, the hardware computes energies, while the optimization +dynamics remain digital. In contrast, our analog microelectronic architecture embeds the gradient flow +itself into hardware: inference is performed by a single continuous-time evolution rather than by discrete +digital updates. + + +4 DenseAM circuit design +Here, we introduce a novel architecture for a class of analog electronic hardware accelerators that models +Equation 1’s system of nonlinear differential equations using time evolution. Our DenseAM design +shown in Figure 1 is comprised of two sets of neurons that interact through a resistive crossbar array. +The resistive crossbar array turns voltage differences between neurons into currents flowing between the +neurons according to synaptic weights, and each neuron’s internal circuitry converts those currents into +dynamics that reproduce Equation 1. + +Resistive weights as a crossbar array. The crossbar array construction is a canonical design of +matrix-vector multiplication using analog electronics [17, 29], and is a natural fit for the weight matrix +ξ in our model. Traditionally, each crosspoint between a row and column line is connected by a resistor +(often memristors, RRAM, or other analog memories that produce resistances), a vector of input voltages +is applied at row lines, and the column lines are held at ground typically via a transimpedance amplifier. +By Ohm’s law, each resistive crosspoint produces a current that multiplies the row’s input voltage by +the inverse of the resistance. Because currents add along each column line, the total current output at a +column is the inner product between the vector of input voltages and the column’s conductance vector. +Thus, the array as a whole implements a parallel analog matrix multiplication of the form Iout = GVin , +where G is the matrix of conductances (inverse of resistances). + Unlike a traditional crossbar array whose rows are driven at a fixed voltage and whose columns +are held at ground, our DenseAM circuit design uses each weight bidirectionally, exactly representing +the bidirectional connections between visible and hidden neurons. As a result, the current flowing into +each neuron corresponds to the weighted sum of the differences P between visible and hidden neuron +activations. For example, for hidden neuron µ, this current is i ξµi (gi − fµ ). This construction enables + + + 4 + (1, 0) (1, 1) + 1 g3 0.4 + Neurons + Visible + + + + + Energy + 0.2 + 0 + + 1 f3 0.0 + Neurons + Hidden + + + + + (0, 0) (0, 1) + 0 0.4 + + + + + Energy + 0.5 + Energy + + + + + 0.2 + + 0.0 0.0 + 0.0 0.5 1.0 1.5 2.0 2.5 3.0 + 0 1 0 1 + Time (s) + v3 v3 + +Figure 2: Solving XOR with a DenseAM. Visible Figure 3: XOR energy landscape of neuron v3 un- +neuron g3 = v3 serves as the output, while the two der different settings of visible input neurons v1 and +input neurons (unlabeled, thin lines) are clamped v2 . Minima in the energy function correspond to +at 1 and 0 for True and False. Output v3 is initial- stationary points of the dynamics. Gradient flow +ized at 0.5 and converges to a positive prediction of dynamics bring v3 to these attractor points, result- +1. The activation of the hidden neuron f3 for the ing in correct XOR outputs. +truth-table row (1, 0, 1) becomes highly activated, +with others (fine lines) are suppressed by softmax. +Energy (2), or equivalently (5), decreases monoton- +ically along the inference trajectory. + + +weight symmetry to be enforced by hardware sharing: both forward and reverse weights are realized by +the same resistive elements. Importantly, as long as weights are represented as conductances, they must +be non-negative. + +Design of a single neuron. Each neuron in the circuit computes its dynamics by integrating the cur- +rents it receives from the crossbar array, which represent weighted differences between its own activation +and those of connected neurons. Considering a hidden neuron (the design for visible neurons is symmet- +ric by design), the neuron’s internal voltage hµ is stored on capacitor C1 and evolves in continuous time, +while the neuron’s activation fµ is obtained by passing hµ through a nonlinear function (e.g. ReLU or +softmax). + The current flowing into hidden neuron µ is produced by its interaction with all visible neurons via +the synaptic weights ξµi for P i = 1, . . . , Nv . Specifically, this is as a weighted sum of the differences +between neuron P activations: i ξµi (gi − fµ ). Inside each neuron, a “self” path scales fµ to produceP the +voltage sµ = fµ i ξµi . This term is added to the value of the incoming current so that the −fµ i ξµi +term is cancelled inside each neuron. As a result, the hidden state, represented as the voltage across +capacitor C1 , integrates only the desired weighted input plus any external stimulus bµ . Its dynamics +reduce to the canonical DenseAM form with a time constant of R2 C1 : + Nv + dhµ X + R2 C 1 = ξµi gi + bµ − hµ (3) + dt i=1 + +Elementwise (or vectorized) nonlinearities then produce activations gi = g(vi ) and fµ = f (hµ ) (e.g., +ReLU, softmax) across the visible and hidden neurons. See Appendix A for the full circuit derivation. + + +5 Analog DenseAM Examples +We begin by studying two examples of the proposed design: the XOR task, and the (7,4) error-correcting +Hamming code. + + + + + 5 +5.1 XOR +The XOR problem is a canonical test for nonlinear representation and inference, as it cannot be solved +by any linear model. We show a minimal DenseAM model for the XOR task, illustrating how energy- +based dynamics can solve this simple task with a continuous-time analog system. The network consists +of Nv = 3 visible neurons, and Nh = 4 hidden neurons. At t = 0 visible neurons v1 and v2 are initialized +at their input values corresponding to the input bits. The last visible neuron v3 is initialized at v3 = 0.5. +The hidden neurons are initialized at zero. The two input visible neurons remain clamped during the +dynamics, while the third output visible neuron and the hidden neurons evolve in time according to (1). +Each row of the memory matrix ξ corresponds to a row of the XOR truth table. The visible neurons +use an identity activation function where gi = vi , and the hidden neurons use a softmax activation. The +biases are set as + N v + 1X 2 + ai = 0, bµ = − ξµi + 2 i=1 + + Figure 2 shows the temporal evolution of visible and hidden neuron activations, as well as the total +energy, during inference on the XOR input (1, 0). The output visible neuron’s activation g3 gradually +converges to the correct prediction of 1, while the hidden neuron associated with that memory, f3 , +becomes strongly activated and the remaining hidden neurons are suppressed by the softmax nonlinearity. +The system’s energy decreases monotonically throughout the trajectory and stabilizes once the network +reaches its fixed-point prediction. Figure 3 depicts the system’s energy landscape as a function of output +neuron v3 for different clamped input configurations (v1 , v2 ). In each case, the energy exhibits a clear +convex minimum at the correct XOR output, demonstrating that gradient flow along the energy surface +drives v3 reliably toward the correct prediction. As shown in Appendix C, we validate our circuit design +and dynamics using SPICE simulation. + τh → 0. Since the second equation in + To analyze this DenseAM, it is instructive to consider the limit P + Nh +(1) is linear in hidden units hµ , they can be integrated out. With µ=1 fµ = 1, the resulting dynamics +of the visible neurons can be written as + Nh Nv + dvi X βX + (ξµi − vi )2 + + τv = ξµi − vi fµ where fµ = softmax − (4) + dt µ=1 + 2 i=1 + +The effective energy on the visible neurons can be written as + Nh Nv + 1 X h βX i + E eff (v) = − log exp − (ξµi − vi )2 (5) + β µ=1 + 2 i=1 + +Intuitively, each hidden neuron computes a squared Euclidean distance between the visible state and its +stored pattern ξ µ . The softmax nonlinearity assigns higher weight to the pattern closest to the current +state of the visible neurons. The resulting visible neuron dynamics are gradient flow for this effective +energy. It is important to note that memories in this implementation are represented by conductances +of the crossbar array, which are always positive. For this reason, matrix elements of memories ξµi must +be positive, necessitating the use of the bias terms, which are just voltage sources that can be arbitrarily +signed. + While a time constant of τh = 0 is impossible to physically construct due to finite conductances +and nonzero capacitances, setting τh ≪ τv realizes the same adiabatic limit in practice. When hidden +neurons evolve much faster than visible ones, they reach their steady state almost instantaneously for each +configuration of visible neurons. The result is an adiabatic elimination of hidden dynamics, yielding the +effective visible-only dynamics above. In practice, for the XOR task, even a relatively modest τh = τv /10 +ratio yields perfect performance. + +5.2 Hamming (7,4) code +The Hamming (7,4) code is an error-correcting code that encodes 4 data bits into a 7-bit codeword by +adding 3 parity bits. The resulting 7-bit strings are special: only certain patterns are valid codewords, +and they are spaced apart so that if a single bit is flipped, the error can be detected and corrected [30]. +Table 1 lists the 16 codewords corresponding to four arbitrary data bits. + + + 6 + 1 + g5 + Neurons + Visible + Data bits (d1 d2 d3 d4 ) Codeword (c1 c2 c3 c4 c5 c6 c7 ) + + 0 + 0000 0000000 + 0001 0001111 + 1 f7 0010 0010110 + Neurons + Hidden + + + + + 0011 0011001 + 0100 0100101 + 0 + 0101 0101010 + 0.5 0110 0110011 + Energy + + + + + 0111 0111100 + 1000 1000011 + 0.0 1001 1001100 + 0 1 2 3 4 5 + 1010 1010101 + Time (s) + 1011 1011010 + 1100 1100110 + 1101 1101001 +Figure 4: Correcting a bit error in a Hamming 1110 1110000 +(7,4) code. Visible neuron g5 flips indicating the 1111 1111111 +bit flip error happened on the 5th codeword bit. f7 +is the hidden neuron corresponding to the memory Table 1: Valid codewords of the Hamming(7,4) +of the correct codeword. Thin lines correspond to code, ordered by their 4-bit data payload. +the other neuron activations. + + + Unlike the XOR case where the only evolving neuron is the readout bit, the Hamming (7,4) code may +require flipping the value of any one of the visible neurons. During inference, the visible neurons are +initialized to the corrupted 7-bit input word. All neurons are left free to evolve, and the dynamics relax +the state toward the nearest stored codeword. Energy minima are located at the valid codewords, so the +network converges to the correct code provided the error is within the Hamming radius of 1. Thus, the +DenseAM replicates the standard decoding property of the Hamming (7,4) code: any single-bit flip is +corrected automatically. Figure 4 illustrates a case where a flipped bit g5 is restored during convergence. + The Hamming (7,4) model’s 7 visible neurons, each corresponding to a codeword bit, are connected +to 16 hidden neurons, each representing one valid codeword. The weight matrix ξ ∈ {0, 1}16×7 is formed +by stacking the 16 codewords as its rows. Visible neurons have the identity activation, hidden neurons +use a softmax activation, and biases are chosen as in the XOR case to give the same integrated-out +visible dynamics as (4). + + +6 Analog Energy Transformer (Analog ET) via DenseAM +Our DenseAM circuit construction can be used to build more complex energy-based models, such as +the transformer-like architecture proposed in the Energy Transformer paper [12]. For causal next-token +prediction with a single attention head, the Energy Transformer’s energy function can be written as the +following (See Appendix J for full derivation): + ⊤ ⊤ ⊤ attn ⊤ hopf + E = 12 ∥v − a∥2 − v⊤ ξ attn f attn + ξ hopf f hopf + f attn − b + f hopf + + h h −c + − Lattn hattn − Lhopf hhopf + + (6) + +with the activation functions and their Lagrangians defined as + L + X + fAattn = softmax(βhattn )A , Lattn (h) = β1 log eβhA (7) + A=1 + M h + X i2 + fµhopf = ReLU(hhopf + µ ), Lhopf (h) = 21 ReLU(hµ ) (8) + µ=1 + +where a, b, and c correspond to the biases of the visible neurons, attention hidden neurons, and Hopfield +network hidden neurons, respectively. The L context tokens are indexed by A, and the M hidden neurons +of the Hopfield network are indexed by µ. Because the visible units use an identity activation function, + + + 7 +Figure 5: Analog ET circuit demonstrating the autoregressive inference procedure. A newly inferenced +token is decoded, sampled, and re-embedded to obtain the weight vector ξ attn + L+1 , which is set as the weight +vector for a new hidden neuron hattn + L+1 in the energy attention block (light gray on right). For this layout +we have flipped the crossbar array, so that indices A and µ run horizontally and index i runs vertically. + + +gi = vi using the languge of Equation 1, the gradient flow of the energy yields the dynamics: + ∂E ⊤ ⊤ + τv v̇ = − = ξ attn f attn + ξ hopf f hopf + a − v (9) + ∂v + ∂E + τh ḣattn + = − attn = ξattn v + b − hattn (10) + ∂f + ∂E + τh ḣhopf = − hopf = ξhopf v + c − hhopf (11) + ∂f +In this formulation, v represents the embedding of the output (next) token, and its evolution is driven by +two terms: one term from the energy attention with weights ξattn and hidden neuron activations f attn , +and one term from the Hopfield network with weights ξ hopf and hidden neuron activations f hopf . The +weights of the energy attention DenseAM are dependent on the context: for a token dimension D, context +length L, and the task of predicting the token at index L + 1, the weights ξ attn ∈ RL×D are generated +by embedding each token of the context via a learned embedding matrix applied to each context token. +In contrast, the Hopfield network weights ξ hopf are learned during training and fixed at inference. The +number of memories in the Hopfield network is a hyperparameter M , such that ξ hopf ∈ RM ×D . + This system suggests a hardware implementation where v interacts with two independent DenseAMs, +one for the energy attention and one for the Hopfield term, which can share the same physical crossbar +structure. Figure 5 shows that the circuit structure remains a crossbar array (like Figure 1), but with +two distinct classes of hidden neurons. Because of the summation of currents along each row of the +crossbar array, the incoming current to visible neuron vi is the sum of contributions from the energy +attention block and from the Hopfield network block. The energy attention hidden neurons hattn use a +softmax activation function, while the Hopfield network hidden neurons hhopf use a ReLU activation. + +6.1 Analog Energy Transformer on the parity task +We build and evaluate the Analog ET on the L-bit parity task, which can + P be thought of as an elementary + L +“language model”: given bits bit1 , . . . , bitL , predict bitL+1 = A=1 bitA mod 2. Parity is instructive +because it requires a representation of a global, order-L interaction, precluding linear and shallow models +from representing it efficiently. A successful model must be able to form high-order interactions in order +to generalize. We formulate parity as a next-token prediction problem: given an L-bit string as context, +predict its parity in the next token. + We train the Analog ET model digitally using backpropagation through time [31] implemented with +Jax’s automatic differentiation. The resulting weights can be deployed onto the analog hardware; in + + + 8 + 11001010 0 01000110 1 + + 4 +Visible neurons + + + 2 + 0 + 1 +Prediction + + + + + 0 + 10 +Energy + + + + + 20 + 30 + 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 + Time t Time t +Figure 6: Inference of parity Analog ET on two example 8-bit strings. Top row plots the visible neurons vi +over time, middle row plots the decoded token prediction, bottom row plots the energy that monotonically +decreases during inference. After a transient period of computation, the network arrives at a steady- +state, making the result of the computation robust against the precise timing of the readout. + + +our experiments we simulate the dynamics of hardware with the Diffrax [32] ODE solver library. On +the 8-bit parity task, our model achieves 100% accuracy on the hold-out validation set of 52 bit strings, +demonstrating clear generalization capabilities. See Appendix H.1 for more details on training and model +design. + Figure 6 shows the dynamics of the visible neurons and energy during two example inference runs +of the Analog ET. Notably, the visible neuron values are constant by the end of the inference period, +meaning that the inference remains highly stable to mismatch and delay in timing during readout. A +single sample-and-hold and switching circuit would enable a single Analog-Digital Converter (ADC) to +read out all the visible neurons at convergence, significantly reducing mismatch, and drastically saving +device area, complexity, and energy. The intrinsic stability of attractor points arises uniquely from +the continuous-time dynamics of the DenseAM, making these models particularly well suited to analog +hardware. + +6.2 Autoregressive inference +Dashed lines in Figure 5 illustrate the autoregressive inference procedure of the Analog ET. To generate +the L-th token given context tokens x(1) , . . . , x(L−1) , each token is first embedded and concatenated to +form the attention weight matrix + (1) + e + e(2) + ξ attn,(L−1) = . ∈ R(L−1)×D + + .. + e(L−1) + +These rows are loaded into the Analog ET’s energy attention weight matrix ξ attn by programming the +corresponding crossbar resistances. During inference, the visible state v(t) evolves according to the +Analog ET dynamics until convergence. A decoder readout (e.g., a linear layer) applied to the converged +v(t = T ) values produces logits, from which the next token x(L) is sampled. This token is then embedded +to form e(L) , and appended to the existing context. The cycle repeats with the updated attention weight + + + 9 +matrix + attn,(L−1) + ξ + ξ attn,(L) = ∈ RL×D + e(L) + +which now includes the new embedding e(L) . In hardware, this corresponds to connecting an additional +hidden neuron in the energy attention block of Figure 5, and setting its resistive weights with e(L) . +Because the physical order of hidden neurons does not affect the energy function, this new neuron can +be placed in any position among the hidden neurons. When the context length is fixed, the hidden +neuron corresponding to the earliest token can simply be reprogrammed with the new vector of weights +e(L) , resulting in the hardware equivalent of a sliding-window context. In practice, an external digital +controller, e.g., an Field-Programmable Gate Array (FPGA) or Application-Specific Integrated Circuit +(ASIC) would orchestrate crossbar programming and token decoding, while the DenseAM dynamics +perform the far more substantial workload of computing each next-token embedding. + This procedure is analogous to key-value (KV) caching in standard transformer inference [33]. Context +tokens x(1) , . . . , x(L−1) produce key and value vectors k(1) , . . . , k(L−1) and v(1) , . . . , v(L−1) respectively. +When new token x(L) is generated, its corresponding k(L) and v(L) vectors are appended to the cache, +allowing all previous k(<L) and v(<L) to be reused without recomputation. When the key and value +matrices are tied so that k(A) = v(A) , the ET’s row-append operation is equivalent to the standard KV- +cache update. The ET performs an autoregressive rollout that reproduces the same recurrence structure +as KV-cached transformer inference, but implemented physically through the addition of new neurons +and weights without touching existing hardware. For a formal derivation of the equivalence between ET +attention and conventional attention with tied keys and values, see [12]. + + +7 Scaling properties +Inference time and energy consumption are crucial characteristics of our system. This section investigates +these metrics with respect to the network size. + +7.1 Inference time scaling +The model (4) and (5) is considered. In the adiabatic limit (τh → 0), which is satisfied by our hardware +implementation, the time derivative of the energy can be written as + Nv Nv + dE eff X ∂E eff dvi 1 X ∂E eff 2 Nv + = =− ∼− (12) + dt i=1 + ∂vi dt τv i=1 ∂vi τv + +This derivative is always negative, since the dynamical system performs the gradient descent on the +energy landscape. The derivative vanishes eventually when the network state vector v converges to the +steady state. Since the state vector vi is typically initialized in the vicinity of the memory vectors, which +are chosen to be of order one (∼ 1), the right hand side of (4) is of order one too, independent of the +network size. This results in the characteristic value of the temporal derivative shown in (12). + At the same time, the typical value1 of the energy (5) is + 1 + |E eff | ∼ Nv + log(Nh ) (13) + β +During the inference dynamics the network is initialized in a high energy state, which has the charac- +teristic value of energy (13), and performs energy descent to a lower value of the energy (which has a +similar order of magnitude). In order to estimate the scaling of the time required to perform this energy +descent, one can take a ratio of the energy drop by the rate of the energy decrease (12). This gives the +following estimate + |E eff | 1 log(Nh ) + T conv ∼ ∼ τv 1 + ∼ τv (14) + dE β Nv + dt + +The last ∼ sign holds since in none of the designs presented here does Nh grow super-exponentially in +Nv . In fact, in all the use cases Nh is sub-exponential in Nv . + 1 We estimate the absolute value of the energy, since it can be both positive and negative depending on the mutual + +arrangement of memories, the state vector, and the number of hidden units. + + + 10 + This back-of-the-envelope estimation provides the core intuition behind the scaling relationship. +The inference time is constant, and independent of the size of the network. A more careful anal- +ysis (Appendix E) shows that in the high-β regime the worst-case dependence is O τβv logNNv + h + , which +remains bounded for all architectures we consider. Thus, for our settings the convergence time is ef- +fectively constant in Nv and Nh . Based on amplifier gain–bandwidth, slew-rate, and output-current +constraints, we estimate achievable inference times of tens to hundreds of nanoseconds using existing +CMOS technology (see Appendix I.2). + +7.2 Scaling of energy consumption +We now analyze how the total inference energy scales with network size. Energy dissipation arises +primarily from (i) Ohmic loss in the resistive weights, (ii) charging of neuron-state capacitors, and (iii) +constant per-neuron overhead from amplifiers and bias currents. We show that, under bounded voltage +swings and fixed conductance budgets, total energy grows only linearly with the number of neurons. + +Weight dissipation. Let the neuron output voltages be proportional to activations: u = κg and +w = κf , where κ is a fixed voltage swing. Such a bounded swing can always be enforced by global +rescaling of ξ, β, and voltage units without changing the dynamics (see Appendix F). The instantaneous +power dissipated by the resistive crossbar array is + Nh X + X Nv + Pweights (t) = ξµi (ui − wµ )2 (15) + µ=1 i=1 + P P +With 0 ≤ gi ≤ 1, f -softmax, and row/column conductance budgets µ ξµi ≤ Cc , i ξµi ≤ Cr , the total +power obeys + + Pweights (t) ≤ 2κ2 (Cc Nv + Cr ) = O(Nv ) (16) + +For a runtime of duration T ∼ T conv , the energy dissipated by the weights is therefore Eweights = O(Nv T ), +where T ∼ 1 from subsection 7.1. + +Capacitive and overhead energy. Each neuron charges a local capacitor a finite number of times +by at most Vswing ∼ κ, giving + ! + (v) + X X + Ecap ≤ κ2 Ci + Cµ(h) = O(Nv + Nh ) (17) + i µ + +Active bias and amplifier inefficiencies contribute fixed per-neuron power, yielding Eother = O((Nv + Nh )T ). + +Total energy scaling. With bounded voltage swing and conductance budgets, + + Etotal = O(Nv + Nh ) (18) + +Hence, the total inference energy scales only linearly with system size. For the full derivation, see +Appendix G. + +7.3 Scaling of hardware area +The area is dominated by two components: the area taken up by the synaptic weights, which is imple- +mented as a crossbar array with programmable weights, and the area taken up by the neurons feeding +the crossbar array. The area of the crossbar array scales as the number of weights O(Nv Nh ). The area +of the neurons scales as O(Nv + Nh ). + + +8 Conclusion +In this paper, we have presented an analog accelerator architecture for Dense Associative Memories, +implemented using resistive crossbar arrays and continuous-time RC neuron dynamics. Our design im- +plements DenseAM inference as time evolution of a physical dynamical system, rather than a sequence of + + + 11 +discrete numerical update steps. We demonstrated this architecture with three representative settings of +increasing complexity: XOR, Hamming (7,4) error decoding, and an Energy Transformer-style sequence +model. These examples show that the analog DenseAM accelerator architecture covers both associative +memory tasks and attention-based sequence models. + Our analysis shows that DenseAM accelerators enjoy favorable asymptotic scaling properties. In- +ference time is constant in the dimensions of the model size, meaning that inference time is governed +primarily by the physical time constants of the circuit. This is in sharp contrast to digital implementa- +tions of the same dynamics, whose runtime must grow at least linearly with model size. + To assess hardware feasibility, we derived lower bounds on the neuronal time constants imposed by +amplifier gain-bandwidth product, slew rate, and output current limits in our neuron design. Reported +figures from representative CMOS OTAs in the literature give inference times on the order of tens-to- +hundreds of nanoseconds, even with conservative design margins. Combined with the constant scaling of +inference with model size, these estimates suggest that DenseAM accelerators can match or exceed the +latency of digital GPUs as models grow, without requiring exotic devices or beyond-CMOS technologies. + Our results highlight DenseAMs as a natural abstraction for analog AI hardware. Their error cor- +recting dynamics and asymptotic stability directly address long-standing concerns about robustness and +readout timing: small perturbations are corrected by the dynamics instead of accumulated, and the final +state is stable when readout happens over a wide temporal window. At the same time, the DenseAM +framework is expressive enough to capture modern primitives such as attention and transformer-like ar- +chitectures, as illustrated by our Analog Energy Transformer construction. These properties suggest that +DenseAM-based analog accelerators may be a promising substrate for future AI systems, and motivate +further co-design of models, dynamics, and devices. + +Acknowledgements +MGB would like to thank Faiz Muhammad for exploratory attempts at SPICE simulations. DK would +like to thank Kwabena Boahen for helpful discussions. + + +References + [1] Ashish Vaswani. “Attention is all you need”. In: arXiv preprint arXiv:1706.03762 (2017). + [2] Jascha Sohl-Dickstein et al. “Deep unsupervised learning using nonequilibrium thermodynamics”. + In: International conference on machine learning. pmlr. 2015, pp. 2256–2265. + [3] Norman P Jouppi et al. “In-datacenter performance analysis of a tensor processing unit”. In: + Proceedings of the 44th annual international symposium on computer architecture. 2017, pp. 1–12. + [4] Eric Masanet et al. “Recalibrating global data center energy-use estimates”. In: Science 367.6481 + (2020), pp. 984–986. + [5] David Patterson et al. “Carbon emissions and large neural network training”. In: arXiv preprint + arXiv:2104.10350 (2021). + [6] Maxwell Aifer et al. “Solving the compute crisis with physics-based ASICs”. In: arXiv preprint + arXiv:2507.10463 (2025). + [7] Dmitry Krotov and John J Hopfield. “Dense associative memory for pattern recognition”. In: + Advances in neural information processing systems 29 (2016). + [8] Dmitry Krotov and John Hopfield. “Dense associative memory is robust to adversarial inputs”. In: + Neural computation 30.12 (2018), pp. 3151–3167. + [9] John J Hopfield. “Neural networks and physical systems with emergent collective computational + abilities.” In: Proceedings of the national academy of sciences 79.8 (1982), pp. 2554–2558. +[10] Dmitry Krotov and John J Hopfield. “Large Associative Memory Problem in Neurobiology and + Machine Learning”. In: International Conference on Learning Representations. 2021. +[11] Hubert Ramsauer et al. “Hopfield networks is all you need”. In: arXiv preprint arXiv:2008.02217 + (2020). +[12] Benjamin Hoover et al. “Energy transformer”. In: Advances in Neural Information Processing + Systems 36 (2024). + + + + 12 +[13] Benjamin Hoover et al. “Memory in plain sight: A survey of the uncanny resemblances between + diffusion models and associative memories”. In: arXiv preprint arXiv:2309.16750 (2023). +[14] Luca Ambrogioni. “In search of dispersed memories: Generative diffusion models are associative + memory networks”. In: arXiv preprint arXiv:2309.17290 (2023). +[15] Bao Pham et al. “Memorization to generalization: Emergence of diffusion models from associative + memory”. In: arXiv preprint arXiv:2505.21777 (2025). +[16] Dmitry Krotov et al. “Modern methods in associative memory”. In: arXiv preprint arXiv:2507.06211 + (2025). +[17] JJ Hopfield. “The effectiveness of analogue’neural network’hardware”. In: Network: Computation + in Neural Systems 1.1 (1990), p. 27. +[18] Dmitry Krotov. “Hierarchical associative memory”. In: arXiv preprint arXiv:2107.06446 (2021). +[19] Fei Tang and Michael Kopp. “A remark on a paper of krotov and hopfield [arxiv: 2008.06996]”. In: + arXiv preprint arXiv:2105.15034 (2021). +[20] Benjamin Hoover et al. “A universal abstraction for hierarchical hopfield networks”. In: The Sym- + biosis of Deep Learning and Differential Equations II. 2022. +[21] John J Hopfield. “Neurons with graded response have collective computational properties like those + of two-state neurons.” In: Proceedings of the national academy of sciences 81.10 (1984), pp. 3088– + 3092. +[22] David W Tank and John J Hopfield. “Simple “Neural” optimization networks: an A/D converter, + signal decision circuit, and a linear programming circuit”. In: Artificial neural networks: theoretical + concepts. 1988, pp. 87–95. +[23] HP Graf et al. “VLSI implementation of a neural network memory with several hundreds of neu- + rons”. In: AIP conference proceedings. Vol. 151. 1. American Institute of Physics. 1986, pp. 182– + 187. +[24] Xinjie Guo et al. “Modeling and experimental demonstration of a Hopfield network analog-to- + digital converter with hybrid CMOS/memristor circuits”. In: Frontiers in neuroscience 9 (2015), + p. 488. +[25] SG Hu et al. “Associative memory realized by a reconfigurable memristive Hopfield neural net- + work”. In: Nature communications 6.1 (2015), p. 7522. +[26] Sukru B Eryilmaz et al. “Brain-like associative learning using a nanoscale non-volatile phase change + synaptic device array”. In: Frontiers in neuroscience 8 (2014), p. 205. +[27] Brendan P Marsh et al. “Enhancing associative memory recall and storage capacity using confocal + cavity QED”. In: Physical Review X 11.2 (2021), p. 021048. +[28] Khalid Musa et al. “Dense Associative Memory in a Nonlinear Optical Hopfield Neural Network”. + In: arXiv preprint arXiv:2506.07849 (2025). +[29] Carver Mead and Mohammed Ismail. Analog VLSI implementation of neural systems. Vol. 80. + Springer Science & Business Media, 2012. +[30] Richard W Hamming. “Error detecting and error correcting codes”. In: The Bell system technical + journal 29.2 (1950), pp. 147–160. +[31] Paul J Werbos. “Backpropagation through time: what it does and how to do it”. In: Proceedings + of the IEEE 78.10 (2002), pp. 1550–1560. +[32] Patrick Kidger. “On Neural Differential Equations”. PhD thesis. University of Oxford, 2021. +[33] Zihang Dai et al. “Transformer-xl: Attentive language models beyond a fixed-length context”. + In: Proceedings of the 57th annual meeting of the association for computational linguistics. 2019, + pp. 2978–2988. +[34] Jacob Sillman. “Analog Implementation of the Softmax Function”. In: arXiv preprint arXiv:2305.13649 + (2023). +[35] John J Hopfield and David W Tank. “Computing with neural circuits: A model”. In: Science + 233.4764 (1986), pp. 625–633. +[36] Aldo Pena Perez and Franco Maloberti. “Performance enhanced op-amp for 65nm CMOS tech- + nologies and below”. In: 2012 IEEE International Symposium on Circuits and Systems (ISCAS). + IEEE. 2012, pp. 201–204. + + + 13 + Figure 7: Circuit for a single neuron. + + +[37] Rida S Assaad and Jose Silva-Martinez. “The recycling folded cascode: A general enhancement of + the folded cascode amplifier”. In: IEEE Journal of Solid-State Circuits 44.9 (2009), pp. 2535–2542. +[38] Alec Yen and Benjamin J Blalock. “A High Slew Rate, Low Power, Compact Operational Ampli- + fier Based on the Super-Class AB Recycling Folded Cascode”. In: 2020 IEEE 63rd International + Midwest Symposium on Circuits and Systems (MWSCAS). IEEE. 2020, pp. 9–12. +[39] Mohammad H Naderi, Suraj Prakash, and Jose Silva-Martinez. “Operational transconductance + amplifier with class-B slew-rate boosting for fast high-performance switched-capacitor circuits”. + In: IEEE Transactions on Circuits and Systems I: Regular Papers 65.11 (2018), pp. 3769–3779. +[40] Franz Schlögl and Horst Zimmermann. “A design example of a 65 nm CMOS operational amplifier”. + In: International Journal of Circuit Theory and Applications 35.3 (2007), pp. 343–354. + + +A Neuron Design +Figure 7 shows the circuit design of a single neuron, with labels corresponding to this being a hidden +neuron at index µ. We derive the dynamics of the neuron internal state hµ and activation output voltage +fµ . We proceed using only Kirchhoff’s Current Law (KCL) and the definition of an ideal op-amp. + +Assumptions and conventions. + • Ideal op-amps: infinite open-loop gain, infinite input impedance (no input current), zero output + impedance. Under stable negative feedback this enforces a virtual short V+ = V− . + • Current Jµ : we define Jµ as the current which flows from fµ to mµ through R1 . + + • Op-amp input labels: We denote the inverting and noninverting inputs of each op-amp explicitly, + e.g. U 2− for the inverting input of U2, U 3+ for the noninverting input of U3, etc. + • Node labels: Label mµ as the output of U1, sµ as the output of U2, and dµ as the output of U3. + The neuron pre-activation state is labeled hµ , and the post-activation state is labeled fµ . Voltage + bµ (as an ideal voltage source) drives the bias for this neuron. Voltages hµ , bµ , and fµ correspond + directly to the state variables in equation (1). + + + + + 14 +Block U1: buffer of activation voltage fµ . Op-amp U1 buffers the output of the activation function +f (·) and drives the output of the neuron, fµ . Because no current can flow into U 1− , all the current +flowing into this neuron must flow through R1 to mµ and is sourced or sunk by U1’s output node. + +Block U2: non-inverting stage producing sµ from fµ and mµ . The positive input of U2 is +U 2+ = fµ , and by U2’s virtual short, the negative input U 2− = U 2+ = fµ . By KCL at U 2− , + + U 2− sµ − U 2− R9 + = ⇒ sµ = 1 + fµ (19) + R10 R9 R10 + +Block U3: non-inverting stage producing dµ from sµ , bµ , and mµ . By KCL at the positive input +of U3, + bµ − U 3+ sµ − U 3+ U 3+ R4 R5 bµ + R3 R5 sµ + + = ⇒ U 3+ = (20) + R3 R4 R5 R4 R5 + R3 R5 + R3 R4 +KCL at the negative input of U3 gives us + + mµ − U 3− −U 3− U 3− − d µ 1 1 R8 mµ + + = ⇒ dµ = U 3− 1 + R8 + − (21) + R6 R7 R8 R6 R7 R6 +Virtual short of U3 means U 3− = U 3+ . Combining equations (20) and (21), get + R6 R7 + R8 (R6 + R7 ) R4 R5 bµ + R3 R5 sµ R8 + dµ = · − mµ (22) + R6 R7 R4 R5 + R3 R5 + R3 R4 R6 + +Dynamics of RC circuit. R2 and C1 form an RC circuit driven by voltage dµ . The voltage across +the capacitor hµ follows the relation + dhµ + R2 C 1 = −hµ + dµ + dt + R6 R7 + R8 (R6 + R7 ) R4 R5 bµ + R3 R5 sµ R8 + = −hµ + · − mµ (23) + R6 R7 R4 R5 + R3 R5 + R3 R4 R6 + P +With incoming current. Take the incoming current PJµ = i ξµi (gi − fµ ). This produces a voltage +drop across R1 such that mµ = fµ − R1 Jµ = fµ − R1 i ξµi (gi − fµ ). Then, the dynamics of hµ from +equation (23) are + dhµ R6 R7 + R8 (R6 + R7 ) R4 R5 bµ + R3 R5 sµ R8 + R2 C1 = −hµ + · − (fµ − R1 Jµ ) (24) + dt R6 R7 R4 R5 + R3 R5 + R3 R4 R6 +Substituting in sµ from equation (19) and Jµ : + + R9 ! + dhµ R6 R7 + R8 (R6 + R7 ) R R b + 4 5 µ + R R + 3 5 1 + R10 fµ R8 X +R2 C1 = −hµ + · − fµ − R 1 ξµi (gi − fµ ) + dt R6 R7 R4 R5 + R3 R5 + R3 R4 R6 i + (25) + +Equal-resistance special case. Set R1 = R3 = R4 = R5 = R6 = R7 = R8 . Then, equation (25) +reduces to + dhµ R9 X + R2 C 1 = −hµ + bµ + fµ + ξµi (gi − fµ ) (26) + dt R10 i + + +Selection of R9 /RP10 self-term gain. Evidently, in order to match the form of equation (1), we need +to cancel the −fµ i ξµi term that appears on the right hand side of equation (26). The R9 /R10 term +allows us to do that by setting + R9 X + = ξµi (27) + R10 i + +Taking equation (27)’s assignment to R9 and R10 simplifies equation (26) into + dhµ X + R2 C1 = ξµi gi − hµ + bµ (28) + dt i +which exactly matches our desired dynamics. + + + 15 +Figure 8: Crossbar Array. Each pentagon contains a neuron of design in Figure 7. In this layout we +have flipped the crossbar array, so that index µ runs horizontally and index i runs vertically. + + +A.1 Activation function +The voltage across C1 gives us the dynamics of the neuron internal state hµ . Figure 7 contains a block +representing a nonlinear amplifier, denoted f (·), whose input is hµ and whose output is fµ = f (hµ ). This +voltage is buffered with U1 onto the neuron output line, labeled fµ , which is what other neurons “see” +in the crossbar array. The chosen activation function does not affect the rest of the dynamics of the +neuron. Particularly, the activation function need not be element-wise: a vector-wise activation function +like softmax can be readily applied instead. + +A.2 Neurons interacting in a network +So far we have examined the dynamics + P of a single neuron, treating as an assumption that the neuron will +receive an incoming current Jµ = i ξµi (gi − fµ ). Now, we will show how to wire these neurons together +to realize this. Figure 8 shows the simplest DenseAM construction where each pentagonal node is a +circuit of design in Figure 7. Each neuron exposes a single node whose voltage is driven at the activation +of the neuron, and which accepts an incoming current which it uses to drive its dynamics. Each hidden +neuron fµ is connected to a visible neuron gi via a resistance + P Rµi = 1/ξµi that is the inverse of the weight +it represents. The current flowing into node fµ is Jµ = i R1µi (gi − fµ ), which is the assumption needed +for equation (24). This same analysis holds for other hidden and visible neurons, and so together they +realize the large dynamical system of (1). + +A.3 SPICE Netlist +Following is the SPICE netlist for the single neuron circuit, using ideal op-amps. Component values are +omitted for brevity. There is no nonlinearity here; adding one would be a matter of inserting a nonlinear +amplifier between node h µ and XU1’s positive terminal. +R1 f_µ m_µ +XU1 f_µ h_µ m_µ opamp Aol=100K GBW=10Meg +XU2 u2- f_µ s_µ opamp Aol=100K GBW=10Meg +R2 u2- 0 +R3 s_µ u2- +R4 u3+ s_µ +R5 u3+ 0 +XU3 u3- u3+ d_µ opamp Aol=100K GBW=10Meg +R6 u3- m_µ +R7 d_µ u3- +R8 d_µ h_µ +C1 h_µ 0 + + + 16 + Figure 9: Softmax circuit design + + +V§b_µ N001 0 +R9 u3+ N001 +R10 u3- 0 + + +B Softmax Circuit +For demonstration purposes, we follow the construction of an analog softmax circuit using bipolar junc- +tion transistors (BJTs) described in [34]. Figure 9 shows the design of a four-way softmax circuit using +BJTs. The softmax function we aim to produce is: + ezi + softmaxi = PN , i = 1, . . . , N (29) + zj + j=1 e + + For the µth BJT in the circuit, the collector current IC,µ can be expressed in terms of the base voltage +hµ and the emitter voltage VE when in the forward-active mode as: + hµ −VE + IC,µ = Is eVBE /VT , VBE,µ = hµ − VE , ⇒ IC,µ = IS e VT + (30) +where Is is the BJT’s saturation current and VT is the thermal voltage. Assuming large BJT β (note: +this β is unrelated to the softmax β)2 , we can neglect base currents IC,µ = IE,µ . Applying KCL at + PN +the shared emitter node VE , the total current IEE = µ=1 IC,µ . We can expand the expression for the +collector currents to get the currents in terms of node voltages: + Nh + X + IEE = IS e(hµ −VE )/VT + µ=1 + Nh + X IS ehµ /VT + = (31) + µ=1 + eVE /VT + +Simultaneously, the current IEE is also fixed by the ideal current source, so IC,µ can also be expressed + I +as the ratio of the branch current to the total current: IC,µ = IC,µ + EE + IEE . Plugging in (30) for IC,µ and +(31) for IEE in the denominator and canceling the term containing VE , + ehµ /VT + IC,µ = PNh IEE (32) + hj /VT + j=1 e + +This already looks very much like the ideal softmax function. The voltage at node fi is created by +current flowing through resistor Ri , producing a voltage drop relative to VCC . Specifically, the voltage + hµ /VT +fµ = VCC − PNeh hj /VT IEE Rµ . When IEE Rµ = 1, this voltage fµ is a negated and shifted softmax in + j=1 e +the range of 1 volt. This scale and negation can be easily corrected with an op amp, which is also needed +to isolate the node and prevent loading. Note that VCC must be chosen to be positive supply in order +for the BJTs to remain in the forward-active mode. + 2 In BJTs, β denotes the ratio of the collector current to the base current. High BJT β indicates the transistor is able to + +amplify a small base current into a much larger collector current, allowing the BJT to function as an amplifier or switch. +A high β reflects that the BJT can efficiently transmit carriers from emitter to collector, without losing them to the base. + + + 17 + Parameter Value + RF 1000 Ω + RT 1 Ω + R1 1 Ω + R2 , R3 , . . . , R8 10 000 Ω + RS 40 Ω + C 10 µF + a3 0 V + b1 0 V + b2 −1 V + b3 −1 V + b4 −1 V + + Table 2: Component and parameter values. + + +C XOR DenseAM Circuit +Figure 10 is a full circuit diagram of the DenseAM that solves the XOR problem. Given input voltages +at V1, V2∈ {0, 1}, the output voltage at g3 is the result of the XOR operation between V1 and V2. In +this model, the visible neuron is linear, and the hidden neurons share a softmax activation function im- +plemented by a set of bipolar junction transistors. Table 2 lists the component values used in simulation. + + +Visible neurons. In the XOR task, only one visible neuron is left evolving, corresponding to the output +column of the truth table. As such, the first two neurons are clamped to the input voltages, represented +by V1 and V2. The third visible neuron, highlighted in blue, is a linear unit with no nonlinear activation: +the internal state voltage v3 directly drives the output, setting g3 = v3 . This is the same circuit described +in Appendix A, except where the activation block is not present. + +Hidden neurons. The XOR task requires four hidden neurons, highlighted in green. These are iden- +tical circuit constructions with the exception of the voltage sources bµ for the biases, which are set +according to the values in Table 2. Unlike the visible neuron, the hidden neurons have a softmax activa- +tion function, such that fµ = softmaxµ (h). + +Softmax activation function. The red highlights the same softmax circuit described in Appendix B, +comprised of BJT transistors, resistors, a voltage source for VCC and a current source for IEE . We +use the 2N5088 transistors in our model, reflecting a standard and widely available BJT. Noninverting +buffers (U10, U11, etc.) are used to prevent loading effects on the state capacitors Cµ from current draw +of the BJT base in forward-active mode. As discussed in Appendix B, the softmax circuit itself produces +an output voltage of + ezi + softmax(z)i = VCC − PN , i = 1, . . . , N + zj + j=1 e + +When VCC = 5V as in this circuit, this requires extra circuitry, highlighted in yellow, to shift and negate +the softmax output. This is done by first buffering the voltage output to prevent loading effects, followed +by a summing op amp that subtracts VCC and inverts the softmax output. For the first hidden neuron +h1 (lower left of figure), op-amp U2 buffers the voltage output, while U1 is configured in an inverting +summing configuration to add -5V (the inverse of VCC ) to the buffered voltage output, producing the +correct softmax output. + +Weight matrix. The weight matrix is comprised of resistors R1 -R12 that represent the weight matrix +ξ. These are set directly according to the XOR truth table, where each row corresponds to one hidden +neuron. A boolean value of 1 (RT ) is set to be a high conductance (1Ω), while a boolean value of 0 (RF ) +is set to be a relatively small conductance (1kΩ). + The gain si /gi governing the value of si is set to be the sum of the resistances in that neuron’s crossbar +column. The column of resistances for neuron 1 has 3 RF resistances, which sum to 3 × 10−3 . Hence, + + + 18 +19 + Figure 10: Full schematic for XOR DenseAM built with 1 evolving linear visible neuron and 4 hidden neurons with softmax activation. Blue: visible neuron. + Green: hidden neurons. Yellow: buffers for softmax activation circuit. Red: analog softmax circuit. +neuron 1’s R47 /RR46 = 3/1000. The crossbar resistances for neuron 2, 3, and 4 have 2 RT resistances +and one RF resistance, which sums to approximately 2. Hence, we approximate R59 /R56 = 2000/1000 +and similarly for hidden neurons 3 and 4. + + +D Design and implementation variations +A large design space remains open across analog electronics and other substrates for realizing DenseAMs, +with clear speed–energy–area–precision trade-offs. In electronics, the core primitives admit multiple re- +alizations: passive, nonvolatile weights (e.g., memristors, triode-region or floating-gate transistors, and +other programmable conductors); active, gained weights via OTAs; and nonlinearities via diode clamps, +reverse-biased diode/BJT exponentials, MOS quadratic regions, or translinear blocks. Architectures in +the spirit of [35, 23] are compact but couple synaptic values to neuronal time constants, making dynamics +drift when a single weight changes—problematic for learning and consistent timing—whereas our decou- +pled neuron preserves a fixed time constant under weight updates. Simpler neuron/network topologies +likely exist and can be attractive in resource-constrained regimes, provided their deviations from the +target ODEs are validated not to degrade performance. Beyond CMOS, photonics (e.g., overdamped, +low-Q microring resonators) can naturally implement first-order ODEs and can offer extreme bandwidth +with distinct calibration and noise constraints. Across these options, open problems include robust +weight storage/programmability and drift control, mixed-signal learning rules compatible with device +limits, scaling under current/GBW/SR constraints, tolerance to mismatch/noise, and algorithm–circuit +co-design to exploit substrate-specific advantages. + + +E Scaling of inference time +There are two conditions under which inference times should be studied, dependent on the softmax +temperature β. In the low-β regime, the DenseAM reaches equilibria with multiple hidden neurons +“competing” in the softmax, while in the high-β regime, the DenseAM reaches equilibria with only one +hidden neuron “winning out” in the softmax. Intuitively, the high-β regime corresponds to exact memory +recall, while the low-β regime corresponds to interpolation. The XOR and Hamming (7,4) code are in +the high-β regime, while the energy transformer lies in the low-β regime. In both regimes, we find that +the DenseAM converges in time that is constant with respect to the number of neurons. + +Assumptions. +(A1) There is a per-synapse device limit of 0 ≤ ξµi ≤ Gmax where Gmax is the maximum conductance + set by the physics of the crossbar crosspoints. Because f is the output of a softmax so fµ ≤ 1 ∀µ, + this means + X + ξµi fµ ≤ Gmax (33) + µ + + so the RHS of the visible neuron dynamics is O(1). + There exist both column-sum and row-sum budgets that are enforced by the hardware, since each + neuron’s output stage can only source/sink a finite amount of current while maintaining GBW/SR + margins. This dictates a per-column and per-row conductance budget to stay within this maximum + current, resulting in + Nv + X Nh + X + ξµi ≤ Cr ∀µ, ξµi ≤ Cc ∀i (34) + i µ + + + Weights can only be positive since conductances can only be positive, so ξµi ≥ 0. + As a corollary of (A1), note also that we can bound ∥ξ µ ∥2 ≤ S ∀µ, and since ∥ξµ ∥2 ≤ ∥ξ µ ∥1 , then + ∥ξ µ ∥2 ≤ Cc ∀µ. +(A2) Bounded biases. |ai | ≤ A, |bµ | ≤ B for all i, µ. In realistic regimes, this typically holds, for + example the typical choice in boolean functions of bµ = − β2 ∥ξ µ ∥2 (seen in Section 5.1). + + + + 20 +Model. Take the system of equation (1) with a softmax activation on hidden neurons and an identity +activation on visible neurons. For clarity we assume 0 biases on visible neurons, but they do not change +the analysis. + + τv v̇ = ξ⊤ f + a − v, τh ḣ = ξv + b − h, f = softmaxβ (h) (35) + +Integrating out the hidden units, + + τv v̇ = ξ ⊤ f (v) − v, (36) + + f (v) = softmax β(ξv + b) (37) + +yields the effective energy function expressed in terms of visible neurons: + 1 1 X + E(v) = ∥v∥2 − log exp β ξ ⊤ + µv+b (38) + 2 β µ + + +where ∇E(v) = v − ξ ⊤ f (v). Because τv v̇ = −∇E(v), we see that the dynamical trajectory causes the +energy to monotonically decrease over time: + d 1 + E(v(t)) = ∇E(v(t))⊤ v̇ = − ∥∇E(v(t))∥2 ≤ 0 (39) + dt τv + +E.1 Low-β regime +The energy landscape in the low-β regime exhibits uniform strong convexity, so the gradient flow dy- +namics cause the energy gap to decay exponentially, reaching an ϵ-fraction of the original energy gap +in constant time. To show E(v) is α-strongly convex, we must show ∇2 E(v) ⪰ αI for some α > 0. +This means that all the eigenvalues of the Hessian are ≥ α. Equivalently, λmin (∇2 E) ≥ α. Denote +G(f ) = Diag(f ) − ff ⊤ ⪰ 0, which is the Jacobian of the softmax function f (v) = softmax(β(ξv + b)). + + ∇2 E(v) = I − βξ ⊤ G(f )ξ (40) + + λmin ∇2 E(v) = λmin I − βξ⊤ G(f )ξ + + (41) + + = 1 − βλmax ξ ⊤ G(f )ξ (42) + + ⇒ ∇2 E(v) ⪰ 1 − βλmax ξ ⊤ G(f )ξ I (43) + +Because G(f ) ⪯ Diag(f ) ⪯PI is PSD and therefore ξG(f )ξ ⊤ is also PSD, and G(f ) is a probability- +weighted covariance where µ fµ = 1, + X + λmax (ξ ⊤ G(f )ξ) ≤ tr(ξ⊤ G(f )ξ) ≤ fµ ∥ξ µ ∥2 ≤ max ∥ξ µ ∥2 (44) + µ + µ + + +Denote S 2 = maxµ ∥ξ µ ∥2 ≤ Cc as in (A1). Therefore, the Hessian of E can be bounded as + + ∇2 E(v) ⪰ (1 − βS 2 )I = αI (45) + +where α = 1 − βS 2 . Then α > 0 when β < 1/ maxµ ∥ξ µ ∥2 . This is a sufficient (but not necessary) +condition for the system to be in the low-β (uniformly convex) regime, where the softmax is diffuse +enough that its covariance term does not contribute so much negative curvature as to overwhelm the +positive curvature contributed by the identity term. In this regime, the uniform lower bound on the +Hessian implies α-strong convexity, which gives the PL inequality + 1 + ∥∇E(v)∥2 ≥ α(E(v) − E ∗ ) (46) + 2 +Together with (39), this allows us to bound the time constant of gradient flow: + + d 1 2α + (E(v(t)) − E ⋆ ) = − ∥∇E(v(t))∥2 ≤ − (E(v(t)) − E ⋆ ) (47) + dt τv τv + + + 21 +If the curvature is bounded below by α, then the gradient magnitude grows at least linearly with distance +to the minimum, ensuring the energy function is “steep enough” to ensure exponential convergence. +Integrating, + 2α + E(v(t)) − E ⋆ ≤ (E(v(0)) − E ⋆ )e− τv t (48) +This indicates exponential decay of the energy gap. In order to reach an ϵ-fraction of the original energy +gap, this takes time + τv 1 + T (ϵ) ≤ log = O(τv log(1/ϵ)) (49) + 2α ϵ +which is entirely independent of system size Nv and Nh . In the energy transformer case, this means that +convergence time is entirely independent of context length L and token dimension D. + +E.2 High-β regime +E.2.1 TI : Basin selection +Denote + sµ (v) := ξ⊤ + µ v + bµ , m(v) := max sµ (v), f := softmax(βs) (50) + µ + +Define the basin of attraction around the winning softmax logit k by the margin γ > 0: + Bk (γ) = {v : sk (v) − max sj (v) ≥ γ} (51) + j̸=k + +Let TI be the first time t such that v(t) ∈ ∪k Bk (γ). Defining the softmax component of the energy +function (38) as + Nh + 1 X + LSEβ (s) = log eβsµ + β µ=1 + +then for every v, we can bound the LSE as + 1 + m(v) ≤ LSEβ (s(v)) ≤ m(v) + log Nh (52) + β +Thus, the “softmax slack” δ(v) := LSEβ (s(v)) − m(v) obeys 0 ≤ δ(v) ≤ β1 log Nh . In the high-β regime, +there are no critical points other than the softmax basins (those within ∪k Bk (γ) for any reasonable +γ > ϵ > 0). To reduce δ from its initial value to the cusp of one of the basins requires dissipating at most + 1 + ∆Esoftmax ≤ log Nh (53) + β +∂E +∂vi = −τv v̇i , and outside winning basins τv v̇i ∼ 1, so the squared magnitude of the gradient grows at +least linearly in Nv : + Nv 2 + 2 + X ∂E + ∥∇E(v)∥ = ≥ cNv (54) + i=1 + ∂vi + +for some c > 0 independent of Nv and Nh for all v in the trajectory outside a winning basin. Therefore, +the energy dissipation rate satisfies + 1 c + −Ė(t) = ∥∇E(v(t))∥2 ≥ Nv (55) + τv τv + Under assumptions (A1)–(A2), the visible state v remains in a bounded box, so the quadratic part of +the energy contributes at most O(Nv ) to the energy difference between any two points on the trajectory. +Since the energy dissipation rate during TI scales proportionally to Nv , the quadratic component of +the energy contribution is dissipated in constant time. The only nontrivial Nh dependence is due to the +softmax slack. Together with the bound on ∆Esoftmax , the total time this phase takes is characteristically + + τv log Nh + TI = O (56) + β Nv + + 22 +E.2.2 TII : Contractive convergence within a winning basin +Find a basin Bk (γ) that is entered at tin = TI . We will now show local strong convexity within this +basin, allowing us to invoke the PL inequality and find exponential convergence within the basin. Define +G := Diag(f ) − ff ⊤ . First, consider that the non-winning softmax mass is 1 − fk , which is + X + 1 − fk = fj ≤ (Nh − 1)e−βγ (57) + j̸=k + + +Additionally, since ∥f ∥2 = fk2 + 2 2 + P + j̸=k fj ≥ fk and 0 ≤ fk ≤ 1, + + + λmax (G(f )) ≤ tr(G(f )) = 1 − ∥f ∥2 ≤ 1 − fk2 ≤ 2(1 − fk ) ≤ 2(Nh − 1)e−βγ (58) + +Hence, with S 2 = maxµ ∥ξ µ ∥2 , + + λmax (ξ ⊤ G(f )ξ) ≤ S 2 λmax (G(f )) ≤ 2S 2 (Nh − 1)e−βγ (59) + +This gives a bound on the largest eigenvalue of G(f ) in a way that incorporates the softmax beta. + Now, we can show local strong convexity in the winning basin: + + ∇2 E(v) = I − βξ ⊤ G(f )ξ ⪰ (1 − β2S 2 (Nh − 1)e−βγ )I ≡ α(β, γ)I (60) + +for all v ∈ Bk (γ). Particularly, if + 1 + e−βγ (Nh − 1) ≤ (61) + 4βS 2 + +then α(β, γ) ≥ 12 , independent of Nh , Nv . Note that this is always possible: if the softmax is not peaked +enough to make this inequality true, simply keep moving in trajectory “Phase I” for a little longer until +the margin γ grows slightly larger such that the condition holds true. This strong convexity within Bk (γ) +implies the PL inequality + 1 + ∥∇E(v)∥2 ≥ α(β, γ)(E(v) − E ⋆ ), ∀v ∈ Bk (γ) (62) + 2 +Therefore, along the trajectory within the basin for times t ≥ tin , + + d 1 2α(β, γ) + E(v(t)) − E ⋆ = − ∥∇E(v(t))∥2 ≤ − E(v(t)) − E ⋆ + + (63) + dt τv τv +Integrating, + 2α(β,γ) + E(v(t)) − E ⋆ ≤ e− (t−tin ) + E(v(tin )) − E ⋆ + + τv (64) + +Impose a relative-to-initial convergence criteria: + + E(v(t)) − E ⋆ ≤ ϵ E(v(0)) − E ⋆ , + + ϵ ∈ (0, 1) + +Since E is non-increasing along the trajectory, E(v(tin )) − E ⋆ ≤ E(v(0)) − E ⋆ , so it suffices that + 2α(β,γ) + e− τv (t−tin ) + ≤ϵ + +Hence the in-basin time satisfies + + τv 1 1 + TII ≤ log = O τv log (65) + 2α(β, γ) ϵ ϵ + +which is size-free of Nh and Nv . + + + + + 23 +E.2.3 Combined bound +Altogether, in the high-β regime, to reach a relative-to-initial tolerance of + E(v(t)) − E ⋆ ≤ ϵ E(v(0)) − E ⋆ + + (66) +the combined convergence time satisfies + + τv log Nh 1 + T (ϵ) = O + O τv log (67) + β Nv ϵ + | {z } | {z } + winner selection (TI ) convergence within basin (TII ) + +For fixed ϵ, β, and τv , TII is independent of Nv and Nh , while TI carries all the model-size dependence. +The dependence of the convergence time on Nh and Nv in the high-β regime is + + τv log Nh + T (ϵ) = O . (68) + β Nv +The convergence time is at most logarithmic in the number of hidden neurons Nh , and actually decreases +as 1/Nv in the number of visible neurons. + +E.3 Limitations +Our analysis assumes that the timescales of the crossbar array are much faster than the fastest neuronal +timescales. In practice, as the crossbar array gets bigger, it may contribute to the time scales of the +entire system, since wires have non-zero capacitances. Once the size of the crossbar array reaches the +point when it significantly modifies the time scales of the neurons, our analysis and the scaling argument +becomes invalid. For this reason, one cannot scale this design to infinitely large sizes. Analyzing that +boundary is outside the scope of our paper, because it is dependent on fabrication and design parameters, +which is a different level of abstraction than our present paper. + + +F Design invariance under voltage scaling +Given hardware constraints of Gmax , Cc , and Cr , we can still implement models with arbitrarily large +weights. Convergence bounds rely on the weight matrix constraints, which can be made feasible by +global normalization at the hardware level, keeping the effective model weights unchanged. Consider the +scaling factor for any non-negative ξ: + ( ) + Gmax Cc Cr + κ = min 1, , P , P (69) + maxµ,i ξµi maxi µ ξµi maxµ i ξµi + +Set ξ̃ = κξ. Then, ξ̃ satisfies all the hardware constraints of assumption (A1): + X X + 0 ≤ ξ˜µi ≤ Gmax , ξ˜µi ≤ Cr ∀µ, ξ˜µi ≤ Cc ∀i (70) + i µ + +So any ξ matrix can be mapped onto budgets with one scalar κ. Consider the pre-softmax arguments +for the hidden neurons: if we scale weights ξ → ξ̃ = κξ, rescale the voltage unit v → ṽ = κv and biases +b → b̃ = κ2 b and set β̃ = β/κ2 , then + ⊤ + β̃(ξ˜µ ṽ + b̃) = β(ξ ⊤ + µ v + b) (71) + +so the softmax outputs f and the system’s attractors are unchanged. The visible ODE τv v̇ = ξ⊤ f (v) − v +is preserved up to units, as the κ terms can be absorbed into the gain of U2 and U3 without affecting the +convergence time bounds. + + +G Scaling of energy consumption +The energy consumption of DenseAM circuits can be broken up into two parts: the energy dissipated +by the weights as a result of Ohm’s Law, and the energy from engineering overhead found in amplifiers +and active circuitry. The energy dissipated by the weights in the crossbar array can be expressed as the +integral of the power dissipated by each resistor of resistance Rµi from time 0 until convergence at Tconv . + + + 24 +Energy consumption of weights. Let the neuron output voltages be proportional to activations: +ui = κgi and wµ = κfµ , where κ is a fixed voltage scale. We assume rail-bounded outputs |ui | ≤ κ and +|wµ | ≤ κ (by Appendix F, global rescaling of ξ, voltages, and β preserves the DenseAM dynamics, so +this choice of κ does not affect behavior.) The instantaneous power in the resistive crossbar is: + X + Pweights (t) = ξµi (ui − wµ )2 (72) + i,µ + P P +Using the row/column conductance budgets µ ξµi ≤ Cc and i ξµi ≤ Cr (Appendix E) and the +inequality (a − b)2 ≤ 2a2 + 2b2 , + + X X + Pweights (t) ≤ 2 ξµi u2i + ξµi wµ2 (73) + i,µ i,µ + ! !! + X X X X + =2 u2i ξµi + wµ2 ξµi (74) + i µ µ i + ! + X X + ≤ 2 Cc u2i + Cr wµ2 (75) + i µ + + 2 2 2 + P P +If the hidden layer uses a softmax activation, then +P 2 µ fµ ≤ 1 and so µ wµ ≤ κ ; and rail bounds give + 2 + i ui ≤ Nv κ . Therefore, + + Pweights (t) ≤ 2κ2 (Cc Nv + Cr ) = O(Nv ) (76) + +Therefore, a system taking time T conv to converge results in an energy consumption of + Z T + Eweights = Pweights (t)dt ≤ 2κ2 (Cc Nv + Cr )T conv (77) + 0 + +According to the convergence time bounds of Appendix E, T conv = O(τv ). Thus, Eweights = O(Nv ), as +a function of system size. + +Energy consumption of capacitors. Let each neuron node voltage be bounded by hardware limits +|ui (t)|, |wµ (t)| ≤ κ. Charging a capacitor of capacitance C from a supply through a resistive path draws +CV 2 from the power supply. The number of times each capacitor charges is finite because the Lyapunov +energy of the DenseAM forbids limit cycles. This means the total supply energy per node can be bounded +by a constant. Therefore, the total energy needed to (re)charge all neuron capacitors is bounded by + Nh + Nv + ! + (v) + X X + 2 (h) + Ecapacitors ≤ O(1) · κ Ci + Cµ = O(Nv + Nh ) (78) + i=1 µ=1 + + +Energy consumption of amplifiers, bias, control, and overhead. Per neuron, the energy expen- +diture to amplifier inefficiency, bias terms, and general overhead do not depend on system size. For a +runtime of duration T conv , the energy consumption of these elements in the entire network scales as + + Eother = O((Nv + Nh )T conv ) (79) + +Combined energy consumption. All together, the total energy consumption can be written as + + Etotal = O(Nv + Nh ) (80) + + +H Model Specifications and Details +Table 3, Table 4, and Table 5 summarize the model design for the XOR, Hamming (7,4), and parity +DenseAM models. + + + + 25 + Table 3: XOR model specification + +Visible neurons vi Nv = 3 (inputs v1 , v2 clamped to {0,1}; output v3 free) +Hidden neurons hµ Nh = 4 (one per truth-table row) + PNv 2 +Visible activation and Lagrangian Identity: gi = vi , Lv = 21 i=1 vi + NPh βhµ +Hidden activation and Lagrangian Softmax: fµ = softmax(βhµ ), Lh = β1 log e + µ=1 +Visible biases ai = 0 + PNv 2 +Hidden biases bµ = − 12 i=1 ξµi + 0 0 0 + 0 1 1 +Weights ξ ξ ∈ {0, 1}4×3 , rows encode memories: ξ = 1 0 1 + + + 1 1 0 +Inference protocol Clamp (v1 , v2 ) to input values; read out v3 at convergence + + + + + Table 4: Hamming (7,4) model specification + +Visible neurons (Nv ) 7 (codeword bits) +Hidden neurons (Nh ) 16 (one per valid codeword) +Visible activation Identity: gi = vi +Hidden activation Softmax over µ ∈ {1, . . . , 16} with temperature β +Visible biases ai = 0 + PNv 2 +Hidden biases bµ = − 21 i=1 ξµi +Weights ξ ξ ∈ {0, 1}16×7 , each row is a valid Hamming(7,4) codeword +Inference protocol Initialize visible neurons to corrupted 7-bit input codeword; let all visible and + hidden neurons evolve; converged visible neurons give the corrected codeword + + + + + Table 5: 8-bit parity model specification + +Visible neurons vi Nv = 16 (dimension of embedding D) +Hidden neurons (energy attention) hattn + A Nhattn = 8 (context length L) +Hidden neurons (Hopfield network) hhopf + µ Nhhopf = 16 (Hopfield network memories M ) +Hidden neurons (total) Nh = 24 (L + M ) +Visible activation Identity: gi = vi +Hidden activation (energy attention) Softmax: fAattn = softmax(βhattn )A for A = 1, . . . , L +Hidden activation (Hopfield network) ReLU: fµhopf = max (hhopf + µ , 0) for µ = 1, . . . , M +Weights (energy attention) ξattn ∈ RL×D , where ξattn + A is embedded A’th context token +Weights (Hopfield network) ξ hopf ∈ RM ×D , static after training +Inference protocol Embed L context tokens to obtain ξ attn . Let visible neurons + evolve until convergence + + + + + 26 +H.1 Bit string energy transformer implementation +As described in Table 5, our trained model uses an embedding matrix of 2 × D = 32 parameters, the +Hopfield network with D × M = 256 parameters, an additional D × 2 = 32 parameter matrix to decode +embeddings to logits, a total of D + L + M = 40 neuron bias terms, and 2 biases for the linear decoder. +This is a total of 362 parameters. + In training and inference we use time constants τv = 0.1 and τh = 0.01. We train with Euler steps of +1e-3, and test with Euler steps of 1e-4 for a time horizon of T = 1 second. Jax’s automatic differentiation +was used to implement backpropagation through time. We encourage the model to reach fixed points +by penalizing v̇ at time T. This yields models that are more robust to hardware imperfection due to the +intrinsic stability of attractor points. The convergence to an attractor also means the inference remains +stable to mismatch and delay in timing during readout. + + +I Hardware analysis +I.1 Hardware speed analysis +As discussed in subsection 7.1, the convergence time of analog DenseAMs is governed not by system size, +but rather primarily by the timescales of the dynamics in hardware. These timescales are set by the time +constants τv and τh . The smaller these time constants, the faster the dynamics move, and the faster the +system converges. In this section, we derive bounds on the minimum time constant min{τv , τh } of the +DenseAM, which is limited by the constraints of active components like amplifiers. + The maximum speed of neuronal dynamics is limited by the ability of active stages (op-amps/buffers) +to track changing signals. If the input slope to an active stage exceeds its slew rate (SR), the output +distorts; if the signal spectrum approaches or exceeds the stage’s closed-loop bandwidth, attenuation +and phase lag appear. Here, we derive lower bounds on the time constants τv , τh imposed by (i) finite +gain–bandwidth product (GBW) and (ii) finite SR of the three active stages in the neuron design (Ap- +pendix A). Without loss of generality we will express the derivation for the hidden neurons, with the +derivations for visible neurons following by symmetry. Throughout, define the following: + + • State swing: |vi (t)| ≤ Av , so that |v̇i | ≲ Av /τ . Similarly, |hµ (t)| ≤ Ah , so that |ḣµ | ≲ Ah /τ . + • Activation swing: Visible activation g(·) is Lipschitz with slope bound Lg = supx |g ′ (x)|. Then, + |ġi | ≤ Lg |v̇i | ≤ Lg Av /τ . Similarly, hidden activation f (·) is Lipschitz with slope bounded by + Lf = supx |f ′ (x)|. Then, |f˙µ | ≤ Lf |ḣµ | ≤ Lf Ah /τ . + + • Weights ξ ≥ 0. Hardware normalization gives + P per-row/column conductivity budgets, so the self- + term gain for hidden neuron µ is Aself,µ = i ξµi = O(1). +We will derive three independent lower bounds and then take the max: + + τmin ≥ max{ τGBW , τSR , τI−limit } (81) + | {z } |{z} | {z } + tracking small signals edge/large-signals output current + + +I.1.1 Gain-bandwidth product bound +For a single-pole op-amp with gain-bandwidth product GBW in a closed-loop configuration with loop +gain ACL , the −3db bandwidth is fc ≈ GBW/ACL . In order for the neuron to faithfully track with a +time constant τ , we require fc ≳ 1/(2πτ ) for every stage in the signal path. Closed-loop gains for each +of the op-amps are: ACL (U 1) = 1 because it is a unity-gain buffer, ACL (U 2) = Aself because it needs +to realize the self term gain, and ACL (U 3) ≈ 1 because it is a unity-gain summer. Assuming the same +op-amp design for U1, U2, and U3, and taking the worst case, + + max(1, Aself ) + τGBW = (82) + 2πGBW + +I.1.2 Slew rate bound +The slew-rate limits cap the maximum output slope of each op-amp stage: + • U1: activation buffer. |f˙µ | ≤ Lf Ah /τ , which gives τ ≥ (Lf Ah )/SRU1 . + + + 27 +Table 6: Estimated neuron time constants and conservative convergence times with Av = Ah = 1 V, + 1 +Lg = 1, Aself = 1 for representative amplifiers in literature. GBW bound τGBW = 2π GBW ; SR bound + Lg Av +τSR = SR (visible path). Overall τmin = max{τGBW , τSR }; we report Tconv = 10 τmin . + +CMOS Amplifier (ref.) SR (V/µs) GBW (MHz) τSR (ns) τGBW (ns) Tconv (ns) +Perez and Maloberti [36] 84.50 321.50 11.83 0.50 118.34 +Assaad and Silva-Martinez [37] 94.10 134.20 10.63 1.19 106.27 +Yen and Blalock [38] 202.00 10.70 4.95 14.87 148.74 +Naderi, Prakash, and Silva-Martinez [39] 1250.00 3600.00 0.80 0.04 8.00 +Schlögl and Zimmermann [40] 1650.00 2510.00 0.61 0.06 6.06 +Notes. (i) τSR values assume the visible path dominates the summer’s SR (low/moderate-β). If softmax dominates at U3 + in the high-β regime, multiply SR-limited values by κ = (β/2) (Ah /Av ) (with Ah = Av = 1 V, simply β/2). (ii) The + current-limit bound τI-limit = CAv /Imax is typically ≪ all reported values for C ∼ 50 fF and Imax ∼mA, so it is omitted + from the table but must still be respected in circuit sizing. + + + • U2: self-term. sµ = Aself fµ , so |ṡµ | = Aself |f˙µ | ≤ (Aself Lf Ah )/τ , which gives τ ≥ (Aself Lf Ah )/SRU2 . + • U3: internal state drive. The time-varying portion of the RC circuit drive dµ is a linear combina- + tion of fµ and gi , with coefficients that have a maximum magnitude of Aself . Using the bounds on + the slopes of those inputs, we get the following bound on |d˙µ | and subsequently the time constant + bound: + Aself Aself max(Lf Ah , Lg Av ) + |d˙µ | ≲ max{Lf Ah , Lg Av } ⇒ τ≥ (83) + τ SRU3 + +All together, the combined constraint is + + Lf Ah Aself Lf Ah Aself max(Lf Ah , Lg Av ) + τSR = max , , (84) + SRU1 SRU2 SRU3 + +I.1.3 Current / headroom limit +U3 must provide the current through R2 to charge C1 . The RC circuit dynamics dictate R2 C1 ḣµ = +−hµ + dµ , so the instantaneous current needed by U3 is + + dµ − h µ + IU3,out = = C1 ḣµ (85) + R2 + +We must respect |IU3,out | ≤ Imax,U3 . With |ḣµ | ≲ Ah /τ , + + C1 Ah + τI-limit ≥ (86) + Imax,U3 + +I.1.4 Combined bound on minimum time constant +Taken together, the minimum time constant must satisfy the bounds (82), (84), and (86): + + τmin ≥ max{τGBW , τSR , τI-limit } (87) + +I.2 Estimates of inference times with existing hardware +Under standard assumptions for DenseAMs (symmetric couplings and monotone activations), the Lya- +punov energy decreases monotonically and the dynamics converge without oscillations. The settling time +is therefore on the order of a few multiples of the largest neuronal time constant, which we bound by +amplifier non-idealities. In this section we take some representative examples of op-amps from literature +and estimate the inference speeds from reasonable and representative design parameters. + + + + + 28 +Minimum time constant. For illustration purposes, we choose three reasonable hardware constraints: + • Activation slopes. Take the slope of the visible activation to be Lg = 1, such as would occur in + a identity visible neuron activation. Take the worst-case (maximum) slope of the hidden activation + to be according to the softmax with fixed β, whose Jacobian is βG(f ) with ∥G(f )∥2 ≤ 12 , so a safe + global bound is Lf ≤ β2 . + • Signal swing. Use the voltage scaling invariance (see Appendix F) to rescale v, ξ, and β together + to pick a swing that is slew-rate friendly but well above component noise limits. Take both Av = + Ah = 1V . + + • Self-term gain. With row/column budgets, use Aself as a worst-case bound. +With those choices, the three lower bounds per neuron are: + + 1. GBW Bound: τGBW = max(1,A + 2πGBW + self ) 1 + = 2πGBW . + L A + 2. SR Bound: The U1/U2 path give τSR,vis = SR g v 1 + = SR µs. In the U3 (summer) path, equation (84) + has two cases. In the low-β regime where Lg Av ≥ Lf Ah , then U3 bound reduces to 1/SR µs. In + the high-β regime where Lf Ah = β/2 dominates, scale the slew-rate limited bound by β/2. + 3. Output Current Bound: In practice, this bound generally does not limit the op amp choice: + even with a large capacitor C = 50 fF, Av = 1V, Imax = 2mA, τI-limit ≈ 0.025ns, which is negligible + compared to the bounds from SR and GBW. +To quantify realistic inference speeds, Table 6 lists representative CMOS operational transconductance +amplifiers (OTAs)3 drawn from recent literature, together with their corresponding lower bounds on +neuronal time constants under the GBW and slew-rate limits. Even using conservative assumptions +with existing amplifier designs, the analysis shows that modern high-speed OTAs can achieve sub–10 ns +neuronal convergence times—corresponding to inference rates in the hundreds of megahertz. + + +J Connection between analog and canonical Energy Transformer +In this section we show that in the adiabatic limit, our Analog Energy Transformer (Analog ET) reduces +to the canonical Energy Transformer. Begin with the dynamics for the Analog Energy Transformer +implemented by our circuit designs. + + ∂E ⊤ ⊤ + τv v̇ = − = ξ attn f attn + ξ hopf f hopf + a − v (88) + ∂v + ∂E + τh ḣattn + = − attn = ξattn v + b − hattn (89) + ∂f + ∂E + τh ḣhopf = − hopf = ξhopf v + c − hhopf (90) + ∂f +Integrating out hidden neurons in the adiabatic limit where τh → 0, we see the relations + + hattn (v) = ξ attn v + b (91) + hopf hopf + h (v) = ξ v+c (92) + +which we can use to integrate out the hidden neuron activations as + + f attn (v) = softmax ξ attn v + b + + (93) + + f hopf (v) = ReLU ξ hopf v + c (94) + +Substituting into the visible dynamics: + ⊤ attn ⊤ + τv v̇ = ξ attn f (v) + ξ hopf f hopf (v) + a − v (95) + 3 Many high-speed CMOS “op-amps” are reported as OTAs (transconductors). In our neuron, these OTA cores operate + +in closed-loop (unity/non-inverting) configurations, so the literature SR and GBW directly constrain τ via Eqs. (82)–(84). + + + + 29 +We can ask ourselves, what scalar energy produces this ODE? We seek an energy Eeff (v) such that +τv v̇ = − ∂E + ∂v . Equivalently, + eff + + + + ⊤ attn ⊤ + ∇v Eeff (v) = v − a − ξ attn f (v) − ξ hopf f hopf (v) (96) + +We can construct Eeff (v) as a sum of three pieces whose gradients match each term Eeff (v) = Equad (v) + +Eattn (v) + Ehopf (v). By inspection we see that Equad (v) = 21 ∥v − a∥2 . + +Attention term. The energy function + 1 X + exp β ξ attn + + Eattn (v) = − log A v + bA (97) + β + A + +satisfies our requirement. We can see that by differentiating with respect to vi , we get + ∂Eattn X + =− softmax(ξ attn v + b)A · ξAi + attn + (98) + ∂vi + A + X + attn attn + =− ξAi fA (v) (99) + A + ⊤ attn +which yields our desired dynamics of ∇v Eattn (v) = − ξ attn f (v). + +Hopfield term. A simple way to achieve the desired dynamics is with a Hopfield-type energy function + X1 2 + Ehopf (v) = − ReLU ξ hopf + µ v + c µ (100) + µ + 2 + +whose derivative with respect to vi yields + ∂Ehopf X + hopf + =− ReLU ξ hopf + µ v + c µ · ξµi (101) + ∂vi µ + X hopf + =− ξµi fµhopf (v) (102) + µ + + ⊤ +which yields our desired dynamics of ∇v Ehopf (v) = − ξ hopf f hopf (v). + +Effective energy function of analog energy transformer. All together, the effective scalar energy +over the visible state v after integrating out hidden neurons is + 1 1 X X 1 2 + Eeff (v) = ∥v − a∥22 − log exp β ξ attn + A v + bA − ReLU ξ hopf + µ v + cµ (103) + |2 {z } β A µ + 2 + Equad | {z } | {z } + Eattn Ehopf + +This effective energy aligns with the canonical Energy Transformer’s energy function. Because our effec- +tive dynamics use hidden neurons, the energy function written in the main text reflects the contributions +of the hidden neurons. When τh ≪ τv , this regime converges to the behavior when the hidden neurons +are integrated out. Hence, the effective expressibility and behavior of our system is equivalent to that of +the original Energy Transformer. + In our model we omit the layer normalization activation that the original Energy Transformer applies +to the visible neurons. This keeps the circuit design simple, while still enabling models with high +expressibility. This choice does not modify the structure of the attention or the Hopfield parts of the +energy; only the self-energy of v differs. From a modeling perspective, layer normalization mainly +improves conditioning and learning of deep networks rather than changing the computational primitive +and expressibility. We empirically observe that the resulting models without layer normalization remain +expressive enough to solve the problems we present. In principle, a layer normalization-type visible +activation function could be implemented in analog hardware (e.g. by subtracting the mean voltage +and normalizing by an on-chip variance estimate), but this would add distracting complications to the +minimalist neuron and circuit designs we show in this paper. + + + 30 +
\ No newline at end of file |
