Recursive reasoning dynamics: analysis pipeline, paper drafts, toy models

Failure=more-chaotic (task-general under validity labeling) reduces to convergence/completeness detection; mechanism (transient chaos vs multistability vs input-induced) under investigation. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
author: YurenHao0426 <blackhao0426@gmail.com> 2026-06-29 12:15:51 -0500
committer: YurenHao0426 <blackhao0426@gmail.com> 2026-06-29 12:15:51 -0500
commit: a6ec4288a2232988b130b2f00bb2565f81706966 (patch)
tree: 1bb86e7f0b899b823b9e7fdf383e832d30a181e0 /papers
6 files changed, 7947 insertions, 0 deletions
diff --git a/papers/txt/engelken2023_gradient_flossing.txt b/papers/txt/engelken2023_gradient_flossing.txt
new file mode 100644
index 0000000..6d70608
--- /dev/null
+++ b/papers/txt/engelken2023_gradient_flossing.txt
@@ -0,0 +1,1879 @@
+                                              Gradient Flossing: Improving Gradient Descent
+                                                 through Dynamic Control of Jacobians
+
+
+                                                                                Rainer Engelken
+                                                                      Zuckerman Mind Brain Behavior Institute
+                                                                               Columbia University
+arXiv:2312.17306v1 [cs.LG] 28 Dec 2023
+
+
+
+
+                                                                                New York, USA
+                                                                             re2365@columbia.edu
+
+
+
+                                                                                      Abstract
+                                                  Training recurrent neural networks (RNNs) remains a challenge due to the insta-
+                                                  bility of gradients across long time horizons, which can lead to exploding and
+                                                  vanishing gradients. Recent research has linked these problems to the values
+                                                  of Lyapunov exponents for the forward-dynamics, which describe the growth or
+                                                  shrinkage of infinitesimal perturbations. Here, we propose gradient flossing, a
+                                                  novel approach to tackling gradient instability by pushing Lyapunov exponents
+                                                  of the forward dynamics toward zero during learning. We achieve this by regu-
+                                                  larizing Lyapunov exponents through backpropagation using differentiable linear
+                                                  algebra. This enables us to "floss" the gradients, stabilizing them and thus improv-
+                                                  ing network training. We demonstrate that gradient flossing controls not only the
+                                                  gradient norm but also the condition number of the long-term Jacobian, facilitating
+                                                  multidimensional error feedback propagation. We find that applying gradient
+                                                  flossing prior to training enhances both the success rate and convergence speed for
+                                                  tasks involving long time horizons. For challenging tasks, we show that gradient
+                                                  flossing during training can further increase the time horizon that can be bridged
+                                                  by backpropagation through time. Moreover, we demonstrate the effectiveness of
+                                                  our approach on various RNN architectures and tasks of variable temporal com-
+                                                  plexity. Additionally, we provide a simple implementation of our gradient flossing
+                                                  algorithm that can be used in practice. Our results indicate that gradient flossing
+                                                  via regularizing Lyapunov exponents can significantly enhance the effectiveness of
+                                                  RNN training and mitigate the exploding and vanishing gradients problem.
+
+
+                                         1   Introduction
+                                         Recurrent neural networks are commonly used both in machine learning and computational neu-
+                                         roscience for tasks that involve input-to-output mappings over sequences and dynamic trajectories.
+                                         Training is often achieved through gradient descent by the backpropagation of error information
+                                         across time steps [1, 2, 3, 4]. This amounts to unrolling the network dynamics in time and recursively
+                                         applying the chain rule to calculate the gradient of the loss with respect to the network parameters.
+                                         Mathematically, evaluating the product of Jacobians of the recurrent state update describes how error
+                                         signals travel across time steps. When trained on tasks that have long-range temporal dependencies,
+                                         recurrent neural networks are prone to exploding and vanishing gradients [5, 6, 7, 8]. These arise from
+                                         the exponential amplification or attenuation of recursive derivatives of recurrent network states over
+                                         many time steps. Intuitively, to evaluate how an output error depends on a small parameter change at
+                                         a much earlier point in time, the error information has to be propagated through the recurrent network
+                                         states iteratively. Mathematically, this corresponds to a product of Jacobians that describe how
+                                         changes in one recurrent network state depend on changes in the previous network state. Together,
+                                         this product forms the long-term Jacobian. The singular value spectrum of the long-term Jacobian reg-
+
+                                         37th Conference on Neural Information Processing Systems (NeurIPS 2023).
+ulates how well error signals can propagate backwards along multiple time steps, allowing temporal
+credit assignment. A close mathematical correspondence of these singular values and the Lyapunov
+exponents of the forward dynamics was established recently [9, 10, 11, 12]. Lyapunov exponents
+characterize the asymptotic average rate of exponential divergence or convergence of nearby initial
+conditions and are a cornerstone of dynamical systems theory [13, 14]. We will use this link to
+improve the trainability of RNNs.
+Previous approaches that tackled the problem of exploding or vanishing gradients have suggested
+solutions at different levels. First, specialized units such as LSTM and GRU were introduced,
+which have additional latent variables that can be decoupled from the recurrent network states via
+multiplicative (gating) interactions. The gating interactions shield the latent memory state, which can
+therefore transport information across multiple time steps [5, 6, 15]. Second, exploding gradients can
+be avoided by gradient clipping, which re-scales the gradient norm [16] or their individual elements
+[17] if they become too large [18]. Third, normalization schemes like batch normalization prevent
+saturated nonlinearities that contribute to vanishing gradients [19]. Third, it was suggested that the
+problem of exploding/vanishing gradients can be ameliorated by specialized network architectures,
+for example, antisymmetric networks [20], orthogonal/unitary initializations [21, 22, 23], coupled
+oscillatory RNNs [24], Lipschitz RNNs [25], linear recurrent units [26], echo state networks [27, 28],
+(recurrent) highway networks [29, 30], and stable limit cycle neural networks [11, 31, 32]. Fourth, for
+large networks, a suitable choice of weights can guarantee a well-conditioned Jacobian at initialization
+[21, 33, 34, 35, 36, 37, 38, 39, 40, 41]. These initializations are based on mean-field methods, which
+become exact only in the large-network limit. Such initialization schemes have also been suggested
+for gated networks [40]. However, even when initializing the network with well-behaved gradients,
+gradients will typically not retain their stability during training once the network parameters have
+changed.
+Here, we propose a novel approach to tackling this challenge by introducing gradient flossing, a
+technique that keeps gradients well-behaved throughout training. Gradient flossing is based on
+a recently described link between the gradients of backpropagation through time and Lyapunov
+exponents, which are the time-averaged logarithms of the singular values of the long-term Jacobian
+[9, 11, 12, 32]. Gradient flossing regularizes one or several Lyapunov exponents to keep them close
+to zero during training. This improves not only the error gradient norm but also the condition number
+of the long-term Jacobian. As a result, error signals can be propagated back over longer time horizons.
+We first demonstrate that the Lyapunov exponents can be controlled during training by including an
+additional loss term. We then demonstrate that gradient flossing improves the gradient norm and
+effective dimension of the gradient signal. We find empirically that gradient flossing improves test
+accuracy and convergence speed on synthetic tasks over a range of temporal complexities. Finally,
+we find that gradient flossing during training further helps to bridge long-time horizons and show
+that it combines well with other approaches to ameliorate exploding and vanishing gradients, such as
+dynamic mean-field theory for initialization, orthogonal initialization and gated units.
+Our contributions include:
+
+       • Gradient flossing, a novel approach to the problem of exploding and vanishing gradients in
+         recurrent neural networks based on regularization of Lyapunov exponents.
+
+       • Analytical estimates of the condition number of the long-term Jacobian based on Lyapunov
+         exponents.
+
+       • Empirical evidence that gradient flossing improves training on tasks that involve bridging
+         long time horizons.
+
+
+2   RNN Gradients and Lyapunov Exponents
+
+We begin by revisiting the established mathematical relationship between the gradients of the loss
+function, computed via backpropagation through time, and Lyapunov exponents [9, 12], and how
+it relates to the problem of vanishing and exploding gradients. In backpropagation through time,
+network parameters θ are iteratively updated by stochastic gradient descent such that a loss Lt is
+locally reduced [1, 2, 3, 4]. For RNN dynamics hs+1 = fθ (hs , xs+1 ), with recurrent network state
+h, external input x, and parameters θ, the gradient of the loss Lt with respect to θ is evaluated by
+
+
+                                                   2
+unrolling the network dynamics in time. The resulting expression for the gradient is given by:
+                            τ =t−1   t−1
+                                                  !
+                 ∂Lt    ∂Lt X        Y ∂hτ ′ +1 ∂hτ         ∂Lt X              ∂hτ
+                     =                                   =            Tt (hτ )                        (1)
+                 ∂θ     ∂ht           ′
+                                          ∂hτ  ′     ∂θ      ∂ht τ              ∂θ
+                               τ =t−l   τ =τ
+
+where Tt (hτ ) is composed of a product of one-step Jacobians Ds = ∂hs+1
+                                                                    ∂hs :
+                                               t−1                       t−1
+                                               Y        ∂hτ ′ +1   Y
+                                 Tt (hτ ) =                      =   Dτ ′                             (2)
+                                                         ∂hτ ′
+                                               τ ′ =τ              ′     τ =τ
+
+Due to the chain of matrix multiplications in Tt , the gradients tend to vanish or explode exponentially
+with time. This complicates training particularly when the task loss at time t dependents on inputs x
+or states h from many time steps prior which creates long temporal dependencies [5, 6, 7, 8]. How
+well error signals can propagate back in time is constrained by the tangent space dynamics along
+trajectory ht , which dictate how local perturbations around each point on the trajectory stretch, rotate,
+shear, or compress as the system evolves.
+The singular values of the Jacobian’s product Tt , which determine how quickly gradients vanish or
+explode during backpropagation through time, are directly related to the Lyapunov exponents of the
+forward dynamics [9, 12]: Lyapunov exponents λ1 ≥ λ2 · · · ≥ λN are defined as the asymptotic
+time-averaged logarithms of the singular values of the long-term Jacobian [13, 42, 43]
+                                                         1
+                                        λi = lim              log(σi,t )                              (3)
+                                               t→∞ t − τ
+
+where σi,t denotes the ith singular value of Tt (hτ ) with σ1,t ≥ σ2,t . . . σN,t (See Appendix I
+for details). This means that positive Lyapunov exponents in the forward dynamics correspond to
+exponentially exploding gradient modes, while negative Lyapunov exponents in the forward dynamics
+correspond to exponentially vanishing gradient modes.
+In summary, the Lyapunov exponents give the average asymptotic exponential growth rates of
+infinitesimal perturbations in the tangent space of the forward dynamics, which also constrain the
+signal propagation in backpropagation for long time horizons. Lyapunov exponents close to zero in
+the forward dynamics correspond to tangent space directions along which error signals are neither
+drastically attenuated nor amplified in backpropagation through time. Such close-to-neutral modes in
+the tangent dynamics can propagate information reliably across many time steps.
+
+3   Gradient Flossing: Idea and Algorithm
+We now leverage the mathematical connection established between Lyapunov exponents and the
+prevalent issue of exploding and vanishing gradients for regularizing the singular values of the
+long-term Jacobian. We term this procedure gradient flossing. To prevent exploding and vanishing
+gradients, we constrain Lyapunov exponents to be close to zero. This ensures that the corresponding
+directions in tangent space grow and shrink on average only slowly. This leads to a better-conditioned
+long-term Jacobian Tt (hτ ). We achieve this by using the sum of the squares of the first k largest
+Lyapunov exponent λ1 , λ2 . . . λk as a loss function:
+                                                             k
+                                                             X
+                                            Lflossing =            λ2i                                (4)
+                                                             i=1
+
+and evaluate the gradient obtained from backpropagation through time:
+                                                      k
+                                        ∂Lflossing   X   ∂λ2i
+                                                   =                                                  (5)
+                                          ∂θ         i=1
+                                                         ∂θ
+
+This might seem like an ill-fated enterprise, as the gradient expression in Eq 5 suffers from its
+own problem of exploding and vanishing gradients. However, instead of calculating the Lyapunov
+exponents by directly evaluating the long-term Jacobian Tt (Eq 2), we use an established iterative
+reorthonormalization method involving QR decomposition that avoids directly evaluating the ill-
+conditioned long-term Jacobian [12, 44].
+
+
+                                                         3
+First, we evolve an initially orthonormal system Qs = [q1s , q2s , . . . qks ] in the tangent space along the
+trajectory using the Jacobian Ds = ∂h   s+1
+                                       ∂hs . This means to calculate
+
+                                           Q
+                                           e s+1 = Ds Qs                                       (6)
+at every time-step. Second, we extract the exponential growth rates using the QR decomposition,
+                                       Qe s+1 = Qs+1 Rs+1 ,
+which decomposes Q    e s+1 uniquely into the product of an orthonormal matrix Qs+1 of size N × k
+     ⊤
+so Qs+1 Qs+1 = 1k×k and an upper triangular matrix Rs+1 of size k × k with positive diagonal
+elements. Note that the QR decomposition does not have to be applied at every step, just sufficiently
+often, i.e., once every tONS such that Q
+                                       e does not become ill-conditioned.
+The Lyapunov exponents are given by time-averaged logarithms of the diagonal entries of Rs [43, 44]:
+                                              t                t
+                                       1     Y              1X
+                               λi = lim  log     Rsii = lim       log Rsii .                             (7)
+                                   t→∞ t                t→∞ t
+                                             s=1              s=1
+This way, the Lyapunov exponent can be expressed in terms of a temporal average over the diagonal
+elements of the Rs -matrix of a QR decomposition of the iterated Jacobian. To propagate the gradient
+of the square of the Lyapunov exponents backward through time in gradient flossing, we used an
+analytical expression for the pullback of the QR decomposition [45]: The backward pass of the QR
+decomposition is given by [45, 46, 47, 48]
+                                 Q = Q + Q copyltu(M) R−T ,
+                                                           
+                                                                                                 (8)
+                   T       T
+where M = RR − Q Q and the copyltu function generates a symmetric matrix by copying
+the lower triangle of the input matrix to its upper triangle, with the element [copyltu(M )]ij =
+Mmax(i,j),min(i,j) [45, 46, 47, 48]. We denote here adjoint variable as T = ∂L/∂T . A simple
+implementation of this algorithm in pseudocode is:
+Algorithm 1 Algorithm for gradient flossing of k tangent space directions
+  initialize h, Q
+  for e = 1 → E do
+       for t = 1 → T do
+           h ← fθ (h, x)
+                  dht
+           D ← dh  t−1
+           Q←D·Q
+           if t ≡ 0 (mod tONS ) then
+               Q, R ← qr(Q)
+               γi += log(Rii )
+           end if
+       end for
+       λi = γi /T
+                       ∂Lflossing
+       θe+1 ← θe − η ∂θ
+  end for
+For clarity, we described gradient flossing in terms of stochastic gradient descent, but we actually
+implemented it with the ADAM optimizer using standard hyperparameters η, β1 and β2 . An example
+implementation is available here. Note that this algorithm also works for different recurrent network
+architectures. In this case, the Jacobians D has size n × n, where n is the number of dynamic
+variables of the recurrent network model. For example, in case of a single recurrent network of
+N LSTM units, the Jacobian has size 2N × 2N [9, 12, 41]. The Jacobian matrix D can either be
+calculated analytically or it can be obtained via automatic differentiation.
+
+4   Gradient Flossing: Control of Lyapunov Exponents
+In Fig 1, we demonstrate that gradient flossing can set one or several Lyapunov exponents to a
+target value via gradient descent with the ADAM optimizer in random Vanilla RNNs initialized with
+different weight variances. The N units of the recurrent neural network follow the dynamics
+                             hs+1 = f (hs , xs+1 ) = Wϕ(hs ) + Vxs+1 .                        (9)
+
+
+                                                     4
+                                                                       C
+                                   t−1
+                                            ∂hτ ′ +1                        10 1
+                                                                                                             number of flossed i
+                                   Y
+                   Tt (hτ ) =                                                                                        k = 32
+                                             ∂hτ ′                                                                   k = 16
+                                   τ ′ =τ
+                                                                            10 3                                     k=1
+
+
+
+
+                                                                 2
+                                                                 i
+                                                                      i=1
+                                                                 k
+                                                                 1
+                                                                 k
+                                                                            10 5
+
+
+                                                                                         0     20     40    60             80         100
+                                                                                                       Epochs
+         B                                                               D
+             0.0                                                                   0
+             0.5                                                                   2
+1(1/ )
+
+
+
+
+                                                                      i(1/ )
+             1.0                                                                   4     number of flossed i
+                                                                                                 k = 32
+             1.5            target 1                                               6             k = 16
+                            actual 1                                                             k=1
+             2.0
+                   0   10     20       30     40 50    60   70   80                    0.0     0.2     0.4           0.6        0.8         1.0
+                                            Epochs                                                             i/N
+Figure 1: Gradient flossing controls Lyapunov exponents and gradient signal propagation
+A) Exploding and vanishing gradients in backpropagation through time arise from amplifica-
+                                                                                    Qt−1 ∂h ′
+tion/attenuation of product of Jacobians that form the long-term Jacobian Tt (hτ ) = τ ′ =τ ∂hτ +1 .
+                                                                                               τ′
+B) First Lyapunov exponent of Vanilla RNN as a function of training epochs. Minimizing the
+mean squared error between estimated first Lyapunov exponent and target Lyapunov exponent
+λ1 = −1, −0.5, 0 by gradient descent. 10 Vanilla RNNs were initialized with Gaussian recurrent
+weights Wij ∼ N (0, g 2 /N ) where values of g were drawn g ∼ Unif(0, 1). C) Gradient flossing
+minimizes the square of Lyapunov exponents over epochs. D) Full Lyapunov spectrum of Vanilla
+RNN after a different number of Lyapunov exponents are pushed to zero via gradient flossing. Note,
+the variability of the Lyapunov exponents that were not flossed. Parameters: network size N = 32
+with 10 network realizations. Error bars in C indicate the 25% and 75% percentiles and solid line
+shows median.
+
+
+The initial entries of W are drawn independently from a Gaussian distribution with zero mean and
+variance g 2 /N , where g is a gain parameter that controls the heterogeneity of weights. We here use
+the transfer function ϕ(x) = tanh(x). (See appendix B for gradient flossing with ReLU and LSTM
+units). xs is a sequence of inputs and V is the input weight. xs is a stream of i.i.d. Gaussian input
+xs ∼ N (0, 1) and the input weights V are N (0, 1). Both W and V are trained during gradient
+flossing.
+In Fig 1B, we show that for randomly initialized RNNs, the Lyapunov exponent can be modified by
+gradient flossing to match a desired target value. The networks were initialized with 10 different values
+of initial weight strength g chosen uniformly between 0 and 1. During gradient flossing, they quickly
+approached three different target values of the first Lyapunov exponents λtarget 1     = {−1, −0.5, 0}
+within less than 100 training epochs with batch size B = 1. We note that gradient flossing with
+positive target λtarget
+                  1     seems not to arrive at a positive Lyapunov exponent λ1 .
+Fig 1C shows gradient flossing for different numbers of Lyapunov exponents k. Here, during gradient-
+descent, the sum of the squares of 1, 16, or 32 Lyapunov exponents is used as loss in gradient flossing
+(see Fig 1A). Fig 1D shows the Lyapunov spectrum after flossing, which now has 1, 16, or 32
+Lyapunov exponents close to zero. We conclude that gradient flossing can selectively manipulate one,
+several, or all Lyapunov exponents before or during network training. Gradient flossing also works for
+RNNs of ReLU and LSTM units (See appendix B. Further, we find that the computational bottleneck         
+of gradient flossing is the QR decomposition, which has a computational complexity of O N k 2 ,
+both in the forward pass and in the backward pass. Thus, gradient flossing of the entire Lyapunov
+spectrum is computationally expensive. However, as we will show, not all Lyapunov exponents need
+to be flossed and only short episodes of gradient flossing are sufficient for significantly improving the
+training performance.
+
+
+                                                                               5
+5                           Gradient Flossing: Condition Number of the Long-Term Jacobian
+                             A                                            B                                              C
+condition number 2(Tt( ))
+                            1034                                         1026        initial
+                                                                                     after flossing                      1030
+                            1025                                                     theory
+                                                                         1019        simulations
+
+
+
+
+                                                                                                            2 (theory)
+                                                              2(Tt( ))
+                                                                                                                         1020
+                            1016                                         1012                                                                    k
+                                                  initial                                                                1010                        5
+                            107                   after flossing         105                                                                         10
+                                                  theory                                                                                             15
+                                                  simulations                                                            100
+                                   0    100 200 300                                    5       10      15                       100 1010 1020 1030
+                                     time horizon t                             tangent space dimension m                             2 (numerical)
+Figure 2: Gradient flossing reduces condition number of the long-term Jacobian A) Condition
+number κ2 of long-term Jacobian Tt (hτ ) as a function of time horizon t − τ at initialization (blue)
+and after gradient flossing (orange). Direct numerical simulations are done with arbitrary precision
+floating point arithmetic (transparent lines) with 256 bits per float, asymptotic theory based on
+Lyapunov exponents (dashed lines) (Eq 10). B) Condition number for different number of tangent
+space dimensions m. Simulations (dots) and Lyapunov exponent based theory (dashed lines) at
+initialization (blue) and after gradient flossing (orange). Gradient flossing increases the number of
+tangent space dimensions available for backpropagation for a given condition number (Grey dotted
+line as a guide for eye for κ2 = 105 .) First 15 Lyapunov exponents were flossed. C) Comparison of
+condition number obtained via direct numerical simulations vs. Lyapunov exponent-based. Colors
+denote the number of flossed Lyapunov exponents k. Parameters: g = 1, batch size b = 1, N = 80,
+epochs = 500, T = 500, gradient flossing for Ef = 500 epochs. Input xs identical to delayed XOR
+task in Fig 3D.
+
+A well-conditioned Jacobian is essential for efficient and fast learning [21, 49, 50]. Gradient
+flossing improves the condition number of the long-term Jacobian which constrains the error signal
+propagation across long time horizons in backpropagation (Fig 2). The condition number κ2 of a
+linear map A measures how close the map is to being singular and is given by the ratio of the largest
+singular value σmax and the smallest singular values σmin , so κ2 (A) = σσmax   (A)
+                                                                            min (A)
+                                                                                    . According to the
+                                            p
+rule of thumb given in [51], if κ2 (A) = 10 , one can anticipate losing at least p digits of precision
+when solving the equation Ax = b. Note that the long-term Jacobian Tt is composed of a product of
+Jacobians, which generically makes it ill-conditioned. To nevertheless quantify the condition number
+numerically, we use arbitrary-precision arithmetic with 256 bits per float. We find numerically that
+the condition number of Tt exponentially diverges with the number of time steps (Fig 2A). We
+compare the numerically measured condition number κ2 with an asymptotic approximation of the
+condition number based on Lyapunov exponents that are calculated in the forward pass and find a
+good match (Fig 2A).
+Our theoretical estimate of the condition number κ2 of an orthonormal system Q of size N × m that
+is temporally evolved by the long-term Jacobian Tt is:
+
+                                         e t+τ ) = κ2 Tt (hτ )Qt = σ1 (Tt (hτ )) ≈ exp ((λ1 − λm )(t − τ )) .
+                                                                
+                                     κ2 (Q                                                                                                            (10)
+                                                                   σm (Tt (hτ ))
+where σ1 (Tt (hτ )) and σm (Tt (hτ )) are the first and mth singular value of the long-term Jacobian.
+We note that this theoretical estimate of the condition number follows from the asymptotic definition
+of Lyapunov exponents and should be exact in the limit of long times. We find that gradient flossing
+reduces the condition number by a factor whose magnitude increases exponentially with time (orange
+in Fig 2A). Thus, we can expect that gradient flossing has a stronger effect on problems with a long
+time horizon to bridge. We will later confirm this numerically.
+Moreover, Lyapunov exponents enable the estimation of the number of gradient dimensions available
+for the backpropagation of error signals. Generally, the long-term Jacobian is ill-conditioned, however,
+the Lyapunov spectrum provides for a given number of tangent space dimensions an estimate of the
+condition number. This indicates how close to singular the gradient signal for a given number of
+tangent space dimensions is. Given a fixed acceptable condition number—determined, for example,
+by noise level or floating-point precision—we observe that gradient flossing increases the number of
+usable tangent space dimensions for backpropagation (Fig 2B).
+
+
+                                                                                       6
+Finally, we show that the asymptotic estimate of the condition number based on Lyapunov exponents
+can even predict differences in condition number that originate from finite network size N (Fig 2C).
+We emphasize that this goes beyond mean-field methods, which become exact only in the large-
+network limit N → ∞ and usually do not capture finite-size effects [52] (see appendix G).
+
+6      Initial Gradient Flossing Improves Trainability
+     A             delayed copy task       B                                                          delayed XOR task
+            10 2                                                             10 2
+test loss
+
+
+
+
+                                                                 test loss
+                                          no gradient flossing                                           no gradient flossing
+            10 3                          gradient flossing                                              gradient flossing
+                                                                             10 3
+            10 4
+                   0   2000      4000 6000         8000 10000                            0        2000    4000 6000         8000 10000
+                                    Epochs                                                                   Epochs
+     C                      delayed copy task                         D                               delayed XOR task
+            10 2                                                             10 2
+
+            10 3                                                             10 3
+test loss
+
+
+
+
+                                                                 test loss
+
+            10 4
+                                                                             10 4
+
+                       20           40            60        80                      10       20      30 40 50 60                 70   80
+                            task difficulty (delay d)                                                task difficulty (delay d)
+Figure 3: Gradient flossing improves trainability on tasks that involve long time horizons A) Test
+error for Vanilla RNNs trained on delayed copy task yt = xt−d for d = 40 with and without gradient
+flossing flossing. Solid lines are medians across 5 network realizations. B) Same as A for delayed
+XOR task with yt = |xt−d/2 − xt−d |. C) Mean final test loss as a function of task difficulty (delay d)
+for delayed copy task. D) Mean final test loss as a function of task difficulty (delay d) for delayed
+XOR task. Parameters: g = 1, batch size b = 16, N = 80, epochs = 104 , T = 300, gradient flossing
+for Ef = 500 epochs on k = 75 before training. Shaded regions in C and D indicate the 20% and
+80% percentiles and solid line shows mean. Dots are individual runs. Task loss: MSE(y, ŷ).
+
+We next present numerical results on two tasks with variable spatial and temporal complexity,
+demonstrating that gradient flossing before training improves the trainability of Vanilla RNNs. We
+call gradient flossing before training in the following preflossing. For preflossing, we first initialize the
+                                                  Pk
+network randomly, then minimize Lflossing = i=1 λ2i using the ADAM optimizer and subsequently
+train on the tasks. We deliberately do not use sequential MNIST or similar toy tasks commonly used
+to probe exploding/vanishing gradients, because we want a task where the structure of long-range
+dependencies in the data is transparent and can be varied as desired.
+First, we consider the delayed copy task, where a scalar stream of random input numbers x must be
+reproduced by the output y delayed by d time steps, i.e. yt = xt−d . Although the task itself is trivial
+and can be solved even by a linear network through a delay line (see appendix E), RNNs encounter
+vanishing gradients for large delays d during training even with ’critical’ initialization with g = 1.
+Our experiments show that gradient flossing can substantially improve the performance of RNNs
+on this task (Fig 3A, C). While Vanilla RNNs without gradient flossing fail to train reliably beyond
+d = 20, Vanilla RNNs with gradient flossing can be reliably trained for d = 40 (Fig 3C). Note that
+we flossed here k = 40 Lyapunov exponents before training. We will later investigate the role of the
+number of flossed Lyapunov exponents.
+Second, we consider the temporal XOR task, which requires the RNN to perform a nonlinear input-
+output computation on a sequential stream of scalar inputs, i.e., yt = |xt−d/2 − xt−d |, where d
+denotes a time delay of d time steps (For details see appendix H). Fig 3D demonstrates that gradient
+flossing helps to train networks on a substantially longer delay d. We found similar improvements
+through gradient flossing for RNNs initialized with orthogonal weights (see appendix G).
+
+
+                                                                 7
+7      Gradient Flossing During Training
+     A          delayed temporal XOR     B                                                             delayed spatial XOR
+                    100        flossing during training                                    100
+test accuracy (%)    90        preflossing                                                  90
+
+
+
+
+                                                                       test accuracy (%)
+                               no flossing
+                     80                                                                     80
+                     70                                                                     70
+                     60                                                                     60
+                     50                                                                     50
+                          0       2000     4000 6000      8000 10000                             0   2000   4000 6000    8000 10000
+                                              Epochs                                                           Epochs
+     C                                                                      D
+                    100                                                                    100
+                     90                                                                     90
+test accuracy (%)
+
+
+
+
+                                                                       test accuracy (%)
+                     80                                                                     80                     flossing during training
+                                                                                                                   preflossing
+                     70                                                                     70                     no flossing
+                     60                                                                     60
+                     50                                                                     50
+                          10     20
+                      30 40 50 60 70                         10 20 30 40 50 60 70
+                     complexity (delay d)                               complexity (delay d)
+Figure 4: Gradient flossing during training further improves trainability
+A) Test accuracy for Vanilla RNNs trained on delayed temporal binary XOR task yt = xt−d/2 ⊕ xt−d
+with gradient flossing during training (green), preflossing (gradient flossing before training) (orange),
+and with no gradient flossing (blue) for d = 70. Solid lines are mean across 20 network realizations,
+individual network realizations shown in transparent fine lines. B) Same as A for delayed spatial
+XOR task with yt = x1t−d ⊕ x2t−d ⊕ x3t−d . Parameters (g = 1, batch size b = 16). C) Test accuracy
+as a function of task difficulty (delay d) for delayed temporal XOR task. D) Test accuracy as a
+function of task difficulty (delay d) for delayed spatial XOR task. Parameters: g = 1, batch size
+b = 16, N = 80, epochs = 104 , T = 300, gradient flossing for Ef = 500 epochs on k = 75 before
+training and during training for green lines, and only before training for orange lines. Same plotting
+conventions as previous figure. Task loss: cross-entropy between y and ŷ.
+We next investigate the effects of gradient flossing during the training and find that gradient flossing
+during training can further improve trainability. We trained RNNs on two more challenging tasks
+with variable temporal complexity and performed gradient flossing either both during and before
+training, only before training, or not at all.
+Fig 4A shows the test accuracy for Vanilla RNNs training on the delayed temporal XOR task
+yt = xt−d/2 ⊕ xt−d with random Bernoulli process x ∈ {0, 1}. The accuracy of Vanilla RNNs
+falls to chance level for d ≥ 40 (Fig 4C). With gradient flossing before training, the trainability
+can be improved, but still goes to chance level for d = 70. In contrast, for networks with gradient
+flossing during training, the accuracy is improved to > 80% at d = 70. In this case, we preflossed
+for 500 epochs before task training and again after 500 epochs of training on the task. In Fig 4B,
+D the networks have to perform the nonlinear XOR operation yt = x1t−d ⊕ x2t−d ⊕ x3t−d on a
+three-dimensional binary input signal x1 , x2 , and x3 and generate the correct output with a delay of
+d steps. While the solution of the task itself is not difficult and could even be implemented by hand
+(see appendix), the task is challenging for backpropagation through time because nonlinear temporal
+associations bridging long time horizons have to be formed. Again, we observe that gradient flossing
+before training improves the performance compared to baseline, but starts failing for long delays
+d > 60. In contrast, networks that are also flossed during training can solve even more difficult tasks
+(Fig 4D). We find that after gradient flossing, the norm of the error gradient with respect to initial
+conditions h0 is amplified (appendix C). Interestingly, gradient flossing can also be detrimental to
+task performance if it is continued throughout all training epochs (appendix C)
+We note that merely regularizing the spectral radius of the recurrent weight matrix W or the individual
+one-step Jacobians Ds numerically or based on mean-field theory does not yield such a training
+improvement. This suggests that taking the temporal correlations between Jacobians Ds into account
+is important for improving trainability.
+
+
+                                                                       8
+7.1   Gradient Flossing for Different Numbers of Flossed Lyapunov Exponents
+
+We investigated how many Lyapunov exponents k have to be flossed to achieve an improvement in
+training success (Fig 5). We studied this in the binary temporal delayed XOR task with gradient
+flossing during training (same as Fig 3) and varied the task difficulty by changing the delay d.
+We found that as the task becomes more difficult, networks where not enough Lyapunov exponents
+k are flossed begin to fall below 100% test accuracy (Fig 5A). Correspondingly, when measuring
+final test accuracy as a function of the number of flossed Lyapunov exponents, we observed that
+more Lyapunov exponent k have to be flossed to achieve 100% accuracy as the tasks become more
+difficult (Fig 5B). We also show the entire parameter plane of median test accuracy as a function of
+both number of flossed Lyapunov exponents k and task difficulty (delay d), and found the same trend
+(Fig 5B). Overall, we found that tasks with larger delay d require more Lyapunov exponents close to
+zero. We note that this might also partially be caused by the ’streaming’ nature of the task: in our
+tasks, longer delays automatically imply that more values have to be stored as at any moment all the
+values in the ’delay line’ have to be remembered to successfully solve the tasks. This is different from
+tasks where a single variable has to be stored and recalled after a long delay. It would be interesting
+to study tasks where the number of delay steps and the number of items in memory can be varied
+independently.
+Finally, we did the same analysis on networks with only preflossing (gradient flossing before training)
+and found the same trend (supplement Fig 7D), however, in that case even if all N Lyapunov
+exponents were flossed, thus k = N , they were not able to solve the most difficult tasks. This seems
+to indicate that gradient flossing during training cannot be replaced by just gradient flossing more
+Lyapunov exponents before training.
+
+
+
+
+Figure 5: Gradient flossing for different numbers of flossed Lyapunov exponents
+A) Test accuracy for delayed temporal XOR task as a function of delay d with different numbers
+flossed Lyapunov exponents k. B) Same data as A but here test accuracy as a function of number
+of flossed Lyapunov exponents k. Parameters: g = 1, batch size b = 16, N = 80, epochs = 104
+for delayed temporal XOR, epochs = 5000 for delayed spatial XOR, T = 300, gradient flossing
+for Ef = 500 epochs before training and during training for A, B. Shaded areas are 25% and 75%
+percentile, solid lines are means, transparent dots are individual simulations, task loss: cross-entropy
+between y and ŷ.
+8     Limitations
+The mathematical connection between Lyapunov exponents and backpropagation through time
+exploited in gradient flossing is rigorously established only in the infinite-time limit. It would be
+interesting to extend our analysis to finite-time Lyapunov exponents.
+Furthermore, the backpropagation through time gradient involves a sum over products of Jacobians
+of different time periods t − τ , but the Lyapunov exponent only considers the asymptotic longest
+product. Additionally, Lyapunov exponents characterize the asymptotic dynamics on the attractor of
+the dynamics, whereas RNNs often exploit transient dynamics from some initial conditions outside
+or towards the attractor.
+Although our proposed method focuses on exploiting Lyapunov exponents, it neglects the geometry
+of covariant Lyapunov vectors [53], which could be used to improve training performance, speed,
+and reliability. Additionally, it is important to investigate how sensitive the method is to the choice
+of orthonormal basis employed because it is only guaranteed to become unique asymptotically [54].
+
+
+                                                   9
+Finally, the computational cost of our method scales with O(N k 2 ), where N is the network size
+and k is the number of Lyapunov exponents calculated. To reduce the computational cost, we
+suggest doing QR decomposition only sufficiently often to ensure that the orthonormal system is
+not ill-conditioned and using gradient flossing only intermittently or as pretraining. One could also
+calculate the Lyapunov spectrum for a shorter time interval or use a cheaper proxy for the Lyapunov
+spectrum and investigate more efficient gradient flossing schedules.
+
+
+9   Discussion
+
+We tackle the problem of gradient signal propagation in recurrent neural networks through a dy-
+namical systems lens. We introduce a novel method called gradient flossing that addresses the
+problem of gradient instability during training. Our approach enhances gradient signal stability both
+before and during training by regularizing Lyapunov exponents. By keeping the long-term Jacobian
+well-conditioned, gradient flossing optimizes both training accuracy and speed. To achieve this,
+we combine established dynamical systems methods for calculating Lyapunov exponents with an
+analytical pullback of the QR factorization. This allows us to establish and maintain gradient stability
+in a in a manner that is memory-efficient, numerically stable, and exact across long time horizons.
+Our method is applicable to arbitrary RNN architectures, nonlinearities, and also neural ODEs [55].
+Empirically, pre-training with gradient flossing enhances both training speed and accuracy. For
+difficult temporal credit assignment problems, gradient flossing throughout training further enhances
+signal propagation. We also demonstrate the versatility of our method on a set of synthetic tasks
+with controllable time-complexity and show that it can be combined with other approaches to tackle
+exploding and vanishing gradients, such as dynamic mean-field theory for initialization, orthogonal
+initialization and specialized single units, such as LSTMs.
+Prior research on exploding and vanishing gradients mainly focused on selecting network architectures
+that are less prone to exploding/vanishing gradients or finding parameter initializations that provide
+well-conditioned gradients at least at the beginning of training. Our introduced gradient flossing can
+be seen as a complementary approach that can further enhance gradient stability throughout training.
+Compared to the work on picking good parameter initializations based on random matrix theory [41]
+and mean-field heuristics [40], gradient flossing provides several improvements: First, mean-field
+theory only considers the gradient flow at initialization, while gradient flossing can maintain gradient
+flow and well-conditioned Jacobians throughout the training process. Second, random matrix theory
+and mean-field heuristics are usually confined to the limit of large networks [52], while gradient
+flossing can be used for networks of any size. The link between Lyapunov exponents and the gradients
+of backpropagation through time has been described previously [9, 12] and has been spelled out
+analytically and studied numerically [10, 11, 56, 57, 58]. In contrast, we use Lyapunov exponents
+here not only as a diagnostic tool for gradient stability but also to show that they can directly be part
+of the cure for exploding and vanishing gradients.
+Future investigations could delve further into the roles of the second to N th Lyapunov exponents
+in trainability, and how it is related to the task at hand, the rank of the parameter update, the dimen-
+sionality of the solution space, as well as the network dynamics (see also [32, 59, 60]). Our results
+suggest a trade-off between trainability across long time horizons and the nonlinear task demands
+that is worth exploring in more detail (appendix C). Applying gradient flossing to real-time recurrent
+learning and its biologically plausible variants is another avenue [61]. Extending gradient flossing to
+feedforward networks, state-space models and transformers is a promising avenue for future research
+(see also [62, 63]). While Lyapunov exponents are only strictly defined for dynamical systems, such
+as maps or flows that are endomorphisms, the long-term Jacobian of deep feedforward networks
+could be treated similarly. This could also provide a link between the stability of the network against
+adversarial examples and its dynamic stability, as measured by Lyapunov exponents. Given that
+time-varying input can suppress chaos in recurrent networks [9, 12, 64, 65, 66, 67], we anticipate they
+may exacerbate vanishing gradients. Gradient flossing could also be applied in neural architecture
+search, to identify and optimize trainable networks. Finally, gradient flossing is applicable to other
+model parameters, as well. For instance, gradients of Lyapunov exponents with respect to single-
+unit parameters could optimize the activation function and single-neuron biophysics in biologically
+plausible neuron models.
+
+
+
+                                                   10
+Acknowledgments and Disclosure of Funding
+I thank E. Izhikevich, F. Wolf, S. Goedeke, J. Lindner, L.F. Abbott, L. Logiaco, M. Schottdorf, G.
+Wayne and P. Sokol for fruitful discussions and J. Stone, L.F. Abbott, M. Ding, O. Marshall, S.
+Goedeke, S. Lippl, M.P. Puelma-Touzel, J. Lindner and the reviewers for feedback on the manuscript.
+I thank Jinguo Liu for the Julia package BackwardsLinalg.jl. Research supported by NSF NeuroNex
+Award (DBI-1707398), the Gatsby Charitable Foundation (GAT3708), the Simons Collaboration for
+the Global Brain (542939SPI), and the Swartz Foundation (2021-6).
+
+References
+ [1] P. WERBOS. Beyond Regression :. Ph. D. dissertation, Harvard University, 1974.
+ [2] DB Parker. Learning-logic (TR-47). Center for Computational Research in Economics and Management
+     Science. MIT-Press, Cambridge, Mass, 8, 1985.
+ [3] Y. LECUN. Une procedure d’apprentissage ponr reseau a seuil asymetrique. Proceedings of Cognitiva 85,
+     pages 599–604, 1985.
+ [4] David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. Learning representations by back-
+     propagating errors. Nature, 323(6088):533, October 1986.
+ [5] Sepp Hochreiter. Untersuchungen zu dynamischen neuronalen Netzen. PhD thesis, 1991.
+ [6] Sepp Hochreiter and Jürgen Schmidhuber. Long Short-Term Memory. Neural Computation, 9(8):1735–
+     1780, 1997.
+ [7] Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient descent is difficult.
+     IEEE Transactions on Neural Networks, 5(2):157–166, March 1994.
+ [8] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training Recurrent Neural
+     Networks. arXiv:1211.5063 [cs], November 2012. arXiv: 1211.5063.
+ [9] Rainer Engelken, Fred Wolf, and L. F. Abbott. Lyapunov spectra of chaotic recurrent neural networks.
+     arXiv:2006.02427 [nlin, q-bio], June 2020. arXiv: 2006.02427.
+[10] Jonas Mikhaeil, Zahra Monfared, and Daniel Durstewitz. On the difficulty of learning chaotic dynamics
+     with RNNs. Advances in Neural Information Processing Systems, 35:11297–11312, December 2022.
+[11] Il Memming Park, Ábel Ságodi, and Piotr Aleksander Sokół. Persistent learning signals and working
+     memory without continuous attractors. ArXiv, page arXiv:2308.12585v1, August 2023.
+[12] Rainer Engelken, Fred Wolf, and L. F. Abbott. Lyapunov spectra of chaotic recurrent neural networks.
+     Physical Review Research, 5(4):043044, October 2023.
+[13] J. P. Eckmann and D. Ruelle. Ergodic theory of chaos and strange attractors. Reviews of Modern Physics,
+     57(3):617–656, July 1985.
+[14] Arkady Pikovsky and Antonio Politi. Lyapunov Exponents: A Tool to Explore Complex Dynamics.
+     Cambridge University Press, Cambridge, February 2016.
+[15] Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio. On the Properties of
+     Neural Machine Translation: Encoder-Decoder Approaches. Technical report, September 2014. ADS
+     Bibcode: 2014arXiv1409.1259C Type: article.
+[16] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural
+     networks. In Proceedings of the 30th International Conference on International Conference on Machine
+     Learning - Volume 28, ICML’13, pages III–1310–III–1318, Atlanta, GA, USA, June 2013. JMLR.org.
+[17] Tomáš Mikolov. Statistical language models based on neural networks. Ph.D. thesis, Brno University of
+     Technology, Faculty of Information Technology, Brno, CZ, 2012.
+[18] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, November 2016.
+     Google-Books-ID: omivDQAAQBAJ.
+[19] Sergey Ioffe and Christian Szegedy. Batch Normalization: Accelerating Deep Network Training by
+     Reducing Internal Covariate Shift. In Proceedings of the 32nd International Conference on Machine
+     Learning, pages 448–456. PMLR, June 2015.
+
+
+                                                     11
+[20] Bo Chang, Minmin Chen, Eldad Haber, and Ed H. Chi. AntisymmetricRNN: A Dynamical System View
+     on Recurrent Neural Networks. International Conference on Learning Representations, December 2018.
+
+[21] Andrew M. Saxe, James L. McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of
+     learning in deep linear neural networks. arXiv:1312.6120 [cond-mat, q-bio, stat], December 2013. arXiv:
+     1312.6120.
+
+[22] Martin Arjovsky, Amar Shah, and Yoshua Bengio. Unitary Evolution Recurrent Neural Networks. In
+     Proceedings of The 33rd International Conference on Machine Learning, pages 1120–1128. PMLR, June
+     2016.
+
+[23] Kyle Helfrich, Devin Willmott, and Qiang Ye. Orthogonal Recurrent Neural Networks with Scaled Cayley
+     Transform. In Proceedings of the 35th International Conference on Machine Learning, pages 1969–1978.
+     PMLR, July 2018.
+
+[24] T. Konstantin Rusch and Siddhartha Mishra. Coupled Oscillatory Recurrent Neural Network (coRNN):
+     An accurate and (gradient) stable architecture for learning long time dependencies. arXiv e-prints, page
+     arXiv:2010.00951, October 2020.
+
+[25] N. Benjamin Erichson, Omri Azencot, Alejandro Queiruga, Liam Hodgkinson, and Michael W. Mahoney.
+     Lipschitz Recurrent Neural Networks. International Conference on Learning Representations, January
+     2021.
+
+[26] Antonio Orvieto, Samuel L Smith, Albert Gu, Anushan Fernando, Caglar Gulcehre, Razvan Pascanu, and
+     Soham De. Resurrecting Recurrent Neural Networks for Long Sequences. Technical report, March 2023.
+     ADS Bibcode: 2023arXiv230306349O Type: article.
+
+[27] Herbert Jaeger. The “echo state” approach to analysing and training recurrent neural networks-with an
+     erratum note. Bonn, Germany: German National Research Center for Information Technology GMD
+     Technical Report, 148:34, 2001.
+
+[28] Mustafa C. Ozturk, Dongming Xu, and José C. Príncipe. Analysis and Design of Echo State Networks.
+     Neural Computation, 19(1):111–138, January 2007.
+
+[29] Rupesh K Srivastava, Klaus Greff, and Jürgen Schmidhuber. Training Very Deep Networks. In Advances
+     in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015.
+
+[30] Julian Georg Zilly, Rupesh Kumar Srivastava, Jan Koutnik, and Jürgen Schmidhuber. Recurrent Highway
+     Networks. In Proceedings of the 34th International Conference on Machine Learning, pages 4189–4198.
+     PMLR, July 2017.
+
+[31] Piotr A. Sokół, Ian Jordan, Eben Kadile, and Il Memming Park. Adjoint Dynamics of Stable Limit Cycle
+     Neural Networks. In 2019 53rd Asilomar Conference on Signals, Systems, and Computers, pages 884–887,
+     November 2019. ISSN: 2576-2303.
+
+[32] Piotr A. Sokół. Geometry of Learning and Representations in Neural Networks. PhD thesis, Stony Brook
+     University, May 2023.
+
+[33] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep Sparse Rectifier Neural Networks. volume 15,
+     pages 315–323, April 2011.
+
+[34] Ben Poole, Subhaneil Lahiri, Maithra Raghu, Jascha Sohl-Dickstein, and Surya Ganguli. Exponential
+     expressivity in deep neural networks through transient chaos. arXiv:1606.05340 [cond-mat, stat], June
+     2016. arXiv: 1606.05340.
+
+[35] Minmin Chen, Jeffrey Pennington, and Samuel S. Schoenholz. Dynamical Isometry and a Mean Field
+     Theory of RNNs: Gating Enables Signal Propagation in Recurrent Neural Networks. arXiv:1806.05394
+     [cs, stat], August 2018. arXiv: 1806.05394.
+
+[36] Boris Hanin and Mihai Nica. Products of Many Large Random Matrices and Gradients in Deep Neural
+     Networks. December 2018.
+
+[37] Samuel S. Schoenholz, Justin Gilmer, Surya Ganguli, and Jascha Sohl-Dickstein. Deep Information
+     Propagation. November 2016.
+
+[38] Boris Hanin and David Rolnick. How to Start Training: The Effect of Initialization and Architecture. In
+     Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.
+
+
+                                                    12
+[39] Piotr A. Sokol and Il Memming Park. Information Geometry of Orthogonal Initializations and Training.
+     Technical report, October 2018. ADS Bibcode: 2018arXiv181003785S Type: article.
+[40] Dar Gilboa, Bo Chang, Minmin Chen, Greg Yang, Samuel S. Schoenholz, Ed H. Chi, and Jeffrey
+     Pennington. Dynamical Isometry and a Mean Field Theory of LSTMs and GRUs. January 2019.
+[41] Tankut Can, Kamesh Krishnamurthy, and David J. Schwab. Gating creates slow modes and controls
+     phase-space complexity in GRUs and LSTMs. arXiv:2002.00025 [cond-mat, stat], January 2020. arXiv:
+     2002.00025.
+[42] Valery Iustinovich Oseledets. A multiplicative ergodic theorem. Characteristic Ljapunov, exponents of
+     dynamical systems. Trudy Moskovskogo Matematicheskogo Obshchestva, 19:179–210, 1968.
+[43] Karlheinz Geist, Ulrich Parlitz, and Werner Lauterborn. Comparison of Different Methods for Computing
+     Lyapunov Exponents. Progress of Theoretical Physics, 83(5):875–893, May 1990.
+[44] Giancarlo Benettin, Luigi Galgani, Antonio Giorgilli, and Jean-Marie Strelcyn. Lyapunov Characteristic
+     Exponents for smooth dynamical systems and for hamiltonian systems; A method for computing all of
+     them. Part 2: Numerical application. Meccanica, 15(1):21–30, March 1980.
+[45] Hai-Jun Liao, Jin-Guo Liu, Lei Wang, and Tao Xiang. Differentiable Programming Tensor Networks.
+     Physical Review X, 9(3):031041, September 2019. arXiv:1903.09650 [cond-mat, physics:quant-ph].
+[46] S. F. Walter and L. Lehmann. Algorithmic Differentiation of Linear Algebra Functions with Application
+     in Optimum Experimental Design (Extended Version). Technical report, January 2010. ADS Bibcode:
+     2010arXiv1001.1654W Type: article.
+[47] Robert Schreiber and Charles Van Loan. A Storage-Efficient $WY$ Representation for Products of
+     Householder Transformations. SIAM Journal on Scientific and Statistical Computing, 10(1):53–57, January
+     1989.
+[48] Matthias Seeger, Asmus Hetzel, Zhenwen Dai, Eric Meissner, and Neil D. Lawrence. Auto-Differentiating
+     Linear Algebra. Technical report, October 2017. ADS Bibcode: 2017arXiv171008717S Type: article.
+[49] Jeffrey Pennington, Samuel Schoenholz, and Surya Ganguli. Resurrecting the sigmoid in deep learning
+     through dynamical isometry: theory and practice. In Advances in Neural Information Processing Systems,
+     volume 30. Curran Associates, Inc., 2017.
+[50] Jeffrey Pennington, Samuel S. Schoenholz, and Surya Ganguli. The Emergence of Spectral Universality in
+     Deep Networks. arXiv:1802.09979 [cs, stat], February 2018. arXiv: 1802.09979.
+[51] E. Cheney and David Kincaid. Numerical Mathematics and Computing. Cengage Learning, August 2007.
+     Google-Books-ID: ZUfVZELlrMEC.
+[52] Yasaman Bahri, Jonathan Kadmon, Jeffrey Pennington, Sam S. Schoenholz, Jascha Sohl-Dickstein, and
+     Surya Ganguli. Statistical Mechanics of Deep Learning. Annual Review of Condensed Matter Physics,
+     11(1):501–528, 2020.
+[53] F. Ginelli, P. Poggi, A. Turchi, H. Chaté, R. Livi, and A. Politi. Characterizing Dynamics with Covariant
+     Lyapunov Vectors. Physical Review Letters, 99(13):130601, September 2007.
+[54] Sergey V. Ershov and Alexei B. Potapov. On the concept of stationary Lyapunov basis. Physica D:
+     Nonlinear Phenomena, 118(3):167–198, July 1998.
+[55] Ricky T. Q. Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural Ordinary Differential
+     Equations. In Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc.,
+     2018.
+[56] Javed Lindner. Investigating the exploding and vanishing gradients problem with Lyapunov exponents.
+     Master’s thesis, RWTH Aaachen, Aaachen/Juelich, 2021.
+[57] Ryan Vogt, Maximilian Puelma Touzel, Eli Shlizerman, and Guillaume Lajoie. On Lyapunov Exponents
+     for RNNs: Understanding Information Propagation Using Dynamical Systems Tools. Frontiers in Applied
+     Mathematics and Statistics, 8, 2022.
+[58] Kamesh Krishnamurthy, Tankut Can, and David J. Schwab. Theory of Gating in Recurrent Neural Networks.
+     Physical Review X, 12(1):011011, January 2022.
+[59] L. S. Pontryagin. Mathematical Theory of Optimal Processes: The Mathematical Theory of Optimal
+     Processes. Routledge, New York, 1st edition edition, March 1987.
+
+
+                                                     13
+[60] Daniel Liberzon. Calculus of Variations and Optimal Control Theory: A Concise Introduction. In Calculus
+     of Variations and Optimal Control Theory. Princeton University Press, December 2011.
+
+[61] Owen Marschall, Kyunghyun Cho, and Cristina Savin. A unified framework of online learning algorithms
+     for training recurrent neural networks. The Journal of Machine Learning Research, 21(1):135:5320–
+     135:5353, January 2020.
+
+[62] Judy Hoffman, Daniel A. Roberts, and Sho Yaida. Robust Learning with Jacobian Regularization. Technical
+     report, August 2019. ADS Bibcode: 2019arXiv190802729H Type: article.
+[63] Hongyi Zhang, Yann N. Dauphin, and Tengyu Ma. Fixup Initialization: Residual Learning Without
+     Normalization. Technical report, January 2019. ADS Bibcode: 2019arXiv190109321Z Type: article.
+
+[64] L. Molgedey, J. Schuchhardt, and H. G. Schuster. Suppressing chaos in neural networks by noise. Physical
+     Review Letters, 69(26):3717–3719, December 1992.
+
+[65] Kanaka Rajan, L. F. Abbott, and Haim Sompolinsky. Stimulus-dependent suppression of chaos in recurrent
+     neural networks. Physical Review E, 82(1):011903, July 2010.
+
+[66] Jannis Schuecker, Sven Goedeke, David Dahmen, and Moritz Helias. Functional methods for disordered
+     neural networks. arXiv:1605.06758 [cond-mat, q-bio], May 2016. arXiv: 1605.06758.
+
+[67] Rainer Engelken, Alessandro Ingrosso, Ramin Khajeh, Sven Goedeke, and L. F. Abbott. Input correlations
+     impede suppression of chaos and learning in balanced firing-rate networks. PLOS Computational Biology,
+     18(12):e1010590, December 2022.
+
+[68] Edward Ott. Chaos in Dynamical Systems. Cambridge University Press, August 2002. Google-Books-ID:
+     PfXoAwAAQBAJ.
+[69] Angelo Vulpiani, Fabio Cecconi, and Massimo Cencini. Chaos: From Simple Models to Complex Systems.
+     World Scientific Pub Co Inc, Hackensack, NJ, September 2009.
+
+
+
+
+                                                     14
+A         Backpropagation Through QR Decomposition
+The backward pass of the QR decomposition is given by [45, 46, 47, 48]
+                                 Q = Q + Q copyltu(M) R−T
+                                                               
+                                                                                                              (11)
+                     T    T
+where M = RR − Q Q and the copyltu function generates a symmetric matrix by copying the lower
+triangle of the input matrix to its upper triangle, with the element [copyltu(M )]ij = Mmax(i,j),min(i,j)
+[45, 46, 47, 48]. Adjoint variable are written here as T = ∂L/∂T .
+Using an analytical pullback is more memory-efficient and less computationally costly than directly doing
+automatic differentiation through the QR-decomposition. Moreover, from a practical perspective, for QR
+decomposition, often BLAS/LAPACK routines are utilized which are not amenable to common differentiable
+programming frameworks like TensorFlow, PyTorch, JAX and Zygote. In our implementation of gradient
+flossing, we used the Julia package BackwardsLinalg.jl by Jinguo Liu available at here .
+
+B       Further Details and Analysis of Gradient Flossing
+An example implementation of gradient flossing in Flux, a machine learning library in Julia is available here.
+We are actively developing implementations for other widely used differentiable programming frameworks.
+
+B.1       Gradient Flossing for recurrent LSTM and ReLU networks
+
+      A                         LSTM                         C                         ReLU
+           0.2                             target 1                 6                             target 1
+                                           actual 1                                               actual 1
+                                                                    4
+           0.0
+                                                                    2
+    1
+
+
+
+
+                                                             1
+
+
+
+
+           0.2
+                                                                    0
+           0.4
+                                                                    2
+                 0   20       40      60   80         100               0       20   40      60   80         100
+                                Epochs                                                 Epochs
+      B                                                      D 101
+          10 1
+          10 2                                                   10 1
+          10 3
+ 2
+
+
+
+
+                                                            2
+ 1
+
+
+
+
+                                                            1
+
+
+
+
+          10 4                                                   10 3
+          10 5
+          10 6                                                   10 5
+                 0   20       40      60   80         100                   0   20   40    60     80    100
+                                Epochs                                                Epochs
+
+
+Figure 6: Gradient flossing for recurrent LSTM networks and recurrent ReLU networks A) First
+Lyapunov exponent of LSTM network as a function of training epochs. Minimizing the mean
+squared error between estimated first Lyapunov exponent and target Lyapunov exponent λ1 = 0
+by gradient descent. First Lyapunov exponent of LSTM network (solid lines) converges to target
+value (thick dashed lines) within less than 100 epochs. 10 random LSTM RNNs were initialized with
+Gaussian recurrent weights, where standard deviations of weight scaling were drawn g ∼ Unif(0, 1).
+B) Gradient flossing minimizes the square of the first Lyapunov exponent of random recurrent
+LSTM networks over epochs. C) Same as A for recurrent ReLU network. Here networks were
+initialized with Gaussian recurrent weights Wij ∼ N (−0.1, g 2 /N ) where values of g were drawn
+g ∼ Unif(0, 1) D) B) for recurrent ReLU network. Parameters: network size N = 32 with 10
+network realizations. Shaded regions in B, D are 25% and 75% percentiles, solid line shows median.
+
+We demonstrate that gradient flossing can also be applied to recurrent LSTM and ReLU networks in Fig 6. To
+this end, we generated random LSTM networks where the weights of all the different gates and biases were
+
+
+                                                        15
+independently and identically distributed (i.i.d.) and sampled from Gaussian distributions of different variance.
+Our results show that gradient flossing can also constrain the Lyapunov exponent to be close to zero. The
+dynamics of each of the N LSTM units follows the map [6]:
+                                      ft    =    σg (Uf ht−1 + Wf xt + bf )                                      (12)
+                                      ot    =    σg (Uo ht−1 + Wo xt + bo )                                      (13)
+                                      it    =    σg (Ui ht−1 + Wi xt + bi )                                      (14)
+                                      c̃t   =    σh (Uc ht−1 + Wc xt + bc )                                      (15)
+                                      ct    =    ft ⊙ ct−1 + it ⊙ c̃t                                            (16)
+                                      ht    =    ot ⊙ ϕ(ct )                                                     (17)
+                                                              1
+where ⊙ denotes the Hadamard product, σg (x) = 1+exp(−x)              is the sigmoid function, σh (x) = tanh(x) and
+                                                                2
+entries of the matrices Ux are drawn from Ux ∼ N (0, gx /N ). For simplicity, the bias terms bx are scalars.
+Subscripts f , o and i denote respectively the forget gate, the output gate, the input gate, and c is the cell state.
+In each LSTM unit, there are two dynamic variables c and h, and three gates f , o, and i that control the flow
+if signals into and out of the cell c. We set the values gih , gix , gf x , bf , gch , gcx , gcx , gox to be uniformly
+distributed between 0 and 1 and initialize bi , gf h ,bc ,b0 as zero.
+During gradient flossing, the actual Lyapunov exponents of different random network realizations converge close
+to the target Lyapunov exponent λtarget
+                                     1   = 0 in fewer than 100 epochs as shown in Fig 6A. Fig 6B shows that
+the squared Lyapunov exponents converge towards zero. We note that for LSTM networks, a target Lyapunov
+exponent of λtarget
+               1    = −1 is achieved after 100 gradient flossing steps only for a subset of random network
+realizations (not shown). We speculate that behavior is influenced by the gating structure of LSTM units,
+which seems to naturally place the first Lyapunov exponent close to zero for certain initializations (See also
+[9, 12, 41, 58]).
+For the recurrent ReLU networks, we considered the same Vanilla RNN dynamics as in the main manuscript in
+Eq 9
+                              hs+1 = f (hs , xs+1 ) = Wϕ(hs ) + Vxs+1 ,
+The initial entries of W are drawn independently from a Gaussian distribution with a negative mean of −0.1
+and variance g 2 /N , where g is a gain parameter that controls the heterogeneity of weights. We use the transfer
+function ϕ(x) = max(x, 0). xs is a sequence of inputs and V is the input weight. xs is a stream of i.i.d.
+Gaussian input xs ∼ N (0, 1) and the input weights V are N (0, 1). Both W and V are trained during gradient
+flossing. We found that some ReLU network had initially unstable dynamics with positive Lyapunov exponents
+Fig 6C. However, during gradient flossing, these unstable networks were quickly stabilized. Fig 6D shows that
+the squared Lyapunov exponents of ReLU networks converge towards zero.
+
+B.2    Additional Results for Different Numbers of Flossed Lyapunov Exponents
+Additionally to the main Fig 5, we did the same analysis on networks with only preflossing (gradient flossing
+before training) and found that more Lyapunov exponent k have to be flossed to achieve 100% accuracy as the
+tasks become more difficult (Fig 7D), however, in that case even if all N Lyapunov exponents were flossed, thus
+k = N , they were not able to solve the most difficult tasks. This seems to indicate that gradient flossing during
+training cannot be replaced by just gradient flossing more Lyapunov exponents before training.
+
+
+C     Additional Results on Gradient Flossing Throughout Training
+We now discuss some additional results on gradient flossing throughout training. First, we analyze how gradient
+flossing affects the gradients and find that during gradient flossing, the norm of gradients that bridge many time
+steps are boosted. Moreover, subordinate singular values of the error norm of the recurrent weights are also
+boosted, indicating that gradient flossing can increase the effective rank of the parameter update. Additionally,
+we show that if gradient flossing is continued throughout training it can be detrimental to the accuracy. Finally,
+we show that Lyapunov exponents of successfully trained networks after training for the spatial delayed XOR
+task have a simple relationship to the delay d.
+
+
+D     Gradient Flossing boosts the Gradient Norm for Long Time Horizons
+In this section, we investigate the impact of gradient flossing on the norm and structure of the gradient. It is
+important to note that the complete error gradient of backpropagation through time is composed of a summation
+of products of one-step Jacobians, reflecting the number of "loops" the error signal traverses through the recurrent
+dynamics before reaching its target. Consequently, when the singular values of the long-term Jacobian are
+smaller than 1, the influence of the shorter loops typically dominates the long-term Jacobian.
+
+
+                                                         16
+                A                                                                                 B
+                     100                                                                               100
+                                     k
+ test accuracy (%)
+
+
+
+
+                                                                                   test accuracy (%)
+                             80          1                                                                     80                           delay
+                                         20                                                                                                     30
+                                         40                                                                                                     50
+                                         60                                                                                                     70
+                             60          80                                                                    60
+
+                                      20       40         60                                                        0       20     40       60   80
+                                       complexity (delay d)                                                                number of flossed i k
+                       C                                          100                                    D                                         100
+                             70                                                                                70
+      complexity (delay d)
+
+
+
+
+                                                                                        complexity (delay d)
+                                                               test accuracy (%)
+
+
+
+
+                                                                                                                                                test accuracy (%)
+                             60                                                                                60
+                             50                                                                                50
+                             40                                                                                40
+                             30                                                                                30
+                             20                                                                                20
+                             10                                                                                10
+                                  1 10 20 30 40 50 60 70 80       50                                                1 10 20 30 40 50 60 70 80      50
+                                   number of flossed i k                                                             number of flossed i k
+
+
+Figure 7: Gradient flossing for different numbers of flossed Lyapunov exponents
+A) Test accuracy for delayed temporal XOR task as a function of delay d with different numbers
+flossed Lyapunov exponents k. B) Same data as A but here test accuracy as a function of number of
+flossed Lyapunov exponents k. C) Median test accuracy for delayed temporal XOR task as a function
+of delay d and k for networks with gradient flossing during training (500 steps of gradient flossing at
+epochs e ∈ {0, 100, 200, 300, 400}). D)Same as B for preflossing only. Parameters: g = 1, batch
+size b = 16, N = 80, epochs = 104 for delayed temporal XOR, epochs = 5000 for delayed spatial
+XOR, T = 300, gradient flossing for Ef = 500 epochs before training and during training for A, B,
+C, and only before training for C. Shaded areas are 25% and 75% percentiles, solid lines are means,
+transparent dots are individual simulations, task loss: cross-entropy btw. y, ŷ.
+
+
+In our tasks, we have full control over the correlation structure of the task and thus know exactly which loop
+length of backpropagation through time is necessary for finding the correct association. We were moreover
+careful in our task design not to have any additional signals in our task that might help to bridge the long time
+scale. In the case of vanishing gradients, the gradient norm is predominantly influenced by the shorter loops,
+even though the actual signal in the gradient originates solely from the loop of length d in our task. To mitigate
+the contamination of spurious signals from shorter loops and effectively extract the gradient that spans long time
+horizons, we focus on the gradient with respect to the initial conditions h0 .
+                   τ =t−1   t−1
+                                          !                τ =t−1   t−1
+                                                                                   !
+     ∂Lt      ∂Lt X          Y ∂hτ ′ +1 ∂hτ           ∂Lt X         Y ∂hτ ′ +1              ∂Lt
+           =                                       =                                 δτ 0 =      Tt (h0 )      (18)
+     ∂h0      ∂ht            ′
+                                   ∂h τ ′    ∂h 0     ∂h t           ′
+                                                                           ∂h  τ ′          ∂ht
+                                      τ =t−l   τ =τ                                           τ =t−l                τ =τ
+
+We note that the sum conveniently drops as only the longest ’loop’, in other words, the only summand that
+contributes is the product of Jacobians going from 0 to t. By considering this gradient, we can therefore ensure
+that no undesired signals stemming from shorter loops interfere with the analysis. Moreover, we note that we
+use the binary cross entropy loss which makes the derivative ∂L
+                                                              ∂ht
+                                                                 t
+                                                                   trivial.
+In Fig 8 we show that gradient flossing boosts the gradient with respect to the initial conditions. Specifically, we
+compare two identical networks trained on the binary delayed temporal XOR task with a loop length of d = 70.
+One network is trained with gradient flossing at epochs e ∈ {0, 100, 200, 300, 400}), while the other is trained
+without gradient flossing.
+
+
+                                                                               17
+                                                                          dL
+For the network without gradient flossing, the gradient norm of | dh        0
+                                                                              | diminishes to extremely small values
+       −6
+(< 10 ) and remains small throughout training. In contrast, for the network trained with gradient flossing, each
+episode of gradient flossing causes the norm | dh dL
+                                                    0
+                                                      | to spike, surpassing values larger than 10−2 . These findings
+are direct evidence that gradient flossing boosts the gradient norm, facilitating to bridge long time horizons in
+challenging temporal credit assignment tasks. We observe that after several episodes of gradient flossing, the
+            dL
+gradient | dh 0
+                | of the networks stays around 10−4 and eventually rise up to values around 10−2 . Subsequent
+in training, the test accuracy surpasses chance level (Fig 8B). We observed this temporal relationship between
+                   dL
+gradient norm | dh    0
+                        | and training success consistently across numerous network realizations (Fig 8C and D).
+                                                     dL
+These findings suggest that the gradient norm | dh      0
+                                                          | can be a good predictor of learning success, sometimes
+hundreds of epochs before the accuracy exceeds the chance level of 50%. Indeed, when depicting the gradient
+norm aligned to the last epoch where accuracy was ≤ 50%, we see for many network realizations a gradual
+growth of gradient norm oven epochs before accuracy surpasses chance level (Fig 9A). Analogously, when
+                                                                                    dL
+plotting the accuracy as a function of epoch aligned with the last epoch with | dh    0
+                                                                                        | < 0.001, we observe for this
+                                             dL
+task that the increase of gradient norm | dh0 | reliably precedes the epoch at which the accuracy surpasses the
+chance level (See Fig 9B). We note that when measuring the overlap of the orientation of the gradient vector
+ dL
+ dh0
+     with the first covariant Lyapunov vector of the forward dynamics, we found a significant increase in overlap
+around the training epoch where the accuracy surpasses the chance level both in networks with and without
+gradient flossing. This does not come as a surprise as the covariant Lyapunov vector measure the most unstable
+(or least stable) direction in the tangent space of a trajectory and perturbations of h0 that have to travel over
+many epochs align
+
+D.1 Gradient Flossing Boosts Effective Dimension of Error Gradient
+To further investigate the effect of gradient flossing on training, we investigated the structure of the error gradient
+                                                                                                                  dL
+and how it is changed by gradient flossing. To this end, we decompose the recurrent weight gradient σi dW
+into in weighted sum of outer products using singular value decomposition (Fig 10).
+As the Lyapunov exponents are the time-averaged logarithms of the singular values of the asymptotic long-term
+Jacobian Tt (hτ ), this allows us to directly link the effect of pushing Lyapunov exponents toward zero during
+gradient flossing to the structure of the error gradient of the recurrent weights, as they are intimately linked:
+                                    τ =t−1    t−1
+                                                            !
+                      ∂Lt      ∂Lt X          Y ∂hτ ′ +1 ∂hτ            ∂Lt X             ∂hτ
+                            =                                        =          Tt (hτ )                        (19)
+                      ∂W       ∂ht             ′
+                                                    ∂h  τ ′    ∂W       ∂h t
+                                                                              τ
+                                                                                          ∂W
+                                     τ =t−l   τ =τ
+
+We again note that different ’loops’ contribute to the total gradient expression and the Lyapunov exponents only
+characterize the longest loop. Further, we note that in our controlled tasks, depending on delay d, only few of the
+summands are relevant for solving the task. We thus expect the relevant gradient summand that carries important
+signals about the task to be contaminated by summands of both shorter and longer chains, which contribute
+irrelevant fluctuations.
+                                                              dL
+                                                                 
+The singular values of the recurrent weight gradient σi dW         as a function of training epoch reveal that the
+subordinate singular values subordinate singular value σ20 and σ40 exhibit peaks at the times of gradient flossing,
+while the first singular value σ1 only shows a slight peak (Fig 10A). This indicates that gradient flossing
+                                                                 dL
+increases the effective rank of the recurrent weight gradient dW     . In other words, gradient flossing facilitates
+high-dimensional parameter updates. Our interpretation is, as gradient flossing pushes Lyapunov exponents
+to zero, the different summands in the total gradient contribute more equitable as long loops have neither a
+dominant contribution (which would happen for exploding gradients) nor a vanishing contribution (which would
+happen for vanishing gradients). This way, the sum of gradient terms has a higher effective rank.
+In contrast, without gradient flossing, the subordinate singular values (in Fig 10A σ20 and σ40 ) rapidly diminish
+to extremely small values over training epochs and remain very small throughout training. Note however that the
+leading singular values σ1 are of comparable size irrespective whether gradient flossing was performed or not.
+                                                                                              dL
+We note that similar to the gradient norm of the loss with respect to the initial condition | dh 0
+                                                                                                   |, the subordinate
+singular values seem to predict when the test accuracy of networks with gradient flossing grows beyond chance
+level (Fig 10B). We confirmed this in multiple other network realizations and give here another example we the
+accuracy grows beyond chance only later during training (Fig 11).
+
+D.2 Gradient Flossing Throughout Training Can Be Detrimental
+We find that gradient flossing continued throughout all training epochs can be detrimental for performance
+(Fig 12). We demonstrate this again in the binary delayed temporal XOR task. We compare three different
+conditions: Either, we floss throughout the training every 100 training epochs for 500 flossing epochs (red), or
+we floss only early during training at training epochs e ∈ {0, 100, 200, 300, 400})(green) or we do not floss at
+all (blue).
+
+
+                                                          18
+            A                                                                     C
+              10 2                                                                          0.05
+              10 4                                                                          0.04
+ |dhd 0 |     10 6                                                                          0.03
+
+
+
+
+                                                                         |dhd 0 |
+              10 8                                                                          0.02
+             10 10                                                                          0.01
+                                        with flossing during training
+             10 12                      without flossing
+                                                                                            0.00
+                           0   500 1000 1500 2000 2500 3000                                        0   500 1000 1500 2000 2500 3000
+                                       Epochs                                                                  Epochs
+            B                                                                     D
+                     100                                                                    100
+      accuracy (%)
+
+
+
+
+                                                                             accuracy (%)
+                               with flossing during training
+                               without flossing
+
+
+                      50                                                                     50
+                           0   500 1000 1500 2000 2500 3000                                        0   500 1000 1500 2000 2500 3000
+                                       Epochs                                                                  Epochs
+                                                                                                dL
+Figure 8: Gradient flossing boosts norm of long-term Jacobian A) Gradient norm of | dh             0
+                                                                                                     | as
+a function of training epochs for networks without flossing (blue) and networks with flossing
+during training (orange). Error gradient norm is boosted after gradient flossing at epochs e ∈
+                                                                                               dL
+{0, 100, 200, 300, 400, 500}). In networks without gradient flossing, the gradient norms | dh    0
+                                                                                                   | are
+much smaller overall. One out of ten random network realizations with solid line, the other 9 with
+transparent line. B) Accuracy as a function of epoch, same depiction and network realizations as in
+A. Note that accuracy of networks with gradient flossing grows beyond chance level approximately
+                            dL
+when the gradient norm | dh   0
+                                | becomes macroscopically large. C) Same as A in linear scale. Mean
+final test loss as a function of task difficulty (delay d) for delayed copy task. Different colors are
+different network realizations with gradient flossing during training. Black lines are without any
+gradient flossing. D) Accuracy as a function of epochs, same colors as in C. Note that for all network
+                                                    dL
+realizations the moments where gradient norm | dh     0
+                                                        | becomes macroscopically large coincides with
+the moment the accuracy is beyond chance level. Parameters: g = 1, batch size b = 16, N = 80,
+epochs = 104 , T = 300, flossing for Ef = 500 epochs on k = 75 Lyapunov exponents before
+training. Task: binary delayed XOR, delay d = 70, loss: cross entropy(y, ŷ).
+
+
+We observe that after every episode of gradient flossing, the accuracy drops down close to chance level of 50%
+(Fig 12A red line). Between flossing, the accuracy quickly recovers but never reaches 100%. Simultaneously,
+                                                                                                                dL
+when the accuracy drops the test error jumps up (Fig 12B). We also observed that the gradient norm | dh           0
+                                                                                                                    | is
+                                                                                                        dL
+initially boosted by gradient flossing, but stays close to indistinguishable once the gradient norm | dh0 | becomes
+macroscopically large (Fig 12C). This suggests that once gradient flossing facilitates signal propagation across
+long time horizons and the network picks up the relevant gradient signal, further gradient flossing can be harmful
+to the actual task execution. We hypothesize that there might be (at least for the Vanilla networks considered
+here), a trade-off between the ability to bridge long time scales which seems to require one or several Lyapunov
+exponents of the forward dynamics close to zero and nonlinear tasks requirements, which require at least a
+fraction of the units to be in the nonlinear regime of the nonlinearity ϕ, where ϕ′ (x) < 1. It would be an
+interesting future research avenue to further investigate this potential trade-off also in other network architectures.
+
+D.3             Lyapunov Exponents after Training With and Without Gradient Flossing
+In Fig 13, we show the first (Fig 13A, B) and the tenth (Fig 13C, D) Lyapunov exponent after training on
+the spatial delayed XOR task both with and without gradient flossing. We find for successful networks with
+gradient flossing a systematic relationship between the first Lyapunov exponent and the delay, that can be fitted
+by approximately by λ1 (d) = −0.2exp.(−0.03delay). Unsuccessful networks with accuracy at chance level
+have a much smaller largest Lyapunov exponent. The same seems to hold true for the tenth Lyapunov exponent.
+In a previous study [57], a similar trend was observed, albeit in the context of a task that did not possess an
+
+
+                                                                        19
+        A                                                       B 80
+
+                                                                           75
+            10 2
+                                                                           70
+            10 3
+
+
+
+
+                                                            accuracy (%)
+                                                                           65
+ |dhd 0 |
+
+
+
+            10 4                                                           60
+                                                                           55
+            10 5
+                                                                           50
+                                                                           45
+                   3000   2000     1000      0       1000                       0   100     200    300     400
+                                 Epochs                                                   Epochs
+
+
+Figure 9: Increase of gradient norm precedes epoch when accuracy exceeds chance level
+                       dL
+A) Gradient norm of | dh 0
+                           | as a function of training epochs for 20 network realizations with flossing
+during training. Epochs are aligned to last epoch where accuracy is ≤ 50%. B) Same task and
+simulations as in A, but here accuracy as a function of epoch, for 20 network realizations with
+                                                                        dL
+flossing during training. Epochs are aligned to the last epoch with | dh  0
+                                                                            | < 0.001. Different colors
+are different network realizations. Parameters: g = 1, batch size b = 16, N = 80, epochs = 104 ,
+T = 300, flossing for Ef = 500 epochs on k = 75 Lyapunov exponents early during training at
+training epochs e ∈ {0, 100, 200, 300, 400}. Task: binary delayed XOR, delay d = 70, loss: cross
+entropy(y, ŷ).
+
+
+analytically tractable temporal correlation structure, which might partially explain the less conclusive results.
+It is important to note that the numerical evaluation of Lyapunov exponents in recurrent LSTM networks in
+[57] was based solely on the N × N Jacobian of the memory state. From a dynamical systems standpoint, a
+2N × 2N Jacobian matrix encompassing interactions between both memory and cell states into account is
+required [9, 12, 58].
+
+
+E Gradient Flossing for Linear Network
+We provide code for gradient flossing in linear networks here. We find that gradient flossing also helps to train
+linear networks on tasks with many time steps that can be solved by linear networks, for example the copy task,
+but not for tasks the require a nonlinear input-output operation like the temporally delayed XOR task. Full
+analytical description of gradient flossing for linear networks would be a promising avenue for future research as
+networks with linear dynamics can still have nonlinear learning dynamics [21]. However this is beyond the cope
+of the presented work.
+
+
+F      Computational Complexity of Gradient Flossing
+We present here a more in-depth scaling analysis of the computational cost of gradient flossing. There are three
+main contributors to the computational cost (table 1): First the RNN step, which has a computational complexity
+of O N 2 b per time step, where N is the dimension of the recurrent network state (which in case of Vanilla
+            
+networks equals the number of units) and b is the batch-size both in the forward and backward pass. Second, the
+Jacobian step which scales with O N 2 k per time step, where k is the number of flossed Lyapunov exponents.
+Third, the QR decomposition, which scales with O N k2 , where k is the number of Lyapunov exponents
+                                                               
+considered.
+Together, this results in a total amortized cost of O N 2 b T per training epoch, where T is the number of
+                                                                 
+
+training time steps and a total amortized costs per flossing epoch of O N 2 Tf (1 + k/tONS + k) where Tf is
+                                                                                                  
+the number of flossing time steps.
+
+
+                                                       20
+  A
+             10 2
+ i dW )
+   d
+      (
+
+
+
+             10 4
+
+
+                        0           500          1000        1500           2000           2500          3000
+  B               100
+                            early training flossing
+                            without flossing
+   accuracy (%)
+
+
+
+
+                  50
+                        0           500          1000         1500          2000           2500          3000
+                                                             Epochs
+Figure 10: Gradient flossing decreases condition number          of recurrent weight error gradient
+                                                     dL
+                                                         
+A) Singular values of recurrent weight gradient σi dW      as a function of training epochs for singular
+values i ∈ 1, 20, 40 for networks without gradient flossing (blue) and early training gradient flossing
+(green). At epochs of gradient flossing, the subordinate singular value σ20 and σ40 are peaked, while
+the first singular value σ1 has only a slight peak. This indicates that gradient flossing increases the
+                                                  dL
+effective rank of the recurrent weight gradient dW   . B) Accuracy as a function of training epochs.
+Note that accuracy of networks with gradient flossing grows beyond chance level approximately
+when the subordinate singular values singular value σ20 and σ40 are peaked increase, which enables
+high-dimensional parameters updates. Parameters: g = 1, batch size b = 16, N = 80, epochs = 103 ,
+T = 300, gradient flossing for Ef = 500 epochs on k = 75 Lyapunov exponents before training.
+Task: binary delayed XOR, delay d = 70, loss: cross entropy(y, ŷ).
+
+
+
+
+In case of preflossing, thus, the total computation cost scale with O N 2 [Eb T + Ep Tf (1 + k/tONS + k)] ,
+                                                                                                         
+where E is the number of training epochs and Ep is the number of preflossing epochs.
+For gradient flossing during training (assuming that there is also preflossing done), the amortized cost scale with
+O N 2 [Eb T + Ep Tp + Ef Tf (1 + k/tONS + k)] , where Ef is the total number of flossing epochs during
+training.
+Empirically, we find that both the number of preflossing epochs Ep and flossing episodes Ef necessary for
+training success is much smaller than the total number of training epochs E. For example, the preflossing for
+500 epochs in the numerical experiment of Fig 3 took ∼ 37 seconds, while the overall training on 10000 training
+epochs with batch size b = 16 took ∼ 1680 seconds. Thus only approximately 2.2% of the total training time
+was spent on gradient flossing. Moreover, Tp can be smaller than T , it just has to be long enough such that the
+temporal correlations in the task can be bridged. In case of the tasks discussed in the manuscript, this would be
+the delay d. It remains an important challenge to infer the suitable number of flossing time steps Tf for tasks
+with unknown temporal correlation structure.
+It would also be interesting to investigate how the CPU hours/wall-clock time/flops/Joule/CO2-emission spent
+on gradient flossing vs on training networks with larger N are trading off against each other. For this, we would
+suggest to first find the smallest network that on median successfully trains on a binary temporal XOR task for
+a fixed given delay d and measure the computational resources involved in training it, e.g. in terms of CPU
+hours. Then compare it to a network with gradient flossing. This would be a promising analysis but is beyond
+our current computational budget. We will start such experiments an might be able to provide results during the
+reviewer period.
+
+
+                                                        21
+  A
+              10 2
+ i dW )
+   d
+      (
+
+
+
+              10 4
+
+
+                         0           500          1000        1500         2000           2500           3000
+  B                100
+                             early training flossing
+                             without flossing
+    accuracy (%)
+
+
+
+
+                   50
+                         0           500          1000         1500        2000           2500           3000
+                                                              Epochs
+Figure 11: Gradient flossing decreases condition number of recurrent weight error gradient
+Same as Fig 10 for different network realization.
+
+
+
+G         Additional controls
+
+We also investigate the effects of gradient flossing during the training with orthogonal weight initializations
+and confirm our finding that gradient flossing improves trainability on tasks that have long time horizons to
+bridge. Moreover, we find that gradient flossing during training can further improve trainability. We replicated
+the two more challenging tasks from the main paper (Fig 4) for orthogonal initialization with variable temporal
+complexity and performed gradient flossing either both during and before training, only before training, or not at
+all.
+Fig 14A shows the test accuracy for Vanilla RNNs with orthogonal initialization trained on the delayed temporal
+XOR task yt = xt−d/2 ⊕ xt−d with random Bernoulli process x ∈ {0, 1}. The accuracy of orthogonal Vanilla
+RNNs falls to chance level for d ≥ 40 (Fig 14C). With gradient flossing before training, the trainability can
+be improved, but still falls close to chance level for d = 70. In contrast, for initially orthogonal networks with
+gradient flossing during training, the accuracy is improved to > 80% at d = 70. In this case, we preflossed for
+500 epochs before task training and again after 500 epochs of training on the task. In Fig 14B, D the networks
+have to perform the nonlinear XOR operation yt = x1t−d ⊕ x2t−d ⊕ x3t−d on a three-dimensional binary input
+signal x1 , x2 , and x3 and generate the correct output with a delay of d steps identical to Fig 4 in the main text.
+Again, we observe similar to networks with Gaussian initialization that flossing before training improves the
+performance compared to baseline, but starts failing for long delays d > 60. In contrast, orthogonal networks
+that are also flossed during training can solve even more difficult tasks (Fig 14D). We note that for Fig 14B and
+D, we trained the network only on 5000 epochs, compared to 10000 epochs in networks with random Gaussian
+initialization because for 10000 epochs, both networks with gradient flossing only before training and with
+gradient flossing before and during training were able to bridge d = 70. These results suggest that orthogonal
+initialization does seem to slightly improve performance for tasks with long time horizons to bridge and gradient
+flossing and additionally boost the performance. Thus orthogonal initialization and gradient flossing seems to go
+well together. It would be interesting to study if orthogonal initialization also reduces the number of gradient
+flossing steps necessary to improve performance.
+
+
+
+H         Additional Details on Training Tasks
+
+In this section, we provide a more rigorous definition of the tasks used for training, as discussed in Section 3:
+
+
+                                                         22
+  A                  100
+
+
+      accuracy (%)
+                                                                      with flossing throughout training
+                                                                      early training flossing
+                                                                      without flossing
+                     50
+                           0       500          1000         1500           2000          2500           3000
+  B
+                     100
+    test error
+
+
+
+
+                 10 1
+                           0       500          1000         1500           2000          2500           3000
+  C
+                 10 3                        with flossing throughout training
+                                             early training flossing
+                                             without flossing
+   |dhd 0 |
+
+
+
+
+                 10 7
+
+                           0       500          1000          1500          2000          2500           3000
+                                                             Epochs
+Figure 12: Gradient flossing throughout training can be detrimental to learning A) Accuracy as a
+function of training epochs for binary temporal delayed XOR task for gradient flossing throughout
+training every 100 training epochs (red). Accuracy drops down close to chance level every time after
+gradient flossing but recovers quickly between. Same for only 5 episodes of gradient flossing at
+epochs e ∈ {0, 100, 200, 300, 400}) (green) and no flossing at all (blue). B) Test error as a function
+                                           dL
+of training epochs. C) Gradient norm of | dh 0
+                                               | as a function of training epochs for networks without
+gradient flossing (blue) and networks with flossing throughout training (red) and early training
+gradient flossing (green). Error gradient norm is boosted after each gradient flossing. In networks
+                                                   dL
+without gradient flossing, the gradient norms | dh   0
+                                                       | are much smaller overall. Parameters: g = 1,
+batch size b = 16, N = 80, epochs = 104 , T = 300, gradient flossing for Ef = 500 epochs on
+k = 75 Lyapunov exponents before training. Task: binary delayed XOR, delay d = 70, loss: cross
+entropy(y, ŷ).
+
+
+H.1        Copy task
+For the copy task, the target network readout at time t is yt = xt−d , where d denotes the delay. We chose the
+input to be sampled i.i.d. from a uniform distribution between 0 and 1.
+
+H.2        Temporal XOR task
+The temporal XOR task requires the target network readout yt at time t to be computed as follows:
+                                              yt = |xt−d/2 − xt−d |                                            (20)
+                                                                                  2
+where again d denotes a time delay of d time steps. In the case of x ∈ {0, 1} and y ∈ {0, 1}, the output yt
+follows the truth table of the XOR digital logic gate (Table 2). Thus, the function f (xa , xb ) = |xa − xb | can be
+seen as an analytical representation of the XOR gate. It is important to note that f (x, 0) = x only for x ≥ 0,
+and that this task requires a nonlinearity. The implementation can easily be constructed analytically, for example,
+using two rectified linear units ϕ(x) = max(x, 0) the outbut can be constructed by
+                               f (xa , xb ) = |xa − xb | = ϕ(xa − xb ) + ϕ(xb − xa ).                          (21)
+Together with a delay line to transmit the signal xt−d over time, this can solve the task.
+
+
+                                                        23
+         A            no gradient flossing                          B gradient flossing during training
+             0.1
+    1
+                                                                        0.1
+
+
+
+
+                                                               1
+             0.2
+                                                                        0.2
+                      20       40         60                                        20       40         60
+                       complexity (delay d)                                          complexity (delay d)
+         C                                                          D
+             0.1
+             0.2                                                        0.1
+    10
+
+
+
+
+                                                               10
+             0.3                                                        0.2
+                      20       40         60                                        20       40         60
+                       complexity (delay d)                                          complexity (delay d)
+Figure 13: Lyapunov exponents of trained networks with and without gradient flossing A) First
+Lyapunov exponents λ1 for Vanilla networks trained on spatial delayed XOR task as a function of
+the delay with no gradient flossing. Colored-coded is test accuracy at the end of training where red
+corresponds to 100% accuracy and blue to chance level (50%). B) Same as A for networks with
+gradient flossing during training. Black dashed line shows that Lyapunov exponents of successfully
+trained networks can be approximated by the empirical fit λ1 (d) = −0.2exp.(−0.03delay). (Proto-
+col for gradient flossing during training same as main text Fig 4B). C) Same as A for tenth Lyapunov
+exponents λ10 . D) Same as B for tenth Lyapunov exponents λ10 . Same fit as in B also describes
+λ10 . Parameters: g = 1, batch size b = 16, N = 80, epochs = 104 , T = 300, gradient flossing for
+Ef = 500 epochs on k = 75 Lyapunov exponents before training. Task: binary spatial delayed XOR,
+loss: cross entropy(y, ŷ).
+
+
+I   Additional Background on Lyapunov Exponents of RNNs
+
+An autonomous dynamical system is usually defined by a set of ordinary differential equations dh/dt =
+F(h), h ∈ RN in the case of continuous-time dynamics, or as a map hs+1 = f (hs ) in the case of discrete-time
+dynamics. In the following, the theory is presented for discrete-time dynamical systems for ease of notation,
+but everything directly extends to continuous-time systems [43]. Together with an initial condition h0 , the
+map forms a trajectory. As a natural extension of linear stability analysis, one can ask how an infinitesimal
+perturbation h′0 = h0 + ϵu0 evolves in time. Chaotic systems are sensitive to initial conditions; almost all
+infinitesimal perturbations ϵu0 of the initial condition grow exponentially with time |ϵut | ≈ exp(λ1 t)|ϵu0 |.
+Finite-size perturbations, therefore, may lead to a drastically different subsequent behavior. The largest Lyapunov
+exponent λ1 measures the average rate of exponential divergence or convergence of nearby initial conditions:
+
+
+                                                          1              ||ϵut ||
+                                        λ1 (h0 ) = lim        lim log                                            (22)
+                                                    t→∞ t ϵ→0            ||ϵu0 ||
+In dynamical systems that are ergodic on the attractor, the Lyapunov exponents do not depend on the initial
+conditions as long as the initial conditions are in the basins of attraction of the attractor. Note that it is crucial
+to first take the limit ϵ → 0 and then t → ∞, as λ1 (h0 ) would be trivially zero for a bounded attractor if the
+                                       ||ϵut ||
+limits are exchanged, as limt→∞ log ||ϵu   0 ||
+                                                is bounded for finite perturbations even if the system is chaotic. To
+measure k Lyapunov exponents, one has to study the evolution of k independent infinitesimal perturbations us
+spanning the tangent space:
+
+
+                                                   us+1 = Ds us                                                  (23)
+
+
+                                                         24
+                                      forward pass                                            backward pass
+                                      O N2 b
+                                             
+    RNN dynamics                                                                              "
+                                      O N2 k
+                                              
+ Jacobian step                                                                                "
+                                      O N k2
+                                              
+    QR step                                                                                   "
+                                      O N2 b T
+                                                
+    total amortized costs                                                                     "
+    per training epoch
+                                      O N 2 Tf (1 + k/tONS + k)
+                                                                    
+    total amortized costs                                                                     "
+    per gradient flossing epoch
+                                      O N 2 [Eb T + Ep Tf (1 + k/tONS + k)]
+                                                                                   
+    total amortized costs                                                                     "
+    of preflossing
+                                      O N 2 [Eb T + Ep Tp + Ef Tf (1 + k/tONS + k)]
+                                                                                        
+    total amortized costs                                                                     "
+    flossing during training
+
+Table 1: Computational cost for gradient flossing and training of RNNs
+N denotes number of neurons, b is the batch size, T is the number of time steps in forward pass of
+training, Tf is the number of time steps in forward pass of flossing, tONS is the reorthonormalization
+interval, k is the number of flossed Lyapunov exponents, E is the number of training epochs, Ep is
+the number of preflossing epochs, Ef is the number of flossing epochs during training. Empirically,
+we find that the necessary number of preflossing epochs Ep and flossing episodes Ef is much smaller
+than both the total number of training epochs E. Moreover, Tp can be smaller than T .
+
+                                                     Table 2: XOR
+                                  input xt−d        input xt−2d     target output yt
+                                  0                 0               0
+                                  0                 1               1
+                                  1                 0               1
+                                  1                 1               0
+
+
+where the N × N Jacobian Ds (hs ) = df (hs )/dh characterizes the evolution of generic infinitesimal perturba-
+tions during one step. Note that this Jacobian along the trajectory is equivalent to a stability matrix only at a
+fixed point, i.e., when hs+1 = f (hs ) = hs .
+We are interested in the asymptotic behavior, and therefore we study the long-term Jacobian
+
+                                      Tt (h0 ) = Dt−1 (ht−1 ) . . . D1 (h1 )D0 (h0 ).                        (24)
+Note that Tt (h0 ) is a product of generally noncommuting matrices. The Lyapunov exponents λ1 ≥ λ2 · · · ≥ λN
+are defined as the logarithms of the eigenvalues of the Oseledets matrix
+                                                                              1
+                                          Λ(h0 ) = lim [Tt (h0 )⊤ Tt (h0 )] 2t ,                             (25)
+                                                    t→∞
+
+where ⊤ denotes the transpose operation. The expression inside the brackets is the Gram matrix of the long-term
+Jacobian Tt (h0 ). Geometrically, the determinant of the Gram matrix is the squared volume of the parallelotope
+spanned by the columns of Tt (h0 ). Thus, the exponential volume growth rate is given by the sum of the
+logarithms of its first k (sorted) eigenvalues. Oseledets’ multiplicative ergodic theorem guarantees the existence
+of the Oseledets matrix Λ(h0 ) for almost all initial conditions h0 [42]. In ergodic systems, the Lyapunov
+exponents λi do not depend on the initial condition h0 . However, for a numerical calculation of the Lyapunov
+spectrum, Eq 25 cannot be used directly because the long-term Jacobian Tt (h0 ) quickly becomes ill-conditioned,
+i.e., the ratio between its largest and smallest singular value diverges exponentially with time.
+
+
+J      Algorithm for Calculating Lyapunov Spectrum of Rate Networks
+For calculating the first k Lyapunov exponents, we exploit the fact that the growth rate of a k-dimensional
+infinitesimal volume element is given by λ(m) = m                               (1)
+                                                                                    , λ2 = λ(2) − λ1 , λ3 =
+                                                   P
+                                                     i=1 λi . Therefore, λ1 = λ
+  (3)
+λ − λ1 − λ2 , . . . [44]. The volume growth rates can be obtained via QR-decomposition.
+
+
+                                                            25
+                                                    orthogonal weight initialization
+       A                           delayed temporal XOR           B                  delayed spatial XOR
+                    100                                                                       100
+test accuracy (%)              flossing during training
+
+
+
+
+                                                                          test accuracy (%)
+                               preflossing
+                     80        no flossing                                                     80
+
+                     60                                                                        60
+
+                          0      2000     4000 6000          8000 10000                             0    1000   2000 3000       4000     5000
+                                             Epochs                                                               Epochs
+       C                                                                         D
+                    100                                                                       100
+test accuracy (%)
+
+
+
+
+                                                                          test accuracy (%)
+                     80                                                                        80                       flossing during training
+                                                                                                                        preflossing
+                                                                                                                        no flossing
+                     60                                                                        60
+
+                          10     20      30 40 50 60                 70                             10   20     30 40 50 60              70
+                                       complexity (delay d)                                                   complexity (delay d)
+
+Figure 14: Gradient flossing before and during training improves trainability for orthogonal
+nets
+A) Test accuracy for orthogonally initialized vanilla RNNs trained on delayed temporal binary XOR
+task yt = xt−d/2 ⊕ xt−d with gradient flossing during training (green), preflossing (orange), and
+with no gradient flossing (blue) for d = 70. Solid lines are mean, transparent thin lines are individual
+network realizations B) Same as A for delayed spatial XOR task with yt = x1t−d ⊕ x2t−d ⊕ x3t−d .
+C) Test accuracy as a function of task difficulty (delay d) for delayed temporal XOR task. D) Test
+accuracy as a function of task difficulty (delay d) for delayed spatial XOR task. Parameters: g = 1,
+batch size b = 16, N = 80, epochs = 104 for delayed temporal XOR, epochs = 5000 for delayed
+spatial XOR, T = 300, flossing for Ef = 500 epochs on k = 75 Lyapunov exponents before training
+and during training for green lines, and only before training for orange lines. Shaded areas are 25%
+and 75% percentiles, solid lines are means, transparent dots are individual simulations, task loss is
+cross-entropy between y, ŷ.
+
+
+First, we evolve an initially orthonormal system Qs = [q1s , q2s , . . . qm
+                                                                          s ] in the tangent space along the trajectory
+using the Jacobian Ds :
+                                                Qe s+1 = Ds Qs                                                     (26)
+A continuous system can be transformed into a discrete system by considering a stroboscopic representation,
+where the trajectory is only considered at certain discrete time points. We use here the notation of discrete
+dynamical systems where this corresponds to performing the product of Jacobians along the trajectory Q
+                                                                                                     e s+1 =
+Ds Qs . We study the discrete network dynamics in the limit of small time step ∆t → 0 and for discrete time
+∆t = 1. The notation can be readily extended to continuous systems [43].
+Second, we extract the exponential growth rates using the QR-decomposition,
+                                                                 e s+1 = Qs+1 Rs+1 ,
+                                                                 Q
+
+which uniquely decomposes Q    e s+1 into an orthonormal matrix Qs+1 of size N × k so Q⊤    s+1 Qs+1 = 1m×m
+and to an upper triangular matrix Rs+1 of size k × k with positive diagonal elements. Geometrically, Qs+1
+describes the rotation of Qs caused by Ds and the diagonal entries of Rs+1 describe the stretching and shrinking
+of the columns of Qs , while the off-diagonal elements represent the shearing. Fig 15 visualizes Ds and the
+QR-decomposition for k = 2.
+The Lyapunov exponents are given by time-averaged logarithms of the diagonal elements of Rs :
+                                                                     t                t
+                                                              1     Y              1X
+                                                 λi = lim       log     Rsii = lim       log Rsii .                                             (27)
+                                                          t→∞ t                t→∞ t
+                                                                    s=1              s=1
+
+
+
+                                                                          26
+                      Ds
+                                                                        QR
+
+                                                         s+1
+                                                        R22
+qs2                              ~2         s+1
+                                                                                    2
+                                                                                   qs+1
+                                 qs+1      R11          ~1
+                                                        qs+1
+                                                                              s+1
+      qs1                                                                    R11             1
+                                                                                            qs+1
+                                                                                                    s+1
+                                                                                                   R22
+
+Figure 15: Geometric illustration of Lyapunov spectrum calculation. An orthonormal matrix
+Qs = [q1s , q2s , . . . qms ], whose columns are the axes of an k-dimensional cube, is rotated and distorted
+by the Jacobian Ds into an k-dimensional parallelotope Q           e s+1 = Ds Qs embedded in RN . The
+figure illustrates this for k = 2, in which case the columns of Q        e s+1 span a parallelogram, which
+can be divided into a right triangle and a trapezoid and rearranged into a rectangle. Thus, the
+area of the gray parallelogram is the same as that of the orange rectangle. The QR-decomposition
+reorthonormalizes Q       e s+1 by decomposing it into the product of an orthonormal matrix Qs+1 =
+[q1s+1 , q2s+1 , . . . qms+1 ] and the upper-triangular matrix R
+                                                                   s+1
+                                                                       . Qs+1 describes the rotation of Qs
+                                                  s+1
+caused by Ds . The diagonal entries of R              gives the stretching/shrinking along the columns of
+Qs+1 ,Qthus the volume of the parallelotope formed by the first k columns of Q            e s+1 is given by
+          m
+Vm = i=1 Rs+1     ii    . The  time-averaged logarithms  of the diagonal  elements of R s
+                                                                                          give the Lyapunov
+                                   1
+                                         Qt       s                1
+                                                                     Pt          s
+spectrum: λi = limtsim →∞ tsim log s=1 Rii = limtsim →∞ t s=1 log Rii .
+
+
+Note that the QR-decomposition does not need to be performed at every simulation step, just sufficiently often,
+i.e., once every sONS steps such that Q     ONS = Ds+sONS −1 · Ds+sONS −2 . . . Ds · Qs remains well-conditioned
+                                      e s+s
+[44]. An appropriate reorthonormalization interval sONS = tONS /∆t thus depends on the condition number, the
+ratio of the smallest and largest singular value:
+                                                                                    s+s
+                              e s+s ) = κ2 (Rs+sONS ) =    σ1 (Rs+sONS )    R ONS
+                         κ2 ( Q                                          = 11          .                   (28)
+                                                                            Rs+s
+                                   ONS                            s+s
+                                                           σm (R ONS )         mm
+                                                                                   ONS
+
+An initial transient should be disregarded in the calculation of the Lyapunov spectrum because h first has to
+converge towards the attractor and Q has to converge to the unique eigenvectors of the Oseledets matrix (Eq 25)
+[54]. A simple example of this algorithm in pseudocode is:
+
+Algorithm 2 Jacobian-based algorithm for Lyapunov spectrum
+  initialize h, Q
+  evolve h until it is on attractor (avoid initial transient)
+  evolve Q until it converges to the eigenvectors of the backward Oseledets matrix
+  set γi = 0
+  for t = 1 → T do
+       h ← f (h)
+              df
+       D ← dh
+       Q←D·Q
+       if s ≡ 0 (mod sONS ) then
+           Q, R ← qr(Q)
+           γi += log(Rii )
+       end if
+  end for
+  λi = γi /T
+
+
+It is guaranteed that under general conditions initially random orthonormal systems will exponentially converge
+towards a unique basis that is given by the eigenvectors of the Oseledets matrix Eq 25 [54]. A minimal example
+of this algorithm in pseudocode is shown in appendix 3. A feasible strategy to determine the reorthonormalization
+time interval tONS is to get first a rough estimate of the Lyapunov spectrum using a short simulation time tsim and
+a small tONS and repeat with a longer simulation time and a tONS based on the Lyapunov spectrum of the rough
+estimate of the Lyapunov spectrum. Another strategy is, to first iteratively adapt tONS on a short simulation
+run to get an acceptable condition number. It should be noted that there exists a diversity of other methods to
+estimate the Lyapunov spectrum [14, 43, 68, 69].
+
+
+                                                        27
+K            Convergence of Lyapunov Exponents of RNNs
+In Fig. 16, we demonstrate the convergence of the Lyapunov exponents. We show the estimate of the Lyapunov
+exponents λi for i = 1, 20, 60, 80 for different initial conditions but identical network realization.
+
+             0
+             10
+i(1/steps)
+
+
+
+
+             20
+             30
+             40
+                  100                                    101                                      102
+                                                        steps
+
+
+Figure 16: Convergence of Lyapunov exponents Convergence of selected Lyapunov exponents
+λi for ten identical network realizations with different initial conditions with simulation time (i =
+1, 20, 60, 80) for σ = 1 and g = 1. (Other parameters: N = 80, tsim = 100 steps, tONS = 1).
+
+
+
+
+                                                   28
+
+\ No newline at end of file
diff --git a/papers/txt/gram2025_generative_recursive.txt b/papers/txt/gram2025_generative_recursive.txt
new file mode 100644
index 0000000..d5f299d
--- /dev/null
+++ b/papers/txt/gram2025_generative_recursive.txt
@@ -0,0 +1,1532 @@
+                                                                    Generative Recursive Reasoning
+
+
+                                                                      Junyeob Baek1†∗ Mingyu Jo1†∗          Minsu Kim1,2
+
+                                                                     Mengye Ren3       Yoshua Bengio2,4     Sungjin Ahn1,3†
+arXiv:2605.19376v2 [cs.AI] 20 May 2026
+
+
+
+
+                                                                              1
+                                                                               KAIST 2 Mila – Québec AI Institute
+                                                                       3
+                                                                           New York University 4 Université de Montréal
+
+
+
+                                                                                          Abstract
+                                                     How should future neural reasoning systems implement extended computation?
+                                                     Recursive Reasoning Models (RRMs) offer a promising alternative to autoregres-
+                                                     sive sequence extension by performing iterative latent-state refinement with shared
+                                                     transition functions. Yet existing RRMs are largely deterministic, following a
+                                                     single latent trajectory and converging to a single prediction. We introduce Gen-
+                                                     erative Recursive reAsoning Models (GRAM), a framework that turns recursive
+                                                     latent reasoning into probabilistic multi-trajectory computation. GRAM models
+                                                     reasoning as a stochastic latent trajectory, enabling multiple hypotheses, alterna-
+                                                     tive solution strategies, and inference-time scaling through both recursive depth
+                                                     and parallel trajectory sampling. This yields a latent-variable generative model
+                                                     supporting conditional reasoning via pθ (y | x) and, with fixed or absent inputs,
+                                                     unconditional generation via pθ (x). Trained with amortized variational inference,
+                                                     GRAM improves over deterministic recurrent and recursive baselines on structured
+                                                     reasoning and multi-solution constraint satisfaction tasks, while demonstrating an
+                                                     unconditional generation capability. https://ahn-ml.github.io/gram-website
+
+
+                                         1     Introduction
+                                         A central question for future neural reasoning systems is how extended computation should be imple-
+                                         mented. Large autoregressive models typically scale reasoning by extending a sequence-generation
+                                         process, whether intermediate computation is expressed explicitly as chain-of-thought tokens or im-
+                                         plicitly in hidden or latent representations [1–6]. A complementary direction is explored by Recursive
+                                         Reasoning Models (RRMs), which use repeated computation to refine a persistent latent state rather
+                                         than to append new elements to an output or reasoning sequence [7–9]. This approach is appealing
+                                         because it decouples reasoning depth from both parameter scale and output length: a compact model
+                                         can perform many steps of internal computation by repeatedly applying shared transition functions
+                                         over time.
+                                         Recent recursive reasoning models such as HRM [8] and TRM [9] provide early evidence for the
+                                         potential of this approach in structured reasoning. Rather than producing a solution in a single
+                                         feedforward pass, they perform extended computation through iterative latent-state refinement, deep
+                                         supervision across refinement steps, and reasoning-oriented recurrent designs such as hierarchical
+                                         latent dynamics. These features make them well suited to problems requiring constraint propagation,
+                                         state tracking, iterative correction, and multi-step inference. More broadly, they build on a principle
+                                             ∗ Equal contribution
+                                            † Correspondence to: Junyeob Baek (wnsdlqjtm@kaist.ac.kr), Mingyu Jo (mingyu.jo@kaist.ac.kr), Sungjin Ahn
+
+                                         (sungjin.ahn@kaist.ac.kr)
+
+
+                                         Preprint.
+                      Solution 1,
+
+
+     Input Task,
+
+
+
+                      Solution 2,
+
+
+
+
+                                              (a) Deterministic RRMs                (b) GRAM (Ours)
+
+Figure 1: Comparison of Latent Reasoning Trajectories. Left: N-Queens Example with two valid solutions.
+Right: Given three independent runs for latent reasoning (τ1 , τ2 , τ3 ): (a) Prior RRMs (e.g. HRM, TRM) are
+deterministic—all runs collapse to an identical trajectory, converging to a single solution and failing to explore
+alternatives, while (b) GRAM explores diverse trajectories, producing diverse trajectories that reach multiple
+valid solutions y1 and y2 , while naturally enabling parallel inference-time scaling.
+
+also explored in recurrent Transformer architectures such as Universal Transformers [10] and Looped
+Transformers [7]: shared Transformer blocks can be repeatedly applied to increase computational
+depth without increasing parameter count. Together, these models suggest that reasoning capability
+can emerge not only from scaling model size or generating longer traces, but also from the organization
+of computation itself.
+While recurrent latent-state refinement provides an appealing mechanism for efficiently increasing
+reasoning depth, depth alone is not sufficient for many reasoning problems. A capable reasoning
+system should also be able to maintain uncertainty, consider alternative hypotheses, and explore
+multiple possible solution strategies [11, 12]. This is especially important in settings where ambiguity
+or multiple valid solutions are intrinsic, and more generally in problems where a single refinement
+path may become trapped in a suboptimal reasoning trajectory. In this sense, future RRMs should be
+not only deep, in the sense of repeated refinement, but also wide, in the sense of maintaining and
+exploring multiple latent trajectories in parallel.
+Existing RRMs [7–10], however, remain fundamentally deterministic: given the same input and
+initialization, they follow a single latent trajectory and converge to a single prediction. This deter-
+ministic recursion collapses the space of plausible reasoning paths into a single attractor, leaving
+probabilistic multi-hypothesis latent reasoning largely unexplored within the RRM paradigm. This
+motivates the central question of our work: can recursive latent computation support probabilistic,
+generative, multi-hypothesis reasoning while preserving the efficiency of compact recurrent models?
+In this paper, we propose Generative Recursive reAsoning Models (GRAM), a framework that turns
+recursive latent reasoning into probabilistic multi-trajectory computation. GRAM treats the reasoning
+process itself as a stochastic latent trajectory: at each recursion step, the model samples a transition
+conditioned on the input and the current reasoning state, rather than deterministically updating to a
+single next state. Repeating this process defines a distribution over possible reasoning trajectories,
+allowing the model to maintain multiple hypotheses, explore alternative solution strategies, and scale
+inference not only by increasing recursive depth but also by sampling trajectories in parallel. From
+a probabilistic perspective, GRAM is a latent-variable generative model: it models pθ (y | x) by
+marginalizing over latent reasoning trajectories, while the same recursive process can also define an
+unconditional generative model pθ (x) when the input is fixed or absent.
+We evaluate GRAM on controlled reasoning and generation tasks that serve as probes of the ar-
+chitectural properties targeted by our formulation: recursive refinement, stochastic exploration,
+multi-solution coverage, and inference-time scaling. Given this goal, our experiments focus on
+comparisons with the most relevant deterministic recurrent and recursive latent reasoning baselines,
+including Looped Transformers, HRM, and TRM, rather than frontier-scale general-purpose LLMs
+whose training data, inference budgets, and external scaffolding are not directly comparable. Sudoku-
+Extreme [8] and ARC-AGI [13, 14] test structured reasoning under hard constraints and abstract
+transformations; N-Queens and Graph Coloring evaluate multi-solution recovery; and binarized
+MNIST [15] probes the unconditional generative interpretation.
+Our main contribution is to establish probabilistic multi-trajectory recursion as a design principle
+for future recurrent and recursive reasoning architectures. Concretely, we make three contributions.
+
+
+                                                        2
+First, we formulate recursive reasoning as a latent-variable generative process, where solutions are
+obtained by marginalizing over stochastic reasoning trajectories. Second, we introduce width-based
+inference-time scaling, enabling inference to scale not only with recursive depth but also with the
+number of sampled latent trajectories. Third, we provide empirical evidence that this formulation
+yields the intended architectural advantages over deterministic recurrent and recursive baselines,
+improving structured reasoning, multi-solution constraint satisfaction, and unconditional generation.
+
+2     Generative Recursive Reasoning Models
+In this section, we introduce Generative Recursive reAsoning Models (GRAM), an instantiation
+of probabilistic recursive reasoning. We describe the architecture in Section 2.1 and the training
+procedure in Section 2.2, with an architecture schematic shown in Figure 2.
+
+2.1   Architecture
+
+Overview. GRAM models the conditional distri-                                                                      𝑦        (A) CE loss
+bution pθ (y | x) by marginalizing over stochas-
+                                                                            Prior          (B) KL Div.         Posterior                  Decoder
+tic latent reasoning trajectories. Given an input                          𝑝𝜃 (⋅ |𝑢𝑡 )                        𝑞𝜙 (⋅ |𝑢𝑡 , 𝑦) ~ 𝜖𝑡          𝑓dec
+x, GRAM first computes an embedding
+                                                              ℎ𝑡−1                                       𝑓𝐻        𝑢𝑡                       ℎ𝑡
+                ex = fenc (x; θ),                       (1)
+                                                                                         𝐾 times
+which is reused throughout the entire recursive 𝑙𝑡−1              𝑓𝐿         𝑓𝐿                        𝑙𝑡
+computation. Starting from a fixed initial la-
+                                                          Encoder
+tent state z0 , the model evolves the latent state   𝑥
+                                                            𝑓enc
+through learned stochastic transitions. The re-
+cursive computation is organized into two nested Figure 2: GRAM Architecture. A single stochastic
+levels: inner and outer loops.                     latent transition in the hierarchical instantiation z =
+At the inner level, a latent transition samples (h, l). After K low-level refinements via fL , the high-
+                                                level update fH produces a deterministic proposal ut , to
+a new latent state conditioned on the previous which stochastic  guidance ϵt is added: ht = ut + ϵt .
+latent state and the input embedding,
+                                  zt ∼ pθ (zt | zt−1 , ex ),              t = 1, . . . , T.                                                   (2)
+At the end of the T transitions, the decoder produces a prediction, ŷ = arg max fdec (zT ; θ). We refer
+to the sequence of T transitions from the initial state z0 to the final state zT as a supervision step. A
+supervision step is the unit at which the decoder is invoked, and the training objective is applied, with
+gradients computed as described in Section 2.2.
+At the outer level, Nsup supervision steps are applied recursively, with the final state of one supervision
+step serving as the initial state of the next, thereby forming the full recursive computation:
+                 (1)   T transitions    (1)       (2)     T transitions                  T transitions          (N      )
+               z0      −−−−−−−→ zT = z0                  −−−−−−−→ · · · −−−−−−−→ zT sup ,                                                     (3)
+        (n)                                                                                                                         (1)
+where zt denotes the latent state at the t-th transition of the n-th supervision step, z0 is the
+fixed initial state, and the terminal state of one supervision step serves as the initial state of the next
+   (n+1)       (n)
+(z0      := zT ). This abstract formulation can be instantiated with various recurrent Transformer
+backbones, including flat designs such as Universal Transformers and Looped Transformers [10, 7],
+as well as hierarchical designs such as HRM and TRM [8, 9].
+Stochastic Latent Transitions. Unlike prior recursive reasoning models (RRMs) that update the
+latent state deterministically and follow a single fixed trajectory [8, 9], GRAM defines pθ (zt |
+zt−1 , ex ) as a stochastic transition, so that repeated computation induces a distribution over latent
+reasoning trajectories. Concretely, GRAM realizes this transition as a learned stochastic residual
+perturbation around a deterministic update: at each transition, the model first computes a deterministic
+update ut from zt−1 and ex , then samples a conditional perturbation from a state-dependent Gaussian,
+and adds it to ut :
+                                ϵt ∼ pθ (ϵt | ut ) := N µθ (ut ), σθ2 (ut )I ,
+                                                                            
+                                                                                                     (4)
+                                 zt = ut + ϵt .                                                                                               (5)
+
+
+                                                              3
+We refer to ϵt as the learnable stochastic guidance. The mean µθ (ut ) encodes a state-dependent
+direction in which the trajectory is steered, while the variance σθ2 (ut ) controls the amount of ex-
+ploration. This design allows GRAM to capture uncertainty, prevent convergence to local minima,
+and support robust exploration of the solution space without discarding the deterministic refinement
+performed by ut .
+Hierarchical Instantiation. We instantiate the latent state with two interacting components, z =
+(h, l). The high-level component h is updated once per latent transition and carries abstract reasoning
+state, while the low-level component l is updated K times within a single transition and carries
+fine-grained intermediate computation. This decomposition separates the two roles across time scales,
+with h accumulating slowly across transitions and l refined rapidly within each one.
+With this hierarchical multi-scale structure, a single transition zt−1 → zt is computed as follows.
+The low-level component is first refined for K updates, with the high-level component held fixed:
+                              lt,k = fL (ht−1 , lt,k−1 , ex ; θ),            k = 1, . . . , K,                (6)
+where lt,0 := lt−1 and we write lt := lt,K for the refined low-level component. The high-level
+component is then updated as a stochastic transition conditioned on the refined lt ,
+                                    ut = fH (ht−1 , lt ; θ),                                                  (7)
+                                    ϵt ∼ pθ (ϵt | ut ) := N µθ (ut ), σθ2 (ut ) I
+                                                                                        
+                                                                                            ,                 (8)
+                                    ht = ut + ϵt ,                                                            (9)
+and we set zt = (ht , lt ). Note that stochasticity is introduced only at the high level: the low-
+level refinement is fully deterministic, while the stochastic guidance signal ϵt acts on the slower,
+more abstract component of the latent state, where it can steer the overall reasoning trajectory
+across transitions3 . Under this instantiation, the decoder reads only the high-level component, i.e.,
+fdec (zT ) = fdec (hT ). Additional architectural details are provided in Appendix B.1.
+Modeling Unconditional Distribution. While the description so far focuses on the conditional
+setting pθ (y | x), the same recursive process can also be defined as an unconditional generative model
+pθ (x) when the input is replaced with an empty conditioning embedding. We use this formulation for
+generation tasks in Section 4.3.
+
+2.2   Training
+
+GRAM is trained to model the conditional distribution pθ (y | x), where each training example
+consists of an input x and its corresponding target y. As a probabilistic model, GRAM adopts a
+latent-variable formulation and is optimized by maximizing an evidence lower bound (ELBO) with
+respect to the generative parameters θ and variational parameters ϕ.
+Latent Variable Modeling. We model GRAM as a latent-variable probabilistic model pθ , where
+the full latent trajectory τ = (z0 → · · · → zTTotal ) consists of a sequence of latent variables, with
+TTotal = T × Nsup . The conditional likelihood is defined as
+                                            Z
+                               pθ (y | x) = pθ (y | τ, x) pθ (τ | x) dτ,                           (10)
+
+where x denotes the input problem and y denotes the corresponding ground-truth output.
+Direct maximum likelihood estimation of log pθ (y | x) is intractable due to the marginalization
+over latent trajectories. We therefore introduce a variational posterior qϕ (τ | x, y) and optimize the
+evidence lower bound (ELBO), jointly training θ and ϕ via variational inference:
+              log pθ (y | x) ≥ Eqϕ (τ |x,y) [log pθ (y | τ, x)] − KL(qϕ (τ | x, y) ∥ pθ (τ | x)) .           (11)
+
+During training, latent trajectories are sampled from the variational posterior qϕ (· | x, y), which has
+access to both the input problem x and the target output y. At inference time, where y is unavailable,
+trajectories are instead generated from the learned prior pθ (· | x).
+   3We also tried injecting noise into the low-level state, but found that it did not improve performance.
+
+
+
+
+                                                              4
+Both the prior and the posterior are modeled as conditional Markov processes over latent states:
+                     TY
+                      Total                                            TY
+                                                                        Total
+
+ pθ (τ | x) = p(z0 )        pθ (zt | zt−1 , x), qϕ (τ | x, y) = p(z0 )        qϕ (zt | zt−1 , x, y). (12)
+                        t=1                                                  t=1
+Here, z0 is a fixed initial state shared by the prior and posterior. Both transitions are implemented by
+adding reparameterized Gaussian noise ϵt after a deterministic update ut ; the posterior uses the same
+transition module as the prior, but samples from a target-conditioned noise distribution qϕ (ϵt | ut , y),
+whereas the prior uses pθ (ϵt | ut ).
+Since the two processes share the same Markov structure and all stochasticity is introduced through
+ϵ1:TTotal , their trajectory distributions can be equivalently represented in noise space. Moreover,
+since GRAM decodes the output only from the terminal latent state, the likelihood term satisfies
+pθ (y | τ, x) = pθ (y | zTTotal , x). Therefore, the full trajectory-level ELBO can be written as
+                                          TX Total               h                                    i
+ LELBO = Eqϕ log pθ (y | zTTotal , x) −              Eqϕ (ϵ<t |x,y) KL(qϕ (ϵt | ut , y) ∥ pθ (ϵt | ut )) . (13)
+                                               t=1
+Here, ut = fH (ht−1 , lt ) denotes the deterministic high-level update before noise injection, as defined
+in Equation (9). Since ut depends on ht−1 , which is determined by the previously sampled noise
+variables ϵ<t := (ϵ1 , . . . , ϵt−1 ), the expectation averages over these ancestral samples.
+Practical Implementation. In practice, following previous recursive reasoning models [8, 9], we
+train GRAM with deep supervision over Nsup consecutive supervision steps, each consisting of T
+recursive latent transitions. This provides dense learning signals along the full latent trajectory, rather
+than supervising only the final state after TTotal = T × Nsup transitions. The terminal state of each
+step is reused as the initial state of the next step.
+Following standard practice for recurrent models with long computation chains, we apply truncated
+gradient propagation [16, 17], as used in recent recursive reasoning models [8, 9, 18]. In our
+implementation, gradients are propagated only through the final transition of each supervision step,
+ (n)      (n)
+zT −1 → zT . This gives the following surrogate objective for each supervision step:
+     (n)                                 (n)               (n)   (n)           (n)   (n) 
+   LGRAM (x, y; θ, ϕ) = Eqϕ log pθ (y | zT , x) − KL qϕ (ϵT | uT , y) ∥ pθ (ϵT | uT ) , (14)
+        (n)
+where zT is the terminal state of the current supervision step n, and gradients are stopped through
+preceding states. Thus, LGRAM should be viewed as a truncated surrogate objective rather than the
+exact ELBO; it introduces a biased but memory-efficient approximation to the full ELBO. Further
+analysis of this approximation is provided in Appendix A.3, and detailed training hyperparameters
+are listed in Appendix B.2.
+
+2.3   Inference-Time Scaling
+
+GRAM supports two complementary axes of inference-time scaling: depth, by varying the number
+of recursive transitions, and width, by sampling multiple latent reasoning trajectories in parallel.
+For depth, we follow prior recursive reasoning models [8, 9] in adopting adaptive computation
+time (ACT) [8–10], which allows each trajectory to terminate at a learned halting depth (details in
+Appendix A.1). For width — the focus of this section — we draw {τ (i) }N  i=1 ∼ pθ (τ(i)| x) from the
+learned prior and decode each terminal state into a candidate output ŷ (i) = fdec (zT ), exploring
+multiple stochastic reasoning paths simultaneously rather than extending a single trajectory.
+To select among candidates, we use either majority voting or best-of-N with a Latent Process Reward
+Model (LPRM). The LPRM is a value head vψ (zt ) trained to predict the final quality of a trajectory
+from its latent state, using a regression target r ∈ [0, 1] given by the final prediction accuracy. At
+inference time, majority voting selects the most frequent prediction, whereas LPRM-guided selection
+chooses the candidate with the highest predicted terminal value. Details of LPRM training are
+provided in Appendix A.2. Overall, this procedure improves robustness and solution quality through
+parallel exploration, without increasing the sequential recursion length.
+
+3     Related Work
+Latent Reasoning. Latent reasoning aims to reduce the inefficiency and verbosity of explicit
+Chain-of-Thought (CoT) by shifting part or all of the reasoning process into latent or continuous
+
+
+                                                      5
+                              Sudoku                                  ARC-AGI-1                                              ARC-AGI-2
+                                                                                                          20
+               100                           97.0%                                                66.7%
+                                     87.4%                                                                                                         16.0%
+                80                                   60                   52.0%           55.7%           15
+Accuracy (%)         61.3%                                        44.6%                                                      11.1%
+                60           55.0%                        40.3%                                                                             9.7%
+                                                     40                           34.5%                   10          7.8%
+                40                                                                                             5.0%
+                                                     20                                                    5                         3.0%
+                20
+                 0                                   0                                                     0
+                     Looped HRM TRM GRAM                  HRM TRM GRAM o3-mini-GPT 5.2 Grok-4-                 HRM TRM GRAM o3-mini-GPT 5.2 Grok-4-
+                       TF           (Ours)                        (Ours) high (low) thinking                           (Ours) high (low) thinking
+                                                           Recursive Models                GRAM (Ours)          LLMs
+
+Figure 3: Performance on puzzle benchmarks. On both Sudoku-Extreme and ARC-AGI, GRAM consistently
+outperforms all deterministic recursive baselines (Looped TF, HRM, TRM), demonstrating that stochastic latent
+transitions yield substantial gains within the recursive-reasoning paradigm. Looped TF results on ARC-AGI are
+omitted due to prohibitive training cost (see Section C.1.1) Note that large reasoning model scores are included
+only as external reference points for benchmark difficulty.
+
+representations [1–6]. By avoiding token-by-token generation of intermediate steps, such representa-
+tions can make reasoning traces more compact and reduce generation overhead. Existing approaches
+instantiate this idea through hidden states, latent or soft tokens, continuous thoughts, internal rea-
+soning traces, and recursive state updates for scaling test-time computation [4, 7, 19–23, 18, 24–26].
+However, many remain organized around autoregressive sequence generation, where additional
+computation is tied to generating more tokens, latent positions, or sequential reasoning states.
+Recursive Architectures. Recursive architectures perform iterative state updates and have evolved
+from RNNs to weight-sharing Transformers with adaptive computation [7, 10, 27–32, 25]. Recent
+recursive reasoning models show that increasing inference-time depth can outperform larger static
+models [8, 9, 18, 24]. GRAM builds on this line but formulates recurrence as a probabilistic process:
+instead of following a single deterministic refinement path, it maintains stochastic latent trajectories,
+enabling multi-path exploration and generative sampling.
+Probabilistic Latent State-Space Models. Probabilistic recurrent models use stochastic latent
+transitions to capture uncertainty and multimodal dynamics, often trained with variational infer-
+ence [33–38]. They have been widely used in sequential generative modeling, video prediction,
+and model-based reinforcement learning. GRAM shares this latent state-space view but reinterprets
+stochastic dynamics as computation rather than temporal observation modeling: latent transitions
+define possible reasoning trajectories, supporting multi-hypothesis exploration and both conditional
+pθ (y | x) and unconditional pθ (x) generation.
+
+
+4                Experiments
+
+GRAM is designed as an architecture for probabilistic recursive reasoning, rather than as a general-
+purpose large language reasoning model whose training data, inference budgets, prompting strategies,
+tool use, and external scaffolding are not directly comparable. Following prior work on recurrent and
+recursive reasoning models [8, 9], we therefore evaluate GRAM on standard structured reasoning
+tasks that probe the computational properties targeted by our formulation: iterative latent refinement,
+stochastic trajectory exploration, multi-solution coverage, and inference-time scaling.
+In the following, we first evaluate structured reasoning performance on Sudoku-Extreme and ARC-
+AGI (Section 4.1). We then assess multi-solution behavior on N-Queens and Graph Coloring
+(Section 4.2). Next, we examine the unconditional generative interpretation of GRAM on binarized
+MNIST (Section 4.3). Finally, we perform ablation studies to evaluate the impact of key design
+choices (Section 4.4).
+
+4.1                  Challenging Puzzle Tasks
+
+Setup. We evaluate on Sudoku-Extreme [8], which contains 9×9 puzzles with minimal clues re-
+quiring extensive constraint propagation, and ARC-AGI Challenge [13, 14], which tests abstract
+visual reasoning through few-shot pattern recognition. We compare against direct prediction (Trans-
+
+
+                                                                                     6
+           1.0                     N=                                    1.0
+                                         1 5 10 20 50
+                                                                         0.8
+           0.9
+Accuracy                                                                                                  Looped TF
+
+
+
+
+                                                              Accuracy
+                                                Looped TF                0.6                              HRM
+                                                HRM
+           0.8                                  TRM                                                       TRM
+                                                GRAM (Ours)              0.4                              GRAM (N=1)
+                                                                         0.2
+           0.6
+           0.5                                                           0.0
+                 8   16    32             128          320                     2   4    6   8   10   12   14   16   18
+                            Iterations                                                 Number of Solutions
+ Figure 4: (Left) Inference-time scaling on Sudoku-Extreme. While both TRM and GRAM benefit from
+ longer recursion (x-axis), GRAM additionally scales with parallel sampling (N = number of samples); each
+ iteration corresponds to a supervision step, while meaning K× more flat iterations in Looped TF. (Right)
+ Accuracy across number of solutions in N-Queens (8 × 8). Conventional deterministic recursive models suffer
+ a sharp performance drop as the number of possible solutions increases, whereas GRAM maintains consistent
+ performance.
+
+
+ former [39]), a flat recursive baselines (Looped TF [7], HRM [8], TRM [9]). Reported large reasoning
+ model results [40] are included as external reference points for benchmark difficulty, rather than
+ as controlled baselines, since their training and inference settings are not directly comparable to
+ task-specific recursive models. For the scaling analysis, all baselines (Looped TF, HRM, TRM) are
+ reproduced under identical settings following Yang et al. [7] and Jolicoeur-Martineau [9].
+ Stochastic Guidance Improves Reasoning. Figure 3 and Table 8 summarize our main results.
+ GRAM consistently outperforms prior recursive models across all benchmarks. We attribute this
+ improvement to the fundamental difference in how reasoning trajectories are utilized. While Looped
+ TF, HRM, and TRM are restricted to learning from a single deterministic path, GRAM leverages
+ stochastic transitions to explore diverse reasoning trajectories. By training on this richer distribution
+ of solution paths, GRAM acquires more robust reasoning capabilities, allowing it to navigate complex
+ problem spaces more effectively than models constrained to a single sequential refinement process.
+ Detailed experiment results, including more state-of-art methods, are provided in Appendix D.1.
+ Parallel Sampling Provides a New Test-time Scaling Axis. Figure 4 (left) shows that increasing the
+ number of parallel samples consistently improves performance across all iteration counts. Notably,
+ GRAM with N = 20 samples at 16 iterations outperforms all deterministic baselines at 320 iterations,
+ including TRM (97.0% vs 90.5%), despite comparable computational budget. While deterministic
+ recursive models scale only through sequential refinement, GRAM leverages stochastic transitions to
+ explore multiple reasoning paths in parallel. To select the best trajectory, we employ a Latent Process
+ Reward Model (LPRM) that predicts output correctness (Section 2.3). This parallel scaling bypasses
+ the latency bottlenecks of depth-based scaling while achieving superior performance. Additional
+ analysis on the ARC-AGI Challenge is provided in Appendix D.2.
+
+ 4.2        Multi-solution Puzzle Tasks
+
+Setup. To evaluate whether GRAM can capture diverse solutions, we test on N-Queens (8 × 8,
+10 × 10) and Graph Coloring (8-vertex, 10-vertex) tasks, where multiple valid solutions exist for each
+input. We compare against direct prediction (Transformer [39]), recursive models (Looped TF [7],
+HRM [8], TRM [9]), and generative models (Autoregressive Transformer (AR), MDLM [41]). For
+N-Queens, we report accuracy (whether the output satisfies all constraints) and coverage (found /
+total valid solutions, with 20 samples). For Graph Coloring, we report conflict edges (number of
+constraint violations; lower is better) instead of accuracy. Detailed configurations are provided in
+Appendix C.2.
+ Deterministic Recursion Fails on Multi-Solution Tasks. Table 1 reveals that deterministic recursive
+ models structurally cannot capture multiple solutions, with coverage at most 36.1% across all tasks.
+ Figure 4 (right) further illustrates this limitation: as the number of valid solutions increases, all three
+ deterministic recursive baselines exhibit sharp accuracy degradation, whereas GRAM maintains
+ consistent performance regardless of solution count. This confirms that deterministic latent updates
+
+
+                                                              7
+Table 1: Evaluation on N-Queens and Graph Coloring benchmarks. Rec. and Gen. indicate whether the
+model uses recursive computation and generative sampling, respectively. Values are mean ± standard deviation
+over runs. Accuracy: single-sample (%). Conflict: constraint-violating edges (↓). Coverage: unique valid
+solutions discovered with 20 samples (%).
+                                                           N-Queens                                 Graph Coloring
+                                                    8×8              10 × 10                 8-vertex            10-vertex
+ Method                  Rec. Gen. # Params Accuracy Coverage Accuracy Coverage       Conflict↓ Coverage Conflict↓ Coverage
+ Direct Pred (8 layers)   ✗    ✗     27M     40.4±1.1  13.7±1.1 13.6±0.5    1.6±0.2   179.3±4.0 19.9±0.2 198.7±5.0      6.7±0.1
+ Direct Pred (32 layers) ✗     ✗     100M    40.2±1.3  13.6±1.1 13.1±0.4    1.6±0.2   174.0±18.0 19.1±1.7 227.7±34.5    6.5±1.9
+ Looped TF                ✓    ✗       7M    68.4±3.7  23.6±1.9 50.0±7.6    6.2±3.2   136.0±16.1 20.5±1.5 157.3±9.0     7.2±0.7
+ HRM                      ✓    ✗      27M    78.7±2.9  26.7±1.3 37.4±0.3    4.7±0.1   109.7±1.5 21.8±0.3 164.3±21.6     8.9±1.7
+ TRM                      ✓    ✗       7M    66.8±5.7 36.1±22.5 17.5±11.2  2.0±1.3    109.3±3.1 22.3±0.6 170.7±17.9     6.8±0.3
+ AR                       ✗    ✓    10.6M    96.3±1.0  84.8±0.8 90.0±2.2   53.2±0.8    19.0±11.3   83.0±0.7 61.3±8.3   40.0±0.3
+ MDLM                     ✗    ✓    12.6M    96.1±1.5  87.2±0.6 74.3±6.6   47.4±2.2     2.7±0.6    84.5±4.0 12.0±7.0   48.2±1.4
+ GRAM (Ours)              ✓    ✓     10M     99.7±0.3  90.3±1.9 89.7±2.7   57.5±3.4     2.7±2.1    85.8±0.5  3.3±1.5   51.3±2.8
+
+
+
+
+cause mode collapse when multiple valid outputs exist for the same input. Additional coverage
+analysis is provided in Appendix D.3.
+Recursive Refinement Yields Sharper Constraint Satisfaction. While generative models (AR,
+MDLM) achieve high coverage, GRAM consistently attains higher accuracy with comparable diver-
+sity. On N-Queens, GRAM reaches 99.7% accuracy versus 96.3% (AR) and 96.1% (MDLM). The
+gap is more pronounced on Graph Coloring, where GRAM reduces conflict edges to 2.7 and 3.3 on 8-
+and 10-vertex tasks, compared to 19.0 and 61.3 for AR. This demonstrates that recursive refinement
+enables stricter constraint satisfaction than generative sampling alone.
+
+4.3   Exploring GRAM as an Unconditional Generator
+
+Setup. To investigate GRAM’s unconditional gen-                    Table 2: Unconditional generation results on
+erative capability beyond conditional reasoning, we                binarized MNIST. We report IS (↑) and FID (↓).
+evaluate generation in two domains: structured con-                For iterative models, a step corresponds to a super-
+straint generation on Sudoku (from empty boards,                   vision step for TRM and GRAM, and a denoising
+                                                                   step for D3PM. FID is calculated using real sam-
+evaluated by the fraction of generated boards satis-               ples with original pixel values (0–255).
+fying Sudoku constraints) and image generation on
+binarized MNIST [15], where pixel values are thresh-                       Method                    IS (↑)    FID (↓)
+olded to 0 or 1 (evaluated by Inception Score (IS) [42]
+and FID [43]). In both cases, the input is replaced by                     VAE                        1.70       86.28
+                                                                           D3PM (1000 steps)          1.86       74.03
+an empty conditioning signal and the model samples                         TRM (16 steps)             1.00      303.29
+an output from its learned prior. Baselines include
+D3PM [44], a discrete diffusion model, on both tasks,                      GRAM (Ours)
+and additionally a VAE [45] trained with binary re-                         8 steps                   1.85       84.08
+                                                                            16 steps                  1.89       77.79
+construction loss on MNIST. To ensure a fair compar-                        32 steps                  1.91       76.65
+ison with existing literature, FID is calculated using                      64 steps                  1.95       75.39
+real samples from the original standard MNIST.                              128 steps                 1.99       74.30
+                                                                            256 steps                 2.04       73.34
+Generative Behavior Beyond Reasoning. GRAM
+extends from conditional reasoning to unconditional
+generation in two different domains. On Sudoku
+generation (Figure 5), GRAM produces valid boards
+with 99.05% validity using 10.9M parameters and
+16 supervision steps, surpassing D3PM baselines
+that use up to 55.1M parameters and 1000 denois-
+ing steps. Figure 7 shows qualitative examples, illus-
+trating that the model produces diverse, fully valid
+boards from empty inputs without any explicit con-
+straint checker. On MNIST (Table 2), the deter-
+ministic baseline TRM exhibits mode collapse (FID
+303.29), whereas GRAM produces recognizable dig- Figure 5: Unconditional Sudoku generation. Va-
+its with IS and FID comparable to D3PM. Together, lidity (%) of generated Sudoku puzzles. GRAM
+these results indicate that GRAM’s stochastic latent achieves higher validity than D3PM with substan-
+                                                                    tially fewer parameters and steps.
+
+                                                            8
+D3PM
+           t=0   t=100     t=200   t=300     t=400   t=500      t=600   t=700   t=800   t=900    t=1000
+GRAM TRM
+           t=0    t=1       t=2     t=4       t=6     t=7        t=9    t=11    t=12    t=14      t=16
+
+
+
+           t=0    t=1       t=2     t=4       t=6     t=7        t=9    t=11    t=12    t=14      t=16    Sample 1 Sample 2 Sample 3 Sample 4
+                                             (a) Generation Process                                                 (b) Samples
+  Figure 6: Visualization of the generation process and samples. (a) The generation process over recursion
+  steps. Each row corresponds to a different model. GRAM (bottom) progressively refines the generated image
+  through recursive latent updates, correcting initial errors. (b) Unconditional generated samples from each model.
+
+  Table 3: Ablation study on Sudoku-Extreme and N-Queens (8 × 8). We evaluate with 5 samples. For (a),
+  Components are added cumulatively to the Looped TF baseline (DS = deep supervision, HR = hierarchical
+  recursion, SG = stochastic guidance). For (b), both stochasticity and learned guidance are essential—removing
+  either significantly degrades performance.
+                        (a) Architecture Ablation.                                              (b) Mechanism Ablation.
+
+  Model variant                             Sudoku            N-Queens          Model variant                   Sudoku       N-Queens
+  base (Looped TF)                            61.25              71.30          GRAM (ours)                      93.96          99.69
+  + DS + HR (=HRM, TRM)                   55.00 / 87.40      80.70 / 72.90
+                                                                                w/o stochastic guidance          82.87          72.91
+  + SG                                        65.64              86.30
+                                                                                stochasticity only               94.88          50.27
+  + DS + SG                                   73.90             100.00
+                                                                                guide only                        0.00           0.00
+  + DS + HR + SG (=GRAM)                     93.96              99.69           w/ direct prediction             63.43          61.44
+                                                                                TRM w/ stochastic decoder        82.87          71.66
+                                                                                TRM w/ random init.              78.53          71.82
+
+
+  transitions support generative modeling beyond symbolic reasoning, with constraint satisfaction
+  emerging as a natural byproduct of the recursive generative process.
+  Inference-Time Scaling Transfers to Generation. Table 2 further shows that increasing recursion
+  at inference improves generation quality monotonically (IS 1.85 → 2.04, FID 84.08 → 73.34 from 8
+  to 256 steps), even though training uses only 16 steps. This indicates that the iterative-refinement
+  advantage of recursive models carries over into the generative regime. Figure 6 visualizes this process;
+  additional samples are in Section D.4.
+
+  4.4        Ablation Study
+
+  We ablate key design choices of GRAM on Sudoku-Extreme and N-Queens (8 × 8) using 5 samples.
+  Table 3 summarizes the results.
+  Stochastic Guidance Provides Consistent Gains Across Architectures. Table 3a shows that
+  stochastic guidance (SG) improves performance regardless of the underlying architecture: SG alone
+  lifts the flat Looped TF baseline, and combining SG with deep supervision already reaches 100%
+  on N-Queens. The full GRAM (with hierarchical recursion on top) achieves the best results overall
+  (93.96% / 99.69%). While the effect of hierarchical recursion is task-dependent, SG yields consistent
+  gains in every configuration, supporting our design of stochastic guidance as the core extension
+  introduced by GRAM.
+  Both Stochasticity and Guidance Are Essential. We ablate each component by modifying the
+  learned distribution ϵt ∼ N (µθ , σθ2 I) in Equation (4). Removing guidance (N (0, σθ2 I)) maintains
+  Sudoku performance (94.88%), indicating that stochasticity alone can enable diverse reasoning paths.
+  However, this variant collapses on N-Queens (50.27%), where structured guidance is necessary to
+  navigate multi-solution spaces. Removing stochasticity (N (µθ , 0)) fails completely (0.0% on both
+  tasks), as deterministic guidance conditioned on the target leads to severe overfitting.
+  Naive Stochasticity Does Not Help TRM. We test two simple approaches to add stochasticity to
+  TRM: (1) stochastic decoding, which samples from the output distribution instead of argmax, and
+
+
+                                                                         9
+                                     8    9       648 59 1        648 59 1     648259317   648259317   648259317   648259317
+                 7                 79         5   79 68       5   79 68 415    79 683415   79 683415   792683415   792683415
+                               9     6      289   3 6     1289    3 6   1289   3 6 71289   356 71289   356471289   356471289
+
+
+
+D3PM
+                                         657 4      8    657 4      831657 4    83165734   983165734   983165734   983165734
+                     3 9           4 3 9 5        4 3 9 5         4 389 56     42389 561   423897561   423897561   423897561
+                     7     8         7      8 2       7     8 2      7 648 2   5 73648 2   5 73648 2   5173648 2   517364892
+                       5               5          13 548          139548 2     139548626   139548626   139548626   139548626
+                     59              593              593 1 8        593 1 8   27593 148   275936148   275936148   275936148
+                         2         8      2   3   8     7 2   3   8   7 2 53   8 47 2953   8 4712953   864712953   864712953
+         t=0         t=125              t=250        t=375          t=500        t=625       t=750       t=875      t=1000
+                 717348652         716348952      716348952       716348952    716348952   716348952   716348952   716348952
+                 483927379         493527861      493527861       493527861    493527861   493527861   493527861   493527861
+                 523971734         528996734      528169734       528169734    528169734   528169734   528169734   528169734
+GRAM
+
+
+                 864235917         864251379      864215379       864215379    864215379   864215379   864215379   864215379
+                 359841126         359784126      359874126       359874126    359874126   359874126   359874126   359874126
+                 172496538         172936548      172936548       172936548    172936548   172936548   172936548   172936548
+                 937812445         937812615      937482615       937482615    937482615   937482615   937482615   937482615
+                 645779283         645179283      645791283       645791283    645791283   645791283   645791283   645791283
+                 281623497         281663497      281653497       281653497    281653497   281653497   281653497   281653497
+        iter=0       iter=1             iter=2       iter=4         iter=6       iter=8     iter=10     iter=13     iter=16
+Figure 7: Qualitative examples of unconditional Sudoku generation by GRAM. Each board is independently
+sampled from an empty grid using the learned prior. GRAM produces diverse, complete boards satisfying all
+row, column, and box constraints, without an explicit constraint checker or search procedure. Incorrect digits are
+highlighted in red.
+
+
+(2) random initialization, which samples z0 from a Gaussian N (0, I) at each inference. Neither
+improves performance, demonstrating that GRAM’s gains stem from the variational framework rather
+than mere randomness.
+
+5      Conclusions and Limitations
+We introduced GRAM, a generative framework that transforms deterministic recursive architectures
+into probabilistic generative models capable of modeling both p(y | x) and p(x) via recursive amor-
+tized variational inference. For reasoning problems, introducing stochasticity into latent transitions
+enables diverse solution discovery and improved exploration compared to deterministic counterparts.
+Notably, we demonstrate GRAM can leverage width-based inference-time scaling as a complement
+to depth: by sampling multiple latent trajectories in parallel, bypassing the latency bottleneck of
+depth-only scaling. Our ablations further reveal that stochastic guidance is a general-purpose exten-
+sion that consistently improves any recursive architecture, and that the gains stem specifically from
+the variational framework — not from mere randomness, as naive stochastic alternatives applied to
+existing models yield no improvement.
+Beyond solution-seeking, GRAM also demonstrates potential as an unconditional generative model
+through recursion-based generation over inputs, with generation quality improving monotonically
+with recursive depth even beyond training-time steps. This suggests new directions for generative
+modeling via hierarchical recursion. Despite these strengths, the sequential nature of deep supervision
+limits training efficiency compared to Transformers, posing a significant barrier to scaling GRAM
+toward larger foundation models.
+
+
+
+
+                                                                    10
+Acknowledgment
+This research was supported by the Brain Pool Plus Program (No. 2021H1D3A2A03103645) and
+the GRDC (Global Research Development Center) Cooperative Hub Program (RS-2024-00436165)
+through the National Research Foundation of Korea (NRF) funded by the Ministry of Science and
+ICT (MSIT). This work was also supported by the Institute of Information & Communications
+Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. RS-
+2024-00509279, Global AI Frontier Lab) and by the NYU-KAIST Global Innovation and Research
+Institute. Minsu Kim acknowledges funding from the KAIST Jang Young Sil Fellow Program. We
+are especially grateful to Gyubin and Seungju for their non-trivial contributions, and we thank the
+members of the MLML for valuable discussions and feedback throughout this project.
+
+Broader Impacts
+GRAM studies probabilistic recursive reasoning for structured reasoning and generation. By main-
+taining multiple latent trajectories, it may benefit tasks such as constraint satisfaction, and scientific
+problem solving, where uncertainty and multiple valid solutions are common. It also suggests a way
+to improve reasoning through inference-time computation rather than parameter scaling alone. Its
+generality also entails risks: plausible but invalid generations may be mistaken for verified solutions
+in downstream decision-making pipelines, and multi-sample inference may increase computational
+and energy costs at scale. Since our experiments focus on controlled benchmarks, deployment in
+real-world or high-stakes settings would require rigorous validation, uncertainty calibration, and
+domain-specific safeguards.
+
+References
+ [1] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le,
+     Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.
+     Advances in neural information processing systems, 35:24824–24837, 2022.
+
+ [2] Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik
+     Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. Ad-
+     vances in neural information processing systems, 36:11809–11822, 2023.
+
+ [3] Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas
+     Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, et al. Graph
+     of thoughts: Solving elaborate problems with large language models. In Proceedings of the
+     AAAI conference on artificial intelligence, volume 38, pages 17682–17690, 2024.
+
+ [4] Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong
+     Tian. Training large language models to reason in a continuous latent space. arXiv preprint
+     arXiv:2412.06769, 2024.
+
+ [5] Hanlin Zhu, Shibo Hao, Zhiting Hu, Jiantao Jiao, Stuart Russell, and Yuandong Tian. Reasoning
+     by superposition: A theoretical perspective on chain of continuous thought. arXiv preprint
+     arXiv:2505.12514, 2025.
+
+ [6] Halil Alperen Gozeten, M Emrullah Ildiz, Xuechen Zhang, Hrayr Harutyunyan, Ankit Singh
+     Rawat, and Samet Oymak. Continuous chain of thought enables parallel exploration and
+     reasoning. arXiv preprint arXiv:2505.23648, 2025.
+
+ [7] Liu Yang, Kangwook Lee, Robert Nowak, and Dimitris Papailiopoulos. Looped transformers
+     are better at learning learning algorithms. arXiv preprint arXiv:2311.12424, 2023.
+
+ [8] Guan Wang, Jin Li, Yuhao Sun, Xing Chen, Changling Liu, Yue Wu, Meng Lu, Sen Song, and
+     Yasin Abbasi Yadkori. Hierarchical reasoning model. arXiv preprint arXiv:2506.21734, 2025.
+
+ [9] Alexia Jolicoeur-Martineau. Less is more: Recursive reasoning with tiny networks. arXiv
+     preprint arXiv:2510.04871, 2025.
+
+
+                                                   11
+[10] Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. Uni-
+     versal transformers. arXiv preprint arXiv:1807.03819, 2018.
+[11] Daniel Kahneman. Thinking, fast and slow. Farrar, Straus and Giroux, 2011.
+[12] Yoshua Bengio. The consciousness prior. arXiv preprint arXiv:1709.08568, 2017.
+[13] François Chollet. On the measure of intelligence. arXiv preprint arXiv:1911.01547, 2019.
+[14] Francois Chollet, Mike Knoop, Gregory Kamradt, Bryan Landers, and Henry Pinkard. Arc-
+     agi-2: A new challenge for frontier ai reasoning systems. arXiv preprint arXiv:2505.11831,
+     2025.
+[15] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document
+     recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. doi: 10.1109/5.726791.
+[16] Ronald J Williams and Jing Peng. An efficient gradient-based algorithm for on-line training of
+     recurrent network trajectories. Neural computation, 2(4):490–501, 1990.
+[17] Corentin Tallec and Yann Ollivier. Unbiasing truncated backpropagation through time. arXiv
+     preprint arXiv:1705.08209, 2017.
+[18] Jonas Geiping, Sean McLeish, Neel Jain, John Kirchenbauer, Siddharth Singh, Brian R Bartold-
+     son, Bhavya Kailkhura, Abhinav Bhatele, and Tom Goldstein. Scaling up test-time compute
+     with latent reasoning: A recurrent depth approach. arXiv preprint arXiv:2502.05171, 2025.
+[19] Yufan Zhuang, Liyuan Liu, Chandan Singh, Jingbo Shang, and Jianfeng Gao. Text generation
+     beyond discrete token sampling. arXiv preprint arXiv:2505.14827, 2025.
+[20] Zhen Zhang, Xuehai He, Weixiang Yan, Ao Shen, Chenyang Zhao, Shuohang Wang, Yelong
+     Shen, and Xin Eric Wang. Soft thinking: Unlocking the reasoning potential of llms in continuous
+     concept space. arXiv preprint arXiv:2505.15778, 2025.
+[21] Natasha Butt, Ariel Kwiatkowski, Ismail Labiad, Julia Kempe, and Yann Ollivier. Soft tokens,
+     hard truths. arXiv preprint arXiv:2509.19170, 2025.
+[22] Zhenyi Shen, Hanqi Yan, Linhai Zhang, Zhanghao Hu, Yali Du, and Yulan He. Codi:
+     Compressing chain-of-thought into continuous space via self-distillation. arXiv preprint
+     arXiv:2502.21074, 2025.
+[23] Zhenrui Yue, Bowen Jin, Huimin Zeng, Honglei Zhuang, Zhen Qin, Jinsung Yoon, Lanyu
+     Shang, Jiawei Han, and Dong Wang. Hybrid latent reasoning via reinforcement learning. arXiv
+     preprint arXiv:2505.18454, 2025.
+[24] Sangmin Bae, Yujin Kim, Reza Bayat, Sungnyun Kim, Jiyoun Ha, Tal Schuster, Adam Fisch,
+     Hrayr Harutyunyan, Ziwei Ji, Aaron Courville, et al. Mixture-of-recursions: Learning dynamic
+     recursive depths for adaptive token-level computation. arXiv preprint arXiv:2507.10524, 2025.
+[25] Amirkeivan Mohtashami, Matteo Pagliardini, and Martin Jaggi. Cotformer: A chain-of-
+     thought driven architecture with budget-adaptive computation cost at inference. arXiv preprint
+     arXiv:2310.10845, 2023.
+[26] Sangmin Bae, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Seungyeon Kim, and Tal Schuster.
+     Relaxed recursive transformers: Effective parameter sharing with layer-wise lora. arXiv preprint
+     arXiv:2410.20672, 2024.
+[27] Jeffrey L Elman. Finding structure in time. Cognitive science, 14(2):179–211, 1990.
+[28] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):
+     1735–1780, 1997.
+[29] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares,
+     Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-
+     decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
+
+
+                                                 12
+[30] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu
+     Soricut. Albert: A lite bert for self-supervised learning of language representations. arXiv
+     preprint arXiv:1909.11942, 2019.
+[31] Maha Elbayad, Jiatao Gu, Edouard Grave, and Michael Auli. Depth-adaptive transformer. arXiv
+     preprint arXiv:1910.10073, 2019.
+[32] Alex Graves. Adaptive computation time for recurrent neural networks. arXiv preprint
+     arXiv:1603.08983, 2016.
+[33] Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron Courville, and Yoshua
+     Bengio. A recurrent latent variable model for sequential data, 2016. URL https://arxiv.
+     org/abs/1506.02216.
+[34] Marco Fraccaro, Søren Kaae Sønderby, Ulrich Paquet, and Ole Winther. Sequential neural
+     models with stochastic layers, 2016. URL https://arxiv.org/abs/1605.07571.
+[35] Rahul G. Krishnan, Uri Shalit, and David Sontag. Deep kalman filters, 2015. URL https:
+     //arxiv.org/abs/1511.05121.
+[36] Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and
+     James Davidson. Learning latent dynamics for planning from pixels, 2019. URL https:
+     //arxiv.org/abs/1811.04551.
+[37] Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with
+     discrete world models. arXiv preprint arXiv:2010.02193, 2020.
+[38] Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains
+     through world models. arXiv preprint arXiv:2301.04104, 2023.
+[39] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
+     Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information
+     processing systems, 30, 2017.
+[40] ARC-Prize-Foundation. ARC-AGI benchmarking: Leaderboard and dataset for the ARC-AGI
+     benchmark. https://arcprize.org/leaderboard, 2026. Accessed: 2026-1-22.
+[41] Subham Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin Chiu,
+     Alexander Rush, and Volodymyr Kuleshov. Simple and effective masked diffusion language
+     models. Advances in Neural Information Processing Systems, 37:130136–130184, 2024.
+[42] Shane Barratt and Rishi Sharma. A note on the inception score. arXiv preprint arXiv:1801.01973,
+     2018.
+[43] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter.
+     Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in
+     neural information processing systems, 30, 2017.
+[44] Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg.
+     Structured denoising diffusion models in discrete state-spaces. Advances in neural information
+     processing systems, 34:17981–17993, 2021.
+[45] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint
+     arXiv:1312.6114, 2013.
+[46] Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer:
+     Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024.
+[47] Noam Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020.
+[48] Simo Ryu. Minimal implementation of a d3pm (structured denoising diffusion models in
+     discrete state-spaces), in pytorch. https://github.com/cloneofsimo/d3pm, 2024.
+[49] William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings
+     of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023.
+
+
+                                                13
+[50] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning
+     applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 2002.
+[51] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep
+     convolutional neural networks. Advances in neural information processing systems, 25, 2012.
+[52] Stefan Elfwing, Eiji Uchibe, and Kenji Doya. Sigmoid-weighted linear units for neural network
+     function approximation in reinforcement learning. Neural networks, 107:3–11, 2018.
+[53] Yuxin Wu and Kaiming He. Group normalization. In Proceedings of the European conference
+     on computer vision (ECCV), pages 3–19, 2018.
+[54] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint
+     arXiv:1711.05101, 2017.
+[55] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity
+     natural image synthesis. arXiv preprint arXiv:1809.11096, 2018.
+[56] Yang Song and Stefano Ermon. Improved techniques for training score-based generative models.
+     Advances in neural information processing systems, 33:12438–12448, 2020.
+[57] P. Erdős and A. Rényi. On the strength of connectedness of a random graph. Acta Mathematica
+     Academiae Scientiarum Hungarica, 12(1):261–267, Mar 1964. ISSN 1588-2632. doi: 10.1007/
+     BF02066689. URL https://doi.org/10.1007/BF02066689.
+[58] Henrique Lemos, Marcelo Prates, Pedro Avelar, and Luis Lamb. Graph colouring meets deep
+     learning: Effective graph neural network models for combinatorial problems. In 2019 IEEE
+     31st International Conference on Tools with Artificial Intelligence (ICTAI), pages 879–885.
+     IEEE, 2019.
+[59] Alexey L. Pomerantsev. Principal component analysis (pca). Encyclopedia of Autism Spectrum
+     Disorders, 2014. URL https://api.semanticscholar.org/CorpusID:2534141.
+[60] Jon Louis Bentley. Multidimensional binary search trees used for associative searching. Com-
+     mun. ACM, 18:509–517, 1975. URL https://api.semanticscholar.org/CorpusID:
+     13091446.
+
+
+
+
+                                               14
+A     Additional Method Details
+A.1   Adaptive Computation Time
+
+GRAM optionally adopts adaptive computation time (ACT) [8–10] at inference, allowing each
+trajectory to terminate at a learned halting depth rather than running for a fixed number of supervision
+steps. We follow the Q-learning formulation introduced by HRM [8] and adopted in TRM [9].
+Halt head. The decoder includes an auxiliary head qψ : RD → R2 that maps the high-level state h
+to two scalar values, qψ (h) = (q halt , q continue ). These are interpreted as estimated Q-values for the
+binary action of halting or continuing computation at the current supervision step.
+Training. The halt head is trained jointly with the main objective via a temporal-difference loss.
+                                  (n)
+After computing the latent state zT at the end of supervision step n, we form Q-learning targets:
+
+       • q̂nhalt = 1[ŷ (n) = y], indicating whether decoding the current state would yield a correct
+         prediction.
+                                                
+       • q̂ncontinue = max qn+1halt    continue
+                                    , qn+1        , the bootstrapped value of running one more supervision
+         step.
+
+The halt head is trained by regression to these targets:
+                               Nsup h
+                               X                           2                                2 i
+                     LACT =             qnhalt − q̂nhalt        + qncontinue − q̂ncontinue       .     (15)
+                               n=1
+
+This auxiliary loss is added to the main training objective and contributes only through the halt head;
+it does not propagate gradients into the recursive core.
+Inference. At inference, computation proceeds one supervision step at a time. After each step
+n, we evaluate qψ (h(n) ) and halt if qnhalt > qncontinue , returning ŷ (n) as the prediction. Otherwise,
+                                                                                       max
+computation continues to the next supervision step, up to a maximum budget of Nsup          steps. Different
+trajectories sampled in parallel may therefore terminate at different depths, complementing the
+parallel-sampling scheme described in Section 2.3. In practice, we found that using only q halt
+(halting when σ(q halt ) > 0.5, without the continue branch) performs comparably while simplifying
+implementation; our released code uses this variant.
+
+A.2   Latent Process Reward Model (LPRM).
+
+To rank or select among sampled candidates, we train a value head vψ (zt ) to predict the expected
+accuracy of the final output, conditioned on the current latent state zt . The LPRM is trained jointly
+with the main objective via a regression loss:
+                                                    T
+                                                    X
+                                        LLPRM =       (vψ (zt ) − r)2 ,                                (16)
+                                                    t=1
+
+where r ∈ [0, 1] denotes the accuracy of the final prediction for a given trajectory.
+
+A.3   Empirical Validation of the Surrogate Objective
+
+We further analyze the approximation introduced by the surrogate training objective LGRAM used in
+Section 2.2, both qualitatively and empirically.
+Truncation as a gradient approximation. We frame LGRAM as a gradient approximation rather than
+a separate variational objective. The full trajectory-level ELBO (Equation (13)) involves a sum of KL
+terms across all TTotal transitions, and computing its exact gradient requires backpropagation through
+the entire trajectory. To enable training with constant memory, we propagate gradients only through
+the final transition of each supervision step. This is a standard practice in recurrent latent variable
+models with long computation chains: ELBOs over truncated sequences are used, for example, in
+VRNN [33] and SRNN [34], while truncated latent imagination is used in Dreamer-family world
+models [37, 38]. Trading a small gradient bias for training stability via local truncation is therefore
+
+
+                                                       15
+well-precedented; what is specific to GRAM is applying this approximation at the level of recursive
+reasoning trajectories rather than temporal sequences.
+Empirical validation. To verify that optimizing LGRAM effectively drives improvement in the full
+variational bound, we compute both quantities on the validation set throughout training. The full
+ELBO LELBO is evaluated as in Equation (13), summing the reconstruction term and KL contributions
+                                                                                       (n)
+across all TTotal transitions; the surrogate objective is evaluated as the average of LGRAM over the
+Nsup supervision steps. Figure 8 reports the results on Sudoku-Extreme and N-Queens 8 × 8.
+
+                              Sudoku-Extreme                                                  N-Queens 8 × 8
+                                        GRAM (training objective)                                      GRAM (training objective)
+             0.6                        ELBO (full)                         103                        ELBO (full)
+
+
+             0.5                                                            102
+      ELBO
+
+
+
+
+                                                                    ELBO
+             0.4                                                            101
+
+                                                                            100
+             0.3
+                                                                           10 1
+             0.2
+                   10000 20000 30000 40000 50000 60000                            0   10000    20000    30000      40000
+                               Training Step                                                  Training Step
+Figure 8: Full ELBO LELBO and surrogate objective LGRAM throughout training (plotted as −ELBO,
+smaller is better). On both Sudoku-Extreme (left) and N-Queens 8 × 8 (right), both quantities decrease
+monotonically over training, indicating that gradient updates of LGRAM consistently improve the full variational
+bound. The two curves do not coincide because LELBO sums KL contributions across all TTotal transitions
+while LGRAM evaluates only the final-step KL of each supervision step; their gap reflects the cumulative KL
+across earlier transitions, not a failure of optimization. The N-Queens plot uses a log scale on the y-axis due to
+the large dynamic range.
+
+Both LELBO and LGRAM improve monotonically throughout training on both tasks. This indicates
+that, despite the truncation, gradient updates of LGRAM effectively drive improvement in the full
+variational bound. Since LELBO also serves as an indirect estimate of the negative log-likelihood,
+its consistent improvement provides evidence that GRAM optimizes a well-defined data likelihood,
+even though training relies on the surrogate.
+The gap between the two curves in Figure 8 reflects the structural difference between the two
+quantities — LELBO accumulates KL terms across all transitions while LGRAM evaluates only the
+final-step KL of each supervision step — rather than an optimization failure. This gap is consistent
+with LGRAM being a biased but useful surrogate for LELBO .
+
+B     Training and Architecture Details
+B.1    Architecture Details
+
+GRAM consists of three components: Encoder, Recursive Core, and Decoder.
+Encoder. Input tokens are mapped to embeddings via a token embedding layer, optionally con-
+catenated with puzzle embeddings (for ARC√ [13, 14]), and combined with positional encodings
+(RoPE) [46]. The embeddings are scaled by D and prepended with 16 puzzle embedding tokens [8].
+Recursive Core. The core maintains two latent states: h (high-level) and l (low-level). For each outer
+step, the low-level state is refined K times via l ← fL (l, h + ex ), injecting the input embedding at
+each iteration. The high-level state is then updated via h ← fH (h, l). Both fL and fH share the same
+architecture: a stack of attention and SwiGLU [47] MLP layers. In addition, as an exception, we use
+[SwiGLU + SwiGLU] network for the Recursive Core module instead of [Attention + SwiGLU] for
+Sudoku tasks, following [9]. For initialization of z0 = (h0 , l0 ), we sample once from the standard
+Gaussian distribution N (0, I), then save the value within the network checkpoint and load it again,
+meaning the initialized z0 has a fixed value.
+Decoder. The decoder extracts content tokens from h (excluding puzzle embedding positions)
+and maps them to logits via a SwiGLU MLP head. An auxiliary head predicts halt decisions and
+correctness values from the first token of h.
+
+
+                                                                    16
+                                       Table 4: Architecture components.
+                  Component         Module                   Description
+                    Encoder
+                                    Token Embedding          vocab → D
+                                    Puzzle Embedding         16 tokens (optional, for ARC)
+                                    Position Encoding        RoPE or learned
+                    Recursive Core
+                                 f L , fH                    [Attention + SwiGLU] × 2 layers
+                                 Iterations                  K low-level, T high-level steps
+                                 µθ , σ θ , µ ϕ , σ ϕ        SwiGLU MLP for each parameter
+                    Decoder
+                                    LM Head                  Linear(D → vocab)
+                                    Q Head                   Linear(D → 2) for halt
+                                    V Head                   Linear(D → 1) for value
+
+
+
+Encoder and Decoder for Image Patches. In the MNIST [15] image generation task, we first
+construct a binarized dataset by normalizing the original discrete pixel values (0 ∼ 255) to the
+continuous range [0, 1] and applying a threshold at 0.5. For the network architecture, we employ a
+convolutional patch encoder, following [48, 49].
+The encoding process proceeds in three stages. First, the discrete input tokens x ∈ {0, 1} are
+normalized to the range [−1, 1]. Second, to capture local spatial dependencies before patchification,
+the normalized image passes through a shallow convolutional encoder. This encoder consists of
+two stacked blocks, where each block comprises a 2D convolution [50, 51] with a 5 × 5 kernel and
+padding 2, a SiLU non-linearity [52], and Group Normalization (GN) [53]. Finally, the resulting
+feature map is divided into non-overlapping patches of size P × P and linearly projected to match
+the model’s hidden dimension D. The detailed architectural specifications and dimension transitions
+are summarized in Table 5.
+
+Table 5: Detailed architecture of the Image Patch Encoder for MNIST. H, W denote image resolution, C input
+channels, P patch size, Np the number of patches, and D the hidden dimension.
+                        Stage           Layer / Operation              Output Dim.
+                                        Input Tokens                   (B, C, H, W )
+                        1. Norm.
+                                        Linear Scaling [−1, 1]         (B, C, H, W )
+                                        Conv2d 5 × 5 (p = 2)
+                                                                      (B, D/2, H, W )
+                                        SiLU → GN(32)
+                        2. Conv
+                                        Conv2d 5 × 5 (p = 2)
+                                                                      (B, D/2, H, W )
+                                        SiLU → GN(32)
+                                        Flatten Patches               (B, Np , P 2 · D
+                                                                                     2)
+                        3. Patch
+                                        Linear Projection               (B, Np , D)
+
+
+Hyperparameters. Following Wang et al. [8], Jolicoeur-Martineau [9], both the input and output are
+represented as sequences of shape [B, L], where B denotes the batch size and L the context length.
+Each input sequence includes 16 fixed puzzle embedding tokens. The latent states ht and lt , as well
+as the decoder output, have shape [B, L, D], with embedding dimension D. The Transformer [39]
+backbone uses embedding dimension D = 512, attention heads Nhead =8 , and FFN hidden dimension
+Dh =512. Within a recursion step, meaning a latent transition zt → zt+1 , we use low-level (inner)
+steps K = 6 for Sudoku [8] and K = 4 for all other tasks, with high-level (outer) steps T = 3.
+
+B.2   Training Details
+
+Task Configuration. All tasks represent inputs and outputs as discrete token sequences (Summarized
+in Table 6).
+
+
+                                                        17
+        • For Sudoku [8], the 9×9 grid is flattened row-by-row into 81 tokens with vocabulary size 11
+          (0=pad, 1=blank, 2–10=digits).
+        • For ARC-AGI [13, 14], variable-size grids are padded to a fixed 30×30 canvas with EOS
+          markers, yielding 900 tokens and vocabulary size 12 (0=pad, 1=eos, 2–11=colors); task-
+          specific puzzle embeddings are prepended to distinguish different ARC tasks.
+        • N-Queens flattens an N × N board row-by-row into N 2 tokens with vocabulary size 3
+          (0=pad, 1=empty, 2=queen).
+        • Graph Coloring encodes the strict upper triangle of the adjacency matrix as n(n−1)/2 tokens,
+          using 0=PAD, 1=no-edge, and 2=edge for inputs and 3 + color_id for output colors.
+        • For image generation on MNIST [15], images are quantized and processed via CNN-based
+          patchification [50, 49], with the encoder applying patchify and the decoder unpatchify. Then,
+          patched input forms 14 × 14 flattened sequence tokens with vocabulary size 3 (0=pad,
+          1=black, 2=white).
+
+
+                                  Table 6: Task-specific configurations.
+           Task             Seq. Len    Vocab     Puzzle Emb                 Encoding
+           Sudoku                81       11              ✗             9×9 grid, row-major
+           ARC-AGI              900       12              ✓             30×30 padded canvas
+           N-Queens             N2        3               ✗                 N × N board
+                             n(n−1)
+           Graph Coloring       2
+                                          6               ✗        Strict adjacency upper triangle
+           MNIST                196        3              ✗                14×14 patches
+
+Training Details. We train all models using AdamW [54] with learning rate 10−4 , weight decay
+1.0, and gradient clipping at 1.0. The global batch size is 768. For stability, we apply exponential
+moving average (EMA) with decay 0.9999, following Brock et al. [55] and Song and Ermon [56].
+To prevent posterior collapse, we use a KL balance [37, 38] coefficient of 0.8. The number of deep
+supervision steps is Nsup = 16 for all tasks. The KL coefficient β is set to 0.1 (Sudoku), 0.04/0.1
+(ARC-AGI-1/2), 0.07/0.045 (N-Queens 8 × 8/10 × 10), 0.5/0.45 (Graph Coloring with 8/10 nodes),
+and 0.07 (MNIST). Task-specific training configurations are summarized in Table 7.
+
+                      Table 7: Training configurations on NVIDIA RTX 4090 GPUs.
+                         Task                             Epochs   GPUs     Time
+                         Sudoku                            50K      8        2h
+                         ARC-AGI                          200K      8      5 days
+                         N-Queens (8×8)                     3K      8        1h
+                         N-Queens (10×10)                   1K      8        3h
+                         Graph Coloring (8 nodes)           5K      8       1.5h
+                         Graph Coloring (10 nodes)          5K      8        6h
+                         MNIST                            1.8K      8       16h
+
+
+
+C     Additional Details of Experiment Setup
+C.1     Challenging Puzzle Tasks
+
+C.1.1    Looped TF on ARC-AGI
+We report Looped Transformer [7] results on Sudoku-Extreme but omit them on ARC-AGI due to
+prohibitive training cost. Under the same setup used for our other recursive baselines (200K epochs,
+batch size 768, on 8× NVIDIA RTX Pro 6000 GPUs), training Looped TF on Sudoku-Extreme
+already takes 19 hours, and extrapolating to ARC-AGI — which uses substantially longer sequences
+and a larger training set — suggests approximately 97 days (≈ 776 GPU-days) for a full training run.
+This gap stems from two compounding factors. First, Looped TF lacks deep supervision: HRM,
+TRM, and GRAM perform Nsup gradient updates per trajectory (one per segment), whereas Looped
+TF performs only one update at the end of the full trajectory, slowing convergence. Second, Looped
+
+
+                                                     18
+TF lacks adaptive halting such as ACT [8–10], so every input must be processed for the maximum
+recursion depth, increasing per-example sequential compute. Both inefficiencies compound at
+ARC-AGI scale, making a full Looped TF training run impractical.
+
+C.2     Multi-solution Puzzle Tasks
+
+C.2.1    N-Queens Problem
+
+                          Input               Solution 1                Solution 2               Solution 3
+
+
+
+
+Figure 9: Example of an 8 × 8 N-Queens puzzle instance. In this example, 5 queens are removed from the
+full board, leaving 3 queens. The model must find the positions of the remaining queens. This configuration
+admits exactly 3 valid solutions.
+
+Data Generation Details. The N-Queens problem requires placing N queens on an N × N
+chessboard such that no two queens attack each other—meaning no queens share the same row,
+column, or diagonal. Figure 9 illustrates an example where 5 queens are removed from an 8 × 8
+solution, resulting in a puzzle with 3 distinct valid completions.
+To construct the dataset, we first generated all valid complete N-Queens solutions for N = 8 and
+N = 10. We then created puzzle instances by removing a specific number of queens, treating the
+remaining partial configuration as the input and the original complete board as the target label. To
+generate instances yielding diverse valid completions, we removed k ∈ {5, 6, 7} queens for the 8 × 8
+setting and k ∈ {7, 8, 9} queens for the 10 × 10 setting. The distribution of solution counts for our
+generated dataset is shown in Figure 10.
+For evaluation, we employed an 85:15 train-test split. Crucially, to prevent data leakage and ensure
+the model learns to reason rather than memorize, the split was performed based on unique input
+configurations. This guarantees that no input pattern in the test set appears in the training set. Inputs
+are flattened into discrete 1D sequences x ∈ {0, 1, 2}L , where L = N 2 , along with zero-padded
+puzzle embedding tokens. Vocabulary mapping follows: padding (0), empty (1), and queen (2).
+
+                                     8x8 N-Queens                                     10x10 N-Queens
+                   3000                                                20000
+                                                                       15000
+          Counts
+
+
+
+
+                                                              Counts
+
+
+
+
+                   2000
+                                                                       10000
+                   1000
+                                                                        5000
+                      0                                                    0
+                             3      6    9   12     15   18                    0      20    40       60       80
+                                  Number of solutions                                Number of solutions
+Figure 10: Distribution of the number of valid solutions for generated N-Queens instances. The dataset
+covers a wide range of solution counts, testing the model’s ability to recover multiple valid outputs.
+
+
+C.2.2    Graph Coloring Problem
+Data Generation Details. The Graph Coloring problem requires assigning one of k colors to each
+node in a graph such that no two adjacent nodes share the same color. We consider graphs with
+N ∈ {8, 10} nodes and use k = 3 colors. Figure 11 illustrates an example instance with N = 8
+nodes and k = 3 colors.
+Graphs are generated using the Erdős–Rényi random graph model [57], following the generation
+pipeline from GNN-GCP [58]. Specifically, for each instance, edges are sampled independently
+
+
+                                                              19
+with a fixed probability p, producing a symmetric adjacency matrix. We retain only graphs that are
+3-colorable.
+For each graph, we enumerate all valid 3-colorings and retain only canonical forms to eliminate
+redundant solutions under color permutation (e.g., swapping red and blue). This yields a set of
+structurally distinct solutions per input. The distribution of solution counts is shown in Figure 12.
+The final dataset consists of 7,002 training and 255 test instances for N = 8, and 13,465 training and
+192 test instances for N = 10.
+Input and Output Representation. The input graph is represented by extracting the upper triangular
+portion of the adjacency matrix (excluding the diagonal) and flattening it into a 1D sequence. The
+output is a sequence of length N , where each position encodes the assigned color for the corresponding
+node. Vocabulary mapping is as follows: PAD (0), no edge (1), edge (2), and colors (3, 4, 5) for red,
+blue, and green respectively.
+
+                      Input             Solution 1             Solution 2              Solution 3        Solution 4
+
+
+
+
+                                             Figure 11: Graph Coloring Example
+
+
+                              Vertex 8 Graph Coloring                                  Vertex 10 Graph Coloring
+                                                                            2000
+                   1500
+                                                                            1500
+          Counts
+
+
+
+
+                                                                   Counts
+
+
+
+
+                   1000
+                                                                            1000
+                    500                                                      500
+                      0                                                        0
+                              3     6    9     12    15   18                       0      20        40   60    80
+                                  Number of solutions                                    Number of solutions
+Figure 12: Distribution of the number of valid solutions for generated graph coloring instances. The
+dataset covers a wide range of solution counts, testing the model’s ability to recover multiple valid outputs.
+
+
+D     Additional Experiment Results
+D.1   Additional Results on Challenging Puzzle Benchmarks
+
+Table 8 reports test accuracy on three challenging puzzle benchmarks. Here we provide additional
+observations complementing the main text.
+
+GRAM Advances the Recursive-Reasoning Line. Across all three benchmarks, GRAM consis-
+tently outperforms prior recursive baselines (Looped TF, HRM, TRM) while using fewer parameters
+than HRM (10M vs. 27M). The complete failure of direct prediction on Sudoku and ARC-AGI-2 (0%
+in both cases) further confirms that recursive computation is essential for these tasks — single-pass
+models, regardless of capacity, cannot solve them. Together, these results indicate that GRAM’s gains
+arise from how recursive computation is organized (probabilistic, multi-trajectory) rather than from
+increased model capacity.
+
+Sudoku-Extreme Resists Parameter Scaling. All tested large reasoning models (LRMs), including
+Deepseek-R1 (671B), score 0% on Sudoku-Extreme. This suggests that pretrained capacity alone
+does not transfer to constraint-propagation reasoning, and that benchmarks like Sudoku-Extreme
+probe a fundamentally different axis from those captured by general-purpose LRMs. On ARC-AGI,
+more recent LRMs such as Gemini 3 Pro (75.0% on ARC-1, 31.1% on ARC-2) remain substantially
+
+
+                                                                 20
+Table 8: Test accuracy (%) on Challenging Puzzle Benchmarks. GRAM significantly outperforms prior
+recursive models. All recursive model scores were obtained at 16 supervision steps.
+                         Method                   #Params Sudoku ARC-1 ARC-2
+                         Large Reasoning Models
+                           Deepseek-R1             671B    0.0    15.8    1.3
+                           Claude 3.7 16k          N/A     0.0    28.6    0.7
+                           o3-mini-high            N/A     0.0    34.5    3.0
+                           GPT 5.2 (low)           N/A      –     55.7    9.7
+                           Grok-4-thinking         1.7T     –     66.7   16.0
+                           Gemini 3 Pro            N/A      –     75.0   31.1
+                         Recursive Models
+                           Direct Pred             27M      0.0   21.0   0.0
+                           Looped TF                7M     61.3    -      -
+                           HRM                     27M     55.0   40.3   5.0
+                           TRM                      7M     87.4   44.6   7.8
+                           GRAM (Ours)             10M     97.0   52.0   11.1
+                         Human Results
+                           Avg. Human                –      –     60.2     –
+                           Best Human                –      –     98.0   100.0
+
+
+
+ahead of all recursive models, highlighting that abstract few-shot reasoning still benefits from scale;
+we view these numbers as benchmark-difficulty reference points rather than controlled baselines.
+
+D.2   Scales with Parallel Sampling on ARC-AGI Challenge
+
+To investigate the effect of GRAM’s sampling on the ARC-AGI-1 benchmark, we measured perfor-
+mance without relying on external data augmentation. Typically, TRM achieves its reported accuracy
+by generating 1,000 augmentations for a single problem and performing majority voting over the
+results. Because this augmentation process itself creates a wide variety of samples, we isolated
+the specific effect of generative sampling by performing inference solely on the original problem
+instance and conducting majority voting over multiple sampled paths. For a fair comparison, TRM
+was evaluated using the same hyperparameters as GRAM, including the number of epochs, learning
+rate, and the number of layers.
+As illustrated in Figure 13, removing augmentations causes a performance decline for both GRAM
+and TRM compared to the values reported in Table 8. However, in the case of GRAM, we observe
+that accuracy consistently improves as the model generates more parallel samples. This trend mirrors
+observations in Section 4.2, suggesting that increased inference-time compute through width scaling
+allows the model to explore more plausible reasoning trajectories and recover from initial errors,
+eventually leading to more robust solution discovery.
+
+Interaction between Augmentation and Sampling. A natural question arises: why not combine
+higher levels of augmentation with extensive parallel sampling? To address this, we conducted an
+ablation study examining the interaction between data augmentation and inference-time sampling.
+Figure 14 presents the results across varying augmentation levels (Aug=0 to Aug=50). Without
+augmentation (Aug=0), increasing the number of samples yields consistent accuracy improvements,
+demonstrating that stochastic sampling effectively explores diverse reasoning trajectories. However,
+as the level of augmentation increases, the marginal benefit of additional sampling diminishes
+substantially. At Aug=50, performance saturates regardless of sample count—accuracy remains
+nearly constant whether we draw 1 or 50 samples. This observation reveals that augmentation
+and sampling serve complementary rather than additive roles: both mechanisms enable the model
+to capture solution diversity, but through different means. When training data is limited, parallel
+sampling compensates by exploring varied reasoning paths at inference time. When training data
+is abundant through augmentation, the model has already internalized sufficient diversity during
+training, rendering additional inference-time exploration redundant. Consequently, scaling sampling
+beyond augmentation provides diminishing returns, justifying our experimental design choice to
+evaluate these two scaling axes separately.
+
+
+                                                    21
+                                       0.450
+                                       0.425
+                                       0.400
+                                       0.375
+
+
+
+
+                            Accuracy
+                                       0.350
+                                       0.325
+                                       0.300                                                       TRM
+                                                                                                   GRAM ( =0.1)
+                                       0.275                                                       GRAM ( =0.05)
+                                                                                                   GRAM ( =0.04)
+                                       0.250   1       2       5      10        20   50    100         250     500
+                                                                   Number of Samples (N)
+Figure 13: Effect of sampling on ARC-AGI-1 without data augmentation. To isolate the internal sampling
+effect, both models are evaluated on original problem instances without 1,000 augmentations. While removing
+augmentations causes an initial performance drop, GRAM exhibits robust scaling through generative sampling
+as the number of parallel samples N increases, outperforming the TRM baseline.
+
+
+                          0.500
+                          0.475
+                          0.450
+               Accuracy
+
+
+
+
+                          0.425
+                          0.400
+                          0.375                                                                                      Aug=0
+                                                                                                                     Aug=5
+                          0.350                                                                                      Aug=10
+                                                                                                                     Aug=50
+                          0.325
+                                         1         2       5         10         20    50         100         250       500
+                                                                   Number of Samples (N)
+Figure 14: Effect of augmentation on sampling efficiency. With limited augmentation (Aug=0), parallel
+sampling provides consistent gains. As augmentation increases, sampling benefits diminish—at Aug= 50,
+performance saturates regardless of sample count, suggesting augmentation and sampling serve complementary
+roles in capturing solution diversity.
+
+
+
+D.3   Solution Coverage Analysis
+
+We analyze the ability of GRAM to capture the diversity of the solution space compared to determin-
+istic baselines. Figure 15 presents the solution coverage on 8 × 8 and 10 × 10 N-Queens tasks with
+respect to the total number of valid ground-truth solutions.
+As shown in Figure 15, deterministic recursive models (HRM and TRM) exhibit a sharp decline in
+coverage as the number of possible solutions increases. Since these models are constrained to a single
+fixed reasoning trajectory, they structurally fail to explore alternative paths, resulting in severe mode
+collapse in multi-solution landscapes.
+In contrast, GRAM effectively leverages its generative latent transitions to cover a broader range
+of solutions. As the number of parallel samples N increases (from 1 to 20), the solution coverage
+improves monotonically across both 8 × 8 and 10 × 10 settings. This empirical evidence confirms
+that GRAM’s stochastic guidance mechanism is essential for navigating complex problem spaces
+where multiple valid reasoning paths exist.
+
+
+D.4   Additional Generated Image Samples
+
+In this section, we provide further qualitative results demonstrating GRAM’s capability in uncon-
+ditional image generation. Figure 16 presents a diverse set of samples generated on the binarized
+MNIST dataset, visualized across the recursive inference steps t = 0 to t = 16.
+
+
+                                                                           22
+           1.0                                                 HRM                       1.0                                            HRM
+                                                               TRM                                                                      TRM
+                                                               GRAM (N=1)                                                               GRAM (N=1)
+                                                               GRAM (N=5)                                                               GRAM (N=5)
+           0.8                                                 GRAM (N=10)               0.8                                            GRAM (N=10)
+                                                               GRAM (N=20)                                                              GRAM (N=20)
+
+
+           0.6                                                                           0.6
+Coverage
+
+
+
+
+                                                                              Coverage
+           0.4                                                                           0.4
+
+
+           0.2                                                                           0.2
+
+
+           0.0                                                                           0.0
+                  2    4     6       8         10   12    14   16     18                       0     15      30      45     60     75        90
+                                 Number of Solutions                                                         Number of Solutions
+                           (a) N-Queens 8 × 8                                                             (b) N-Queens 10 × 10
+
+ Figure 15: Solution coverage analysis on N-Queens (8 × 8 and 10 × 10) with respect to the number of
+ ground-truth solutions. While deterministic baselines (HRM, TRM) suffer from mode collapse as the solution
+ space grows, GRAM demonstrates monotonic improvement in coverage as the number of parallel samples N
+ increases.
+
+
+
+ As observed in the main text, GRAM exhibits a distinct progressive refinement behavior. Starting
+ from a black initialization, the model iteratively adds details and sharpens the structure of the digit.
+ A particularly compelling property of this process is the model’s ability to recover from initially
+ ambiguous or incorrect formations.
+ For instance, in the second row (generating the digit ’2’) and the last row (generating the digit ’1’),
+ the early predictions at t = 1 and t = 2 manifest as disjointed artifacts or incorrect shapes. However,
+ as the recursion proceeds, GRAM effectively leverages its feedback loop to correct these initial errors,
+ resolving the ambiguity and converging to a coherent, high-quality digit by t = 16.
+
+
+
+
+                 t=0       t=1           t=2        t=4        t=6           t=7               t=9    t=11        t=12    t=14     t=16
+ Figure 16: Additional generated samples from GRAM. We provide 8 additional samples generated uncondi-
+ tionally on binarized MNIST using GRAM. Each row represents a single generated sample, visualized across its
+ recursive refinement process.
+
+
+
+
+                                                                             23
+D.5   Additional Experiment Results on Unconditional Sudoku Generation
+
+In this section, we provide additional details on unconditional Sudoku generation. Unlike the
+conditional Sudoku-solving setting, where the input board contains given clues, the model receives an
+entirely blank board and samples a complete 9 × 9 Sudoku board from its learned prior. We evaluate
+each generated board using the standard Sudoku validity criterion: every row, column, and 3 × 3 box
+must contain the digits 1 through 9 exactly once. We report the validity rate over 100K generated
+boards. To check whether high validity comes from repeatedly producing the same board, we also
+compute the fraction of unique boards among valid samples.
+For GRAM, we construct the unconditional training set from Sudoku-Extreme [8], the Sudoku
+benchmark used by HRM and TRM. We sample 50K complete solutions from the original training
+split, discard the clue patterns, and use an all-blank board as input with the complete solution as
+the target. No data augmentation is used. We train GRAM on this derived 50K-solution set for 200
+epochs with learning rate 10−4 , EMA decay 0.999, and KL coefficient 0.05. The resulting model
+contains 10.9M parameters and uses 16 inference steps.
+For D3PM baselines, we use a DiT-style Transformer backbone and evaluate two model sizes. D3PM-
+Big uses hidden dimension 768, 5 Transformer blocks, and 12 attention heads, yielding 55.1M
+parameters, while D3PM-Small uses hidden dimension 512, 3 Transformer blocks, and 8 attention
+heads, yielding 15.9M parameters. Both variants are trained on the same derived training set and
+generate boards with 1000 denoising steps.
+As shown in Table 9, GRAM achieves 99.05% validity, outperforming all D3PM baselines. The
+strongest D3PM baseline, D3PM-Uniform (Big), reaches 91.33% validity while using 55.1M parame-
+ters and 1000 denoising steps. In contrast, GRAM uses fewer parameters and only 16 inference steps.
+In all cases, the valid samples are unique under exact board matching, indicating that the reported
+validity is not due to simple repetition of a small set of boards. These results show that GRAM can
+generate highly constrained symbolic structures from an empty input, supporting its potential as a
+generator beyond conditional puzzle solving.
+Figure 17 illustrates the unconditional Sudoku generation setup. Starting from an empty board,
+the task is to generate complete boards, and validity is determined by whether the generated board
+satisfies all Sudoku constraints. Figure 7 shows qualitative examples of boards generated by GRAM.
+
+Table 9: Unconditional Sudoku generation. We report the ratio of generated boards satisfying Sudoku
+constraints over 100K samples. All valid boards are unique for all methods in this evaluation.
+                         Method                     #Params            Steps       Validity(%)
+                         D3PM-Uniform (Big)         55.1M              1000           91.33
+                         D3PM-Uniform (Small)       15.9M              1000           29.24
+                         D3PM-Absorb (Big)          55.1M              1000           79.18
+                         D3PM-Absorb (Small)        15.9M              1000           21.88
+                         GRAM (Ours)                10.9M               16            99.05
+
+
+           Empty input                                  Valid sample                              Invalid sample
+                                            3   6   5    4   7   8     9   2   1        4     3   8   2   9   1   7   6   5
+                                            9   4   1    2   5   6     8   7   3        5     2   1   7   3   6   4   9   8
+                                            2   7   8    9   1   3     6   5   4        7     6   9   4   5   8   1   2   3
+                                            5   2   9    8   4   7     3   1   6        9     1   3   4   4   7   5   8   6
+                                            7   3   6    1   9   2     5   4   8        2     5   4   6   8   9   3   1   7
+                                            1   8   4    3   6   5     2   9   7        6     8   7   5   1   3   2   4   9
+                                            8   9   7    5   3   4     1   6   2        3     7   2   9   6   4   8   5   1
+                                            4   1   3    6   2   9     7   8   5        1     9   5   3   2   8   6   7   4
+                                            6   5   2    7   8   1     4   3   9        8     4   6   1   7   5   9   3   2
+
+
+Figure 17: Unconditional Sudoku generation setup. Starting from an empty board, the task is to generate
+complete Sudoku boards. The valid sample satisfies all Sudoku constraints, while red entries in the invalid
+sample indicate cells involved in constraint violations.
+
+
+
+                                                        24
+D.6   Visualizing Latent Recursion Process
+
+To understand how stochastic guidance shapes reasoning, we visualize latent trajectories during
+recursive computation. Specifically, we track the high-level state h at each supervision step throughout
+the recursion process. For visualization, we project these latent vectors into 2D using PCA [59] and
+interpolate unobserved states via K-D tree [60] to construct a continuous loss landscape.
+Figures 18 and 19 compare TRM and GRAM on the same Sudoku puzzle. TRM follows a single
+deterministic path from initialization to solution, offering no mechanism to escape if the trajectory
+enters a suboptimal region. In contrast, GRAM samples diverse trajectories that explore different
+regions of latent space before converging. While some trajectories become trapped in local minima
+(bright yellow regions), others successfully navigate toward the global optimum (dark blue regions).
+This diversity enables GRAM to discover valid solutions more reliably through parallel exploration.
+
+
+
+
+Figure 18: Latent reasoning trajectory of TRM. The red dot indicates the initial state h0 and the green dot
+indicates the final state hT . Background color represents the loss landscape: bright yellow corresponds to high
+loss regions, while dark blue indicates low loss (optimal) regions. TRM follows a single deterministic path with
+no ability to escape suboptimal trajectories.
+
+
+
+
+                                                      25
+Figure 19: Latent reasoning trajectories of GRAM (50 samples). Using the same visualization scheme as
+Figure 18, we show 50 sampled trajectories from GRAM. The stochastic guidance enables diverse exploration
+of the latent space: while some trajectories converge to local minima (right bottom), others successfully reach
+the global optimum (left middle), demonstrating how parallel sampling improves solution discovery.
+
+
+
+
+                                                      26
+Licenses
+Table 10: Existing assets, licenses, and source links. We list the existing datasets, benchmarks, and public
+reference implementations used or cited in our experiments. Synthetic N-Queens and Graph Coloring instances
+are generated by the authors and are therefore not external assets.
+
+ Asset                 Use in this paper           License / terms     Source link
+ MNIST                 Binarized MNIST genera-     Creative Com-       https://keras.io/api/
+                       tion experiments            mons Attribution-   datasets/mnist/
+                                                   Share Alike 3.0
+ ARC-AGI-1 / origi-  ARC-AGI         reasoning     Apache License      https://github.com/fchollet/
+ nal ARC             benchmark                     2.0                 ARC-AGI
+ ARC-AGI-2           ARC-AGI-2 reasoning           Apache License      https://github.com/arcprize/
+                     benchmark / reference         2.0                 ARC-AGI-2
+                     results
+ HRM repository      HRM       baseline    and     Apache License      https://github.com/
+                     Sudoku-Extreme-related        2.0                 sapientinc/HRM
+                     reference implementation
+ TinyRecursiveModels TRM baseline and recur-       MIT License         https://github.com/
+ / TRM repository    sive reasoning reference                          SamsungSAILMontreal/
+                     implementation                                    TinyRecursiveModels
+ MDLM repository     Masked diffusion base-        Apache License      https://github.com/
+                     line reference implemen-      2.0                 kuleshov-group/mdlm
+                     tation, if public code is
+                     used
+ Google Research D3PM image-generation             Apache License      https://github.com/
+ D3PM implementa- baseline reference imple-        2.0                 google-research/
+ tion                mentation, if public code                         google-research/blob/master/
+                     is used                                           d3pm/images/diffusion_
+                                                                       categorical.py
+ Looped       Trans-   Looped Transformer base-    MIT License         https://github.com/Leiay/
+ former repository     line reference implemen-                        looped_transformer
+                       tation, if public code is
+                       used
+ N-Queens              Synthetic multi-solution    Not an external     N/A
+                       constraint satisfaction     asset
+                       task generated by the
+                       authors
+ Graph Coloring        Synthetic multi-solution    Not an external     N/A
+                       constraint satisfaction     asset
+                       task generated by the
+                       authors
+
+
+
+
+                                                    27
+
+\ No newline at end of file
diff --git a/papers/txt/gram2026_generative_recursive.txt b/papers/txt/gram2026_generative_recursive.txt
new file mode 100644
index 0000000..d5f299d
--- /dev/null
+++ b/papers/txt/gram2026_generative_recursive.txt
@@ -0,0 +1,1532 @@
+                                                                    Generative Recursive Reasoning
+
+
+                                                                      Junyeob Baek1†∗ Mingyu Jo1†∗          Minsu Kim1,2
+
+                                                                     Mengye Ren3       Yoshua Bengio2,4     Sungjin Ahn1,3†
+arXiv:2605.19376v2 [cs.AI] 20 May 2026
+
+
+
+
+                                                                              1
+                                                                               KAIST 2 Mila – Québec AI Institute
+                                                                       3
+                                                                           New York University 4 Université de Montréal
+
+
+
+                                                                                          Abstract
+                                                     How should future neural reasoning systems implement extended computation?
+                                                     Recursive Reasoning Models (RRMs) offer a promising alternative to autoregres-
+                                                     sive sequence extension by performing iterative latent-state refinement with shared
+                                                     transition functions. Yet existing RRMs are largely deterministic, following a
+                                                     single latent trajectory and converging to a single prediction. We introduce Gen-
+                                                     erative Recursive reAsoning Models (GRAM), a framework that turns recursive
+                                                     latent reasoning into probabilistic multi-trajectory computation. GRAM models
+                                                     reasoning as a stochastic latent trajectory, enabling multiple hypotheses, alterna-
+                                                     tive solution strategies, and inference-time scaling through both recursive depth
+                                                     and parallel trajectory sampling. This yields a latent-variable generative model
+                                                     supporting conditional reasoning via pθ (y | x) and, with fixed or absent inputs,
+                                                     unconditional generation via pθ (x). Trained with amortized variational inference,
+                                                     GRAM improves over deterministic recurrent and recursive baselines on structured
+                                                     reasoning and multi-solution constraint satisfaction tasks, while demonstrating an
+                                                     unconditional generation capability. https://ahn-ml.github.io/gram-website
+
+
+                                         1     Introduction
+                                         A central question for future neural reasoning systems is how extended computation should be imple-
+                                         mented. Large autoregressive models typically scale reasoning by extending a sequence-generation
+                                         process, whether intermediate computation is expressed explicitly as chain-of-thought tokens or im-
+                                         plicitly in hidden or latent representations [1–6]. A complementary direction is explored by Recursive
+                                         Reasoning Models (RRMs), which use repeated computation to refine a persistent latent state rather
+                                         than to append new elements to an output or reasoning sequence [7–9]. This approach is appealing
+                                         because it decouples reasoning depth from both parameter scale and output length: a compact model
+                                         can perform many steps of internal computation by repeatedly applying shared transition functions
+                                         over time.
+                                         Recent recursive reasoning models such as HRM [8] and TRM [9] provide early evidence for the
+                                         potential of this approach in structured reasoning. Rather than producing a solution in a single
+                                         feedforward pass, they perform extended computation through iterative latent-state refinement, deep
+                                         supervision across refinement steps, and reasoning-oriented recurrent designs such as hierarchical
+                                         latent dynamics. These features make them well suited to problems requiring constraint propagation,
+                                         state tracking, iterative correction, and multi-step inference. More broadly, they build on a principle
+                                             ∗ Equal contribution
+                                            † Correspondence to: Junyeob Baek (wnsdlqjtm@kaist.ac.kr), Mingyu Jo (mingyu.jo@kaist.ac.kr), Sungjin Ahn
+
+                                         (sungjin.ahn@kaist.ac.kr)
+
+
+                                         Preprint.
+                      Solution 1,
+
+
+     Input Task,
+
+
+
+                      Solution 2,
+
+
+
+
+                                              (a) Deterministic RRMs                (b) GRAM (Ours)
+
+Figure 1: Comparison of Latent Reasoning Trajectories. Left: N-Queens Example with two valid solutions.
+Right: Given three independent runs for latent reasoning (τ1 , τ2 , τ3 ): (a) Prior RRMs (e.g. HRM, TRM) are
+deterministic—all runs collapse to an identical trajectory, converging to a single solution and failing to explore
+alternatives, while (b) GRAM explores diverse trajectories, producing diverse trajectories that reach multiple
+valid solutions y1 and y2 , while naturally enabling parallel inference-time scaling.
+
+also explored in recurrent Transformer architectures such as Universal Transformers [10] and Looped
+Transformers [7]: shared Transformer blocks can be repeatedly applied to increase computational
+depth without increasing parameter count. Together, these models suggest that reasoning capability
+can emerge not only from scaling model size or generating longer traces, but also from the organization
+of computation itself.
+While recurrent latent-state refinement provides an appealing mechanism for efficiently increasing
+reasoning depth, depth alone is not sufficient for many reasoning problems. A capable reasoning
+system should also be able to maintain uncertainty, consider alternative hypotheses, and explore
+multiple possible solution strategies [11, 12]. This is especially important in settings where ambiguity
+or multiple valid solutions are intrinsic, and more generally in problems where a single refinement
+path may become trapped in a suboptimal reasoning trajectory. In this sense, future RRMs should be
+not only deep, in the sense of repeated refinement, but also wide, in the sense of maintaining and
+exploring multiple latent trajectories in parallel.
+Existing RRMs [7–10], however, remain fundamentally deterministic: given the same input and
+initialization, they follow a single latent trajectory and converge to a single prediction. This deter-
+ministic recursion collapses the space of plausible reasoning paths into a single attractor, leaving
+probabilistic multi-hypothesis latent reasoning largely unexplored within the RRM paradigm. This
+motivates the central question of our work: can recursive latent computation support probabilistic,
+generative, multi-hypothesis reasoning while preserving the efficiency of compact recurrent models?
+In this paper, we propose Generative Recursive reAsoning Models (GRAM), a framework that turns
+recursive latent reasoning into probabilistic multi-trajectory computation. GRAM treats the reasoning
+process itself as a stochastic latent trajectory: at each recursion step, the model samples a transition
+conditioned on the input and the current reasoning state, rather than deterministically updating to a
+single next state. Repeating this process defines a distribution over possible reasoning trajectories,
+allowing the model to maintain multiple hypotheses, explore alternative solution strategies, and scale
+inference not only by increasing recursive depth but also by sampling trajectories in parallel. From
+a probabilistic perspective, GRAM is a latent-variable generative model: it models pθ (y | x) by
+marginalizing over latent reasoning trajectories, while the same recursive process can also define an
+unconditional generative model pθ (x) when the input is fixed or absent.
+We evaluate GRAM on controlled reasoning and generation tasks that serve as probes of the ar-
+chitectural properties targeted by our formulation: recursive refinement, stochastic exploration,
+multi-solution coverage, and inference-time scaling. Given this goal, our experiments focus on
+comparisons with the most relevant deterministic recurrent and recursive latent reasoning baselines,
+including Looped Transformers, HRM, and TRM, rather than frontier-scale general-purpose LLMs
+whose training data, inference budgets, and external scaffolding are not directly comparable. Sudoku-
+Extreme [8] and ARC-AGI [13, 14] test structured reasoning under hard constraints and abstract
+transformations; N-Queens and Graph Coloring evaluate multi-solution recovery; and binarized
+MNIST [15] probes the unconditional generative interpretation.
+Our main contribution is to establish probabilistic multi-trajectory recursion as a design principle
+for future recurrent and recursive reasoning architectures. Concretely, we make three contributions.
+
+
+                                                        2
+First, we formulate recursive reasoning as a latent-variable generative process, where solutions are
+obtained by marginalizing over stochastic reasoning trajectories. Second, we introduce width-based
+inference-time scaling, enabling inference to scale not only with recursive depth but also with the
+number of sampled latent trajectories. Third, we provide empirical evidence that this formulation
+yields the intended architectural advantages over deterministic recurrent and recursive baselines,
+improving structured reasoning, multi-solution constraint satisfaction, and unconditional generation.
+
+2     Generative Recursive Reasoning Models
+In this section, we introduce Generative Recursive reAsoning Models (GRAM), an instantiation
+of probabilistic recursive reasoning. We describe the architecture in Section 2.1 and the training
+procedure in Section 2.2, with an architecture schematic shown in Figure 2.
+
+2.1   Architecture
+
+Overview. GRAM models the conditional distri-                                                                      𝑦        (A) CE loss
+bution pθ (y | x) by marginalizing over stochas-
+                                                                            Prior          (B) KL Div.         Posterior                  Decoder
+tic latent reasoning trajectories. Given an input                          𝑝𝜃 (⋅ |𝑢𝑡 )                        𝑞𝜙 (⋅ |𝑢𝑡 , 𝑦) ~ 𝜖𝑡          𝑓dec
+x, GRAM first computes an embedding
+                                                              ℎ𝑡−1                                       𝑓𝐻        𝑢𝑡                       ℎ𝑡
+                ex = fenc (x; θ),                       (1)
+                                                                                         𝐾 times
+which is reused throughout the entire recursive 𝑙𝑡−1              𝑓𝐿         𝑓𝐿                        𝑙𝑡
+computation. Starting from a fixed initial la-
+                                                          Encoder
+tent state z0 , the model evolves the latent state   𝑥
+                                                            𝑓enc
+through learned stochastic transitions. The re-
+cursive computation is organized into two nested Figure 2: GRAM Architecture. A single stochastic
+levels: inner and outer loops.                     latent transition in the hierarchical instantiation z =
+At the inner level, a latent transition samples (h, l). After K low-level refinements via fL , the high-
+                                                level update fH produces a deterministic proposal ut , to
+a new latent state conditioned on the previous which stochastic  guidance ϵt is added: ht = ut + ϵt .
+latent state and the input embedding,
+                                  zt ∼ pθ (zt | zt−1 , ex ),              t = 1, . . . , T.                                                   (2)
+At the end of the T transitions, the decoder produces a prediction, ŷ = arg max fdec (zT ; θ). We refer
+to the sequence of T transitions from the initial state z0 to the final state zT as a supervision step. A
+supervision step is the unit at which the decoder is invoked, and the training objective is applied, with
+gradients computed as described in Section 2.2.
+At the outer level, Nsup supervision steps are applied recursively, with the final state of one supervision
+step serving as the initial state of the next, thereby forming the full recursive computation:
+                 (1)   T transitions    (1)       (2)     T transitions                  T transitions          (N      )
+               z0      −−−−−−−→ zT = z0                  −−−−−−−→ · · · −−−−−−−→ zT sup ,                                                     (3)
+        (n)                                                                                                                         (1)
+where zt denotes the latent state at the t-th transition of the n-th supervision step, z0 is the
+fixed initial state, and the terminal state of one supervision step serves as the initial state of the next
+   (n+1)       (n)
+(z0      := zT ). This abstract formulation can be instantiated with various recurrent Transformer
+backbones, including flat designs such as Universal Transformers and Looped Transformers [10, 7],
+as well as hierarchical designs such as HRM and TRM [8, 9].
+Stochastic Latent Transitions. Unlike prior recursive reasoning models (RRMs) that update the
+latent state deterministically and follow a single fixed trajectory [8, 9], GRAM defines pθ (zt |
+zt−1 , ex ) as a stochastic transition, so that repeated computation induces a distribution over latent
+reasoning trajectories. Concretely, GRAM realizes this transition as a learned stochastic residual
+perturbation around a deterministic update: at each transition, the model first computes a deterministic
+update ut from zt−1 and ex , then samples a conditional perturbation from a state-dependent Gaussian,
+and adds it to ut :
+                                ϵt ∼ pθ (ϵt | ut ) := N µθ (ut ), σθ2 (ut )I ,
+                                                                            
+                                                                                                     (4)
+                                 zt = ut + ϵt .                                                                                               (5)
+
+
+                                                              3
+We refer to ϵt as the learnable stochastic guidance. The mean µθ (ut ) encodes a state-dependent
+direction in which the trajectory is steered, while the variance σθ2 (ut ) controls the amount of ex-
+ploration. This design allows GRAM to capture uncertainty, prevent convergence to local minima,
+and support robust exploration of the solution space without discarding the deterministic refinement
+performed by ut .
+Hierarchical Instantiation. We instantiate the latent state with two interacting components, z =
+(h, l). The high-level component h is updated once per latent transition and carries abstract reasoning
+state, while the low-level component l is updated K times within a single transition and carries
+fine-grained intermediate computation. This decomposition separates the two roles across time scales,
+with h accumulating slowly across transitions and l refined rapidly within each one.
+With this hierarchical multi-scale structure, a single transition zt−1 → zt is computed as follows.
+The low-level component is first refined for K updates, with the high-level component held fixed:
+                              lt,k = fL (ht−1 , lt,k−1 , ex ; θ),            k = 1, . . . , K,                (6)
+where lt,0 := lt−1 and we write lt := lt,K for the refined low-level component. The high-level
+component is then updated as a stochastic transition conditioned on the refined lt ,
+                                    ut = fH (ht−1 , lt ; θ),                                                  (7)
+                                    ϵt ∼ pθ (ϵt | ut ) := N µθ (ut ), σθ2 (ut ) I
+                                                                                        
+                                                                                            ,                 (8)
+                                    ht = ut + ϵt ,                                                            (9)
+and we set zt = (ht , lt ). Note that stochasticity is introduced only at the high level: the low-
+level refinement is fully deterministic, while the stochastic guidance signal ϵt acts on the slower,
+more abstract component of the latent state, where it can steer the overall reasoning trajectory
+across transitions3 . Under this instantiation, the decoder reads only the high-level component, i.e.,
+fdec (zT ) = fdec (hT ). Additional architectural details are provided in Appendix B.1.
+Modeling Unconditional Distribution. While the description so far focuses on the conditional
+setting pθ (y | x), the same recursive process can also be defined as an unconditional generative model
+pθ (x) when the input is replaced with an empty conditioning embedding. We use this formulation for
+generation tasks in Section 4.3.
+
+2.2   Training
+
+GRAM is trained to model the conditional distribution pθ (y | x), where each training example
+consists of an input x and its corresponding target y. As a probabilistic model, GRAM adopts a
+latent-variable formulation and is optimized by maximizing an evidence lower bound (ELBO) with
+respect to the generative parameters θ and variational parameters ϕ.
+Latent Variable Modeling. We model GRAM as a latent-variable probabilistic model pθ , where
+the full latent trajectory τ = (z0 → · · · → zTTotal ) consists of a sequence of latent variables, with
+TTotal = T × Nsup . The conditional likelihood is defined as
+                                            Z
+                               pθ (y | x) = pθ (y | τ, x) pθ (τ | x) dτ,                           (10)
+
+where x denotes the input problem and y denotes the corresponding ground-truth output.
+Direct maximum likelihood estimation of log pθ (y | x) is intractable due to the marginalization
+over latent trajectories. We therefore introduce a variational posterior qϕ (τ | x, y) and optimize the
+evidence lower bound (ELBO), jointly training θ and ϕ via variational inference:
+              log pθ (y | x) ≥ Eqϕ (τ |x,y) [log pθ (y | τ, x)] − KL(qϕ (τ | x, y) ∥ pθ (τ | x)) .           (11)
+
+During training, latent trajectories are sampled from the variational posterior qϕ (· | x, y), which has
+access to both the input problem x and the target output y. At inference time, where y is unavailable,
+trajectories are instead generated from the learned prior pθ (· | x).
+   3We also tried injecting noise into the low-level state, but found that it did not improve performance.
+
+
+
+
+                                                              4
+Both the prior and the posterior are modeled as conditional Markov processes over latent states:
+                     TY
+                      Total                                            TY
+                                                                        Total
+
+ pθ (τ | x) = p(z0 )        pθ (zt | zt−1 , x), qϕ (τ | x, y) = p(z0 )        qϕ (zt | zt−1 , x, y). (12)
+                        t=1                                                  t=1
+Here, z0 is a fixed initial state shared by the prior and posterior. Both transitions are implemented by
+adding reparameterized Gaussian noise ϵt after a deterministic update ut ; the posterior uses the same
+transition module as the prior, but samples from a target-conditioned noise distribution qϕ (ϵt | ut , y),
+whereas the prior uses pθ (ϵt | ut ).
+Since the two processes share the same Markov structure and all stochasticity is introduced through
+ϵ1:TTotal , their trajectory distributions can be equivalently represented in noise space. Moreover,
+since GRAM decodes the output only from the terminal latent state, the likelihood term satisfies
+pθ (y | τ, x) = pθ (y | zTTotal , x). Therefore, the full trajectory-level ELBO can be written as
+                                          TX Total               h                                    i
+ LELBO = Eqϕ log pθ (y | zTTotal , x) −              Eqϕ (ϵ<t |x,y) KL(qϕ (ϵt | ut , y) ∥ pθ (ϵt | ut )) . (13)
+                                               t=1
+Here, ut = fH (ht−1 , lt ) denotes the deterministic high-level update before noise injection, as defined
+in Equation (9). Since ut depends on ht−1 , which is determined by the previously sampled noise
+variables ϵ<t := (ϵ1 , . . . , ϵt−1 ), the expectation averages over these ancestral samples.
+Practical Implementation. In practice, following previous recursive reasoning models [8, 9], we
+train GRAM with deep supervision over Nsup consecutive supervision steps, each consisting of T
+recursive latent transitions. This provides dense learning signals along the full latent trajectory, rather
+than supervising only the final state after TTotal = T × Nsup transitions. The terminal state of each
+step is reused as the initial state of the next step.
+Following standard practice for recurrent models with long computation chains, we apply truncated
+gradient propagation [16, 17], as used in recent recursive reasoning models [8, 9, 18]. In our
+implementation, gradients are propagated only through the final transition of each supervision step,
+ (n)      (n)
+zT −1 → zT . This gives the following surrogate objective for each supervision step:
+     (n)                                 (n)               (n)   (n)           (n)   (n) 
+   LGRAM (x, y; θ, ϕ) = Eqϕ log pθ (y | zT , x) − KL qϕ (ϵT | uT , y) ∥ pθ (ϵT | uT ) , (14)
+        (n)
+where zT is the terminal state of the current supervision step n, and gradients are stopped through
+preceding states. Thus, LGRAM should be viewed as a truncated surrogate objective rather than the
+exact ELBO; it introduces a biased but memory-efficient approximation to the full ELBO. Further
+analysis of this approximation is provided in Appendix A.3, and detailed training hyperparameters
+are listed in Appendix B.2.
+
+2.3   Inference-Time Scaling
+
+GRAM supports two complementary axes of inference-time scaling: depth, by varying the number
+of recursive transitions, and width, by sampling multiple latent reasoning trajectories in parallel.
+For depth, we follow prior recursive reasoning models [8, 9] in adopting adaptive computation
+time (ACT) [8–10], which allows each trajectory to terminate at a learned halting depth (details in
+Appendix A.1). For width — the focus of this section — we draw {τ (i) }N  i=1 ∼ pθ (τ(i)| x) from the
+learned prior and decode each terminal state into a candidate output ŷ (i) = fdec (zT ), exploring
+multiple stochastic reasoning paths simultaneously rather than extending a single trajectory.
+To select among candidates, we use either majority voting or best-of-N with a Latent Process Reward
+Model (LPRM). The LPRM is a value head vψ (zt ) trained to predict the final quality of a trajectory
+from its latent state, using a regression target r ∈ [0, 1] given by the final prediction accuracy. At
+inference time, majority voting selects the most frequent prediction, whereas LPRM-guided selection
+chooses the candidate with the highest predicted terminal value. Details of LPRM training are
+provided in Appendix A.2. Overall, this procedure improves robustness and solution quality through
+parallel exploration, without increasing the sequential recursion length.
+
+3     Related Work
+Latent Reasoning. Latent reasoning aims to reduce the inefficiency and verbosity of explicit
+Chain-of-Thought (CoT) by shifting part or all of the reasoning process into latent or continuous
+
+
+                                                      5
+                              Sudoku                                  ARC-AGI-1                                              ARC-AGI-2
+                                                                                                          20
+               100                           97.0%                                                66.7%
+                                     87.4%                                                                                                         16.0%
+                80                                   60                   52.0%           55.7%           15
+Accuracy (%)         61.3%                                        44.6%                                                      11.1%
+                60           55.0%                        40.3%                                                                             9.7%
+                                                     40                           34.5%                   10          7.8%
+                40                                                                                             5.0%
+                                                     20                                                    5                         3.0%
+                20
+                 0                                   0                                                     0
+                     Looped HRM TRM GRAM                  HRM TRM GRAM o3-mini-GPT 5.2 Grok-4-                 HRM TRM GRAM o3-mini-GPT 5.2 Grok-4-
+                       TF           (Ours)                        (Ours) high (low) thinking                           (Ours) high (low) thinking
+                                                           Recursive Models                GRAM (Ours)          LLMs
+
+Figure 3: Performance on puzzle benchmarks. On both Sudoku-Extreme and ARC-AGI, GRAM consistently
+outperforms all deterministic recursive baselines (Looped TF, HRM, TRM), demonstrating that stochastic latent
+transitions yield substantial gains within the recursive-reasoning paradigm. Looped TF results on ARC-AGI are
+omitted due to prohibitive training cost (see Section C.1.1) Note that large reasoning model scores are included
+only as external reference points for benchmark difficulty.
+
+representations [1–6]. By avoiding token-by-token generation of intermediate steps, such representa-
+tions can make reasoning traces more compact and reduce generation overhead. Existing approaches
+instantiate this idea through hidden states, latent or soft tokens, continuous thoughts, internal rea-
+soning traces, and recursive state updates for scaling test-time computation [4, 7, 19–23, 18, 24–26].
+However, many remain organized around autoregressive sequence generation, where additional
+computation is tied to generating more tokens, latent positions, or sequential reasoning states.
+Recursive Architectures. Recursive architectures perform iterative state updates and have evolved
+from RNNs to weight-sharing Transformers with adaptive computation [7, 10, 27–32, 25]. Recent
+recursive reasoning models show that increasing inference-time depth can outperform larger static
+models [8, 9, 18, 24]. GRAM builds on this line but formulates recurrence as a probabilistic process:
+instead of following a single deterministic refinement path, it maintains stochastic latent trajectories,
+enabling multi-path exploration and generative sampling.
+Probabilistic Latent State-Space Models. Probabilistic recurrent models use stochastic latent
+transitions to capture uncertainty and multimodal dynamics, often trained with variational infer-
+ence [33–38]. They have been widely used in sequential generative modeling, video prediction,
+and model-based reinforcement learning. GRAM shares this latent state-space view but reinterprets
+stochastic dynamics as computation rather than temporal observation modeling: latent transitions
+define possible reasoning trajectories, supporting multi-hypothesis exploration and both conditional
+pθ (y | x) and unconditional pθ (x) generation.
+
+
+4                Experiments
+
+GRAM is designed as an architecture for probabilistic recursive reasoning, rather than as a general-
+purpose large language reasoning model whose training data, inference budgets, prompting strategies,
+tool use, and external scaffolding are not directly comparable. Following prior work on recurrent and
+recursive reasoning models [8, 9], we therefore evaluate GRAM on standard structured reasoning
+tasks that probe the computational properties targeted by our formulation: iterative latent refinement,
+stochastic trajectory exploration, multi-solution coverage, and inference-time scaling.
+In the following, we first evaluate structured reasoning performance on Sudoku-Extreme and ARC-
+AGI (Section 4.1). We then assess multi-solution behavior on N-Queens and Graph Coloring
+(Section 4.2). Next, we examine the unconditional generative interpretation of GRAM on binarized
+MNIST (Section 4.3). Finally, we perform ablation studies to evaluate the impact of key design
+choices (Section 4.4).
+
+4.1                  Challenging Puzzle Tasks
+
+Setup. We evaluate on Sudoku-Extreme [8], which contains 9×9 puzzles with minimal clues re-
+quiring extensive constraint propagation, and ARC-AGI Challenge [13, 14], which tests abstract
+visual reasoning through few-shot pattern recognition. We compare against direct prediction (Trans-
+
+
+                                                                                     6
+           1.0                     N=                                    1.0
+                                         1 5 10 20 50
+                                                                         0.8
+           0.9
+Accuracy                                                                                                  Looped TF
+
+
+
+
+                                                              Accuracy
+                                                Looped TF                0.6                              HRM
+                                                HRM
+           0.8                                  TRM                                                       TRM
+                                                GRAM (Ours)              0.4                              GRAM (N=1)
+                                                                         0.2
+           0.6
+           0.5                                                           0.0
+                 8   16    32             128          320                     2   4    6   8   10   12   14   16   18
+                            Iterations                                                 Number of Solutions
+ Figure 4: (Left) Inference-time scaling on Sudoku-Extreme. While both TRM and GRAM benefit from
+ longer recursion (x-axis), GRAM additionally scales with parallel sampling (N = number of samples); each
+ iteration corresponds to a supervision step, while meaning K× more flat iterations in Looped TF. (Right)
+ Accuracy across number of solutions in N-Queens (8 × 8). Conventional deterministic recursive models suffer
+ a sharp performance drop as the number of possible solutions increases, whereas GRAM maintains consistent
+ performance.
+
+
+ former [39]), a flat recursive baselines (Looped TF [7], HRM [8], TRM [9]). Reported large reasoning
+ model results [40] are included as external reference points for benchmark difficulty, rather than
+ as controlled baselines, since their training and inference settings are not directly comparable to
+ task-specific recursive models. For the scaling analysis, all baselines (Looped TF, HRM, TRM) are
+ reproduced under identical settings following Yang et al. [7] and Jolicoeur-Martineau [9].
+ Stochastic Guidance Improves Reasoning. Figure 3 and Table 8 summarize our main results.
+ GRAM consistently outperforms prior recursive models across all benchmarks. We attribute this
+ improvement to the fundamental difference in how reasoning trajectories are utilized. While Looped
+ TF, HRM, and TRM are restricted to learning from a single deterministic path, GRAM leverages
+ stochastic transitions to explore diverse reasoning trajectories. By training on this richer distribution
+ of solution paths, GRAM acquires more robust reasoning capabilities, allowing it to navigate complex
+ problem spaces more effectively than models constrained to a single sequential refinement process.
+ Detailed experiment results, including more state-of-art methods, are provided in Appendix D.1.
+ Parallel Sampling Provides a New Test-time Scaling Axis. Figure 4 (left) shows that increasing the
+ number of parallel samples consistently improves performance across all iteration counts. Notably,
+ GRAM with N = 20 samples at 16 iterations outperforms all deterministic baselines at 320 iterations,
+ including TRM (97.0% vs 90.5%), despite comparable computational budget. While deterministic
+ recursive models scale only through sequential refinement, GRAM leverages stochastic transitions to
+ explore multiple reasoning paths in parallel. To select the best trajectory, we employ a Latent Process
+ Reward Model (LPRM) that predicts output correctness (Section 2.3). This parallel scaling bypasses
+ the latency bottlenecks of depth-based scaling while achieving superior performance. Additional
+ analysis on the ARC-AGI Challenge is provided in Appendix D.2.
+
+ 4.2        Multi-solution Puzzle Tasks
+
+Setup. To evaluate whether GRAM can capture diverse solutions, we test on N-Queens (8 × 8,
+10 × 10) and Graph Coloring (8-vertex, 10-vertex) tasks, where multiple valid solutions exist for each
+input. We compare against direct prediction (Transformer [39]), recursive models (Looped TF [7],
+HRM [8], TRM [9]), and generative models (Autoregressive Transformer (AR), MDLM [41]). For
+N-Queens, we report accuracy (whether the output satisfies all constraints) and coverage (found /
+total valid solutions, with 20 samples). For Graph Coloring, we report conflict edges (number of
+constraint violations; lower is better) instead of accuracy. Detailed configurations are provided in
+Appendix C.2.
+ Deterministic Recursion Fails on Multi-Solution Tasks. Table 1 reveals that deterministic recursive
+ models structurally cannot capture multiple solutions, with coverage at most 36.1% across all tasks.
+ Figure 4 (right) further illustrates this limitation: as the number of valid solutions increases, all three
+ deterministic recursive baselines exhibit sharp accuracy degradation, whereas GRAM maintains
+ consistent performance regardless of solution count. This confirms that deterministic latent updates
+
+
+                                                              7
+Table 1: Evaluation on N-Queens and Graph Coloring benchmarks. Rec. and Gen. indicate whether the
+model uses recursive computation and generative sampling, respectively. Values are mean ± standard deviation
+over runs. Accuracy: single-sample (%). Conflict: constraint-violating edges (↓). Coverage: unique valid
+solutions discovered with 20 samples (%).
+                                                           N-Queens                                 Graph Coloring
+                                                    8×8              10 × 10                 8-vertex            10-vertex
+ Method                  Rec. Gen. # Params Accuracy Coverage Accuracy Coverage       Conflict↓ Coverage Conflict↓ Coverage
+ Direct Pred (8 layers)   ✗    ✗     27M     40.4±1.1  13.7±1.1 13.6±0.5    1.6±0.2   179.3±4.0 19.9±0.2 198.7±5.0      6.7±0.1
+ Direct Pred (32 layers) ✗     ✗     100M    40.2±1.3  13.6±1.1 13.1±0.4    1.6±0.2   174.0±18.0 19.1±1.7 227.7±34.5    6.5±1.9
+ Looped TF                ✓    ✗       7M    68.4±3.7  23.6±1.9 50.0±7.6    6.2±3.2   136.0±16.1 20.5±1.5 157.3±9.0     7.2±0.7
+ HRM                      ✓    ✗      27M    78.7±2.9  26.7±1.3 37.4±0.3    4.7±0.1   109.7±1.5 21.8±0.3 164.3±21.6     8.9±1.7
+ TRM                      ✓    ✗       7M    66.8±5.7 36.1±22.5 17.5±11.2  2.0±1.3    109.3±3.1 22.3±0.6 170.7±17.9     6.8±0.3
+ AR                       ✗    ✓    10.6M    96.3±1.0  84.8±0.8 90.0±2.2   53.2±0.8    19.0±11.3   83.0±0.7 61.3±8.3   40.0±0.3
+ MDLM                     ✗    ✓    12.6M    96.1±1.5  87.2±0.6 74.3±6.6   47.4±2.2     2.7±0.6    84.5±4.0 12.0±7.0   48.2±1.4
+ GRAM (Ours)              ✓    ✓     10M     99.7±0.3  90.3±1.9 89.7±2.7   57.5±3.4     2.7±2.1    85.8±0.5  3.3±1.5   51.3±2.8
+
+
+
+
+cause mode collapse when multiple valid outputs exist for the same input. Additional coverage
+analysis is provided in Appendix D.3.
+Recursive Refinement Yields Sharper Constraint Satisfaction. While generative models (AR,
+MDLM) achieve high coverage, GRAM consistently attains higher accuracy with comparable diver-
+sity. On N-Queens, GRAM reaches 99.7% accuracy versus 96.3% (AR) and 96.1% (MDLM). The
+gap is more pronounced on Graph Coloring, where GRAM reduces conflict edges to 2.7 and 3.3 on 8-
+and 10-vertex tasks, compared to 19.0 and 61.3 for AR. This demonstrates that recursive refinement
+enables stricter constraint satisfaction than generative sampling alone.
+
+4.3   Exploring GRAM as an Unconditional Generator
+
+Setup. To investigate GRAM’s unconditional gen-                    Table 2: Unconditional generation results on
+erative capability beyond conditional reasoning, we                binarized MNIST. We report IS (↑) and FID (↓).
+evaluate generation in two domains: structured con-                For iterative models, a step corresponds to a super-
+straint generation on Sudoku (from empty boards,                   vision step for TRM and GRAM, and a denoising
+                                                                   step for D3PM. FID is calculated using real sam-
+evaluated by the fraction of generated boards satis-               ples with original pixel values (0–255).
+fying Sudoku constraints) and image generation on
+binarized MNIST [15], where pixel values are thresh-                       Method                    IS (↑)    FID (↓)
+olded to 0 or 1 (evaluated by Inception Score (IS) [42]
+and FID [43]). In both cases, the input is replaced by                     VAE                        1.70       86.28
+                                                                           D3PM (1000 steps)          1.86       74.03
+an empty conditioning signal and the model samples                         TRM (16 steps)             1.00      303.29
+an output from its learned prior. Baselines include
+D3PM [44], a discrete diffusion model, on both tasks,                      GRAM (Ours)
+and additionally a VAE [45] trained with binary re-                         8 steps                   1.85       84.08
+                                                                            16 steps                  1.89       77.79
+construction loss on MNIST. To ensure a fair compar-                        32 steps                  1.91       76.65
+ison with existing literature, FID is calculated using                      64 steps                  1.95       75.39
+real samples from the original standard MNIST.                              128 steps                 1.99       74.30
+                                                                            256 steps                 2.04       73.34
+Generative Behavior Beyond Reasoning. GRAM
+extends from conditional reasoning to unconditional
+generation in two different domains. On Sudoku
+generation (Figure 5), GRAM produces valid boards
+with 99.05% validity using 10.9M parameters and
+16 supervision steps, surpassing D3PM baselines
+that use up to 55.1M parameters and 1000 denois-
+ing steps. Figure 7 shows qualitative examples, illus-
+trating that the model produces diverse, fully valid
+boards from empty inputs without any explicit con-
+straint checker. On MNIST (Table 2), the deter-
+ministic baseline TRM exhibits mode collapse (FID
+303.29), whereas GRAM produces recognizable dig- Figure 5: Unconditional Sudoku generation. Va-
+its with IS and FID comparable to D3PM. Together, lidity (%) of generated Sudoku puzzles. GRAM
+these results indicate that GRAM’s stochastic latent achieves higher validity than D3PM with substan-
+                                                                    tially fewer parameters and steps.
+
+                                                            8
+D3PM
+           t=0   t=100     t=200   t=300     t=400   t=500      t=600   t=700   t=800   t=900    t=1000
+GRAM TRM
+           t=0    t=1       t=2     t=4       t=6     t=7        t=9    t=11    t=12    t=14      t=16
+
+
+
+           t=0    t=1       t=2     t=4       t=6     t=7        t=9    t=11    t=12    t=14      t=16    Sample 1 Sample 2 Sample 3 Sample 4
+                                             (a) Generation Process                                                 (b) Samples
+  Figure 6: Visualization of the generation process and samples. (a) The generation process over recursion
+  steps. Each row corresponds to a different model. GRAM (bottom) progressively refines the generated image
+  through recursive latent updates, correcting initial errors. (b) Unconditional generated samples from each model.
+
+  Table 3: Ablation study on Sudoku-Extreme and N-Queens (8 × 8). We evaluate with 5 samples. For (a),
+  Components are added cumulatively to the Looped TF baseline (DS = deep supervision, HR = hierarchical
+  recursion, SG = stochastic guidance). For (b), both stochasticity and learned guidance are essential—removing
+  either significantly degrades performance.
+                        (a) Architecture Ablation.                                              (b) Mechanism Ablation.
+
+  Model variant                             Sudoku            N-Queens          Model variant                   Sudoku       N-Queens
+  base (Looped TF)                            61.25              71.30          GRAM (ours)                      93.96          99.69
+  + DS + HR (=HRM, TRM)                   55.00 / 87.40      80.70 / 72.90
+                                                                                w/o stochastic guidance          82.87          72.91
+  + SG                                        65.64              86.30
+                                                                                stochasticity only               94.88          50.27
+  + DS + SG                                   73.90             100.00
+                                                                                guide only                        0.00           0.00
+  + DS + HR + SG (=GRAM)                     93.96              99.69           w/ direct prediction             63.43          61.44
+                                                                                TRM w/ stochastic decoder        82.87          71.66
+                                                                                TRM w/ random init.              78.53          71.82
+
+
+  transitions support generative modeling beyond symbolic reasoning, with constraint satisfaction
+  emerging as a natural byproduct of the recursive generative process.
+  Inference-Time Scaling Transfers to Generation. Table 2 further shows that increasing recursion
+  at inference improves generation quality monotonically (IS 1.85 → 2.04, FID 84.08 → 73.34 from 8
+  to 256 steps), even though training uses only 16 steps. This indicates that the iterative-refinement
+  advantage of recursive models carries over into the generative regime. Figure 6 visualizes this process;
+  additional samples are in Section D.4.
+
+  4.4        Ablation Study
+
+  We ablate key design choices of GRAM on Sudoku-Extreme and N-Queens (8 × 8) using 5 samples.
+  Table 3 summarizes the results.
+  Stochastic Guidance Provides Consistent Gains Across Architectures. Table 3a shows that
+  stochastic guidance (SG) improves performance regardless of the underlying architecture: SG alone
+  lifts the flat Looped TF baseline, and combining SG with deep supervision already reaches 100%
+  on N-Queens. The full GRAM (with hierarchical recursion on top) achieves the best results overall
+  (93.96% / 99.69%). While the effect of hierarchical recursion is task-dependent, SG yields consistent
+  gains in every configuration, supporting our design of stochastic guidance as the core extension
+  introduced by GRAM.
+  Both Stochasticity and Guidance Are Essential. We ablate each component by modifying the
+  learned distribution ϵt ∼ N (µθ , σθ2 I) in Equation (4). Removing guidance (N (0, σθ2 I)) maintains
+  Sudoku performance (94.88%), indicating that stochasticity alone can enable diverse reasoning paths.
+  However, this variant collapses on N-Queens (50.27%), where structured guidance is necessary to
+  navigate multi-solution spaces. Removing stochasticity (N (µθ , 0)) fails completely (0.0% on both
+  tasks), as deterministic guidance conditioned on the target leads to severe overfitting.
+  Naive Stochasticity Does Not Help TRM. We test two simple approaches to add stochasticity to
+  TRM: (1) stochastic decoding, which samples from the output distribution instead of argmax, and
+
+
+                                                                         9
+                                     8    9       648 59 1        648 59 1     648259317   648259317   648259317   648259317
+                 7                 79         5   79 68       5   79 68 415    79 683415   79 683415   792683415   792683415
+                               9     6      289   3 6     1289    3 6   1289   3 6 71289   356 71289   356471289   356471289
+
+
+
+D3PM
+                                         657 4      8    657 4      831657 4    83165734   983165734   983165734   983165734
+                     3 9           4 3 9 5        4 3 9 5         4 389 56     42389 561   423897561   423897561   423897561
+                     7     8         7      8 2       7     8 2      7 648 2   5 73648 2   5 73648 2   5173648 2   517364892
+                       5               5          13 548          139548 2     139548626   139548626   139548626   139548626
+                     59              593              593 1 8        593 1 8   27593 148   275936148   275936148   275936148
+                         2         8      2   3   8     7 2   3   8   7 2 53   8 47 2953   8 4712953   864712953   864712953
+         t=0         t=125              t=250        t=375          t=500        t=625       t=750       t=875      t=1000
+                 717348652         716348952      716348952       716348952    716348952   716348952   716348952   716348952
+                 483927379         493527861      493527861       493527861    493527861   493527861   493527861   493527861
+                 523971734         528996734      528169734       528169734    528169734   528169734   528169734   528169734
+GRAM
+
+
+                 864235917         864251379      864215379       864215379    864215379   864215379   864215379   864215379
+                 359841126         359784126      359874126       359874126    359874126   359874126   359874126   359874126
+                 172496538         172936548      172936548       172936548    172936548   172936548   172936548   172936548
+                 937812445         937812615      937482615       937482615    937482615   937482615   937482615   937482615
+                 645779283         645179283      645791283       645791283    645791283   645791283   645791283   645791283
+                 281623497         281663497      281653497       281653497    281653497   281653497   281653497   281653497
+        iter=0       iter=1             iter=2       iter=4         iter=6       iter=8     iter=10     iter=13     iter=16
+Figure 7: Qualitative examples of unconditional Sudoku generation by GRAM. Each board is independently
+sampled from an empty grid using the learned prior. GRAM produces diverse, complete boards satisfying all
+row, column, and box constraints, without an explicit constraint checker or search procedure. Incorrect digits are
+highlighted in red.
+
+
+(2) random initialization, which samples z0 from a Gaussian N (0, I) at each inference. Neither
+improves performance, demonstrating that GRAM’s gains stem from the variational framework rather
+than mere randomness.
+
+5      Conclusions and Limitations
+We introduced GRAM, a generative framework that transforms deterministic recursive architectures
+into probabilistic generative models capable of modeling both p(y | x) and p(x) via recursive amor-
+tized variational inference. For reasoning problems, introducing stochasticity into latent transitions
+enables diverse solution discovery and improved exploration compared to deterministic counterparts.
+Notably, we demonstrate GRAM can leverage width-based inference-time scaling as a complement
+to depth: by sampling multiple latent trajectories in parallel, bypassing the latency bottleneck of
+depth-only scaling. Our ablations further reveal that stochastic guidance is a general-purpose exten-
+sion that consistently improves any recursive architecture, and that the gains stem specifically from
+the variational framework — not from mere randomness, as naive stochastic alternatives applied to
+existing models yield no improvement.
+Beyond solution-seeking, GRAM also demonstrates potential as an unconditional generative model
+through recursion-based generation over inputs, with generation quality improving monotonically
+with recursive depth even beyond training-time steps. This suggests new directions for generative
+modeling via hierarchical recursion. Despite these strengths, the sequential nature of deep supervision
+limits training efficiency compared to Transformers, posing a significant barrier to scaling GRAM
+toward larger foundation models.
+
+
+
+
+                                                                    10
+Acknowledgment
+This research was supported by the Brain Pool Plus Program (No. 2021H1D3A2A03103645) and
+the GRDC (Global Research Development Center) Cooperative Hub Program (RS-2024-00436165)
+through the National Research Foundation of Korea (NRF) funded by the Ministry of Science and
+ICT (MSIT). This work was also supported by the Institute of Information & Communications
+Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. RS-
+2024-00509279, Global AI Frontier Lab) and by the NYU-KAIST Global Innovation and Research
+Institute. Minsu Kim acknowledges funding from the KAIST Jang Young Sil Fellow Program. We
+are especially grateful to Gyubin and Seungju for their non-trivial contributions, and we thank the
+members of the MLML for valuable discussions and feedback throughout this project.
+
+Broader Impacts
+GRAM studies probabilistic recursive reasoning for structured reasoning and generation. By main-
+taining multiple latent trajectories, it may benefit tasks such as constraint satisfaction, and scientific
+problem solving, where uncertainty and multiple valid solutions are common. It also suggests a way
+to improve reasoning through inference-time computation rather than parameter scaling alone. Its
+generality also entails risks: plausible but invalid generations may be mistaken for verified solutions
+in downstream decision-making pipelines, and multi-sample inference may increase computational
+and energy costs at scale. Since our experiments focus on controlled benchmarks, deployment in
+real-world or high-stakes settings would require rigorous validation, uncertainty calibration, and
+domain-specific safeguards.
+
+References
+ [1] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le,
+     Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.
+     Advances in neural information processing systems, 35:24824–24837, 2022.
+
+ [2] Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik
+     Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. Ad-
+     vances in neural information processing systems, 36:11809–11822, 2023.
+
+ [3] Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas
+     Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, et al. Graph
+     of thoughts: Solving elaborate problems with large language models. In Proceedings of the
+     AAAI conference on artificial intelligence, volume 38, pages 17682–17690, 2024.
+
+ [4] Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong
+     Tian. Training large language models to reason in a continuous latent space. arXiv preprint
+     arXiv:2412.06769, 2024.
+
+ [5] Hanlin Zhu, Shibo Hao, Zhiting Hu, Jiantao Jiao, Stuart Russell, and Yuandong Tian. Reasoning
+     by superposition: A theoretical perspective on chain of continuous thought. arXiv preprint
+     arXiv:2505.12514, 2025.
+
+ [6] Halil Alperen Gozeten, M Emrullah Ildiz, Xuechen Zhang, Hrayr Harutyunyan, Ankit Singh
+     Rawat, and Samet Oymak. Continuous chain of thought enables parallel exploration and
+     reasoning. arXiv preprint arXiv:2505.23648, 2025.
+
+ [7] Liu Yang, Kangwook Lee, Robert Nowak, and Dimitris Papailiopoulos. Looped transformers
+     are better at learning learning algorithms. arXiv preprint arXiv:2311.12424, 2023.
+
+ [8] Guan Wang, Jin Li, Yuhao Sun, Xing Chen, Changling Liu, Yue Wu, Meng Lu, Sen Song, and
+     Yasin Abbasi Yadkori. Hierarchical reasoning model. arXiv preprint arXiv:2506.21734, 2025.
+
+ [9] Alexia Jolicoeur-Martineau. Less is more: Recursive reasoning with tiny networks. arXiv
+     preprint arXiv:2510.04871, 2025.
+
+
+                                                   11
+[10] Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. Uni-
+     versal transformers. arXiv preprint arXiv:1807.03819, 2018.
+[11] Daniel Kahneman. Thinking, fast and slow. Farrar, Straus and Giroux, 2011.
+[12] Yoshua Bengio. The consciousness prior. arXiv preprint arXiv:1709.08568, 2017.
+[13] François Chollet. On the measure of intelligence. arXiv preprint arXiv:1911.01547, 2019.
+[14] Francois Chollet, Mike Knoop, Gregory Kamradt, Bryan Landers, and Henry Pinkard. Arc-
+     agi-2: A new challenge for frontier ai reasoning systems. arXiv preprint arXiv:2505.11831,
+     2025.
+[15] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document
+     recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. doi: 10.1109/5.726791.
+[16] Ronald J Williams and Jing Peng. An efficient gradient-based algorithm for on-line training of
+     recurrent network trajectories. Neural computation, 2(4):490–501, 1990.
+[17] Corentin Tallec and Yann Ollivier. Unbiasing truncated backpropagation through time. arXiv
+     preprint arXiv:1705.08209, 2017.
+[18] Jonas Geiping, Sean McLeish, Neel Jain, John Kirchenbauer, Siddharth Singh, Brian R Bartold-
+     son, Bhavya Kailkhura, Abhinav Bhatele, and Tom Goldstein. Scaling up test-time compute
+     with latent reasoning: A recurrent depth approach. arXiv preprint arXiv:2502.05171, 2025.
+[19] Yufan Zhuang, Liyuan Liu, Chandan Singh, Jingbo Shang, and Jianfeng Gao. Text generation
+     beyond discrete token sampling. arXiv preprint arXiv:2505.14827, 2025.
+[20] Zhen Zhang, Xuehai He, Weixiang Yan, Ao Shen, Chenyang Zhao, Shuohang Wang, Yelong
+     Shen, and Xin Eric Wang. Soft thinking: Unlocking the reasoning potential of llms in continuous
+     concept space. arXiv preprint arXiv:2505.15778, 2025.
+[21] Natasha Butt, Ariel Kwiatkowski, Ismail Labiad, Julia Kempe, and Yann Ollivier. Soft tokens,
+     hard truths. arXiv preprint arXiv:2509.19170, 2025.
+[22] Zhenyi Shen, Hanqi Yan, Linhai Zhang, Zhanghao Hu, Yali Du, and Yulan He. Codi:
+     Compressing chain-of-thought into continuous space via self-distillation. arXiv preprint
+     arXiv:2502.21074, 2025.
+[23] Zhenrui Yue, Bowen Jin, Huimin Zeng, Honglei Zhuang, Zhen Qin, Jinsung Yoon, Lanyu
+     Shang, Jiawei Han, and Dong Wang. Hybrid latent reasoning via reinforcement learning. arXiv
+     preprint arXiv:2505.18454, 2025.
+[24] Sangmin Bae, Yujin Kim, Reza Bayat, Sungnyun Kim, Jiyoun Ha, Tal Schuster, Adam Fisch,
+     Hrayr Harutyunyan, Ziwei Ji, Aaron Courville, et al. Mixture-of-recursions: Learning dynamic
+     recursive depths for adaptive token-level computation. arXiv preprint arXiv:2507.10524, 2025.
+[25] Amirkeivan Mohtashami, Matteo Pagliardini, and Martin Jaggi. Cotformer: A chain-of-
+     thought driven architecture with budget-adaptive computation cost at inference. arXiv preprint
+     arXiv:2310.10845, 2023.
+[26] Sangmin Bae, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Seungyeon Kim, and Tal Schuster.
+     Relaxed recursive transformers: Effective parameter sharing with layer-wise lora. arXiv preprint
+     arXiv:2410.20672, 2024.
+[27] Jeffrey L Elman. Finding structure in time. Cognitive science, 14(2):179–211, 1990.
+[28] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):
+     1735–1780, 1997.
+[29] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares,
+     Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-
+     decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
+
+
+                                                 12
+[30] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu
+     Soricut. Albert: A lite bert for self-supervised learning of language representations. arXiv
+     preprint arXiv:1909.11942, 2019.
+[31] Maha Elbayad, Jiatao Gu, Edouard Grave, and Michael Auli. Depth-adaptive transformer. arXiv
+     preprint arXiv:1910.10073, 2019.
+[32] Alex Graves. Adaptive computation time for recurrent neural networks. arXiv preprint
+     arXiv:1603.08983, 2016.
+[33] Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron Courville, and Yoshua
+     Bengio. A recurrent latent variable model for sequential data, 2016. URL https://arxiv.
+     org/abs/1506.02216.
+[34] Marco Fraccaro, Søren Kaae Sønderby, Ulrich Paquet, and Ole Winther. Sequential neural
+     models with stochastic layers, 2016. URL https://arxiv.org/abs/1605.07571.
+[35] Rahul G. Krishnan, Uri Shalit, and David Sontag. Deep kalman filters, 2015. URL https:
+     //arxiv.org/abs/1511.05121.
+[36] Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and
+     James Davidson. Learning latent dynamics for planning from pixels, 2019. URL https:
+     //arxiv.org/abs/1811.04551.
+[37] Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with
+     discrete world models. arXiv preprint arXiv:2010.02193, 2020.
+[38] Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains
+     through world models. arXiv preprint arXiv:2301.04104, 2023.
+[39] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
+     Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information
+     processing systems, 30, 2017.
+[40] ARC-Prize-Foundation. ARC-AGI benchmarking: Leaderboard and dataset for the ARC-AGI
+     benchmark. https://arcprize.org/leaderboard, 2026. Accessed: 2026-1-22.
+[41] Subham Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin Chiu,
+     Alexander Rush, and Volodymyr Kuleshov. Simple and effective masked diffusion language
+     models. Advances in Neural Information Processing Systems, 37:130136–130184, 2024.
+[42] Shane Barratt and Rishi Sharma. A note on the inception score. arXiv preprint arXiv:1801.01973,
+     2018.
+[43] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter.
+     Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in
+     neural information processing systems, 30, 2017.
+[44] Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg.
+     Structured denoising diffusion models in discrete state-spaces. Advances in neural information
+     processing systems, 34:17981–17993, 2021.
+[45] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint
+     arXiv:1312.6114, 2013.
+[46] Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer:
+     Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024.
+[47] Noam Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020.
+[48] Simo Ryu. Minimal implementation of a d3pm (structured denoising diffusion models in
+     discrete state-spaces), in pytorch. https://github.com/cloneofsimo/d3pm, 2024.
+[49] William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings
+     of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023.
+
+
+                                                13
+[50] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning
+     applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 2002.
+[51] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep
+     convolutional neural networks. Advances in neural information processing systems, 25, 2012.
+[52] Stefan Elfwing, Eiji Uchibe, and Kenji Doya. Sigmoid-weighted linear units for neural network
+     function approximation in reinforcement learning. Neural networks, 107:3–11, 2018.
+[53] Yuxin Wu and Kaiming He. Group normalization. In Proceedings of the European conference
+     on computer vision (ECCV), pages 3–19, 2018.
+[54] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint
+     arXiv:1711.05101, 2017.
+[55] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity
+     natural image synthesis. arXiv preprint arXiv:1809.11096, 2018.
+[56] Yang Song and Stefano Ermon. Improved techniques for training score-based generative models.
+     Advances in neural information processing systems, 33:12438–12448, 2020.
+[57] P. Erdős and A. Rényi. On the strength of connectedness of a random graph. Acta Mathematica
+     Academiae Scientiarum Hungarica, 12(1):261–267, Mar 1964. ISSN 1588-2632. doi: 10.1007/
+     BF02066689. URL https://doi.org/10.1007/BF02066689.
+[58] Henrique Lemos, Marcelo Prates, Pedro Avelar, and Luis Lamb. Graph colouring meets deep
+     learning: Effective graph neural network models for combinatorial problems. In 2019 IEEE
+     31st International Conference on Tools with Artificial Intelligence (ICTAI), pages 879–885.
+     IEEE, 2019.
+[59] Alexey L. Pomerantsev. Principal component analysis (pca). Encyclopedia of Autism Spectrum
+     Disorders, 2014. URL https://api.semanticscholar.org/CorpusID:2534141.
+[60] Jon Louis Bentley. Multidimensional binary search trees used for associative searching. Com-
+     mun. ACM, 18:509–517, 1975. URL https://api.semanticscholar.org/CorpusID:
+     13091446.
+
+
+
+
+                                               14
+A     Additional Method Details
+A.1   Adaptive Computation Time
+
+GRAM optionally adopts adaptive computation time (ACT) [8–10] at inference, allowing each
+trajectory to terminate at a learned halting depth rather than running for a fixed number of supervision
+steps. We follow the Q-learning formulation introduced by HRM [8] and adopted in TRM [9].
+Halt head. The decoder includes an auxiliary head qψ : RD → R2 that maps the high-level state h
+to two scalar values, qψ (h) = (q halt , q continue ). These are interpreted as estimated Q-values for the
+binary action of halting or continuing computation at the current supervision step.
+Training. The halt head is trained jointly with the main objective via a temporal-difference loss.
+                                  (n)
+After computing the latent state zT at the end of supervision step n, we form Q-learning targets:
+
+       • q̂nhalt = 1[ŷ (n) = y], indicating whether decoding the current state would yield a correct
+         prediction.
+                                                
+       • q̂ncontinue = max qn+1halt    continue
+                                    , qn+1        , the bootstrapped value of running one more supervision
+         step.
+
+The halt head is trained by regression to these targets:
+                               Nsup h
+                               X                           2                                2 i
+                     LACT =             qnhalt − q̂nhalt        + qncontinue − q̂ncontinue       .     (15)
+                               n=1
+
+This auxiliary loss is added to the main training objective and contributes only through the halt head;
+it does not propagate gradients into the recursive core.
+Inference. At inference, computation proceeds one supervision step at a time. After each step
+n, we evaluate qψ (h(n) ) and halt if qnhalt > qncontinue , returning ŷ (n) as the prediction. Otherwise,
+                                                                                       max
+computation continues to the next supervision step, up to a maximum budget of Nsup          steps. Different
+trajectories sampled in parallel may therefore terminate at different depths, complementing the
+parallel-sampling scheme described in Section 2.3. In practice, we found that using only q halt
+(halting when σ(q halt ) > 0.5, without the continue branch) performs comparably while simplifying
+implementation; our released code uses this variant.
+
+A.2   Latent Process Reward Model (LPRM).
+
+To rank or select among sampled candidates, we train a value head vψ (zt ) to predict the expected
+accuracy of the final output, conditioned on the current latent state zt . The LPRM is trained jointly
+with the main objective via a regression loss:
+                                                    T
+                                                    X
+                                        LLPRM =       (vψ (zt ) − r)2 ,                                (16)
+                                                    t=1
+
+where r ∈ [0, 1] denotes the accuracy of the final prediction for a given trajectory.
+
+A.3   Empirical Validation of the Surrogate Objective
+
+We further analyze the approximation introduced by the surrogate training objective LGRAM used in
+Section 2.2, both qualitatively and empirically.
+Truncation as a gradient approximation. We frame LGRAM as a gradient approximation rather than
+a separate variational objective. The full trajectory-level ELBO (Equation (13)) involves a sum of KL
+terms across all TTotal transitions, and computing its exact gradient requires backpropagation through
+the entire trajectory. To enable training with constant memory, we propagate gradients only through
+the final transition of each supervision step. This is a standard practice in recurrent latent variable
+models with long computation chains: ELBOs over truncated sequences are used, for example, in
+VRNN [33] and SRNN [34], while truncated latent imagination is used in Dreamer-family world
+models [37, 38]. Trading a small gradient bias for training stability via local truncation is therefore
+
+
+                                                       15
+well-precedented; what is specific to GRAM is applying this approximation at the level of recursive
+reasoning trajectories rather than temporal sequences.
+Empirical validation. To verify that optimizing LGRAM effectively drives improvement in the full
+variational bound, we compute both quantities on the validation set throughout training. The full
+ELBO LELBO is evaluated as in Equation (13), summing the reconstruction term and KL contributions
+                                                                                       (n)
+across all TTotal transitions; the surrogate objective is evaluated as the average of LGRAM over the
+Nsup supervision steps. Figure 8 reports the results on Sudoku-Extreme and N-Queens 8 × 8.
+
+                              Sudoku-Extreme                                                  N-Queens 8 × 8
+                                        GRAM (training objective)                                      GRAM (training objective)
+             0.6                        ELBO (full)                         103                        ELBO (full)
+
+
+             0.5                                                            102
+      ELBO
+
+
+
+
+                                                                    ELBO
+             0.4                                                            101
+
+                                                                            100
+             0.3
+                                                                           10 1
+             0.2
+                   10000 20000 30000 40000 50000 60000                            0   10000    20000    30000      40000
+                               Training Step                                                  Training Step
+Figure 8: Full ELBO LELBO and surrogate objective LGRAM throughout training (plotted as −ELBO,
+smaller is better). On both Sudoku-Extreme (left) and N-Queens 8 × 8 (right), both quantities decrease
+monotonically over training, indicating that gradient updates of LGRAM consistently improve the full variational
+bound. The two curves do not coincide because LELBO sums KL contributions across all TTotal transitions
+while LGRAM evaluates only the final-step KL of each supervision step; their gap reflects the cumulative KL
+across earlier transitions, not a failure of optimization. The N-Queens plot uses a log scale on the y-axis due to
+the large dynamic range.
+
+Both LELBO and LGRAM improve monotonically throughout training on both tasks. This indicates
+that, despite the truncation, gradient updates of LGRAM effectively drive improvement in the full
+variational bound. Since LELBO also serves as an indirect estimate of the negative log-likelihood,
+its consistent improvement provides evidence that GRAM optimizes a well-defined data likelihood,
+even though training relies on the surrogate.
+The gap between the two curves in Figure 8 reflects the structural difference between the two
+quantities — LELBO accumulates KL terms across all transitions while LGRAM evaluates only the
+final-step KL of each supervision step — rather than an optimization failure. This gap is consistent
+with LGRAM being a biased but useful surrogate for LELBO .
+
+B     Training and Architecture Details
+B.1    Architecture Details
+
+GRAM consists of three components: Encoder, Recursive Core, and Decoder.
+Encoder. Input tokens are mapped to embeddings via a token embedding layer, optionally con-
+catenated with puzzle embeddings (for ARC√ [13, 14]), and combined with positional encodings
+(RoPE) [46]. The embeddings are scaled by D and prepended with 16 puzzle embedding tokens [8].
+Recursive Core. The core maintains two latent states: h (high-level) and l (low-level). For each outer
+step, the low-level state is refined K times via l ← fL (l, h + ex ), injecting the input embedding at
+each iteration. The high-level state is then updated via h ← fH (h, l). Both fL and fH share the same
+architecture: a stack of attention and SwiGLU [47] MLP layers. In addition, as an exception, we use
+[SwiGLU + SwiGLU] network for the Recursive Core module instead of [Attention + SwiGLU] for
+Sudoku tasks, following [9]. For initialization of z0 = (h0 , l0 ), we sample once from the standard
+Gaussian distribution N (0, I), then save the value within the network checkpoint and load it again,
+meaning the initialized z0 has a fixed value.
+Decoder. The decoder extracts content tokens from h (excluding puzzle embedding positions)
+and maps them to logits via a SwiGLU MLP head. An auxiliary head predicts halt decisions and
+correctness values from the first token of h.
+
+
+                                                                    16
+                                       Table 4: Architecture components.
+                  Component         Module                   Description
+                    Encoder
+                                    Token Embedding          vocab → D
+                                    Puzzle Embedding         16 tokens (optional, for ARC)
+                                    Position Encoding        RoPE or learned
+                    Recursive Core
+                                 f L , fH                    [Attention + SwiGLU] × 2 layers
+                                 Iterations                  K low-level, T high-level steps
+                                 µθ , σ θ , µ ϕ , σ ϕ        SwiGLU MLP for each parameter
+                    Decoder
+                                    LM Head                  Linear(D → vocab)
+                                    Q Head                   Linear(D → 2) for halt
+                                    V Head                   Linear(D → 1) for value
+
+
+
+Encoder and Decoder for Image Patches. In the MNIST [15] image generation task, we first
+construct a binarized dataset by normalizing the original discrete pixel values (0 ∼ 255) to the
+continuous range [0, 1] and applying a threshold at 0.5. For the network architecture, we employ a
+convolutional patch encoder, following [48, 49].
+The encoding process proceeds in three stages. First, the discrete input tokens x ∈ {0, 1} are
+normalized to the range [−1, 1]. Second, to capture local spatial dependencies before patchification,
+the normalized image passes through a shallow convolutional encoder. This encoder consists of
+two stacked blocks, where each block comprises a 2D convolution [50, 51] with a 5 × 5 kernel and
+padding 2, a SiLU non-linearity [52], and Group Normalization (GN) [53]. Finally, the resulting
+feature map is divided into non-overlapping patches of size P × P and linearly projected to match
+the model’s hidden dimension D. The detailed architectural specifications and dimension transitions
+are summarized in Table 5.
+
+Table 5: Detailed architecture of the Image Patch Encoder for MNIST. H, W denote image resolution, C input
+channels, P patch size, Np the number of patches, and D the hidden dimension.
+                        Stage           Layer / Operation              Output Dim.
+                                        Input Tokens                   (B, C, H, W )
+                        1. Norm.
+                                        Linear Scaling [−1, 1]         (B, C, H, W )
+                                        Conv2d 5 × 5 (p = 2)
+                                                                      (B, D/2, H, W )
+                                        SiLU → GN(32)
+                        2. Conv
+                                        Conv2d 5 × 5 (p = 2)
+                                                                      (B, D/2, H, W )
+                                        SiLU → GN(32)
+                                        Flatten Patches               (B, Np , P 2 · D
+                                                                                     2)
+                        3. Patch
+                                        Linear Projection               (B, Np , D)
+
+
+Hyperparameters. Following Wang et al. [8], Jolicoeur-Martineau [9], both the input and output are
+represented as sequences of shape [B, L], where B denotes the batch size and L the context length.
+Each input sequence includes 16 fixed puzzle embedding tokens. The latent states ht and lt , as well
+as the decoder output, have shape [B, L, D], with embedding dimension D. The Transformer [39]
+backbone uses embedding dimension D = 512, attention heads Nhead =8 , and FFN hidden dimension
+Dh =512. Within a recursion step, meaning a latent transition zt → zt+1 , we use low-level (inner)
+steps K = 6 for Sudoku [8] and K = 4 for all other tasks, with high-level (outer) steps T = 3.
+
+B.2   Training Details
+
+Task Configuration. All tasks represent inputs and outputs as discrete token sequences (Summarized
+in Table 6).
+
+
+                                                        17
+        • For Sudoku [8], the 9×9 grid is flattened row-by-row into 81 tokens with vocabulary size 11
+          (0=pad, 1=blank, 2–10=digits).
+        • For ARC-AGI [13, 14], variable-size grids are padded to a fixed 30×30 canvas with EOS
+          markers, yielding 900 tokens and vocabulary size 12 (0=pad, 1=eos, 2–11=colors); task-
+          specific puzzle embeddings are prepended to distinguish different ARC tasks.
+        • N-Queens flattens an N × N board row-by-row into N 2 tokens with vocabulary size 3
+          (0=pad, 1=empty, 2=queen).
+        • Graph Coloring encodes the strict upper triangle of the adjacency matrix as n(n−1)/2 tokens,
+          using 0=PAD, 1=no-edge, and 2=edge for inputs and 3 + color_id for output colors.
+        • For image generation on MNIST [15], images are quantized and processed via CNN-based
+          patchification [50, 49], with the encoder applying patchify and the decoder unpatchify. Then,
+          patched input forms 14 × 14 flattened sequence tokens with vocabulary size 3 (0=pad,
+          1=black, 2=white).
+
+
+                                  Table 6: Task-specific configurations.
+           Task             Seq. Len    Vocab     Puzzle Emb                 Encoding
+           Sudoku                81       11              ✗             9×9 grid, row-major
+           ARC-AGI              900       12              ✓             30×30 padded canvas
+           N-Queens             N2        3               ✗                 N × N board
+                             n(n−1)
+           Graph Coloring       2
+                                          6               ✗        Strict adjacency upper triangle
+           MNIST                196        3              ✗                14×14 patches
+
+Training Details. We train all models using AdamW [54] with learning rate 10−4 , weight decay
+1.0, and gradient clipping at 1.0. The global batch size is 768. For stability, we apply exponential
+moving average (EMA) with decay 0.9999, following Brock et al. [55] and Song and Ermon [56].
+To prevent posterior collapse, we use a KL balance [37, 38] coefficient of 0.8. The number of deep
+supervision steps is Nsup = 16 for all tasks. The KL coefficient β is set to 0.1 (Sudoku), 0.04/0.1
+(ARC-AGI-1/2), 0.07/0.045 (N-Queens 8 × 8/10 × 10), 0.5/0.45 (Graph Coloring with 8/10 nodes),
+and 0.07 (MNIST). Task-specific training configurations are summarized in Table 7.
+
+                      Table 7: Training configurations on NVIDIA RTX 4090 GPUs.
+                         Task                             Epochs   GPUs     Time
+                         Sudoku                            50K      8        2h
+                         ARC-AGI                          200K      8      5 days
+                         N-Queens (8×8)                     3K      8        1h
+                         N-Queens (10×10)                   1K      8        3h
+                         Graph Coloring (8 nodes)           5K      8       1.5h
+                         Graph Coloring (10 nodes)          5K      8        6h
+                         MNIST                            1.8K      8       16h
+
+
+
+C     Additional Details of Experiment Setup
+C.1     Challenging Puzzle Tasks
+
+C.1.1    Looped TF on ARC-AGI
+We report Looped Transformer [7] results on Sudoku-Extreme but omit them on ARC-AGI due to
+prohibitive training cost. Under the same setup used for our other recursive baselines (200K epochs,
+batch size 768, on 8× NVIDIA RTX Pro 6000 GPUs), training Looped TF on Sudoku-Extreme
+already takes 19 hours, and extrapolating to ARC-AGI — which uses substantially longer sequences
+and a larger training set — suggests approximately 97 days (≈ 776 GPU-days) for a full training run.
+This gap stems from two compounding factors. First, Looped TF lacks deep supervision: HRM,
+TRM, and GRAM perform Nsup gradient updates per trajectory (one per segment), whereas Looped
+TF performs only one update at the end of the full trajectory, slowing convergence. Second, Looped
+
+
+                                                     18
+TF lacks adaptive halting such as ACT [8–10], so every input must be processed for the maximum
+recursion depth, increasing per-example sequential compute. Both inefficiencies compound at
+ARC-AGI scale, making a full Looped TF training run impractical.
+
+C.2     Multi-solution Puzzle Tasks
+
+C.2.1    N-Queens Problem
+
+                          Input               Solution 1                Solution 2               Solution 3
+
+
+
+
+Figure 9: Example of an 8 × 8 N-Queens puzzle instance. In this example, 5 queens are removed from the
+full board, leaving 3 queens. The model must find the positions of the remaining queens. This configuration
+admits exactly 3 valid solutions.
+
+Data Generation Details. The N-Queens problem requires placing N queens on an N × N
+chessboard such that no two queens attack each other—meaning no queens share the same row,
+column, or diagonal. Figure 9 illustrates an example where 5 queens are removed from an 8 × 8
+solution, resulting in a puzzle with 3 distinct valid completions.
+To construct the dataset, we first generated all valid complete N-Queens solutions for N = 8 and
+N = 10. We then created puzzle instances by removing a specific number of queens, treating the
+remaining partial configuration as the input and the original complete board as the target label. To
+generate instances yielding diverse valid completions, we removed k ∈ {5, 6, 7} queens for the 8 × 8
+setting and k ∈ {7, 8, 9} queens for the 10 × 10 setting. The distribution of solution counts for our
+generated dataset is shown in Figure 10.
+For evaluation, we employed an 85:15 train-test split. Crucially, to prevent data leakage and ensure
+the model learns to reason rather than memorize, the split was performed based on unique input
+configurations. This guarantees that no input pattern in the test set appears in the training set. Inputs
+are flattened into discrete 1D sequences x ∈ {0, 1, 2}L , where L = N 2 , along with zero-padded
+puzzle embedding tokens. Vocabulary mapping follows: padding (0), empty (1), and queen (2).
+
+                                     8x8 N-Queens                                     10x10 N-Queens
+                   3000                                                20000
+                                                                       15000
+          Counts
+
+
+
+
+                                                              Counts
+
+
+
+
+                   2000
+                                                                       10000
+                   1000
+                                                                        5000
+                      0                                                    0
+                             3      6    9   12     15   18                    0      20    40       60       80
+                                  Number of solutions                                Number of solutions
+Figure 10: Distribution of the number of valid solutions for generated N-Queens instances. The dataset
+covers a wide range of solution counts, testing the model’s ability to recover multiple valid outputs.
+
+
+C.2.2    Graph Coloring Problem
+Data Generation Details. The Graph Coloring problem requires assigning one of k colors to each
+node in a graph such that no two adjacent nodes share the same color. We consider graphs with
+N ∈ {8, 10} nodes and use k = 3 colors. Figure 11 illustrates an example instance with N = 8
+nodes and k = 3 colors.
+Graphs are generated using the Erdős–Rényi random graph model [57], following the generation
+pipeline from GNN-GCP [58]. Specifically, for each instance, edges are sampled independently
+
+
+                                                              19
+with a fixed probability p, producing a symmetric adjacency matrix. We retain only graphs that are
+3-colorable.
+For each graph, we enumerate all valid 3-colorings and retain only canonical forms to eliminate
+redundant solutions under color permutation (e.g., swapping red and blue). This yields a set of
+structurally distinct solutions per input. The distribution of solution counts is shown in Figure 12.
+The final dataset consists of 7,002 training and 255 test instances for N = 8, and 13,465 training and
+192 test instances for N = 10.
+Input and Output Representation. The input graph is represented by extracting the upper triangular
+portion of the adjacency matrix (excluding the diagonal) and flattening it into a 1D sequence. The
+output is a sequence of length N , where each position encodes the assigned color for the corresponding
+node. Vocabulary mapping is as follows: PAD (0), no edge (1), edge (2), and colors (3, 4, 5) for red,
+blue, and green respectively.
+
+                      Input             Solution 1             Solution 2              Solution 3        Solution 4
+
+
+
+
+                                             Figure 11: Graph Coloring Example
+
+
+                              Vertex 8 Graph Coloring                                  Vertex 10 Graph Coloring
+                                                                            2000
+                   1500
+                                                                            1500
+          Counts
+
+
+
+
+                                                                   Counts
+
+
+
+
+                   1000
+                                                                            1000
+                    500                                                      500
+                      0                                                        0
+                              3     6    9     12    15   18                       0      20        40   60    80
+                                  Number of solutions                                    Number of solutions
+Figure 12: Distribution of the number of valid solutions for generated graph coloring instances. The
+dataset covers a wide range of solution counts, testing the model’s ability to recover multiple valid outputs.
+
+
+D     Additional Experiment Results
+D.1   Additional Results on Challenging Puzzle Benchmarks
+
+Table 8 reports test accuracy on three challenging puzzle benchmarks. Here we provide additional
+observations complementing the main text.
+
+GRAM Advances the Recursive-Reasoning Line. Across all three benchmarks, GRAM consis-
+tently outperforms prior recursive baselines (Looped TF, HRM, TRM) while using fewer parameters
+than HRM (10M vs. 27M). The complete failure of direct prediction on Sudoku and ARC-AGI-2 (0%
+in both cases) further confirms that recursive computation is essential for these tasks — single-pass
+models, regardless of capacity, cannot solve them. Together, these results indicate that GRAM’s gains
+arise from how recursive computation is organized (probabilistic, multi-trajectory) rather than from
+increased model capacity.
+
+Sudoku-Extreme Resists Parameter Scaling. All tested large reasoning models (LRMs), including
+Deepseek-R1 (671B), score 0% on Sudoku-Extreme. This suggests that pretrained capacity alone
+does not transfer to constraint-propagation reasoning, and that benchmarks like Sudoku-Extreme
+probe a fundamentally different axis from those captured by general-purpose LRMs. On ARC-AGI,
+more recent LRMs such as Gemini 3 Pro (75.0% on ARC-1, 31.1% on ARC-2) remain substantially
+
+
+                                                                 20
+Table 8: Test accuracy (%) on Challenging Puzzle Benchmarks. GRAM significantly outperforms prior
+recursive models. All recursive model scores were obtained at 16 supervision steps.
+                         Method                   #Params Sudoku ARC-1 ARC-2
+                         Large Reasoning Models
+                           Deepseek-R1             671B    0.0    15.8    1.3
+                           Claude 3.7 16k          N/A     0.0    28.6    0.7
+                           o3-mini-high            N/A     0.0    34.5    3.0
+                           GPT 5.2 (low)           N/A      –     55.7    9.7
+                           Grok-4-thinking         1.7T     –     66.7   16.0
+                           Gemini 3 Pro            N/A      –     75.0   31.1
+                         Recursive Models
+                           Direct Pred             27M      0.0   21.0   0.0
+                           Looped TF                7M     61.3    -      -
+                           HRM                     27M     55.0   40.3   5.0
+                           TRM                      7M     87.4   44.6   7.8
+                           GRAM (Ours)             10M     97.0   52.0   11.1
+                         Human Results
+                           Avg. Human                –      –     60.2     –
+                           Best Human                –      –     98.0   100.0
+
+
+
+ahead of all recursive models, highlighting that abstract few-shot reasoning still benefits from scale;
+we view these numbers as benchmark-difficulty reference points rather than controlled baselines.
+
+D.2   Scales with Parallel Sampling on ARC-AGI Challenge
+
+To investigate the effect of GRAM’s sampling on the ARC-AGI-1 benchmark, we measured perfor-
+mance without relying on external data augmentation. Typically, TRM achieves its reported accuracy
+by generating 1,000 augmentations for a single problem and performing majority voting over the
+results. Because this augmentation process itself creates a wide variety of samples, we isolated
+the specific effect of generative sampling by performing inference solely on the original problem
+instance and conducting majority voting over multiple sampled paths. For a fair comparison, TRM
+was evaluated using the same hyperparameters as GRAM, including the number of epochs, learning
+rate, and the number of layers.
+As illustrated in Figure 13, removing augmentations causes a performance decline for both GRAM
+and TRM compared to the values reported in Table 8. However, in the case of GRAM, we observe
+that accuracy consistently improves as the model generates more parallel samples. This trend mirrors
+observations in Section 4.2, suggesting that increased inference-time compute through width scaling
+allows the model to explore more plausible reasoning trajectories and recover from initial errors,
+eventually leading to more robust solution discovery.
+
+Interaction between Augmentation and Sampling. A natural question arises: why not combine
+higher levels of augmentation with extensive parallel sampling? To address this, we conducted an
+ablation study examining the interaction between data augmentation and inference-time sampling.
+Figure 14 presents the results across varying augmentation levels (Aug=0 to Aug=50). Without
+augmentation (Aug=0), increasing the number of samples yields consistent accuracy improvements,
+demonstrating that stochastic sampling effectively explores diverse reasoning trajectories. However,
+as the level of augmentation increases, the marginal benefit of additional sampling diminishes
+substantially. At Aug=50, performance saturates regardless of sample count—accuracy remains
+nearly constant whether we draw 1 or 50 samples. This observation reveals that augmentation
+and sampling serve complementary rather than additive roles: both mechanisms enable the model
+to capture solution diversity, but through different means. When training data is limited, parallel
+sampling compensates by exploring varied reasoning paths at inference time. When training data
+is abundant through augmentation, the model has already internalized sufficient diversity during
+training, rendering additional inference-time exploration redundant. Consequently, scaling sampling
+beyond augmentation provides diminishing returns, justifying our experimental design choice to
+evaluate these two scaling axes separately.
+
+
+                                                    21
+                                       0.450
+                                       0.425
+                                       0.400
+                                       0.375
+
+
+
+
+                            Accuracy
+                                       0.350
+                                       0.325
+                                       0.300                                                       TRM
+                                                                                                   GRAM ( =0.1)
+                                       0.275                                                       GRAM ( =0.05)
+                                                                                                   GRAM ( =0.04)
+                                       0.250   1       2       5      10        20   50    100         250     500
+                                                                   Number of Samples (N)
+Figure 13: Effect of sampling on ARC-AGI-1 without data augmentation. To isolate the internal sampling
+effect, both models are evaluated on original problem instances without 1,000 augmentations. While removing
+augmentations causes an initial performance drop, GRAM exhibits robust scaling through generative sampling
+as the number of parallel samples N increases, outperforming the TRM baseline.
+
+
+                          0.500
+                          0.475
+                          0.450
+               Accuracy
+
+
+
+
+                          0.425
+                          0.400
+                          0.375                                                                                      Aug=0
+                                                                                                                     Aug=5
+                          0.350                                                                                      Aug=10
+                                                                                                                     Aug=50
+                          0.325
+                                         1         2       5         10         20    50         100         250       500
+                                                                   Number of Samples (N)
+Figure 14: Effect of augmentation on sampling efficiency. With limited augmentation (Aug=0), parallel
+sampling provides consistent gains. As augmentation increases, sampling benefits diminish—at Aug= 50,
+performance saturates regardless of sample count, suggesting augmentation and sampling serve complementary
+roles in capturing solution diversity.
+
+
+
+D.3   Solution Coverage Analysis
+
+We analyze the ability of GRAM to capture the diversity of the solution space compared to determin-
+istic baselines. Figure 15 presents the solution coverage on 8 × 8 and 10 × 10 N-Queens tasks with
+respect to the total number of valid ground-truth solutions.
+As shown in Figure 15, deterministic recursive models (HRM and TRM) exhibit a sharp decline in
+coverage as the number of possible solutions increases. Since these models are constrained to a single
+fixed reasoning trajectory, they structurally fail to explore alternative paths, resulting in severe mode
+collapse in multi-solution landscapes.
+In contrast, GRAM effectively leverages its generative latent transitions to cover a broader range
+of solutions. As the number of parallel samples N increases (from 1 to 20), the solution coverage
+improves monotonically across both 8 × 8 and 10 × 10 settings. This empirical evidence confirms
+that GRAM’s stochastic guidance mechanism is essential for navigating complex problem spaces
+where multiple valid reasoning paths exist.
+
+
+D.4   Additional Generated Image Samples
+
+In this section, we provide further qualitative results demonstrating GRAM’s capability in uncon-
+ditional image generation. Figure 16 presents a diverse set of samples generated on the binarized
+MNIST dataset, visualized across the recursive inference steps t = 0 to t = 16.
+
+
+                                                                           22
+           1.0                                                 HRM                       1.0                                            HRM
+                                                               TRM                                                                      TRM
+                                                               GRAM (N=1)                                                               GRAM (N=1)
+                                                               GRAM (N=5)                                                               GRAM (N=5)
+           0.8                                                 GRAM (N=10)               0.8                                            GRAM (N=10)
+                                                               GRAM (N=20)                                                              GRAM (N=20)
+
+
+           0.6                                                                           0.6
+Coverage
+
+
+
+
+                                                                              Coverage
+           0.4                                                                           0.4
+
+
+           0.2                                                                           0.2
+
+
+           0.0                                                                           0.0
+                  2    4     6       8         10   12    14   16     18                       0     15      30      45     60     75        90
+                                 Number of Solutions                                                         Number of Solutions
+                           (a) N-Queens 8 × 8                                                             (b) N-Queens 10 × 10
+
+ Figure 15: Solution coverage analysis on N-Queens (8 × 8 and 10 × 10) with respect to the number of
+ ground-truth solutions. While deterministic baselines (HRM, TRM) suffer from mode collapse as the solution
+ space grows, GRAM demonstrates monotonic improvement in coverage as the number of parallel samples N
+ increases.
+
+
+
+ As observed in the main text, GRAM exhibits a distinct progressive refinement behavior. Starting
+ from a black initialization, the model iteratively adds details and sharpens the structure of the digit.
+ A particularly compelling property of this process is the model’s ability to recover from initially
+ ambiguous or incorrect formations.
+ For instance, in the second row (generating the digit ’2’) and the last row (generating the digit ’1’),
+ the early predictions at t = 1 and t = 2 manifest as disjointed artifacts or incorrect shapes. However,
+ as the recursion proceeds, GRAM effectively leverages its feedback loop to correct these initial errors,
+ resolving the ambiguity and converging to a coherent, high-quality digit by t = 16.
+
+
+
+
+                 t=0       t=1           t=2        t=4        t=6           t=7               t=9    t=11        t=12    t=14     t=16
+ Figure 16: Additional generated samples from GRAM. We provide 8 additional samples generated uncondi-
+ tionally on binarized MNIST using GRAM. Each row represents a single generated sample, visualized across its
+ recursive refinement process.
+
+
+
+
+                                                                             23
+D.5   Additional Experiment Results on Unconditional Sudoku Generation
+
+In this section, we provide additional details on unconditional Sudoku generation. Unlike the
+conditional Sudoku-solving setting, where the input board contains given clues, the model receives an
+entirely blank board and samples a complete 9 × 9 Sudoku board from its learned prior. We evaluate
+each generated board using the standard Sudoku validity criterion: every row, column, and 3 × 3 box
+must contain the digits 1 through 9 exactly once. We report the validity rate over 100K generated
+boards. To check whether high validity comes from repeatedly producing the same board, we also
+compute the fraction of unique boards among valid samples.
+For GRAM, we construct the unconditional training set from Sudoku-Extreme [8], the Sudoku
+benchmark used by HRM and TRM. We sample 50K complete solutions from the original training
+split, discard the clue patterns, and use an all-blank board as input with the complete solution as
+the target. No data augmentation is used. We train GRAM on this derived 50K-solution set for 200
+epochs with learning rate 10−4 , EMA decay 0.999, and KL coefficient 0.05. The resulting model
+contains 10.9M parameters and uses 16 inference steps.
+For D3PM baselines, we use a DiT-style Transformer backbone and evaluate two model sizes. D3PM-
+Big uses hidden dimension 768, 5 Transformer blocks, and 12 attention heads, yielding 55.1M
+parameters, while D3PM-Small uses hidden dimension 512, 3 Transformer blocks, and 8 attention
+heads, yielding 15.9M parameters. Both variants are trained on the same derived training set and
+generate boards with 1000 denoising steps.
+As shown in Table 9, GRAM achieves 99.05% validity, outperforming all D3PM baselines. The
+strongest D3PM baseline, D3PM-Uniform (Big), reaches 91.33% validity while using 55.1M parame-
+ters and 1000 denoising steps. In contrast, GRAM uses fewer parameters and only 16 inference steps.
+In all cases, the valid samples are unique under exact board matching, indicating that the reported
+validity is not due to simple repetition of a small set of boards. These results show that GRAM can
+generate highly constrained symbolic structures from an empty input, supporting its potential as a
+generator beyond conditional puzzle solving.
+Figure 17 illustrates the unconditional Sudoku generation setup. Starting from an empty board,
+the task is to generate complete boards, and validity is determined by whether the generated board
+satisfies all Sudoku constraints. Figure 7 shows qualitative examples of boards generated by GRAM.
+
+Table 9: Unconditional Sudoku generation. We report the ratio of generated boards satisfying Sudoku
+constraints over 100K samples. All valid boards are unique for all methods in this evaluation.
+                         Method                     #Params            Steps       Validity(%)
+                         D3PM-Uniform (Big)         55.1M              1000           91.33
+                         D3PM-Uniform (Small)       15.9M              1000           29.24
+                         D3PM-Absorb (Big)          55.1M              1000           79.18
+                         D3PM-Absorb (Small)        15.9M              1000           21.88
+                         GRAM (Ours)                10.9M               16            99.05
+
+
+           Empty input                                  Valid sample                              Invalid sample
+                                            3   6   5    4   7   8     9   2   1        4     3   8   2   9   1   7   6   5
+                                            9   4   1    2   5   6     8   7   3        5     2   1   7   3   6   4   9   8
+                                            2   7   8    9   1   3     6   5   4        7     6   9   4   5   8   1   2   3
+                                            5   2   9    8   4   7     3   1   6        9     1   3   4   4   7   5   8   6
+                                            7   3   6    1   9   2     5   4   8        2     5   4   6   8   9   3   1   7
+                                            1   8   4    3   6   5     2   9   7        6     8   7   5   1   3   2   4   9
+                                            8   9   7    5   3   4     1   6   2        3     7   2   9   6   4   8   5   1
+                                            4   1   3    6   2   9     7   8   5        1     9   5   3   2   8   6   7   4
+                                            6   5   2    7   8   1     4   3   9        8     4   6   1   7   5   9   3   2
+
+
+Figure 17: Unconditional Sudoku generation setup. Starting from an empty board, the task is to generate
+complete Sudoku boards. The valid sample satisfies all Sudoku constraints, while red entries in the invalid
+sample indicate cells involved in constraint violations.
+
+
+
+                                                        24
+D.6   Visualizing Latent Recursion Process
+
+To understand how stochastic guidance shapes reasoning, we visualize latent trajectories during
+recursive computation. Specifically, we track the high-level state h at each supervision step throughout
+the recursion process. For visualization, we project these latent vectors into 2D using PCA [59] and
+interpolate unobserved states via K-D tree [60] to construct a continuous loss landscape.
+Figures 18 and 19 compare TRM and GRAM on the same Sudoku puzzle. TRM follows a single
+deterministic path from initialization to solution, offering no mechanism to escape if the trajectory
+enters a suboptimal region. In contrast, GRAM samples diverse trajectories that explore different
+regions of latent space before converging. While some trajectories become trapped in local minima
+(bright yellow regions), others successfully navigate toward the global optimum (dark blue regions).
+This diversity enables GRAM to discover valid solutions more reliably through parallel exploration.
+
+
+
+
+Figure 18: Latent reasoning trajectory of TRM. The red dot indicates the initial state h0 and the green dot
+indicates the final state hT . Background color represents the loss landscape: bright yellow corresponds to high
+loss regions, while dark blue indicates low loss (optimal) regions. TRM follows a single deterministic path with
+no ability to escape suboptimal trajectories.
+
+
+
+
+                                                      25
+Figure 19: Latent reasoning trajectories of GRAM (50 samples). Using the same visualization scheme as
+Figure 18, we show 50 sampled trajectories from GRAM. The stochastic guidance enables diverse exploration
+of the latent space: while some trajectories converge to local minima (right bottom), others successfully reach
+the global optimum (left middle), demonstrating how parallel sampling improves solution discovery.
+
+
+
+
+                                                      26
+Licenses
+Table 10: Existing assets, licenses, and source links. We list the existing datasets, benchmarks, and public
+reference implementations used or cited in our experiments. Synthetic N-Queens and Graph Coloring instances
+are generated by the authors and are therefore not external assets.
+
+ Asset                 Use in this paper           License / terms     Source link
+ MNIST                 Binarized MNIST genera-     Creative Com-       https://keras.io/api/
+                       tion experiments            mons Attribution-   datasets/mnist/
+                                                   Share Alike 3.0
+ ARC-AGI-1 / origi-  ARC-AGI         reasoning     Apache License      https://github.com/fchollet/
+ nal ARC             benchmark                     2.0                 ARC-AGI
+ ARC-AGI-2           ARC-AGI-2 reasoning           Apache License      https://github.com/arcprize/
+                     benchmark / reference         2.0                 ARC-AGI-2
+                     results
+ HRM repository      HRM       baseline    and     Apache License      https://github.com/
+                     Sudoku-Extreme-related        2.0                 sapientinc/HRM
+                     reference implementation
+ TinyRecursiveModels TRM baseline and recur-       MIT License         https://github.com/
+ / TRM repository    sive reasoning reference                          SamsungSAILMontreal/
+                     implementation                                    TinyRecursiveModels
+ MDLM repository     Masked diffusion base-        Apache License      https://github.com/
+                     line reference implemen-      2.0                 kuleshov-group/mdlm
+                     tation, if public code is
+                     used
+ Google Research D3PM image-generation             Apache License      https://github.com/
+ D3PM implementa- baseline reference imple-        2.0                 google-research/
+ tion                mentation, if public code                         google-research/blob/master/
+                     is used                                           d3pm/images/diffusion_
+                                                                       categorical.py
+ Looped       Trans-   Looped Transformer base-    MIT License         https://github.com/Leiay/
+ former repository     line reference implemen-                        looped_transformer
+                       tation, if public code is
+                       used
+ N-Queens              Synthetic multi-solution    Not an external     N/A
+                       constraint satisfaction     asset
+                       task generated by the
+                       authors
+ Graph Coloring        Synthetic multi-solution    Not an external     N/A
+                       constraint satisfaction     asset
+                       task generated by the
+                       authors
+
+
+
+
+                                                    27
+
+\ No newline at end of file
diff --git a/papers/txt/hrm2025_hierarchical_reasoning.txt b/papers/txt/hrm2025_hierarchical_reasoning.txt
new file mode 100644
index 0000000..3cc1f2e
--- /dev/null
+++ b/papers/txt/hrm2025_hierarchical_reasoning.txt
@@ -0,0 +1,1302 @@
+                                                                  Hierarchical Reasoning Model
+                                                                 Guan Wang1,† , Jin Li1 , Yuhao Sun1 , Xing Chen1 , Changling Liu1 ,
+                                                                   Yue Wu1 , Meng Lu1,† , Sen Song2,† , Yasin Abbasi Yadkori1,†
+                                                                              1
+                                                                                  Sapient Intelligence, Singapore
+
+
+
+                                                                                                                           Abstract
+
+                                                    Reasoning, the process of devising and executing complex goal-oriented action sequences,
+arXiv:2506.21734v3 [cs.AI] 4 Aug 2025
+
+
+
+
+                                                remains a critical challenge in AI. Current large language models (LLMs) primarily employ
+                                                Chain-of-Thought (CoT) techniques, which suffer from brittle task decomposition, extensive
+                                                data requirements, and high latency. Inspired by the hierarchical and multi-timescale pro-
+                                                cessing in the human brain, we propose the Hierarchical Reasoning Model (HRM), a novel
+                                                recurrent architecture that attains significant computational depth while maintaining both train-
+                                                ing stability and efficiency. HRM executes sequential reasoning tasks in a single forward pass
+                                                without explicit supervision of the intermediate process, through two interdependent recurrent
+                                                modules: a high-level module responsible for slow, abstract planning, and a low-level mod-
+                                                ule handling rapid, detailed computations. With only 27 million parameters, HRM achieves
+                                                exceptional performance on complex reasoning tasks using only 1000 training samples. The
+                                                model operates without pre-training or CoT data, yet achieves nearly perfect performance on
+                                                challenging tasks including complex Sudoku puzzles and optimal path finding in large mazes.
+                                                Furthermore, HRM outperforms much larger models with significantly longer context windows
+                                                on the Abstraction and Reasoning Corpus (ARC), a key benchmark for measuring artificial
+                                                general intelligence capabilities. These results underscore HRM’s potential as a transformative
+                                                advancement toward universal computation and general-purpose reasoning systems.
+
+
+                                                                                                                    ARC-AGI-1                                                      ARC-AGI-2                Sudoku-Extreme (9x9) Maze-Hard (30x30)
+                                                                                                      960 training examples                                                 1120 training examples 1000 training examples 1000 training examples
+                                                                                                                                                                 40.3                         5.0 60                 55.0 80                74.5
+                                                                                                 40                                                                     5
+                                                                                                                                                   34.5
+                                                                                                                                                                                                                                        60
+                                                                                                                                                                                    Deepseek R1
+
+
+
+
+                                                                                                                                                                        4
+                                                                                                                                                                                 Claude 3.7 8K
+
+
+
+
+                                                                                                 30                                                                                                        40
+                                                                                    Accuracy %
+
+
+
+
+                                                                                                                                                                                               3.0
+                                                                                                                                                                        3
+                                                                                                                                                                                                                Claude 3.7 8K
+
+
+
+
+                                                                                                                                                                                                                                             Claude 3.7 8K
+                                                                                                                    21.0 21.2
+                                                                                                                                                                                                                Deepseek R1
+
+
+
+
+                                                                                                                                                                                                                                             Deepseek R1
+                                                                                                                                                                                                                                        40
+                                                                                                                                                                                                                o3-mini-high
+
+
+
+
+                                                                                                                                                                                                                                             o3-mini-high
+
+
+                                                                                                 20 15.8
+                                                                                                                                                                             Direct pred
+
+
+
+
+                                                                                                                                                                                                                Direct pred
+
+
+
+
+                                                                                                                                                                                                                                             Direct pred
+                                                                                                                                                  o3-mini-high
+
+
+
+
+                                                                                                                                                                            o3-mini-high
+                                                                                                                                  Claude 3.7 8K
+
+
+
+
+                                                                                                                                                                        2                                  20
+                                                                                                      Deepseek R1
+
+
+
+
+                                                                                                                                                                                         1.3
+                                                                                                                    Direct pred
+
+
+
+
+                                                                                                 10                                                                                0.9                                                  20
+                                                                                                                                                                        1
+                                                                                                                                                                 HRM
+
+
+
+
+                                                                                                                                                                                                     HRM
+
+
+
+
+                                                                                                                                                                                                                                  HRM
+
+
+
+
+                                                                                                                                                                                                                                                               HRM
+
+                                                                                                                                                                             0.0                                0.0 0.0 0.0 0.0              0.0 0.0 0.0 0.0
+                                                                                                  0                                                                     0                                   0                            0
+                                                                                                                                      Chain-of-thought, pretrained                                         Direct prediction, small-sample learning
+
+
+
+                                        Figure 1: Left: HRM is inspired by hierarchical processing and temporal separation in the brain. It
+                                        has two recurrent networks operating at different timescales to collaboratively solve tasks. Right:
+                                        With only about 1000 training examples, the HRM (~27M parameters) surpasses state-of-the-art
+                                        CoT models on inductive benchmarks (ARC-AGI) and challenging symbolic tree-search puzzles
+                                        (Sudoku-Extreme, Maze-Hard) where CoT models failed completely. The HRM was randomly
+                                        initialized, and it solved the tasks directly from inputs without chain of thoughts.
+                                          2
+                                              Tsinghua University † Corresponding author. Contact: research@sapient.inc.
+                                              Code available at: github.com/sapientinc/HRM
+
+                                                                                                                                                                                                                                                                 1
+1        Introduction
+Deep learning, as its name suggests, emerged from the idea of stacking more layers to achieve
+increased representation power and improved performance 1,2 . However, despite the remarkable
+success of large language models, their core architecture is paradoxically shallow 3 . This imposes
+a fundamental constraint on their most sought-after capability: reasoning. The fixed depth of stan-
+dard Transformers places them in computational complexity classes such as AC 0 or T C 0 4 , prevent-
+ing them from solving problems that require polynomial time 5,6 . LLMs are not Turing-complete
+and thus they cannot, at least in a purely end-to-end manner, execute complex algorithmic rea-
+soning that is necessary for deliberate planning or symbolic manipulation tasks 7,8 . For example,
+our results on the Sudoku task show that increasing Transformer model depth can improve per-
+formance,1 but performance remains far from optimal even with very deep models (see Figure 2),
+which supports the conjectured limitations of the LLM scaling paradigm 9 .
+The LLMs literature has relied largely on Chain-of-Thought (CoT) prompting for reasoning 10 .
+CoT externalizes reasoning into token-level language by breaking down complex tasks into sim-
+pler intermediate steps, sequentially generating text using a shallow model 11 . However, CoT for
+reasoning is a crutch, not a satisfactory solution. It relies on brittle, human-defined decompositions
+where a single misstep or a misorder of the steps can derail the reasoning process entirely 12,13 . This
+dependency on explicit linguistic steps tethers reasoning to patterns at the token level. As a result,
+CoT reasoning often requires significant amount of training data and generates a large number of
+tokens for complex reasoning tasks, resulting in slow response times. A more efficient approach is
+needed to minimize these data requirements 14 .
+Towards this goal, we explore “latent reasoning”, where the model conducts computations within
+its internal hidden state space 15,16 . This aligns with the understanding that language is a tool for
+human communication, not the substrate of thought itself 17 ; the brain sustains lengthy, coherent
+chains of reasoning with remarkable efficiency in a latent space, without constant translation back
+to language. However, the power of latent reasoning is still fundamentally constrained by a model’s
+effective computational depth. Naively stacking layers is notoriously difficult due to vanishing gra-
+dients, which plague training stability and effectiveness 1,18 . Recurrent architectures, a natural al-
+ternative for sequential tasks, often suffer from early convergence, rendering subsequent computa-
+tional steps inert, and rely on the biologically implausible, computationally expensive and memory
+intensive Backpropagation Through Time (BPTT) for training 19 .
+The human brain provides a compelling blueprint for achieving the effective computational depth
+that contemporary artificial models lack. It organizes computation hierarchically across corti-
+cal regions operating at different timescales, enabling deep, multi-stage reasoning 20,21,22 . Recur-
+rent feedback loops iteratively refine internal representations, allowing slow, higher-level areas to
+guide, and fast, lower-level circuits to execute—subordinate processing while preserving global
+coherence 23,24,25 . Notably, the brain achieves such depth without incurring the prohibitive credit-
+assignment costs that typically hamper recurrent networks from backpropagation through time 19,26 .
+Inspired by this hierarchical and multi-timescale biological architecture, we propose the Hierar-
+chical Reasoning Model (HRM). HRM is designed to significantly increase the effective compu-
+tational depth. It features two coupled recurrent modules: a high-level (H) module for abstract,
+deliberate reasoning, and a low-level (L) module for fast, detailed computations. This structure
+    1
+        Simply increasing the model width does not improve performance here.
+
+                                                                                                      2
+                  100                                                         100
+                              Scaling Width - 8 layers fixed                            Transformer
+                              Scaling Depth - 512 hidden size fixed                     Recurrent Transformer
+                  80                                                          80        HRM
+     Accuracy %
+                  60                                                          60
+
+                  40                                                          40
+
+                  20                                                          20
+                        27M        54M      109M      218M      436M   872M         8        16      32         64   128   256   512
+                                             Parameters                                    Depth / Transformer layers computed
+
+Figure 2: The necessity of depth for complex reasoning. Left: On Sudoku-Extreme Full, which
+require extensive tree-search and backtracking, increasing a Transformer’s width yields no perfor-
+mance gain, while increasing depth is critical. Right: Standard architectures saturates, failing to
+benefit from increased depth. HRM overcomes this fundamental limitation, effectively using its
+computational depth to achieve near-perfect accuracy.
+
+avoids the rapid convergence of standard recurrent models through a process we term “hierarchi-
+cal convergence.” The slow-updating H-module advances only after the fast-updating L-module
+has completed multiple computational steps and reached a local equilibrium, at which point the
+L-module is reset to begin a new computational phase.
+Furthermore, we propose a one-step gradient approximation for training HRM, which offers im-
+proved efficiency and eliminates the requirement for BPTT. This design maintains a constant mem-
+ory footprint (O(1) compared to BPTT’s O(T ) for T timesteps) throughout the backpropagation
+process, making it scalable and more biologically plausible.
+Leveraging its enhanced effective depth, HRM excels at tasks that demand extensive search and
+backtracking. Using only 1,000 input-output examples, without pre-training or CoT supervi-
+sion, HRM learns to solve problems that are intractable for even the most advanced LLMs. For
+example, it achieves near-perfect accuracy in complex Sudoku puzzles (Sudoku-Extreme Full) and
+optimal pathfinding in 30x30 mazes, where state-of-the-art CoT methods completely fail (0% ac-
+curacy). In the Abstraction and Reasoning Corpus (ARC) AGI Challenge 27,28,29 - a benchmark
+of inductive reasoning - HRM, trained from scratch with only the official dataset (~1000 exam-
+ples), with only 27M parameters and a 30x30 grid context (900 tokens), achieves a performance
+of 40.3%, which substantially surpasses leading CoT-based models like o3-mini-high (34.5%)
+and Claude 3.7 8K context (21.2%), despite their considerably larger parameter sizes and con-
+text lengths, as shown in Figure 1. This represents a promising direction toward the development
+of next-generation AI reasoning systems with universal computational capabilities.
+
+
+2    Hierarchical Reasoning Model
+We present the HRM, inspired by three fundamental principles of neural computation observed in
+the brain:
+• Hierarchical processing: The brain processes information across a hierarchy of cortical ar-
+  eas. Higher-level areas integrate information over longer timescales and form abstract repre-
+  sentations, while lower-level areas handle more immediate, detailed sensory and motor process-
+  ing 20,22,21 .
+
+                                                                                                                                       3
+• Temporal Separation: These hierarchical levels in the brain operate at distinct intrinsic timescales,
+  reflected in neural rhythms (e.g., slow theta waves, 4–8 Hz and fast gamma waves, 30–100
+  Hz) 30,31 . This separation allows for stable, high-level guidance of rapid, low-level computa-
+  tions 32,33 .
+• Recurrent Connectivity: The brain features extensive recurrent connections. These feedback
+  loops enable iterative refinement, yielding more accurate and context-sensitive representations
+  at the cost of additional processing time. Additionally, the brain largely avoids the problematic
+  deep credit assignment problem associated with BPTT 19 .
+The HRM model consists of four learnable components: an input network fI (·; θI ), a low-level re-
+current module fL (·; θL ), a high-level recurrent module fH (·; θH ), and an output network fO (·; θO ).
+The model’s dynamics unfold over N high-level cycles of T low-level timesteps each2 . We index
+the total timesteps of one forward pass by i = 1, . . . , N × T . The modules fL and fH each keep a
+hidden state—zLi for fL and zH   i
+                                   for fH —which are initialized with the vectors zL0 and zH0
+                                                                                              , respec-
+tively.
+The HRM maps an input vector x to an output prediction vector ŷ as follows. First, the input x is
+projected into a working representation x̃ by the input network:
+
+                                                  x̃ = fI (x; θI ) .
+At each timestep i, the L-module updates its state conditioned on its own previous state, the H-
+module’s current state (which remains fixed throughout the cycle), and the input representation.
+The H-module only updates once per cycle (i.e., every T timesteps) using the L-module’s final
+state at the end of that cycle:
+
+                           zLi = fL zLi−1 , zH
+                                             i−1
+                                                         
+                                                 , x̃; θL ,
+                                 (
+                                        i−1 i−1
+                                                         
+                            i      fH zH     , zL ; θH      if i ≡ 0 (mod T ) ,
+                           zH =     i−1
+                                   zH                       otherwise .
+
+Finally, after N full cycles, a prediction ŷ is extracted from the hidden state of the H-module:
+
+
+                                                         NT
+                                               ŷ = fO (zH  ; θO ) .
+
+This entire N T -timestep process represents a single forward pass of the HRM. A halting mecha-
+nism (detailed later in this section) determines whether the model should terminate, in which case
+ŷ will be used as the final prediction, or continue with an additional forward pass.
+Hierarchical convergence Although convergence is crucial for recurrent networks, standard RNNs
+are fundamentally limited by their tendency to converge too early. As the hidden state settles toward
+a fixed point, update magnitudes shrink, effectively stalling subsequent computation and capping
+the network’s effective depth. To preserve computational power, we actually want convergence to
+proceed very slowly–but engineering that gradual approach is difficult, since pushing convergence
+too far edges the system toward instability.
+    2
+      While inspired by temporal separation in the brain, our model’s “high-level” and “low-level” modules are concep-
+tual abstractions and do not map directly to specific neural oscillation frequencies.
+
+
+                                                                                                                    4
+                   250                                        250                                                        250
+                                             HRM H                              Recurrent Neural Net                               Deep Neural Net
+                   200                                        200                                                        200
+Forward residual
+                                             HRM L
+                   150                                        150                                                        150
+                   100                                        100                                                        100
+                    50                                         50                                                         50
+                     0                                          0                                                          0
+                         0     20      40    60                          0      20       40         60                         0        100        200
+                                Step Index #                                    Step Index #                                          Layer Index #
+
+                    60                                              60
+
+
+
+
+                                                                                                         Layer Index #
+     Step Index #
+
+
+
+
+                                                     Step Index #
+                                                                                                                         200
+                    30                                              30                                                   100
+
+
+                             Principal Components                            Principal Components                                  Principal Components
+
+Figure 3: Comparison of forward residuals and PCA trajectories. HRM shows hierarchical conver-
+gence: the H-module steadily converges, while the L-module repeatedly converges within cycles
+before being reset by H, resulting in residual spikes. The recurrent neural network exhibits rapid
+convergence with residuals quickly approaching zero. In contrast, the deep neural network experi-
+ences vanishing gradients, with significant residuals primarily in the initial (input) and final layers.
+
+HRM is explicitly designed to counteract this premature convergence through a process we term
+hierarchical convergence. During each cycle, the L-module (an RNN) exhibits stable convergence
+to a local equilibrium. This equilibrium, however, depends on the high-level state zH supplied
+during that cycle. After completing the T steps, the H-module incorporates the sub-computation’s
+outcome (the final state zL ) and performs its own update. This zH update establishes a fresh context
+for the L-module, essentially “restarting” its computational path and initiating a new convergence
+phase toward a different local equilibrium.
+This process allows the HRM to perform a sequence of distinct, stable, nested computations, where
+the H-module directs the overall problem-solving strategy and the L-module executes the intensive
+search or refinement required for each step. Although a standard RNN may approach convergence
+within T iterations, the hierarchical convergence benefits from an enhanced effective depth of N T
+steps. As empirically shown in Figure 3, this mechanism allows HRM both to maintain high
+computational activity (forward residual) over many steps (in contrast to a standard RNN, whose
+activity rapidly decays) and to enjoy stable convergence. This translates into better performance at
+any computation depth, as illustrated in Figure 2.
+Approximate gradient Recurrent models typically use BPTT to compute gradients. However,
+BPTT requires storing the hidden states from the forward pass and then combining them with
+gradients during the backward pass, which demands O(T ) memory for T timesteps. This heavy
+memory burden forces smaller batch sizes and leads to poor GPU utilization, especially for large-
+scale networks. Additionally, because retaining the full history trace through time is biologically
+implausible, it is unlikely that the brain implements BPTT 19 .
+Fortunately, if a recurrent neural network converges to a fixed point, we can avoid unrolling its state
+sequence by applying backpropagation in a single step at that equilibrium point. Moreover, such a
+mechanism could plausibly be implemented in the brain using only local learning rules 34,35 . Based
+
+                                                                                                                                                          5
+on this finding, we propose a one-step approximation of the HRM gradient–using the gradient of
+the last state of each module and treating other states as constant. The gradient path is, therefore,
+
+      Output head → final state of the H-module → final state of the L-module → input embedding
+
+The above method needs O(1) memory, does not require unrolling through time, and can be easily
+implemented with an autograd framework such as PyTorch, as shown in Figure 4. Given that
+each module only needs to back-propagate errors through its most recent local synaptic activity,
+this approach aligns well with the perspective that cortical credit assignment relies on short-range,
+temporally local mechanisms rather than on a global replay of activity patterns.
+The one-step gradient approximation is theoretically
+grounded in the mathematics of Deep Equilibrium Mod-
+els (DEQ) 36 which employs the Implicit Function Theo-
+rem (IFT) to bypass BPTT, as detailed next. Consider an
+idealized HRM behavior where, during high-level cycle
+k, the L-module repeatedly updates until its state zL con-         def hrm(z, x, N=2, T=2):
+                                                                       x = input_embedding(x)
+verges to a local fixed point zL⋆ . This fixed point, given            zH, zL = z
+                              k−1
+the current high-level state zH   , can be expressed as               with torch.no_grad():
+                                                                          for _i in range(N ∗ T − 1):
+                                  k−1                                         zL = L_net(zL, zH, x)
+                 zL⋆ = fL (zL⋆ , zH   , x̃; θL ) .                            if (_i + 1) % T == 0:
+                                                                                  zH = H_net(zH, zL)
+
+The H-module then performs a single update using this                 # 1−step grad
+                                                                      zL = L_net(zL, zH, x)
+converged L-state:                                                    zH = H_net(zH, zL)
+                                                                      return (zH, zL), output_head(zH)
+                   k        k−1 ⋆
+                  zH = fH (zH  , zL ; θH ) .                       # Deep Supervision
+                                                                   for x, y_true in train_dataloader:
+                                                                       z = z_init
+With a proper mapping F, the updates to the high-level                 for step in range(N_supervision):
+                                                     k                     z, y_hat = hrm(z, x)
+state can be written in a more compact form as zH       =
+     k−1                                                         loss = softmax_cross_entropy(y_hat, y_true)
+F(zH ; x̃, θ), where θ = (θI , θL ), and the fixed-point         z = z.detach()
+                    ⋆         ⋆                    ∂F
+can be written as zH = F(zH ; x̃, θ). Let JF = ∂zH be            loss.backward()
+the Jacobian of F, and assume that the matrix I − JF is          opt.step()
+                                                                 opt.zero_grad()
+               ⋆
+invertible at zH and that the mapping F is continuously
+differentiable. The Implicit Function Theorem then al- Figure 4: Top: Diagram of HRM with
+                                                        ⋆
+lows us to calculate the exact gradient of fixed point zH  approximate gradient. Bottom: Pseu-
+with respect to the parameters θ without explicit back- docode of HRM with deep supervision
+propagation:                                               training in PyTorch.
+                                    ⋆                 −1 ∂F
+                                 ∂zH     
+                                      = I − JF z⋆                 .                                          (1)
+                                  ∂θ                H     ∂θ z⋆
+                                                                        H
+
+
+Calculating the above gradient requires evaluating and inverting matrix (I − JF ) that can be com-
+putationally expensive. Given the Neumann series expansion,
+                                 (I − JF )−1 = I + JF + JF2 + JF3 + . . . ,
+the so-called 1-step gradient 37 approximates the series by considering only its first term, i.e. (I −
+JF )−1 ≈ I, and leads to the following approximation of Equation (1):
+                        ∗                    ∗
+                      ∂zH   ∂fH            ∂zH   ∂fH ∂zL∗           ∗
+                                                                  ∂zH   ∂fH ∂zL∗
+                          ≈     ,              ≈     ·    ,           ≈     ·    .                          (2)
+                      ∂θH   ∂θH            ∂θL   ∂zL∗ ∂θL         ∂θI   ∂zL∗ ∂θI
+                                                                                                              6
+                                             ∂z ∗     ∂z ∗
+The gradients of the low-level fixed point, ∂θLL and ∂θLI , can also be approximated using another
+application of the 1-step gradient:
+                                   ∂zL∗   ∂fL       ∂zL∗   ∂fL
+                                        ≈     ,          ≈     .                                    (3)
+                                   ∂θL    ∂θL       ∂θI    ∂θI
+By substituting Equation (3) back into Equation (2), we arrive at the final simplified gradients.
+Before defining our loss function, we must first introduce two key elements of our proposed
+method: deep supervision and adaptive computational time.
+Deep supervision Inspired by the principle that periodic neural oscillations regulate when learning
+occurs in the brain 38 , we incorporate a deep supervision mechanism into HRM, as detailed next.
+Given a data sample (x, y), we run multiple forward passes of the HRM model, each of which we
+refer to as a segment. Let M denote the total number of segments executed before termination.
+For each segment m ∈ {1, . . . , M }, let z m = (zHmN T
+                                                        , zLmN T ) represent the hidden state at the
+conclusion of segment m, encompassing both high-level and low-level state components.
+At each segment m, we apply a deep supervision step as follows:
+   1. Given the state z m−1 from the previous segment, compute the next state z m and its associated
+      output ŷ m through a forward pass in the HRM model:
+
+                                     (z m , ŷ m ) ← HRM(z m−1 , x; θ)
+
+   2. Compute the loss for the current segment:
+
+                                           Lm ← L OSS(ŷ m , y)
+
+   3. Update parameters:
+
+                                    θ ← O PTIMIZER S TEP(θ, ∇θ Lm )
+
+The crucial aspect of this procedure is that the hidden state z m is “detached” from the computa-
+tion graph before being used as the input state for the next segment. Consequently, gradients from
+segment m + 1 do not propagate back through segment m, effectively creating a 1-step approxi-
+mation of the gradient of the recursive deep supervision process 39,40 . This approach provides more
+frequent feedback to the H-module and serves as a regularization mechanism, demonstrating supe-
+rior empirical performance and enhanced stability in deep equilibrium models when compared to
+more complex, Jacobian-based regularization techniques 39,41 . Figure 4 shows pseudocode of deep
+supervision training.
+Adaptive computational time (ACT) The brain dynamically alternates between automatic think-
+ing (“System 1”) and deliberate reasoning (“System 2”) 42 . Neuroscientific evidence shows that
+these cognitive modes share overlapping neural circuits, particularly within regions such as the
+prefrontal cortex and the default mode network 43,44 . This indicates that the brain dynamically mod-
+ulates the “runtime” of these circuits according to task complexity and potential rewards 45,46 .
+Inspired by the above mechanism, we incorporate an adaptive halting strategy into HRM that en-
+ables “thinking, fast and slow”. This integration leverages deep supervision and uses the Q-learning
+
+
+
+                                                                                                     7
+algorithm 47 to adaptively determine the number of segments. A Q-head uses the final state of the
+H-module to predict the Q-values Q̂m = (Q̂m         m
+                                           halt , Q̂continue ) of the “halt” and “continue” actions:
+
+                                                    ⊤ mN T
+                                           Q̂m = σ(θQ zH ) ,
+where σ denotes the sigmoid function applied element-wise. The halt or continue action is chosen
+using a randomized strategy as detailed next. Let Mmax denote the maximum number of segments
+(a fixed hyperparameter) and Mmin denote the minimum number of segments (a random variable).
+The value of Mmin is determined stochastically: with probability ε, it is sampled uniformly from the
+set {2, · · · , Mmax } (to encourage longer thinking), and with probability 1 − ε, it is set to 1. The halt
+action is selected under two conditions: when the segment count surpasses the maximum threshold
+Mmax , or when the estimated halt value Q̂halt exceeds the estimated continue value Q̂continue and the
+segment count has reached at least the minimum threshold Mmin .
+The Q-head is updated through a Q-learning algorithm, which is defined on the following episodic
+Markov Decision Process (MDP). The state of the MDP at segment m is z m , and the action space
+is {halt, continue}. Choosing the action “halt” terminates the episode and returns a binary reward
+indicating prediction correctness, i.e., 1{ŷ m = y}. Choosing “continue” yields a reward of 0 and
+the state transitions to z m+1 . Thus, the Q-learning targets for the two actions Ĝm = (Ĝm        m
+                                                                                           halt , Ĝcontinue )
+are given by
+
+                             Ĝm           m
+                                halt = 1{ŷ = y} ,
+                                       
+                                       Q̂m+1
+                                           halt ,               if m ≥ Nmax ,
+                           m
+                         Ĝcontinue =
+                                       max(Q̂m+1 , Q̂m+1 ) , otherwise .
+                                                  halt continue
+
+We can now define the loss function of our learning procedure. The overall loss for each supervision
+segment combines both the Q-head loss and the sequence-to-sequence loss:
+
+                    Lm              m                                  m    m
+                     ACT = L OSS (ŷ , y) + B INARY C ROSS E NTROPY (Q̂ , Ĝ ) .
+
+Minimizing the above loss enables both accurate predictions and nearly optimal stopping decisions.
+Selecting the “halt” action ends the supervision loop. In practice, sequences are processed in
+batches, which can be easily handled by substituting any halted sample in the batch with a fresh
+sample from the dataloader.
+Figure 5 presents a performance comparison between two HRM variants: one incorporating ACT
+and another employing a fixed computational step count equivalent to ACT’s Mmax parameter. It
+shows that ACT effectively adapts its computational resources based on task complexity, achieving
+significant computational savings with minimal impact on performance.
+Inference-time scaling An effective neural model should exploit additional computational re-
+sources during inference to enhance performance. As illustrated in Figure 5-(c), HRM seamlessly
+achieves inference-time scaling by simply increasing the computational limit parameter, Mmax
+without requiring further training or architectural modifications.
+Additional compute is especially effective for tasks that demand deeper reasoning. On Sudoku—
+a problem that often requires long-term planning—HRM exhibits strong inference-time scaling.
+On the other hand, we find that extra computational resources yield minimal gains in ARC-AGI
+challenge, as solutions generally require only a few transformations.
+
+                                                                                                            8
+                                    (a) ACT Compute Spent                                             (b) ACT Performance                                    (c) Inference-time scaling
+                     8                                                           100.0                                                           100.0
+                             Fixed M                                                         Fixed M
+                     7
+Mean Compute Steps
+
+                             ACT (Mmax limit)                                     97.5       ACT (Mmax limit)                                     97.5
+                     6                                                            95.0                                                            95.0
+
+
+
+
+                                                                    Accuracy %
+
+
+
+
+                                                                                                                                    Accuracy %
+                     5                                                            92.5                                                            92.5
+                     4                                                            90.0                                                            90.0
+                     3                                                            87.5                                                            87.5                              Train Mmax = 2
+                                                                                  85.0                                                            85.0                              Train Mmax = 4
+                     2                                                                                                                                                              Train Mmax = 8
+                     1                                                            82.5                                                            82.5
+                         2                       4              8                        2                       4              8                        2        4             8              16
+                                      M (Fixed) or Mmax (ACT)                                         M (Fixed) or Mmax (ACT)                                      Inference Mmax
+
+
+Figure 5: Effectiveness of Adaptive Computation Time (ACT) on the Sudoku-Extreme-Full. (a)
+Mean compute steps used by models with ACT versus models with a fixed number of compute steps
+(M ). ACT maintains a low and stable number of average compute steps even as the maximum limit
+(Mmax ) increases. (b) Accuracy comparison. The ACT model achieves performance comparable
+to the fixed-compute model while utilizing substantially fewer computational steps on average. (c)
+Inference-time scalability. Models trained with a specific Mmax can generalize to higher compu-
+tational limits during inference, leading to improved accuracy. For example, a model trained with
+Mmax = 8 continues to see accuracy gains when run with Mmax = 16 during inference.
+
+Stability of Q-learning in ACT The deep Q-learning that underpins our ACT mechanism is
+known to be prone to instability, often requiring stabilization techniques such as replay buffers
+and target networks 48 , which are absent in our design. Our approach, however, achieves stability
+through the intrinsic properties of our model and training procedure. Recent theoretical work by
+Gallici et al. 49 shows that Q-learning can achieve convergence if network parameters are bounded,
+weight decay is incorporated during training, and post-normalization layers are implemented. Our
+model satisfies these conditions through its Post-Norm architecture that employs RMSNorm (a
+layer normalization variant) and the AdamW optimizer. AdamW has been shown to solve an L∞ -
+constrained optimization problem, ensuring that model parameters remain bounded by 1/λ 50 .
+Architectural details We employ a sequence-to-sequence architecture for HRM. Both input and
+output are represented as token sequences: x = (x1 , . . . , xl ) and y = (y1 , . . . , yl′ ) respectively.
+The model includes an embedding layer fI that converts discrete tokens into vector representa-
+tions, and an output head fO (z; θO ) = softmax(θO z) that transforms hidden states into token prob-
+ability distributions ŷ. For small-sample experiments, we replace softmax with stablemax 51 to
+improve generalization performance. The sequence-to-sequence loss is averaged over all tokens,
+                    Pl′
+L OSS(ŷ, y) = l1′ i=1    log p(yi ), where p(yi ) is the probability that distribution ŷi assigns to token
+yi . The initial hidden states z 0 are initialized by sampling from a truncated normal distribution with
+standard deviation of 1, truncation of 2, and kept fixed throughout training.
+Both the low-level and high-level recurrent modules fL and fH are implemented using encoder-
+only Transformer 52 blocks with identical architectures and dimensions. These modules take mul-
+tiple inputs, and we use straightforward element-wise addition to combine them, though more
+sophisticated merging techniques such as gating mechanisms could potentially improve perfor-
+mance and is left for future work. For all Transformer blocks in this work—including those in
+the baseline models—we incorporate the enhancements found in modern LLMs (based on Llama 53
+architectures). These improvements include Rotary Positional Encoding 54 , Gated Linear Units 55 ,
+RMSNorm 56 , and the removal of bias terms from linear layers.
+Furthermore, both HRM and recurrent Transformer models implement a Post-Norm architecture
+
+                                                                                                                                                                                                     9
+                                   8   4           5       6
+                                               8       7
+                              3                4
+                                   3   8   4               2
+                                       6           3           8
+                              9                                6
+                                           5
+                                                   2           1
+                                   2   5       3           8
+
+
+
+
+                              7    8   4   1   2   5   9   6   3
+                              2    6   1   3   8   9   7   4   5
+                              3    5   9   6   4   7   8   1   2
+                              5    3   8   4   9   6   1   2   7
+                              4    1   6   2   7   3   5   9   8
+                              9    7   2   8   5   1   4   3   6
+                              6    9   3   5   1   8   2   7   4
+                              8    4   7   9   6   2   3   5   1
+                              1    2   5   7   3   4   6   8   9
+
+
+
+
+      (a) ARC-AGI                 (b) Sudoku-Hard                  (c) Maze navigation   (d) Sudoku-Extreme subset difficulty
+
+Figure 6: Left: Visualization of benchmark tasks. Right: Difficulty of Sudoku-Extreme examples.
+
+with weights initialized via truncated LeCun Normal initialization 57,58,59 , while the scale and bias
+parameters are excluded from RMSNorm. All parameters are optimized using the Adam-atan2 op-
+timizer 60 , a scale-invariant variant of Adam 61 , combined with a constant learning rate that includes
+linear warm-up.
+
+
+3     Results
+This section begins by describing the ARC-AGI, Sudoku, and Maze benchmarks, followed by an
+overview of the baseline models and their results. Figure 6-(a,b,c) presents a visual representa-
+tion of the three benchmark tasks, which are selected to evaluate various reasoning abilities in AI
+models.
+
+3.1    Benchmarks
+ARC-AGI Challenge The ARC-AGI benchmark evaluates general fluid intelligence through IQ-
+test-like puzzles that require inductive reasoning 27 . The initial version, ARC-AGI-1, presents chal-
+lenges as input-label grid pairs that force AI systems to extract and generalize abstract rules from
+just a few examples. Each task provides a few input–output demonstration pairs (usually 2–3) and
+a test input. An AI model has two attempts to produce the correct output grid. Although some be-
+lieve that mastering ARC-AGI would signal true artificial general intelligence, its primary purpose
+is to expose the current roadblocks in AGI progress. In fact, both conventional deep learning meth-
+ods and CoT techniques have faced significant challenges with ARC-AGI-1, primarily because it
+requires the ability to generalize to entirely new tasks 28 .
+Addressing the limitations identified in ARC-AGI-1, ARC-AGI-2 significantly expands the bench-
+mark by providing a more comprehensive and carefully refined collection of tasks. These new
+tasks emphasize deeper compositional reasoning, multi-step logic, contextual rule application, and
+symbolic abstraction. Human calibration studies show these tasks are challenging but doable for
+people, while being much harder for current AI systems, offering a clearer measure of general
+reasoning abilities 29 .
+
+
+                                                                                                                          10
+Sudoku-Extreme Sudoku is a 9×9 logic puzzle, requiring each row, column, and 3×3 block to
+contain the digits 1–9 exactly once. A prediction is considered correct if it exactly matches the
+puzzle’s unique solution. Sudoku’s complex logical structure makes it a popular benchmark for
+evaluating logical reasoning in machine learning 62,63,64 .
+The most frequently used Sudoku dataset in research, namely the Kaggle dataset 65 , can be fully
+solved using elementary single-digit techniques 66 . The minimal 17-clue puzzles 62 , another widely-
+used collection, might seem more challenging due to its small number of clues. However, this
+perception is misleading—since 17 represents the minimum number of clues required to guarantee
+a unique Sudoku solution, these hints need to be highly orthogonal to each other. This orthogonal
+arrangement leads to many direct, easily-resolved solution paths 67 .
+We introduce Sudoku-Extreme, a more challenging dataset that is compiled from the aforemen-
+tioned easy datasets as well as puzzles recognized by the Sudoku community as exceptionally
+difficult for human players:
+• Easy puzzles compiled from Kaggle, 17-clue, plus unbiased samples from the Sudoku puzzle
+  distribution 67 : totaling 1 149 158 puzzles.
+• Challenging puzzles compiled from Magictour 1465, Forum-Hard and Forum-Extreme subsets:
+  totaling 3 104 157 puzzles.
+The compiled data then undergo a strict 90/10 train-test split, ensuring that the test set puzzles
+cannot be derived through equivalent transformations of any training samples. Sudoku-Extreme is
+a down-sampled subset of this data containing 1000 training examples. We use Sudoku-Extreme in
+our main experiments (Figure 1), which focuses on small-sample learning scenarios. To guarantee
+convergence and control overfitting effects in our analysis experiments (Figures 2, 3 and 5), we use
+the complete training data, Sudoku-Extreme-Full, containing 3 831 994 examples.
+We measure puzzle difficulty by counting the number of search backtracks (“guesses”) required
+by a smart Sudoku solver program tdoku, which uses propositional logic to reduce the number of
+guesses 67 . Our Sudoku-Extreme dataset exhibits a mean difficulty of 22 backtracks per puzzle, sig-
+nificantly higher than existing datasets, including recent handmade puzzles Sudoku-Bench 68 which
+average just 0.45 backtracks per puzzle. These subset complexity levels are shown in Figure 6-(d).
+Maze-Hard This task involves finding the optimal path in a 30×30 maze, making it interpretable
+and frequently used for training LLMs in search tasks 69,70,71 . We adopt the instance generation
+procedure of Lehnert et al. 71 , but introduce an additional filter to retain only those instances whose
+difficulty exceeds 110. Here, “difficulty” is defined as the length of the shortest path, which aligns
+with the linear time complexity of the wavefront breadth-first search algorithm on GPUs 72 . A path
+is considered correct if it is valid and optimal—that is, the shortest route from the start to the goal.
+The training and test set both include 1000 examples.
+
+3.2    Evaluation Details
+For all benchmarks, HRM models were initialized with random weights and trained in the sequence-
+to-sequence setup using the input-output pairs. The two-dimensional input and output grids were
+flattened and then padded to the maximum sequence length. The resulting performance is shown in
+Figure 1. Remarkably, HRM attains these results with just ~1000 training examples per task—and
+without pretraining or CoT labels.
+
+
+                                                                                                     11
+For ARC-AGI challenge, we start with (1) all demonstration and test input-label pairs from the
+training set, and (2) all demonstration pairs along with test inputs from the evaluation set. The
+dataset is augmented by applying translations, rotations, flips, and color permutations to the puz-
+zles. Each task example is prepended with a learnable special token that represents the puzzle it
+belongs to. At test time, we proceed as follows for each test input in the evaluation set: (1) Gener-
+ate and solve 1000 augmented variants and, for each, apply the inverse-augmentation transform to
+obtain a prediction. (2) Choose the two most popular predictions as the final outputs.3 All reported
+results are obtained by comparing the outputs with the withheld test labels from the evaluation set.
+We augment Sudoku puzzles by applying band and digit permutations, while data augmentation is
+disabled for Maze tasks. Both tasks undergo only a single inference pass.
+For ARC-AGI, the scores of the CoT models are taken from the official leaderboard 29 , while for
+Sudoku and Maze, the scores are obtained by evaluating through the corresponding API.
+In Figure 1, the baselines are grouped based on whether they are pre-trained and use CoT, or neither.
+The “Direct pred” baseline means using “direct prediction without CoT and pre-training”, which
+retains the exact training setup of HRM but swaps in a Transformer architecture. Interestingly, on
+ARC-AGI-1, “Direct pred” matches the performance of Liao and Gu 73 , who built a carefully de-
+signed, domain-specific equivariant network for learning the ARC-AGI task from scratch, without
+pre-training. By substituting the Transformer architecture with HRM’s hierarchical framework and
+implementing ACT, we achieve more than a twofold performance improvement.
+On the Sudoku-Extreme and Maze-Hard benchmarks, the performance gap between HRM and the
+baseline methods is significant, as the baselines almost never manage to solve the tasks. These
+benchmarks that demand lengthy reasoning traces are particularly difficult for CoT-based methods.
+With only 1000 training examples, the “Direct pred” baseline—which employs an 8-layer Trans-
+former identical in size to HRM—fails entirely on these challenging reasoning problems. When
+trained on the larger Sudoku-Extreme-Full dataset, however, “Direct pred” can solve some easy
+Sudoku puzzles and reaches 16.9% accuracy (see Figure 2). Lehnert et al. 71 showed that a large
+vanilla Transformer model with 175M parameters, trained on 1 million examples across multiple
+trials, achieved only marginal success on 30x30 Maze tasks, with accuracy below 20% using the
+pass@64 evaluation metric.
+
+3.3       Visualization of intermediate timesteps
+Although HRM demonstrates strong performance on complex reasoning tasks, it raises an intrigu-
+ing question: what underlying reasoning algorithms does the HRM neural network actually imple-
+ment? Addressing this question is important for enhancing model interpretability and developing a
+deeper understanding of the HRM solution space.
+While a definitive answer lies beyond our current scope, we begin our investigation by analyzing
+state trajectories and their corresponding solution evolution. More specifically, at each timestep
+i and given the low-level and high-level state pair (zLi and zH        i
+                                                                         ) we perform a preliminary forward
+                                                i        i   i
+pass through the H-module to obtain z̄ = fH (zH , zL ; θH ) and its corresponding decoded prediction
+ȳ i = fO (z̄ i ; θO ). The prediction ȳ i is then visualized in Figure 7.
+In the Maze task, HRM appears to initially explore several potential paths simultaneously, subse-
+quently eliminating blocked or inefficient routes, then constructing a preliminary solution outline
+   3
+       The ARC-AGI allows two attempts for each test input.
+
+                                                                                                        12
+        Timestep i = 0                            Timestep i = 1                           Timestep i = 2                             Timestep i = 3                             Timestep i = 4                               Timestep i = 5                                     Timestep i = 6
+
+
+
+
+           Initial                   Timestep i = 0                 Timestep i = 1                      Timestep i = 2                  Timestep i = 3                     Timestep i = 4                    Timestep i = 5                       Timestep i = 6                     Timestep i = 7
+  4                  89
+             2 4 3 5 7 1 8 9 6                              2 4 3 5 7 1 8 9 6                   2 4 3 5 7 1 8 9 3               2 4 3 5 7 1 8 9 3               2 4 1 6 7 5 8 9 3                    2 4 1 6 7 5 8 9 3                    2 4 1 6 7 5 8 9 3                      2 4 1 6 7 5 8 9 3
+  7          3       1
+             6 7 8 6 3 4 1 5 4                              6 7 8 6 3 4 1 5 4                   8 7 9 6 3 4 1 5 2               8 7 9 6 3 4 1 5 2               6 7 9 6 3 4 1 5 2                    6 7 9 9 3 4 1 5 2                    6 7 9 8 3 4 1 5 2                      6 7 9 8 3 4 1 5 2
+    2        6 5 1 2 7 9 7 3 4                              6 5 1 2 8 9 7 3 4                   6 5 1 2 8 8 7 3 4               6 5 1 2 8 9 7 3 4               6 5 3 2 1 8 7 6 4                    9 5 1 2 8 8 7 3 4                    8 5 3 2 1 9 7 6 4                      8 5 3 2 1 9 7 6 4
+      67     8 3 4 8 6 7 2 1 2                              5 3 4 8 6 7 2 1 9                   5 3 4 8 6 7 2 1 5               5 3 4 8 6 7 2 1 5               5 3 4 9 6 7 2 1 5                    5 3 4 8 6 7 2 1 8                    5 3 4 9 6 7 2 1 8                      5 3 4 9 6 7 2 1 8
+    3    4   7 2 8 3 1 8 6 4 8                              7 2 5 3 1 5 6 4 8                   7 2 5 3 1 5 6 4 9               7 2 5 3 1 1 6 4 6               7 2 5 3 8 1 6 4 6                    7 2 5 3 1 1 6 4 6                    7 2 8 3 8 1 6 4 6                      7 2 8 3 5 1 6 4 9
+1 64 23      15649237 9                                     19645237 7                          19645237 7                      19645237 7                      19645238 7                           19645237 7                           19648237 5                             19648237 5
+  27 3       8 1 2 7 9 3 4 6 6                              9 1 2 7 4 3 9 8 6                   9 1 2 7 4 3 5 6 8               9 1 2 7 4 3 5 6 8               9 1 2 7 4 3 5 6 8                    9 1 2 7 4 3 5 8 6                    9 1 2 7 4 3 5 8 6                      9 1 2 7 4 3 5 8 6
+46 12        468128 7 3 7                                   465128 9 7 3                        465128 9 7 3                    468128 9 7 3                    468128 9 7 3                         468128 9 3 7                         465128 9 3 7                           465128 9 3 7
+3 7    6   1 3979 964 21                                    3875 564 21                         3879 564 21                     3879 564 21                     3875 864 21                          3875 864 21                          3875 964 21                            3875 964 21
+[7666fa5d] Example Input            [7666fa5d] Example Output                [7666fa5d] Test Input                   Timestep i = 0                      Timestep i = 1                            Timestep i = 2                              Timestep i = 3                         Timestep i = 4
+
+
+
+
+                                                            [7b80bb43] Test Input               Timestep i = 0               Timestep i = 1               Timestep i = 2                    Timestep i = 3                    Timestep i = 4                    Timestep i = 5                  Timestep i = 6
+ [7b80bb43] Example Input    [7b80bb43] Example Output
+
+
+
+
+          Figure 7: Visualization of intermediate predictions by HRM on benchmark tasks. Top: Maze-
+          Hard—blue cells indicate the predicted path. Middle: Sudoku-Extreme—bold cells represent ini-
+          tial givens; red highlights cells violating Sudoku constraints; grey shading indicates changes from
+          the previous timestep. Bottom: ARC-AGI-2 Task—left: provided example input-output pair; right:
+          intermediate steps solving the test input.
+
+          followed by multiple refinement iterations. In Sudoku, the strategy resembles a depth-first search
+          approach, where the model appears to explore potential solutions and backtracks when it hits dead
+          ends. HRM uses a different approach for ARC tasks, making incremental adjustments to the board
+          and iteratively improving it until reaching a solution. Unlike Sudoku, which involves frequent
+          backtracking, the ARC solution path follows a more consistent progression similar to hill-climbing
+          optimization.
+          Importantly, the model shows that it can adapt to different reasoning approaches, likely choosing an
+          effective strategy for each particular task. Further research is needed to gain more comprehensive
+          insights into these solution strategies.
+
+
+          4                 Brain Correspondence
+          A key principle from systems neuroscience is that a brain region’s functional repertoire—its ability
+          to handle diverse and complex tasks—is closely linked to the dimensionality of its neural represen-
+          tations 75,76 . Higher-order cortical areas, responsible for complex reasoning and decision-making,
+          must handle a wide variety of tasks, demanding more flexible and context-dependent processing 77 .
+          In dynamical systems, this flexibility is often realized through higher-dimensional state-space tra-
+          jectories, which allow for a richer repertoire of potential computations 78 . This principle gives rise
+          to an observable dimensionality hierarchy, where a region’s position in the processing hierarchy
+
+                                                                                                                                                                                                                                                                                          13
+   (a)                                                                (c)   (e)
+
+
+
+
+   (b)                                                                (d)   (f)
+                                5.0
+
+                                4.5
+     Participation Ratio (PR)
+
+
+
+
+                                4.0
+
+                                3.5
+
+                                3.0
+
+                                2.5
+
+                                2.0
+
+                                      0             20           40
+                                          Position in the hierarchy
+
+
+Figure 8: Hierarchical Dimensionality Organization in the HRM and Mouse Cortex. (a,b) are
+adapted from Posani et al. 74 . (a) Anatomical illustration of mouse cortical areas, color-coded by
+functional modules. (b) Correlation between Participation Ratio (PR), a measure of effective neural
+dimensionality, and hierarchical position across different mouse cortical areas. Higher positions in
+the hierarchy (e.g., MOs, ACAd) exhibit significantly higher PR values compared to lower sensory
+areas (e.g., SSp-n), with a Spearman correlation coefficient of ρ = 0.79 (P = 0.0003). (c,d) Trained
+HRM. (c) PR scaling of the trained HRM with task diversity. The dimensionality of the high-
+level module (zH ) scales with the number of unique tasks (trajectories) included in the analysis,
+indicating an adaptive expansion of its representational capacity. In contrast, the low-level module’s
+(zL ) dimensionality remains stable. (d) PR values for the low-level (zL , PR = 30.22) and high-
+level (zH , PR = 89.95) modules of the trained HRM, computed from neural activity during 100
+unique Sudoku-solving trajectories. A clear dimensionality hierarchy is observed, with the high-
+level module operating in a substantially higher-dimensional space. (e,f) Analysis of Untrained
+Network. To verify that the dimensionality hierarchy is an emergent property of training, the same
+analyses were performed on an untrained HRM with random weights. (e) In contrast to the trained
+model’s scaling in (c), the dimensionality of both modules in the untrained model remains low and
+stable, failing to scale with the number of tasks. (f) Similarly, contrasting with the clear separation
+in (d), the PR values for the untrained model’s modules (zL , PR = 42.09; zH , PR = 40.75) are
+low and nearly identical, showing no evidence of hierarchical separation. This confirms that the
+observed hierarchical organization of dimensionality is a learned property that emerges through
+training, not an artifact of the model’s architecture.
+
+
+
+
+                                                                                                    14
+correlates with its effective dimensionality. To quantify this phenomenon, we can examine the
+Participation Ratio (PR), which serves as a standard measure of the effective dimensionality of a
+high-dimensional representation 79 . The PR is calculated using the formula
+                                                ( i λi )2
+                                                 P
+                                         PR = P 2 ,
+                                                    i λi
+
+where {λi } are the eigenvalues of the covariance matrix of neural trajectories. Intuitively, a higher
+PR value signifies that variance is distributed more evenly across many dimensions, corresponding
+to a higher-dimensional representation. Conversely, a lower PR value indicates that variance is
+concentrated in only a few principal components, reflecting a more compact, lower-dimensional
+structure.
+The dimensionality hierarchy can be observed, for example, in the mouse cortex, where the PR of
+population activity increases monotonically from low-level sensory areas to high-level associative
+areas, supporting this link between dimensionality and functional complexity 74 (Figure 8 (a,b)).
+We evaluated whether HRM reproduces this neuroscientific principle by calculating the PR for
+both recurrent modules after training on the Sudoku-Extreme Full dataset. The PR computation
+used the covariance matrix derived from neural states gathered across multiple Sudoku-solving
+trajectories. The results show a striking parallel to the biological findings. The low-level module’s
+state (zL ) occupies a relatively small subspace with a participation ratio of 30.22, whereas the high-
+level module’s state (zH ) operates in a substantially larger subspace with a participation ratio of
+89.95, as shown in Figure 8(c). Furthermore, Figure 8(d) shows that increasing the number of
+unique tasks (trajectories) from 10 to 100 causes zH dimensionality to scale up accordingly, while
+zL dimensionality remains stable. These results suggest an emergent separation of representational
+capacity between the modules that parallels their functional roles.
+To confirm that this hierarchical organization is an emergent property of training, and not an artifact
+of the network’s architecture, we performed a control analysis using an identical but untrained
+network with random weights.
+We initialized an identical HRM architecture with random weights and, without any training, mea-
+sured the PR of its modules as the network processed the same task-specific inputs given to the
+trained model.
+The results, shown in Figure 8(e,f), reveal a stark contrast: the high-level and low-level modules of
+the untrained network exhibit no hierarchical separation, with their PR values remaining low and
+nearly indistinguishable from each other. This control analysis validates that the dimensionality
+hierarchy is an emergent property that arises as the model learns to perform complex reasoning.
+The high-to-low PR ratio in HRM (zH /zL ≈ 2.98) closely matches that measured in the mouse
+cortex (≈ 2.25). In contrast, conventional deep networks often exhibit neural collapse, where
+last-layer features converge to a low-dimensional subspace 80,81,82 . HRM therefore departs from the
+collapse pattern and instead fosters a high-dimensional representation in its higher module. This
+is significant because such representations are considered crucial for cognitive flexibility and are a
+hallmark of higher-order brain regions like the prefrontal cortex (PFC), which is central to complex
+reasoning.
+This structural parallel suggests the model has discovered a fundamental organizational principle.
+By learning to partition its representations into a high-capacity, high-dimensional subspace (zH )
+
+                                                                                                    15
+and a more specialized, low-dimensional one (zL ), HRM autonomously discovers an organizational
+principle that is thought to be fundamental for achieving robust and flexible reasoning in biological
+systems. This provides a potential mechanistic explanation for the model’s success on complex,
+long-horizon tasks that are intractable for models lacking such a differentiated internal structure.
+We emphasize, however, that this evidence is correlational. While a causal link could be tested
+via intervention (e.g., by constraining the H-module’s dimensionality), such methods are difficult
+to interpret in deep learning due to potential confounding effects on the training process itself.
+Thus, the causal necessity of this emergent hierarchy remains an important question for future
+investigation.
+
+
+5    Related Work
+Reasoning and algorithm learning Given the central role of reasoning problems and their close
+relation to algorithms, researchers have long explored neural architectures that enable algorithm
+learning from training instances. This line of work includes Neural Turing Machines (NTM) 83 ,
+the Differentiable Neural Computer (DNC) 84 , and Neural GPUs 85 –all of which construct iterative
+neural architectures that mimic computational hardware for algorithm execution, and are trained to
+learn algorithms from data. Another notable work in this area is Recurrent Relational Networks
+(RRN) 62 , which executes algorithms on graph representations through graph neural networks.
+Recent studies have integrated algorithm learning approaches with Transformer-based architec-
+tures. Universal Transformers extend the standard Transformer model by introducing a recurrent
+loop over the layers and implementing an adaptive halting mechanism. Geiping et al. 86 demonstrate
+that looped Transformers can generalize to a larger number of recurrent steps during inference than
+what they were trained on. Shen et al. 16 propose adding continuous recurrent reasoning tokens
+to the Transformer. Finally, TransNAR 8 combine recurrent graph neural networks with language
+models.
+Building on the success of CoT-based reasoning, a line of work have introduced fine-tuning meth-
+ods that use reasoning paths from search algorithms (like A*) as SFT targets 87,71,70 .
+We also mention adaptive halting mechanisms designed to allocate additional computational re-
+sources to more challenging problems. This includes the Adaptive Computation Time (ACT) for
+RNNs 88 and follow-up research like PonderNet 89 , which aims to improve the stability of this allo-
+cation process.
+HRM further pushes the boundary of algorithm learning through a brain-inspired computational
+architecture that achieves exceptional data efficiency and model expressiveness, successfully dis-
+covering complex and diverse algorithms from just 1000 training examples.
+Brain-inspired reasoning architectures Developing a model with the reasoning power of the
+brain has long been a goal in brain-inspired computing. Spaun 90 is one notable example, which uses
+spiking neural networks to create distinct modules corresponding to brain regions like the visual
+cortex and prefrontal cortex. This design enables an architecture to perform a range of cognitive
+tasks, from memory recall to simple reasoning puzzles. However, its reasoning relies on hand-
+designed algorithms, which may limit its ability to learn new tasks. Another significant model is the
+Tolman-Eichenbaum Machine (TEM) 91 , which is inspired by the hippocampal-entorhinal system’s
+role in spatial and relational memory tasks. TEM proposes that medial entorhinal cells create a
+basis for structural knowledge, while hippocampal cells link this basis to sensory information. This
+
+                                                                                                  16
+allows TEM to generalize and explains the emergence of various cell types like grid, border, and
+place cells. Another approach involves neural sampling models 92 , which view the neural signaling
+process as inference over a distribution, functioning similarly to a Boltzmann machine. These
+models often require hand-made rules to be set up for solving a specific reasoning task. In essence,
+while prior models are restricted to simple reasoning problems, HRM is designed to solve complex
+tasks that are hard for even advanced LLMs, without pre-training or task-specific manual design.
+Hierarchical memory The hierarchical multi-timescale structure also plays an important role in
+how the brain processes memory. Models such as Hierarchical Sequential Models 93 and Clockwork
+RNN 94 use multiple recurrent modules that operate at varying time scales to more effectively cap-
+ture long-range dependencies within sequences, thereby mitigating the forgetting issue in RNNs.
+Similar mechanisms have also been adopted in linear attention methods for memorizing long con-
+texts (see the Discussions section). Since HRM focuses on reasoning, full attention is applied for
+simplicity. Incorporating hierarchical memory into HRM could be a promising future direction.
+
+
+6    Discussions
+Turing-completeness of HRM Like earlier neural reasoning algorithms including the Universal
+Transformer 95 , HRM is computationally universal when given sufficient memory and time con-
+straints. In other words, it falls into the category of models that can simulate any Turing machine,
+overcoming the computational limitations of standard Transformers discussed previously in the in-
+troduction. Given that earlier neural algorithm reasoners were trained as recurrent neural networks,
+they suffer from premature convergence and memory intensive BPTT. Therefore, in practice, their
+effective computational depth remains limited, though still deeper than that of a standard Trans-
+former. By resolving these two challenges and being equipped with adaptive computation, HRM
+could be trained on long reasoning processes, solve complex puzzles requiring intensive depth-first
+search and backtracking, and move closer to practical Turing-completeness.
+Reinforcement learning with chain-of-thought Beyond fine-tuning using human-annotated CoT,
+reinforcement learning (RL) represents another widely adopted training methodology. However,
+recent evidence suggests that RL primarily unlocks existing CoT-like capabilities rather than dis-
+covering fundamentally new reasoning mechanisms 96,97,98,99 . Additionally, CoT-training with RL
+is known for its instability and data inefficiency, often requiring extensive exploration and careful
+reward design. In contrast, HRM takes feedback from dense gradient-based supervision rather than
+relying on a sparse reward signal. Moreover, HRM operates naturally in a continuous space, which
+is biologically plausible and avoids allocating same computational resources to each token, even
+though tokens vary in their reasoning and planning complexity 16 .
+Linear attention Recurrence has been explored not only for its capability in universal computa-
+tion, but also as a means to replace the attention mechanism in Transformers, which suffers from
+quadratic time and memory complexity 100 . Recurrent alternatives offer a more efficient design by
+processing input tokens sequentially and predicting the next token at each time step, similar to early
+RNN-based language models.
+Some linear-attention variants, such as Log-linear Attention 101 , share an RNN-like state-update that
+can be interpreted as propagating multi-timescale summary statistics, thereby retaining long-range
+context without the quadratic memory growth of standard self-attention. However, substituting the
+attention mechanism alone does not change the fact that Transformers are still fixed-depth, and
+
+                                                                                                   17
+require CoT as a compensatory mechanism. Notably, linear attention can operate with a reduced
+key-value cache over extended contexts, making them more suitable for deployment on resource-
+constrained edge devices.
+
+
+7    Conclusion
+This work introduces the Hierarchical Reasoning Model, a brain-inspired architecture that lever-
+ages hierarchical structure and multi-timescale processing to achieve substantial computational
+depth without sacrificing training stability or efficiency. With only 27M parameters and train-
+ing on just 1000 examples, HRM effectively solves challenging reasoning problems such as ARC,
+Sudoku, and complex maze navigation–tasks that typically pose significant difficulties for contem-
+porary LLM and chain-of-thought models.
+Although the brain relies heavily on hierarchical structures to enable most cognitive processes,
+these concepts have largely remained confined to academic literature rather than being translated
+into practical applications. The prevailing AI approach continues to favor non-hierarchical models.
+Our results challenge this established paradigm and suggest that the Hierarchical Reasoning Model
+represents a viable alternative to the currently dominant chain-of-thought reasoning methods, ad-
+vancing toward a foundational framework capable of Turing-complete universal computation.
+Acknowledgements We thank Mingli Yuan, Ahmed Murtadha Hasan Mahyoub and Hengshuai
+Yao for their insightful discussions and valuable feedback throughout the course of this work.
+
+
+
+
+                                                                                                18
+References
+ 1. Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016.
+    http://www.deeplearningbook.org.
+ 2. Kaiming He, X. Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image
+    recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
+    pages 770–778, 2015.
+ 3. Lena Strobl. Average-hard attention transformers are constant-depth uniform threshold
+    circuits, 2023.
+ 4. Tom Bylander. Complexity results for planning. In Proceedings of the 12th International Joint
+    Conference on Artificial Intelligence - Volume 1, IJCAI’91, page 274–279, San Francisco,
+    CA, USA, 1991. Morgan Kaufmann Publishers Inc. ISBN 1558601600.
+ 5. William Merrill and Ashish Sabharwal. A logic for expressing log-precision transformers. In
+    Neural Information Processing Systems, 2023.
+ 6. David Chiang. Transformers in DLOGTIME-uniform TC0 . Transactions on Machine
+    Learning Research, 2025.
+ 7. Lucas Lehnert, Sainbayar Sukhbaatar, DiJia Su, Qinqing Zheng, Paul McVay, Michael
+    Rabbat, and Yuandong Tian. Beyond a*: Better planning with transformers via search
+    dynamics bootstrapping. In First Conference on Language Modeling, 2024.
+ 8. Wilfried Bounsi, Borja Ibarz, Andrew Dudzik, Jessica B. Hamrick, Larisa Markeeva, Alex
+    Vitvitskyi, Razvan Pascanu, and Petar Velivckovi’c. Transformers meet neural algorithmic
+    reasoners. ArXiv, abs/2406.09308, 2024.
+ 9. William Merrill and Ashish Sabharwal. The parallelism tradeoff: Limitations of log-precision
+    transformers. Transactions of the Association for Computational Linguistics, 11:531–545,
+    2023. doi: 10.1162/tacl_a_00562.
+10. Jason Wei, Yi Tay, et al. Chain-of-thought prompting elicits reasoning in large language
+    models, 2022. arXiv preprint arXiv:2201.11903.
+11. William Merrill and Ashish Sabharwal. The expressive power of transformers with chain of
+    thought. In ICLR, 2024.
+12. Xinyun Chen, Ryan A. Chi, Xuezhi Wang, and Denny Zhou. Premise order matters in
+    reasoning with large language models. ArXiv, abs/2402.08939, 2024.
+13. Rongwu Xu, Zehan Qi, and Wei Xu. Preemptive answer "attacks" on chain-of-thought
+    reasoning. In Annual Meeting of the Association for Computational Linguistics, 2024.
+14. Pablo Villalobos, Anson Ho, Jaime Sevilla, Tamay Besiroglu, Lennart Heim, and Marius
+    Hobbhahn. Will we run out of data? limits of llm scaling based on human-generated data.
+    arXiv preprint arXiv:2211.04325, 2022.
+15. Xinghao Chen, Anhao Zhao, Heming Xia, Xuan Lu, Hanlin Wang, Yanjun Chen, Wei Zhang,
+    Jian Wang, Wenjie Li, and Xiaoyu Shen. Reasoning beyond language: A comprehensive
+    survey on latent chain-of-thought reasoning, 2025.
+16. Xuan Shen, Yizhou Wang, Xiangxi Shi, Yanzhi Wang, Pu Zhao, and Jiuxiang Gu.
+    Training large language models to reason in a continuous latent space. arXiv preprint
+    arXiv:2412.07423, 2024.
+
+
+                                                                                              19
+17. Evelina Fedorenko, Steven T Piantadosi, and Edward AF Gibson. Language is primarily a
+    tool for communication rather than thought. Nature, 630(8017):575–586, 2024.
+18. Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Dongdong Zhang, and Furu Wei.
+    Deepnet: Scaling transformers to 1,000 layers. IEEE Transactions on Pattern Analysis and
+    Machine Intelligence, 2024.
+19. Timothy P Lillicrap and Adam Santoro. Backpropagation through time and the brain. Current
+    Opinion in Neurobiology, 55:82–89, 2019. ISSN 0959-4388. doi: https://doi.org/10.1016/j.
+    conb.2019.01.011.
+20. John D Murray, Alberto Bernacchia, David J Freedman, Ranulfo Romo, Jonathan D Wallis,
+    Xinying Cai, Camillo Padoa-Schioppa, Tatiana Pasternak, Hyojung Seo, Daeyeol Lee, et al.
+    A hierarchy of intrinsic timescales across primate cortex. Nature neuroscience, 17(12):1661–
+    1663, 2014.
+21. Roxana Zeraati, Yan-Liang Shi, Nicholas A Steinmetz, Marc A Gieselmann, Alexander
+    Thiele, Tirin Moore, Anna Levina, and Tatiana A Engel. Intrinsic timescales in the
+    visual cortex change with selective attention and reflect spatial connectivity. Nature
+    communications, 14(1):1858, 2023.
+22. Julia M Huntenburg, Pierre-Louis Bazin, and Daniel S Margulies. Large-scale gradients in
+    human cortical organization. Trends in cognitive sciences, 22(1):21–31, 2018.
+23. Victor AF Lamme and Pieter R Roelfsema. The distinct modes of vision offered by
+    feedforward and recurrent processing. Trends in neurosciences, 23(11):571–579, 2000.
+24. Andre M Bastos, W Martin Usrey, Rick A Adams, George R Mangun, Pascal Fries, and Karl J
+    Friston. Canonical microcircuits for predictive coding. Neuron, 76(4):695–711, 2012.
+25. Klara Kaleb, Barbara Feulner, Juan Gallego, and Claudia Clopath. Feedback control guides
+    credit assignment in recurrent neural networks. Advances in Neural Information Processing
+    Systems, 37:5122–5144, 2024.
+26. Timothy P Lillicrap, Adam Santoro, Luke Marris, Colin J Akerman, and Geoffrey Hinton.
+    Backpropagation and the brain. Nature Reviews Neuroscience, 21(6):335–346, 2020.
+27. François Chollet. On the measure of intelligence (abstraction and reasoning corpus), 2019.
+    arXiv preprint arXiv:1911.01547.
+28. Francois Chollet, Mike Knoop, Gregory Kamradt, and Bryan Landers. Arc prize 2024:
+    Technical report. ArXiv, abs/2412.04604, 2024.
+29. Francois Chollet, Mike Knoop, Gregory Kamradt, Bryan Landers, and Henry Pinkard. Arc-
+    agi-2: A new challenge for frontier ai reasoning systems. arXiv preprint arXiv:2505.11831,
+    2025.
+30. György Buzsáki. Gamma, alpha, delta, and theta oscillations govern cognitive processes.
+    International Journal of Psychophysiology, 39:241–248, 2000.
+31. György Buzsáki. Rhythms of the Brain. Oxford university press, 2006.
+32. Anja Pahor and Norbert Jaušovec. Theta–gamma cross-frequency coupling relates to the level
+    of human intelligence. Intelligence, 46:283–290, 2014.
+33. Adriano BL Tort, Robert W Komorowski, Joseph R Manns, Nancy J Kopell, and Howard
+    Eichenbaum.      Theta–gamma coupling increases during the learning of item–context
+    associations. Proceedings of the National Academy of Sciences, 106(49):20942–20947, 2009.
+
+
+                                                                                             20
+34. Benjamin Scellier and Yoshua Bengio. Equilibrium propagation: Bridging the gap between
+    energy-based models and backpropagation. Frontiers in Computational Neuroscience, 11,
+    2016.
+35. Guillaume Bellec, Franz Scherr, Anand Subramoney, Elias Hajek, Darjan Salaj, Robert
+    Legenstein, and Wolfgang Maass. A solution to the learning dilemma for recurrent
+    networks of spiking neurons. Nature Communications, 11, 07 2020. doi: 10.1038/
+    s41467-020-17236-y.
+36. Shaojie Bai, J Zico Kolter, and Vladlen Koltun. Deep equilibrium models. In Advances in
+    Neural Information Processing Systems, pages 690–701, 2019.
+37. Zhengyang Geng, Xinyu Zhang, Shaojie Bai, Yisen Wang, and Zhouchen Lin. On training
+    implicit models. ArXiv, abs/2111.05177, 2021.
+38. Katarina Begus and Elizabeth Bonawitz. The rhythm of learning: Theta oscillations as an
+    index of active learning in infancy. Developmental Cognitive Neuroscience, 45:100810, 2020.
+    ISSN 1878-9293. doi: https://doi.org/10.1016/j.dcn.2020.100810.
+39. Shaojie Bai, Zhengyang Geng, Yash Savani, and J. Zico Kolter.            Deep Equilibrium
+    Optical Flow Estimation . In 2022 IEEE/CVF Conference on Computer Vision and Pattern
+    Recognition (CVPR), pages 610–620, 2022.
+40. Zaccharie Ramzi, Florian Mannel, Shaojie Bai, Jean-Luc Starck, Philippe Ciuciu, and
+    Thomas Moreau. Shine: Sharing the inverse estimate from the forward pass for bi-level
+    optimization and implicit models. ArXiv, abs/2106.00553, 2021.
+41. Shaojie Bai, Vladlen Koltun, and J. Zico Kolter. Stabilizing equilibrium models by jacobian
+    regularization. In International Conference on Machine Learning, 2021.
+42. Daniel Kahneman and P Egan. Thinking, fast and slow (farrar, straus and giroux, new york),
+    2011.
+43. Matthew D Lieberman. Social cognitive neuroscience: a review of core processes. Annu. Rev.
+    Psychol., 58(1):259–289, 2007.
+44. Randy L Buckner, Jessica R Andrews-Hanna, and Daniel L Schacter. The brain’s default
+    network: anatomy, function, and relevance to disease. Annals of the new York Academy of
+    Sciences, 1124(1):1–38, 2008.
+45. Marcus E Raichle. The brain’s default mode network. Annual review of neuroscience, 38(1):
+    433–447, 2015.
+46. Andrew Westbrook and Todd S Braver. Cognitive effort: A neuroeconomic approach.
+    Cognitive, Affective, & Behavioral Neuroscience, 15:395–415, 2015.
+47. Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT
+    Press, Cambridge, MA, 2018.
+48. Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan
+    Wierstra, and Martin A. Riedmiller. Playing atari with deep reinforcement learning. ArXiv,
+    abs/1312.5602, 2013.
+49. Matteo Gallici, Mattie Fellows, Benjamin Ellis, Bartomeu Pou, Ivan Masmitja,
+    Jakob Nicolaus Foerster, and Mario Martin. Simplifying deep temporal difference learning,
+    2025.
+
+
+
+                                                                                            21
+50. Shuo Xie and Zhiyuan Li. Implicit bias of adamw: L inf norm constrained optimization.
+    ArXiv, abs/2404.04454, 2024.
+51. Lucas Prieto, Melih Barsbey, Pedro A. M. Mediano, and Tolga Birdal. Grokking at the edge of
+    numerical stability. In The Thirteenth International Conference on Learning Representations,
+    2025.
+52. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
+    Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural
+    information processing systems, pages 5998–6008, 2017.
+53. Meta AI. Llama 3: State-of-the-art open weight language models. Technical report, Meta,
+    2024. URL https://ai.meta.com/llama/.
+54. Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer:
+    Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024.
+55. Noam M. Shazeer. Glu variants improve transformer. ArXiv, abs/2002.05202, 2020.
+56. Biao Zhang and Rico Sennrich.            Root mean square layer normalization.       ArXiv,
+    abs/1910.07467, 2019.
+57. Günter Klambauer, Thomas Unterthiner, Andreas Mayr, and Sepp Hochreiter. Self-
+    normalizing neural networks. In Neural Information Processing Systems, 2017.
+58. JAX Developers. jax.nn.initializers.lecun_normal. Google Research, 2025.               URL
+    https://docs.jax.dev/en/latest/_autosummary/jax.nn.initializers.lecun_
+    normal.html. Accessed June 22, 2025.
+59. Yann LeCun, Léon Bottou, Genevieve B Orr, and Klaus-Robert Müller. Efficient backprop.
+    In Neural networks: Tricks of the trade, pages 9–50. Springer, 2002.
+60. Katie E Everett, Lechao Xiao, Mitchell Wortsman, Alexander A Alemi, Roman Novak,
+    Peter J Liu, Izzeddin Gur, Jascha Sohl-Dickstein, Leslie Pack Kaelbling, Jaehoon Lee, and
+    Jeffrey Pennington. Scaling exponents across parameterizations and optimizers. In Forty-first
+    International Conference on Machine Learning, 2024.
+61. Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017.
+62. Rasmus Berg Palm, Ulrich Paquet, and Ole Winther. Recurrent relational networks. In Neural
+    Information Processing Systems, 2017.
+63. Jieyi Long. Large language model guided tree-of-thought. ArXiv, abs/2305.08291, 2023.
+64. Yilun Du, Jiayuan Mao, and Josh Tenenbaum. Learning iterative reasoning through energy
+    diffusion. ArXiv, abs/2406.11179, 2024.
+65. Kyubyong Park. Can convolutional neural networks crack sudoku puzzles?              https:
+    //github.com/Kyubyong/sudoku, 2018.
+66. Single-digit techniques. https://hodoku.sourceforge.net/en/tech_singles.php.
+    Accessed: 2025-06-16.
+67. Tom Dillion. Tdoku: A fast sudoku solver and generator. https://t-dillon.github.io/
+    tdoku/, 2025.
+68. Jeffrey Seely, Yuki Imajuku, Tianyu Zhao, Edoardo Cetin, and Llion Jones. Sudoku-bench:
+    Evaluating creative reasoning with sudoku variants. arXiv preprint arXiv:2505.16135, 2025.
+69. Luke Darlow, Ciaran Regan, Sebastian Risi, Jeffrey Seely, and Llion Jones. Continuous
+    thought machines. arXiv preprint arXiv:2505.05522, 2025.
+
+                                                                                              22
+70. DiJia Su, Sainbayar Sukhbaatar, Michael Rabbat, Yuandong Tian, and Qinqing Zheng.
+    Dualformer: Controllable fast and slow thinking by learning with randomized reasoning
+    traces, 2025.
+71. Lucas Lehnert, Sainbayar Sukhbaatar, DiJia Su, Qinqing Zheng, Paul McVay, Michael
+    Rabbat, and Yuandong Tian. Beyond a*: Better planning with transformers via search
+    dynamics bootstrapping. In First Conference on Language Modeling, 2024.
+72. Mubbasir Kapadia, Francisco Garcia, Cory D. Boatright, and Norman I. Badler. Dynamic
+    search on the gpu. In 2013 IEEE/RSJ International Conference on Intelligent Robots and
+    Systems, pages 3332–3337, 2013. doi: 10.1109/IROS.2013.6696830.
+73. Isaac Liao and Albert Gu.          Arc-agi without pretraining, 2025.          URL https:
+    //iliao2345.github.io/blog_posts/arc_agi_without_pretraining/arc_agi_
+    without_pretraining.html.
+74. Lorenzo Posani, Shuqi Wang, Samuel P Muscinelli, Liam Paninski, and Stefano Fusi.
+    Rarely categorical, always high-dimensional: how the neural code changes along the cortical
+    hierarchy. bioRxiv, pages 2024–11, 2025.
+75. Mattia Rigotti, Omri Barak, Melissa R. Warden, Xiao-Jing Wang, Nathaniel D. Daw, Earl K.
+    Miller, and Stefano Fusi. The importance of mixed selectivity in complex cognitive tasks.
+    Nature, 497:585–590, 2013. doi: 10.1038/nature12160.
+76. Valerio Mante, David Sussillo, Krishna V. Shenoy, and William T. Newsome. Context-
+    dependent computation by recurrent dynamics in prefrontal cortex. Nature, 503(7474):78–84,
+    2013. doi: 10.1038/nature12742.
+77. Earl K. Miller and Jonathan D. Cohen. An integrative theory of prefrontal cortex function.
+    Annual Review of Neuroscience, 24(1):167–202, 2001. doi: 10.1146/annurev.neuro.24.1.167.
+78. Wolfgang Maass. Real-time computing without stable states: a new framework for neural
+    computation based on perturbations. Neural Computation, 14(11):2531–2560, 2002. doi:
+    10.1162/089976602760407955.
+79. Ege Altan, Sara A. Solla, Lee E. Miller, and Eric J. Perreault. Estimating the dimensionality
+    of the manifold underlying multi-electrode neural recordings. PLoS Computational Biology,
+    17(11):e1008591, 2021. doi: 10.1371/journal.pcbi.1008591.
+80. Vardan Papyan, X. Y. Han, and David L. Donoho. Prevalence of neural collapse during the
+    terminal phase of deep learning training. Proceedings of the National Academy of Sciences,
+    117(40):24652–24663, 2020. doi: 10.1073/pnas.2015509117.
+81. Cong Fang, Hangfeng He, Qi Long, and Weijie J. Su. Exploring deep neural networks via
+    layer–peeled model: Minority collapse in imbalanced training. Proceedings of the National
+    Academy of Sciences, 118(43):e2103091118, 2021. doi: 10.1073/pnas.2103091118.
+82. Zhihui Zhu, Tianyu Ding, Jinxin Zhou, Xiao Li, Chong You, Jeremias Sulam, and Qing Qu.
+    A geometric analysis of neural collapse with unconstrained features. In Advances in Neural
+    Information Processing Systems, volume 34 of NeurIPS, pages 29820–29834, 2021.
+83. Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines, 2014.
+84. Alex Graves, Greg Wayne, Malcolm Reynolds, Tim Harley, Ivo Danihelka, Agnieszka
+    Grabska-Barwińska, Sergio Gómez Colmenarejo, Edward Grefenstette, Tiago Ramalho, John
+    Agapiou, et al. Hybrid computing using a neural network with dynamic external memory.
+    Nature, 538(7626):471–476, 2016.
+
+                                                                                              23
+ 85. Lukasz Kaiser and Ilya Sutskever. Neural GPUs learn algorithms. In ICLR, 2016.
+ 86. Jonas Geiping, Sean McLeish, Neel Jain, John Kirchenbauer, Siddharth Singh, Brian R.
+     Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, and Tom Goldstein. Scaling up test-time
+     compute with latent reasoning: A recurrent depth approach, 2025.
+ 87. Tiedong Liu and Kian Hsiang Low. Goat: Fine-tuned llama outperforms gpt-4 on arithmetic
+     tasks. ArXiv, abs/2305.14201, 2023.
+ 88. Alex Graves.       Adaptive computation time for recurrent neural networks.          ArXiv,
+     abs/1603.08983, 2016.
+ 89. Andrea Banino, Jan Balaguer, and Charles Blundell. Pondernet: Learning to ponder. ArXiv,
+     abs/2107.05407, 2021.
+ 90. Chris Eliasmith, Terrence C Stewart, Xuan Choo, Trevor Bekolay, Travis DeWolf, Yichuan
+     Tang, and Daniel Rasmussen. A large-scale model of the functioning brain. science, 338
+     (6111):1202–1205, 2012.
+ 91. James CR Whittington, Timothy H Muller, Shirley Mark, Guifen Chen, Caswell Barry, Neil
+     Burgess, and Timothy EJ Behrens. The tolman-eichenbaum machine: unifying space and
+     relational memory through generalization in the hippocampal formation. Cell, 183(5):1249–
+     1263, 2020.
+ 92. Lars Buesing, Johannes Bill, Bernhard Nessler, and Wolfgang Maass. Neural dynamics as
+     sampling: a model for stochastic computation in recurrent networks of spiking neurons. PLoS
+     computational biology, 7(11):e1002211, 2011.
+ 93. Salah Hihi and Yoshua Bengio. Hierarchical recurrent neural networks for long-term
+     dependencies. In D. Touretzky, M.C. Mozer, and M. Hasselmo, editors, Advances in Neural
+     Information Processing Systems, volume 8. MIT Press, 1995.
+ 94. Jan Koutník, Klaus Greff, Faustino J. Gomez, and Jürgen Schmidhuber. A clockwork rnn. In
+     International Conference on Machine Learning, 2014.
+ 95. Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser.
+     Universal transformers, 2018. arXiv preprint arXiv:1807.03819.
+ 96. Yiping Wang, Qing Yang, Zhiyuan Zeng, Liliang Ren, Lucas Liu, Baolin Peng, Hao Cheng,
+     Xuehai He, Kuan Wang, Jianfeng Gao, Weizhu Chen, Shuohang Wang, Simon Shaolei Du,
+     and Yelong Shen. Reinforcement learning for reasoning in large language models with one
+     training example, 2025. URL https://arxiv.org/abs/2504.20571.
+ 97. Niklas Muennighoff. s1: Simple test-time scaling. arXiv preprint arXiv:2502.23456, 2025.
+ 98. Liang Wen, Yunke Cai, Fenrui Xiao, Xin He, Qi An, Zhenyu Duan, Yimin Du, Junchen Liu,
+     Lifu Tang, Xiaowei Lv, Haosheng Zou, Yongchao Deng, Shousheng Jia, and Xiangzheng
+     Zhang. Light-r1: Curriculum sft, dpo and rl for long cot from scratch and beyond, 2025.
+ 99. Xuefeng Li, Haoyang Zou, and Pengfei Liu. Limr: Less is more for rl scaling, 2025.
+100. Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms
+     through structured state space duality. ArXiv, abs/2405.21060, 2024.
+101. Han Guo, Songlin Yang, Tarushii Goel, Eric P Xing, Tri Dao, and Yoon Kim. Log-linear
+     attention. arXiv preprint arXiv:2506.04761, 2025.
+
+
+
+
+                                                                                             24
+
+\ No newline at end of file
diff --git a/papers/txt/ptrm2025_probabilistic_trm.txt b/papers/txt/ptrm2025_probabilistic_trm.txt
new file mode 100644
index 0000000..a00f31c
--- /dev/null
+++ b/papers/txt/ptrm2025_probabilistic_trm.txt
@@ -0,0 +1,906 @@
+                                                                                            Probabilistic Tiny Recursive Model
+
+
+                                                              Amin Sghaier                                                                             Ali Parviz                                              Alexia Jolicoeur-Martineau
+                                                        Mila – Quebec AI Institute                                                              Mila – Quebec AI Institute                                             Independent
+                                                         ILLS & ETS Montreal
+
+                                                                                                             {amin.sghaier, ali.parviz}@mila.quebec
+arXiv:2605.19943v1 [cs.AI] 19 May 2026
+
+
+
+
+                                                                                                            alexia.jolicoeur-martineau@mail.mcgill.ca
+
+
+
+                                                                                                                                                             Abstract
+                                                                 Tiny Recursive Models (TRM) solve complex reasoning tasks with a fraction of
+                                                                 the parameters of modern large language models (LLMs) by iteratively refining a
+                                                                 latent state and final answer. While powerful, their deterministic recursion can lead
+                                                                 to convergence at suboptimal solutions, without escape mechanism. A common
+                                                                 workaround relies on task-specific input perturbations at test time combined with
+                                                                 answer aggregation via voting. We introduce Probabilistic TRM (PTRM), a task-
+                                                                 agnostic framework for test-time compute scaling that addresses this limitation
+                                                                 through stochastic exploration. PTRM injects Gaussian noise at each deep recursion
+                                                                 step, enabling parallel trajectories to explore diverse solution basins, and selects
+                                                                 among them using the model’s existing Q head (used for early stopping in the
+                                                                 original TRM). Without requiring retraining or task-specific augmentations, PTRM
+                                                                 enables substantial accuracy gains across benchmarks, including Sudoku-Extreme
+                                                                 (87.4% to 98.75%) and on various puzzles from Pencil Puzzle Bench (62.6% to
+                                                                 91.2%). On the latter, PTRM achieves nearly double the accuracy of frontier LLMs
+                                                                 (91.2% vs. 55.1%) at less than 0.0001x the cost, using only 7M parameters.
+
+
+
+                                                                                                   PPBench Puzzles
+                                                                               sudoku, lightup, nurikabe, heyawake, and tapa                                                                                    Sudoku-Extreme
+                                                        100                                                                                                                100                                                                             98.75
+                                                                                                                                                              91.2                                                                                  87.4
+                                                         80                                                                                                                 80
+                                         Accuracy (%)
+
+
+
+
+                                                                                                                                                     62.6
+                                                         60                                                                           55.1                                  60                                                                55
+                                                         40                                                         34.7                                                    40
+                                                                                   24.5            24.5
+                                                         20                                                                                                                 20
+                                                                      2                                                                                                               0              0                0              0
+                                                          0                                                                                                                  0
+                                                                 Direct pred
+
+
+
+
+                                                                                                                                                                                                                                Direct pred
+                                                                                                   gemini-3.1-pro
+
+                                                                                                                    claude-opus-4-6
+
+
+
+                                                                                                                                                     TRM
+
+
+
+
+                                                                                                                                                                                 Deepseek R1
+
+
+
+                                                                                                                                                                                                                 o3-mini-high
+
+
+
+                                                                                                                                                                                                                                              HRM
+
+                                                                                                                                                                                                                                                    TRM
+                                                                                   gpt-5.2@xhigh
+
+
+
+
+                                                                                                                                                             PTRM (ours)
+
+
+
+
+                                                                                                                                                                                               Claude 3.7 8K
+
+
+
+
+                                                                                                                                                                                                                                                           PTRM (ours)
+                                                                                                                                      LLM ensemble
+
+
+
+
+                                                              Chain-of-thought, pretrained                                                       Direct prediction                                             Probabilistic recursive prediction (ours)
+                                                              LLM ensemble                                                                       Deterministic recursive prediction
+                                                                                                                          Best of 7 strongest LLMs. Assumes access to a perfect verifier.
+
+                                           Figure 1: PTRM performance comparison. On various PPBench puzzles, PTRM boosts TRM
+                                           performance by 28.6 points without any retraining. It outperforms the strongest single frontier LLMs
+                                           by 56.5 points and an ensemble of the seven strongest LLMs (assuming a perfect verifier) by 36
+                                           points. On Sudoku-Extreme, PTRM reaches a state of the art 98.75%.
+1   Introduction
+Tiny Recursive Models (TRM) [1] achieve strong performance on complex reasoning puzzles with
+orders of magnitude fewer parameters than the large language models (LLMs) they outperform on
+tasks like Sudoku-Extreme [2] and ARC-AGI [3, 4]. TRM and its predecessor Hierarchical Reasoning
+Model (HRM) [2] represent an emerging architectural alternative to standard autoregressive reasoning
+models. Rather than autoregressively generating chains of token-level reasoning, they recursively
+refine a latent state. This approach produces a single deterministic answer per input, fitting well with
+tasks where the answer is unique.
+Despite their strong performance, their deterministic inference does not make full use of their
+capabilities. We show that many of TRM’s incorrect answers are from rollouts trapped in bad latent
+space basins (i.e., regions of the latent space which decode to incorrect answers and from which the
+deterministic recursions cannot escape). This observation, which aligns with recent mechanistic work
+on related models [5], suggests that TRM has the capabilities to solve significantly more problems
+but is limited by its standard inference procedure.
+Although each puzzle has a unique correct answer, many distinct latent trajectories can reach it. This
+is analogous to reasoning LLMs, where many reasoning trajectories can lead to the same unique
+answer. However, being non-deterministic, LLMs can be randomly sampled in order to form different
+trajectories (including Chains of Thought and actual answer). By then selecting a trajectory using
+a voting mechanism or based on the answer’s projected value (via a verifier), LLMs can leverage
+test-time compute to achieve very high accuracy [6]. We propose a way to achieve similar test-time
+scaling performance gains by sampling stochastic latent trajectories, each producing a deterministic
+decoded answer, and selecting among the answers using the model’s own Q head.
+TRM’s Q head is trained jointly (as a correctness classifier) with the rest of the network and is
+conventionally used only at training time for adaptive computation (ACT) [7]. It carries valuable
+information that the standard inference procedure discards.
+We propose Probabilistic TRM (PTRM), a test-time compute scaling framework that introduces a
+new width scaling axis. At inference we run K parallel rollouts per puzzle, each receiving Gaussian
+noise injected into the latent at every deep recursion step. The noise causes rollouts to follow different
+latent trajectories and settle in different basins. Among the resulting candidate answers, the Q head
+is used to select the one most likely to be correct. PTRM requires no training changes and no
+task-specific test-time augmentation, yet, as illustrated in Figure 1, delivers substantial accuracy
+gains across diverse reasoning benchmarks.
+
+2   Background: Tiny Recursive Model
+Tiny Recursive Model (TRM) is a single network that iteratively refines a predicted answer y to a
+question x through recursive updates of a reasoning latent z. Specifically, a single latent recursion
+consists of n updates to the latent state z followed by one update to the predicted answer y, all using
+the same two-layer network fθ : z ← fθ (x + y + z) n times, then y ← fθ (y + z).
+fθ distinguishes the two update types by whether the input includes x. A deep recursion runs T
+latent recursions in sequence, with only the final one retaining gradients, allowing the model to
+leverage a large effective depth while keeping training efficient.
+Rather than doing one optimization step per sample, TRM is trained via deep supervision, which
+consists in keeping the previous latent state z and answer y as initialization (after being detached from
+the computational graph) for the next supervision step. This is done for up to Nsup supervision steps.
+The loss at each step is calculated using cross entropy between the predicted answer logits fO (y)
+(where fO is a linear output head) and the ground truth ytrue . This trains the network to progressively
+refine its prediction across reasoning steps. At inference, the recurrence can be unrolled for more
+steps than during training, providing a depth axis for test-time compute scaling (additional steps may
+correct otherwise-incorrect answers).
+Without halting mechanism during training, each puzzle stays in the mini-batch for Nsup supervision
+steps rather than being replaced after each one. To avoid wasting compute on already-solved samples,
+an Adaptive Computational Time (ACT) halting mechanism is used. This is done by adding a binary
+cross entropy loss between a halting logit q̂ = fQ (y) (where fQ is a linear Q head) and the binary exact
+
+
+                                                    2
+                                Correct answer           Incorrect answer                          Cell accuracy                       Start         End
+                               Quick success                                      Delayed success                                          Failure
+
+          PC 2 (15% var)
+
+
+
+
+                                                             PC 2 (36% var)
+
+
+
+
+                                                                                                                   PC 2 (8% var)
+                               PC 1 (84% var)                                       PC 1 (58% var)                                      PC 1 (85% var)
+
+
+                                                       1.0                                                   1.0                                                1.0
+          5.0                                              5.0                                                          5
+
+
+
+
+                                                                                                                                                                  Cell accuracy
+          2.5                                          0.9 2.5                                               0.9                                                0.8
+Q value
+
+
+
+
+                                                           0.0                                                          0
+          0.0                                          0.8 2.5                                               0.8
+                                                                                                                                                                0.6
+          2.5                                                                                                           5
+                                                           5.0
+                                                       0.7                                                   0.7
+          5.0
+                           0      5      10       15                          0       5       10        15                         0       5     10        15
+                               Supervision step                                    Supervision step                                    Supervision step
+ Figure 2: TRM Trajectory Modes. PCA projection of y (top) and Q value (solid, left axis) with cell
+ accuracy (dashed, right axis) across supervision steps (bottom) for three PPBench puzzles, illustrating
+ three trajectory modes (left to right): quick success, delayed success, and failure (Sec. 3). Latents are
+ projected into the principal plane per puzzle, so PC axes are not comparable across plots. Trajectories
+ fade from light (early steps) to dark (later steps). Circle marks the start and square marks end.
+
+
+ correctness of the predicted answer ŷ = arg max fO (y): Lstep = CE(fO (y), ytrue ) + BCE(q̂, 1[ŷ =
+ ytrue ]). The Q head thus allows the supervision loop to halt early on samples where sigmoid(q̂) > 0.5,
+ improving data efficiency. During inference, the Q head is not used, and the model performs Nsup
+ supervision steps to maximize answer correctness.
+ While TRM is powerful, it sometimes gets stuck into incorrect solutions. In the next section, we will
+ investigate such failures cases in order to determine a way to remedy them.
+
+
+ 3        Problem: When Does TRM Fail?
+
+ 3.1           Analysis of failures and successes
+
+ We present observations about TRM that motivate our method. In this section, we train a TRM on
+ multiple Pencil Puzzle Bench (PPBench) [8] puzzles and inspect the latent dynamics and Q head
+ behavior across supervision steps on a held-out validation set. For each puzzle, we record the latent
+ yt and the Q logit q̂t = fQ (yt ) at every supervision step t = 1, . . . , Nsup , project the latents into
+ the principal plane (PCA per puzzle), and jointly plot the Q value alongside cell accuracy (fraction
+ of correct cells in the predicted answer) over supervision steps. Figure 2 shows paired PCA and
+ Q/cell-accuracy plots for three representative puzzles, illustrating three trajectory modes we observe:
+ Quick success: the trajectory transitions in a few steps from its starting location to a convergence
+ region and remains there. Cell accuracy and the Q value rise together and saturate near their maxima
+ within the same few steps.
+ Delayed success: the trajectory initially oscillates around one region and remains there for multiple
+ supervision steps before sharply escaping to a different region where it converges. During the initial
+
+
+                                                                                          3
+phase, the Q value is negative, and at the step where the trajectory escapes, both Q value and cell
+accuracy spike together.
+Failure: the trajectory oscillates in a bounded region without converging. Cell accuracy never reaches
+near 100%, and the Q value stays negative for all supervision steps.
+We refer to latent space regions that trajectories remain in across multiple supervision steps and
+exhibit similar cell accuracy throughout as basins. Basins where cell accuracy is near-maximal are
+good basins and basins where it is not are bad basins. Initially, failures and delayed successes behave
+similarly (both are caught in bad basins with negative Q). They diverge only later in their trajectories,
+when delayed successes find an escape to a good basin while failures remain stuck.
+
+3.2           The Q head tracks trajectory quality
+
+          6                                                                  1.00
+                                                                             0.95               Figure 3: Q value follows cell ac-
+          4
+                                                                             0.90               curacy across reasoning. Mean
+          2
+
+
+
+
+                                                                                Cell accuracy
+                                                  Incorrect (28)             0.85               Q value (solid, left axis) and mean
+Q value
+
+
+
+
+          0                                       Correct (69)               0.80               cell accuracy (dashed, right axis)
+                                                  Cell accuracy (right axis)                    over supervision steps, aggregated
+          2                                                                  0.75
+                                                                             0.70
+                                                                                                over 100 PPBench validation puz-
+          4                                                                                     zles, separated by final correctness
+                                                                             0.65
+          6                                                                                     (green: correct, red: incorrect).
+                                                                             0.60
+               0     2    4     6        8         10      12       14
+                               Supervision step
+Across all three modes (failures, delayed successes, and quick successes), we find that the Q head’s
+value closely tracks cell accuracy at every supervision step. To further confirm this, Figure 3
+aggregates trajectories from 100 PPBench validation puzzles, separating them by final-answer
+correctness. The aggregate view corroborates the per-puzzle observation: mean Q and mean cell
+accuracy rise together on correct trajectories and remain mostly flat on incorrect ones. Moreover, at
+convergence, the Q logit sharply separates the two populations where q̂ ≈ +6 (sigmoid ≈ 1) for
+correct trajectories and q̂ ≈ −6 (sigmoid ≈ 0) for incorrect ones. The Q head is therefore a reliable
+learned indicator of whether a trajectory has reached a good basin.
+Given that the Q head’s ability to distinguish good from bad trajectories, a natural question follows:
+can we leverage the Q head to identify better trajectories? The main challenge is that the standard
+TRM is inherently deterministic, and thus cannot be used to sample different trajectories for a given
+problem. In the next section, we will show that by simply adding Gaussian noise to the latent state,
+we can sample different parallel trajectories and leverage the Q head to pick the best one.
+
+4         Method: Test-Time Compute Scaling via Stochastic Rollouts
+We propose Probabilistic TRM (PTRM), an inference-time procedure that makes the TRM recursion
+stochastic and selects the best of K resulting trajectories. PTRM requires no special training and
+can be readily applied to any pretrained TRM model. Furthermore it requires no task-specific
+augmentations. PTRM works as follows: at each supervision step, we add Gaussian noise (scaled by
+σ) to the latent state input. The Q head fQ scores each candidate latent output, and the one with the
+highest Q value is selected and then decoded using the model’s output head fO . The algorithm in
+Figure 4 (left) states this formally. PTRM offers two complementary benefits: 1) it enables trajectories
+to escape bad basins where deterministic TRM remains stuck, and 2) it introduces width as a new
+axis for test-time scaling.
+
+4.1           Escaping bad basins
+
+In Sec. 3, we found that some failed deterministic trajectories are caught in bad solution basins in
+latent space, with no way to escape. PTRM lets us test whether stochastic perturbations are enough
+for some of the rollouts of a previously failed puzzle to reach a good solution basin. Figure 5 shows
+K=100 independent rollouts, from the same failed puzzle used in Figure 2 (which fails at K=1),
+
+
+                                                                   4
+PTRM Inference                                                                 (a) Standard TRM (deterministic)
+
+ 1: Input: puzzle x, rollouts K,                                                       puzzle                            ···                answer
+ 2: supervision steps D, noise scale σ
+ 3: for k = 1, . . . , K in parallel do                                                             depth axis: D deep recursion steps
+                   (k)   (k)
+ 4:     Initialize z0 , y0                                                     (b) PTRM (ours): K stochastic rollouts + Q-head selection
+ 5:     for t = 1, . . . , D do                                                                      +ϵ         +ϵ                 +ϵ
+                                                                                                                         ···
+
+
+
+
+                                                 width axis: K rollouts
+               (k)
+ 6:           zt−1 += ϵ, ϵ ∼ N (0, σ 2 I)                                                       1    +ϵ         +ϵ                 +ϵ
+                                                                                           k=
+               (k)   (k)            (k)    (k)
+ 7:           zt , yt ← rec(x, zt−1 , yt−1 )                                                                             ···
+                                                                              puzzle                                                             arg maxk Qk
+ 8:     end for                                                                            k=2
+                                                                                                      ·
+                                                                                                      ·
+                                                                                                                 ·
+                                                                                                                 ·
+                                                                                                                                    ·
+                                                                                                                                    ·
+                                (k)
+ 9:     ŷ (k) ← arg max fO (yD )                                                        k=
+                                                                                           K
+                                                                                                      ·
+                                                                                                     +ϵ
+                                                                                                                 ·
+                                                                                                                +ϵ
+                                                                                                                                    ·
+                                                                                                                                   +ϵ
+          (k)            (k)
+10:     q̂ ← fQ (yD )                                                                                                    ···                     final answer
+
+11: end for ∗
+12: return ŷ (k ) , k ∗ = arg maxk q̂ (k)                                             deep recursion step       +ϵ   Gaussian noise injection
+
+
+
+
+Figure 4: Left: PTRM inference procedure (the rec() function refers to a deep recursion step). Right:
+PTRM mechanism. (a) Standard TRM: a single deterministic rollout. (b) PTRM: K stochastic latent
+rollouts with Gaussian noise ϵ at each deep recursion step, with the Q head selecting the final answer.
+
+
+projected into the principal plane. Most rollouts (92%) remain stuck in the same bad basin, while
+a minority (8%) escape to a distinct region in latent space and produce correct answers. We also
+observe that recurrent noise creates a per-rollout probability of escape: at K = 5 no rollouts escape,
+at K = 25 one does, and at K = 100 eight do. This confirms that noise provides the stochasticity
+needed to occasionally find an escape trajectory.
+
+4.2   Width scaling
+
+Since more rollouts per puzzle compound the chance that at least one reaches a good basin, the
+number of rollouts K is a natural quantity to scale. Given K independent rollouts, pass@K (any
+rollout correct) is the oracle upper bound and best-Q@K (the rollout with highest q̂ is correct) is a
+metric available at inference without a correctness oracle. The choice of Q as selector is motivated by
+Sec. 3’s observation that Q accurately separates correct from incorrect trajectories (Figure 3).
+Figure 6 shows pass@K and best-Q@K as K grows, averaged over 3 seeds on the held-out PPBench
+validation set (sudoku, nurikabe, tapa, lightup, and heyawake). Both metrics rise from 76.4% at
+K = 1 to 89.5% at K = 100, a gain of 13 percentage points. Across all tested K, the gap between
+pass@K and best-Q@K stays under 1pp, making the Q head a strong verifier on this validation set.
+By contrast, mode@K (most frequent answer across rollouts) rises by only 1.3pp over the same
+range, showing that the width-scaling gains come mostly from the Q head’s ability to identify correct
+solutions even when they are rare.
+Interaction with depth scaling. Depth is another scaling axis already supported by TRM, which
+consists of running more deep recursions (supervision steps) at inference than the Nsup the model
+was trained on. On the deterministic baseline (K=1), tripling the depth from 16 to 48 steps raises
+PPBench validation accuracy from 76.4% to 79.5% (+3.1pp). At higher K, depth scaling only
+provides additional gains on specific puzzle types such as sudoku (+4pp at K = 100). Both depth
+and width scaling can be seen as ways to explore the model’s solution space. Since rollouts are
+independent and parallelizable while extra depth is sequential, width is the more practical scaling
+axis.
+PTRM unlocks a simple and task-agnostic recipe for scaling TRM test-time compute. The next
+section evaluates the method across multiple benchmarks and against several baselines, including
+frontier LLMs.
+
+5     Experiments
+This section evaluates PTRM’s performance on diverse reasoning benchmarks. We compare against
+the deterministic TRM baseline, a non-recursive direct-prediction baseline, and frontier LLMs.
+Across several PPBench puzzles [8], Sudoku-Extreme [2], Maze-Hard [2], and ARC-AGI 2 [4],
+PTRM substantially boosts the performance of each pretrained TRM using only inference compute.
+
+
+                                                                          5
+                                                            Correct (8)
+                 10                                         Incorrect (92)
+                                                            Start
+                  8                                         End                                         92.5       pass@K
+                                                                                                        90.0       best-Q@K
+                  6                                                                                                mode@K
+                                                                                                        87.5
+PC 2 (34% var)
+
+
+
+
+                                                                                 PPBench accuracy (%)
+                  4                                                                                     85.0
+
+                  2
+                                                                                                        82.5
+                                                                                                        80.0
+                  0
+                                                                                                        77.5
+                  2                                                                                     75.0
+                                                                                                        72.5
+                  4                                                                                            1                    5        10           25      100
+                       2.5    0.0   2.5      5.0     7.5   10.0    12.5                                                       Rollouts per puzzle K (log scale)
+                                       PC 1 (53% var)
+
+Figure 5: Stochastic rollouts escape bad                                         Figure 6: Width scaling. pass@K, best-Q@K,
+basins. Principal plane projection of K =                                        and mode@K as K grows, averaged over 3
+100 independent rollouts of the same failed                                      seeds on a held-out PPBench validation set. The
+puzzle as in Figure 2 (right). 92 rollouts                                       Q head is a strong verifier on the tested puzzles,
+remain caught in the bad basin (red). 8                                          consistently outperforming selection of the most
+escape to a good basin and produce correct                                       frequent answer.
+answers (green).
+
+
+ 5.1                  Setup
+
+Datasets. Pencil Puzzle Bench (PPBench) [8] consists of 62,231 constraint-satisfaction pencil puzzles
+(from 94 puzzle types). From the full PPBench dataset, 300 puzzles (15 puzzles from 20 types)
+selected by Waugh [8] are held out to form the golden set. From the remainder we hold out a
+fixed-size validation set of 100 puzzles per puzzle type (50 for tapa, due to its smaller base size),
+and the rest forms the training set. We filter all three sets to puzzles of six types (sudoku, lightup,
+nurikabe, shakashaka, heyawake, and tapa) of grid size 9×9 for sudoku, and 10×10 for the rest.
+We use the validation set to track performance during training and select the final checkpoint. We
+report per-puzzle accuracy on five of these types on the golden set (TRM already reaches 100% on
+shakashaka, so we omit it from the reported results), with aggregate scores sample-weighted across
+types. We also report results on the Sudoku-Extreme, Maze-Hard, and ARC-AGI 2 datasets.
+ Models and inference. For each benchmark we use a standard TRM checkpoint. For Sudoku-
+ Extreme we use the TRM-MLP variant (which the TRM paper showed to be stronger on Sudoku),
+ and for the other datasets, we use TRM-Att. PTRM inference uses K parallel rollouts each running
+ D supervision steps with Gaussian noise of scale σ added to the latent state at each supervision step.
+ The selected configuration (K, D, σ) varies by benchmark and is given alongside each result. Metrics
+ are averaged across three seeds.
+ Baselines. To isolate the contribution of PTRM’s stochastic rollouts from the underlying backbone,
+ we report standard TRM performance (the same checkpoint as PTRM ran deterministically). For
+ each dataset, we report the performance of frontier LLMs. For Sudoku-Extreme, Maze-Hard, and
+ ARC2 we additionally report the published direct prediction and TRM baselines from [1].
+ Cost estimation. PPBench provides the dollar cost per attempt for each LLM. We convert PTRM’s
+ wall-clock to a comparable dollar figure using a single H100 at $2.50/hr (standard cloud pricing [9])
+ so that cost = $2.50 · tpuzzle /3600, where tpuzzle is the time (in seconds) to complete a puzzle.
+
+
+ 5.2                  Pencil Puzzle Bench
+
+ 5.2.1                 Per-puzzle accuracy
+
+ Table 1 reports per-puzzle accuracy on the PPBench golden set. PTRM at K=100, D=48, σ=0.2
+ raises aggregate best-Q@K from 62.6% to 91.2%. Increasing supervision depth alone (K=1, D=48)
+ gives a small boost over the standard TRM baseline (K=1, D=16). Most of the gain comes
+ from scaling width (stochastic rollouts). The largest improvements are on puzzle types where
+
+
+                                                                             6
+the deterministic baseline performed the worst (most headroom): sudoku improves from 46.7% to
+97.8% and tapa from 40.0% to 80.0%.
+
+  % accuracy                           # Params sudoku lightup nurikabe heyawake          tapa   agg.
+  Direct prediction                      27M        0.0      0.0      0.0       14.3      0.0  2.0
+  TRM (K=1, D=16)                        7M        46.7     87.5     74.1       85.7     40.0 62.6
+  TRM (K=1, D=48)                        7M        57.8     87.5     74.1       85.7     40.0 66.0
+  PTRM, best-Q@K (K=100, D=16)           7M        93.3     100      88.9       85.7     80.0 89.8
+  PTRM, best-Q@K (K=100, D=48)           7M        97.8     100      88.9       85.7     80.0 91.2
+Table 1: PPBench per-puzzle accuracy on the golden set. PTRM uses the same backbone as
+the deterministic TRM. Scaling depth alone (K=1, D=48) lifts aggregate accuracy by 3.4 points
+over the standard D=16 baseline. Combining depth with K=100 stochastic (σ=0.2) rollouts raises
+accuracy by 28.6 percentage points overall. The direct-prediction baseline is a larger transformer
+trained on the same data.
+
+
+
+5.2.2    Comparison with frontier LLMs on golden set
+PPBench reported per-puzzle results for several frontier LLMs using two strategies: 1) direct response
+from a single prompt, and 2) multi-turn agentic strategy with verification. We report results for direct
+and any (best of any strategy attempted, including agentic). The agentic strategy gives the LLM
+substantially more resources than PTRM has access to. It provides the LLM the ability to iteratively
+verify each move with a perfect verifier. The direct strategy is the fairer comparison since, while
+it may use the model provider’s reasoning harness, it does not have direct access to a multi-turn
+verifier (the LLM could still self-verify by writing verification code within the same response). We
+additionally observe that the agentic strategy was applied selectively in the published PPBench data:
+across the LLMs we compare against, only 9.6% of direct failures on the golden set were retried
+with agentic. We restrict the comparison to the 7 strongest LLMs that attempted every puzzle in our
+golden set: claude-opus-4-6@thinking, gpt-5.2@xhigh, gemini-3.1-pro, gpt-5.2@high,
+claude-sonnet-4-6@thinking, gpt-5.2@medium, and kimi-k2.5. Table 2 lists the top 3 in
+each strategy block.
+We additionally report an ensemble score formed from these 7 LLMs where a puzzle counts as solved
+if at least one of them solved it via any strategy. This ensemble setup is deliberately stacked against
+PTRM. It assumes a perfect verifier since, if any of the 7 LLMs produced a correct answer under
+any strategy, the ensemble counts it as solved, even though in practice we would not have access
+to an oracle verifier. Although it is not deployable, we include the ensemble to demonstrate that
+even under these heavily favorable conditions, frontier LLMs fall well short of PTRM. Ensemble
+cost-per-attempt averages over the attempts of all 7 models on each puzzle, and cost-per-correct
+divides total cost by the number of puzzles the ensemble solved.
+Table 2 reports the comparison. PTRM exceeds the strongest single LLM (direct strategy) by 57
+points aggregate (91.2% vs. 34.7%), and exceeds the LLM ensemble by 36 points (91.2% vs. 55.1%)
+despite the ensemble’s stacked advantages. Cost per attempt is several orders of magnitude higher for
+LLMs than PTRM.
+
+5.3     Sudoku-Extreme, Maze-Hard, and ARC-AGI-2
+
+For each benchmark we use the standard TRM checkpoint trained as described in [1] without
+modification (TRM-MLP for Sudoku-Extreme and TRM-Att for Maze-Hard and ARC-AGI-2).
+Table 3 summarizes results on all three.
+On Sudoku-Extreme, PTRM at K=100, D=64, σ=0.3 raises the deterministic baseline of 87.3% to
+99.06% pass@K and 98.75% best-Q@K, achieving state of the art.
+On Maze-Hard, PTRM at K=100, D=16, σ=1.0 reaches 95.63% pass@K, an 11.83 point gain
+over the 83.8% deterministic baseline. mode@K gives the best PTRM accuracy here at 86.73%
+(+2.93 points), with best-Q@K slightly behind at 85.17% (+1.37 points). While pass@K shows
+that PTRM is able to unlock several correct answers, the Q head identifies them less reliably than on
+the previous benchmarks.
+
+
+                                                   7
+ % accuracy                         sudoku lightup nurikabe heyawake               tapa   agg.   $/att.   $/corr.
+                                                      Direct
+ gemini-3.1-pro                        6.7     75.0      22.2         0.0          30.0   24.5   $0.40    $1.62
+ gpt-5.2@xhigh                        20.0     50.0      0.0          0.0          50.0   24.5   $1.79    $7.29
+ claude-opus-4-6@thinking              0.0     87.5      44.4         0.0          60.0   34.7   $2.91    $8.40
+                                         Any strategy (direct or agentic)†
+ gemini-3.1-pro                        6.7     87.5      33.3         0.0          40.0   30.6   $10.38   $33.91
+ gpt-5.2@xhigh                        33.3     75.0      0.0          0.0          60.0   34.7    $3.09    $8.90
+ claude-opus-4-6@thinking              0.0     87.5      44.4         0.0          70.0   36.7    $4.38   $11.92
+                                                 LLM ensemble†
+ Any strategy (direct or agentic)     46.7     100       44.4         0.0          80.0   55.1   $2.66    $38.51
+                                    Ours, trained from scratch, 7M parameters
+ PTRM, best-Q@K                      97.8      100       88.9        85.7          80.0 91.2 $0.001 $0.001
+Table 2: PTRM vs. frontier LLMs on PPBench golden. Per-puzzle accuracy and per-attempt /
+per-correct cost on the golden set. LLM costs are from PPBench. PTRM cost is estimated from H100
+wall-clock (Sec. 5.1). The direct and agentic blocks list the 3 highest scoring LLMs on aggregate,
+and the ensemble row uses all 7 listed in Sec. 5.2.2. † Assumes access to a perfect verifier.
+
+
+
+
+On ARC-AGI-2, the standard inference pipeline applies data augmentations and votes across them.
+PTRM adds K stochastic rollouts per augmentation. For selection, we pick the rollout with the
+highest Q value within each augmentation, then vote across augmentations as in the standard pipeline.
+With K=25 and σ=0.2, PTRM lifts pass@1 from 7.36% to 8.47% and pass@100 from 14.31% to
+15.97% over our deterministic TRM baseline, while matching it at pass@2.
+
+
+                                                  Sudoku-Extreme Maze-Hard        ARC-AGI-2
+ Method                               # Params       Acc. (%)     Acc. (%) pass@1 pass@2 pass@100
+ HRM                                    27M             55.0                74.5           –       5.0       –
+ TRM                                  5M / 7M†          87.4                85.3           –       7.8       –
+ Ours
+ Standard TRM, our reproduction 5M / 7M†               87.28            83.80             7.36    9.72     14.31
+ PTRM                           5M / 7M†               98.75            86.73             8.47    9.72     15.97
+Table 3: Sudoku-Extreme, Maze-Hard, and ARC-AGI-2 results. For Sudoku-Extreme, K=100,
+D=64, σ=0.3. For Maze-Hard, K=100, D=16, σ=1.0. For ARC-AGI-2, K=25, D=16, σ=0.2.
+pass@k for ARC-AGI-2 reports the top-k predictions from the augmentation-voting pipeline. PTRM
+shows an accuracy improvement over standard TRM across all 3 benchmarks. † Following [1], 5M
+for Sudoku-Extreme (TRM-MLP), 7M for Maze-Hard and ARC-AGI-2 (TRM-Att).
+
+
+5.4   Q head selection as σ grows
+
+With a higher σ value, PTRM finds many correct solutions that the deterministic inference misses.
+For instance, on Maze-Hard, the deterministic model solves 83.8% of puzzles, but PTRM raises
+pass@K to nearly 96%. The extent to which PTRM helps depends on the task, but on every dataset
+we tested, it unlocks correct solutions well beyond the deterministic model’s reach.
+TRM’s jointly trained Q head serves as a strong verifier on most tasks. On PPBench and Sudoku-
+Extreme, best-Q@K reaches values within a point of the saturated pass@K, so PTRM’s exploration
+translates directly into accuracy gains. On Maze-Hard, more exploration (higher σ) produces
+significantly more correct rollouts, but the existing Q head is not able to identify them, leaving
+performance on the table. The gap between best-Q@K and pass@K represents headroom for a
+stronger verifier which is left for future work. Appendix B reports the full σ sweep.
+
+
+                                                        8
+6   Related Work
+
+A long line of work explores recursive computation for iterative reasoning and representation re-
+finement. Early examples include Universal Transformers [10], Mixture-of-Recursions [11], Deep
+Thinking models [12, 13, 14], and HRM [2], all of which investigate the use of repeated computation
+steps to improve reasoning performance. More recent work has introduced methods to substantially
+accelerate TRM training [15], while TRM-style recursive architectures have also been extended to
+language modeling tasks [16].
+Building on this broader perspective of recursive computation, a growing body of work studies
+latent-space reasoning through the reuse of hidden states. Hao et al. [17] propose continuous
+“thinking tokens” derived from Chain-of-Thought (CoT) traces [18], which are autoregressively
+generated and appended to the model context, enabling reasoning directly in latent space without
+producing intermediate textual outputs. Similarly, Zhu et al. [19] formalize learning by superposition
+and demonstrate improvements on tasks such as graph reachability. By avoiding explicit token
+sampling and implicitly representing multiple reasoning trajectories, these approaches may mitigate
+the unfaithfulness and backtracking often observed in standard autoregressive reasoning [20, 21].
+Related to our work, Baek et al. [22] propose a generative version of TRM where the hidden state
+z is sampled instead of deterministic. This improves performance on multiple tasks, but requires
+retraining. Efstathiou and Balwani [23] (concurrent work) propose a similar test-time compute
+method where they only apply noise in the initial hidden state z, while we apply noise at every
+supervision step. Furthermore, they test their method on a small subset of the Sudoku-Extreme
+dataset, and treat it as a proof-of-concept that needs to be developed and tested further. Note that
+Baek et al. [22] also tested applying noise to the initial z with TRM and obtained negative results (no
+improvement in accuracy on two datasets).
+Our observations in Sec. 3 are consistent with the mechanistic analysis of Ren and Liu [5], who
+identify spurious fixed points in HRM’s latent dynamics on Sudoku-Extreme. Their method mitigates
+these attractors through a combination of task-specific training data augmentation, inference-time
+input perturbations, and model bootstrapping across training checkpoints, thereby effectively in-
+creasing test-time compute. However, these interventions are comparatively less general and less
+computationally efficient. In contrast, we observe analogous basin structure in TRM across multiple
+puzzle types and achieve attractor escape using a substantially simpler, task-agnostic mechanism:
+injecting Gaussian noise into the latent state at each supervision step while using a single deterministic
+checkpoint.
+
+
+7   Conclusion
+
+In this work, we introduced Probabilistic TRM (PTRM), a novel test-time scaling paradigm for
+Tiny Recursive Models (TRM) through parallel exploration and selection. This approach scales
+test-time compute using width (K parallel rollouts), yielding substantially larger gains than depth
+scaling (increasing deep recursion steps) alone. PTRM requires no retraining and does not rely on
+task-specific data augmentations making it extremely easy to use and versatile.
+By scaling both width and depth, PTRM obtains significant gains in accuracy when tested on a wide
+selection of puzzles. On PPBench (Sudoku, Lightup, Nurikabe, Heyawake, Tapa puzzles), PTRM
+nearly obtains twice the accuracy (91.2%; $0.001 cost) of ensemble of SOTA LLMs (55.1%; $38.51
+cost) at less than 0.0001x the cost. Furthermore, PTRM improves accuracy on Sudoku (from 87.4%
+to 98.75%), Maze-Hard (from 83.80% to 86.73%), and ARC-AGI (from 7.8% to 8.47% pass@1).
+Limitations. Our experiments focus on reasoning puzzles rather than general tasks. We only test
+on a subset of PPBench puzzles. We are limited to puzzles with a small grid-size due to limited
+computational resources. It is not guaranteed that the method works as well for all types of problems
+(e.g., accuracy gains on ARC-AGI-2 and Heyawake are smaller).
+Future work. It would be interesting to understand why some puzzles benefit from test-time scaling
+more than others. We suspect that problems that are harder to verify (e.g., ARC-AGI-2) benefit less
+from PTRM because the Q head may struggle to distinguish correct solutions from incorrect ones.
+Developing stronger verifiers than the existing Q head is an interesting direction for future work.
+
+
+                                                    9
+References
+ [1] Alexia Jolicoeur-Martineau. Less is more: Recursive reasoning with tiny networks. arXiv
+     preprint arXiv:2510.04871, 2025.
+ [2] Guan Wang, Jin Li, Yuhao Sun, Xing Chen, Changling Liu, Yue Wu, Meng Lu, Sen Song, and
+     Yasin Abbasi Yadkori. Hierarchical reasoning model. arXiv preprint arXiv:2506.21734, 2025.
+ [3] François Chollet. On the measure of intelligence. arXiv preprint arXiv:1911.01547, 2019.
+ [4] Francois Chollet, Mike Knoop, Gregory Kamradt, Bryan Landers, and Henry Pinkard. Arc-
+     agi-2: A new challenge for frontier ai reasoning systems. arXiv preprint arXiv:2505.11831,
+     2025.
+ [5] Zirui Ren and Ziming Liu. Are your reasoning models reasoning or guessing? a mechanistic
+     analysis of hierarchical reasoning models. arXiv preprint arXiv:2601.10679, 2026.
+ [6] Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute opti-
+     mally can be more effective than scaling model parameters. arXiv preprint arXiv:2408.03314,
+     2024.
+ [7] Alex Graves. Adaptive computation time for recurrent neural networks. arXiv preprint
+     arXiv:1603.08983, 2016.
+ [8] Justin Waugh. Pencil puzzle bench: A benchmark for multi-step verifiable reasoning. arXiv
+     preprint arXiv:2603.02119, 2026.
+ [9] Vast.ai. Rent h100 pcie gpus on vast.ai. https://vast.ai/pricing/gpu/H100-PCIE, 2026.
+     Accessed: 2026-05-01.
+[10] Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. Uni-
+     versal transformers. arXiv preprint arXiv:1807.03819, 2018.
+[11] Sangmin Bae, Yujin Kim, Reza Bayat, Sungnyun Kim, Jiyoun Ha, Tal Schuster, Adam Fisch,
+     Hrayr Harutyunyan, Ziwei Ji, Aaron Courville, et al. Mixture-of-recursions: Learning dynamic
+     recursive depths for adaptive token-level computation. arXiv preprint arXiv:2507.10524, 2025.
+[12] Avi Schwarzschild, Eitan Borgnia, Arjun Gupta, Furong Huang, Uzi Vishkin, Micah Goldblum,
+     and Tom Goldstein. Can you learn an algorithm? generalizing from easy to hard problems with
+     recurrent networks. Advances in Neural Information Processing Systems, 34:6695–6706, 2021.
+[13] Arpit Bansal, Avi Schwarzschild, Eitan Borgnia, Zeyad Emam, Furong Huang, Micah Goldblum,
+     and Tom Goldstein. End-to-end algorithm synthesis with recurrent networks: Extrapolation
+     without overthinking. Advances in Neural Information Processing Systems, 35:20232–20242,
+     2022.
+[14] Jay Bear, Adam Prugel-Bennett, and Jonathon Hare. Rethinking deep thinking: Stable learning
+     of algorithms using lipschitz constraints. Advances in Neural Information Processing Systems,
+     37:97027–97052, 2024.
+[15] Navid Hakimi. Form follows function: Recursive stem model. arXiv preprint arXiv:2603.15641,
+     2026.
+[16] Yinxi Li, Jiaao Chen, Fang Wu, Jiakai Yu, Heli Qi, Weihao Xuan, Haokai Zhao, Pengyu Nie,
+     Di Jin, and Xiangru Tang. Learning multi-step reasoning via persistent latent state propagation.
+     In Workshop on Latent {\&} Implicit Thinking {\textendash} Going Beyond CoT Reasoning,
+     2026.
+[17] Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong
+     Tian. Training large language models to reason in a continuous latent space. arXiv preprint
+     arXiv:2412.06769, 2024.
+[18] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le,
+     Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.
+     Advances in neural information processing systems, 35:24824–24837, 2022.
+
+
+                                                 10
+[19] Hanlin Zhu, Shibo Hao, Zhiting Hu, Jiantao Jiao, Stuart Russell, and Yuandong Tian. Reasoning
+     by superposition: A theoretical perspective on chain of continuous thought. arXiv preprint
+     arXiv:2505.12514, 2025.
+[20] Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny
+     Hernandez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, et al. Measuring
+     faithfulness in chain-of-thought reasoning. arXiv preprint arXiv:2307.13702, 2023.
+[21] Yanda Chen, Joe Benton, Ansh Radhakrishnan, Jonathan Uesato, Carson Denison, John Schul-
+     man, Arushi Somani, Peter Hase, Misha Wagner, Fabien Roger, et al. Reasoning models don’t
+     always say what they think. arXiv preprint arXiv:2505.05410, 2025.
+[22] Junyeob Baek, Mingyu Jo, Minsu Kim, Yoshua Bengio, and Sungjin Ahn. Generative recursive
+     reasoning models. ICLR 2026 Workshop on AI with Recursive Self-Improvement, 2026.
+[23] Andreas Efstathiou and Aishwarya Balwani. Recursive reasoning as attractor landscape search:
+     Mechanistic dynamics of the tiny recursive model. Workshop on Latent & Implicit Think-
+     ing – Going Beyond CoT Reasoning, 2026. URL https://openreview.net/forum?id=
+     kKps9W1K7n.
+
+
+
+
+                                               11
+A     Implementation Details
+
+A.1   Compute
+
+We train and evaluate all models on a single NVIDIA H100 80GB GPU. PTRM introduces no
+additional training cost over standard TRM since it operates entirely at inference time.
+
+A.2   Models
+
+All experiments use the standard TRM backbone [1] with the released architecture and training recipes.
+Following the TRM paper, we use the MLP variant (TRM-MLP, 5M parameters) for Sudoku-Extreme
+and the attention variant (TRM-Att, 7M parameters) for Maze-Hard, ARC-AGI-2, and PPBench.
+Layout and hyperparameters are unchanged from TRM.
+
+A.3   PPBench dataset construction
+
+Sudoku-Extreme, Maze-Hard, and ARC-AGI-2 use the same checkpoints and data splits as TRM.
+The PPBench dataset is more recent and has previously been used only with frontier LLMs, so we
+detail how we built our training, validation, and golden splits.
+
+Source. PPBench contains 62,231 constraint-satisfaction pencil puzzles spanning 94 puzzle types.
+Of these, 300 puzzles (15 puzzles × 20 types) are held out as the golden benchmark set by Waugh [8].
+
+Filtering. From the remaining 61,931 puzzles we hold out a validation set by sampling 100 puzzles
+from each puzzle type (50 for tapa, due to its smaller base size), and the rest forms the training
+set. We then filter all three sets (training, validation, golden) to retain only puzzles of six types
+(sudoku, lightup, nurikabe, shakashaka, heyawake, tapa) at fixed grid sizes: 9×9 for sudoku
+and 10×10 for the others. Sudoku grids are padded with a pad token to 10×10, giving a uniform
+sequence length of seq_len = 100 across all six puzzle types. The deterministic TRM baseline
+reaches 100% accuracy on shakashaka, so we exclude it from per-puzzle accuracy reporting (no
+headroom to compare against PTRM).
+
+Augmentation. Each training puzzle is expanded into 10 examples using two augmentations: 1)
+trajectory sampling, where the input is set to a random intermediate solve state along the puzzle’s
+solution trajectory rather than always the empty initial grid, while the label is always the fully solved
+grid; and 2) dihedral transformation, where a random dihedral transformation of a square grid, among
+the 8 possibilities given by 4 rotations × 2 {identity, reflection}, is applied to both the input and the
+label. For each puzzle, the first example is the unaugmented (initial state, solved) pair. The remaining
+9 are randomly sampled (trajectory and dihedral transform). Validation and golden splits are not
+augmented.
+
+Resulting splits. The merged multi-type splits use a unified vocabulary of 294 tokens and seq_len =
+100. Per-type sample counts are reported in Table 4.
+
+                               puzzle type        train    val    golden
+                               sudoku           7,810       97        15
+                               lightup          9,504       65         8
+                               nurikabe        15,180       55         9
+                               heyawake        42,108       70         7
+                               tapa             3,663       26        10
+                               shakashaka∗     20,702       62        12
+                               total           98,967     375         61
+Table 4: Per-puzzle-type sample counts in the PPBench splits used in training and evaluation.
+∗
+  Shakashaka is included in training but excluded from per-puzzle accuracy reporting because deter-
+ministic TRM already solves all evaluated shakashaka puzzles.
+
+
+
+                                                   12
+  B                  Noise Ablation
+
+  We ablate the inference noise level σ on three benchmarks at K=25 (K=100 for Maze-Hard) and
+  D=16 to keep the sweep tractable. For Sudoku-Extreme we randomly sample 1000 puzzles from the
+  test set for the same reason. Figure 7 shows pass@K, best-Q@K, and mode@K as a function of σ,
+  averaged over three random seeds.
+
+                                                           pass@K      best-Q@K         mode@K         K = 1 baseline
+
+                              Sudoku-Extreme                                       Maze-Hard                                        ARC-AGI-2 (within-aug)
+               100
+                                                               96                                                 5.5
+               90
+                                                               94
+               80                                                                                                 5.0
+                                                               92
+accuracy (%)
+
+
+
+
+               70                                              90                                                 4.5
+               60                                              88
+               50                                              86                                                 4.0
+               40                                              84
+                                                                                                                  3.5
+               30                                              82
+                 0.0    0.2    0.4    0.6      0.8   1.0         0.0    0.2       0.4   0.6      0.8     1.0            0.0   0.2       0.4     0.6    0.8   1.0
+
+
+  Figure 7: pass@K, best-Q@K, and mode@K across σ per rollout batch. On every task,
+  increasing the inference noise consistently produces more correct rollouts (pass@K, blue) up to
+  a task-dependent σ value. The Q head (best-Q@K, orange) tracks the pass@K ceiling closely
+  on Sudoku-Extreme and leaves a larger gap on Maze-Hard and ARC-AGI-2. The shaded region
+  represents the verifier headroom (accuracy that a better verifier could extract). mode@K (green) has
+  the edge over the Q head only on Maze-Hard. For ARC-AGI-2, metrics are per puzzle/augmentation
+  to isolate the Q head’s verification abilities from the augmentation pipeline.
+
+  On Maze-Hard pass@K climbs from 83.8% (deterministic) to nearly 96% by σ≈1.0 and then
+  plateaus. On Sudoku-Extreme it is already near its ceiling at σ=0.1 and stays roughly flat across the
+  sweep. On ARC-AGI-2 it peaks near σ=0.6 before declining. Q head selection nearly matches the
+  ceiling (maximum pass@K) on Sudoku-Extreme while best-Q@K peaks at 98.5% (within a point of
+  pass@K’s peak of 99.3%). On the other hand, the gap between best-Q@K and maximum pass@K
+  is more pronounced on Maze-Hard and ARC-AGI-2 (headroom a stronger verifier could close).
+
+
+  C                  Q-guided Langevin sampling
+
+  We initially explored Langevin sampling (using the Q head gradient) as a more principled exploration
+  mechanism than the Gaussian noise injection used in PTRM. The idea is to better guide the stochastic
+  search by additionally steering each rollout (using the Q head gradient) toward regions of high Q
+  value. We ultimately found that the gain from this approach was entirely attributable to the Langevin
+  noise term, with the gradient component contributing nothing measurable on top of the equivalent
+  recurrent noise of Sec. 4. We document the approach here as a negative result.
+
+  Motivation. The Q head is trained as a correctness predictor over latent states. Let fQ (z) denote
+  the head’s scalar output. We treated E(z) = − log sigmoid(fQ (z)) as an energy function over latent
+  space. Empirical observations during early experiments suggested that regions of low E correspond
+  to good basins from which the decoded answer is likely correct. PCA visualizations of the latent
+  dynamics showed that ∇z fQ points toward the good-basin region from both good-basin (correct) and
+  bad-basin (incorrect) latents (Figure 8). This made ∇z fQ look like a valuable direction along which
+  to push latents.
+
+  Method. We sample from the target distribution p(z) ∝ e−E(z) = sigmoid(fQ (z)) via Langevin
+  dynamics where at the end of each deep recursion step t = 1, . . . , D we apply N Langevin steps to
+  the latent,
+                                                 p
+                           z ← z − η ∇z E(z) + 2η ξ, ξ ∼ N (0, I),
+  The number of Langevin steps N is the additional scaling axis under this scheme.
+
+
+                                                                                   13
+                    t=0               t=5                       t = 10                    t = 15
+    Correct (21)
+    Incorrect (4)
+      Q
+
+
+
+
+Figure 8: y latents and their ∇z fQ gradients projected into the principal plane at several recur-
+sive/supervision steps, for multiple rollouts (using recurrent noise) of a single puzzle (correct rollouts
+in green, incorrect in red). Arrows are drawn at each latent in the direction of ∇z fQ . From both
+good-basin and bad-basin latents, gradients point toward the good-basin region. This visualization
+motivated the Langevin sampling experiment described below.
+
+
+Tractable gradient computation. TRM’s original Q head is a linear projection on a single token,
+fQ (y) = w⊤ y[:, 0]+b, so its gradient with respect to this head’s input is a constant vector independent
+of z. For ∇z fQ to be input-dependent, the gradient must flow back through the last latent recursion.
+This works but requires backpropagating through a full latent recursion at every Langevin step, which
+scales poorly with N . To make guidance tractable for large N , we replaced the linear Q head with
+an attention-pooled variant that reads the full latent and produces a scalar through a small nonlinear
+network. With this head, ∇z fQ can be computed by backpropagating through the head alone, which
+is ∼8× faster per step and does not sacrifice accuracy.
+
+The gain came from the noise,√ not the gradient. Comparing Langevin sampling against a noise-
+only ablation (with the same 2η ξ, but with the −η ∇z E(z) term zeroed out) produced essentially
+identical accuracy at matched N . The gradient component contributed nothing measurable on
+top of the equivalent recurrent noise. This prompted us to focus on the noise-only formulation in
+Sec. 4, which is much more impactful since it is: 1) significantly simpler (no retraining, no test-time
+backpropagation), 2) applicable to any TRM checkpoint out of the box, and 3) equally effective.
+
+D      Per-puzzle accuracy on the PPBench validation set
+The main paper reports per-puzzle accuracy on the PPBench golden set (Table 1) for direct compara-
+bility with the LLM evaluations from Waugh [8] who used that set. For a lower-variance complement,
+Table 5 reports results on our validation set (313 puzzles across the five reported types vs. 49 for
+golden). Trends match the golden-set results: depth scaling alone (K=1, D=48) provides a small lift,
+and combining depth with stochastic rollouts (K=100, D=48, σ=0.2) raises aggregate best-Q@K
+from 76.4% to 90.4%, a 14.0 percentage-point improvement. The biggest gains again are on puzzles
+where the deterministic baseline has the most headroom (tapa ∼ 40% to 71.8%, sudoku ∼ 69%
+to 93.3%). Types where the baseline is already near ceiling (heyawake at 96.7%) increase only
+marginally.
+
+  % accuracy                            # Params sudoku lightup nurikabe heyawake           tapa   agg.
+  Direct prediction                         27M      0.0     10.0         4.0     14.0      0.0  6.2
+  TRM (K=1, D=16)                           7M      68.7     83.3        76.0     96.7     39.7 76.4
+  TRM (K=1, D=48)                           7M      74.0     84.0        76.7     98.0     41.0 78.3
+  PTRM, best-Q@K (K=100, D=48)              7M      93.3     93.3        84.7     100      71.8 90.4
+Table 5: PPBench per-puzzle accuracy on the validation set. PTRM uses the same backbone as the
+deterministic TRM. Results on the larger validation set follow the same trends as on the golden set.
+
+
+
+
+                                                   14
+
+\ No newline at end of file
diff --git a/papers/txt/trm2025_tiny_recursive.txt b/papers/txt/trm2025_tiny_recursive.txt
new file mode 100644
index 0000000..55cf994
--- /dev/null
+++ b/papers/txt/trm2025_tiny_recursive.txt
@@ -0,0 +1,796 @@
+                                                       Less is More: Recursive Reasoning with Tiny Networks
+
+
+
+                                                                                    Alexia Jolicoeur-Martineau
+                                                                                     Samsung SAIL Montréal
+                                                                                      alexia.j@samsung.com
+
+
+                                                               Abstract
+arXiv:2510.04871v1 [cs.LG] 6 Oct 2025
+
+
+
+
+                                            Hierarchical Reasoning Model (HRM) is a
+                                            novel approach using two small neural net-
+                                            works recursing at different frequencies. This
+                                            biologically inspired method beats Large Lan-
+                                            guage models (LLMs) on hard puzzle tasks
+                                            such as Sudoku, Maze, and ARC-AGI while
+                                            trained with small models (27M parameters)
+                                            on small data (∼ 1000 examples). HRM holds
+                                            great promise for solving hard problems with
+                                            small networks, but it is not yet well un-
+                                            derstood and may be suboptimal. We pro-
+                                            pose Tiny Recursive Model (TRM), a much
+                                            simpler recursive reasoning approach that
+                                            achieves significantly higher generalization
+                                            than HRM, while using a single tiny network
+                                            with only 2 layers. With only 7M parameters,
+                                            TRM obtains 45% test-accuracy on ARC-AGI-
+                                            1 and 8% on ARC-AGI-2, higher than most
+                                            LLMs (e.g., Deepseek R1, o3-mini, Gemini 2.5
+                                            Pro) with less than 0.01% of the parameters.
+
+
+
+                                        1. Introduction
+                                        While powerful, Large Language models (LLMs) can
+                                        struggle on hard question-answer problems. Given
+                                        that they generate their answer auto-regressively, there
+                                        is a high risk of error since a single incorrect token can
+                                        render an answer invalid. To improve their reliabil-             Figure 1. Tiny Recursion Model (TRM) recursively improves
+                                        ity, LLMs rely on Chain-of-thoughts (CoT) (Wei et al.,           its predicted answer y with a tiny network. It starts with the
+                                        2022) and Test-Time Compute (TTC) (Snell et al., 2024).          embedded input question x and initial embedded answer
+                                                                                                         y, and latent z. For up to Nsup = 16 improvements steps,
+                                        CoTs seek to emulate human reasoning by having the
+                                                                                                         it tries to improve its answer y. It does so by i) recursively
+                                        LLM to sample step-by-step reasoning traces prior to
+                                                                                                         updating n times its latent z given the question x, current
+                                        giving their answer. Doing so can improve accuracy,              answer y, and current latent z (recursive reasoning), and
+                                        but CoT is expensive, requires high-quality reasoning            then ii) updating its answer y given the current answer y
+                                        data (which may not be available), and can be brittle            and current latent z. This recursive process allows the model
+                                        since the generated reasoning may be wrong. To fur-              to progressively improve its answer (potentially address-
+                                        ther improve reliability, test-time compute can be used          ing any errors from its previous answer) in an extremely
+                                        by reporting the most common answer out of K or the              parameter-efficient manner while minimizing overfitting.
+                                        highest-reward answer (Snell et al., 2024).
+
+                                                                                                     1
+                                      Recursive Reasoning with Tiny Networks
+
+However, this may not be enough. LLMs with CoT                   ers that achieves significantly higher generalization
+and TTC are not enough to beat every problem. While              than HRM on a variety of problems. In doing so, we
+LLMs have made significant progress on ARC-AGI                   improve the state-of-the-art test accuracy on Sudoku-
+(Chollet, 2019) since 2019, human-level accuracy still           Extreme from 55% to 87%, Maze-Hard from 75% to
+has not been reached (6 years later, as of writing of            85%, ARC-AGI-1 from 40% to 45%, and ARC-AGI-2
+this paper). Furthermore, LLMs struggle on the newer             from 5% to 8%.
+ARC-AGI-2 (e.g., Gemini 2.5 Pro only obtains 4.9% test
+accuracy with a high amount of TTC) (Chollet et al.,             2. Background
+2025; ARC Prize Foundation, 2025b).
+                                                                 HRM is described in Algorithm 2. We discuss the
+An alternative direction has recently been proposed by
+                                                                 details of the algorithm further below.
+Wang et al. (2025). They propose a new way forward
+through their novel Hierarchical Reasoning Model
+(HRM), which obtains high accuracy on puzzle tasks               2.1. Structure and goal
+where LLMs struggle to make a dent (e.g., Sudoku                 The focus of HRM is supervised learning. Given an
+solving, Maze pathfinding, and ARC-AGI). HRM is a                input, produce an output. Both input and output are
+supervised learning model with two main novelties: 1)            assumed to have shape [ B, L] (when the shape differs,
+recursive hierarchical reasoning, and 2) deep supervision.       padding tokens can be added), where B is the batch-
+Recursive hierarchical reasoning consists of recurs-             size and L is the context-length.
+ing multiple times through two small networks ( f L at           HRM contains four learnable components: the in-
+high frequency and f H at low frequency) to predict the          put embedding f I (·; θ I ), low-level recurrent network
+answer. Each network generates a different latent fea-           f L (·; θ L ), high-level recurrent network f H (·; θ H ), and
+ture: f L outputs z H and f H outputs z L . Both features        the output head fO (·; θO ). Once the input is embedded,
+(z L , z H ) are used as input to the two networks. The          the shape becomes [ B, L, D ] where D is the embedding
+authors provide some biological arguments in favor of            size. Each network is a 4-layer Transformers architec-
+recursing at different hierarchies based on the different        ture (Vaswani et al., 2017), with RMSNorm (Zhang
+temporal frequencies at which the brains operate and             & Sennrich, 2019), no bias (Chowdhery et al., 2023),
+hierarchical processing of sensory inputs.                       rotary embeddings (Su et al., 2024), and SwiGLU acti-
+Deep supervision consists of improving the answer                vation function (Hendrycks & Gimpel, 2016; Shazeer,
+through multiple supervision steps while carrying the            2020).
+two latent features as initialization for the improve-
+ment steps (after detaching them from the computa-               2.2. Recursion at two different frequencies
+tional graph so that their gradients do not propagate).          Given the hyperparameters used by Wang et al. (2025)
+This provide residual connections, which emulates                (n = 2 f L steps, 1 f H steps; done T = 2 times), a
+very deep neural networks that are too memory ex-                forward pass of HRM is done as follows:
+pensive to apply in one forward pass.
+An independent analysis on the ARC-AGI benchmark                        x ← f I ( x̃ )
+showed that deep supervision seems to be the primary                  z L ← f L (z L + z H + x ) # without gradients
+driver of the performance gains (ARC Prize Founda-                    z L ← f L (z L + z H + x ) # without gradients
+tion, 2025a). Using deep supervision doubled accuracy
+                                                                      z H ← f H (z L + z H )      # without gradients
+over single-step supervision (going from 19% to 39%
+accuracy), while recursive hierarchical reasoning only                z L ← f L (z L + z H + x ) # without gradients
+slightly improved accuracy over a regular model with                  z L ← z L .detach()
+a single forward pass (going from 35.7% to 39.0% ac-                  z H ← z H .detach()
+curacy). This suggests that reasoning across different
+                                                                      z L ← f L (z L + z H + x ) # with gradients
+supervision steps is worth it, but the recursion done
+in each supervision step is not particularly important.               z H ← f H (z L + z H )     # with gradients
+                                                                        ŷ ← argmax( fO (z H ))
+In this work, we show that the benefit from recursive
+reasoning can be massively improved, making it much
+more than incremental. We propose Tiny Recursive                 where ŷ is the predicted output answer, z L and z H are
+Model (TRM), an improved and simplified approach                 either initialized embeddings or the embeddings of
+using a much smaller tiny network with only 2 lay-               the previous deep supervision step (after detaching
+                                                                 them from the computational graph). As can be seen,
+
+                                                             2
+                                               Recursive Reasoning with Tiny Networks
+
+def hrm(z, x, n=2, T=2): # hierarchical reasoning                    2.4. Deep supervision
+    zH, zL = z
+    with torch.no grad():                                            To improve effective depth, deep supervision is used.
+        for i in range(nT − 2):
+            zL = L net(zL, zH, x)                                    This consists of reusing the previous latent features
+            if (i + 1) % T == 0:
+                zH = H net(zH, zL)
+                                                                     (z H and z L ) as initialization for the next forward pass.
+    # 1−step grad                                                    This allows the model to reason over many iterations
+    zL = L net(zL, zH, x)
+    zH = H net(zH, zL)                                               and improve its latent features (z L and z H ) until it
+    return (zH, zL), output head(zH), Q head(zH)                     (hopefully) converges to the correct solution. At most
+def ACT halt(q, y hat, y true):                                      Nsup = 16 supervision steps are used.
+    target halt = (y hat == y true)
+    loss = 0.5∗binary cross entropy(q[0], target halt)
+    return loss                                                      2.5. Adaptive computational time (ACT)
+def ACT continue(q, last step):
+    if last step:                                                    With deep supervision, each mini-batch of data sam-
+        target continue = sigmoid(q[0])                              ples must be used for Nsup = 16 supervision steps
+    else:
+        target continue = sigmoid(max(q[0], q[1])))                  before moving to the next mini-batch. This is expen-
+    loss = 0.5∗binary cross entropy(q[1], target continue)
+    return loss
+                                                                     sive, and there is a balance to be reached between
+                                                                     optimizing a few data examples for many supervision
+# Deep Supervision
+for x input, y true in train dataloader:                             steps versus optimizing many data examples with less
+    z = z init                                                       supervision steps. To reach a better balance, a halting
+    for step in range(N sup): # deep supervision
+        x = input embedding(x input)                                 mechanism is incorporated to determine whether the
+        z, y pred, q = hrm(z, x)
+        loss = softmax cross entropy(y pred, y true)
+                                                                     model should terminate early. It is learned through
+        # Adaptive computational time (ACT) using Q−learning         a Q-learning objective that requires passing the z H
+        loss += ACT halt(q, y pred, y true)
+         , , q next = hrm(z, x) # extra forward pass                 through an additional head and running an additional
+        loss += ACT continue(q next, step == N sup − 1)              forward pass (to determine if halting now rather than
+        z = z.detach()
+        loss.backward()                                              later would have been preferable). They call this
+        opt.step()
+        opt.zero grad()
+                                                                     method Adaptive computational time (ACT). It is only
+        if q[0] > q[1]: # early−stopping                             used during training, while the full Nsup = 16 super-
+            break
+                                                                     vision steps are done at test time to maximize down-
+                                                                     stream performance. ACT greatly diminishes the time
+Figure 2. Pseudocode of Hierarchical Reasoning Models                spent per example (on average spending less than 2
+(HRMs).                                                              steps on the Sudoku-Extreme dataset rather than the
+                                                                     full Nsup = 16 steps), allowing more coverage of the
+a forward pass of HRM consists of applying 6 function                dataset given a fixed number of training iterations.
+evaluations, where the first 4 function evaluations are
+detached from the computational graph and are not                    2.6. Deep supervision and 1-step gradient
+back-propagated through. The authors uses n = 2                           approximations replaces BPTT
+with T = 2 in all experiments, but HRM can be gener-
+alized by allowing for an arbitrary number of L steps                Deep supervision and the 1-step gradient approxima-
+(n) and recursions (T) as shown in Algorithm 2.                      tion provide a more biologically plausible and less
+                                                                     computationally-expansive alternative to Backpropa-
+2.3. Fixed-point recursion with 1-step gradient                      gation Through Time (BPTT) (Werbos, 1974; Rumel-
+     approximation                                                   hart et al., 1985; LeCun, 1985) for solving the temporal
+                                                                     credit assignment (TCA) (Rumelhart et al., 1985; Wer-
+Assuming that (z L , z H ) reaches a fixed-point (z∗L , z∗H )        bos, 1988; Elman, 1990) problem (Lillicrap & Santoro,
+through recursing from both f L and f H ,                            2019). The implication is that HRM can learn what
+                                                                     would normally require an extremely large network
+                     z∗L ≈ f L (z∗L + z H + x )                      without having to back-propagate through its entire
+                    z∗H ≈ f H (z L + z∗H ) ,                         depth. Given the hyperparameters used by Jang et al.
+                                                                     (2023) in all their experiments, HRM effectively rea-
+the Implicit Function Theorem (Krantz & Parks, 2002)
+                                                                     sons over nlayers (n + 1) TNsup = 4 ∗ (2 + 1) ∗ 2 ∗ 16 =
+with the 1-step gradient approximation (Bai et al.,
+                                                                     384 layers of effective depth.
+2019) is used to approximate the gradient by back-
+propagating only the last f L and f H steps. This theo-
+rem is used to justify only tracking the gradients of
+the last two steps (out of 6), which greatly reduces
+memory demands.
+
+                                                                 3
+                                       Recursive Reasoning with Tiny Networks
+
+2.7. Summary of HRM                                               different from the much smaller n = 2 and T = 2 used
+                                                                  in every experiment of their paper, we observe the
+HRM leverages recursion from two networks at dif-
+                                                                  following:
+ferent frequencies (high frequency versus low fre-
+quency) and deep supervision to learn to improve
+its answer over multiple supervision steps (with ACT               1. the residual for z H is clearly well above 0 at every
+to reduce time spent per data example). This enables                  step
+the model to imitate extremely large depth without
+requiring backpropagation through all layers. This
+approach obtains significantly higher performance on               2. the residual for z L only becomes closer to 0 after
+hard question-answer tasks that regular supervised                    many cycles, but it remains significantly above 0
+models struggle with. However, this method is quite
+complicated, relying a bit too heavily on uncertain
+biological arguments and fixed-point theorems that                 3. z L is very far from converged after one f L evalu-
+are not guaranteed to be applicable. In the next sec-                 ation at T cycles, which is when the fixed-point
+tion, we discuss those issues and potential targets for               is assumed to be reached and the 1-step gradient
+improvements in HRM.                                                  approximation is used
+
+3. Target for improvements in Hierarchical                        Thus, while the application of the IFT theorem and
+   Reasoning Models                                               1-step gradient approximation to HRM has some basis
+                                                                  since the residuals do generally reduce over time, a
+In this section, we identify key targets for improve-
+                                                                  fixed point is unlikely to be reached when the theorem
+ments in HRM, which will be addressed by our pro-
+                                                                  is actually applied.
+posed method, Tiny Recursion Models (TRMs).
+                                                                  In the next section, we show that we can bypass the
+3.1. Implicit Function Theorem (IFT) with 1-step                  need for the IFT theorem and 1-step gradient approxi-
+     gradient approximation                                       mation, thus bypassing the issue entirely.
+
+HRM only back-propagates through the last 2 of the 6              3.2. Twice the forward passes with Adaptive
+recursions. The authors justify this by leveraging the                 computational time (ACT)
+Implicit Function Theorem (IFT) and one-step approx-
+imation, which states that when a recurrent function              HRM uses Adaptive computational time (ACT) during
+converges to a fixed point, backpropagation can be                training to optimize the time spent of each data sam-
+applied in a single step at that equilibrium point.               ple. Without ACT, Nsup = 16 supervision steps would
+                                                                  be spent on the same data sample, which is highly in-
+There are concerns about applying this theorem to
+                                                                  efficient. They implement ACT through an additional
+HRM. Most importantly, there is no guarantee that
+                                                                  Q-learning objective, which decides when to halt and
+a fixed-point is reached. Deep equilibrium models
+                                                                  move to a new data sample rather than keep iterating
+normally do fixed-point iteration to solve for the fixed
+                                                                  on the same data. This allows much more efficient
+pointz∗ = f (z∗ ) (Bai et al., 2019). However, in the case
+                                                                  use of time especially since the average number of su-
+of HRM, it is not iterating to the fixed-point but simply
+                                                                  pervision steps during training is quite low with ACT
+doing forward passes of f L and f H . To make matters
+                                                                  (less than 2 steps on the Sudoku-Extreme dataset as
+worse, HRM is only doing 4 recursions before stopping
+                                                                  per their reported numbers).
+to apply the one-step approximation. After its first
+loop of two f L and 1 f H evaluations, it only apply a            However, ACT comes at a cost. This cost is not directly
+single f L evaluation before assuming that a fixed-point          shown in the HRM’s paper, but it is shown in their of-
+is reached for both z L and z H (z∗L = f L (z∗L + z H + x )       ficial code. The Q-learning objective relies on a halting
+and z∗H = f H (z∗L + z∗H )). Then, the one-step gradient          loss and a continue loss. The continue loss requires an
+approximation is applied to both latent variables in              extra forward pass through HRM (with all 6 function
+succession.                                                       evaluations). This means that while ACT optimizes
+                                                                  time more efficiently per sample, it requires 2 forward
+The authors justify that a fixed-point is reached by
+                                                                  passes per optimization step. The exact formulation is
+depicting an example with n = 7 and T = 7 where
+                                                                  shown in Algorithm 2.
+the forward residuals is reduced over time (Figure 3
+in Wang et al. (2025)). Even in this setting, which is            In the next section, we show that we can bypass the
+                                                                  need for two forward passes in ACT.
+
+                                                              4
+                                               Recursive Reasoning with Tiny Networks
+
+3.3. Hierarchical interpretation based on complex                        of 2 passes). Our approach is described in Algorithm 3
+     biological arguments                                                and illustrated in Figure 1. We also provide an ablation
+                                                                         in Table 1 on the Sudoku-Extreme dataset (a dataset
+The HRM’s authors justify the two latent variables
+                                                                         of difficult Sudokus with only 1K training examples,
+and two networks operating at different hierarchies
+                                                                         but 423K test examples). Below, we explain the key
+based on biological arguments, which are very far
+                                                                         components of TRMs.
+from artificial neural networks. They even try to match
+HRM to actual brain experiments on mice. While in-
+teresting, this sort of explanation makes it incredibly                  Table 1. Ablation of TRM on Sudoku-Extreme comparing %
+hard to parse out why HRM is designed the way it                         Test accuracy, effective depth per supervision step ( T (n +
+is. Given the lack of ablation table in their paper, the                 1)nlayers ), number of Forward Passes (NFP) per optimization
+over-reliance on biological arguments and fixed-point                    step, and number of parameters
+theorems (that are not perfectly applicable), it is hard                  Method                Acc (%) Depth NFP # Params
+to determine what parts of HRM is helping what and                        HRM                    55.0    24    2     27M
+why. Furthermore, it is not clear why they use two                        TRM (T = 3, n = 6) 87.4        42    1     5M
+latent features rather than other combinations of fea-                    w/ ACT                 86.1    42    2      5M
+tures.                                                                    w/ separate f H , f L  82.4    42    1     10M
+                                                                          no EMA                 79.9    42    1      5M
+In the next section, we show that the recursive process                   w/ 4-layers, n = 3     79.5    48    1     10M
+can be greatly simplified and understood in a much                        w/ self-attention      74.7    42    1      7M
+simpler manner that does not require any biological                       w/ T = 2, n = 2        73.7    12    1     5M
+argument, fixed-point theorem, hierarchical interpre-                     w/ 1-step gradient 56.5        42    1      5M
+tation, nor using two networks. It also explains why 2
+is the optimal number of features (z L and z H ).
+                                                                         4.1. No fixed-point theorem required
+
+def latent recursion(x, y, z, n=6):                                      HRM assumes that the recursions converge to a fixed-
+    for i in range(n): # latent reasoning                                point for both z L and z H in order to leverage the 1-step
+        z = net(x, y, z)
+    y = net(y, z) # refine output answer                                 gradient approximation (Bai et al., 2019). This allows
+    return y, z
+                                                                         the authors to justify only back-propagating through
+def deep recursion(x, y, z, n=6, T=3):                                   the last two function evaluations (1 f L and 1 f H ). To
+    # recursing T−1 times to improve y and z (no gradients needed)
+    with torch.no grad():                                                bypass this theoretical requirement, we define a full
+        for j in range(T−1):                                             recursion process as containing n evaluations of f L
+            y, z = latent recursion(x, y, z, n)
+    # recursing once to improve y and z                                  and 1 evaluation of f H :
+    y, z = latent recursion(x, y, z, n)
+    return (y.detach(), z.detach()), output head(y), Q head(y)                            z L ← f L (z L + z H + x )
+# Deep Supervision                                                                           ...
+for x input, y true in train dataloader:
+    y, z = y init, z init                                                                 z L ← f L (z L + z H + x )
+    for step in range(N supervision):
+        x = input embedding(x input)                                                      z H ← f H (z L + z H ) .
+        (y, z), y hat, q hat = deep recursion(x, y, z)
+        loss = softmax cross entropy(y hat, y true)
+        loss += binary cross entropy(q hat, (y hat == y true))           Then, we simply back-propagate through the full re-
+        loss.backward()                                                  cursion process.
+        opt.step()
+        opt.zero grad()
+        if q hat > 0: # early−stopping                                   Through deep supervision, the models learns to take
+            break                                                        any (z L , z H ) and improve it through a full recursion
+                                                                         process, hopefully making z H closer to the solution.
+Figure 3. Pseudocode of Tiny Recursion Models (TRMs).                    This means that by the design of the deep supervi-
+                                                                         sion goal, running a few full recursion processes (even
+                                                                         without gradients) is expected to bring us closer to the
+4. Tiny Recursion Models                                                 solution. We propose to run T − 1 recursion processes
+                                                                         without gradient to improve (z L , z H ) before running
+In this section, we present Tiny Recursion Models
+                                                                         one recursion process with backpropagation.
+(TRMs). Contrary to HRM, TRM requires no com-
+plex mathematical theorem, hierarchy, nor biological                     Thus, instead of using the 1-step gradient approxi-
+arguments. It generalizes better while requiring only                    mation, we apply a full recursion process containing
+a single tiny network (instead of two medium-size net-                   n evaluations of f L and 1 evaluation of f H . This re-
+works) and a single forward pass for the ACT (instead                    moves entirely the need to assume that a fixed-point
+
+                                                                     5
+                                      Recursive Reasoning with Tiny Networks
+
+is reached and the use of the IFT theorem with 1-step            While this is intuitive, we wanted to verify whether
+gradient approximation. Yet, we can still leverage               using more or less features could be helpful. Results
+multiple backpropagation-free recursion processes to             are shown in Table 2.
+improve (z L , z H ). With this approach, we obtain a
+                                                                 More features (> 2): We tested splitting z into dif-
+massive boost in generalization on Sudoku-Extreme
+                                                                 ferent features by treating each of the n recursions as
+(improving TRM from 56.5% to 87.4%; see Table 1).
+                                                                 producing a different zi for i = 1, ..., n. Then, each
+                                                                 zi is carried across supervision steps. The approach
+4.2. Simpler reinterpretation of z H and z L                     is described in Algorithm 5. In doing so, we found
+HRM is interpreted as doing hierarchical reasoning               performance to drop. This is expected because, as dis-
+over two latent features of different hierarchies due to         cussed, there is no apparent need for splitting z into
+arguments from biology. However, one might wonder                multiple parts. It does not have to be hierarchical.
+why use two latent features instead of 1, 3, or more?            Single feature: Similarly, we tested the idea of taking
+And do we really need to justify these so-called ”hier-          a single feature by only carrying z H across supervi-
+archical” features based on biology to make sense of             sion steps. The approach is described in Algorithm 4.
+them? We propose a simple non-biological explana-                In doing so, we found performance to drop. This is
+tion, which is more natural, and directly answers the            expected because, as discussed, it forces the model to
+question of why there are 2 features.                            store the solution y within z.
+The fact of the matter is: z H is simply the current             Thus, we explored using more or less latent variables
+(embedded) solution. The embedding is reversed by                on Sudoku-Extreme, but found that having only y and
+applying the output head and rounding to the nearest             z lead to better test accuracy in addition to being the
+token using the argmax operation. On the other hand,             simplest more natural approach.
+z L is a latent feature that does not directly correspond
+to a solution, but it can be transformed into a solution
+by applying z H ← f H ( x, z L , z H ). We show an example
+on Sudoku-Extreme in Figure 6 to highlight the fact              Table 2. TRM on Sudoku-Extreme comparing % Test accu-
+that z H does correspond to the solution, but z L does           racy when using more or less latent features
+not.                                                                 Method                 # of features   Acc (%)
+                                                                     TRM y, z (Ours)              2          87.4
+Once this is understood, hierarchy is not needed; there              TRM multi-scale z       n+1 = 7         77.6
+is simply an input x, a proposed solution y (previously              TRM single z                 1          71.9
+called z H ), and a latent reasoning feature z (previously
+called z L ). Given the input question x, current solution
+y, and current latent reasoning z, the model recursively
+improves its latent z. Then, given the current latent z          4.3. Single network
+and the previous solution y, the model proposes a new            HRM uses two networks, one applied frequently as a
+solution y (or stay at the current solution if its already       low-level module f H and one applied rarely as an high-
+good).                                                           level module ( f H ). This requires twice the number of
+Although this has no direct influence on the algorithm,          parameters compared to regular supervised learning
+this re-interpretation is much simpler and natural. It           with a single network.
+answers the question about why two features: remem-              As mentioned previously, while f L iterates on the la-
+bering in context the question x, previous reasoning             tent reasoning feature z (z L in HRM), the goal of f H
+z, and previous answer y helps the model iterate on              is to update the solution y (z H in HRM) given the la-
+the next reasoning z and then the next answer y. If              tent reasoning and current solution. Importantly, since
+we were not passing the previous reasoning z, the                z ← f L ( x + y + z) contains x but y ← f H (y + z) does
+model would forget how it got to the previous solu-              not contains x, the task to achieve (iterating on z versus
+tion y (since z acts similarly as a chain-of-thought). If        using z to update y) is directly specified by the inclu-
+we were not passing the previous solution y, then the            sion or lack of x in the inputs. Thus, we considered
+model would forget what solution it had and would                the possibility that both networks could be replaced
+be forced to store the solution y within z instead of            by a single network doing both tasks. In doing so, we
+using it for latent reasoning. Thus, we need both y and          obtain better generalization on Sudoku-Extreme (im-
+z separately, and there is no apparent reason why one            proving TRM from 82.4% to 87.4%; see Table 1) while
+would need to split z into multiple features.                    reducing the number of parameters by half. It turns
+                                                                 out that a single network is enough.
+
+                                                             6
+                                        Recursive Reasoning with Tiny Networks
+
+4.4. Less is more                                                  4.6. No additional forward pass needed with ACT
+We attempted to increase capacity by increasing the                As previously mentioned, the implementation of ACT
+number of layers in order to scale the model. Sur-                 in HRM through Q-learning requires two forward
+prisingly, we found that adding layers decreased gen-              passes, which slows down training. We propose a
+eralization due to overfitting. In doing the oppo-                 simple solution, which is to get rid of the continue loss
+site, decreasing the number of layers while scaling                (from the Q-learning) and only learn a halting proba-
+the number of recursions (n) proportionally (to keep               bility through a Binary-Cross-Entropy loss of having
+the amount of compute and emulated depth approxi-                  reached the correct solution. By removing the continue
+mately the same), we found that using 2 layers (instead            loss, we remove the need for the expensive second for-
+of 4 layers) maximized generalization. In doing so, we             ward pass, while still being able to determine when to
+obtain better generalization on Sudoku-Extreme (im-                halt with relatively good accuracy. We found no sig-
+proving TRM from 79.5% to 87.4%; see Table 1) while                nificant difference in generalization from this change
+reducing the number of parameters by half (again).                 (going from 86.1% to 87.4%; see Table 1).
+It is quite surprising that smaller networks are bet-
+ter, but 2 layers seems to be the optimal choice. Bai              4.7. Exponential Moving Average (EMA)
+& Melas-Kyriazi (2024) also observed optimal perfor-               On small data (such as Sudoku-Extreme and Maze-
+mance for 2-layers in the context of deep equilibrium              Hard), HRM tends to overfit quickly and then diverge.
+diffusion models; however, they had similar perfor-                To reduce this problem and improves stability, we
+mance to the bigger networks, while we instead ob-                 integrate Exponential Moving Average (EMA) of the
+serve better performance with 2 layers. This may ap-               weights, a common technique in GANs and diffusion
+pear unusual, as with modern neural networks, gener-               models to improve stability (Brock et al., 2018; Song &
+alization tends to directly correlate with model sizes.            Ermon, 2020). We find that it prevents sharp collapse
+However, when data is too scarce and model size is                 and leads to higher generalization (going from 79.9%
+large, there can be an overfitting penalty (Kaplan et al.,         to 87.4%; see Table 1).
+2020). This is likely an indication that there is too little
+data. Thus, using tiny networks with deep recursion                4.8. Optimal the number of recursions
+and deep supervision appears to allow us to bypass a
+lot of the overfitting.                                            We experimented with different number of recursions
+                                                                   by varying T and n and found that T = 3, n = 3
+4.5. attention-free architecture for tasks with small              (equivalent to 48 recursions) in HRM and T = 3, n = 6
+     fixed context length                                          in TRM (equivalent to 42 recursions) to lead to optimal
+                                                                   generalization on Sudoku-Extreme. More recursions
+Self-attention is particularly good for long-context               could be helpful for harder problems (we have not
+lengths when L ≫ D since it only requires a matrix of              tested it, given our limited resources); however, in-
+[ D, 3D ] parameters, even though it can account for the           creasing either T or n incurs massive slowdowns. We
+whole sequence. However, when focusing on tasks                    show results at different n and T for HRM and TRM
+where L ≤ D, a linear layer is cheap, requiring only a             in Table 3. Note that TRM requires backpropagation
+matrix of [ L, L] parameters. Taking inspiration from              through a full recursion process, thus increasing n too
+the MLP-Mixer (Tolstikhin et al., 2021), we can replace            much leads to Out Of Memory (OOM) errors. How-
+the self-attention layer with a multilayer perceptron              ever, this memory cost is well worth its price in gold.
+(MLP) applied on the sequence length. Using an MLP
+instead of self-attention, we obtain better generaliza-            In the following section, we show our main results on
+tion on Sudoku-Extreme (improving from 74.7% to                    multiple datasets comparing HRM, TRM, and LLMs.
+87.4%; see Table 1). This worked well on Sudoku 9x9
+grids, given the small and fixed context length; how-              5. Results
+ever, we found this architecture to be suboptimal for
+tasks with large context length, such as Maze-Hard                 Following Wang et al. (2025), we test our approach
+and ARC-AGI (both using 30x30 grids). We show                      on the following datasets: Sudoku-Extreme (Wang
+results with and without self-attention for all experi-            et al., 2025), Maze-Hard (Wang et al., 2025), ARC-AGI-
+ments.                                                             1 (Chollet, 2019) and, ARC-AGI-2 (Chollet et al., 2025).
+                                                                   Results are presented in Tables 4 and 5. Hyperparame-
+                                                                   ters are detailed in Section 6. Datasets are discussed
+                                                                   below.
+
+
+                                                               7
+                                      Recursive Reasoning with Tiny Networks
+
+                                                                 ity of the MLP on large 30x30 grids). TRM with self-
+Table 3. % Test accuracy on Sudoku-Extreme dataset. HRM
+                                                                 attention obtains 85.3% accuracy on Maze-Hard, 44.6%
+versus TRM matched at a similar effective depth per super-
+vision step ( T (n + 1)nlayers )
+                                                                 accuracy on ARC-AGI-1, and 7.8% accuracy on ARC-
+                                                                 AGI-2 with 7M parameters. This is significantly higher
+                  HRM                  TRM
+                                                                 than the 74.5%, 40.3%, and 5.0% obtained by HRM us-
+              n = k, 4 layers     n = 2k, 2 layers
+                                                                 ing 4 times the number of parameters (27M).
+    k   T    Depth Acc (%)        Depth Acc (%)
+    1   1      9         46.4       7        63.2
+    2   2      24        55.0       20       81.9                Table 4. % Test accuracy on Puzzle Benchmarks (Sudoku-
+    3   3      48        61.6       42       87.4                Extreme and Maze-Hard)
+    4   4      80        59.5       72       84.2                   Method              # Params Sudoku Maze
+    6   3      84        62.3       78      OOM                              Chain-of-thought, pretrained
+    3   6      96        58.8       84       85.8                   Deepseek R1            671B       0.0     0.0
+    6   6     168        57.5      156      OOM                     Claude 3.7 8K            ?        0.0      0.0
+                                                                    O3-mini-high             ?        0.0      0.0
+                                                                       Direct prediction, small-sample training
+Sudoku-Extreme consists of extremely difficult Su-                  Direct pred            27M        0.0     0.0
+doku puzzles (Dillion, 2025; Palm et al., 2018; Park,               HRM                    27M       55.0     74.5
+2018) (9x9 grid), for which only 1K training samples                TRM-Att (Ours)          7M       74.7     85.3
+are used to test small-sample learning. Testing is done             TRM-MLP (Ours) 5M/19M        1   87.4      0.0
+on 423K samples. Maze-Hard consists of 30x30 mazes
+generated by the procedure by Lehnert et al. (2024)
+whose shortest path is of length above 110; both the
+training set and test set include 1000 mazes.                    Table 5. % Test accuracy on ARC-AGI Benchmarks (2 tries)
+                                                                   Method               # Params ARC-1 ARC-2
+ARC-AGI-1 and ARC-AGI-2 are geometric puzzles in-                           Chain-of-thought, pretrained
+volving monetary prizes. Each puzzle is designed to                Deepseek R1            671B      15.8      1.3
+be easy for a human, yet hard for current AI models.               Claude 3.7 16K            ?      28.6      0.7
+Each puzzle task consists of 2-3 input–output demon-               o3-mini-high              ?      34.5      3.0
+stration pairs and 1-2 test inputs to be solved. The final         Gemini 2.5 Pro 32K        ?      37.0      4.9
+score is computed as the accuracy over all test inputs             Grok-4-thinking         1.7T     66.7     16.0
+from two attempts to produce the correct output grid.              Bespoke (Grok-4)        1.7T     79.6     29.4
+The maximum grid size is 30x30. ARC-AGI-1 con-                        Direct prediction, small-sample training
+tains 800 tasks, while ARC-AGI-2 contains 1120 tasks.
+                                                                   Direct pred             27M      21.0      0.0
+We also augment our data with the 160 tasks from
+                                                                   HRM                     27M      40.3      5.0
+the closely related ConceptARC dataset (Moskvichev
+et al., 2023). We provide results on the public evalua-            TRM-Att (Ours)           7M      44.6      7.8
+tion set for both ARC-AGI-1 and ARC-AGI-2.                         TRM-MLP (Ours)          19M      29.6      2.4
+
+While these datasets are small, heavy data-
+augmentation is used in order to improve gen-                    6. Conclusion
+eralization. Sudoku-Extreme uses 1000 shuffling
+(done without breaking the Sudoku rules) augmenta-               We propose Tiny Recursion Models (TRM), a simple
+tions per data example. Maze-Hard uses 8 dihedral                recursive reasoning approach that achieves strong gen-
+transformations per data example. ARC-AGI uses                   eralization on hard tasks using a single tiny network
+1000 data augmentations (color permutation, dihedral-            recursing on its latent reasoning feature and progres-
+group, and translations transformations) per data                sively improving its final answer. Contrary to the
+example. The dihedral-group transformations consist              Hierarchical Reasoning Model (HRM), TRM requires
+of random 90-degree rotations, horizontal/vertical               no fixed-point theorem, no complex biological justi-
+flips, and reflections.                                          fications, and no hierarchy. It significantly reduces
+                                                                 the number of parameters by halving the number of
+From the results, we see that TRM without self-                  layers and replacing the two networks with a single
+attention obtains the best generalization on Sudoku-             tiny network. It also simplifies the halting process,
+Extreme (87.4% test accuracy). Meanwhile, TRM with               removing the need for the extra forward pass. Over-
+self-attention generalizes better on the other tasks
+(probably due to inductive biases and the overcapac-                1 5M on Sudoku and 19M on Maze
+
+
+
+                                                             8
+                                      Recursive Reasoning with Tiny Networks
+
+all, TRM is much simpler than HRM, while achieving                gan training for high fidelity natural image synthe-
+better generalization.                                            sis. arXiv preprint arXiv:1809.11096, 2018.
+While our approach led to better generalization on 4            Chollet, F. On the measure of intelligence. arXiv
+benchmarks, every choice made is not guaranteed to               preprint arXiv:1911.01547, 2019.
+be optimal on every dataset. For example, we found
+that replacing the self-attention with an MLP worked            Chollet, F., Knoop, M., Kamradt, G., Landers, B.,
+extremely well on Sudoku-Extreme (improving test ac-             and Pinkard, H. Arc-agi-2: A new challenge
+curacy by 10%), but poorly on other datasets. Different          for frontier ai reasoning systems. arXiv preprint
+problem settings may require different architectures             arXiv:2505.11831, 2025.
+or number of parameters. Scaling laws are needed
+to parametrize these networks optimally. Although               Chowdhery, A., Narang, S., Devlin, J., Bosma, M.,
+we simplified and improved on deep recursion, the                Mishra, G., Roberts, A., Barham, P., Chung, H. W.,
+question of why recursion helps so much compared                 Sutton, C., Gehrmann, S., et al. Palm: Scaling lan-
+to using a larger and deeper network remains to be               guage modeling with pathways. Journal of Machine
+explained; we suspect it has to do with overfitting, but         Learning Research, 24(240):1–113, 2023.
+we have no theory to back this explaination. Not all
+our ideas made the cut; we briefly discuss some of the          Dillion, T. Tdoku: A fast sudoku solver and gener-
+failed ideas that we tried but did not work in Section 6.         ator. https://t-dillon.github.io/tdoku/,
+Currently, recursive reasoning models such as HRM                 2025.
+and TRM are supervised learning methods rather than
+                                                                Elman, J. L. Finding structure in time. Cognitive science,
+generative models. This means that given an input
+                                                                  14(2):179–211, 1990.
+question, they can only provide a single deterministic
+answer. In many settings, multiple answers exist for a          Fedus, W., Zoph, B., and Shazeer, N. Switch transform-
+question. Thus, it would be interesting to extend TRM             ers: Scaling to trillion parameter models with simple
+to generative tasks.                                              and efficient sparsity. Journal of Machine Learning Re-
+                                                                  search, 23(120):1–39, 2022.
+Acknowledgements
+                                                                Geng, Z. and Kolter, J. Z. Torchdeq: A library for deep
+Thank you Emy Gervais for your invaluable support                 equilibrium models. arXiv preprint arXiv:2310.18605,
+and extra push. This research was enabled in part                 2023.
+by computing resources, software, and technical as-
+sistance provided by Mila and the Digital Research              Hendrycks, D. and Gimpel, K. Gaussian error linear
+Alliance of Canada.                                              units (gelus). arXiv preprint arXiv:1606.08415, 2016.
+
+                                                                Jang, Y., Kim, D., and Ahn, S. Hierarchical graph
+References                                                        generation with k2-trees. In ICML 2023 Workshop on
+ARC Prize Foundation. The Hidden Drivers of HRM’s                 Structured Probabilistic Inference Generative Modeling,
+ Performance on ARC-AGI. https://arcprize.                        2023.
+ org/blog/hrm-analysis, 2025a. [Online; ac-
+                                                                Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B.,
+ cessed 2025-09-15].
+                                                                  Chess, B., Child, R., Gray, S., Radford, A., Wu, J.,
+ARC Prize Foundation. ARC-AGI Leaderboard.                        and Amodei, D. Scaling laws for neural language
+ https://arcprize.org/leaderboard, 2025b.                         models. arXiv preprint arXiv:2001.08361, 2020.
+ [Online; accessed 2025-09-24].
+                                                                Kingma, D. P. and Ba, J. Adam: A method for stochas-
+Bai, S., Kolter, J. Z., and Koltun, V. Deep equilibrium           tic optimization. arXiv preprint arXiv:1412.6980,
+  models. Advances in neural information processing               2014.
+  systems, 32, 2019.
+                                                                Krantz, S. G. and Parks, H. R. The implicit function
+Bai, X. and Melas-Kyriazi, L. Fixed point diffusion               theorem: history, theory, and applications. Springer
+  models. In Proceedings of the IEEE/CVF Conference               Science & Business Media, 2002.
+  on Computer Vision and Pattern Recognition, pp. 9430–
+  9440, 2024.                                                   LeCun, Y. Une procedure d’apprentissage ponr reseau
+                                                                  a seuil asymetrique. Proceedings of cognitiva 85, pp.
+Brock, A., Donahue, J., and Simonyan, K. Large scale              599–604, 1985.
+
+                                                            9
+                                      Recursive Reasoning with Tiny Networks
+
+Lehnert, L., Sukhbaatar, S., Su, D., Zheng, Q., Mcvay, P.,          all-mlp architecture for vision. Advances in neural
+  Rabbat, M., and Tian, Y. Beyond a*: Better planning               information processing systems, 34:24261–24272, 2021.
+  with transformers via search dynamics bootstrap-
+  ping. arXiv preprint arXiv:2402.14083, 2024.                    Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J.,
+                                                                    Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin,
+Lillicrap, T. P. and Santoro, A. Backpropagation                    I. Attention is all you need. Advances in neural
+  through time and the brain. Current opinion in neuro-             information processing systems, 30, 2017.
+  biology, 55:82–89, 2019.
+                                                                  Wang, G., Li, J., Sun, Y., Chen, X., Liu, C., Wu, Y.,
+Loshchilov, I. and Hutter, F. Decoupled weight decay               Lu, M., Song, S., and Yadkori, Y. A. Hierarchical
+  regularization. arXiv preprint arXiv:1711.05101, 2017.           reasoning model. arXiv preprint arXiv:2506.21734,
+                                                                   2025.
+Moskvichev, A., Odouard, V. V., and Mitchell, M. The
+ conceptarc benchmark: Evaluating understanding                   Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F.,
+ and generalization in the arc domain. arXiv preprint              Chi, E., Le, Q. V., Zhou, D., et al. Chain-of-thought
+ arXiv:2305.07141, 2023.                                           prompting elicits reasoning in large language mod-
+                                                                   els. Advances in neural information processing systems,
+Palm, R., Paquet, U., and Winther, O. Recurrent re-
+                                                                   35:24824–24837, 2022.
+  lational networks. Advances in neural information
+  processing systems, 31, 2018.                                   Werbos, P. Beyond regression: New tools for predic-
+                                                                   tion and analysis in the behavioral sciences. PhD
+Park, K.    Can convolutional neural networks
+                                                                   thesis, Committee on Applied Mathematics, Harvard
+  crack sudoku puzzles? https://github.com/
+                                                                   University, Cambridge, MA, 1974.
+  Kyubyong/sudoku, 2018.
+                                                                  Werbos, P. J. Generalization of backpropagation with
+Prieto, L., Barsbey, M., Mediano, P. A., and Birdal, T.
+                                                                   application to a recurrent gas market model. Neural
+  Grokking at the edge of numerical stability. arXiv
+                                                                   networks, 1(4):339–356, 1988.
+  preprint arXiv:2501.04697, 2025.
+                                                                  Zhang, B. and Sennrich, R. Root mean square layer
+Rumelhart, D. E., Hinton, G. E., and Williams, R. J.
+                                                                    normalization. Advances in Neural Information Pro-
+  Learning internal representations by error propaga-
+                                                                    cessing Systems, 32, 2019.
+  tion. Technical report, 1985.
+Shazeer, N. Glu variants improve transformer. arXiv
+  preprint arXiv:2002.05202, 2020.
+Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le,
+  Q., Hinton, G., and Dean, J. Outrageously large neu-
+  ral networks: The sparsely-gated mixture-of-experts
+  layer. arXiv preprint arXiv:1701.06538, 2017.
+Snell, C., Lee, J., Xu, K., and Kumar, A. Scaling
+  llm test-time compute optimally can be more effec-
+  tive than scaling model parameters. arXiv preprint
+  arXiv:2408.03314, 2024.
+Song, Y. and Ermon, S. Improved techniques for train-
+  ing score-based generative models. Advances in
+  neural information processing systems, 33:12438–12448,
+  2020.
+Su, J., Ahmed, M., Lu, Y., Pan, S., Bo, W., and Liu,
+  Y. Roformer: Enhanced transformer with rotary
+  position embedding. Neurocomputing, 568:127063,
+  2024.
+Tolstikhin, I. O., Houlsby, N., Kolesnikov, A., Beyer,
+  L., Zhai, X., Unterthiner, T., Yung, J., Steiner, A.,
+  Keysers, D., Uszkoreit, J., et al. Mlp-mixer: An
+
+                                                             10
+                                      Recursive Reasoning with Tiny Networks
+
+Hyper-parameters and setup                                       propagating through the whole n + 1 recursions makes
+                                                                 the most sense and works best.
+All models are trained with the AdamW opti-
+mizer(Loshchilov & Hutter, 2017; Kingma & Ba, 2014)              We tried removing ACT with the option of stopping
+with β 1 = 0.9, β 2 = 0.95, small learning rate warm-            when the solution is reached, but we found that gen-
+up (2K iterations), batch-size 768, hidden-size of 512,          eralization dropped significantly. This can probably
+Nsup = 16 max supervision steps, and stable-max loss             be attributed to the fact that the model is spending
+(Prieto et al., 2025) for improved stability. TRM uses an        too much time on the same data samples rather than
+Exponential Moving Average (EMA) of 0.999. HRM                   focusing on learning on a wide range of data samples.
+uses n = 2, T = 2 with two 4-layers networks, while              We tried weight tying the input embedding and out-
+we use n = 6, T = 3 with one 2-layer network.                    put head, but this was too constraining and led to a
+For Sudoku-Extreme and Maze-Hard, we train for 60k               massive generalization drop.
+epochs with learning rate 1e-4 and weight decay 1.0.             We tried using TorchDEQ (Geng & Kolter, 2023) to
+For ARC-AGI, we train for 100K epochs with learning              replace the recursion steps by fixed-point iteration as
+rate 1e-4 (with 1e-2 learning rate for the embeddings)           done by Deep Equilibrium Models (Bai et al., 2019).
+and weight decay 0.1. The numbers for Deepseek R1,               This would provide a better justification for the 1-step
+Claude 3.7 8K, O3-mini-high, Direct prediction, and              gradient approximation. However, this slowed down
+HRM from the Table 4 and 5 are taken from Wang et al.            training due to the fixed-point iteration and led to
+(2025). Both HRM and TRM add an embedding of                     worse generalization. This highlights the fact that
+shape [0, 1, D ] on Sudoku-Extreme and Maze-Hard to              converging to a fixed-point is not essential.
+the input. For ARC-AGI, each puzzle (containing 2-3
+training examples and 1-2 test examples) at each data-
+augmentation is given a specific embedding of shape
+[0, 1, D ] and, at test-time, the most common answer
+out of the 1000 data augmentations is given as answer.
+Experiments on Sudoku-Extreme were ran with 1 L40S
+with 40Gb of RAM for generally less than 36 hours.
+Experiments on Maze-Hard were ran with 4 L40S with
+40Gb of RAM for less than 24 hours. Experiments on
+ARC-AGI were ran for around 3 days with 4 H100
+with 80Gb of RAM.
+
+Ideas that failed
+In this section, we quickly mention a few ideas that
+did not work to prevent others from making the same
+mistake.
+We tried replacing the SwiGLU MLPs by SwiGLU
+Mixture-of-Experts (MoEs) (Shazeer et al., 2017; Fedus
+et al., 2022), but we found generalization to decrease
+massively. MoEs clearly add too much unnecessary
+capacity, just like increasing the number of layers does.
+Instead of back-propagating through the whole n + 1
+recursions, we tried a compromise between HRM 1-
+step gradient approximation, which back-propagates
+through the last 2 recursions. We did so by decou-
+pling n from the number of last recursions k that we
+back-propagate through. For example, while n = 6
+requires 7 steps with gradients in TRM, we can use
+gradients for only the k = 4 last steps. However, we
+found that this did not help generalization in any way,
+and it made the approach more complicated. Back-
+
+
+                                                            11
+                                               Recursive Reasoning with Tiny Networks
+
+Algorithms with different number of latent                                Example on Sudoku-Extreme
+features
+                                                                                                                8 3 1
+                                                                                                9       6 8       7
+def latent recursion(x, z, n=6):                                                                      3   5
+    for i in range(n+1): # latent recursion
+        z = net(x, z)                                                                           6 8
+    return z                                                                                               6        2
+def deep recursion(x, z, n=6, T=3):
+                                                                                            7 4                     3
+    # recursing T−1 times to improve z (no gradients needed)                                               9        4
+    with torch.no grad():
+        for j in range(T−1):
+                                                                                            2            4        1
+            z = latent recursion(x, z, n)                                                   6            2        5 7
+    # recursing once to improve z
+    z = latent recursion(x, z, n)                                                                     Input x
+    return z.detach(), output head(y), Q head(y)
+                                                                                            5 2 6 7 9 4 8 3 1
+# Deep Supervision
+for x input, y true in train dataloader:                                                    3 9 1 2 6 8 4 7 5
+    z = z init                                                                              4 8 7 3 1 5 2 9 6
+    for step in range(N supervision):
+        x = input embedding(x input)
+                                                                                            1 6 8 5 3 2 7 4 9
+        z, y hat, q hat = deep recursion(x, z)                                              9 3 5 4 7 6 1 8 2
+        loss = softmax cross entropy(y hat, y true)                                         7 4 2 9 8 1 5 6 3
+        loss += binary cross entropy(q hat, (y hat == y true))
+        z = z.detach()                                                                      8 7 3 1 5 9 6 2 4
+        loss.backward()                                                                     2 5 9 6 4 7 3 1 8
+        opt.step()
+        opt.zero grad()                                                                     6 1 4 8 5 3 9 5 7
+        if q[0] > 0: # early−stopping
+            break                                                                                Output y
+                                                                                            5 2 6 7 9 4 8 3 1
+Figure 4. Pseudocode of TRM using a single-z with deep                                      3 9 1 2 6 8 4 7 5
+supervision training in PyTorch.                                                            4 8 7 3 1 5 2 9 6
+                                                                                            1 6 8 5 3 2 7 4 9
+                                                                                            9 3 5 4 7 6 1 8 2
+                                                                                            7 4 2 9 8 1 5 6 3
+                                                                                            8 7 3 1 5 9 6 2 4
+def latent recursion(x, y, z, n=6):
+    for i in range(n): # latent recursion                                                   2 5 9 6 4 7 3 1 8
+        z[i] = net(x, y, z[0], ... , z[n−1])                                                6 1 4 8 5 3 9 5 7
+    y = net(y, z[0], ... , z[n−1]) # refine output answer
+    return y, z                                                                     Tokenized z H (denoted y in TRM)
+def deep recursion(x, y, z, n=6, T=3):                                                      5   5 4 9 4   6 3
+    # recursing T−1 times to improve y and z (no gradients needed)
+    with torch.no grad():                                                                   4   3 1     4 6 5
+        for j in range(T−1):                                                                4 8 4   3   6 6 4
+            y, z = latent recursion(x, y, z, n)
+    # recursing once to improve y and z                                                     9   6 5 3   5 4
+    y, z = latent recursion(x, y, z, n)                                                       3 5 4 3   5 4 4
+    return (y.detach(), z.detach()), output head(y), Q head(y)
+                                                                                            6   3   3 3 5 8 8
+# Deep Supervision                                                                          3 3 3 6 5   6 6 4
+for x input, y true in train dataloader:                                                    7 5   6   3 3 6 6
+    y, z = y init, z init
+    for step in range(N supervision):                                                       4 3 4 8   3 6 6 4
+        x = input embedding(x input)
+        (y, z), y hat, q hat = deep recursion(x, y, z)                               Tokenized z L (denoted z in TRM)
+        loss = softmax cross entropy(y hat, y true)
+        loss += binary cross entropy(q hat, (y hat == y true))
+        loss.backward()
+        opt.step()
+                                                                          Figure 6. This Sudoku-Extreme example shows an input, ex-
+        opt.zero grad()                                                   pected output, and the tokenized z H and z L (after reversing
+        if q[0] > 0: # early−stopping                                     the embedding and using argmax) for a pretrained model.
+            break
+                                                                          This highlights the fact that z H corresponds to the predicted
+                                                                          response, while z L is a latent feature that cannot be decoded
+Figure 5. Pseudocode of TRM using multi-scale z with deep                 to a sensible output unless transformed into z H by f H .
+supervision training in PyTorch.
+
+
+
+
+                                                                     12
+
+\ No newline at end of file
author	YurenHao0426 <blackhao0426@gmail.com>	2026-06-29 12:15:51 -0500
committer	YurenHao0426 <blackhao0426@gmail.com>	2026-06-29 12:15:51 -0500
commit	a6ec4288a2232988b130b2f00bb2565f81706966 (patch)
tree	1bb86e7f0b899b823b9e7fdf383e832d30a181e0 /papers