1 files changed, 796 insertions, 0 deletions
diff --git a/papers/txt/trm2025_tiny_recursive.txt b/papers/txt/trm2025_tiny_recursive.txt
new file mode 100644
index 0000000..55cf994
--- /dev/null
+++ b/papers/txt/trm2025_tiny_recursive.txt
@@ -0,0 +1,796 @@
+                                                       Less is More: Recursive Reasoning with Tiny Networks
+
+
+
+                                                                                    Alexia Jolicoeur-Martineau
+                                                                                     Samsung SAIL Montréal
+                                                                                      alexia.j@samsung.com
+
+
+                                                               Abstract
+arXiv:2510.04871v1 [cs.LG] 6 Oct 2025
+
+
+
+
+                                            Hierarchical Reasoning Model (HRM) is a
+                                            novel approach using two small neural net-
+                                            works recursing at different frequencies. This
+                                            biologically inspired method beats Large Lan-
+                                            guage models (LLMs) on hard puzzle tasks
+                                            such as Sudoku, Maze, and ARC-AGI while
+                                            trained with small models (27M parameters)
+                                            on small data (∼ 1000 examples). HRM holds
+                                            great promise for solving hard problems with
+                                            small networks, but it is not yet well un-
+                                            derstood and may be suboptimal. We pro-
+                                            pose Tiny Recursive Model (TRM), a much
+                                            simpler recursive reasoning approach that
+                                            achieves significantly higher generalization
+                                            than HRM, while using a single tiny network
+                                            with only 2 layers. With only 7M parameters,
+                                            TRM obtains 45% test-accuracy on ARC-AGI-
+                                            1 and 8% on ARC-AGI-2, higher than most
+                                            LLMs (e.g., Deepseek R1, o3-mini, Gemini 2.5
+                                            Pro) with less than 0.01% of the parameters.
+
+
+
+                                        1. Introduction
+                                        While powerful, Large Language models (LLMs) can
+                                        struggle on hard question-answer problems. Given
+                                        that they generate their answer auto-regressively, there
+                                        is a high risk of error since a single incorrect token can
+                                        render an answer invalid. To improve their reliabil-             Figure 1. Tiny Recursion Model (TRM) recursively improves
+                                        ity, LLMs rely on Chain-of-thoughts (CoT) (Wei et al.,           its predicted answer y with a tiny network. It starts with the
+                                        2022) and Test-Time Compute (TTC) (Snell et al., 2024).          embedded input question x and initial embedded answer
+                                                                                                         y, and latent z. For up to Nsup = 16 improvements steps,
+                                        CoTs seek to emulate human reasoning by having the
+                                                                                                         it tries to improve its answer y. It does so by i) recursively
+                                        LLM to sample step-by-step reasoning traces prior to
+                                                                                                         updating n times its latent z given the question x, current
+                                        giving their answer. Doing so can improve accuracy,              answer y, and current latent z (recursive reasoning), and
+                                        but CoT is expensive, requires high-quality reasoning            then ii) updating its answer y given the current answer y
+                                        data (which may not be available), and can be brittle            and current latent z. This recursive process allows the model
+                                        since the generated reasoning may be wrong. To fur-              to progressively improve its answer (potentially address-
+                                        ther improve reliability, test-time compute can be used          ing any errors from its previous answer) in an extremely
+                                        by reporting the most common answer out of K or the              parameter-efficient manner while minimizing overfitting.
+                                        highest-reward answer (Snell et al., 2024).
+
+                                                                                                     1
+                                      Recursive Reasoning with Tiny Networks
+
+However, this may not be enough. LLMs with CoT                   ers that achieves significantly higher generalization
+and TTC are not enough to beat every problem. While              than HRM on a variety of problems. In doing so, we
+LLMs have made significant progress on ARC-AGI                   improve the state-of-the-art test accuracy on Sudoku-
+(Chollet, 2019) since 2019, human-level accuracy still           Extreme from 55% to 87%, Maze-Hard from 75% to
+has not been reached (6 years later, as of writing of            85%, ARC-AGI-1 from 40% to 45%, and ARC-AGI-2
+this paper). Furthermore, LLMs struggle on the newer             from 5% to 8%.
+ARC-AGI-2 (e.g., Gemini 2.5 Pro only obtains 4.9% test
+accuracy with a high amount of TTC) (Chollet et al.,             2. Background
+2025; ARC Prize Foundation, 2025b).
+                                                                 HRM is described in Algorithm 2. We discuss the
+An alternative direction has recently been proposed by
+                                                                 details of the algorithm further below.
+Wang et al. (2025). They propose a new way forward
+through their novel Hierarchical Reasoning Model
+(HRM), which obtains high accuracy on puzzle tasks               2.1. Structure and goal
+where LLMs struggle to make a dent (e.g., Sudoku                 The focus of HRM is supervised learning. Given an
+solving, Maze pathfinding, and ARC-AGI). HRM is a                input, produce an output. Both input and output are
+supervised learning model with two main novelties: 1)            assumed to have shape [ B, L] (when the shape differs,
+recursive hierarchical reasoning, and 2) deep supervision.       padding tokens can be added), where B is the batch-
+Recursive hierarchical reasoning consists of recurs-             size and L is the context-length.
+ing multiple times through two small networks ( f L at           HRM contains four learnable components: the in-
+high frequency and f H at low frequency) to predict the          put embedding f I (·; θ I ), low-level recurrent network
+answer. Each network generates a different latent fea-           f L (·; θ L ), high-level recurrent network f H (·; θ H ), and
+ture: f L outputs z H and f H outputs z L . Both features        the output head fO (·; θO ). Once the input is embedded,
+(z L , z H ) are used as input to the two networks. The          the shape becomes [ B, L, D ] where D is the embedding
+authors provide some biological arguments in favor of            size. Each network is a 4-layer Transformers architec-
+recursing at different hierarchies based on the different        ture (Vaswani et al., 2017), with RMSNorm (Zhang
+temporal frequencies at which the brains operate and             & Sennrich, 2019), no bias (Chowdhery et al., 2023),
+hierarchical processing of sensory inputs.                       rotary embeddings (Su et al., 2024), and SwiGLU acti-
+Deep supervision consists of improving the answer                vation function (Hendrycks & Gimpel, 2016; Shazeer,
+through multiple supervision steps while carrying the            2020).
+two latent features as initialization for the improve-
+ment steps (after detaching them from the computa-               2.2. Recursion at two different frequencies
+tional graph so that their gradients do not propagate).          Given the hyperparameters used by Wang et al. (2025)
+This provide residual connections, which emulates                (n = 2 f L steps, 1 f H steps; done T = 2 times), a
+very deep neural networks that are too memory ex-                forward pass of HRM is done as follows:
+pensive to apply in one forward pass.
+An independent analysis on the ARC-AGI benchmark                        x ← f I ( x̃ )
+showed that deep supervision seems to be the primary                  z L ← f L (z L + z H + x ) # without gradients
+driver of the performance gains (ARC Prize Founda-                    z L ← f L (z L + z H + x ) # without gradients
+tion, 2025a). Using deep supervision doubled accuracy
+                                                                      z H ← f H (z L + z H )      # without gradients
+over single-step supervision (going from 19% to 39%
+accuracy), while recursive hierarchical reasoning only                z L ← f L (z L + z H + x ) # without gradients
+slightly improved accuracy over a regular model with                  z L ← z L .detach()
+a single forward pass (going from 35.7% to 39.0% ac-                  z H ← z H .detach()
+curacy). This suggests that reasoning across different
+                                                                      z L ← f L (z L + z H + x ) # with gradients
+supervision steps is worth it, but the recursion done
+in each supervision step is not particularly important.               z H ← f H (z L + z H )     # with gradients
+                                                                        ŷ ← argmax( fO (z H ))
+In this work, we show that the benefit from recursive
+reasoning can be massively improved, making it much
+more than incremental. We propose Tiny Recursive                 where ŷ is the predicted output answer, z L and z H are
+Model (TRM), an improved and simplified approach                 either initialized embeddings or the embeddings of
+using a much smaller tiny network with only 2 lay-               the previous deep supervision step (after detaching
+                                                                 them from the computational graph). As can be seen,
+
+                                                             2
+                                               Recursive Reasoning with Tiny Networks
+
+def hrm(z, x, n=2, T=2): # hierarchical reasoning                    2.4. Deep supervision
+    zH, zL = z
+    with torch.no grad():                                            To improve effective depth, deep supervision is used.
+        for i in range(nT − 2):
+            zL = L net(zL, zH, x)                                    This consists of reusing the previous latent features
+            if (i + 1) % T == 0:
+                zH = H net(zH, zL)
+                                                                     (z H and z L ) as initialization for the next forward pass.
+    # 1−step grad                                                    This allows the model to reason over many iterations
+    zL = L net(zL, zH, x)
+    zH = H net(zH, zL)                                               and improve its latent features (z L and z H ) until it
+    return (zH, zL), output head(zH), Q head(zH)                     (hopefully) converges to the correct solution. At most
+def ACT halt(q, y hat, y true):                                      Nsup = 16 supervision steps are used.
+    target halt = (y hat == y true)
+    loss = 0.5∗binary cross entropy(q[0], target halt)
+    return loss                                                      2.5. Adaptive computational time (ACT)
+def ACT continue(q, last step):
+    if last step:                                                    With deep supervision, each mini-batch of data sam-
+        target continue = sigmoid(q[0])                              ples must be used for Nsup = 16 supervision steps
+    else:
+        target continue = sigmoid(max(q[0], q[1])))                  before moving to the next mini-batch. This is expen-
+    loss = 0.5∗binary cross entropy(q[1], target continue)
+    return loss
+                                                                     sive, and there is a balance to be reached between
+                                                                     optimizing a few data examples for many supervision
+# Deep Supervision
+for x input, y true in train dataloader:                             steps versus optimizing many data examples with less
+    z = z init                                                       supervision steps. To reach a better balance, a halting
+    for step in range(N sup): # deep supervision
+        x = input embedding(x input)                                 mechanism is incorporated to determine whether the
+        z, y pred, q = hrm(z, x)
+        loss = softmax cross entropy(y pred, y true)
+                                                                     model should terminate early. It is learned through
+        # Adaptive computational time (ACT) using Q−learning         a Q-learning objective that requires passing the z H
+        loss += ACT halt(q, y pred, y true)
+         , , q next = hrm(z, x) # extra forward pass                 through an additional head and running an additional
+        loss += ACT continue(q next, step == N sup − 1)              forward pass (to determine if halting now rather than
+        z = z.detach()
+        loss.backward()                                              later would have been preferable). They call this
+        opt.step()
+        opt.zero grad()
+                                                                     method Adaptive computational time (ACT). It is only
+        if q[0] > q[1]: # early−stopping                             used during training, while the full Nsup = 16 super-
+            break
+                                                                     vision steps are done at test time to maximize down-
+                                                                     stream performance. ACT greatly diminishes the time
+Figure 2. Pseudocode of Hierarchical Reasoning Models                spent per example (on average spending less than 2
+(HRMs).                                                              steps on the Sudoku-Extreme dataset rather than the
+                                                                     full Nsup = 16 steps), allowing more coverage of the
+a forward pass of HRM consists of applying 6 function                dataset given a fixed number of training iterations.
+evaluations, where the first 4 function evaluations are
+detached from the computational graph and are not                    2.6. Deep supervision and 1-step gradient
+back-propagated through. The authors uses n = 2                           approximations replaces BPTT
+with T = 2 in all experiments, but HRM can be gener-
+alized by allowing for an arbitrary number of L steps                Deep supervision and the 1-step gradient approxima-
+(n) and recursions (T) as shown in Algorithm 2.                      tion provide a more biologically plausible and less
+                                                                     computationally-expansive alternative to Backpropa-
+2.3. Fixed-point recursion with 1-step gradient                      gation Through Time (BPTT) (Werbos, 1974; Rumel-
+     approximation                                                   hart et al., 1985; LeCun, 1985) for solving the temporal
+                                                                     credit assignment (TCA) (Rumelhart et al., 1985; Wer-
+Assuming that (z L , z H ) reaches a fixed-point (z∗L , z∗H )        bos, 1988; Elman, 1990) problem (Lillicrap & Santoro,
+through recursing from both f L and f H ,                            2019). The implication is that HRM can learn what
+                                                                     would normally require an extremely large network
+                     z∗L ≈ f L (z∗L + z H + x )                      without having to back-propagate through its entire
+                    z∗H ≈ f H (z L + z∗H ) ,                         depth. Given the hyperparameters used by Jang et al.
+                                                                     (2023) in all their experiments, HRM effectively rea-
+the Implicit Function Theorem (Krantz & Parks, 2002)
+                                                                     sons over nlayers (n + 1) TNsup = 4 ∗ (2 + 1) ∗ 2 ∗ 16 =
+with the 1-step gradient approximation (Bai et al.,
+                                                                     384 layers of effective depth.
+2019) is used to approximate the gradient by back-
+propagating only the last f L and f H steps. This theo-
+rem is used to justify only tracking the gradients of
+the last two steps (out of 6), which greatly reduces
+memory demands.
+
+                                                                 3
+                                       Recursive Reasoning with Tiny Networks
+
+2.7. Summary of HRM                                               different from the much smaller n = 2 and T = 2 used
+                                                                  in every experiment of their paper, we observe the
+HRM leverages recursion from two networks at dif-
+                                                                  following:
+ferent frequencies (high frequency versus low fre-
+quency) and deep supervision to learn to improve
+its answer over multiple supervision steps (with ACT               1. the residual for z H is clearly well above 0 at every
+to reduce time spent per data example). This enables                  step
+the model to imitate extremely large depth without
+requiring backpropagation through all layers. This
+approach obtains significantly higher performance on               2. the residual for z L only becomes closer to 0 after
+hard question-answer tasks that regular supervised                    many cycles, but it remains significantly above 0
+models struggle with. However, this method is quite
+complicated, relying a bit too heavily on uncertain
+biological arguments and fixed-point theorems that                 3. z L is very far from converged after one f L evalu-
+are not guaranteed to be applicable. In the next sec-                 ation at T cycles, which is when the fixed-point
+tion, we discuss those issues and potential targets for               is assumed to be reached and the 1-step gradient
+improvements in HRM.                                                  approximation is used
+
+3. Target for improvements in Hierarchical                        Thus, while the application of the IFT theorem and
+   Reasoning Models                                               1-step gradient approximation to HRM has some basis
+                                                                  since the residuals do generally reduce over time, a
+In this section, we identify key targets for improve-
+                                                                  fixed point is unlikely to be reached when the theorem
+ments in HRM, which will be addressed by our pro-
+                                                                  is actually applied.
+posed method, Tiny Recursion Models (TRMs).
+                                                                  In the next section, we show that we can bypass the
+3.1. Implicit Function Theorem (IFT) with 1-step                  need for the IFT theorem and 1-step gradient approxi-
+     gradient approximation                                       mation, thus bypassing the issue entirely.
+
+HRM only back-propagates through the last 2 of the 6              3.2. Twice the forward passes with Adaptive
+recursions. The authors justify this by leveraging the                 computational time (ACT)
+Implicit Function Theorem (IFT) and one-step approx-
+imation, which states that when a recurrent function              HRM uses Adaptive computational time (ACT) during
+converges to a fixed point, backpropagation can be                training to optimize the time spent of each data sam-
+applied in a single step at that equilibrium point.               ple. Without ACT, Nsup = 16 supervision steps would
+                                                                  be spent on the same data sample, which is highly in-
+There are concerns about applying this theorem to
+                                                                  efficient. They implement ACT through an additional
+HRM. Most importantly, there is no guarantee that
+                                                                  Q-learning objective, which decides when to halt and
+a fixed-point is reached. Deep equilibrium models
+                                                                  move to a new data sample rather than keep iterating
+normally do fixed-point iteration to solve for the fixed
+                                                                  on the same data. This allows much more efficient
+pointz∗ = f (z∗ ) (Bai et al., 2019). However, in the case
+                                                                  use of time especially since the average number of su-
+of HRM, it is not iterating to the fixed-point but simply
+                                                                  pervision steps during training is quite low with ACT
+doing forward passes of f L and f H . To make matters
+                                                                  (less than 2 steps on the Sudoku-Extreme dataset as
+worse, HRM is only doing 4 recursions before stopping
+                                                                  per their reported numbers).
+to apply the one-step approximation. After its first
+loop of two f L and 1 f H evaluations, it only apply a            However, ACT comes at a cost. This cost is not directly
+single f L evaluation before assuming that a fixed-point          shown in the HRM’s paper, but it is shown in their of-
+is reached for both z L and z H (z∗L = f L (z∗L + z H + x )       ficial code. The Q-learning objective relies on a halting
+and z∗H = f H (z∗L + z∗H )). Then, the one-step gradient          loss and a continue loss. The continue loss requires an
+approximation is applied to both latent variables in              extra forward pass through HRM (with all 6 function
+succession.                                                       evaluations). This means that while ACT optimizes
+                                                                  time more efficiently per sample, it requires 2 forward
+The authors justify that a fixed-point is reached by
+                                                                  passes per optimization step. The exact formulation is
+depicting an example with n = 7 and T = 7 where
+                                                                  shown in Algorithm 2.
+the forward residuals is reduced over time (Figure 3
+in Wang et al. (2025)). Even in this setting, which is            In the next section, we show that we can bypass the
+                                                                  need for two forward passes in ACT.
+
+                                                              4
+                                               Recursive Reasoning with Tiny Networks
+
+3.3. Hierarchical interpretation based on complex                        of 2 passes). Our approach is described in Algorithm 3
+     biological arguments                                                and illustrated in Figure 1. We also provide an ablation
+                                                                         in Table 1 on the Sudoku-Extreme dataset (a dataset
+The HRM’s authors justify the two latent variables
+                                                                         of difficult Sudokus with only 1K training examples,
+and two networks operating at different hierarchies
+                                                                         but 423K test examples). Below, we explain the key
+based on biological arguments, which are very far
+                                                                         components of TRMs.
+from artificial neural networks. They even try to match
+HRM to actual brain experiments on mice. While in-
+teresting, this sort of explanation makes it incredibly                  Table 1. Ablation of TRM on Sudoku-Extreme comparing %
+hard to parse out why HRM is designed the way it                         Test accuracy, effective depth per supervision step ( T (n +
+is. Given the lack of ablation table in their paper, the                 1)nlayers ), number of Forward Passes (NFP) per optimization
+over-reliance on biological arguments and fixed-point                    step, and number of parameters
+theorems (that are not perfectly applicable), it is hard                  Method                Acc (%) Depth NFP # Params
+to determine what parts of HRM is helping what and                        HRM                    55.0    24    2     27M
+why. Furthermore, it is not clear why they use two                        TRM (T = 3, n = 6) 87.4        42    1     5M
+latent features rather than other combinations of fea-                    w/ ACT                 86.1    42    2      5M
+tures.                                                                    w/ separate f H , f L  82.4    42    1     10M
+                                                                          no EMA                 79.9    42    1      5M
+In the next section, we show that the recursive process                   w/ 4-layers, n = 3     79.5    48    1     10M
+can be greatly simplified and understood in a much                        w/ self-attention      74.7    42    1      7M
+simpler manner that does not require any biological                       w/ T = 2, n = 2        73.7    12    1     5M
+argument, fixed-point theorem, hierarchical interpre-                     w/ 1-step gradient 56.5        42    1      5M
+tation, nor using two networks. It also explains why 2
+is the optimal number of features (z L and z H ).
+                                                                         4.1. No fixed-point theorem required
+
+def latent recursion(x, y, z, n=6):                                      HRM assumes that the recursions converge to a fixed-
+    for i in range(n): # latent reasoning                                point for both z L and z H in order to leverage the 1-step
+        z = net(x, y, z)
+    y = net(y, z) # refine output answer                                 gradient approximation (Bai et al., 2019). This allows
+    return y, z
+                                                                         the authors to justify only back-propagating through
+def deep recursion(x, y, z, n=6, T=3):                                   the last two function evaluations (1 f L and 1 f H ). To
+    # recursing T−1 times to improve y and z (no gradients needed)
+    with torch.no grad():                                                bypass this theoretical requirement, we define a full
+        for j in range(T−1):                                             recursion process as containing n evaluations of f L
+            y, z = latent recursion(x, y, z, n)
+    # recursing once to improve y and z                                  and 1 evaluation of f H :
+    y, z = latent recursion(x, y, z, n)
+    return (y.detach(), z.detach()), output head(y), Q head(y)                            z L ← f L (z L + z H + x )
+# Deep Supervision                                                                           ...
+for x input, y true in train dataloader:
+    y, z = y init, z init                                                                 z L ← f L (z L + z H + x )
+    for step in range(N supervision):
+        x = input embedding(x input)                                                      z H ← f H (z L + z H ) .
+        (y, z), y hat, q hat = deep recursion(x, y, z)
+        loss = softmax cross entropy(y hat, y true)
+        loss += binary cross entropy(q hat, (y hat == y true))           Then, we simply back-propagate through the full re-
+        loss.backward()                                                  cursion process.
+        opt.step()
+        opt.zero grad()
+        if q hat > 0: # early−stopping                                   Through deep supervision, the models learns to take
+            break                                                        any (z L , z H ) and improve it through a full recursion
+                                                                         process, hopefully making z H closer to the solution.
+Figure 3. Pseudocode of Tiny Recursion Models (TRMs).                    This means that by the design of the deep supervi-
+                                                                         sion goal, running a few full recursion processes (even
+                                                                         without gradients) is expected to bring us closer to the
+4. Tiny Recursion Models                                                 solution. We propose to run T − 1 recursion processes
+                                                                         without gradient to improve (z L , z H ) before running
+In this section, we present Tiny Recursion Models
+                                                                         one recursion process with backpropagation.
+(TRMs). Contrary to HRM, TRM requires no com-
+plex mathematical theorem, hierarchy, nor biological                     Thus, instead of using the 1-step gradient approxi-
+arguments. It generalizes better while requiring only                    mation, we apply a full recursion process containing
+a single tiny network (instead of two medium-size net-                   n evaluations of f L and 1 evaluation of f H . This re-
+works) and a single forward pass for the ACT (instead                    moves entirely the need to assume that a fixed-point
+
+                                                                     5
+                                      Recursive Reasoning with Tiny Networks
+
+is reached and the use of the IFT theorem with 1-step            While this is intuitive, we wanted to verify whether
+gradient approximation. Yet, we can still leverage               using more or less features could be helpful. Results
+multiple backpropagation-free recursion processes to             are shown in Table 2.
+improve (z L , z H ). With this approach, we obtain a
+                                                                 More features (> 2): We tested splitting z into dif-
+massive boost in generalization on Sudoku-Extreme
+                                                                 ferent features by treating each of the n recursions as
+(improving TRM from 56.5% to 87.4%; see Table 1).
+                                                                 producing a different zi for i = 1, ..., n. Then, each
+                                                                 zi is carried across supervision steps. The approach
+4.2. Simpler reinterpretation of z H and z L                     is described in Algorithm 5. In doing so, we found
+HRM is interpreted as doing hierarchical reasoning               performance to drop. This is expected because, as dis-
+over two latent features of different hierarchies due to         cussed, there is no apparent need for splitting z into
+arguments from biology. However, one might wonder                multiple parts. It does not have to be hierarchical.
+why use two latent features instead of 1, 3, or more?            Single feature: Similarly, we tested the idea of taking
+And do we really need to justify these so-called ”hier-          a single feature by only carrying z H across supervi-
+archical” features based on biology to make sense of             sion steps. The approach is described in Algorithm 4.
+them? We propose a simple non-biological explana-                In doing so, we found performance to drop. This is
+tion, which is more natural, and directly answers the            expected because, as discussed, it forces the model to
+question of why there are 2 features.                            store the solution y within z.
+The fact of the matter is: z H is simply the current             Thus, we explored using more or less latent variables
+(embedded) solution. The embedding is reversed by                on Sudoku-Extreme, but found that having only y and
+applying the output head and rounding to the nearest             z lead to better test accuracy in addition to being the
+token using the argmax operation. On the other hand,             simplest more natural approach.
+z L is a latent feature that does not directly correspond
+to a solution, but it can be transformed into a solution
+by applying z H ← f H ( x, z L , z H ). We show an example
+on Sudoku-Extreme in Figure 6 to highlight the fact              Table 2. TRM on Sudoku-Extreme comparing % Test accu-
+that z H does correspond to the solution, but z L does           racy when using more or less latent features
+not.                                                                 Method                 # of features   Acc (%)
+                                                                     TRM y, z (Ours)              2          87.4
+Once this is understood, hierarchy is not needed; there              TRM multi-scale z       n+1 = 7         77.6
+is simply an input x, a proposed solution y (previously              TRM single z                 1          71.9
+called z H ), and a latent reasoning feature z (previously
+called z L ). Given the input question x, current solution
+y, and current latent reasoning z, the model recursively
+improves its latent z. Then, given the current latent z          4.3. Single network
+and the previous solution y, the model proposes a new            HRM uses two networks, one applied frequently as a
+solution y (or stay at the current solution if its already       low-level module f H and one applied rarely as an high-
+good).                                                           level module ( f H ). This requires twice the number of
+Although this has no direct influence on the algorithm,          parameters compared to regular supervised learning
+this re-interpretation is much simpler and natural. It           with a single network.
+answers the question about why two features: remem-              As mentioned previously, while f L iterates on the la-
+bering in context the question x, previous reasoning             tent reasoning feature z (z L in HRM), the goal of f H
+z, and previous answer y helps the model iterate on              is to update the solution y (z H in HRM) given the la-
+the next reasoning z and then the next answer y. If              tent reasoning and current solution. Importantly, since
+we were not passing the previous reasoning z, the                z ← f L ( x + y + z) contains x but y ← f H (y + z) does
+model would forget how it got to the previous solu-              not contains x, the task to achieve (iterating on z versus
+tion y (since z acts similarly as a chain-of-thought). If        using z to update y) is directly specified by the inclu-
+we were not passing the previous solution y, then the            sion or lack of x in the inputs. Thus, we considered
+model would forget what solution it had and would                the possibility that both networks could be replaced
+be forced to store the solution y within z instead of            by a single network doing both tasks. In doing so, we
+using it for latent reasoning. Thus, we need both y and          obtain better generalization on Sudoku-Extreme (im-
+z separately, and there is no apparent reason why one            proving TRM from 82.4% to 87.4%; see Table 1) while
+would need to split z into multiple features.                    reducing the number of parameters by half. It turns
+                                                                 out that a single network is enough.
+
+                                                             6
+                                        Recursive Reasoning with Tiny Networks
+
+4.4. Less is more                                                  4.6. No additional forward pass needed with ACT
+We attempted to increase capacity by increasing the                As previously mentioned, the implementation of ACT
+number of layers in order to scale the model. Sur-                 in HRM through Q-learning requires two forward
+prisingly, we found that adding layers decreased gen-              passes, which slows down training. We propose a
+eralization due to overfitting. In doing the oppo-                 simple solution, which is to get rid of the continue loss
+site, decreasing the number of layers while scaling                (from the Q-learning) and only learn a halting proba-
+the number of recursions (n) proportionally (to keep               bility through a Binary-Cross-Entropy loss of having
+the amount of compute and emulated depth approxi-                  reached the correct solution. By removing the continue
+mately the same), we found that using 2 layers (instead            loss, we remove the need for the expensive second for-
+of 4 layers) maximized generalization. In doing so, we             ward pass, while still being able to determine when to
+obtain better generalization on Sudoku-Extreme (im-                halt with relatively good accuracy. We found no sig-
+proving TRM from 79.5% to 87.4%; see Table 1) while                nificant difference in generalization from this change
+reducing the number of parameters by half (again).                 (going from 86.1% to 87.4%; see Table 1).
+It is quite surprising that smaller networks are bet-
+ter, but 2 layers seems to be the optimal choice. Bai              4.7. Exponential Moving Average (EMA)
+& Melas-Kyriazi (2024) also observed optimal perfor-               On small data (such as Sudoku-Extreme and Maze-
+mance for 2-layers in the context of deep equilibrium              Hard), HRM tends to overfit quickly and then diverge.
+diffusion models; however, they had similar perfor-                To reduce this problem and improves stability, we
+mance to the bigger networks, while we instead ob-                 integrate Exponential Moving Average (EMA) of the
+serve better performance with 2 layers. This may ap-               weights, a common technique in GANs and diffusion
+pear unusual, as with modern neural networks, gener-               models to improve stability (Brock et al., 2018; Song &
+alization tends to directly correlate with model sizes.            Ermon, 2020). We find that it prevents sharp collapse
+However, when data is too scarce and model size is                 and leads to higher generalization (going from 79.9%
+large, there can be an overfitting penalty (Kaplan et al.,         to 87.4%; see Table 1).
+2020). This is likely an indication that there is too little
+data. Thus, using tiny networks with deep recursion                4.8. Optimal the number of recursions
+and deep supervision appears to allow us to bypass a
+lot of the overfitting.                                            We experimented with different number of recursions
+                                                                   by varying T and n and found that T = 3, n = 3
+4.5. attention-free architecture for tasks with small              (equivalent to 48 recursions) in HRM and T = 3, n = 6
+     fixed context length                                          in TRM (equivalent to 42 recursions) to lead to optimal
+                                                                   generalization on Sudoku-Extreme. More recursions
+Self-attention is particularly good for long-context               could be helpful for harder problems (we have not
+lengths when L ≫ D since it only requires a matrix of              tested it, given our limited resources); however, in-
+[ D, 3D ] parameters, even though it can account for the           creasing either T or n incurs massive slowdowns. We
+whole sequence. However, when focusing on tasks                    show results at different n and T for HRM and TRM
+where L ≤ D, a linear layer is cheap, requiring only a             in Table 3. Note that TRM requires backpropagation
+matrix of [ L, L] parameters. Taking inspiration from              through a full recursion process, thus increasing n too
+the MLP-Mixer (Tolstikhin et al., 2021), we can replace            much leads to Out Of Memory (OOM) errors. How-
+the self-attention layer with a multilayer perceptron              ever, this memory cost is well worth its price in gold.
+(MLP) applied on the sequence length. Using an MLP
+instead of self-attention, we obtain better generaliza-            In the following section, we show our main results on
+tion on Sudoku-Extreme (improving from 74.7% to                    multiple datasets comparing HRM, TRM, and LLMs.
+87.4%; see Table 1). This worked well on Sudoku 9x9
+grids, given the small and fixed context length; how-              5. Results
+ever, we found this architecture to be suboptimal for
+tasks with large context length, such as Maze-Hard                 Following Wang et al. (2025), we test our approach
+and ARC-AGI (both using 30x30 grids). We show                      on the following datasets: Sudoku-Extreme (Wang
+results with and without self-attention for all experi-            et al., 2025), Maze-Hard (Wang et al., 2025), ARC-AGI-
+ments.                                                             1 (Chollet, 2019) and, ARC-AGI-2 (Chollet et al., 2025).
+                                                                   Results are presented in Tables 4 and 5. Hyperparame-
+                                                                   ters are detailed in Section 6. Datasets are discussed
+                                                                   below.
+
+
+                                                               7
+                                      Recursive Reasoning with Tiny Networks
+
+                                                                 ity of the MLP on large 30x30 grids). TRM with self-
+Table 3. % Test accuracy on Sudoku-Extreme dataset. HRM
+                                                                 attention obtains 85.3% accuracy on Maze-Hard, 44.6%
+versus TRM matched at a similar effective depth per super-
+vision step ( T (n + 1)nlayers )
+                                                                 accuracy on ARC-AGI-1, and 7.8% accuracy on ARC-
+                                                                 AGI-2 with 7M parameters. This is significantly higher
+                  HRM                  TRM
+                                                                 than the 74.5%, 40.3%, and 5.0% obtained by HRM us-
+              n = k, 4 layers     n = 2k, 2 layers
+                                                                 ing 4 times the number of parameters (27M).
+    k   T    Depth Acc (%)        Depth Acc (%)
+    1   1      9         46.4       7        63.2
+    2   2      24        55.0       20       81.9                Table 4. % Test accuracy on Puzzle Benchmarks (Sudoku-
+    3   3      48        61.6       42       87.4                Extreme and Maze-Hard)
+    4   4      80        59.5       72       84.2                   Method              # Params Sudoku Maze
+    6   3      84        62.3       78      OOM                              Chain-of-thought, pretrained
+    3   6      96        58.8       84       85.8                   Deepseek R1            671B       0.0     0.0
+    6   6     168        57.5      156      OOM                     Claude 3.7 8K            ?        0.0      0.0
+                                                                    O3-mini-high             ?        0.0      0.0
+                                                                       Direct prediction, small-sample training
+Sudoku-Extreme consists of extremely difficult Su-                  Direct pred            27M        0.0     0.0
+doku puzzles (Dillion, 2025; Palm et al., 2018; Park,               HRM                    27M       55.0     74.5
+2018) (9x9 grid), for which only 1K training samples                TRM-Att (Ours)          7M       74.7     85.3
+are used to test small-sample learning. Testing is done             TRM-MLP (Ours) 5M/19M        1   87.4      0.0
+on 423K samples. Maze-Hard consists of 30x30 mazes
+generated by the procedure by Lehnert et al. (2024)
+whose shortest path is of length above 110; both the
+training set and test set include 1000 mazes.                    Table 5. % Test accuracy on ARC-AGI Benchmarks (2 tries)
+                                                                   Method               # Params ARC-1 ARC-2
+ARC-AGI-1 and ARC-AGI-2 are geometric puzzles in-                           Chain-of-thought, pretrained
+volving monetary prizes. Each puzzle is designed to                Deepseek R1            671B      15.8      1.3
+be easy for a human, yet hard for current AI models.               Claude 3.7 16K            ?      28.6      0.7
+Each puzzle task consists of 2-3 input–output demon-               o3-mini-high              ?      34.5      3.0
+stration pairs and 1-2 test inputs to be solved. The final         Gemini 2.5 Pro 32K        ?      37.0      4.9
+score is computed as the accuracy over all test inputs             Grok-4-thinking         1.7T     66.7     16.0
+from two attempts to produce the correct output grid.              Bespoke (Grok-4)        1.7T     79.6     29.4
+The maximum grid size is 30x30. ARC-AGI-1 con-                        Direct prediction, small-sample training
+tains 800 tasks, while ARC-AGI-2 contains 1120 tasks.
+                                                                   Direct pred             27M      21.0      0.0
+We also augment our data with the 160 tasks from
+                                                                   HRM                     27M      40.3      5.0
+the closely related ConceptARC dataset (Moskvichev
+et al., 2023). We provide results on the public evalua-            TRM-Att (Ours)           7M      44.6      7.8
+tion set for both ARC-AGI-1 and ARC-AGI-2.                         TRM-MLP (Ours)          19M      29.6      2.4
+
+While these datasets are small, heavy data-
+augmentation is used in order to improve gen-                    6. Conclusion
+eralization. Sudoku-Extreme uses 1000 shuffling
+(done without breaking the Sudoku rules) augmenta-               We propose Tiny Recursion Models (TRM), a simple
+tions per data example. Maze-Hard uses 8 dihedral                recursive reasoning approach that achieves strong gen-
+transformations per data example. ARC-AGI uses                   eralization on hard tasks using a single tiny network
+1000 data augmentations (color permutation, dihedral-            recursing on its latent reasoning feature and progres-
+group, and translations transformations) per data                sively improving its final answer. Contrary to the
+example. The dihedral-group transformations consist              Hierarchical Reasoning Model (HRM), TRM requires
+of random 90-degree rotations, horizontal/vertical               no fixed-point theorem, no complex biological justi-
+flips, and reflections.                                          fications, and no hierarchy. It significantly reduces
+                                                                 the number of parameters by halving the number of
+From the results, we see that TRM without self-                  layers and replacing the two networks with a single
+attention obtains the best generalization on Sudoku-             tiny network. It also simplifies the halting process,
+Extreme (87.4% test accuracy). Meanwhile, TRM with               removing the need for the extra forward pass. Over-
+self-attention generalizes better on the other tasks
+(probably due to inductive biases and the overcapac-                1 5M on Sudoku and 19M on Maze
+
+
+
+                                                             8
+                                      Recursive Reasoning with Tiny Networks
+
+all, TRM is much simpler than HRM, while achieving                gan training for high fidelity natural image synthe-
+better generalization.                                            sis. arXiv preprint arXiv:1809.11096, 2018.
+While our approach led to better generalization on 4            Chollet, F. On the measure of intelligence. arXiv
+benchmarks, every choice made is not guaranteed to               preprint arXiv:1911.01547, 2019.
+be optimal on every dataset. For example, we found
+that replacing the self-attention with an MLP worked            Chollet, F., Knoop, M., Kamradt, G., Landers, B.,
+extremely well on Sudoku-Extreme (improving test ac-             and Pinkard, H. Arc-agi-2: A new challenge
+curacy by 10%), but poorly on other datasets. Different          for frontier ai reasoning systems. arXiv preprint
+problem settings may require different architectures             arXiv:2505.11831, 2025.
+or number of parameters. Scaling laws are needed
+to parametrize these networks optimally. Although               Chowdhery, A., Narang, S., Devlin, J., Bosma, M.,
+we simplified and improved on deep recursion, the                Mishra, G., Roberts, A., Barham, P., Chung, H. W.,
+question of why recursion helps so much compared                 Sutton, C., Gehrmann, S., et al. Palm: Scaling lan-
+to using a larger and deeper network remains to be               guage modeling with pathways. Journal of Machine
+explained; we suspect it has to do with overfitting, but         Learning Research, 24(240):1–113, 2023.
+we have no theory to back this explaination. Not all
+our ideas made the cut; we briefly discuss some of the          Dillion, T. Tdoku: A fast sudoku solver and gener-
+failed ideas that we tried but did not work in Section 6.         ator. https://t-dillon.github.io/tdoku/,
+Currently, recursive reasoning models such as HRM                 2025.
+and TRM are supervised learning methods rather than
+                                                                Elman, J. L. Finding structure in time. Cognitive science,
+generative models. This means that given an input
+                                                                  14(2):179–211, 1990.
+question, they can only provide a single deterministic
+answer. In many settings, multiple answers exist for a          Fedus, W., Zoph, B., and Shazeer, N. Switch transform-
+question. Thus, it would be interesting to extend TRM             ers: Scaling to trillion parameter models with simple
+to generative tasks.                                              and efficient sparsity. Journal of Machine Learning Re-
+                                                                  search, 23(120):1–39, 2022.
+Acknowledgements
+                                                                Geng, Z. and Kolter, J. Z. Torchdeq: A library for deep
+Thank you Emy Gervais for your invaluable support                 equilibrium models. arXiv preprint arXiv:2310.18605,
+and extra push. This research was enabled in part                 2023.
+by computing resources, software, and technical as-
+sistance provided by Mila and the Digital Research              Hendrycks, D. and Gimpel, K. Gaussian error linear
+Alliance of Canada.                                              units (gelus). arXiv preprint arXiv:1606.08415, 2016.
+
+                                                                Jang, Y., Kim, D., and Ahn, S. Hierarchical graph
+References                                                        generation with k2-trees. In ICML 2023 Workshop on
+ARC Prize Foundation. The Hidden Drivers of HRM’s                 Structured Probabilistic Inference Generative Modeling,
+ Performance on ARC-AGI. https://arcprize.                        2023.
+ org/blog/hrm-analysis, 2025a. [Online; ac-
+                                                                Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B.,
+ cessed 2025-09-15].
+                                                                  Chess, B., Child, R., Gray, S., Radford, A., Wu, J.,
+ARC Prize Foundation. ARC-AGI Leaderboard.                        and Amodei, D. Scaling laws for neural language
+ https://arcprize.org/leaderboard, 2025b.                         models. arXiv preprint arXiv:2001.08361, 2020.
+ [Online; accessed 2025-09-24].
+                                                                Kingma, D. P. and Ba, J. Adam: A method for stochas-
+Bai, S., Kolter, J. Z., and Koltun, V. Deep equilibrium           tic optimization. arXiv preprint arXiv:1412.6980,
+  models. Advances in neural information processing               2014.
+  systems, 32, 2019.
+                                                                Krantz, S. G. and Parks, H. R. The implicit function
+Bai, X. and Melas-Kyriazi, L. Fixed point diffusion               theorem: history, theory, and applications. Springer
+  models. In Proceedings of the IEEE/CVF Conference               Science & Business Media, 2002.
+  on Computer Vision and Pattern Recognition, pp. 9430–
+  9440, 2024.                                                   LeCun, Y. Une procedure d’apprentissage ponr reseau
+                                                                  a seuil asymetrique. Proceedings of cognitiva 85, pp.
+Brock, A., Donahue, J., and Simonyan, K. Large scale              599–604, 1985.
+
+                                                            9
+                                      Recursive Reasoning with Tiny Networks
+
+Lehnert, L., Sukhbaatar, S., Su, D., Zheng, Q., Mcvay, P.,          all-mlp architecture for vision. Advances in neural
+  Rabbat, M., and Tian, Y. Beyond a*: Better planning               information processing systems, 34:24261–24272, 2021.
+  with transformers via search dynamics bootstrap-
+  ping. arXiv preprint arXiv:2402.14083, 2024.                    Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J.,
+                                                                    Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin,
+Lillicrap, T. P. and Santoro, A. Backpropagation                    I. Attention is all you need. Advances in neural
+  through time and the brain. Current opinion in neuro-             information processing systems, 30, 2017.
+  biology, 55:82–89, 2019.
+                                                                  Wang, G., Li, J., Sun, Y., Chen, X., Liu, C., Wu, Y.,
+Loshchilov, I. and Hutter, F. Decoupled weight decay               Lu, M., Song, S., and Yadkori, Y. A. Hierarchical
+  regularization. arXiv preprint arXiv:1711.05101, 2017.           reasoning model. arXiv preprint arXiv:2506.21734,
+                                                                   2025.
+Moskvichev, A., Odouard, V. V., and Mitchell, M. The
+ conceptarc benchmark: Evaluating understanding                   Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F.,
+ and generalization in the arc domain. arXiv preprint              Chi, E., Le, Q. V., Zhou, D., et al. Chain-of-thought
+ arXiv:2305.07141, 2023.                                           prompting elicits reasoning in large language mod-
+                                                                   els. Advances in neural information processing systems,
+Palm, R., Paquet, U., and Winther, O. Recurrent re-
+                                                                   35:24824–24837, 2022.
+  lational networks. Advances in neural information
+  processing systems, 31, 2018.                                   Werbos, P. Beyond regression: New tools for predic-
+                                                                   tion and analysis in the behavioral sciences. PhD
+Park, K.    Can convolutional neural networks
+                                                                   thesis, Committee on Applied Mathematics, Harvard
+  crack sudoku puzzles? https://github.com/
+                                                                   University, Cambridge, MA, 1974.
+  Kyubyong/sudoku, 2018.
+                                                                  Werbos, P. J. Generalization of backpropagation with
+Prieto, L., Barsbey, M., Mediano, P. A., and Birdal, T.
+                                                                   application to a recurrent gas market model. Neural
+  Grokking at the edge of numerical stability. arXiv
+                                                                   networks, 1(4):339–356, 1988.
+  preprint arXiv:2501.04697, 2025.
+                                                                  Zhang, B. and Sennrich, R. Root mean square layer
+Rumelhart, D. E., Hinton, G. E., and Williams, R. J.
+                                                                    normalization. Advances in Neural Information Pro-
+  Learning internal representations by error propaga-
+                                                                    cessing Systems, 32, 2019.
+  tion. Technical report, 1985.
+Shazeer, N. Glu variants improve transformer. arXiv
+  preprint arXiv:2002.05202, 2020.
+Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le,
+  Q., Hinton, G., and Dean, J. Outrageously large neu-
+  ral networks: The sparsely-gated mixture-of-experts
+  layer. arXiv preprint arXiv:1701.06538, 2017.
+Snell, C., Lee, J., Xu, K., and Kumar, A. Scaling
+  llm test-time compute optimally can be more effec-
+  tive than scaling model parameters. arXiv preprint
+  arXiv:2408.03314, 2024.
+Song, Y. and Ermon, S. Improved techniques for train-
+  ing score-based generative models. Advances in
+  neural information processing systems, 33:12438–12448,
+  2020.
+Su, J., Ahmed, M., Lu, Y., Pan, S., Bo, W., and Liu,
+  Y. Roformer: Enhanced transformer with rotary
+  position embedding. Neurocomputing, 568:127063,
+  2024.
+Tolstikhin, I. O., Houlsby, N., Kolesnikov, A., Beyer,
+  L., Zhai, X., Unterthiner, T., Yung, J., Steiner, A.,
+  Keysers, D., Uszkoreit, J., et al. Mlp-mixer: An
+
+                                                             10
+                                      Recursive Reasoning with Tiny Networks
+
+Hyper-parameters and setup                                       propagating through the whole n + 1 recursions makes
+                                                                 the most sense and works best.
+All models are trained with the AdamW opti-
+mizer(Loshchilov & Hutter, 2017; Kingma & Ba, 2014)              We tried removing ACT with the option of stopping
+with β 1 = 0.9, β 2 = 0.95, small learning rate warm-            when the solution is reached, but we found that gen-
+up (2K iterations), batch-size 768, hidden-size of 512,          eralization dropped significantly. This can probably
+Nsup = 16 max supervision steps, and stable-max loss             be attributed to the fact that the model is spending
+(Prieto et al., 2025) for improved stability. TRM uses an        too much time on the same data samples rather than
+Exponential Moving Average (EMA) of 0.999. HRM                   focusing on learning on a wide range of data samples.
+uses n = 2, T = 2 with two 4-layers networks, while              We tried weight tying the input embedding and out-
+we use n = 6, T = 3 with one 2-layer network.                    put head, but this was too constraining and led to a
+For Sudoku-Extreme and Maze-Hard, we train for 60k               massive generalization drop.
+epochs with learning rate 1e-4 and weight decay 1.0.             We tried using TorchDEQ (Geng & Kolter, 2023) to
+For ARC-AGI, we train for 100K epochs with learning              replace the recursion steps by fixed-point iteration as
+rate 1e-4 (with 1e-2 learning rate for the embeddings)           done by Deep Equilibrium Models (Bai et al., 2019).
+and weight decay 0.1. The numbers for Deepseek R1,               This would provide a better justification for the 1-step
+Claude 3.7 8K, O3-mini-high, Direct prediction, and              gradient approximation. However, this slowed down
+HRM from the Table 4 and 5 are taken from Wang et al.            training due to the fixed-point iteration and led to
+(2025). Both HRM and TRM add an embedding of                     worse generalization. This highlights the fact that
+shape [0, 1, D ] on Sudoku-Extreme and Maze-Hard to              converging to a fixed-point is not essential.
+the input. For ARC-AGI, each puzzle (containing 2-3
+training examples and 1-2 test examples) at each data-
+augmentation is given a specific embedding of shape
+[0, 1, D ] and, at test-time, the most common answer
+out of the 1000 data augmentations is given as answer.
+Experiments on Sudoku-Extreme were ran with 1 L40S
+with 40Gb of RAM for generally less than 36 hours.
+Experiments on Maze-Hard were ran with 4 L40S with
+40Gb of RAM for less than 24 hours. Experiments on
+ARC-AGI were ran for around 3 days with 4 H100
+with 80Gb of RAM.
+
+Ideas that failed
+In this section, we quickly mention a few ideas that
+did not work to prevent others from making the same
+mistake.
+We tried replacing the SwiGLU MLPs by SwiGLU
+Mixture-of-Experts (MoEs) (Shazeer et al., 2017; Fedus
+et al., 2022), but we found generalization to decrease
+massively. MoEs clearly add too much unnecessary
+capacity, just like increasing the number of layers does.
+Instead of back-propagating through the whole n + 1
+recursions, we tried a compromise between HRM 1-
+step gradient approximation, which back-propagates
+through the last 2 recursions. We did so by decou-
+pling n from the number of last recursions k that we
+back-propagate through. For example, while n = 6
+requires 7 steps with gradients in TRM, we can use
+gradients for only the k = 4 last steps. However, we
+found that this did not help generalization in any way,
+and it made the approach more complicated. Back-
+
+
+                                                            11
+                                               Recursive Reasoning with Tiny Networks
+
+Algorithms with different number of latent                                Example on Sudoku-Extreme
+features
+                                                                                                                8 3 1
+                                                                                                9       6 8       7
+def latent recursion(x, z, n=6):                                                                      3   5
+    for i in range(n+1): # latent recursion
+        z = net(x, z)                                                                           6 8
+    return z                                                                                               6        2
+def deep recursion(x, z, n=6, T=3):
+                                                                                            7 4                     3
+    # recursing T−1 times to improve z (no gradients needed)                                               9        4
+    with torch.no grad():
+        for j in range(T−1):
+                                                                                            2            4        1
+            z = latent recursion(x, z, n)                                                   6            2        5 7
+    # recursing once to improve z
+    z = latent recursion(x, z, n)                                                                     Input x
+    return z.detach(), output head(y), Q head(y)
+                                                                                            5 2 6 7 9 4 8 3 1
+# Deep Supervision
+for x input, y true in train dataloader:                                                    3 9 1 2 6 8 4 7 5
+    z = z init                                                                              4 8 7 3 1 5 2 9 6
+    for step in range(N supervision):
+        x = input embedding(x input)
+                                                                                            1 6 8 5 3 2 7 4 9
+        z, y hat, q hat = deep recursion(x, z)                                              9 3 5 4 7 6 1 8 2
+        loss = softmax cross entropy(y hat, y true)                                         7 4 2 9 8 1 5 6 3
+        loss += binary cross entropy(q hat, (y hat == y true))
+        z = z.detach()                                                                      8 7 3 1 5 9 6 2 4
+        loss.backward()                                                                     2 5 9 6 4 7 3 1 8
+        opt.step()
+        opt.zero grad()                                                                     6 1 4 8 5 3 9 5 7
+        if q[0] > 0: # early−stopping
+            break                                                                                Output y
+                                                                                            5 2 6 7 9 4 8 3 1
+Figure 4. Pseudocode of TRM using a single-z with deep                                      3 9 1 2 6 8 4 7 5
+supervision training in PyTorch.                                                            4 8 7 3 1 5 2 9 6
+                                                                                            1 6 8 5 3 2 7 4 9
+                                                                                            9 3 5 4 7 6 1 8 2
+                                                                                            7 4 2 9 8 1 5 6 3
+                                                                                            8 7 3 1 5 9 6 2 4
+def latent recursion(x, y, z, n=6):
+    for i in range(n): # latent recursion                                                   2 5 9 6 4 7 3 1 8
+        z[i] = net(x, y, z[0], ... , z[n−1])                                                6 1 4 8 5 3 9 5 7
+    y = net(y, z[0], ... , z[n−1]) # refine output answer
+    return y, z                                                                     Tokenized z H (denoted y in TRM)
+def deep recursion(x, y, z, n=6, T=3):                                                      5   5 4 9 4   6 3
+    # recursing T−1 times to improve y and z (no gradients needed)
+    with torch.no grad():                                                                   4   3 1     4 6 5
+        for j in range(T−1):                                                                4 8 4   3   6 6 4
+            y, z = latent recursion(x, y, z, n)
+    # recursing once to improve y and z                                                     9   6 5 3   5 4
+    y, z = latent recursion(x, y, z, n)                                                       3 5 4 3   5 4 4
+    return (y.detach(), z.detach()), output head(y), Q head(y)
+                                                                                            6   3   3 3 5 8 8
+# Deep Supervision                                                                          3 3 3 6 5   6 6 4
+for x input, y true in train dataloader:                                                    7 5   6   3 3 6 6
+    y, z = y init, z init
+    for step in range(N supervision):                                                       4 3 4 8   3 6 6 4
+        x = input embedding(x input)
+        (y, z), y hat, q hat = deep recursion(x, y, z)                               Tokenized z L (denoted z in TRM)
+        loss = softmax cross entropy(y hat, y true)
+        loss += binary cross entropy(q hat, (y hat == y true))
+        loss.backward()
+        opt.step()
+                                                                          Figure 6. This Sudoku-Extreme example shows an input, ex-
+        opt.zero grad()                                                   pected output, and the tokenized z H and z L (after reversing
+        if q[0] > 0: # early−stopping                                     the embedding and using argmax) for a pretrained model.
+            break
+                                                                          This highlights the fact that z H corresponds to the predicted
+                                                                          response, while z L is a latent feature that cannot be decoded
+Figure 5. Pseudocode of TRM using multi-scale z with deep                 to a sensible output unless transformed into z H by f H .
+supervision training in PyTorch.
+
+
+
+
+                                                                     12
+
+\ No newline at end of file