summaryrefslogtreecommitdiff
path: root/papers/txt/trm2025_tiny_recursive.txt
diff options
context:
space:
mode:
Diffstat (limited to 'papers/txt/trm2025_tiny_recursive.txt')
-rw-r--r--papers/txt/trm2025_tiny_recursive.txt796
1 files changed, 796 insertions, 0 deletions
diff --git a/papers/txt/trm2025_tiny_recursive.txt b/papers/txt/trm2025_tiny_recursive.txt
new file mode 100644
index 0000000..55cf994
--- /dev/null
+++ b/papers/txt/trm2025_tiny_recursive.txt
@@ -0,0 +1,796 @@
+ Less is More: Recursive Reasoning with Tiny Networks
+
+
+
+ Alexia Jolicoeur-Martineau
+ Samsung SAIL Montréal
+ alexia.j@samsung.com
+
+
+ Abstract
+arXiv:2510.04871v1 [cs.LG] 6 Oct 2025
+
+
+
+
+ Hierarchical Reasoning Model (HRM) is a
+ novel approach using two small neural net-
+ works recursing at different frequencies. This
+ biologically inspired method beats Large Lan-
+ guage models (LLMs) on hard puzzle tasks
+ such as Sudoku, Maze, and ARC-AGI while
+ trained with small models (27M parameters)
+ on small data (∼ 1000 examples). HRM holds
+ great promise for solving hard problems with
+ small networks, but it is not yet well un-
+ derstood and may be suboptimal. We pro-
+ pose Tiny Recursive Model (TRM), a much
+ simpler recursive reasoning approach that
+ achieves significantly higher generalization
+ than HRM, while using a single tiny network
+ with only 2 layers. With only 7M parameters,
+ TRM obtains 45% test-accuracy on ARC-AGI-
+ 1 and 8% on ARC-AGI-2, higher than most
+ LLMs (e.g., Deepseek R1, o3-mini, Gemini 2.5
+ Pro) with less than 0.01% of the parameters.
+
+
+
+ 1. Introduction
+ While powerful, Large Language models (LLMs) can
+ struggle on hard question-answer problems. Given
+ that they generate their answer auto-regressively, there
+ is a high risk of error since a single incorrect token can
+ render an answer invalid. To improve their reliabil- Figure 1. Tiny Recursion Model (TRM) recursively improves
+ ity, LLMs rely on Chain-of-thoughts (CoT) (Wei et al., its predicted answer y with a tiny network. It starts with the
+ 2022) and Test-Time Compute (TTC) (Snell et al., 2024). embedded input question x and initial embedded answer
+ y, and latent z. For up to Nsup = 16 improvements steps,
+ CoTs seek to emulate human reasoning by having the
+ it tries to improve its answer y. It does so by i) recursively
+ LLM to sample step-by-step reasoning traces prior to
+ updating n times its latent z given the question x, current
+ giving their answer. Doing so can improve accuracy, answer y, and current latent z (recursive reasoning), and
+ but CoT is expensive, requires high-quality reasoning then ii) updating its answer y given the current answer y
+ data (which may not be available), and can be brittle and current latent z. This recursive process allows the model
+ since the generated reasoning may be wrong. To fur- to progressively improve its answer (potentially address-
+ ther improve reliability, test-time compute can be used ing any errors from its previous answer) in an extremely
+ by reporting the most common answer out of K or the parameter-efficient manner while minimizing overfitting.
+ highest-reward answer (Snell et al., 2024).
+
+ 1
+ Recursive Reasoning with Tiny Networks
+
+However, this may not be enough. LLMs with CoT ers that achieves significantly higher generalization
+and TTC are not enough to beat every problem. While than HRM on a variety of problems. In doing so, we
+LLMs have made significant progress on ARC-AGI improve the state-of-the-art test accuracy on Sudoku-
+(Chollet, 2019) since 2019, human-level accuracy still Extreme from 55% to 87%, Maze-Hard from 75% to
+has not been reached (6 years later, as of writing of 85%, ARC-AGI-1 from 40% to 45%, and ARC-AGI-2
+this paper). Furthermore, LLMs struggle on the newer from 5% to 8%.
+ARC-AGI-2 (e.g., Gemini 2.5 Pro only obtains 4.9% test
+accuracy with a high amount of TTC) (Chollet et al., 2. Background
+2025; ARC Prize Foundation, 2025b).
+ HRM is described in Algorithm 2. We discuss the
+An alternative direction has recently been proposed by
+ details of the algorithm further below.
+Wang et al. (2025). They propose a new way forward
+through their novel Hierarchical Reasoning Model
+(HRM), which obtains high accuracy on puzzle tasks 2.1. Structure and goal
+where LLMs struggle to make a dent (e.g., Sudoku The focus of HRM is supervised learning. Given an
+solving, Maze pathfinding, and ARC-AGI). HRM is a input, produce an output. Both input and output are
+supervised learning model with two main novelties: 1) assumed to have shape [ B, L] (when the shape differs,
+recursive hierarchical reasoning, and 2) deep supervision. padding tokens can be added), where B is the batch-
+Recursive hierarchical reasoning consists of recurs- size and L is the context-length.
+ing multiple times through two small networks ( f L at HRM contains four learnable components: the in-
+high frequency and f H at low frequency) to predict the put embedding f I (·; θ I ), low-level recurrent network
+answer. Each network generates a different latent fea- f L (·; θ L ), high-level recurrent network f H (·; θ H ), and
+ture: f L outputs z H and f H outputs z L . Both features the output head fO (·; θO ). Once the input is embedded,
+(z L , z H ) are used as input to the two networks. The the shape becomes [ B, L, D ] where D is the embedding
+authors provide some biological arguments in favor of size. Each network is a 4-layer Transformers architec-
+recursing at different hierarchies based on the different ture (Vaswani et al., 2017), with RMSNorm (Zhang
+temporal frequencies at which the brains operate and & Sennrich, 2019), no bias (Chowdhery et al., 2023),
+hierarchical processing of sensory inputs. rotary embeddings (Su et al., 2024), and SwiGLU acti-
+Deep supervision consists of improving the answer vation function (Hendrycks & Gimpel, 2016; Shazeer,
+through multiple supervision steps while carrying the 2020).
+two latent features as initialization for the improve-
+ment steps (after detaching them from the computa- 2.2. Recursion at two different frequencies
+tional graph so that their gradients do not propagate). Given the hyperparameters used by Wang et al. (2025)
+This provide residual connections, which emulates (n = 2 f L steps, 1 f H steps; done T = 2 times), a
+very deep neural networks that are too memory ex- forward pass of HRM is done as follows:
+pensive to apply in one forward pass.
+An independent analysis on the ARC-AGI benchmark x ← f I ( x̃ )
+showed that deep supervision seems to be the primary z L ← f L (z L + z H + x ) # without gradients
+driver of the performance gains (ARC Prize Founda- z L ← f L (z L + z H + x ) # without gradients
+tion, 2025a). Using deep supervision doubled accuracy
+ z H ← f H (z L + z H ) # without gradients
+over single-step supervision (going from 19% to 39%
+accuracy), while recursive hierarchical reasoning only z L ← f L (z L + z H + x ) # without gradients
+slightly improved accuracy over a regular model with z L ← z L .detach()
+a single forward pass (going from 35.7% to 39.0% ac- z H ← z H .detach()
+curacy). This suggests that reasoning across different
+ z L ← f L (z L + z H + x ) # with gradients
+supervision steps is worth it, but the recursion done
+in each supervision step is not particularly important. z H ← f H (z L + z H ) # with gradients
+ ŷ ← argmax( fO (z H ))
+In this work, we show that the benefit from recursive
+reasoning can be massively improved, making it much
+more than incremental. We propose Tiny Recursive where ŷ is the predicted output answer, z L and z H are
+Model (TRM), an improved and simplified approach either initialized embeddings or the embeddings of
+using a much smaller tiny network with only 2 lay- the previous deep supervision step (after detaching
+ them from the computational graph). As can be seen,
+
+ 2
+ Recursive Reasoning with Tiny Networks
+
+def hrm(z, x, n=2, T=2): # hierarchical reasoning 2.4. Deep supervision
+ zH, zL = z
+ with torch.no grad(): To improve effective depth, deep supervision is used.
+ for i in range(nT − 2):
+ zL = L net(zL, zH, x) This consists of reusing the previous latent features
+ if (i + 1) % T == 0:
+ zH = H net(zH, zL)
+ (z H and z L ) as initialization for the next forward pass.
+ # 1−step grad This allows the model to reason over many iterations
+ zL = L net(zL, zH, x)
+ zH = H net(zH, zL) and improve its latent features (z L and z H ) until it
+ return (zH, zL), output head(zH), Q head(zH) (hopefully) converges to the correct solution. At most
+def ACT halt(q, y hat, y true): Nsup = 16 supervision steps are used.
+ target halt = (y hat == y true)
+ loss = 0.5∗binary cross entropy(q[0], target halt)
+ return loss 2.5. Adaptive computational time (ACT)
+def ACT continue(q, last step):
+ if last step: With deep supervision, each mini-batch of data sam-
+ target continue = sigmoid(q[0]) ples must be used for Nsup = 16 supervision steps
+ else:
+ target continue = sigmoid(max(q[0], q[1]))) before moving to the next mini-batch. This is expen-
+ loss = 0.5∗binary cross entropy(q[1], target continue)
+ return loss
+ sive, and there is a balance to be reached between
+ optimizing a few data examples for many supervision
+# Deep Supervision
+for x input, y true in train dataloader: steps versus optimizing many data examples with less
+ z = z init supervision steps. To reach a better balance, a halting
+ for step in range(N sup): # deep supervision
+ x = input embedding(x input) mechanism is incorporated to determine whether the
+ z, y pred, q = hrm(z, x)
+ loss = softmax cross entropy(y pred, y true)
+ model should terminate early. It is learned through
+ # Adaptive computational time (ACT) using Q−learning a Q-learning objective that requires passing the z H
+ loss += ACT halt(q, y pred, y true)
+ , , q next = hrm(z, x) # extra forward pass through an additional head and running an additional
+ loss += ACT continue(q next, step == N sup − 1) forward pass (to determine if halting now rather than
+ z = z.detach()
+ loss.backward() later would have been preferable). They call this
+ opt.step()
+ opt.zero grad()
+ method Adaptive computational time (ACT). It is only
+ if q[0] > q[1]: # early−stopping used during training, while the full Nsup = 16 super-
+ break
+ vision steps are done at test time to maximize down-
+ stream performance. ACT greatly diminishes the time
+Figure 2. Pseudocode of Hierarchical Reasoning Models spent per example (on average spending less than 2
+(HRMs). steps on the Sudoku-Extreme dataset rather than the
+ full Nsup = 16 steps), allowing more coverage of the
+a forward pass of HRM consists of applying 6 function dataset given a fixed number of training iterations.
+evaluations, where the first 4 function evaluations are
+detached from the computational graph and are not 2.6. Deep supervision and 1-step gradient
+back-propagated through. The authors uses n = 2 approximations replaces BPTT
+with T = 2 in all experiments, but HRM can be gener-
+alized by allowing for an arbitrary number of L steps Deep supervision and the 1-step gradient approxima-
+(n) and recursions (T) as shown in Algorithm 2. tion provide a more biologically plausible and less
+ computationally-expansive alternative to Backpropa-
+2.3. Fixed-point recursion with 1-step gradient gation Through Time (BPTT) (Werbos, 1974; Rumel-
+ approximation hart et al., 1985; LeCun, 1985) for solving the temporal
+ credit assignment (TCA) (Rumelhart et al., 1985; Wer-
+Assuming that (z L , z H ) reaches a fixed-point (z∗L , z∗H ) bos, 1988; Elman, 1990) problem (Lillicrap & Santoro,
+through recursing from both f L and f H , 2019). The implication is that HRM can learn what
+ would normally require an extremely large network
+ z∗L ≈ f L (z∗L + z H + x ) without having to back-propagate through its entire
+ z∗H ≈ f H (z L + z∗H ) , depth. Given the hyperparameters used by Jang et al.
+ (2023) in all their experiments, HRM effectively rea-
+the Implicit Function Theorem (Krantz & Parks, 2002)
+ sons over nlayers (n + 1) TNsup = 4 ∗ (2 + 1) ∗ 2 ∗ 16 =
+with the 1-step gradient approximation (Bai et al.,
+ 384 layers of effective depth.
+2019) is used to approximate the gradient by back-
+propagating only the last f L and f H steps. This theo-
+rem is used to justify only tracking the gradients of
+the last two steps (out of 6), which greatly reduces
+memory demands.
+
+ 3
+ Recursive Reasoning with Tiny Networks
+
+2.7. Summary of HRM different from the much smaller n = 2 and T = 2 used
+ in every experiment of their paper, we observe the
+HRM leverages recursion from two networks at dif-
+ following:
+ferent frequencies (high frequency versus low fre-
+quency) and deep supervision to learn to improve
+its answer over multiple supervision steps (with ACT 1. the residual for z H is clearly well above 0 at every
+to reduce time spent per data example). This enables step
+the model to imitate extremely large depth without
+requiring backpropagation through all layers. This
+approach obtains significantly higher performance on 2. the residual for z L only becomes closer to 0 after
+hard question-answer tasks that regular supervised many cycles, but it remains significantly above 0
+models struggle with. However, this method is quite
+complicated, relying a bit too heavily on uncertain
+biological arguments and fixed-point theorems that 3. z L is very far from converged after one f L evalu-
+are not guaranteed to be applicable. In the next sec- ation at T cycles, which is when the fixed-point
+tion, we discuss those issues and potential targets for is assumed to be reached and the 1-step gradient
+improvements in HRM. approximation is used
+
+3. Target for improvements in Hierarchical Thus, while the application of the IFT theorem and
+ Reasoning Models 1-step gradient approximation to HRM has some basis
+ since the residuals do generally reduce over time, a
+In this section, we identify key targets for improve-
+ fixed point is unlikely to be reached when the theorem
+ments in HRM, which will be addressed by our pro-
+ is actually applied.
+posed method, Tiny Recursion Models (TRMs).
+ In the next section, we show that we can bypass the
+3.1. Implicit Function Theorem (IFT) with 1-step need for the IFT theorem and 1-step gradient approxi-
+ gradient approximation mation, thus bypassing the issue entirely.
+
+HRM only back-propagates through the last 2 of the 6 3.2. Twice the forward passes with Adaptive
+recursions. The authors justify this by leveraging the computational time (ACT)
+Implicit Function Theorem (IFT) and one-step approx-
+imation, which states that when a recurrent function HRM uses Adaptive computational time (ACT) during
+converges to a fixed point, backpropagation can be training to optimize the time spent of each data sam-
+applied in a single step at that equilibrium point. ple. Without ACT, Nsup = 16 supervision steps would
+ be spent on the same data sample, which is highly in-
+There are concerns about applying this theorem to
+ efficient. They implement ACT through an additional
+HRM. Most importantly, there is no guarantee that
+ Q-learning objective, which decides when to halt and
+a fixed-point is reached. Deep equilibrium models
+ move to a new data sample rather than keep iterating
+normally do fixed-point iteration to solve for the fixed
+ on the same data. This allows much more efficient
+pointz∗ = f (z∗ ) (Bai et al., 2019). However, in the case
+ use of time especially since the average number of su-
+of HRM, it is not iterating to the fixed-point but simply
+ pervision steps during training is quite low with ACT
+doing forward passes of f L and f H . To make matters
+ (less than 2 steps on the Sudoku-Extreme dataset as
+worse, HRM is only doing 4 recursions before stopping
+ per their reported numbers).
+to apply the one-step approximation. After its first
+loop of two f L and 1 f H evaluations, it only apply a However, ACT comes at a cost. This cost is not directly
+single f L evaluation before assuming that a fixed-point shown in the HRM’s paper, but it is shown in their of-
+is reached for both z L and z H (z∗L = f L (z∗L + z H + x ) ficial code. The Q-learning objective relies on a halting
+and z∗H = f H (z∗L + z∗H )). Then, the one-step gradient loss and a continue loss. The continue loss requires an
+approximation is applied to both latent variables in extra forward pass through HRM (with all 6 function
+succession. evaluations). This means that while ACT optimizes
+ time more efficiently per sample, it requires 2 forward
+The authors justify that a fixed-point is reached by
+ passes per optimization step. The exact formulation is
+depicting an example with n = 7 and T = 7 where
+ shown in Algorithm 2.
+the forward residuals is reduced over time (Figure 3
+in Wang et al. (2025)). Even in this setting, which is In the next section, we show that we can bypass the
+ need for two forward passes in ACT.
+
+ 4
+ Recursive Reasoning with Tiny Networks
+
+3.3. Hierarchical interpretation based on complex of 2 passes). Our approach is described in Algorithm 3
+ biological arguments and illustrated in Figure 1. We also provide an ablation
+ in Table 1 on the Sudoku-Extreme dataset (a dataset
+The HRM’s authors justify the two latent variables
+ of difficult Sudokus with only 1K training examples,
+and two networks operating at different hierarchies
+ but 423K test examples). Below, we explain the key
+based on biological arguments, which are very far
+ components of TRMs.
+from artificial neural networks. They even try to match
+HRM to actual brain experiments on mice. While in-
+teresting, this sort of explanation makes it incredibly Table 1. Ablation of TRM on Sudoku-Extreme comparing %
+hard to parse out why HRM is designed the way it Test accuracy, effective depth per supervision step ( T (n +
+is. Given the lack of ablation table in their paper, the 1)nlayers ), number of Forward Passes (NFP) per optimization
+over-reliance on biological arguments and fixed-point step, and number of parameters
+theorems (that are not perfectly applicable), it is hard Method Acc (%) Depth NFP # Params
+to determine what parts of HRM is helping what and HRM 55.0 24 2 27M
+why. Furthermore, it is not clear why they use two TRM (T = 3, n = 6) 87.4 42 1 5M
+latent features rather than other combinations of fea- w/ ACT 86.1 42 2 5M
+tures. w/ separate f H , f L 82.4 42 1 10M
+ no EMA 79.9 42 1 5M
+In the next section, we show that the recursive process w/ 4-layers, n = 3 79.5 48 1 10M
+can be greatly simplified and understood in a much w/ self-attention 74.7 42 1 7M
+simpler manner that does not require any biological w/ T = 2, n = 2 73.7 12 1 5M
+argument, fixed-point theorem, hierarchical interpre- w/ 1-step gradient 56.5 42 1 5M
+tation, nor using two networks. It also explains why 2
+is the optimal number of features (z L and z H ).
+ 4.1. No fixed-point theorem required
+
+def latent recursion(x, y, z, n=6): HRM assumes that the recursions converge to a fixed-
+ for i in range(n): # latent reasoning point for both z L and z H in order to leverage the 1-step
+ z = net(x, y, z)
+ y = net(y, z) # refine output answer gradient approximation (Bai et al., 2019). This allows
+ return y, z
+ the authors to justify only back-propagating through
+def deep recursion(x, y, z, n=6, T=3): the last two function evaluations (1 f L and 1 f H ). To
+ # recursing T−1 times to improve y and z (no gradients needed)
+ with torch.no grad(): bypass this theoretical requirement, we define a full
+ for j in range(T−1): recursion process as containing n evaluations of f L
+ y, z = latent recursion(x, y, z, n)
+ # recursing once to improve y and z and 1 evaluation of f H :
+ y, z = latent recursion(x, y, z, n)
+ return (y.detach(), z.detach()), output head(y), Q head(y) z L ← f L (z L + z H + x )
+# Deep Supervision ...
+for x input, y true in train dataloader:
+ y, z = y init, z init z L ← f L (z L + z H + x )
+ for step in range(N supervision):
+ x = input embedding(x input) z H ← f H (z L + z H ) .
+ (y, z), y hat, q hat = deep recursion(x, y, z)
+ loss = softmax cross entropy(y hat, y true)
+ loss += binary cross entropy(q hat, (y hat == y true)) Then, we simply back-propagate through the full re-
+ loss.backward() cursion process.
+ opt.step()
+ opt.zero grad()
+ if q hat > 0: # early−stopping Through deep supervision, the models learns to take
+ break any (z L , z H ) and improve it through a full recursion
+ process, hopefully making z H closer to the solution.
+Figure 3. Pseudocode of Tiny Recursion Models (TRMs). This means that by the design of the deep supervi-
+ sion goal, running a few full recursion processes (even
+ without gradients) is expected to bring us closer to the
+4. Tiny Recursion Models solution. We propose to run T − 1 recursion processes
+ without gradient to improve (z L , z H ) before running
+In this section, we present Tiny Recursion Models
+ one recursion process with backpropagation.
+(TRMs). Contrary to HRM, TRM requires no com-
+plex mathematical theorem, hierarchy, nor biological Thus, instead of using the 1-step gradient approxi-
+arguments. It generalizes better while requiring only mation, we apply a full recursion process containing
+a single tiny network (instead of two medium-size net- n evaluations of f L and 1 evaluation of f H . This re-
+works) and a single forward pass for the ACT (instead moves entirely the need to assume that a fixed-point
+
+ 5
+ Recursive Reasoning with Tiny Networks
+
+is reached and the use of the IFT theorem with 1-step While this is intuitive, we wanted to verify whether
+gradient approximation. Yet, we can still leverage using more or less features could be helpful. Results
+multiple backpropagation-free recursion processes to are shown in Table 2.
+improve (z L , z H ). With this approach, we obtain a
+ More features (> 2): We tested splitting z into dif-
+massive boost in generalization on Sudoku-Extreme
+ ferent features by treating each of the n recursions as
+(improving TRM from 56.5% to 87.4%; see Table 1).
+ producing a different zi for i = 1, ..., n. Then, each
+ zi is carried across supervision steps. The approach
+4.2. Simpler reinterpretation of z H and z L is described in Algorithm 5. In doing so, we found
+HRM is interpreted as doing hierarchical reasoning performance to drop. This is expected because, as dis-
+over two latent features of different hierarchies due to cussed, there is no apparent need for splitting z into
+arguments from biology. However, one might wonder multiple parts. It does not have to be hierarchical.
+why use two latent features instead of 1, 3, or more? Single feature: Similarly, we tested the idea of taking
+And do we really need to justify these so-called ”hier- a single feature by only carrying z H across supervi-
+archical” features based on biology to make sense of sion steps. The approach is described in Algorithm 4.
+them? We propose a simple non-biological explana- In doing so, we found performance to drop. This is
+tion, which is more natural, and directly answers the expected because, as discussed, it forces the model to
+question of why there are 2 features. store the solution y within z.
+The fact of the matter is: z H is simply the current Thus, we explored using more or less latent variables
+(embedded) solution. The embedding is reversed by on Sudoku-Extreme, but found that having only y and
+applying the output head and rounding to the nearest z lead to better test accuracy in addition to being the
+token using the argmax operation. On the other hand, simplest more natural approach.
+z L is a latent feature that does not directly correspond
+to a solution, but it can be transformed into a solution
+by applying z H ← f H ( x, z L , z H ). We show an example
+on Sudoku-Extreme in Figure 6 to highlight the fact Table 2. TRM on Sudoku-Extreme comparing % Test accu-
+that z H does correspond to the solution, but z L does racy when using more or less latent features
+not. Method # of features Acc (%)
+ TRM y, z (Ours) 2 87.4
+Once this is understood, hierarchy is not needed; there TRM multi-scale z n+1 = 7 77.6
+is simply an input x, a proposed solution y (previously TRM single z 1 71.9
+called z H ), and a latent reasoning feature z (previously
+called z L ). Given the input question x, current solution
+y, and current latent reasoning z, the model recursively
+improves its latent z. Then, given the current latent z 4.3. Single network
+and the previous solution y, the model proposes a new HRM uses two networks, one applied frequently as a
+solution y (or stay at the current solution if its already low-level module f H and one applied rarely as an high-
+good). level module ( f H ). This requires twice the number of
+Although this has no direct influence on the algorithm, parameters compared to regular supervised learning
+this re-interpretation is much simpler and natural. It with a single network.
+answers the question about why two features: remem- As mentioned previously, while f L iterates on the la-
+bering in context the question x, previous reasoning tent reasoning feature z (z L in HRM), the goal of f H
+z, and previous answer y helps the model iterate on is to update the solution y (z H in HRM) given the la-
+the next reasoning z and then the next answer y. If tent reasoning and current solution. Importantly, since
+we were not passing the previous reasoning z, the z ← f L ( x + y + z) contains x but y ← f H (y + z) does
+model would forget how it got to the previous solu- not contains x, the task to achieve (iterating on z versus
+tion y (since z acts similarly as a chain-of-thought). If using z to update y) is directly specified by the inclu-
+we were not passing the previous solution y, then the sion or lack of x in the inputs. Thus, we considered
+model would forget what solution it had and would the possibility that both networks could be replaced
+be forced to store the solution y within z instead of by a single network doing both tasks. In doing so, we
+using it for latent reasoning. Thus, we need both y and obtain better generalization on Sudoku-Extreme (im-
+z separately, and there is no apparent reason why one proving TRM from 82.4% to 87.4%; see Table 1) while
+would need to split z into multiple features. reducing the number of parameters by half. It turns
+ out that a single network is enough.
+
+ 6
+ Recursive Reasoning with Tiny Networks
+
+4.4. Less is more 4.6. No additional forward pass needed with ACT
+We attempted to increase capacity by increasing the As previously mentioned, the implementation of ACT
+number of layers in order to scale the model. Sur- in HRM through Q-learning requires two forward
+prisingly, we found that adding layers decreased gen- passes, which slows down training. We propose a
+eralization due to overfitting. In doing the oppo- simple solution, which is to get rid of the continue loss
+site, decreasing the number of layers while scaling (from the Q-learning) and only learn a halting proba-
+the number of recursions (n) proportionally (to keep bility through a Binary-Cross-Entropy loss of having
+the amount of compute and emulated depth approxi- reached the correct solution. By removing the continue
+mately the same), we found that using 2 layers (instead loss, we remove the need for the expensive second for-
+of 4 layers) maximized generalization. In doing so, we ward pass, while still being able to determine when to
+obtain better generalization on Sudoku-Extreme (im- halt with relatively good accuracy. We found no sig-
+proving TRM from 79.5% to 87.4%; see Table 1) while nificant difference in generalization from this change
+reducing the number of parameters by half (again). (going from 86.1% to 87.4%; see Table 1).
+It is quite surprising that smaller networks are bet-
+ter, but 2 layers seems to be the optimal choice. Bai 4.7. Exponential Moving Average (EMA)
+& Melas-Kyriazi (2024) also observed optimal perfor- On small data (such as Sudoku-Extreme and Maze-
+mance for 2-layers in the context of deep equilibrium Hard), HRM tends to overfit quickly and then diverge.
+diffusion models; however, they had similar perfor- To reduce this problem and improves stability, we
+mance to the bigger networks, while we instead ob- integrate Exponential Moving Average (EMA) of the
+serve better performance with 2 layers. This may ap- weights, a common technique in GANs and diffusion
+pear unusual, as with modern neural networks, gener- models to improve stability (Brock et al., 2018; Song &
+alization tends to directly correlate with model sizes. Ermon, 2020). We find that it prevents sharp collapse
+However, when data is too scarce and model size is and leads to higher generalization (going from 79.9%
+large, there can be an overfitting penalty (Kaplan et al., to 87.4%; see Table 1).
+2020). This is likely an indication that there is too little
+data. Thus, using tiny networks with deep recursion 4.8. Optimal the number of recursions
+and deep supervision appears to allow us to bypass a
+lot of the overfitting. We experimented with different number of recursions
+ by varying T and n and found that T = 3, n = 3
+4.5. attention-free architecture for tasks with small (equivalent to 48 recursions) in HRM and T = 3, n = 6
+ fixed context length in TRM (equivalent to 42 recursions) to lead to optimal
+ generalization on Sudoku-Extreme. More recursions
+Self-attention is particularly good for long-context could be helpful for harder problems (we have not
+lengths when L ≫ D since it only requires a matrix of tested it, given our limited resources); however, in-
+[ D, 3D ] parameters, even though it can account for the creasing either T or n incurs massive slowdowns. We
+whole sequence. However, when focusing on tasks show results at different n and T for HRM and TRM
+where L ≤ D, a linear layer is cheap, requiring only a in Table 3. Note that TRM requires backpropagation
+matrix of [ L, L] parameters. Taking inspiration from through a full recursion process, thus increasing n too
+the MLP-Mixer (Tolstikhin et al., 2021), we can replace much leads to Out Of Memory (OOM) errors. How-
+the self-attention layer with a multilayer perceptron ever, this memory cost is well worth its price in gold.
+(MLP) applied on the sequence length. Using an MLP
+instead of self-attention, we obtain better generaliza- In the following section, we show our main results on
+tion on Sudoku-Extreme (improving from 74.7% to multiple datasets comparing HRM, TRM, and LLMs.
+87.4%; see Table 1). This worked well on Sudoku 9x9
+grids, given the small and fixed context length; how- 5. Results
+ever, we found this architecture to be suboptimal for
+tasks with large context length, such as Maze-Hard Following Wang et al. (2025), we test our approach
+and ARC-AGI (both using 30x30 grids). We show on the following datasets: Sudoku-Extreme (Wang
+results with and without self-attention for all experi- et al., 2025), Maze-Hard (Wang et al., 2025), ARC-AGI-
+ments. 1 (Chollet, 2019) and, ARC-AGI-2 (Chollet et al., 2025).
+ Results are presented in Tables 4 and 5. Hyperparame-
+ ters are detailed in Section 6. Datasets are discussed
+ below.
+
+
+ 7
+ Recursive Reasoning with Tiny Networks
+
+ ity of the MLP on large 30x30 grids). TRM with self-
+Table 3. % Test accuracy on Sudoku-Extreme dataset. HRM
+ attention obtains 85.3% accuracy on Maze-Hard, 44.6%
+versus TRM matched at a similar effective depth per super-
+vision step ( T (n + 1)nlayers )
+ accuracy on ARC-AGI-1, and 7.8% accuracy on ARC-
+ AGI-2 with 7M parameters. This is significantly higher
+ HRM TRM
+ than the 74.5%, 40.3%, and 5.0% obtained by HRM us-
+ n = k, 4 layers n = 2k, 2 layers
+ ing 4 times the number of parameters (27M).
+ k T Depth Acc (%) Depth Acc (%)
+ 1 1 9 46.4 7 63.2
+ 2 2 24 55.0 20 81.9 Table 4. % Test accuracy on Puzzle Benchmarks (Sudoku-
+ 3 3 48 61.6 42 87.4 Extreme and Maze-Hard)
+ 4 4 80 59.5 72 84.2 Method # Params Sudoku Maze
+ 6 3 84 62.3 78 OOM Chain-of-thought, pretrained
+ 3 6 96 58.8 84 85.8 Deepseek R1 671B 0.0 0.0
+ 6 6 168 57.5 156 OOM Claude 3.7 8K ? 0.0 0.0
+ O3-mini-high ? 0.0 0.0
+ Direct prediction, small-sample training
+Sudoku-Extreme consists of extremely difficult Su- Direct pred 27M 0.0 0.0
+doku puzzles (Dillion, 2025; Palm et al., 2018; Park, HRM 27M 55.0 74.5
+2018) (9x9 grid), for which only 1K training samples TRM-Att (Ours) 7M 74.7 85.3
+are used to test small-sample learning. Testing is done TRM-MLP (Ours) 5M/19M 1 87.4 0.0
+on 423K samples. Maze-Hard consists of 30x30 mazes
+generated by the procedure by Lehnert et al. (2024)
+whose shortest path is of length above 110; both the
+training set and test set include 1000 mazes. Table 5. % Test accuracy on ARC-AGI Benchmarks (2 tries)
+ Method # Params ARC-1 ARC-2
+ARC-AGI-1 and ARC-AGI-2 are geometric puzzles in- Chain-of-thought, pretrained
+volving monetary prizes. Each puzzle is designed to Deepseek R1 671B 15.8 1.3
+be easy for a human, yet hard for current AI models. Claude 3.7 16K ? 28.6 0.7
+Each puzzle task consists of 2-3 input–output demon- o3-mini-high ? 34.5 3.0
+stration pairs and 1-2 test inputs to be solved. The final Gemini 2.5 Pro 32K ? 37.0 4.9
+score is computed as the accuracy over all test inputs Grok-4-thinking 1.7T 66.7 16.0
+from two attempts to produce the correct output grid. Bespoke (Grok-4) 1.7T 79.6 29.4
+The maximum grid size is 30x30. ARC-AGI-1 con- Direct prediction, small-sample training
+tains 800 tasks, while ARC-AGI-2 contains 1120 tasks.
+ Direct pred 27M 21.0 0.0
+We also augment our data with the 160 tasks from
+ HRM 27M 40.3 5.0
+the closely related ConceptARC dataset (Moskvichev
+et al., 2023). We provide results on the public evalua- TRM-Att (Ours) 7M 44.6 7.8
+tion set for both ARC-AGI-1 and ARC-AGI-2. TRM-MLP (Ours) 19M 29.6 2.4
+
+While these datasets are small, heavy data-
+augmentation is used in order to improve gen- 6. Conclusion
+eralization. Sudoku-Extreme uses 1000 shuffling
+(done without breaking the Sudoku rules) augmenta- We propose Tiny Recursion Models (TRM), a simple
+tions per data example. Maze-Hard uses 8 dihedral recursive reasoning approach that achieves strong gen-
+transformations per data example. ARC-AGI uses eralization on hard tasks using a single tiny network
+1000 data augmentations (color permutation, dihedral- recursing on its latent reasoning feature and progres-
+group, and translations transformations) per data sively improving its final answer. Contrary to the
+example. The dihedral-group transformations consist Hierarchical Reasoning Model (HRM), TRM requires
+of random 90-degree rotations, horizontal/vertical no fixed-point theorem, no complex biological justi-
+flips, and reflections. fications, and no hierarchy. It significantly reduces
+ the number of parameters by halving the number of
+From the results, we see that TRM without self- layers and replacing the two networks with a single
+attention obtains the best generalization on Sudoku- tiny network. It also simplifies the halting process,
+Extreme (87.4% test accuracy). Meanwhile, TRM with removing the need for the extra forward pass. Over-
+self-attention generalizes better on the other tasks
+(probably due to inductive biases and the overcapac- 1 5M on Sudoku and 19M on Maze
+
+
+
+ 8
+ Recursive Reasoning with Tiny Networks
+
+all, TRM is much simpler than HRM, while achieving gan training for high fidelity natural image synthe-
+better generalization. sis. arXiv preprint arXiv:1809.11096, 2018.
+While our approach led to better generalization on 4 Chollet, F. On the measure of intelligence. arXiv
+benchmarks, every choice made is not guaranteed to preprint arXiv:1911.01547, 2019.
+be optimal on every dataset. For example, we found
+that replacing the self-attention with an MLP worked Chollet, F., Knoop, M., Kamradt, G., Landers, B.,
+extremely well on Sudoku-Extreme (improving test ac- and Pinkard, H. Arc-agi-2: A new challenge
+curacy by 10%), but poorly on other datasets. Different for frontier ai reasoning systems. arXiv preprint
+problem settings may require different architectures arXiv:2505.11831, 2025.
+or number of parameters. Scaling laws are needed
+to parametrize these networks optimally. Although Chowdhery, A., Narang, S., Devlin, J., Bosma, M.,
+we simplified and improved on deep recursion, the Mishra, G., Roberts, A., Barham, P., Chung, H. W.,
+question of why recursion helps so much compared Sutton, C., Gehrmann, S., et al. Palm: Scaling lan-
+to using a larger and deeper network remains to be guage modeling with pathways. Journal of Machine
+explained; we suspect it has to do with overfitting, but Learning Research, 24(240):1–113, 2023.
+we have no theory to back this explaination. Not all
+our ideas made the cut; we briefly discuss some of the Dillion, T. Tdoku: A fast sudoku solver and gener-
+failed ideas that we tried but did not work in Section 6. ator. https://t-dillon.github.io/tdoku/,
+Currently, recursive reasoning models such as HRM 2025.
+and TRM are supervised learning methods rather than
+ Elman, J. L. Finding structure in time. Cognitive science,
+generative models. This means that given an input
+ 14(2):179–211, 1990.
+question, they can only provide a single deterministic
+answer. In many settings, multiple answers exist for a Fedus, W., Zoph, B., and Shazeer, N. Switch transform-
+question. Thus, it would be interesting to extend TRM ers: Scaling to trillion parameter models with simple
+to generative tasks. and efficient sparsity. Journal of Machine Learning Re-
+ search, 23(120):1–39, 2022.
+Acknowledgements
+ Geng, Z. and Kolter, J. Z. Torchdeq: A library for deep
+Thank you Emy Gervais for your invaluable support equilibrium models. arXiv preprint arXiv:2310.18605,
+and extra push. This research was enabled in part 2023.
+by computing resources, software, and technical as-
+sistance provided by Mila and the Digital Research Hendrycks, D. and Gimpel, K. Gaussian error linear
+Alliance of Canada. units (gelus). arXiv preprint arXiv:1606.08415, 2016.
+
+ Jang, Y., Kim, D., and Ahn, S. Hierarchical graph
+References generation with k2-trees. In ICML 2023 Workshop on
+ARC Prize Foundation. The Hidden Drivers of HRM’s Structured Probabilistic Inference Generative Modeling,
+ Performance on ARC-AGI. https://arcprize. 2023.
+ org/blog/hrm-analysis, 2025a. [Online; ac-
+ Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B.,
+ cessed 2025-09-15].
+ Chess, B., Child, R., Gray, S., Radford, A., Wu, J.,
+ARC Prize Foundation. ARC-AGI Leaderboard. and Amodei, D. Scaling laws for neural language
+ https://arcprize.org/leaderboard, 2025b. models. arXiv preprint arXiv:2001.08361, 2020.
+ [Online; accessed 2025-09-24].
+ Kingma, D. P. and Ba, J. Adam: A method for stochas-
+Bai, S., Kolter, J. Z., and Koltun, V. Deep equilibrium tic optimization. arXiv preprint arXiv:1412.6980,
+ models. Advances in neural information processing 2014.
+ systems, 32, 2019.
+ Krantz, S. G. and Parks, H. R. The implicit function
+Bai, X. and Melas-Kyriazi, L. Fixed point diffusion theorem: history, theory, and applications. Springer
+ models. In Proceedings of the IEEE/CVF Conference Science & Business Media, 2002.
+ on Computer Vision and Pattern Recognition, pp. 9430–
+ 9440, 2024. LeCun, Y. Une procedure d’apprentissage ponr reseau
+ a seuil asymetrique. Proceedings of cognitiva 85, pp.
+Brock, A., Donahue, J., and Simonyan, K. Large scale 599–604, 1985.
+
+ 9
+ Recursive Reasoning with Tiny Networks
+
+Lehnert, L., Sukhbaatar, S., Su, D., Zheng, Q., Mcvay, P., all-mlp architecture for vision. Advances in neural
+ Rabbat, M., and Tian, Y. Beyond a*: Better planning information processing systems, 34:24261–24272, 2021.
+ with transformers via search dynamics bootstrap-
+ ping. arXiv preprint arXiv:2402.14083, 2024. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J.,
+ Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin,
+Lillicrap, T. P. and Santoro, A. Backpropagation I. Attention is all you need. Advances in neural
+ through time and the brain. Current opinion in neuro- information processing systems, 30, 2017.
+ biology, 55:82–89, 2019.
+ Wang, G., Li, J., Sun, Y., Chen, X., Liu, C., Wu, Y.,
+Loshchilov, I. and Hutter, F. Decoupled weight decay Lu, M., Song, S., and Yadkori, Y. A. Hierarchical
+ regularization. arXiv preprint arXiv:1711.05101, 2017. reasoning model. arXiv preprint arXiv:2506.21734,
+ 2025.
+Moskvichev, A., Odouard, V. V., and Mitchell, M. The
+ conceptarc benchmark: Evaluating understanding Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F.,
+ and generalization in the arc domain. arXiv preprint Chi, E., Le, Q. V., Zhou, D., et al. Chain-of-thought
+ arXiv:2305.07141, 2023. prompting elicits reasoning in large language mod-
+ els. Advances in neural information processing systems,
+Palm, R., Paquet, U., and Winther, O. Recurrent re-
+ 35:24824–24837, 2022.
+ lational networks. Advances in neural information
+ processing systems, 31, 2018. Werbos, P. Beyond regression: New tools for predic-
+ tion and analysis in the behavioral sciences. PhD
+Park, K. Can convolutional neural networks
+ thesis, Committee on Applied Mathematics, Harvard
+ crack sudoku puzzles? https://github.com/
+ University, Cambridge, MA, 1974.
+ Kyubyong/sudoku, 2018.
+ Werbos, P. J. Generalization of backpropagation with
+Prieto, L., Barsbey, M., Mediano, P. A., and Birdal, T.
+ application to a recurrent gas market model. Neural
+ Grokking at the edge of numerical stability. arXiv
+ networks, 1(4):339–356, 1988.
+ preprint arXiv:2501.04697, 2025.
+ Zhang, B. and Sennrich, R. Root mean square layer
+Rumelhart, D. E., Hinton, G. E., and Williams, R. J.
+ normalization. Advances in Neural Information Pro-
+ Learning internal representations by error propaga-
+ cessing Systems, 32, 2019.
+ tion. Technical report, 1985.
+Shazeer, N. Glu variants improve transformer. arXiv
+ preprint arXiv:2002.05202, 2020.
+Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le,
+ Q., Hinton, G., and Dean, J. Outrageously large neu-
+ ral networks: The sparsely-gated mixture-of-experts
+ layer. arXiv preprint arXiv:1701.06538, 2017.
+Snell, C., Lee, J., Xu, K., and Kumar, A. Scaling
+ llm test-time compute optimally can be more effec-
+ tive than scaling model parameters. arXiv preprint
+ arXiv:2408.03314, 2024.
+Song, Y. and Ermon, S. Improved techniques for train-
+ ing score-based generative models. Advances in
+ neural information processing systems, 33:12438–12448,
+ 2020.
+Su, J., Ahmed, M., Lu, Y., Pan, S., Bo, W., and Liu,
+ Y. Roformer: Enhanced transformer with rotary
+ position embedding. Neurocomputing, 568:127063,
+ 2024.
+Tolstikhin, I. O., Houlsby, N., Kolesnikov, A., Beyer,
+ L., Zhai, X., Unterthiner, T., Yung, J., Steiner, A.,
+ Keysers, D., Uszkoreit, J., et al. Mlp-mixer: An
+
+ 10
+ Recursive Reasoning with Tiny Networks
+
+Hyper-parameters and setup propagating through the whole n + 1 recursions makes
+ the most sense and works best.
+All models are trained with the AdamW opti-
+mizer(Loshchilov & Hutter, 2017; Kingma & Ba, 2014) We tried removing ACT with the option of stopping
+with β 1 = 0.9, β 2 = 0.95, small learning rate warm- when the solution is reached, but we found that gen-
+up (2K iterations), batch-size 768, hidden-size of 512, eralization dropped significantly. This can probably
+Nsup = 16 max supervision steps, and stable-max loss be attributed to the fact that the model is spending
+(Prieto et al., 2025) for improved stability. TRM uses an too much time on the same data samples rather than
+Exponential Moving Average (EMA) of 0.999. HRM focusing on learning on a wide range of data samples.
+uses n = 2, T = 2 with two 4-layers networks, while We tried weight tying the input embedding and out-
+we use n = 6, T = 3 with one 2-layer network. put head, but this was too constraining and led to a
+For Sudoku-Extreme and Maze-Hard, we train for 60k massive generalization drop.
+epochs with learning rate 1e-4 and weight decay 1.0. We tried using TorchDEQ (Geng & Kolter, 2023) to
+For ARC-AGI, we train for 100K epochs with learning replace the recursion steps by fixed-point iteration as
+rate 1e-4 (with 1e-2 learning rate for the embeddings) done by Deep Equilibrium Models (Bai et al., 2019).
+and weight decay 0.1. The numbers for Deepseek R1, This would provide a better justification for the 1-step
+Claude 3.7 8K, O3-mini-high, Direct prediction, and gradient approximation. However, this slowed down
+HRM from the Table 4 and 5 are taken from Wang et al. training due to the fixed-point iteration and led to
+(2025). Both HRM and TRM add an embedding of worse generalization. This highlights the fact that
+shape [0, 1, D ] on Sudoku-Extreme and Maze-Hard to converging to a fixed-point is not essential.
+the input. For ARC-AGI, each puzzle (containing 2-3
+training examples and 1-2 test examples) at each data-
+augmentation is given a specific embedding of shape
+[0, 1, D ] and, at test-time, the most common answer
+out of the 1000 data augmentations is given as answer.
+Experiments on Sudoku-Extreme were ran with 1 L40S
+with 40Gb of RAM for generally less than 36 hours.
+Experiments on Maze-Hard were ran with 4 L40S with
+40Gb of RAM for less than 24 hours. Experiments on
+ARC-AGI were ran for around 3 days with 4 H100
+with 80Gb of RAM.
+
+Ideas that failed
+In this section, we quickly mention a few ideas that
+did not work to prevent others from making the same
+mistake.
+We tried replacing the SwiGLU MLPs by SwiGLU
+Mixture-of-Experts (MoEs) (Shazeer et al., 2017; Fedus
+et al., 2022), but we found generalization to decrease
+massively. MoEs clearly add too much unnecessary
+capacity, just like increasing the number of layers does.
+Instead of back-propagating through the whole n + 1
+recursions, we tried a compromise between HRM 1-
+step gradient approximation, which back-propagates
+through the last 2 recursions. We did so by decou-
+pling n from the number of last recursions k that we
+back-propagate through. For example, while n = 6
+requires 7 steps with gradients in TRM, we can use
+gradients for only the k = 4 last steps. However, we
+found that this did not help generalization in any way,
+and it made the approach more complicated. Back-
+
+
+ 11
+ Recursive Reasoning with Tiny Networks
+
+Algorithms with different number of latent Example on Sudoku-Extreme
+features
+ 8 3 1
+ 9 6 8 7
+def latent recursion(x, z, n=6): 3 5
+ for i in range(n+1): # latent recursion
+ z = net(x, z) 6 8
+ return z 6 2
+def deep recursion(x, z, n=6, T=3):
+ 7 4 3
+ # recursing T−1 times to improve z (no gradients needed) 9 4
+ with torch.no grad():
+ for j in range(T−1):
+ 2 4 1
+ z = latent recursion(x, z, n) 6 2 5 7
+ # recursing once to improve z
+ z = latent recursion(x, z, n) Input x
+ return z.detach(), output head(y), Q head(y)
+ 5 2 6 7 9 4 8 3 1
+# Deep Supervision
+for x input, y true in train dataloader: 3 9 1 2 6 8 4 7 5
+ z = z init 4 8 7 3 1 5 2 9 6
+ for step in range(N supervision):
+ x = input embedding(x input)
+ 1 6 8 5 3 2 7 4 9
+ z, y hat, q hat = deep recursion(x, z) 9 3 5 4 7 6 1 8 2
+ loss = softmax cross entropy(y hat, y true) 7 4 2 9 8 1 5 6 3
+ loss += binary cross entropy(q hat, (y hat == y true))
+ z = z.detach() 8 7 3 1 5 9 6 2 4
+ loss.backward() 2 5 9 6 4 7 3 1 8
+ opt.step()
+ opt.zero grad() 6 1 4 8 5 3 9 5 7
+ if q[0] > 0: # early−stopping
+ break Output y
+ 5 2 6 7 9 4 8 3 1
+Figure 4. Pseudocode of TRM using a single-z with deep 3 9 1 2 6 8 4 7 5
+supervision training in PyTorch. 4 8 7 3 1 5 2 9 6
+ 1 6 8 5 3 2 7 4 9
+ 9 3 5 4 7 6 1 8 2
+ 7 4 2 9 8 1 5 6 3
+ 8 7 3 1 5 9 6 2 4
+def latent recursion(x, y, z, n=6):
+ for i in range(n): # latent recursion 2 5 9 6 4 7 3 1 8
+ z[i] = net(x, y, z[0], ... , z[n−1]) 6 1 4 8 5 3 9 5 7
+ y = net(y, z[0], ... , z[n−1]) # refine output answer
+ return y, z Tokenized z H (denoted y in TRM)
+def deep recursion(x, y, z, n=6, T=3): 5 5 4 9 4 6 3
+ # recursing T−1 times to improve y and z (no gradients needed)
+ with torch.no grad(): 4 3 1 4 6 5
+ for j in range(T−1): 4 8 4 3 6 6 4
+ y, z = latent recursion(x, y, z, n)
+ # recursing once to improve y and z 9 6 5 3 5 4
+ y, z = latent recursion(x, y, z, n) 3 5 4 3 5 4 4
+ return (y.detach(), z.detach()), output head(y), Q head(y)
+ 6 3 3 3 5 8 8
+# Deep Supervision 3 3 3 6 5 6 6 4
+for x input, y true in train dataloader: 7 5 6 3 3 6 6
+ y, z = y init, z init
+ for step in range(N supervision): 4 3 4 8 3 6 6 4
+ x = input embedding(x input)
+ (y, z), y hat, q hat = deep recursion(x, y, z) Tokenized z L (denoted z in TRM)
+ loss = softmax cross entropy(y hat, y true)
+ loss += binary cross entropy(q hat, (y hat == y true))
+ loss.backward()
+ opt.step()
+ Figure 6. This Sudoku-Extreme example shows an input, ex-
+ opt.zero grad() pected output, and the tokenized z H and z L (after reversing
+ if q[0] > 0: # early−stopping the embedding and using argmax) for a pretrained model.
+ break
+ This highlights the fact that z H corresponds to the predicted
+ response, while z L is a latent feature that cannot be decoded
+Figure 5. Pseudocode of TRM using multi-scale z with deep to a sensible output unless transformed into z H by f H .
+supervision training in PyTorch.
+
+
+
+
+ 12
+ \ No newline at end of file