diff options
Diffstat (limited to 'papers/txt/trm2025_tiny_recursive.txt')
| -rw-r--r-- | papers/txt/trm2025_tiny_recursive.txt | 796 |
1 files changed, 796 insertions, 0 deletions
diff --git a/papers/txt/trm2025_tiny_recursive.txt b/papers/txt/trm2025_tiny_recursive.txt new file mode 100644 index 0000000..55cf994 --- /dev/null +++ b/papers/txt/trm2025_tiny_recursive.txt @@ -0,0 +1,796 @@ + Less is More: Recursive Reasoning with Tiny Networks + + + + Alexia Jolicoeur-Martineau + Samsung SAIL Montréal + alexia.j@samsung.com + + + Abstract +arXiv:2510.04871v1 [cs.LG] 6 Oct 2025 + + + + + Hierarchical Reasoning Model (HRM) is a + novel approach using two small neural net- + works recursing at different frequencies. This + biologically inspired method beats Large Lan- + guage models (LLMs) on hard puzzle tasks + such as Sudoku, Maze, and ARC-AGI while + trained with small models (27M parameters) + on small data (∼ 1000 examples). HRM holds + great promise for solving hard problems with + small networks, but it is not yet well un- + derstood and may be suboptimal. We pro- + pose Tiny Recursive Model (TRM), a much + simpler recursive reasoning approach that + achieves significantly higher generalization + than HRM, while using a single tiny network + with only 2 layers. With only 7M parameters, + TRM obtains 45% test-accuracy on ARC-AGI- + 1 and 8% on ARC-AGI-2, higher than most + LLMs (e.g., Deepseek R1, o3-mini, Gemini 2.5 + Pro) with less than 0.01% of the parameters. + + + + 1. Introduction + While powerful, Large Language models (LLMs) can + struggle on hard question-answer problems. Given + that they generate their answer auto-regressively, there + is a high risk of error since a single incorrect token can + render an answer invalid. To improve their reliabil- Figure 1. Tiny Recursion Model (TRM) recursively improves + ity, LLMs rely on Chain-of-thoughts (CoT) (Wei et al., its predicted answer y with a tiny network. It starts with the + 2022) and Test-Time Compute (TTC) (Snell et al., 2024). embedded input question x and initial embedded answer + y, and latent z. For up to Nsup = 16 improvements steps, + CoTs seek to emulate human reasoning by having the + it tries to improve its answer y. It does so by i) recursively + LLM to sample step-by-step reasoning traces prior to + updating n times its latent z given the question x, current + giving their answer. Doing so can improve accuracy, answer y, and current latent z (recursive reasoning), and + but CoT is expensive, requires high-quality reasoning then ii) updating its answer y given the current answer y + data (which may not be available), and can be brittle and current latent z. This recursive process allows the model + since the generated reasoning may be wrong. To fur- to progressively improve its answer (potentially address- + ther improve reliability, test-time compute can be used ing any errors from its previous answer) in an extremely + by reporting the most common answer out of K or the parameter-efficient manner while minimizing overfitting. + highest-reward answer (Snell et al., 2024). + + 1 + Recursive Reasoning with Tiny Networks + +However, this may not be enough. LLMs with CoT ers that achieves significantly higher generalization +and TTC are not enough to beat every problem. While than HRM on a variety of problems. In doing so, we +LLMs have made significant progress on ARC-AGI improve the state-of-the-art test accuracy on Sudoku- +(Chollet, 2019) since 2019, human-level accuracy still Extreme from 55% to 87%, Maze-Hard from 75% to +has not been reached (6 years later, as of writing of 85%, ARC-AGI-1 from 40% to 45%, and ARC-AGI-2 +this paper). Furthermore, LLMs struggle on the newer from 5% to 8%. +ARC-AGI-2 (e.g., Gemini 2.5 Pro only obtains 4.9% test +accuracy with a high amount of TTC) (Chollet et al., 2. Background +2025; ARC Prize Foundation, 2025b). + HRM is described in Algorithm 2. We discuss the +An alternative direction has recently been proposed by + details of the algorithm further below. +Wang et al. (2025). They propose a new way forward +through their novel Hierarchical Reasoning Model +(HRM), which obtains high accuracy on puzzle tasks 2.1. Structure and goal +where LLMs struggle to make a dent (e.g., Sudoku The focus of HRM is supervised learning. Given an +solving, Maze pathfinding, and ARC-AGI). HRM is a input, produce an output. Both input and output are +supervised learning model with two main novelties: 1) assumed to have shape [ B, L] (when the shape differs, +recursive hierarchical reasoning, and 2) deep supervision. padding tokens can be added), where B is the batch- +Recursive hierarchical reasoning consists of recurs- size and L is the context-length. +ing multiple times through two small networks ( f L at HRM contains four learnable components: the in- +high frequency and f H at low frequency) to predict the put embedding f I (·; θ I ), low-level recurrent network +answer. Each network generates a different latent fea- f L (·; θ L ), high-level recurrent network f H (·; θ H ), and +ture: f L outputs z H and f H outputs z L . Both features the output head fO (·; θO ). Once the input is embedded, +(z L , z H ) are used as input to the two networks. The the shape becomes [ B, L, D ] where D is the embedding +authors provide some biological arguments in favor of size. Each network is a 4-layer Transformers architec- +recursing at different hierarchies based on the different ture (Vaswani et al., 2017), with RMSNorm (Zhang +temporal frequencies at which the brains operate and & Sennrich, 2019), no bias (Chowdhery et al., 2023), +hierarchical processing of sensory inputs. rotary embeddings (Su et al., 2024), and SwiGLU acti- +Deep supervision consists of improving the answer vation function (Hendrycks & Gimpel, 2016; Shazeer, +through multiple supervision steps while carrying the 2020). +two latent features as initialization for the improve- +ment steps (after detaching them from the computa- 2.2. Recursion at two different frequencies +tional graph so that their gradients do not propagate). Given the hyperparameters used by Wang et al. (2025) +This provide residual connections, which emulates (n = 2 f L steps, 1 f H steps; done T = 2 times), a +very deep neural networks that are too memory ex- forward pass of HRM is done as follows: +pensive to apply in one forward pass. +An independent analysis on the ARC-AGI benchmark x ← f I ( x̃ ) +showed that deep supervision seems to be the primary z L ← f L (z L + z H + x ) # without gradients +driver of the performance gains (ARC Prize Founda- z L ← f L (z L + z H + x ) # without gradients +tion, 2025a). Using deep supervision doubled accuracy + z H ← f H (z L + z H ) # without gradients +over single-step supervision (going from 19% to 39% +accuracy), while recursive hierarchical reasoning only z L ← f L (z L + z H + x ) # without gradients +slightly improved accuracy over a regular model with z L ← z L .detach() +a single forward pass (going from 35.7% to 39.0% ac- z H ← z H .detach() +curacy). This suggests that reasoning across different + z L ← f L (z L + z H + x ) # with gradients +supervision steps is worth it, but the recursion done +in each supervision step is not particularly important. z H ← f H (z L + z H ) # with gradients + ŷ ← argmax( fO (z H )) +In this work, we show that the benefit from recursive +reasoning can be massively improved, making it much +more than incremental. We propose Tiny Recursive where ŷ is the predicted output answer, z L and z H are +Model (TRM), an improved and simplified approach either initialized embeddings or the embeddings of +using a much smaller tiny network with only 2 lay- the previous deep supervision step (after detaching + them from the computational graph). As can be seen, + + 2 + Recursive Reasoning with Tiny Networks + +def hrm(z, x, n=2, T=2): # hierarchical reasoning 2.4. Deep supervision + zH, zL = z + with torch.no grad(): To improve effective depth, deep supervision is used. + for i in range(nT − 2): + zL = L net(zL, zH, x) This consists of reusing the previous latent features + if (i + 1) % T == 0: + zH = H net(zH, zL) + (z H and z L ) as initialization for the next forward pass. + # 1−step grad This allows the model to reason over many iterations + zL = L net(zL, zH, x) + zH = H net(zH, zL) and improve its latent features (z L and z H ) until it + return (zH, zL), output head(zH), Q head(zH) (hopefully) converges to the correct solution. At most +def ACT halt(q, y hat, y true): Nsup = 16 supervision steps are used. + target halt = (y hat == y true) + loss = 0.5∗binary cross entropy(q[0], target halt) + return loss 2.5. Adaptive computational time (ACT) +def ACT continue(q, last step): + if last step: With deep supervision, each mini-batch of data sam- + target continue = sigmoid(q[0]) ples must be used for Nsup = 16 supervision steps + else: + target continue = sigmoid(max(q[0], q[1]))) before moving to the next mini-batch. This is expen- + loss = 0.5∗binary cross entropy(q[1], target continue) + return loss + sive, and there is a balance to be reached between + optimizing a few data examples for many supervision +# Deep Supervision +for x input, y true in train dataloader: steps versus optimizing many data examples with less + z = z init supervision steps. To reach a better balance, a halting + for step in range(N sup): # deep supervision + x = input embedding(x input) mechanism is incorporated to determine whether the + z, y pred, q = hrm(z, x) + loss = softmax cross entropy(y pred, y true) + model should terminate early. It is learned through + # Adaptive computational time (ACT) using Q−learning a Q-learning objective that requires passing the z H + loss += ACT halt(q, y pred, y true) + , , q next = hrm(z, x) # extra forward pass through an additional head and running an additional + loss += ACT continue(q next, step == N sup − 1) forward pass (to determine if halting now rather than + z = z.detach() + loss.backward() later would have been preferable). They call this + opt.step() + opt.zero grad() + method Adaptive computational time (ACT). It is only + if q[0] > q[1]: # early−stopping used during training, while the full Nsup = 16 super- + break + vision steps are done at test time to maximize down- + stream performance. ACT greatly diminishes the time +Figure 2. Pseudocode of Hierarchical Reasoning Models spent per example (on average spending less than 2 +(HRMs). steps on the Sudoku-Extreme dataset rather than the + full Nsup = 16 steps), allowing more coverage of the +a forward pass of HRM consists of applying 6 function dataset given a fixed number of training iterations. +evaluations, where the first 4 function evaluations are +detached from the computational graph and are not 2.6. Deep supervision and 1-step gradient +back-propagated through. The authors uses n = 2 approximations replaces BPTT +with T = 2 in all experiments, but HRM can be gener- +alized by allowing for an arbitrary number of L steps Deep supervision and the 1-step gradient approxima- +(n) and recursions (T) as shown in Algorithm 2. tion provide a more biologically plausible and less + computationally-expansive alternative to Backpropa- +2.3. Fixed-point recursion with 1-step gradient gation Through Time (BPTT) (Werbos, 1974; Rumel- + approximation hart et al., 1985; LeCun, 1985) for solving the temporal + credit assignment (TCA) (Rumelhart et al., 1985; Wer- +Assuming that (z L , z H ) reaches a fixed-point (z∗L , z∗H ) bos, 1988; Elman, 1990) problem (Lillicrap & Santoro, +through recursing from both f L and f H , 2019). The implication is that HRM can learn what + would normally require an extremely large network + z∗L ≈ f L (z∗L + z H + x ) without having to back-propagate through its entire + z∗H ≈ f H (z L + z∗H ) , depth. Given the hyperparameters used by Jang et al. + (2023) in all their experiments, HRM effectively rea- +the Implicit Function Theorem (Krantz & Parks, 2002) + sons over nlayers (n + 1) TNsup = 4 ∗ (2 + 1) ∗ 2 ∗ 16 = +with the 1-step gradient approximation (Bai et al., + 384 layers of effective depth. +2019) is used to approximate the gradient by back- +propagating only the last f L and f H steps. This theo- +rem is used to justify only tracking the gradients of +the last two steps (out of 6), which greatly reduces +memory demands. + + 3 + Recursive Reasoning with Tiny Networks + +2.7. Summary of HRM different from the much smaller n = 2 and T = 2 used + in every experiment of their paper, we observe the +HRM leverages recursion from two networks at dif- + following: +ferent frequencies (high frequency versus low fre- +quency) and deep supervision to learn to improve +its answer over multiple supervision steps (with ACT 1. the residual for z H is clearly well above 0 at every +to reduce time spent per data example). This enables step +the model to imitate extremely large depth without +requiring backpropagation through all layers. This +approach obtains significantly higher performance on 2. the residual for z L only becomes closer to 0 after +hard question-answer tasks that regular supervised many cycles, but it remains significantly above 0 +models struggle with. However, this method is quite +complicated, relying a bit too heavily on uncertain +biological arguments and fixed-point theorems that 3. z L is very far from converged after one f L evalu- +are not guaranteed to be applicable. In the next sec- ation at T cycles, which is when the fixed-point +tion, we discuss those issues and potential targets for is assumed to be reached and the 1-step gradient +improvements in HRM. approximation is used + +3. Target for improvements in Hierarchical Thus, while the application of the IFT theorem and + Reasoning Models 1-step gradient approximation to HRM has some basis + since the residuals do generally reduce over time, a +In this section, we identify key targets for improve- + fixed point is unlikely to be reached when the theorem +ments in HRM, which will be addressed by our pro- + is actually applied. +posed method, Tiny Recursion Models (TRMs). + In the next section, we show that we can bypass the +3.1. Implicit Function Theorem (IFT) with 1-step need for the IFT theorem and 1-step gradient approxi- + gradient approximation mation, thus bypassing the issue entirely. + +HRM only back-propagates through the last 2 of the 6 3.2. Twice the forward passes with Adaptive +recursions. The authors justify this by leveraging the computational time (ACT) +Implicit Function Theorem (IFT) and one-step approx- +imation, which states that when a recurrent function HRM uses Adaptive computational time (ACT) during +converges to a fixed point, backpropagation can be training to optimize the time spent of each data sam- +applied in a single step at that equilibrium point. ple. Without ACT, Nsup = 16 supervision steps would + be spent on the same data sample, which is highly in- +There are concerns about applying this theorem to + efficient. They implement ACT through an additional +HRM. Most importantly, there is no guarantee that + Q-learning objective, which decides when to halt and +a fixed-point is reached. Deep equilibrium models + move to a new data sample rather than keep iterating +normally do fixed-point iteration to solve for the fixed + on the same data. This allows much more efficient +pointz∗ = f (z∗ ) (Bai et al., 2019). However, in the case + use of time especially since the average number of su- +of HRM, it is not iterating to the fixed-point but simply + pervision steps during training is quite low with ACT +doing forward passes of f L and f H . To make matters + (less than 2 steps on the Sudoku-Extreme dataset as +worse, HRM is only doing 4 recursions before stopping + per their reported numbers). +to apply the one-step approximation. After its first +loop of two f L and 1 f H evaluations, it only apply a However, ACT comes at a cost. This cost is not directly +single f L evaluation before assuming that a fixed-point shown in the HRM’s paper, but it is shown in their of- +is reached for both z L and z H (z∗L = f L (z∗L + z H + x ) ficial code. The Q-learning objective relies on a halting +and z∗H = f H (z∗L + z∗H )). Then, the one-step gradient loss and a continue loss. The continue loss requires an +approximation is applied to both latent variables in extra forward pass through HRM (with all 6 function +succession. evaluations). This means that while ACT optimizes + time more efficiently per sample, it requires 2 forward +The authors justify that a fixed-point is reached by + passes per optimization step. The exact formulation is +depicting an example with n = 7 and T = 7 where + shown in Algorithm 2. +the forward residuals is reduced over time (Figure 3 +in Wang et al. (2025)). Even in this setting, which is In the next section, we show that we can bypass the + need for two forward passes in ACT. + + 4 + Recursive Reasoning with Tiny Networks + +3.3. Hierarchical interpretation based on complex of 2 passes). Our approach is described in Algorithm 3 + biological arguments and illustrated in Figure 1. We also provide an ablation + in Table 1 on the Sudoku-Extreme dataset (a dataset +The HRM’s authors justify the two latent variables + of difficult Sudokus with only 1K training examples, +and two networks operating at different hierarchies + but 423K test examples). Below, we explain the key +based on biological arguments, which are very far + components of TRMs. +from artificial neural networks. They even try to match +HRM to actual brain experiments on mice. While in- +teresting, this sort of explanation makes it incredibly Table 1. Ablation of TRM on Sudoku-Extreme comparing % +hard to parse out why HRM is designed the way it Test accuracy, effective depth per supervision step ( T (n + +is. Given the lack of ablation table in their paper, the 1)nlayers ), number of Forward Passes (NFP) per optimization +over-reliance on biological arguments and fixed-point step, and number of parameters +theorems (that are not perfectly applicable), it is hard Method Acc (%) Depth NFP # Params +to determine what parts of HRM is helping what and HRM 55.0 24 2 27M +why. Furthermore, it is not clear why they use two TRM (T = 3, n = 6) 87.4 42 1 5M +latent features rather than other combinations of fea- w/ ACT 86.1 42 2 5M +tures. w/ separate f H , f L 82.4 42 1 10M + no EMA 79.9 42 1 5M +In the next section, we show that the recursive process w/ 4-layers, n = 3 79.5 48 1 10M +can be greatly simplified and understood in a much w/ self-attention 74.7 42 1 7M +simpler manner that does not require any biological w/ T = 2, n = 2 73.7 12 1 5M +argument, fixed-point theorem, hierarchical interpre- w/ 1-step gradient 56.5 42 1 5M +tation, nor using two networks. It also explains why 2 +is the optimal number of features (z L and z H ). + 4.1. No fixed-point theorem required + +def latent recursion(x, y, z, n=6): HRM assumes that the recursions converge to a fixed- + for i in range(n): # latent reasoning point for both z L and z H in order to leverage the 1-step + z = net(x, y, z) + y = net(y, z) # refine output answer gradient approximation (Bai et al., 2019). This allows + return y, z + the authors to justify only back-propagating through +def deep recursion(x, y, z, n=6, T=3): the last two function evaluations (1 f L and 1 f H ). To + # recursing T−1 times to improve y and z (no gradients needed) + with torch.no grad(): bypass this theoretical requirement, we define a full + for j in range(T−1): recursion process as containing n evaluations of f L + y, z = latent recursion(x, y, z, n) + # recursing once to improve y and z and 1 evaluation of f H : + y, z = latent recursion(x, y, z, n) + return (y.detach(), z.detach()), output head(y), Q head(y) z L ← f L (z L + z H + x ) +# Deep Supervision ... +for x input, y true in train dataloader: + y, z = y init, z init z L ← f L (z L + z H + x ) + for step in range(N supervision): + x = input embedding(x input) z H ← f H (z L + z H ) . + (y, z), y hat, q hat = deep recursion(x, y, z) + loss = softmax cross entropy(y hat, y true) + loss += binary cross entropy(q hat, (y hat == y true)) Then, we simply back-propagate through the full re- + loss.backward() cursion process. + opt.step() + opt.zero grad() + if q hat > 0: # early−stopping Through deep supervision, the models learns to take + break any (z L , z H ) and improve it through a full recursion + process, hopefully making z H closer to the solution. +Figure 3. Pseudocode of Tiny Recursion Models (TRMs). This means that by the design of the deep supervi- + sion goal, running a few full recursion processes (even + without gradients) is expected to bring us closer to the +4. Tiny Recursion Models solution. We propose to run T − 1 recursion processes + without gradient to improve (z L , z H ) before running +In this section, we present Tiny Recursion Models + one recursion process with backpropagation. +(TRMs). Contrary to HRM, TRM requires no com- +plex mathematical theorem, hierarchy, nor biological Thus, instead of using the 1-step gradient approxi- +arguments. It generalizes better while requiring only mation, we apply a full recursion process containing +a single tiny network (instead of two medium-size net- n evaluations of f L and 1 evaluation of f H . This re- +works) and a single forward pass for the ACT (instead moves entirely the need to assume that a fixed-point + + 5 + Recursive Reasoning with Tiny Networks + +is reached and the use of the IFT theorem with 1-step While this is intuitive, we wanted to verify whether +gradient approximation. Yet, we can still leverage using more or less features could be helpful. Results +multiple backpropagation-free recursion processes to are shown in Table 2. +improve (z L , z H ). With this approach, we obtain a + More features (> 2): We tested splitting z into dif- +massive boost in generalization on Sudoku-Extreme + ferent features by treating each of the n recursions as +(improving TRM from 56.5% to 87.4%; see Table 1). + producing a different zi for i = 1, ..., n. Then, each + zi is carried across supervision steps. The approach +4.2. Simpler reinterpretation of z H and z L is described in Algorithm 5. In doing so, we found +HRM is interpreted as doing hierarchical reasoning performance to drop. This is expected because, as dis- +over two latent features of different hierarchies due to cussed, there is no apparent need for splitting z into +arguments from biology. However, one might wonder multiple parts. It does not have to be hierarchical. +why use two latent features instead of 1, 3, or more? Single feature: Similarly, we tested the idea of taking +And do we really need to justify these so-called ”hier- a single feature by only carrying z H across supervi- +archical” features based on biology to make sense of sion steps. The approach is described in Algorithm 4. +them? We propose a simple non-biological explana- In doing so, we found performance to drop. This is +tion, which is more natural, and directly answers the expected because, as discussed, it forces the model to +question of why there are 2 features. store the solution y within z. +The fact of the matter is: z H is simply the current Thus, we explored using more or less latent variables +(embedded) solution. The embedding is reversed by on Sudoku-Extreme, but found that having only y and +applying the output head and rounding to the nearest z lead to better test accuracy in addition to being the +token using the argmax operation. On the other hand, simplest more natural approach. +z L is a latent feature that does not directly correspond +to a solution, but it can be transformed into a solution +by applying z H ← f H ( x, z L , z H ). We show an example +on Sudoku-Extreme in Figure 6 to highlight the fact Table 2. TRM on Sudoku-Extreme comparing % Test accu- +that z H does correspond to the solution, but z L does racy when using more or less latent features +not. Method # of features Acc (%) + TRM y, z (Ours) 2 87.4 +Once this is understood, hierarchy is not needed; there TRM multi-scale z n+1 = 7 77.6 +is simply an input x, a proposed solution y (previously TRM single z 1 71.9 +called z H ), and a latent reasoning feature z (previously +called z L ). Given the input question x, current solution +y, and current latent reasoning z, the model recursively +improves its latent z. Then, given the current latent z 4.3. Single network +and the previous solution y, the model proposes a new HRM uses two networks, one applied frequently as a +solution y (or stay at the current solution if its already low-level module f H and one applied rarely as an high- +good). level module ( f H ). This requires twice the number of +Although this has no direct influence on the algorithm, parameters compared to regular supervised learning +this re-interpretation is much simpler and natural. It with a single network. +answers the question about why two features: remem- As mentioned previously, while f L iterates on the la- +bering in context the question x, previous reasoning tent reasoning feature z (z L in HRM), the goal of f H +z, and previous answer y helps the model iterate on is to update the solution y (z H in HRM) given the la- +the next reasoning z and then the next answer y. If tent reasoning and current solution. Importantly, since +we were not passing the previous reasoning z, the z ← f L ( x + y + z) contains x but y ← f H (y + z) does +model would forget how it got to the previous solu- not contains x, the task to achieve (iterating on z versus +tion y (since z acts similarly as a chain-of-thought). If using z to update y) is directly specified by the inclu- +we were not passing the previous solution y, then the sion or lack of x in the inputs. Thus, we considered +model would forget what solution it had and would the possibility that both networks could be replaced +be forced to store the solution y within z instead of by a single network doing both tasks. In doing so, we +using it for latent reasoning. Thus, we need both y and obtain better generalization on Sudoku-Extreme (im- +z separately, and there is no apparent reason why one proving TRM from 82.4% to 87.4%; see Table 1) while +would need to split z into multiple features. reducing the number of parameters by half. It turns + out that a single network is enough. + + 6 + Recursive Reasoning with Tiny Networks + +4.4. Less is more 4.6. No additional forward pass needed with ACT +We attempted to increase capacity by increasing the As previously mentioned, the implementation of ACT +number of layers in order to scale the model. Sur- in HRM through Q-learning requires two forward +prisingly, we found that adding layers decreased gen- passes, which slows down training. We propose a +eralization due to overfitting. In doing the oppo- simple solution, which is to get rid of the continue loss +site, decreasing the number of layers while scaling (from the Q-learning) and only learn a halting proba- +the number of recursions (n) proportionally (to keep bility through a Binary-Cross-Entropy loss of having +the amount of compute and emulated depth approxi- reached the correct solution. By removing the continue +mately the same), we found that using 2 layers (instead loss, we remove the need for the expensive second for- +of 4 layers) maximized generalization. In doing so, we ward pass, while still being able to determine when to +obtain better generalization on Sudoku-Extreme (im- halt with relatively good accuracy. We found no sig- +proving TRM from 79.5% to 87.4%; see Table 1) while nificant difference in generalization from this change +reducing the number of parameters by half (again). (going from 86.1% to 87.4%; see Table 1). +It is quite surprising that smaller networks are bet- +ter, but 2 layers seems to be the optimal choice. Bai 4.7. Exponential Moving Average (EMA) +& Melas-Kyriazi (2024) also observed optimal perfor- On small data (such as Sudoku-Extreme and Maze- +mance for 2-layers in the context of deep equilibrium Hard), HRM tends to overfit quickly and then diverge. +diffusion models; however, they had similar perfor- To reduce this problem and improves stability, we +mance to the bigger networks, while we instead ob- integrate Exponential Moving Average (EMA) of the +serve better performance with 2 layers. This may ap- weights, a common technique in GANs and diffusion +pear unusual, as with modern neural networks, gener- models to improve stability (Brock et al., 2018; Song & +alization tends to directly correlate with model sizes. Ermon, 2020). We find that it prevents sharp collapse +However, when data is too scarce and model size is and leads to higher generalization (going from 79.9% +large, there can be an overfitting penalty (Kaplan et al., to 87.4%; see Table 1). +2020). This is likely an indication that there is too little +data. Thus, using tiny networks with deep recursion 4.8. Optimal the number of recursions +and deep supervision appears to allow us to bypass a +lot of the overfitting. We experimented with different number of recursions + by varying T and n and found that T = 3, n = 3 +4.5. attention-free architecture for tasks with small (equivalent to 48 recursions) in HRM and T = 3, n = 6 + fixed context length in TRM (equivalent to 42 recursions) to lead to optimal + generalization on Sudoku-Extreme. More recursions +Self-attention is particularly good for long-context could be helpful for harder problems (we have not +lengths when L ≫ D since it only requires a matrix of tested it, given our limited resources); however, in- +[ D, 3D ] parameters, even though it can account for the creasing either T or n incurs massive slowdowns. We +whole sequence. However, when focusing on tasks show results at different n and T for HRM and TRM +where L ≤ D, a linear layer is cheap, requiring only a in Table 3. Note that TRM requires backpropagation +matrix of [ L, L] parameters. Taking inspiration from through a full recursion process, thus increasing n too +the MLP-Mixer (Tolstikhin et al., 2021), we can replace much leads to Out Of Memory (OOM) errors. How- +the self-attention layer with a multilayer perceptron ever, this memory cost is well worth its price in gold. +(MLP) applied on the sequence length. Using an MLP +instead of self-attention, we obtain better generaliza- In the following section, we show our main results on +tion on Sudoku-Extreme (improving from 74.7% to multiple datasets comparing HRM, TRM, and LLMs. +87.4%; see Table 1). This worked well on Sudoku 9x9 +grids, given the small and fixed context length; how- 5. Results +ever, we found this architecture to be suboptimal for +tasks with large context length, such as Maze-Hard Following Wang et al. (2025), we test our approach +and ARC-AGI (both using 30x30 grids). We show on the following datasets: Sudoku-Extreme (Wang +results with and without self-attention for all experi- et al., 2025), Maze-Hard (Wang et al., 2025), ARC-AGI- +ments. 1 (Chollet, 2019) and, ARC-AGI-2 (Chollet et al., 2025). + Results are presented in Tables 4 and 5. Hyperparame- + ters are detailed in Section 6. Datasets are discussed + below. + + + 7 + Recursive Reasoning with Tiny Networks + + ity of the MLP on large 30x30 grids). TRM with self- +Table 3. % Test accuracy on Sudoku-Extreme dataset. HRM + attention obtains 85.3% accuracy on Maze-Hard, 44.6% +versus TRM matched at a similar effective depth per super- +vision step ( T (n + 1)nlayers ) + accuracy on ARC-AGI-1, and 7.8% accuracy on ARC- + AGI-2 with 7M parameters. This is significantly higher + HRM TRM + than the 74.5%, 40.3%, and 5.0% obtained by HRM us- + n = k, 4 layers n = 2k, 2 layers + ing 4 times the number of parameters (27M). + k T Depth Acc (%) Depth Acc (%) + 1 1 9 46.4 7 63.2 + 2 2 24 55.0 20 81.9 Table 4. % Test accuracy on Puzzle Benchmarks (Sudoku- + 3 3 48 61.6 42 87.4 Extreme and Maze-Hard) + 4 4 80 59.5 72 84.2 Method # Params Sudoku Maze + 6 3 84 62.3 78 OOM Chain-of-thought, pretrained + 3 6 96 58.8 84 85.8 Deepseek R1 671B 0.0 0.0 + 6 6 168 57.5 156 OOM Claude 3.7 8K ? 0.0 0.0 + O3-mini-high ? 0.0 0.0 + Direct prediction, small-sample training +Sudoku-Extreme consists of extremely difficult Su- Direct pred 27M 0.0 0.0 +doku puzzles (Dillion, 2025; Palm et al., 2018; Park, HRM 27M 55.0 74.5 +2018) (9x9 grid), for which only 1K training samples TRM-Att (Ours) 7M 74.7 85.3 +are used to test small-sample learning. Testing is done TRM-MLP (Ours) 5M/19M 1 87.4 0.0 +on 423K samples. Maze-Hard consists of 30x30 mazes +generated by the procedure by Lehnert et al. (2024) +whose shortest path is of length above 110; both the +training set and test set include 1000 mazes. Table 5. % Test accuracy on ARC-AGI Benchmarks (2 tries) + Method # Params ARC-1 ARC-2 +ARC-AGI-1 and ARC-AGI-2 are geometric puzzles in- Chain-of-thought, pretrained +volving monetary prizes. Each puzzle is designed to Deepseek R1 671B 15.8 1.3 +be easy for a human, yet hard for current AI models. Claude 3.7 16K ? 28.6 0.7 +Each puzzle task consists of 2-3 input–output demon- o3-mini-high ? 34.5 3.0 +stration pairs and 1-2 test inputs to be solved. The final Gemini 2.5 Pro 32K ? 37.0 4.9 +score is computed as the accuracy over all test inputs Grok-4-thinking 1.7T 66.7 16.0 +from two attempts to produce the correct output grid. Bespoke (Grok-4) 1.7T 79.6 29.4 +The maximum grid size is 30x30. ARC-AGI-1 con- Direct prediction, small-sample training +tains 800 tasks, while ARC-AGI-2 contains 1120 tasks. + Direct pred 27M 21.0 0.0 +We also augment our data with the 160 tasks from + HRM 27M 40.3 5.0 +the closely related ConceptARC dataset (Moskvichev +et al., 2023). We provide results on the public evalua- TRM-Att (Ours) 7M 44.6 7.8 +tion set for both ARC-AGI-1 and ARC-AGI-2. TRM-MLP (Ours) 19M 29.6 2.4 + +While these datasets are small, heavy data- +augmentation is used in order to improve gen- 6. Conclusion +eralization. Sudoku-Extreme uses 1000 shuffling +(done without breaking the Sudoku rules) augmenta- We propose Tiny Recursion Models (TRM), a simple +tions per data example. Maze-Hard uses 8 dihedral recursive reasoning approach that achieves strong gen- +transformations per data example. ARC-AGI uses eralization on hard tasks using a single tiny network +1000 data augmentations (color permutation, dihedral- recursing on its latent reasoning feature and progres- +group, and translations transformations) per data sively improving its final answer. Contrary to the +example. The dihedral-group transformations consist Hierarchical Reasoning Model (HRM), TRM requires +of random 90-degree rotations, horizontal/vertical no fixed-point theorem, no complex biological justi- +flips, and reflections. fications, and no hierarchy. It significantly reduces + the number of parameters by halving the number of +From the results, we see that TRM without self- layers and replacing the two networks with a single +attention obtains the best generalization on Sudoku- tiny network. It also simplifies the halting process, +Extreme (87.4% test accuracy). Meanwhile, TRM with removing the need for the extra forward pass. Over- +self-attention generalizes better on the other tasks +(probably due to inductive biases and the overcapac- 1 5M on Sudoku and 19M on Maze + + + + 8 + Recursive Reasoning with Tiny Networks + +all, TRM is much simpler than HRM, while achieving gan training for high fidelity natural image synthe- +better generalization. sis. arXiv preprint arXiv:1809.11096, 2018. +While our approach led to better generalization on 4 Chollet, F. On the measure of intelligence. arXiv +benchmarks, every choice made is not guaranteed to preprint arXiv:1911.01547, 2019. +be optimal on every dataset. For example, we found +that replacing the self-attention with an MLP worked Chollet, F., Knoop, M., Kamradt, G., Landers, B., +extremely well on Sudoku-Extreme (improving test ac- and Pinkard, H. Arc-agi-2: A new challenge +curacy by 10%), but poorly on other datasets. Different for frontier ai reasoning systems. arXiv preprint +problem settings may require different architectures arXiv:2505.11831, 2025. +or number of parameters. Scaling laws are needed +to parametrize these networks optimally. Although Chowdhery, A., Narang, S., Devlin, J., Bosma, M., +we simplified and improved on deep recursion, the Mishra, G., Roberts, A., Barham, P., Chung, H. W., +question of why recursion helps so much compared Sutton, C., Gehrmann, S., et al. Palm: Scaling lan- +to using a larger and deeper network remains to be guage modeling with pathways. Journal of Machine +explained; we suspect it has to do with overfitting, but Learning Research, 24(240):1–113, 2023. +we have no theory to back this explaination. Not all +our ideas made the cut; we briefly discuss some of the Dillion, T. Tdoku: A fast sudoku solver and gener- +failed ideas that we tried but did not work in Section 6. ator. https://t-dillon.github.io/tdoku/, +Currently, recursive reasoning models such as HRM 2025. +and TRM are supervised learning methods rather than + Elman, J. L. Finding structure in time. Cognitive science, +generative models. This means that given an input + 14(2):179–211, 1990. +question, they can only provide a single deterministic +answer. In many settings, multiple answers exist for a Fedus, W., Zoph, B., and Shazeer, N. Switch transform- +question. Thus, it would be interesting to extend TRM ers: Scaling to trillion parameter models with simple +to generative tasks. and efficient sparsity. Journal of Machine Learning Re- + search, 23(120):1–39, 2022. +Acknowledgements + Geng, Z. and Kolter, J. Z. Torchdeq: A library for deep +Thank you Emy Gervais for your invaluable support equilibrium models. arXiv preprint arXiv:2310.18605, +and extra push. This research was enabled in part 2023. +by computing resources, software, and technical as- +sistance provided by Mila and the Digital Research Hendrycks, D. and Gimpel, K. Gaussian error linear +Alliance of Canada. units (gelus). arXiv preprint arXiv:1606.08415, 2016. + + Jang, Y., Kim, D., and Ahn, S. Hierarchical graph +References generation with k2-trees. In ICML 2023 Workshop on +ARC Prize Foundation. The Hidden Drivers of HRM’s Structured Probabilistic Inference Generative Modeling, + Performance on ARC-AGI. https://arcprize. 2023. + org/blog/hrm-analysis, 2025a. [Online; ac- + Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., + cessed 2025-09-15]. + Chess, B., Child, R., Gray, S., Radford, A., Wu, J., +ARC Prize Foundation. ARC-AGI Leaderboard. and Amodei, D. Scaling laws for neural language + https://arcprize.org/leaderboard, 2025b. models. arXiv preprint arXiv:2001.08361, 2020. + [Online; accessed 2025-09-24]. + Kingma, D. P. and Ba, J. Adam: A method for stochas- +Bai, S., Kolter, J. Z., and Koltun, V. Deep equilibrium tic optimization. arXiv preprint arXiv:1412.6980, + models. Advances in neural information processing 2014. + systems, 32, 2019. + Krantz, S. G. and Parks, H. R. The implicit function +Bai, X. and Melas-Kyriazi, L. Fixed point diffusion theorem: history, theory, and applications. Springer + models. In Proceedings of the IEEE/CVF Conference Science & Business Media, 2002. + on Computer Vision and Pattern Recognition, pp. 9430– + 9440, 2024. LeCun, Y. Une procedure d’apprentissage ponr reseau + a seuil asymetrique. Proceedings of cognitiva 85, pp. +Brock, A., Donahue, J., and Simonyan, K. Large scale 599–604, 1985. + + 9 + Recursive Reasoning with Tiny Networks + +Lehnert, L., Sukhbaatar, S., Su, D., Zheng, Q., Mcvay, P., all-mlp architecture for vision. Advances in neural + Rabbat, M., and Tian, Y. Beyond a*: Better planning information processing systems, 34:24261–24272, 2021. + with transformers via search dynamics bootstrap- + ping. arXiv preprint arXiv:2402.14083, 2024. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., + Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, +Lillicrap, T. P. and Santoro, A. Backpropagation I. Attention is all you need. Advances in neural + through time and the brain. Current opinion in neuro- information processing systems, 30, 2017. + biology, 55:82–89, 2019. + Wang, G., Li, J., Sun, Y., Chen, X., Liu, C., Wu, Y., +Loshchilov, I. and Hutter, F. Decoupled weight decay Lu, M., Song, S., and Yadkori, Y. A. Hierarchical + regularization. arXiv preprint arXiv:1711.05101, 2017. reasoning model. arXiv preprint arXiv:2506.21734, + 2025. +Moskvichev, A., Odouard, V. V., and Mitchell, M. The + conceptarc benchmark: Evaluating understanding Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., + and generalization in the arc domain. arXiv preprint Chi, E., Le, Q. V., Zhou, D., et al. Chain-of-thought + arXiv:2305.07141, 2023. prompting elicits reasoning in large language mod- + els. Advances in neural information processing systems, +Palm, R., Paquet, U., and Winther, O. Recurrent re- + 35:24824–24837, 2022. + lational networks. Advances in neural information + processing systems, 31, 2018. Werbos, P. Beyond regression: New tools for predic- + tion and analysis in the behavioral sciences. PhD +Park, K. Can convolutional neural networks + thesis, Committee on Applied Mathematics, Harvard + crack sudoku puzzles? https://github.com/ + University, Cambridge, MA, 1974. + Kyubyong/sudoku, 2018. + Werbos, P. J. Generalization of backpropagation with +Prieto, L., Barsbey, M., Mediano, P. A., and Birdal, T. + application to a recurrent gas market model. Neural + Grokking at the edge of numerical stability. arXiv + networks, 1(4):339–356, 1988. + preprint arXiv:2501.04697, 2025. + Zhang, B. and Sennrich, R. Root mean square layer +Rumelhart, D. E., Hinton, G. E., and Williams, R. J. + normalization. Advances in Neural Information Pro- + Learning internal representations by error propaga- + cessing Systems, 32, 2019. + tion. Technical report, 1985. +Shazeer, N. Glu variants improve transformer. arXiv + preprint arXiv:2002.05202, 2020. +Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, + Q., Hinton, G., and Dean, J. Outrageously large neu- + ral networks: The sparsely-gated mixture-of-experts + layer. arXiv preprint arXiv:1701.06538, 2017. +Snell, C., Lee, J., Xu, K., and Kumar, A. Scaling + llm test-time compute optimally can be more effec- + tive than scaling model parameters. arXiv preprint + arXiv:2408.03314, 2024. +Song, Y. and Ermon, S. Improved techniques for train- + ing score-based generative models. Advances in + neural information processing systems, 33:12438–12448, + 2020. +Su, J., Ahmed, M., Lu, Y., Pan, S., Bo, W., and Liu, + Y. Roformer: Enhanced transformer with rotary + position embedding. Neurocomputing, 568:127063, + 2024. +Tolstikhin, I. O., Houlsby, N., Kolesnikov, A., Beyer, + L., Zhai, X., Unterthiner, T., Yung, J., Steiner, A., + Keysers, D., Uszkoreit, J., et al. Mlp-mixer: An + + 10 + Recursive Reasoning with Tiny Networks + +Hyper-parameters and setup propagating through the whole n + 1 recursions makes + the most sense and works best. +All models are trained with the AdamW opti- +mizer(Loshchilov & Hutter, 2017; Kingma & Ba, 2014) We tried removing ACT with the option of stopping +with β 1 = 0.9, β 2 = 0.95, small learning rate warm- when the solution is reached, but we found that gen- +up (2K iterations), batch-size 768, hidden-size of 512, eralization dropped significantly. This can probably +Nsup = 16 max supervision steps, and stable-max loss be attributed to the fact that the model is spending +(Prieto et al., 2025) for improved stability. TRM uses an too much time on the same data samples rather than +Exponential Moving Average (EMA) of 0.999. HRM focusing on learning on a wide range of data samples. +uses n = 2, T = 2 with two 4-layers networks, while We tried weight tying the input embedding and out- +we use n = 6, T = 3 with one 2-layer network. put head, but this was too constraining and led to a +For Sudoku-Extreme and Maze-Hard, we train for 60k massive generalization drop. +epochs with learning rate 1e-4 and weight decay 1.0. We tried using TorchDEQ (Geng & Kolter, 2023) to +For ARC-AGI, we train for 100K epochs with learning replace the recursion steps by fixed-point iteration as +rate 1e-4 (with 1e-2 learning rate for the embeddings) done by Deep Equilibrium Models (Bai et al., 2019). +and weight decay 0.1. The numbers for Deepseek R1, This would provide a better justification for the 1-step +Claude 3.7 8K, O3-mini-high, Direct prediction, and gradient approximation. However, this slowed down +HRM from the Table 4 and 5 are taken from Wang et al. training due to the fixed-point iteration and led to +(2025). Both HRM and TRM add an embedding of worse generalization. This highlights the fact that +shape [0, 1, D ] on Sudoku-Extreme and Maze-Hard to converging to a fixed-point is not essential. +the input. For ARC-AGI, each puzzle (containing 2-3 +training examples and 1-2 test examples) at each data- +augmentation is given a specific embedding of shape +[0, 1, D ] and, at test-time, the most common answer +out of the 1000 data augmentations is given as answer. +Experiments on Sudoku-Extreme were ran with 1 L40S +with 40Gb of RAM for generally less than 36 hours. +Experiments on Maze-Hard were ran with 4 L40S with +40Gb of RAM for less than 24 hours. Experiments on +ARC-AGI were ran for around 3 days with 4 H100 +with 80Gb of RAM. + +Ideas that failed +In this section, we quickly mention a few ideas that +did not work to prevent others from making the same +mistake. +We tried replacing the SwiGLU MLPs by SwiGLU +Mixture-of-Experts (MoEs) (Shazeer et al., 2017; Fedus +et al., 2022), but we found generalization to decrease +massively. MoEs clearly add too much unnecessary +capacity, just like increasing the number of layers does. +Instead of back-propagating through the whole n + 1 +recursions, we tried a compromise between HRM 1- +step gradient approximation, which back-propagates +through the last 2 recursions. We did so by decou- +pling n from the number of last recursions k that we +back-propagate through. For example, while n = 6 +requires 7 steps with gradients in TRM, we can use +gradients for only the k = 4 last steps. However, we +found that this did not help generalization in any way, +and it made the approach more complicated. Back- + + + 11 + Recursive Reasoning with Tiny Networks + +Algorithms with different number of latent Example on Sudoku-Extreme +features + 8 3 1 + 9 6 8 7 +def latent recursion(x, z, n=6): 3 5 + for i in range(n+1): # latent recursion + z = net(x, z) 6 8 + return z 6 2 +def deep recursion(x, z, n=6, T=3): + 7 4 3 + # recursing T−1 times to improve z (no gradients needed) 9 4 + with torch.no grad(): + for j in range(T−1): + 2 4 1 + z = latent recursion(x, z, n) 6 2 5 7 + # recursing once to improve z + z = latent recursion(x, z, n) Input x + return z.detach(), output head(y), Q head(y) + 5 2 6 7 9 4 8 3 1 +# Deep Supervision +for x input, y true in train dataloader: 3 9 1 2 6 8 4 7 5 + z = z init 4 8 7 3 1 5 2 9 6 + for step in range(N supervision): + x = input embedding(x input) + 1 6 8 5 3 2 7 4 9 + z, y hat, q hat = deep recursion(x, z) 9 3 5 4 7 6 1 8 2 + loss = softmax cross entropy(y hat, y true) 7 4 2 9 8 1 5 6 3 + loss += binary cross entropy(q hat, (y hat == y true)) + z = z.detach() 8 7 3 1 5 9 6 2 4 + loss.backward() 2 5 9 6 4 7 3 1 8 + opt.step() + opt.zero grad() 6 1 4 8 5 3 9 5 7 + if q[0] > 0: # early−stopping + break Output y + 5 2 6 7 9 4 8 3 1 +Figure 4. Pseudocode of TRM using a single-z with deep 3 9 1 2 6 8 4 7 5 +supervision training in PyTorch. 4 8 7 3 1 5 2 9 6 + 1 6 8 5 3 2 7 4 9 + 9 3 5 4 7 6 1 8 2 + 7 4 2 9 8 1 5 6 3 + 8 7 3 1 5 9 6 2 4 +def latent recursion(x, y, z, n=6): + for i in range(n): # latent recursion 2 5 9 6 4 7 3 1 8 + z[i] = net(x, y, z[0], ... , z[n−1]) 6 1 4 8 5 3 9 5 7 + y = net(y, z[0], ... , z[n−1]) # refine output answer + return y, z Tokenized z H (denoted y in TRM) +def deep recursion(x, y, z, n=6, T=3): 5 5 4 9 4 6 3 + # recursing T−1 times to improve y and z (no gradients needed) + with torch.no grad(): 4 3 1 4 6 5 + for j in range(T−1): 4 8 4 3 6 6 4 + y, z = latent recursion(x, y, z, n) + # recursing once to improve y and z 9 6 5 3 5 4 + y, z = latent recursion(x, y, z, n) 3 5 4 3 5 4 4 + return (y.detach(), z.detach()), output head(y), Q head(y) + 6 3 3 3 5 8 8 +# Deep Supervision 3 3 3 6 5 6 6 4 +for x input, y true in train dataloader: 7 5 6 3 3 6 6 + y, z = y init, z init + for step in range(N supervision): 4 3 4 8 3 6 6 4 + x = input embedding(x input) + (y, z), y hat, q hat = deep recursion(x, y, z) Tokenized z L (denoted z in TRM) + loss = softmax cross entropy(y hat, y true) + loss += binary cross entropy(q hat, (y hat == y true)) + loss.backward() + opt.step() + Figure 6. This Sudoku-Extreme example shows an input, ex- + opt.zero grad() pected output, and the tokenized z H and z L (after reversing + if q[0] > 0: # early−stopping the embedding and using argmax) for a pretrained model. + break + This highlights the fact that z H corresponds to the predicted + response, while z L is a latent feature that cannot be decoded +Figure 5. Pseudocode of TRM using multi-scale z with deep to a sensible output unless transformed into z H by f H . +supervision training in PyTorch. + + + + + 12 +
\ No newline at end of file |
