papers/txt/trm2025_tiny_recursive.txt


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796

                                                       Less is More: Recursive Reasoning with Tiny Networks


                                                                                    Alexia Jolicoeur-Martineau
                                                                                     Samsung SAIL Montréal
                                                                                      alexia.j@samsung.com


                                                               Abstract
arXiv:2510.04871v1 [cs.LG] 6 Oct 2025


                                            Hierarchical Reasoning Model (HRM) is a
                                            novel approach using two small neural net-
                                            works recursing at different frequencies. This
                                            biologically inspired method beats Large Lan-
                                            guage models (LLMs) on hard puzzle tasks
                                            such as Sudoku, Maze, and ARC-AGI while
                                            trained with small models (27M parameters)
                                            on small data (∼ 1000 examples). HRM holds
                                            great promise for solving hard problems with
                                            small networks, but it is not yet well un-
                                            derstood and may be suboptimal. We pro-
                                            pose Tiny Recursive Model (TRM), a much
                                            simpler recursive reasoning approach that
                                            achieves significantly higher generalization
                                            than HRM, while using a single tiny network
                                            with only 2 layers. With only 7M parameters,
                                            TRM obtains 45% test-accuracy on ARC-AGI-
                                            1 and 8% on ARC-AGI-2, higher than most
                                            LLMs (e.g., Deepseek R1, o3-mini, Gemini 2.5
                                            Pro) with less than 0.01% of the parameters.


                                        1. Introduction
                                        While powerful, Large Language models (LLMs) can
                                        struggle on hard question-answer problems. Given
                                        that they generate their answer auto-regressively, there
                                        is a high risk of error since a single incorrect token can
                                        render an answer invalid. To improve their reliabil-             Figure 1. Tiny Recursion Model (TRM) recursively improves
                                        ity, LLMs rely on Chain-of-thoughts (CoT) (Wei et al.,           its predicted answer y with a tiny network. It starts with the
                                        2022) and Test-Time Compute (TTC) (Snell et al., 2024).          embedded input question x and initial embedded answer
                                                                                                         y, and latent z. For up to Nsup = 16 improvements steps,
                                        CoTs seek to emulate human reasoning by having the
                                                                                                         it tries to improve its answer y. It does so by i) recursively
                                        LLM to sample step-by-step reasoning traces prior to
                                                                                                         updating n times its latent z given the question x, current
                                        giving their answer. Doing so can improve accuracy,              answer y, and current latent z (recursive reasoning), and
                                        but CoT is expensive, requires high-quality reasoning            then ii) updating its answer y given the current answer y
                                        data (which may not be available), and can be brittle            and current latent z. This recursive process allows the model
                                        since the generated reasoning may be wrong. To fur-              to progressively improve its answer (potentially address-
                                        ther improve reliability, test-time compute can be used          ing any errors from its previous answer) in an extremely
                                        by reporting the most common answer out of K or the              parameter-efficient manner while minimizing overfitting.
                                        highest-reward answer (Snell et al., 2024).

                                                                                                     1
                                      Recursive Reasoning with Tiny Networks

However, this may not be enough. LLMs with CoT                   ers that achieves significantly higher generalization
and TTC are not enough to beat every problem. While              than HRM on a variety of problems. In doing so, we
LLMs have made significant progress on ARC-AGI                   improve the state-of-the-art test accuracy on Sudoku-
(Chollet, 2019) since 2019, human-level accuracy still           Extreme from 55% to 87%, Maze-Hard from 75% to
has not been reached (6 years later, as of writing of            85%, ARC-AGI-1 from 40% to 45%, and ARC-AGI-2
this paper). Furthermore, LLMs struggle on the newer             from 5% to 8%.
ARC-AGI-2 (e.g., Gemini 2.5 Pro only obtains 4.9% test
accuracy with a high amount of TTC) (Chollet et al.,             2. Background
2025; ARC Prize Foundation, 2025b).
                                                                 HRM is described in Algorithm 2. We discuss the
An alternative direction has recently been proposed by
                                                                 details of the algorithm further below.
Wang et al. (2025). They propose a new way forward
through their novel Hierarchical Reasoning Model
(HRM), which obtains high accuracy on puzzle tasks               2.1. Structure and goal
where LLMs struggle to make a dent (e.g., Sudoku                 The focus of HRM is supervised learning. Given an
solving, Maze pathfinding, and ARC-AGI). HRM is a                input, produce an output. Both input and output are
supervised learning model with two main novelties: 1)            assumed to have shape [ B, L] (when the shape differs,
recursive hierarchical reasoning, and 2) deep supervision.       padding tokens can be added), where B is the batch-
Recursive hierarchical reasoning consists of recurs-             size and L is the context-length.
ing multiple times through two small networks ( f L at           HRM contains four learnable components: the in-
high frequency and f H at low frequency) to predict the          put embedding f I (·; θ I ), low-level recurrent network
answer. Each network generates a different latent fea-           f L (·; θ L ), high-level recurrent network f H (·; θ H ), and
ture: f L outputs z H and f H outputs z L . Both features        the output head fO (·; θO ). Once the input is embedded,
(z L , z H ) are used as input to the two networks. The          the shape becomes [ B, L, D ] where D is the embedding
authors provide some biological arguments in favor of            size. Each network is a 4-layer Transformers architec-
recursing at different hierarchies based on the different        ture (Vaswani et al., 2017), with RMSNorm (Zhang
temporal frequencies at which the brains operate and             & Sennrich, 2019), no bias (Chowdhery et al., 2023),
hierarchical processing of sensory inputs.                       rotary embeddings (Su et al., 2024), and SwiGLU acti-
Deep supervision consists of improving the answer                vation function (Hendrycks & Gimpel, 2016; Shazeer,
through multiple supervision steps while carrying the            2020).
two latent features as initialization for the improve-
ment steps (after detaching them from the computa-               2.2. Recursion at two different frequencies
tional graph so that their gradients do not propagate).          Given the hyperparameters used by Wang et al. (2025)
This provide residual connections, which emulates                (n = 2 f L steps, 1 f H steps; done T = 2 times), a
very deep neural networks that are too memory ex-                forward pass of HRM is done as follows:
pensive to apply in one forward pass.
An independent analysis on the ARC-AGI benchmark                        x ← f I ( x̃ )
showed that deep supervision seems to be the primary                  z L ← f L (z L + z H + x ) # without gradients
driver of the performance gains (ARC Prize Founda-                    z L ← f L (z L + z H + x ) # without gradients
tion, 2025a). Using deep supervision doubled accuracy
                                                                      z H ← f H (z L + z H )      # without gradients
over single-step supervision (going from 19% to 39%
accuracy), while recursive hierarchical reasoning only                z L ← f L (z L + z H + x ) # without gradients
slightly improved accuracy over a regular model with                  z L ← z L .detach()
a single forward pass (going from 35.7% to 39.0% ac-                  z H ← z H .detach()
curacy). This suggests that reasoning across different
                                                                      z L ← f L (z L + z H + x ) # with gradients
supervision steps is worth it, but the recursion done
in each supervision step is not particularly important.               z H ← f H (z L + z H )     # with gradients
                                                                        ŷ ← argmax( fO (z H ))
In this work, we show that the benefit from recursive
reasoning can be massively improved, making it much
more than incremental. We propose Tiny Recursive                 where ŷ is the predicted output answer, z L and z H are
Model (TRM), an improved and simplified approach                 either initialized embeddings or the embeddings of
using a much smaller tiny network with only 2 lay-               the previous deep supervision step (after detaching
                                                                 them from the computational graph). As can be seen,

                                                             2
                                               Recursive Reasoning with Tiny Networks

def hrm(z, x, n=2, T=2): # hierarchical reasoning                    2.4. Deep supervision
    zH, zL = z
    with torch.no grad():                                            To improve effective depth, deep supervision is used.
        for i in range(nT − 2):
            zL = L net(zL, zH, x)                                    This consists of reusing the previous latent features
            if (i + 1) % T == 0:
                zH = H net(zH, zL)
                                                                     (z H and z L ) as initialization for the next forward pass.
    # 1−step grad                                                    This allows the model to reason over many iterations
    zL = L net(zL, zH, x)
    zH = H net(zH, zL)                                               and improve its latent features (z L and z H ) until it
    return (zH, zL), output head(zH), Q head(zH)                     (hopefully) converges to the correct solution. At most
def ACT halt(q, y hat, y true):                                      Nsup = 16 supervision steps are used.
    target halt = (y hat == y true)
    loss = 0.5∗binary cross entropy(q[0], target halt)
    return loss                                                      2.5. Adaptive computational time (ACT)
def ACT continue(q, last step):
    if last step:                                                    With deep supervision, each mini-batch of data sam-
        target continue = sigmoid(q[0])                              ples must be used for Nsup = 16 supervision steps
    else:
        target continue = sigmoid(max(q[0], q[1])))                  before moving to the next mini-batch. This is expen-
    loss = 0.5∗binary cross entropy(q[1], target continue)
    return loss
                                                                     sive, and there is a balance to be reached between
                                                                     optimizing a few data examples for many supervision
# Deep Supervision
for x input, y true in train dataloader:                             steps versus optimizing many data examples with less
    z = z init                                                       supervision steps. To reach a better balance, a halting
    for step in range(N sup): # deep supervision
        x = input embedding(x input)                                 mechanism is incorporated to determine whether the
        z, y pred, q = hrm(z, x)
        loss = softmax cross entropy(y pred, y true)
                                                                     model should terminate early. It is learned through
        # Adaptive computational time (ACT) using Q−learning         a Q-learning objective that requires passing the z H
        loss += ACT halt(q, y pred, y true)
         , , q next = hrm(z, x) # extra forward pass                 through an additional head and running an additional
        loss += ACT continue(q next, step == N sup − 1)              forward pass (to determine if halting now rather than
        z = z.detach()
        loss.backward()                                              later would have been preferable). They call this
        opt.step()
        opt.zero grad()
                                                                     method Adaptive computational time (ACT). It is only
        if q[0] > q[1]: # early−stopping                             used during training, while the full Nsup = 16 super-
            break
                                                                     vision steps are done at test time to maximize down-
                                                                     stream performance. ACT greatly diminishes the time
Figure 2. Pseudocode of Hierarchical Reasoning Models                spent per example (on average spending less than 2
(HRMs).                                                              steps on the Sudoku-Extreme dataset rather than the
                                                                     full Nsup = 16 steps), allowing more coverage of the
a forward pass of HRM consists of applying 6 function                dataset given a fixed number of training iterations.
evaluations, where the first 4 function evaluations are
detached from the computational graph and are not                    2.6. Deep supervision and 1-step gradient
back-propagated through. The authors uses n = 2                           approximations replaces BPTT
with T = 2 in all experiments, but HRM can be gener-
alized by allowing for an arbitrary number of L steps                Deep supervision and the 1-step gradient approxima-
(n) and recursions (T) as shown in Algorithm 2.                      tion provide a more biologically plausible and less
                                                                     computationally-expansive alternative to Backpropa-
2.3. Fixed-point recursion with 1-step gradient                      gation Through Time (BPTT) (Werbos, 1974; Rumel-
     approximation                                                   hart et al., 1985; LeCun, 1985) for solving the temporal
                                                                     credit assignment (TCA) (Rumelhart et al., 1985; Wer-
Assuming that (z L , z H ) reaches a fixed-point (z∗L , z∗H )        bos, 1988; Elman, 1990) problem (Lillicrap & Santoro,
through recursing from both f L and f H ,                            2019). The implication is that HRM can learn what
                                                                     would normally require an extremely large network
                     z∗L ≈ f L (z∗L + z H + x )                      without having to back-propagate through its entire
                    z∗H ≈ f H (z L + z∗H ) ,                         depth. Given the hyperparameters used by Jang et al.
                                                                     (2023) in all their experiments, HRM effectively rea-
the Implicit Function Theorem (Krantz & Parks, 2002)
                                                                     sons over nlayers (n + 1) TNsup = 4 ∗ (2 + 1) ∗ 2 ∗ 16 =
with the 1-step gradient approximation (Bai et al.,
                                                                     384 layers of effective depth.
2019) is used to approximate the gradient by back-
propagating only the last f L and f H steps. This theo-
rem is used to justify only tracking the gradients of
the last two steps (out of 6), which greatly reduces
memory demands.

                                                                 3
                                       Recursive Reasoning with Tiny Networks

2.7. Summary of HRM                                               different from the much smaller n = 2 and T = 2 used
                                                                  in every experiment of their paper, we observe the
HRM leverages recursion from two networks at dif-
                                                                  following:
ferent frequencies (high frequency versus low fre-
quency) and deep supervision to learn to improve
its answer over multiple supervision steps (with ACT               1. the residual for z H is clearly well above 0 at every
to reduce time spent per data example). This enables                  step
the model to imitate extremely large depth without
requiring backpropagation through all layers. This
approach obtains significantly higher performance on               2. the residual for z L only becomes closer to 0 after
hard question-answer tasks that regular supervised                    many cycles, but it remains significantly above 0
models struggle with. However, this method is quite
complicated, relying a bit too heavily on uncertain
biological arguments and fixed-point theorems that                 3. z L is very far from converged after one f L evalu-
are not guaranteed to be applicable. In the next sec-                 ation at T cycles, which is when the fixed-point
tion, we discuss those issues and potential targets for               is assumed to be reached and the 1-step gradient
improvements in HRM.                                                  approximation is used

3. Target for improvements in Hierarchical                        Thus, while the application of the IFT theorem and
   Reasoning Models                                               1-step gradient approximation to HRM has some basis
                                                                  since the residuals do generally reduce over time, a
In this section, we identify key targets for improve-
                                                                  fixed point is unlikely to be reached when the theorem
ments in HRM, which will be addressed by our pro-
                                                                  is actually applied.
posed method, Tiny Recursion Models (TRMs).
                                                                  In the next section, we show that we can bypass the
3.1. Implicit Function Theorem (IFT) with 1-step                  need for the IFT theorem and 1-step gradient approxi-
     gradient approximation                                       mation, thus bypassing the issue entirely.

HRM only back-propagates through the last 2 of the 6              3.2. Twice the forward passes with Adaptive
recursions. The authors justify this by leveraging the                 computational time (ACT)
Implicit Function Theorem (IFT) and one-step approx-
imation, which states that when a recurrent function              HRM uses Adaptive computational time (ACT) during
converges to a fixed point, backpropagation can be                training to optimize the time spent of each data sam-
applied in a single step at that equilibrium point.               ple. Without ACT, Nsup = 16 supervision steps would
                                                                  be spent on the same data sample, which is highly in-
There are concerns about applying this theorem to
                                                                  efficient. They implement ACT through an additional
HRM. Most importantly, there is no guarantee that
                                                                  Q-learning objective, which decides when to halt and
a fixed-point is reached. Deep equilibrium models
                                                                  move to a new data sample rather than keep iterating
normally do fixed-point iteration to solve for the fixed
                                                                  on the same data. This allows much more efficient
pointz∗ = f (z∗ ) (Bai et al., 2019). However, in the case
                                                                  use of time especially since the average number of su-
of HRM, it is not iterating to the fixed-point but simply
                                                                  pervision steps during training is quite low with ACT
doing forward passes of f L and f H . To make matters
                                                                  (less than 2 steps on the Sudoku-Extreme dataset as
worse, HRM is only doing 4 recursions before stopping
                                                                  per their reported numbers).
to apply the one-step approximation. After its first
loop of two f L and 1 f H evaluations, it only apply a            However, ACT comes at a cost. This cost is not directly
single f L evaluation before assuming that a fixed-point          shown in the HRM’s paper, but it is shown in their of-
is reached for both z L and z H (z∗L = f L (z∗L + z H + x )       ficial code. The Q-learning objective relies on a halting
and z∗H = f H (z∗L + z∗H )). Then, the one-step gradient          loss and a continue loss. The continue loss requires an
approximation is applied to both latent variables in              extra forward pass through HRM (with all 6 function
succession.                                                       evaluations). This means that while ACT optimizes
                                                                  time more efficiently per sample, it requires 2 forward
The authors justify that a fixed-point is reached by
                                                                  passes per optimization step. The exact formulation is
depicting an example with n = 7 and T = 7 where
                                                                  shown in Algorithm 2.
the forward residuals is reduced over time (Figure 3
in Wang et al. (2025)). Even in this setting, which is            In the next section, we show that we can bypass the
                                                                  need for two forward passes in ACT.

                                                              4
                                               Recursive Reasoning with Tiny Networks

3.3. Hierarchical interpretation based on complex                        of 2 passes). Our approach is described in Algorithm 3
     biological arguments                                                and illustrated in Figure 1. We also provide an ablation
                                                                         in Table 1 on the Sudoku-Extreme dataset (a dataset
The HRM’s authors justify the two latent variables
                                                                         of difficult Sudokus with only 1K training examples,
and two networks operating at different hierarchies
                                                                         but 423K test examples). Below, we explain the key
based on biological arguments, which are very far
                                                                         components of TRMs.
from artificial neural networks. They even try to match
HRM to actual brain experiments on mice. While in-
teresting, this sort of explanation makes it incredibly                  Table 1. Ablation of TRM on Sudoku-Extreme comparing %
hard to parse out why HRM is designed the way it                         Test accuracy, effective depth per supervision step ( T (n +
is. Given the lack of ablation table in their paper, the                 1)nlayers ), number of Forward Passes (NFP) per optimization
over-reliance on biological arguments and fixed-point                    step, and number of parameters
theorems (that are not perfectly applicable), it is hard                  Method                Acc (%) Depth NFP # Params
to determine what parts of HRM is helping what and                        HRM                    55.0    24    2     27M
why. Furthermore, it is not clear why they use two                        TRM (T = 3, n = 6) 87.4        42    1     5M
latent features rather than other combinations of fea-                    w/ ACT                 86.1    42    2      5M
tures.                                                                    w/ separate f H , f L  82.4    42    1     10M
                                                                          no EMA                 79.9    42    1      5M
In the next section, we show that the recursive process                   w/ 4-layers, n = 3     79.5    48    1     10M
can be greatly simplified and understood in a much                        w/ self-attention      74.7    42    1      7M
simpler manner that does not require any biological                       w/ T = 2, n = 2        73.7    12    1     5M
argument, fixed-point theorem, hierarchical interpre-                     w/ 1-step gradient 56.5        42    1      5M
tation, nor using two networks. It also explains why 2
is the optimal number of features (z L and z H ).
                                                                         4.1. No fixed-point theorem required

def latent recursion(x, y, z, n=6):                                      HRM assumes that the recursions converge to a fixed-
    for i in range(n): # latent reasoning                                point for both z L and z H in order to leverage the 1-step
        z = net(x, y, z)
    y = net(y, z) # refine output answer                                 gradient approximation (Bai et al., 2019). This allows
    return y, z
                                                                         the authors to justify only back-propagating through
def deep recursion(x, y, z, n=6, T=3):                                   the last two function evaluations (1 f L and 1 f H ). To
    # recursing T−1 times to improve y and z (no gradients needed)
    with torch.no grad():                                                bypass this theoretical requirement, we define a full
        for j in range(T−1):                                             recursion process as containing n evaluations of f L
            y, z = latent recursion(x, y, z, n)
    # recursing once to improve y and z                                  and 1 evaluation of f H :
    y, z = latent recursion(x, y, z, n)
    return (y.detach(), z.detach()), output head(y), Q head(y)                            z L ← f L (z L + z H + x )
# Deep Supervision                                                                           ...
for x input, y true in train dataloader:
    y, z = y init, z init                                                                 z L ← f L (z L + z H + x )
    for step in range(N supervision):
        x = input embedding(x input)                                                      z H ← f H (z L + z H ) .
        (y, z), y hat, q hat = deep recursion(x, y, z)
        loss = softmax cross entropy(y hat, y true)
        loss += binary cross entropy(q hat, (y hat == y true))           Then, we simply back-propagate through the full re-
        loss.backward()                                                  cursion process.
        opt.step()
        opt.zero grad()
        if q hat > 0: # early−stopping                                   Through deep supervision, the models learns to take
            break                                                        any (z L , z H ) and improve it through a full recursion
                                                                         process, hopefully making z H closer to the solution.
Figure 3. Pseudocode of Tiny Recursion Models (TRMs).                    This means that by the design of the deep supervi-
                                                                         sion goal, running a few full recursion processes (even
                                                                         without gradients) is expected to bring us closer to the
4. Tiny Recursion Models                                                 solution. We propose to run T − 1 recursion processes
                                                                         without gradient to improve (z L , z H ) before running
In this section, we present Tiny Recursion Models
                                                                         one recursion process with backpropagation.
(TRMs). Contrary to HRM, TRM requires no com-
plex mathematical theorem, hierarchy, nor biological                     Thus, instead of using the 1-step gradient approxi-
arguments. It generalizes better while requiring only                    mation, we apply a full recursion process containing
a single tiny network (instead of two medium-size net-                   n evaluations of f L and 1 evaluation of f H . This re-
works) and a single forward pass for the ACT (instead                    moves entirely the need to assume that a fixed-point

                                                                     5
                                      Recursive Reasoning with Tiny Networks

is reached and the use of the IFT theorem with 1-step            While this is intuitive, we wanted to verify whether
gradient approximation. Yet, we can still leverage               using more or less features could be helpful. Results
multiple backpropagation-free recursion processes to             are shown in Table 2.
improve (z L , z H ). With this approach, we obtain a
                                                                 More features (> 2): We tested splitting z into dif-
massive boost in generalization on Sudoku-Extreme
                                                                 ferent features by treating each of the n recursions as
(improving TRM from 56.5% to 87.4%; see Table 1).
                                                                 producing a different zi for i = 1, ..., n. Then, each
                                                                 zi is carried across supervision steps. The approach
4.2. Simpler reinterpretation of z H and z L                     is described in Algorithm 5. In doing so, we found
HRM is interpreted as doing hierarchical reasoning               performance to drop. This is expected because, as dis-
over two latent features of different hierarchies due to         cussed, there is no apparent need for splitting z into
arguments from biology. However, one might wonder                multiple parts. It does not have to be hierarchical.
why use two latent features instead of 1, 3, or more?            Single feature: Similarly, we tested the idea of taking
And do we really need to justify these so-called ”hier-          a single feature by only carrying z H across supervi-
archical” features based on biology to make sense of             sion steps. The approach is described in Algorithm 4.
them? We propose a simple non-biological explana-                In doing so, we found performance to drop. This is
tion, which is more natural, and directly answers the            expected because, as discussed, it forces the model to
question of why there are 2 features.                            store the solution y within z.
The fact of the matter is: z H is simply the current             Thus, we explored using more or less latent variables
(embedded) solution. The embedding is reversed by                on Sudoku-Extreme, but found that having only y and
applying the output head and rounding to the nearest             z lead to better test accuracy in addition to being the
token using the argmax operation. On the other hand,             simplest more natural approach.
z L is a latent feature that does not directly correspond
to a solution, but it can be transformed into a solution
by applying z H ← f H ( x, z L , z H ). We show an example
on Sudoku-Extreme in Figure 6 to highlight the fact              Table 2. TRM on Sudoku-Extreme comparing % Test accu-
that z H does correspond to the solution, but z L does           racy when using more or less latent features
not.                                                                 Method                 # of features   Acc (%)
                                                                     TRM y, z (Ours)              2          87.4
Once this is understood, hierarchy is not needed; there              TRM multi-scale z       n+1 = 7         77.6
is simply an input x, a proposed solution y (previously              TRM single z                 1          71.9
called z H ), and a latent reasoning feature z (previously
called z L ). Given the input question x, current solution
y, and current latent reasoning z, the model recursively
improves its latent z. Then, given the current latent z          4.3. Single network
and the previous solution y, the model proposes a new            HRM uses two networks, one applied frequently as a
solution y (or stay at the current solution if its already       low-level module f H and one applied rarely as an high-
good).                                                           level module ( f H ). This requires twice the number of
Although this has no direct influence on the algorithm,          parameters compared to regular supervised learning
this re-interpretation is much simpler and natural. It           with a single network.
answers the question about why two features: remem-              As mentioned previously, while f L iterates on the la-
bering in context the question x, previous reasoning             tent reasoning feature z (z L in HRM), the goal of f H
z, and previous answer y helps the model iterate on              is to update the solution y (z H in HRM) given the la-
the next reasoning z and then the next answer y. If              tent reasoning and current solution. Importantly, since
we were not passing the previous reasoning z, the                z ← f L ( x + y + z) contains x but y ← f H (y + z) does
model would forget how it got to the previous solu-              not contains x, the task to achieve (iterating on z versus
tion y (since z acts similarly as a chain-of-thought). If        using z to update y) is directly specified by the inclu-
we were not passing the previous solution y, then the            sion or lack of x in the inputs. Thus, we considered
model would forget what solution it had and would                the possibility that both networks could be replaced
be forced to store the solution y within z instead of            by a single network doing both tasks. In doing so, we
using it for latent reasoning. Thus, we need both y and          obtain better generalization on Sudoku-Extreme (im-
z separately, and there is no apparent reason why one            proving TRM from 82.4% to 87.4%; see Table 1) while
would need to split z into multiple features.                    reducing the number of parameters by half. It turns
                                                                 out that a single network is enough.

                                                             6
                                        Recursive Reasoning with Tiny Networks

4.4. Less is more                                                  4.6. No additional forward pass needed with ACT
We attempted to increase capacity by increasing the                As previously mentioned, the implementation of ACT
number of layers in order to scale the model. Sur-                 in HRM through Q-learning requires two forward
prisingly, we found that adding layers decreased gen-              passes, which slows down training. We propose a
eralization due to overfitting. In doing the oppo-                 simple solution, which is to get rid of the continue loss
site, decreasing the number of layers while scaling                (from the Q-learning) and only learn a halting proba-
the number of recursions (n) proportionally (to keep               bility through a Binary-Cross-Entropy loss of having
the amount of compute and emulated depth approxi-                  reached the correct solution. By removing the continue
mately the same), we found that using 2 layers (instead            loss, we remove the need for the expensive second for-
of 4 layers) maximized generalization. In doing so, we             ward pass, while still being able to determine when to
obtain better generalization on Sudoku-Extreme (im-                halt with relatively good accuracy. We found no sig-
proving TRM from 79.5% to 87.4%; see Table 1) while                nificant difference in generalization from this change
reducing the number of parameters by half (again).                 (going from 86.1% to 87.4%; see Table 1).
It is quite surprising that smaller networks are bet-
ter, but 2 layers seems to be the optimal choice. Bai              4.7. Exponential Moving Average (EMA)
& Melas-Kyriazi (2024) also observed optimal perfor-               On small data (such as Sudoku-Extreme and Maze-
mance for 2-layers in the context of deep equilibrium              Hard), HRM tends to overfit quickly and then diverge.
diffusion models; however, they had similar perfor-                To reduce this problem and improves stability, we
mance to the bigger networks, while we instead ob-                 integrate Exponential Moving Average (EMA) of the
serve better performance with 2 layers. This may ap-               weights, a common technique in GANs and diffusion
pear unusual, as with modern neural networks, gener-               models to improve stability (Brock et al., 2018; Song &
alization tends to directly correlate with model sizes.            Ermon, 2020). We find that it prevents sharp collapse
However, when data is too scarce and model size is                 and leads to higher generalization (going from 79.9%
large, there can be an overfitting penalty (Kaplan et al.,         to 87.4%; see Table 1).
2020). This is likely an indication that there is too little
data. Thus, using tiny networks with deep recursion                4.8. Optimal the number of recursions
and deep supervision appears to allow us to bypass a
lot of the overfitting.                                            We experimented with different number of recursions
                                                                   by varying T and n and found that T = 3, n = 3
4.5. attention-free architecture for tasks with small              (equivalent to 48 recursions) in HRM and T = 3, n = 6
     fixed context length                                          in TRM (equivalent to 42 recursions) to lead to optimal
                                                                   generalization on Sudoku-Extreme. More recursions
Self-attention is particularly good for long-context               could be helpful for harder problems (we have not
lengths when L ≫ D since it only requires a matrix of              tested it, given our limited resources); however, in-
[ D, 3D ] parameters, even though it can account for the           creasing either T or n incurs massive slowdowns. We
whole sequence. However, when focusing on tasks                    show results at different n and T for HRM and TRM
where L ≤ D, a linear layer is cheap, requiring only a             in Table 3. Note that TRM requires backpropagation
matrix of [ L, L] parameters. Taking inspiration from              through a full recursion process, thus increasing n too
the MLP-Mixer (Tolstikhin et al., 2021), we can replace            much leads to Out Of Memory (OOM) errors. How-
the self-attention layer with a multilayer perceptron              ever, this memory cost is well worth its price in gold.
(MLP) applied on the sequence length. Using an MLP
instead of self-attention, we obtain better generaliza-            In the following section, we show our main results on
tion on Sudoku-Extreme (improving from 74.7% to                    multiple datasets comparing HRM, TRM, and LLMs.
87.4%; see Table 1). This worked well on Sudoku 9x9
grids, given the small and fixed context length; how-              5. Results
ever, we found this architecture to be suboptimal for
tasks with large context length, such as Maze-Hard                 Following Wang et al. (2025), we test our approach
and ARC-AGI (both using 30x30 grids). We show                      on the following datasets: Sudoku-Extreme (Wang
results with and without self-attention for all experi-            et al., 2025), Maze-Hard (Wang et al., 2025), ARC-AGI-
ments.                                                             1 (Chollet, 2019) and, ARC-AGI-2 (Chollet et al., 2025).
                                                                   Results are presented in Tables 4 and 5. Hyperparame-
                                                                   ters are detailed in Section 6. Datasets are discussed
                                                                   below.


                                                               7
                                      Recursive Reasoning with Tiny Networks

                                                                 ity of the MLP on large 30x30 grids). TRM with self-
Table 3. % Test accuracy on Sudoku-Extreme dataset. HRM
                                                                 attention obtains 85.3% accuracy on Maze-Hard, 44.6%
versus TRM matched at a similar effective depth per super-
vision step ( T (n + 1)nlayers )
                                                                 accuracy on ARC-AGI-1, and 7.8% accuracy on ARC-
                                                                 AGI-2 with 7M parameters. This is significantly higher
                  HRM                  TRM
                                                                 than the 74.5%, 40.3%, and 5.0% obtained by HRM us-
              n = k, 4 layers     n = 2k, 2 layers
                                                                 ing 4 times the number of parameters (27M).
    k   T    Depth Acc (%)        Depth Acc (%)
    1   1      9         46.4       7        63.2
    2   2      24        55.0       20       81.9                Table 4. % Test accuracy on Puzzle Benchmarks (Sudoku-
    3   3      48        61.6       42       87.4                Extreme and Maze-Hard)
    4   4      80        59.5       72       84.2                   Method              # Params Sudoku Maze
    6   3      84        62.3       78      OOM                              Chain-of-thought, pretrained
    3   6      96        58.8       84       85.8                   Deepseek R1            671B       0.0     0.0
    6   6     168        57.5      156      OOM                     Claude 3.7 8K            ?        0.0      0.0
                                                                    O3-mini-high             ?        0.0      0.0
                                                                       Direct prediction, small-sample training
Sudoku-Extreme consists of extremely difficult Su-                  Direct pred            27M        0.0     0.0
doku puzzles (Dillion, 2025; Palm et al., 2018; Park,               HRM                    27M       55.0     74.5
2018) (9x9 grid), for which only 1K training samples                TRM-Att (Ours)          7M       74.7     85.3
are used to test small-sample learning. Testing is done             TRM-MLP (Ours) 5M/19M        1   87.4      0.0
on 423K samples. Maze-Hard consists of 30x30 mazes
generated by the procedure by Lehnert et al. (2024)
whose shortest path is of length above 110; both the
training set and test set include 1000 mazes.                    Table 5. % Test accuracy on ARC-AGI Benchmarks (2 tries)
                                                                   Method               # Params ARC-1 ARC-2
ARC-AGI-1 and ARC-AGI-2 are geometric puzzles in-                           Chain-of-thought, pretrained
volving monetary prizes. Each puzzle is designed to                Deepseek R1            671B      15.8      1.3
be easy for a human, yet hard for current AI models.               Claude 3.7 16K            ?      28.6      0.7
Each puzzle task consists of 2-3 input–output demon-               o3-mini-high              ?      34.5      3.0
stration pairs and 1-2 test inputs to be solved. The final         Gemini 2.5 Pro 32K        ?      37.0      4.9
score is computed as the accuracy over all test inputs             Grok-4-thinking         1.7T     66.7     16.0
from two attempts to produce the correct output grid.              Bespoke (Grok-4)        1.7T     79.6     29.4
The maximum grid size is 30x30. ARC-AGI-1 con-                        Direct prediction, small-sample training
tains 800 tasks, while ARC-AGI-2 contains 1120 tasks.
                                                                   Direct pred             27M      21.0      0.0
We also augment our data with the 160 tasks from
                                                                   HRM                     27M      40.3      5.0
the closely related ConceptARC dataset (Moskvichev
et al., 2023). We provide results on the public evalua-            TRM-Att (Ours)           7M      44.6      7.8
tion set for both ARC-AGI-1 and ARC-AGI-2.                         TRM-MLP (Ours)          19M      29.6      2.4

While these datasets are small, heavy data-
augmentation is used in order to improve gen-                    6. Conclusion
eralization. Sudoku-Extreme uses 1000 shuffling
(done without breaking the Sudoku rules) augmenta-               We propose Tiny Recursion Models (TRM), a simple
tions per data example. Maze-Hard uses 8 dihedral                recursive reasoning approach that achieves strong gen-
transformations per data example. ARC-AGI uses                   eralization on hard tasks using a single tiny network
1000 data augmentations (color permutation, dihedral-            recursing on its latent reasoning feature and progres-
group, and translations transformations) per data                sively improving its final answer. Contrary to the
example. The dihedral-group transformations consist              Hierarchical Reasoning Model (HRM), TRM requires
of random 90-degree rotations, horizontal/vertical               no fixed-point theorem, no complex biological justi-
flips, and reflections.                                          fications, and no hierarchy. It significantly reduces
                                                                 the number of parameters by halving the number of
From the results, we see that TRM without self-                  layers and replacing the two networks with a single
attention obtains the best generalization on Sudoku-             tiny network. It also simplifies the halting process,
Extreme (87.4% test accuracy). Meanwhile, TRM with               removing the need for the extra forward pass. Over-
self-attention generalizes better on the other tasks
(probably due to inductive biases and the overcapac-                1 5M on Sudoku and 19M on Maze


                                                             8
                                      Recursive Reasoning with Tiny Networks

all, TRM is much simpler than HRM, while achieving                gan training for high fidelity natural image synthe-
better generalization.                                            sis. arXiv preprint arXiv:1809.11096, 2018.
While our approach led to better generalization on 4            Chollet, F. On the measure of intelligence. arXiv
benchmarks, every choice made is not guaranteed to               preprint arXiv:1911.01547, 2019.
be optimal on every dataset. For example, we found
that replacing the self-attention with an MLP worked            Chollet, F., Knoop, M., Kamradt, G., Landers, B.,
extremely well on Sudoku-Extreme (improving test ac-             and Pinkard, H. Arc-agi-2: A new challenge
curacy by 10%), but poorly on other datasets. Different          for frontier ai reasoning systems. arXiv preprint
problem settings may require different architectures             arXiv:2505.11831, 2025.
or number of parameters. Scaling laws are needed
to parametrize these networks optimally. Although               Chowdhery, A., Narang, S., Devlin, J., Bosma, M.,
we simplified and improved on deep recursion, the                Mishra, G., Roberts, A., Barham, P., Chung, H. W.,
question of why recursion helps so much compared                 Sutton, C., Gehrmann, S., et al. Palm: Scaling lan-
to using a larger and deeper network remains to be               guage modeling with pathways. Journal of Machine
explained; we suspect it has to do with overfitting, but         Learning Research, 24(240):1–113, 2023.
we have no theory to back this explaination. Not all
our ideas made the cut; we briefly discuss some of the          Dillion, T. Tdoku: A fast sudoku solver and gener-
failed ideas that we tried but did not work in Section 6.         ator. https://t-dillon.github.io/tdoku/,
Currently, recursive reasoning models such as HRM                 2025.
and TRM are supervised learning methods rather than
                                                                Elman, J. L. Finding structure in time. Cognitive science,
generative models. This means that given an input
                                                                  14(2):179–211, 1990.
question, they can only provide a single deterministic
answer. In many settings, multiple answers exist for a          Fedus, W., Zoph, B., and Shazeer, N. Switch transform-
question. Thus, it would be interesting to extend TRM             ers: Scaling to trillion parameter models with simple
to generative tasks.                                              and efficient sparsity. Journal of Machine Learning Re-
                                                                  search, 23(120):1–39, 2022.
Acknowledgements
                                                                Geng, Z. and Kolter, J. Z. Torchdeq: A library for deep
Thank you Emy Gervais for your invaluable support                 equilibrium models. arXiv preprint arXiv:2310.18605,
and extra push. This research was enabled in part                 2023.
by computing resources, software, and technical as-
sistance provided by Mila and the Digital Research              Hendrycks, D. and Gimpel, K. Gaussian error linear
Alliance of Canada.                                              units (gelus). arXiv preprint arXiv:1606.08415, 2016.

                                                                Jang, Y., Kim, D., and Ahn, S. Hierarchical graph
References                                                        generation with k2-trees. In ICML 2023 Workshop on
ARC Prize Foundation. The Hidden Drivers of HRM’s                 Structured Probabilistic Inference Generative Modeling,
 Performance on ARC-AGI. https://arcprize.                        2023.
 org/blog/hrm-analysis, 2025a. [Online; ac-
                                                                Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B.,
 cessed 2025-09-15].
                                                                  Chess, B., Child, R., Gray, S., Radford, A., Wu, J.,
ARC Prize Foundation. ARC-AGI Leaderboard.                        and Amodei, D. Scaling laws for neural language
 https://arcprize.org/leaderboard, 2025b.                         models. arXiv preprint arXiv:2001.08361, 2020.
 [Online; accessed 2025-09-24].
                                                                Kingma, D. P. and Ba, J. Adam: A method for stochas-
Bai, S., Kolter, J. Z., and Koltun, V. Deep equilibrium           tic optimization. arXiv preprint arXiv:1412.6980,
  models. Advances in neural information processing               2014.
  systems, 32, 2019.
                                                                Krantz, S. G. and Parks, H. R. The implicit function
Bai, X. and Melas-Kyriazi, L. Fixed point diffusion               theorem: history, theory, and applications. Springer
  models. In Proceedings of the IEEE/CVF Conference               Science & Business Media, 2002.
  on Computer Vision and Pattern Recognition, pp. 9430–
  9440, 2024.                                                   LeCun, Y. Une procedure d’apprentissage ponr reseau
                                                                  a seuil asymetrique. Proceedings of cognitiva 85, pp.
Brock, A., Donahue, J., and Simonyan, K. Large scale              599–604, 1985.

                                                            9
                                      Recursive Reasoning with Tiny Networks

Lehnert, L., Sukhbaatar, S., Su, D., Zheng, Q., Mcvay, P.,          all-mlp architecture for vision. Advances in neural
  Rabbat, M., and Tian, Y. Beyond a*: Better planning               information processing systems, 34:24261–24272, 2021.
  with transformers via search dynamics bootstrap-
  ping. arXiv preprint arXiv:2402.14083, 2024.                    Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J.,
                                                                    Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin,
Lillicrap, T. P. and Santoro, A. Backpropagation                    I. Attention is all you need. Advances in neural
  through time and the brain. Current opinion in neuro-             information processing systems, 30, 2017.
  biology, 55:82–89, 2019.
                                                                  Wang, G., Li, J., Sun, Y., Chen, X., Liu, C., Wu, Y.,
Loshchilov, I. and Hutter, F. Decoupled weight decay               Lu, M., Song, S., and Yadkori, Y. A. Hierarchical
  regularization. arXiv preprint arXiv:1711.05101, 2017.           reasoning model. arXiv preprint arXiv:2506.21734,
                                                                   2025.
Moskvichev, A., Odouard, V. V., and Mitchell, M. The
 conceptarc benchmark: Evaluating understanding                   Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F.,
 and generalization in the arc domain. arXiv preprint              Chi, E., Le, Q. V., Zhou, D., et al. Chain-of-thought
 arXiv:2305.07141, 2023.                                           prompting elicits reasoning in large language mod-
                                                                   els. Advances in neural information processing systems,
Palm, R., Paquet, U., and Winther, O. Recurrent re-
                                                                   35:24824–24837, 2022.
  lational networks. Advances in neural information
  processing systems, 31, 2018.                                   Werbos, P. Beyond regression: New tools for predic-
                                                                   tion and analysis in the behavioral sciences. PhD
Park, K.    Can convolutional neural networks
                                                                   thesis, Committee on Applied Mathematics, Harvard
  crack sudoku puzzles? https://github.com/
                                                                   University, Cambridge, MA, 1974.
  Kyubyong/sudoku, 2018.
                                                                  Werbos, P. J. Generalization of backpropagation with
Prieto, L., Barsbey, M., Mediano, P. A., and Birdal, T.
                                                                   application to a recurrent gas market model. Neural
  Grokking at the edge of numerical stability. arXiv
                                                                   networks, 1(4):339–356, 1988.
  preprint arXiv:2501.04697, 2025.
                                                                  Zhang, B. and Sennrich, R. Root mean square layer
Rumelhart, D. E., Hinton, G. E., and Williams, R. J.
                                                                    normalization. Advances in Neural Information Pro-
  Learning internal representations by error propaga-
                                                                    cessing Systems, 32, 2019.
  tion. Technical report, 1985.
Shazeer, N. Glu variants improve transformer. arXiv
  preprint arXiv:2002.05202, 2020.
Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le,
  Q., Hinton, G., and Dean, J. Outrageously large neu-
  ral networks: The sparsely-gated mixture-of-experts
  layer. arXiv preprint arXiv:1701.06538, 2017.
Snell, C., Lee, J., Xu, K., and Kumar, A. Scaling
  llm test-time compute optimally can be more effec-
  tive than scaling model parameters. arXiv preprint
  arXiv:2408.03314, 2024.
Song, Y. and Ermon, S. Improved techniques for train-
  ing score-based generative models. Advances in
  neural information processing systems, 33:12438–12448,
  2020.
Su, J., Ahmed, M., Lu, Y., Pan, S., Bo, W., and Liu,
  Y. Roformer: Enhanced transformer with rotary
  position embedding. Neurocomputing, 568:127063,
  2024.
Tolstikhin, I. O., Houlsby, N., Kolesnikov, A., Beyer,
  L., Zhai, X., Unterthiner, T., Yung, J., Steiner, A.,
  Keysers, D., Uszkoreit, J., et al. Mlp-mixer: An

                                                             10
                                      Recursive Reasoning with Tiny Networks

Hyper-parameters and setup                                       propagating through the whole n + 1 recursions makes
                                                                 the most sense and works best.
All models are trained with the AdamW opti-
mizer(Loshchilov & Hutter, 2017; Kingma & Ba, 2014)              We tried removing ACT with the option of stopping
with β 1 = 0.9, β 2 = 0.95, small learning rate warm-            when the solution is reached, but we found that gen-
up (2K iterations), batch-size 768, hidden-size of 512,          eralization dropped significantly. This can probably
Nsup = 16 max supervision steps, and stable-max loss             be attributed to the fact that the model is spending
(Prieto et al., 2025) for improved stability. TRM uses an        too much time on the same data samples rather than
Exponential Moving Average (EMA) of 0.999. HRM                   focusing on learning on a wide range of data samples.
uses n = 2, T = 2 with two 4-layers networks, while              We tried weight tying the input embedding and out-
we use n = 6, T = 3 with one 2-layer network.                    put head, but this was too constraining and led to a
For Sudoku-Extreme and Maze-Hard, we train for 60k               massive generalization drop.
epochs with learning rate 1e-4 and weight decay 1.0.             We tried using TorchDEQ (Geng & Kolter, 2023) to
For ARC-AGI, we train for 100K epochs with learning              replace the recursion steps by fixed-point iteration as
rate 1e-4 (with 1e-2 learning rate for the embeddings)           done by Deep Equilibrium Models (Bai et al., 2019).
and weight decay 0.1. The numbers for Deepseek R1,               This would provide a better justification for the 1-step
Claude 3.7 8K, O3-mini-high, Direct prediction, and              gradient approximation. However, this slowed down
HRM from the Table 4 and 5 are taken from Wang et al.            training due to the fixed-point iteration and led to
(2025). Both HRM and TRM add an embedding of                     worse generalization. This highlights the fact that
shape [0, 1, D ] on Sudoku-Extreme and Maze-Hard to              converging to a fixed-point is not essential.
the input. For ARC-AGI, each puzzle (containing 2-3
training examples and 1-2 test examples) at each data-
augmentation is given a specific embedding of shape
[0, 1, D ] and, at test-time, the most common answer
out of the 1000 data augmentations is given as answer.
Experiments on Sudoku-Extreme were ran with 1 L40S
with 40Gb of RAM for generally less than 36 hours.
Experiments on Maze-Hard were ran with 4 L40S with
40Gb of RAM for less than 24 hours. Experiments on
ARC-AGI were ran for around 3 days with 4 H100
with 80Gb of RAM.

Ideas that failed
In this section, we quickly mention a few ideas that
did not work to prevent others from making the same
mistake.
We tried replacing the SwiGLU MLPs by SwiGLU
Mixture-of-Experts (MoEs) (Shazeer et al., 2017; Fedus
et al., 2022), but we found generalization to decrease
massively. MoEs clearly add too much unnecessary
capacity, just like increasing the number of layers does.
Instead of back-propagating through the whole n + 1
recursions, we tried a compromise between HRM 1-
step gradient approximation, which back-propagates
through the last 2 recursions. We did so by decou-
pling n from the number of last recursions k that we
back-propagate through. For example, while n = 6
requires 7 steps with gradients in TRM, we can use
gradients for only the k = 4 last steps. However, we
found that this did not help generalization in any way,
and it made the approach more complicated. Back-


                                                            11
                                               Recursive Reasoning with Tiny Networks

Algorithms with different number of latent                                Example on Sudoku-Extreme
features
                                                                                                                8 3 1
                                                                                                9       6 8       7
def latent recursion(x, z, n=6):                                                                      3   5
    for i in range(n+1): # latent recursion
        z = net(x, z)                                                                           6 8
    return z                                                                                               6        2
def deep recursion(x, z, n=6, T=3):
                                                                                            7 4                     3
    # recursing T−1 times to improve z (no gradients needed)                                               9        4
    with torch.no grad():
        for j in range(T−1):
                                                                                            2            4        1
            z = latent recursion(x, z, n)                                                   6            2        5 7
    # recursing once to improve z
    z = latent recursion(x, z, n)                                                                     Input x
    return z.detach(), output head(y), Q head(y)
                                                                                            5 2 6 7 9 4 8 3 1
# Deep Supervision
for x input, y true in train dataloader:                                                    3 9 1 2 6 8 4 7 5
    z = z init                                                                              4 8 7 3 1 5 2 9 6
    for step in range(N supervision):
        x = input embedding(x input)
                                                                                            1 6 8 5 3 2 7 4 9
        z, y hat, q hat = deep recursion(x, z)                                              9 3 5 4 7 6 1 8 2
        loss = softmax cross entropy(y hat, y true)                                         7 4 2 9 8 1 5 6 3
        loss += binary cross entropy(q hat, (y hat == y true))
        z = z.detach()                                                                      8 7 3 1 5 9 6 2 4
        loss.backward()                                                                     2 5 9 6 4 7 3 1 8
        opt.step()
        opt.zero grad()                                                                     6 1 4 8 5 3 9 5 7
        if q[0] > 0: # early−stopping
            break                                                                                Output y
                                                                                            5 2 6 7 9 4 8 3 1
Figure 4. Pseudocode of TRM using a single-z with deep                                      3 9 1 2 6 8 4 7 5
supervision training in PyTorch.                                                            4 8 7 3 1 5 2 9 6
                                                                                            1 6 8 5 3 2 7 4 9
                                                                                            9 3 5 4 7 6 1 8 2
                                                                                            7 4 2 9 8 1 5 6 3
                                                                                            8 7 3 1 5 9 6 2 4
def latent recursion(x, y, z, n=6):
    for i in range(n): # latent recursion                                                   2 5 9 6 4 7 3 1 8
        z[i] = net(x, y, z[0], ... , z[n−1])                                                6 1 4 8 5 3 9 5 7
    y = net(y, z[0], ... , z[n−1]) # refine output answer
    return y, z                                                                     Tokenized z H (denoted y in TRM)
def deep recursion(x, y, z, n=6, T=3):                                                      5   5 4 9 4   6 3
    # recursing T−1 times to improve y and z (no gradients needed)
    with torch.no grad():                                                                   4   3 1     4 6 5
        for j in range(T−1):                                                                4 8 4   3   6 6 4
            y, z = latent recursion(x, y, z, n)
    # recursing once to improve y and z                                                     9   6 5 3   5 4
    y, z = latent recursion(x, y, z, n)                                                       3 5 4 3   5 4 4
    return (y.detach(), z.detach()), output head(y), Q head(y)
                                                                                            6   3   3 3 5 8 8
# Deep Supervision                                                                          3 3 3 6 5   6 6 4
for x input, y true in train dataloader:                                                    7 5   6   3 3 6 6
    y, z = y init, z init
    for step in range(N supervision):                                                       4 3 4 8   3 6 6 4
        x = input embedding(x input)
        (y, z), y hat, q hat = deep recursion(x, y, z)                               Tokenized z L (denoted z in TRM)
        loss = softmax cross entropy(y hat, y true)
        loss += binary cross entropy(q hat, (y hat == y true))
        loss.backward()
        opt.step()
                                                                          Figure 6. This Sudoku-Extreme example shows an input, ex-
        opt.zero grad()                                                   pected output, and the tokenized z H and z L (after reversing
        if q[0] > 0: # early−stopping                                     the embedding and using argmax) for a pretrained model.
            break
                                                                          This highlights the fact that z H corresponds to the predicted
                                                                          response, while z L is a latent feature that cannot be decoded
Figure 5. Pseudocode of TRM using multi-scale z with deep                 to a sensible output unless transformed into z H by f H .
supervision training in PyTorch.


                                                                     12