1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
|
Less is More: Recursive Reasoning with Tiny Networks
Alexia Jolicoeur-Martineau
Samsung SAIL Montréal
alexia.j@samsung.com
Abstract
arXiv:2510.04871v1 [cs.LG] 6 Oct 2025
Hierarchical Reasoning Model (HRM) is a
novel approach using two small neural net-
works recursing at different frequencies. This
biologically inspired method beats Large Lan-
guage models (LLMs) on hard puzzle tasks
such as Sudoku, Maze, and ARC-AGI while
trained with small models (27M parameters)
on small data (∼ 1000 examples). HRM holds
great promise for solving hard problems with
small networks, but it is not yet well un-
derstood and may be suboptimal. We pro-
pose Tiny Recursive Model (TRM), a much
simpler recursive reasoning approach that
achieves significantly higher generalization
than HRM, while using a single tiny network
with only 2 layers. With only 7M parameters,
TRM obtains 45% test-accuracy on ARC-AGI-
1 and 8% on ARC-AGI-2, higher than most
LLMs (e.g., Deepseek R1, o3-mini, Gemini 2.5
Pro) with less than 0.01% of the parameters.
1. Introduction
While powerful, Large Language models (LLMs) can
struggle on hard question-answer problems. Given
that they generate their answer auto-regressively, there
is a high risk of error since a single incorrect token can
render an answer invalid. To improve their reliabil- Figure 1. Tiny Recursion Model (TRM) recursively improves
ity, LLMs rely on Chain-of-thoughts (CoT) (Wei et al., its predicted answer y with a tiny network. It starts with the
2022) and Test-Time Compute (TTC) (Snell et al., 2024). embedded input question x and initial embedded answer
y, and latent z. For up to Nsup = 16 improvements steps,
CoTs seek to emulate human reasoning by having the
it tries to improve its answer y. It does so by i) recursively
LLM to sample step-by-step reasoning traces prior to
updating n times its latent z given the question x, current
giving their answer. Doing so can improve accuracy, answer y, and current latent z (recursive reasoning), and
but CoT is expensive, requires high-quality reasoning then ii) updating its answer y given the current answer y
data (which may not be available), and can be brittle and current latent z. This recursive process allows the model
since the generated reasoning may be wrong. To fur- to progressively improve its answer (potentially address-
ther improve reliability, test-time compute can be used ing any errors from its previous answer) in an extremely
by reporting the most common answer out of K or the parameter-efficient manner while minimizing overfitting.
highest-reward answer (Snell et al., 2024).
1
Recursive Reasoning with Tiny Networks
However, this may not be enough. LLMs with CoT ers that achieves significantly higher generalization
and TTC are not enough to beat every problem. While than HRM on a variety of problems. In doing so, we
LLMs have made significant progress on ARC-AGI improve the state-of-the-art test accuracy on Sudoku-
(Chollet, 2019) since 2019, human-level accuracy still Extreme from 55% to 87%, Maze-Hard from 75% to
has not been reached (6 years later, as of writing of 85%, ARC-AGI-1 from 40% to 45%, and ARC-AGI-2
this paper). Furthermore, LLMs struggle on the newer from 5% to 8%.
ARC-AGI-2 (e.g., Gemini 2.5 Pro only obtains 4.9% test
accuracy with a high amount of TTC) (Chollet et al., 2. Background
2025; ARC Prize Foundation, 2025b).
HRM is described in Algorithm 2. We discuss the
An alternative direction has recently been proposed by
details of the algorithm further below.
Wang et al. (2025). They propose a new way forward
through their novel Hierarchical Reasoning Model
(HRM), which obtains high accuracy on puzzle tasks 2.1. Structure and goal
where LLMs struggle to make a dent (e.g., Sudoku The focus of HRM is supervised learning. Given an
solving, Maze pathfinding, and ARC-AGI). HRM is a input, produce an output. Both input and output are
supervised learning model with two main novelties: 1) assumed to have shape [ B, L] (when the shape differs,
recursive hierarchical reasoning, and 2) deep supervision. padding tokens can be added), where B is the batch-
Recursive hierarchical reasoning consists of recurs- size and L is the context-length.
ing multiple times through two small networks ( f L at HRM contains four learnable components: the in-
high frequency and f H at low frequency) to predict the put embedding f I (·; θ I ), low-level recurrent network
answer. Each network generates a different latent fea- f L (·; θ L ), high-level recurrent network f H (·; θ H ), and
ture: f L outputs z H and f H outputs z L . Both features the output head fO (·; θO ). Once the input is embedded,
(z L , z H ) are used as input to the two networks. The the shape becomes [ B, L, D ] where D is the embedding
authors provide some biological arguments in favor of size. Each network is a 4-layer Transformers architec-
recursing at different hierarchies based on the different ture (Vaswani et al., 2017), with RMSNorm (Zhang
temporal frequencies at which the brains operate and & Sennrich, 2019), no bias (Chowdhery et al., 2023),
hierarchical processing of sensory inputs. rotary embeddings (Su et al., 2024), and SwiGLU acti-
Deep supervision consists of improving the answer vation function (Hendrycks & Gimpel, 2016; Shazeer,
through multiple supervision steps while carrying the 2020).
two latent features as initialization for the improve-
ment steps (after detaching them from the computa- 2.2. Recursion at two different frequencies
tional graph so that their gradients do not propagate). Given the hyperparameters used by Wang et al. (2025)
This provide residual connections, which emulates (n = 2 f L steps, 1 f H steps; done T = 2 times), a
very deep neural networks that are too memory ex- forward pass of HRM is done as follows:
pensive to apply in one forward pass.
An independent analysis on the ARC-AGI benchmark x ← f I ( x̃ )
showed that deep supervision seems to be the primary z L ← f L (z L + z H + x ) # without gradients
driver of the performance gains (ARC Prize Founda- z L ← f L (z L + z H + x ) # without gradients
tion, 2025a). Using deep supervision doubled accuracy
z H ← f H (z L + z H ) # without gradients
over single-step supervision (going from 19% to 39%
accuracy), while recursive hierarchical reasoning only z L ← f L (z L + z H + x ) # without gradients
slightly improved accuracy over a regular model with z L ← z L .detach()
a single forward pass (going from 35.7% to 39.0% ac- z H ← z H .detach()
curacy). This suggests that reasoning across different
z L ← f L (z L + z H + x ) # with gradients
supervision steps is worth it, but the recursion done
in each supervision step is not particularly important. z H ← f H (z L + z H ) # with gradients
ŷ ← argmax( fO (z H ))
In this work, we show that the benefit from recursive
reasoning can be massively improved, making it much
more than incremental. We propose Tiny Recursive where ŷ is the predicted output answer, z L and z H are
Model (TRM), an improved and simplified approach either initialized embeddings or the embeddings of
using a much smaller tiny network with only 2 lay- the previous deep supervision step (after detaching
them from the computational graph). As can be seen,
2
Recursive Reasoning with Tiny Networks
def hrm(z, x, n=2, T=2): # hierarchical reasoning 2.4. Deep supervision
zH, zL = z
with torch.no grad(): To improve effective depth, deep supervision is used.
for i in range(nT − 2):
zL = L net(zL, zH, x) This consists of reusing the previous latent features
if (i + 1) % T == 0:
zH = H net(zH, zL)
(z H and z L ) as initialization for the next forward pass.
# 1−step grad This allows the model to reason over many iterations
zL = L net(zL, zH, x)
zH = H net(zH, zL) and improve its latent features (z L and z H ) until it
return (zH, zL), output head(zH), Q head(zH) (hopefully) converges to the correct solution. At most
def ACT halt(q, y hat, y true): Nsup = 16 supervision steps are used.
target halt = (y hat == y true)
loss = 0.5∗binary cross entropy(q[0], target halt)
return loss 2.5. Adaptive computational time (ACT)
def ACT continue(q, last step):
if last step: With deep supervision, each mini-batch of data sam-
target continue = sigmoid(q[0]) ples must be used for Nsup = 16 supervision steps
else:
target continue = sigmoid(max(q[0], q[1]))) before moving to the next mini-batch. This is expen-
loss = 0.5∗binary cross entropy(q[1], target continue)
return loss
sive, and there is a balance to be reached between
optimizing a few data examples for many supervision
# Deep Supervision
for x input, y true in train dataloader: steps versus optimizing many data examples with less
z = z init supervision steps. To reach a better balance, a halting
for step in range(N sup): # deep supervision
x = input embedding(x input) mechanism is incorporated to determine whether the
z, y pred, q = hrm(z, x)
loss = softmax cross entropy(y pred, y true)
model should terminate early. It is learned through
# Adaptive computational time (ACT) using Q−learning a Q-learning objective that requires passing the z H
loss += ACT halt(q, y pred, y true)
, , q next = hrm(z, x) # extra forward pass through an additional head and running an additional
loss += ACT continue(q next, step == N sup − 1) forward pass (to determine if halting now rather than
z = z.detach()
loss.backward() later would have been preferable). They call this
opt.step()
opt.zero grad()
method Adaptive computational time (ACT). It is only
if q[0] > q[1]: # early−stopping used during training, while the full Nsup = 16 super-
break
vision steps are done at test time to maximize down-
stream performance. ACT greatly diminishes the time
Figure 2. Pseudocode of Hierarchical Reasoning Models spent per example (on average spending less than 2
(HRMs). steps on the Sudoku-Extreme dataset rather than the
full Nsup = 16 steps), allowing more coverage of the
a forward pass of HRM consists of applying 6 function dataset given a fixed number of training iterations.
evaluations, where the first 4 function evaluations are
detached from the computational graph and are not 2.6. Deep supervision and 1-step gradient
back-propagated through. The authors uses n = 2 approximations replaces BPTT
with T = 2 in all experiments, but HRM can be gener-
alized by allowing for an arbitrary number of L steps Deep supervision and the 1-step gradient approxima-
(n) and recursions (T) as shown in Algorithm 2. tion provide a more biologically plausible and less
computationally-expansive alternative to Backpropa-
2.3. Fixed-point recursion with 1-step gradient gation Through Time (BPTT) (Werbos, 1974; Rumel-
approximation hart et al., 1985; LeCun, 1985) for solving the temporal
credit assignment (TCA) (Rumelhart et al., 1985; Wer-
Assuming that (z L , z H ) reaches a fixed-point (z∗L , z∗H ) bos, 1988; Elman, 1990) problem (Lillicrap & Santoro,
through recursing from both f L and f H , 2019). The implication is that HRM can learn what
would normally require an extremely large network
z∗L ≈ f L (z∗L + z H + x ) without having to back-propagate through its entire
z∗H ≈ f H (z L + z∗H ) , depth. Given the hyperparameters used by Jang et al.
(2023) in all their experiments, HRM effectively rea-
the Implicit Function Theorem (Krantz & Parks, 2002)
sons over nlayers (n + 1) TNsup = 4 ∗ (2 + 1) ∗ 2 ∗ 16 =
with the 1-step gradient approximation (Bai et al.,
384 layers of effective depth.
2019) is used to approximate the gradient by back-
propagating only the last f L and f H steps. This theo-
rem is used to justify only tracking the gradients of
the last two steps (out of 6), which greatly reduces
memory demands.
3
Recursive Reasoning with Tiny Networks
2.7. Summary of HRM different from the much smaller n = 2 and T = 2 used
in every experiment of their paper, we observe the
HRM leverages recursion from two networks at dif-
following:
ferent frequencies (high frequency versus low fre-
quency) and deep supervision to learn to improve
its answer over multiple supervision steps (with ACT 1. the residual for z H is clearly well above 0 at every
to reduce time spent per data example). This enables step
the model to imitate extremely large depth without
requiring backpropagation through all layers. This
approach obtains significantly higher performance on 2. the residual for z L only becomes closer to 0 after
hard question-answer tasks that regular supervised many cycles, but it remains significantly above 0
models struggle with. However, this method is quite
complicated, relying a bit too heavily on uncertain
biological arguments and fixed-point theorems that 3. z L is very far from converged after one f L evalu-
are not guaranteed to be applicable. In the next sec- ation at T cycles, which is when the fixed-point
tion, we discuss those issues and potential targets for is assumed to be reached and the 1-step gradient
improvements in HRM. approximation is used
3. Target for improvements in Hierarchical Thus, while the application of the IFT theorem and
Reasoning Models 1-step gradient approximation to HRM has some basis
since the residuals do generally reduce over time, a
In this section, we identify key targets for improve-
fixed point is unlikely to be reached when the theorem
ments in HRM, which will be addressed by our pro-
is actually applied.
posed method, Tiny Recursion Models (TRMs).
In the next section, we show that we can bypass the
3.1. Implicit Function Theorem (IFT) with 1-step need for the IFT theorem and 1-step gradient approxi-
gradient approximation mation, thus bypassing the issue entirely.
HRM only back-propagates through the last 2 of the 6 3.2. Twice the forward passes with Adaptive
recursions. The authors justify this by leveraging the computational time (ACT)
Implicit Function Theorem (IFT) and one-step approx-
imation, which states that when a recurrent function HRM uses Adaptive computational time (ACT) during
converges to a fixed point, backpropagation can be training to optimize the time spent of each data sam-
applied in a single step at that equilibrium point. ple. Without ACT, Nsup = 16 supervision steps would
be spent on the same data sample, which is highly in-
There are concerns about applying this theorem to
efficient. They implement ACT through an additional
HRM. Most importantly, there is no guarantee that
Q-learning objective, which decides when to halt and
a fixed-point is reached. Deep equilibrium models
move to a new data sample rather than keep iterating
normally do fixed-point iteration to solve for the fixed
on the same data. This allows much more efficient
pointz∗ = f (z∗ ) (Bai et al., 2019). However, in the case
use of time especially since the average number of su-
of HRM, it is not iterating to the fixed-point but simply
pervision steps during training is quite low with ACT
doing forward passes of f L and f H . To make matters
(less than 2 steps on the Sudoku-Extreme dataset as
worse, HRM is only doing 4 recursions before stopping
per their reported numbers).
to apply the one-step approximation. After its first
loop of two f L and 1 f H evaluations, it only apply a However, ACT comes at a cost. This cost is not directly
single f L evaluation before assuming that a fixed-point shown in the HRM’s paper, but it is shown in their of-
is reached for both z L and z H (z∗L = f L (z∗L + z H + x ) ficial code. The Q-learning objective relies on a halting
and z∗H = f H (z∗L + z∗H )). Then, the one-step gradient loss and a continue loss. The continue loss requires an
approximation is applied to both latent variables in extra forward pass through HRM (with all 6 function
succession. evaluations). This means that while ACT optimizes
time more efficiently per sample, it requires 2 forward
The authors justify that a fixed-point is reached by
passes per optimization step. The exact formulation is
depicting an example with n = 7 and T = 7 where
shown in Algorithm 2.
the forward residuals is reduced over time (Figure 3
in Wang et al. (2025)). Even in this setting, which is In the next section, we show that we can bypass the
need for two forward passes in ACT.
4
Recursive Reasoning with Tiny Networks
3.3. Hierarchical interpretation based on complex of 2 passes). Our approach is described in Algorithm 3
biological arguments and illustrated in Figure 1. We also provide an ablation
in Table 1 on the Sudoku-Extreme dataset (a dataset
The HRM’s authors justify the two latent variables
of difficult Sudokus with only 1K training examples,
and two networks operating at different hierarchies
but 423K test examples). Below, we explain the key
based on biological arguments, which are very far
components of TRMs.
from artificial neural networks. They even try to match
HRM to actual brain experiments on mice. While in-
teresting, this sort of explanation makes it incredibly Table 1. Ablation of TRM on Sudoku-Extreme comparing %
hard to parse out why HRM is designed the way it Test accuracy, effective depth per supervision step ( T (n +
is. Given the lack of ablation table in their paper, the 1)nlayers ), number of Forward Passes (NFP) per optimization
over-reliance on biological arguments and fixed-point step, and number of parameters
theorems (that are not perfectly applicable), it is hard Method Acc (%) Depth NFP # Params
to determine what parts of HRM is helping what and HRM 55.0 24 2 27M
why. Furthermore, it is not clear why they use two TRM (T = 3, n = 6) 87.4 42 1 5M
latent features rather than other combinations of fea- w/ ACT 86.1 42 2 5M
tures. w/ separate f H , f L 82.4 42 1 10M
no EMA 79.9 42 1 5M
In the next section, we show that the recursive process w/ 4-layers, n = 3 79.5 48 1 10M
can be greatly simplified and understood in a much w/ self-attention 74.7 42 1 7M
simpler manner that does not require any biological w/ T = 2, n = 2 73.7 12 1 5M
argument, fixed-point theorem, hierarchical interpre- w/ 1-step gradient 56.5 42 1 5M
tation, nor using two networks. It also explains why 2
is the optimal number of features (z L and z H ).
4.1. No fixed-point theorem required
def latent recursion(x, y, z, n=6): HRM assumes that the recursions converge to a fixed-
for i in range(n): # latent reasoning point for both z L and z H in order to leverage the 1-step
z = net(x, y, z)
y = net(y, z) # refine output answer gradient approximation (Bai et al., 2019). This allows
return y, z
the authors to justify only back-propagating through
def deep recursion(x, y, z, n=6, T=3): the last two function evaluations (1 f L and 1 f H ). To
# recursing T−1 times to improve y and z (no gradients needed)
with torch.no grad(): bypass this theoretical requirement, we define a full
for j in range(T−1): recursion process as containing n evaluations of f L
y, z = latent recursion(x, y, z, n)
# recursing once to improve y and z and 1 evaluation of f H :
y, z = latent recursion(x, y, z, n)
return (y.detach(), z.detach()), output head(y), Q head(y) z L ← f L (z L + z H + x )
# Deep Supervision ...
for x input, y true in train dataloader:
y, z = y init, z init z L ← f L (z L + z H + x )
for step in range(N supervision):
x = input embedding(x input) z H ← f H (z L + z H ) .
(y, z), y hat, q hat = deep recursion(x, y, z)
loss = softmax cross entropy(y hat, y true)
loss += binary cross entropy(q hat, (y hat == y true)) Then, we simply back-propagate through the full re-
loss.backward() cursion process.
opt.step()
opt.zero grad()
if q hat > 0: # early−stopping Through deep supervision, the models learns to take
break any (z L , z H ) and improve it through a full recursion
process, hopefully making z H closer to the solution.
Figure 3. Pseudocode of Tiny Recursion Models (TRMs). This means that by the design of the deep supervi-
sion goal, running a few full recursion processes (even
without gradients) is expected to bring us closer to the
4. Tiny Recursion Models solution. We propose to run T − 1 recursion processes
without gradient to improve (z L , z H ) before running
In this section, we present Tiny Recursion Models
one recursion process with backpropagation.
(TRMs). Contrary to HRM, TRM requires no com-
plex mathematical theorem, hierarchy, nor biological Thus, instead of using the 1-step gradient approxi-
arguments. It generalizes better while requiring only mation, we apply a full recursion process containing
a single tiny network (instead of two medium-size net- n evaluations of f L and 1 evaluation of f H . This re-
works) and a single forward pass for the ACT (instead moves entirely the need to assume that a fixed-point
5
Recursive Reasoning with Tiny Networks
is reached and the use of the IFT theorem with 1-step While this is intuitive, we wanted to verify whether
gradient approximation. Yet, we can still leverage using more or less features could be helpful. Results
multiple backpropagation-free recursion processes to are shown in Table 2.
improve (z L , z H ). With this approach, we obtain a
More features (> 2): We tested splitting z into dif-
massive boost in generalization on Sudoku-Extreme
ferent features by treating each of the n recursions as
(improving TRM from 56.5% to 87.4%; see Table 1).
producing a different zi for i = 1, ..., n. Then, each
zi is carried across supervision steps. The approach
4.2. Simpler reinterpretation of z H and z L is described in Algorithm 5. In doing so, we found
HRM is interpreted as doing hierarchical reasoning performance to drop. This is expected because, as dis-
over two latent features of different hierarchies due to cussed, there is no apparent need for splitting z into
arguments from biology. However, one might wonder multiple parts. It does not have to be hierarchical.
why use two latent features instead of 1, 3, or more? Single feature: Similarly, we tested the idea of taking
And do we really need to justify these so-called ”hier- a single feature by only carrying z H across supervi-
archical” features based on biology to make sense of sion steps. The approach is described in Algorithm 4.
them? We propose a simple non-biological explana- In doing so, we found performance to drop. This is
tion, which is more natural, and directly answers the expected because, as discussed, it forces the model to
question of why there are 2 features. store the solution y within z.
The fact of the matter is: z H is simply the current Thus, we explored using more or less latent variables
(embedded) solution. The embedding is reversed by on Sudoku-Extreme, but found that having only y and
applying the output head and rounding to the nearest z lead to better test accuracy in addition to being the
token using the argmax operation. On the other hand, simplest more natural approach.
z L is a latent feature that does not directly correspond
to a solution, but it can be transformed into a solution
by applying z H ← f H ( x, z L , z H ). We show an example
on Sudoku-Extreme in Figure 6 to highlight the fact Table 2. TRM on Sudoku-Extreme comparing % Test accu-
that z H does correspond to the solution, but z L does racy when using more or less latent features
not. Method # of features Acc (%)
TRM y, z (Ours) 2 87.4
Once this is understood, hierarchy is not needed; there TRM multi-scale z n+1 = 7 77.6
is simply an input x, a proposed solution y (previously TRM single z 1 71.9
called z H ), and a latent reasoning feature z (previously
called z L ). Given the input question x, current solution
y, and current latent reasoning z, the model recursively
improves its latent z. Then, given the current latent z 4.3. Single network
and the previous solution y, the model proposes a new HRM uses two networks, one applied frequently as a
solution y (or stay at the current solution if its already low-level module f H and one applied rarely as an high-
good). level module ( f H ). This requires twice the number of
Although this has no direct influence on the algorithm, parameters compared to regular supervised learning
this re-interpretation is much simpler and natural. It with a single network.
answers the question about why two features: remem- As mentioned previously, while f L iterates on the la-
bering in context the question x, previous reasoning tent reasoning feature z (z L in HRM), the goal of f H
z, and previous answer y helps the model iterate on is to update the solution y (z H in HRM) given the la-
the next reasoning z and then the next answer y. If tent reasoning and current solution. Importantly, since
we were not passing the previous reasoning z, the z ← f L ( x + y + z) contains x but y ← f H (y + z) does
model would forget how it got to the previous solu- not contains x, the task to achieve (iterating on z versus
tion y (since z acts similarly as a chain-of-thought). If using z to update y) is directly specified by the inclu-
we were not passing the previous solution y, then the sion or lack of x in the inputs. Thus, we considered
model would forget what solution it had and would the possibility that both networks could be replaced
be forced to store the solution y within z instead of by a single network doing both tasks. In doing so, we
using it for latent reasoning. Thus, we need both y and obtain better generalization on Sudoku-Extreme (im-
z separately, and there is no apparent reason why one proving TRM from 82.4% to 87.4%; see Table 1) while
would need to split z into multiple features. reducing the number of parameters by half. It turns
out that a single network is enough.
6
Recursive Reasoning with Tiny Networks
4.4. Less is more 4.6. No additional forward pass needed with ACT
We attempted to increase capacity by increasing the As previously mentioned, the implementation of ACT
number of layers in order to scale the model. Sur- in HRM through Q-learning requires two forward
prisingly, we found that adding layers decreased gen- passes, which slows down training. We propose a
eralization due to overfitting. In doing the oppo- simple solution, which is to get rid of the continue loss
site, decreasing the number of layers while scaling (from the Q-learning) and only learn a halting proba-
the number of recursions (n) proportionally (to keep bility through a Binary-Cross-Entropy loss of having
the amount of compute and emulated depth approxi- reached the correct solution. By removing the continue
mately the same), we found that using 2 layers (instead loss, we remove the need for the expensive second for-
of 4 layers) maximized generalization. In doing so, we ward pass, while still being able to determine when to
obtain better generalization on Sudoku-Extreme (im- halt with relatively good accuracy. We found no sig-
proving TRM from 79.5% to 87.4%; see Table 1) while nificant difference in generalization from this change
reducing the number of parameters by half (again). (going from 86.1% to 87.4%; see Table 1).
It is quite surprising that smaller networks are bet-
ter, but 2 layers seems to be the optimal choice. Bai 4.7. Exponential Moving Average (EMA)
& Melas-Kyriazi (2024) also observed optimal perfor- On small data (such as Sudoku-Extreme and Maze-
mance for 2-layers in the context of deep equilibrium Hard), HRM tends to overfit quickly and then diverge.
diffusion models; however, they had similar perfor- To reduce this problem and improves stability, we
mance to the bigger networks, while we instead ob- integrate Exponential Moving Average (EMA) of the
serve better performance with 2 layers. This may ap- weights, a common technique in GANs and diffusion
pear unusual, as with modern neural networks, gener- models to improve stability (Brock et al., 2018; Song &
alization tends to directly correlate with model sizes. Ermon, 2020). We find that it prevents sharp collapse
However, when data is too scarce and model size is and leads to higher generalization (going from 79.9%
large, there can be an overfitting penalty (Kaplan et al., to 87.4%; see Table 1).
2020). This is likely an indication that there is too little
data. Thus, using tiny networks with deep recursion 4.8. Optimal the number of recursions
and deep supervision appears to allow us to bypass a
lot of the overfitting. We experimented with different number of recursions
by varying T and n and found that T = 3, n = 3
4.5. attention-free architecture for tasks with small (equivalent to 48 recursions) in HRM and T = 3, n = 6
fixed context length in TRM (equivalent to 42 recursions) to lead to optimal
generalization on Sudoku-Extreme. More recursions
Self-attention is particularly good for long-context could be helpful for harder problems (we have not
lengths when L ≫ D since it only requires a matrix of tested it, given our limited resources); however, in-
[ D, 3D ] parameters, even though it can account for the creasing either T or n incurs massive slowdowns. We
whole sequence. However, when focusing on tasks show results at different n and T for HRM and TRM
where L ≤ D, a linear layer is cheap, requiring only a in Table 3. Note that TRM requires backpropagation
matrix of [ L, L] parameters. Taking inspiration from through a full recursion process, thus increasing n too
the MLP-Mixer (Tolstikhin et al., 2021), we can replace much leads to Out Of Memory (OOM) errors. How-
the self-attention layer with a multilayer perceptron ever, this memory cost is well worth its price in gold.
(MLP) applied on the sequence length. Using an MLP
instead of self-attention, we obtain better generaliza- In the following section, we show our main results on
tion on Sudoku-Extreme (improving from 74.7% to multiple datasets comparing HRM, TRM, and LLMs.
87.4%; see Table 1). This worked well on Sudoku 9x9
grids, given the small and fixed context length; how- 5. Results
ever, we found this architecture to be suboptimal for
tasks with large context length, such as Maze-Hard Following Wang et al. (2025), we test our approach
and ARC-AGI (both using 30x30 grids). We show on the following datasets: Sudoku-Extreme (Wang
results with and without self-attention for all experi- et al., 2025), Maze-Hard (Wang et al., 2025), ARC-AGI-
ments. 1 (Chollet, 2019) and, ARC-AGI-2 (Chollet et al., 2025).
Results are presented in Tables 4 and 5. Hyperparame-
ters are detailed in Section 6. Datasets are discussed
below.
7
Recursive Reasoning with Tiny Networks
ity of the MLP on large 30x30 grids). TRM with self-
Table 3. % Test accuracy on Sudoku-Extreme dataset. HRM
attention obtains 85.3% accuracy on Maze-Hard, 44.6%
versus TRM matched at a similar effective depth per super-
vision step ( T (n + 1)nlayers )
accuracy on ARC-AGI-1, and 7.8% accuracy on ARC-
AGI-2 with 7M parameters. This is significantly higher
HRM TRM
than the 74.5%, 40.3%, and 5.0% obtained by HRM us-
n = k, 4 layers n = 2k, 2 layers
ing 4 times the number of parameters (27M).
k T Depth Acc (%) Depth Acc (%)
1 1 9 46.4 7 63.2
2 2 24 55.0 20 81.9 Table 4. % Test accuracy on Puzzle Benchmarks (Sudoku-
3 3 48 61.6 42 87.4 Extreme and Maze-Hard)
4 4 80 59.5 72 84.2 Method # Params Sudoku Maze
6 3 84 62.3 78 OOM Chain-of-thought, pretrained
3 6 96 58.8 84 85.8 Deepseek R1 671B 0.0 0.0
6 6 168 57.5 156 OOM Claude 3.7 8K ? 0.0 0.0
O3-mini-high ? 0.0 0.0
Direct prediction, small-sample training
Sudoku-Extreme consists of extremely difficult Su- Direct pred 27M 0.0 0.0
doku puzzles (Dillion, 2025; Palm et al., 2018; Park, HRM 27M 55.0 74.5
2018) (9x9 grid), for which only 1K training samples TRM-Att (Ours) 7M 74.7 85.3
are used to test small-sample learning. Testing is done TRM-MLP (Ours) 5M/19M 1 87.4 0.0
on 423K samples. Maze-Hard consists of 30x30 mazes
generated by the procedure by Lehnert et al. (2024)
whose shortest path is of length above 110; both the
training set and test set include 1000 mazes. Table 5. % Test accuracy on ARC-AGI Benchmarks (2 tries)
Method # Params ARC-1 ARC-2
ARC-AGI-1 and ARC-AGI-2 are geometric puzzles in- Chain-of-thought, pretrained
volving monetary prizes. Each puzzle is designed to Deepseek R1 671B 15.8 1.3
be easy for a human, yet hard for current AI models. Claude 3.7 16K ? 28.6 0.7
Each puzzle task consists of 2-3 input–output demon- o3-mini-high ? 34.5 3.0
stration pairs and 1-2 test inputs to be solved. The final Gemini 2.5 Pro 32K ? 37.0 4.9
score is computed as the accuracy over all test inputs Grok-4-thinking 1.7T 66.7 16.0
from two attempts to produce the correct output grid. Bespoke (Grok-4) 1.7T 79.6 29.4
The maximum grid size is 30x30. ARC-AGI-1 con- Direct prediction, small-sample training
tains 800 tasks, while ARC-AGI-2 contains 1120 tasks.
Direct pred 27M 21.0 0.0
We also augment our data with the 160 tasks from
HRM 27M 40.3 5.0
the closely related ConceptARC dataset (Moskvichev
et al., 2023). We provide results on the public evalua- TRM-Att (Ours) 7M 44.6 7.8
tion set for both ARC-AGI-1 and ARC-AGI-2. TRM-MLP (Ours) 19M 29.6 2.4
While these datasets are small, heavy data-
augmentation is used in order to improve gen- 6. Conclusion
eralization. Sudoku-Extreme uses 1000 shuffling
(done without breaking the Sudoku rules) augmenta- We propose Tiny Recursion Models (TRM), a simple
tions per data example. Maze-Hard uses 8 dihedral recursive reasoning approach that achieves strong gen-
transformations per data example. ARC-AGI uses eralization on hard tasks using a single tiny network
1000 data augmentations (color permutation, dihedral- recursing on its latent reasoning feature and progres-
group, and translations transformations) per data sively improving its final answer. Contrary to the
example. The dihedral-group transformations consist Hierarchical Reasoning Model (HRM), TRM requires
of random 90-degree rotations, horizontal/vertical no fixed-point theorem, no complex biological justi-
flips, and reflections. fications, and no hierarchy. It significantly reduces
the number of parameters by halving the number of
From the results, we see that TRM without self- layers and replacing the two networks with a single
attention obtains the best generalization on Sudoku- tiny network. It also simplifies the halting process,
Extreme (87.4% test accuracy). Meanwhile, TRM with removing the need for the extra forward pass. Over-
self-attention generalizes better on the other tasks
(probably due to inductive biases and the overcapac- 1 5M on Sudoku and 19M on Maze
8
Recursive Reasoning with Tiny Networks
all, TRM is much simpler than HRM, while achieving gan training for high fidelity natural image synthe-
better generalization. sis. arXiv preprint arXiv:1809.11096, 2018.
While our approach led to better generalization on 4 Chollet, F. On the measure of intelligence. arXiv
benchmarks, every choice made is not guaranteed to preprint arXiv:1911.01547, 2019.
be optimal on every dataset. For example, we found
that replacing the self-attention with an MLP worked Chollet, F., Knoop, M., Kamradt, G., Landers, B.,
extremely well on Sudoku-Extreme (improving test ac- and Pinkard, H. Arc-agi-2: A new challenge
curacy by 10%), but poorly on other datasets. Different for frontier ai reasoning systems. arXiv preprint
problem settings may require different architectures arXiv:2505.11831, 2025.
or number of parameters. Scaling laws are needed
to parametrize these networks optimally. Although Chowdhery, A., Narang, S., Devlin, J., Bosma, M.,
we simplified and improved on deep recursion, the Mishra, G., Roberts, A., Barham, P., Chung, H. W.,
question of why recursion helps so much compared Sutton, C., Gehrmann, S., et al. Palm: Scaling lan-
to using a larger and deeper network remains to be guage modeling with pathways. Journal of Machine
explained; we suspect it has to do with overfitting, but Learning Research, 24(240):1–113, 2023.
we have no theory to back this explaination. Not all
our ideas made the cut; we briefly discuss some of the Dillion, T. Tdoku: A fast sudoku solver and gener-
failed ideas that we tried but did not work in Section 6. ator. https://t-dillon.github.io/tdoku/,
Currently, recursive reasoning models such as HRM 2025.
and TRM are supervised learning methods rather than
Elman, J. L. Finding structure in time. Cognitive science,
generative models. This means that given an input
14(2):179–211, 1990.
question, they can only provide a single deterministic
answer. In many settings, multiple answers exist for a Fedus, W., Zoph, B., and Shazeer, N. Switch transform-
question. Thus, it would be interesting to extend TRM ers: Scaling to trillion parameter models with simple
to generative tasks. and efficient sparsity. Journal of Machine Learning Re-
search, 23(120):1–39, 2022.
Acknowledgements
Geng, Z. and Kolter, J. Z. Torchdeq: A library for deep
Thank you Emy Gervais for your invaluable support equilibrium models. arXiv preprint arXiv:2310.18605,
and extra push. This research was enabled in part 2023.
by computing resources, software, and technical as-
sistance provided by Mila and the Digital Research Hendrycks, D. and Gimpel, K. Gaussian error linear
Alliance of Canada. units (gelus). arXiv preprint arXiv:1606.08415, 2016.
Jang, Y., Kim, D., and Ahn, S. Hierarchical graph
References generation with k2-trees. In ICML 2023 Workshop on
ARC Prize Foundation. The Hidden Drivers of HRM’s Structured Probabilistic Inference Generative Modeling,
Performance on ARC-AGI. https://arcprize. 2023.
org/blog/hrm-analysis, 2025a. [Online; ac-
Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B.,
cessed 2025-09-15].
Chess, B., Child, R., Gray, S., Radford, A., Wu, J.,
ARC Prize Foundation. ARC-AGI Leaderboard. and Amodei, D. Scaling laws for neural language
https://arcprize.org/leaderboard, 2025b. models. arXiv preprint arXiv:2001.08361, 2020.
[Online; accessed 2025-09-24].
Kingma, D. P. and Ba, J. Adam: A method for stochas-
Bai, S., Kolter, J. Z., and Koltun, V. Deep equilibrium tic optimization. arXiv preprint arXiv:1412.6980,
models. Advances in neural information processing 2014.
systems, 32, 2019.
Krantz, S. G. and Parks, H. R. The implicit function
Bai, X. and Melas-Kyriazi, L. Fixed point diffusion theorem: history, theory, and applications. Springer
models. In Proceedings of the IEEE/CVF Conference Science & Business Media, 2002.
on Computer Vision and Pattern Recognition, pp. 9430–
9440, 2024. LeCun, Y. Une procedure d’apprentissage ponr reseau
a seuil asymetrique. Proceedings of cognitiva 85, pp.
Brock, A., Donahue, J., and Simonyan, K. Large scale 599–604, 1985.
9
Recursive Reasoning with Tiny Networks
Lehnert, L., Sukhbaatar, S., Su, D., Zheng, Q., Mcvay, P., all-mlp architecture for vision. Advances in neural
Rabbat, M., and Tian, Y. Beyond a*: Better planning information processing systems, 34:24261–24272, 2021.
with transformers via search dynamics bootstrap-
ping. arXiv preprint arXiv:2402.14083, 2024. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J.,
Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin,
Lillicrap, T. P. and Santoro, A. Backpropagation I. Attention is all you need. Advances in neural
through time and the brain. Current opinion in neuro- information processing systems, 30, 2017.
biology, 55:82–89, 2019.
Wang, G., Li, J., Sun, Y., Chen, X., Liu, C., Wu, Y.,
Loshchilov, I. and Hutter, F. Decoupled weight decay Lu, M., Song, S., and Yadkori, Y. A. Hierarchical
regularization. arXiv preprint arXiv:1711.05101, 2017. reasoning model. arXiv preprint arXiv:2506.21734,
2025.
Moskvichev, A., Odouard, V. V., and Mitchell, M. The
conceptarc benchmark: Evaluating understanding Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F.,
and generalization in the arc domain. arXiv preprint Chi, E., Le, Q. V., Zhou, D., et al. Chain-of-thought
arXiv:2305.07141, 2023. prompting elicits reasoning in large language mod-
els. Advances in neural information processing systems,
Palm, R., Paquet, U., and Winther, O. Recurrent re-
35:24824–24837, 2022.
lational networks. Advances in neural information
processing systems, 31, 2018. Werbos, P. Beyond regression: New tools for predic-
tion and analysis in the behavioral sciences. PhD
Park, K. Can convolutional neural networks
thesis, Committee on Applied Mathematics, Harvard
crack sudoku puzzles? https://github.com/
University, Cambridge, MA, 1974.
Kyubyong/sudoku, 2018.
Werbos, P. J. Generalization of backpropagation with
Prieto, L., Barsbey, M., Mediano, P. A., and Birdal, T.
application to a recurrent gas market model. Neural
Grokking at the edge of numerical stability. arXiv
networks, 1(4):339–356, 1988.
preprint arXiv:2501.04697, 2025.
Zhang, B. and Sennrich, R. Root mean square layer
Rumelhart, D. E., Hinton, G. E., and Williams, R. J.
normalization. Advances in Neural Information Pro-
Learning internal representations by error propaga-
cessing Systems, 32, 2019.
tion. Technical report, 1985.
Shazeer, N. Glu variants improve transformer. arXiv
preprint arXiv:2002.05202, 2020.
Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le,
Q., Hinton, G., and Dean, J. Outrageously large neu-
ral networks: The sparsely-gated mixture-of-experts
layer. arXiv preprint arXiv:1701.06538, 2017.
Snell, C., Lee, J., Xu, K., and Kumar, A. Scaling
llm test-time compute optimally can be more effec-
tive than scaling model parameters. arXiv preprint
arXiv:2408.03314, 2024.
Song, Y. and Ermon, S. Improved techniques for train-
ing score-based generative models. Advances in
neural information processing systems, 33:12438–12448,
2020.
Su, J., Ahmed, M., Lu, Y., Pan, S., Bo, W., and Liu,
Y. Roformer: Enhanced transformer with rotary
position embedding. Neurocomputing, 568:127063,
2024.
Tolstikhin, I. O., Houlsby, N., Kolesnikov, A., Beyer,
L., Zhai, X., Unterthiner, T., Yung, J., Steiner, A.,
Keysers, D., Uszkoreit, J., et al. Mlp-mixer: An
10
Recursive Reasoning with Tiny Networks
Hyper-parameters and setup propagating through the whole n + 1 recursions makes
the most sense and works best.
All models are trained with the AdamW opti-
mizer(Loshchilov & Hutter, 2017; Kingma & Ba, 2014) We tried removing ACT with the option of stopping
with β 1 = 0.9, β 2 = 0.95, small learning rate warm- when the solution is reached, but we found that gen-
up (2K iterations), batch-size 768, hidden-size of 512, eralization dropped significantly. This can probably
Nsup = 16 max supervision steps, and stable-max loss be attributed to the fact that the model is spending
(Prieto et al., 2025) for improved stability. TRM uses an too much time on the same data samples rather than
Exponential Moving Average (EMA) of 0.999. HRM focusing on learning on a wide range of data samples.
uses n = 2, T = 2 with two 4-layers networks, while We tried weight tying the input embedding and out-
we use n = 6, T = 3 with one 2-layer network. put head, but this was too constraining and led to a
For Sudoku-Extreme and Maze-Hard, we train for 60k massive generalization drop.
epochs with learning rate 1e-4 and weight decay 1.0. We tried using TorchDEQ (Geng & Kolter, 2023) to
For ARC-AGI, we train for 100K epochs with learning replace the recursion steps by fixed-point iteration as
rate 1e-4 (with 1e-2 learning rate for the embeddings) done by Deep Equilibrium Models (Bai et al., 2019).
and weight decay 0.1. The numbers for Deepseek R1, This would provide a better justification for the 1-step
Claude 3.7 8K, O3-mini-high, Direct prediction, and gradient approximation. However, this slowed down
HRM from the Table 4 and 5 are taken from Wang et al. training due to the fixed-point iteration and led to
(2025). Both HRM and TRM add an embedding of worse generalization. This highlights the fact that
shape [0, 1, D ] on Sudoku-Extreme and Maze-Hard to converging to a fixed-point is not essential.
the input. For ARC-AGI, each puzzle (containing 2-3
training examples and 1-2 test examples) at each data-
augmentation is given a specific embedding of shape
[0, 1, D ] and, at test-time, the most common answer
out of the 1000 data augmentations is given as answer.
Experiments on Sudoku-Extreme were ran with 1 L40S
with 40Gb of RAM for generally less than 36 hours.
Experiments on Maze-Hard were ran with 4 L40S with
40Gb of RAM for less than 24 hours. Experiments on
ARC-AGI were ran for around 3 days with 4 H100
with 80Gb of RAM.
Ideas that failed
In this section, we quickly mention a few ideas that
did not work to prevent others from making the same
mistake.
We tried replacing the SwiGLU MLPs by SwiGLU
Mixture-of-Experts (MoEs) (Shazeer et al., 2017; Fedus
et al., 2022), but we found generalization to decrease
massively. MoEs clearly add too much unnecessary
capacity, just like increasing the number of layers does.
Instead of back-propagating through the whole n + 1
recursions, we tried a compromise between HRM 1-
step gradient approximation, which back-propagates
through the last 2 recursions. We did so by decou-
pling n from the number of last recursions k that we
back-propagate through. For example, while n = 6
requires 7 steps with gradients in TRM, we can use
gradients for only the k = 4 last steps. However, we
found that this did not help generalization in any way,
and it made the approach more complicated. Back-
11
Recursive Reasoning with Tiny Networks
Algorithms with different number of latent Example on Sudoku-Extreme
features
8 3 1
9 6 8 7
def latent recursion(x, z, n=6): 3 5
for i in range(n+1): # latent recursion
z = net(x, z) 6 8
return z 6 2
def deep recursion(x, z, n=6, T=3):
7 4 3
# recursing T−1 times to improve z (no gradients needed) 9 4
with torch.no grad():
for j in range(T−1):
2 4 1
z = latent recursion(x, z, n) 6 2 5 7
# recursing once to improve z
z = latent recursion(x, z, n) Input x
return z.detach(), output head(y), Q head(y)
5 2 6 7 9 4 8 3 1
# Deep Supervision
for x input, y true in train dataloader: 3 9 1 2 6 8 4 7 5
z = z init 4 8 7 3 1 5 2 9 6
for step in range(N supervision):
x = input embedding(x input)
1 6 8 5 3 2 7 4 9
z, y hat, q hat = deep recursion(x, z) 9 3 5 4 7 6 1 8 2
loss = softmax cross entropy(y hat, y true) 7 4 2 9 8 1 5 6 3
loss += binary cross entropy(q hat, (y hat == y true))
z = z.detach() 8 7 3 1 5 9 6 2 4
loss.backward() 2 5 9 6 4 7 3 1 8
opt.step()
opt.zero grad() 6 1 4 8 5 3 9 5 7
if q[0] > 0: # early−stopping
break Output y
5 2 6 7 9 4 8 3 1
Figure 4. Pseudocode of TRM using a single-z with deep 3 9 1 2 6 8 4 7 5
supervision training in PyTorch. 4 8 7 3 1 5 2 9 6
1 6 8 5 3 2 7 4 9
9 3 5 4 7 6 1 8 2
7 4 2 9 8 1 5 6 3
8 7 3 1 5 9 6 2 4
def latent recursion(x, y, z, n=6):
for i in range(n): # latent recursion 2 5 9 6 4 7 3 1 8
z[i] = net(x, y, z[0], ... , z[n−1]) 6 1 4 8 5 3 9 5 7
y = net(y, z[0], ... , z[n−1]) # refine output answer
return y, z Tokenized z H (denoted y in TRM)
def deep recursion(x, y, z, n=6, T=3): 5 5 4 9 4 6 3
# recursing T−1 times to improve y and z (no gradients needed)
with torch.no grad(): 4 3 1 4 6 5
for j in range(T−1): 4 8 4 3 6 6 4
y, z = latent recursion(x, y, z, n)
# recursing once to improve y and z 9 6 5 3 5 4
y, z = latent recursion(x, y, z, n) 3 5 4 3 5 4 4
return (y.detach(), z.detach()), output head(y), Q head(y)
6 3 3 3 5 8 8
# Deep Supervision 3 3 3 6 5 6 6 4
for x input, y true in train dataloader: 7 5 6 3 3 6 6
y, z = y init, z init
for step in range(N supervision): 4 3 4 8 3 6 6 4
x = input embedding(x input)
(y, z), y hat, q hat = deep recursion(x, y, z) Tokenized z L (denoted z in TRM)
loss = softmax cross entropy(y hat, y true)
loss += binary cross entropy(q hat, (y hat == y true))
loss.backward()
opt.step()
Figure 6. This Sudoku-Extreme example shows an input, ex-
opt.zero grad() pected output, and the tokenized z H and z L (after reversing
if q[0] > 0: # early−stopping the embedding and using argmax) for a pretrained model.
break
This highlights the fact that z H corresponds to the predicted
response, while z L is a latent feature that cannot be decoded
Figure 5. Pseudocode of TRM using multi-scale z with deep to a sensible output unless transformed into z H by f H .
supervision training in PyTorch.
12
|