summaryrefslogtreecommitdiff
path: root/papers/txt/hrm2025_hierarchical_reasoning.txt
diff options
context:
space:
mode:
Diffstat (limited to 'papers/txt/hrm2025_hierarchical_reasoning.txt')
-rw-r--r--papers/txt/hrm2025_hierarchical_reasoning.txt1302
1 files changed, 1302 insertions, 0 deletions
diff --git a/papers/txt/hrm2025_hierarchical_reasoning.txt b/papers/txt/hrm2025_hierarchical_reasoning.txt
new file mode 100644
index 0000000..3cc1f2e
--- /dev/null
+++ b/papers/txt/hrm2025_hierarchical_reasoning.txt
@@ -0,0 +1,1302 @@
+ Hierarchical Reasoning Model
+ Guan Wang1,† , Jin Li1 , Yuhao Sun1 , Xing Chen1 , Changling Liu1 ,
+ Yue Wu1 , Meng Lu1,† , Sen Song2,† , Yasin Abbasi Yadkori1,†
+ 1
+ Sapient Intelligence, Singapore
+
+
+
+ Abstract
+
+ Reasoning, the process of devising and executing complex goal-oriented action sequences,
+arXiv:2506.21734v3 [cs.AI] 4 Aug 2025
+
+
+
+
+ remains a critical challenge in AI. Current large language models (LLMs) primarily employ
+ Chain-of-Thought (CoT) techniques, which suffer from brittle task decomposition, extensive
+ data requirements, and high latency. Inspired by the hierarchical and multi-timescale pro-
+ cessing in the human brain, we propose the Hierarchical Reasoning Model (HRM), a novel
+ recurrent architecture that attains significant computational depth while maintaining both train-
+ ing stability and efficiency. HRM executes sequential reasoning tasks in a single forward pass
+ without explicit supervision of the intermediate process, through two interdependent recurrent
+ modules: a high-level module responsible for slow, abstract planning, and a low-level mod-
+ ule handling rapid, detailed computations. With only 27 million parameters, HRM achieves
+ exceptional performance on complex reasoning tasks using only 1000 training samples. The
+ model operates without pre-training or CoT data, yet achieves nearly perfect performance on
+ challenging tasks including complex Sudoku puzzles and optimal path finding in large mazes.
+ Furthermore, HRM outperforms much larger models with significantly longer context windows
+ on the Abstraction and Reasoning Corpus (ARC), a key benchmark for measuring artificial
+ general intelligence capabilities. These results underscore HRM’s potential as a transformative
+ advancement toward universal computation and general-purpose reasoning systems.
+
+
+ ARC-AGI-1 ARC-AGI-2 Sudoku-Extreme (9x9) Maze-Hard (30x30)
+ 960 training examples 1120 training examples 1000 training examples 1000 training examples
+ 40.3 5.0 60 55.0 80 74.5
+ 40 5
+ 34.5
+ 60
+ Deepseek R1
+
+
+
+
+ 4
+ Claude 3.7 8K
+
+
+
+
+ 30 40
+ Accuracy %
+
+
+
+
+ 3.0
+ 3
+ Claude 3.7 8K
+
+
+
+
+ Claude 3.7 8K
+ 21.0 21.2
+ Deepseek R1
+
+
+
+
+ Deepseek R1
+ 40
+ o3-mini-high
+
+
+
+
+ o3-mini-high
+
+
+ 20 15.8
+ Direct pred
+
+
+
+
+ Direct pred
+
+
+
+
+ Direct pred
+ o3-mini-high
+
+
+
+
+ o3-mini-high
+ Claude 3.7 8K
+
+
+
+
+ 2 20
+ Deepseek R1
+
+
+
+
+ 1.3
+ Direct pred
+
+
+
+
+ 10 0.9 20
+ 1
+ HRM
+
+
+
+
+ HRM
+
+
+
+
+ HRM
+
+
+
+
+ HRM
+
+ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
+ 0 0 0 0
+ Chain-of-thought, pretrained Direct prediction, small-sample learning
+
+
+
+ Figure 1: Left: HRM is inspired by hierarchical processing and temporal separation in the brain. It
+ has two recurrent networks operating at different timescales to collaboratively solve tasks. Right:
+ With only about 1000 training examples, the HRM (~27M parameters) surpasses state-of-the-art
+ CoT models on inductive benchmarks (ARC-AGI) and challenging symbolic tree-search puzzles
+ (Sudoku-Extreme, Maze-Hard) where CoT models failed completely. The HRM was randomly
+ initialized, and it solved the tasks directly from inputs without chain of thoughts.
+ 2
+ Tsinghua University † Corresponding author. Contact: research@sapient.inc.
+ Code available at: github.com/sapientinc/HRM
+
+ 1
+ 1 Introduction
+Deep learning, as its name suggests, emerged from the idea of stacking more layers to achieve
+increased representation power and improved performance 1,2 . However, despite the remarkable
+success of large language models, their core architecture is paradoxically shallow 3 . This imposes
+a fundamental constraint on their most sought-after capability: reasoning. The fixed depth of stan-
+dard Transformers places them in computational complexity classes such as AC 0 or T C 0 4 , prevent-
+ing them from solving problems that require polynomial time 5,6 . LLMs are not Turing-complete
+and thus they cannot, at least in a purely end-to-end manner, execute complex algorithmic rea-
+soning that is necessary for deliberate planning or symbolic manipulation tasks 7,8 . For example,
+our results on the Sudoku task show that increasing Transformer model depth can improve per-
+formance,1 but performance remains far from optimal even with very deep models (see Figure 2),
+which supports the conjectured limitations of the LLM scaling paradigm 9 .
+The LLMs literature has relied largely on Chain-of-Thought (CoT) prompting for reasoning 10 .
+CoT externalizes reasoning into token-level language by breaking down complex tasks into sim-
+pler intermediate steps, sequentially generating text using a shallow model 11 . However, CoT for
+reasoning is a crutch, not a satisfactory solution. It relies on brittle, human-defined decompositions
+where a single misstep or a misorder of the steps can derail the reasoning process entirely 12,13 . This
+dependency on explicit linguistic steps tethers reasoning to patterns at the token level. As a result,
+CoT reasoning often requires significant amount of training data and generates a large number of
+tokens for complex reasoning tasks, resulting in slow response times. A more efficient approach is
+needed to minimize these data requirements 14 .
+Towards this goal, we explore “latent reasoning”, where the model conducts computations within
+its internal hidden state space 15,16 . This aligns with the understanding that language is a tool for
+human communication, not the substrate of thought itself 17 ; the brain sustains lengthy, coherent
+chains of reasoning with remarkable efficiency in a latent space, without constant translation back
+to language. However, the power of latent reasoning is still fundamentally constrained by a model’s
+effective computational depth. Naively stacking layers is notoriously difficult due to vanishing gra-
+dients, which plague training stability and effectiveness 1,18 . Recurrent architectures, a natural al-
+ternative for sequential tasks, often suffer from early convergence, rendering subsequent computa-
+tional steps inert, and rely on the biologically implausible, computationally expensive and memory
+intensive Backpropagation Through Time (BPTT) for training 19 .
+The human brain provides a compelling blueprint for achieving the effective computational depth
+that contemporary artificial models lack. It organizes computation hierarchically across corti-
+cal regions operating at different timescales, enabling deep, multi-stage reasoning 20,21,22 . Recur-
+rent feedback loops iteratively refine internal representations, allowing slow, higher-level areas to
+guide, and fast, lower-level circuits to execute—subordinate processing while preserving global
+coherence 23,24,25 . Notably, the brain achieves such depth without incurring the prohibitive credit-
+assignment costs that typically hamper recurrent networks from backpropagation through time 19,26 .
+Inspired by this hierarchical and multi-timescale biological architecture, we propose the Hierar-
+chical Reasoning Model (HRM). HRM is designed to significantly increase the effective compu-
+tational depth. It features two coupled recurrent modules: a high-level (H) module for abstract,
+deliberate reasoning, and a low-level (L) module for fast, detailed computations. This structure
+ 1
+ Simply increasing the model width does not improve performance here.
+
+ 2
+ 100 100
+ Scaling Width - 8 layers fixed Transformer
+ Scaling Depth - 512 hidden size fixed Recurrent Transformer
+ 80 80 HRM
+ Accuracy %
+ 60 60
+
+ 40 40
+
+ 20 20
+ 27M 54M 109M 218M 436M 872M 8 16 32 64 128 256 512
+ Parameters Depth / Transformer layers computed
+
+Figure 2: The necessity of depth for complex reasoning. Left: On Sudoku-Extreme Full, which
+require extensive tree-search and backtracking, increasing a Transformer’s width yields no perfor-
+mance gain, while increasing depth is critical. Right: Standard architectures saturates, failing to
+benefit from increased depth. HRM overcomes this fundamental limitation, effectively using its
+computational depth to achieve near-perfect accuracy.
+
+avoids the rapid convergence of standard recurrent models through a process we term “hierarchi-
+cal convergence.” The slow-updating H-module advances only after the fast-updating L-module
+has completed multiple computational steps and reached a local equilibrium, at which point the
+L-module is reset to begin a new computational phase.
+Furthermore, we propose a one-step gradient approximation for training HRM, which offers im-
+proved efficiency and eliminates the requirement for BPTT. This design maintains a constant mem-
+ory footprint (O(1) compared to BPTT’s O(T ) for T timesteps) throughout the backpropagation
+process, making it scalable and more biologically plausible.
+Leveraging its enhanced effective depth, HRM excels at tasks that demand extensive search and
+backtracking. Using only 1,000 input-output examples, without pre-training or CoT supervi-
+sion, HRM learns to solve problems that are intractable for even the most advanced LLMs. For
+example, it achieves near-perfect accuracy in complex Sudoku puzzles (Sudoku-Extreme Full) and
+optimal pathfinding in 30x30 mazes, where state-of-the-art CoT methods completely fail (0% ac-
+curacy). In the Abstraction and Reasoning Corpus (ARC) AGI Challenge 27,28,29 - a benchmark
+of inductive reasoning - HRM, trained from scratch with only the official dataset (~1000 exam-
+ples), with only 27M parameters and a 30x30 grid context (900 tokens), achieves a performance
+of 40.3%, which substantially surpasses leading CoT-based models like o3-mini-high (34.5%)
+and Claude 3.7 8K context (21.2%), despite their considerably larger parameter sizes and con-
+text lengths, as shown in Figure 1. This represents a promising direction toward the development
+of next-generation AI reasoning systems with universal computational capabilities.
+
+
+2 Hierarchical Reasoning Model
+We present the HRM, inspired by three fundamental principles of neural computation observed in
+the brain:
+• Hierarchical processing: The brain processes information across a hierarchy of cortical ar-
+ eas. Higher-level areas integrate information over longer timescales and form abstract repre-
+ sentations, while lower-level areas handle more immediate, detailed sensory and motor process-
+ ing 20,22,21 .
+
+ 3
+ • Temporal Separation: These hierarchical levels in the brain operate at distinct intrinsic timescales,
+ reflected in neural rhythms (e.g., slow theta waves, 4–8 Hz and fast gamma waves, 30–100
+ Hz) 30,31 . This separation allows for stable, high-level guidance of rapid, low-level computa-
+ tions 32,33 .
+• Recurrent Connectivity: The brain features extensive recurrent connections. These feedback
+ loops enable iterative refinement, yielding more accurate and context-sensitive representations
+ at the cost of additional processing time. Additionally, the brain largely avoids the problematic
+ deep credit assignment problem associated with BPTT 19 .
+The HRM model consists of four learnable components: an input network fI (·; θI ), a low-level re-
+current module fL (·; θL ), a high-level recurrent module fH (·; θH ), and an output network fO (·; θO ).
+The model’s dynamics unfold over N high-level cycles of T low-level timesteps each2 . We index
+the total timesteps of one forward pass by i = 1, . . . , N × T . The modules fL and fH each keep a
+hidden state—zLi for fL and zH i
+ for fH —which are initialized with the vectors zL0 and zH0
+ , respec-
+tively.
+The HRM maps an input vector x to an output prediction vector ŷ as follows. First, the input x is
+projected into a working representation x̃ by the input network:
+
+ x̃ = fI (x; θI ) .
+At each timestep i, the L-module updates its state conditioned on its own previous state, the H-
+module’s current state (which remains fixed throughout the cycle), and the input representation.
+The H-module only updates once per cycle (i.e., every T timesteps) using the L-module’s final
+state at the end of that cycle:
+
+ zLi = fL zLi−1 , zH
+ i−1
+ 
+ , x̃; θL ,
+ (
+ i−1 i−1
+ 
+ i fH zH , zL ; θH if i ≡ 0 (mod T ) ,
+ zH = i−1
+ zH otherwise .
+
+Finally, after N full cycles, a prediction ŷ is extracted from the hidden state of the H-module:
+
+
+ NT
+ ŷ = fO (zH ; θO ) .
+
+This entire N T -timestep process represents a single forward pass of the HRM. A halting mecha-
+nism (detailed later in this section) determines whether the model should terminate, in which case
+ŷ will be used as the final prediction, or continue with an additional forward pass.
+Hierarchical convergence Although convergence is crucial for recurrent networks, standard RNNs
+are fundamentally limited by their tendency to converge too early. As the hidden state settles toward
+a fixed point, update magnitudes shrink, effectively stalling subsequent computation and capping
+the network’s effective depth. To preserve computational power, we actually want convergence to
+proceed very slowly–but engineering that gradual approach is difficult, since pushing convergence
+too far edges the system toward instability.
+ 2
+ While inspired by temporal separation in the brain, our model’s “high-level” and “low-level” modules are concep-
+tual abstractions and do not map directly to specific neural oscillation frequencies.
+
+
+ 4
+ 250 250 250
+ HRM H Recurrent Neural Net Deep Neural Net
+ 200 200 200
+Forward residual
+ HRM L
+ 150 150 150
+ 100 100 100
+ 50 50 50
+ 0 0 0
+ 0 20 40 60 0 20 40 60 0 100 200
+ Step Index # Step Index # Layer Index #
+
+ 60 60
+
+
+
+
+ Layer Index #
+ Step Index #
+
+
+
+
+ Step Index #
+ 200
+ 30 30 100
+
+
+ Principal Components Principal Components Principal Components
+
+Figure 3: Comparison of forward residuals and PCA trajectories. HRM shows hierarchical conver-
+gence: the H-module steadily converges, while the L-module repeatedly converges within cycles
+before being reset by H, resulting in residual spikes. The recurrent neural network exhibits rapid
+convergence with residuals quickly approaching zero. In contrast, the deep neural network experi-
+ences vanishing gradients, with significant residuals primarily in the initial (input) and final layers.
+
+HRM is explicitly designed to counteract this premature convergence through a process we term
+hierarchical convergence. During each cycle, the L-module (an RNN) exhibits stable convergence
+to a local equilibrium. This equilibrium, however, depends on the high-level state zH supplied
+during that cycle. After completing the T steps, the H-module incorporates the sub-computation’s
+outcome (the final state zL ) and performs its own update. This zH update establishes a fresh context
+for the L-module, essentially “restarting” its computational path and initiating a new convergence
+phase toward a different local equilibrium.
+This process allows the HRM to perform a sequence of distinct, stable, nested computations, where
+the H-module directs the overall problem-solving strategy and the L-module executes the intensive
+search or refinement required for each step. Although a standard RNN may approach convergence
+within T iterations, the hierarchical convergence benefits from an enhanced effective depth of N T
+steps. As empirically shown in Figure 3, this mechanism allows HRM both to maintain high
+computational activity (forward residual) over many steps (in contrast to a standard RNN, whose
+activity rapidly decays) and to enjoy stable convergence. This translates into better performance at
+any computation depth, as illustrated in Figure 2.
+Approximate gradient Recurrent models typically use BPTT to compute gradients. However,
+BPTT requires storing the hidden states from the forward pass and then combining them with
+gradients during the backward pass, which demands O(T ) memory for T timesteps. This heavy
+memory burden forces smaller batch sizes and leads to poor GPU utilization, especially for large-
+scale networks. Additionally, because retaining the full history trace through time is biologically
+implausible, it is unlikely that the brain implements BPTT 19 .
+Fortunately, if a recurrent neural network converges to a fixed point, we can avoid unrolling its state
+sequence by applying backpropagation in a single step at that equilibrium point. Moreover, such a
+mechanism could plausibly be implemented in the brain using only local learning rules 34,35 . Based
+
+ 5
+ on this finding, we propose a one-step approximation of the HRM gradient–using the gradient of
+the last state of each module and treating other states as constant. The gradient path is, therefore,
+
+ Output head → final state of the H-module → final state of the L-module → input embedding
+
+The above method needs O(1) memory, does not require unrolling through time, and can be easily
+implemented with an autograd framework such as PyTorch, as shown in Figure 4. Given that
+each module only needs to back-propagate errors through its most recent local synaptic activity,
+this approach aligns well with the perspective that cortical credit assignment relies on short-range,
+temporally local mechanisms rather than on a global replay of activity patterns.
+The one-step gradient approximation is theoretically
+grounded in the mathematics of Deep Equilibrium Mod-
+els (DEQ) 36 which employs the Implicit Function Theo-
+rem (IFT) to bypass BPTT, as detailed next. Consider an
+idealized HRM behavior where, during high-level cycle
+k, the L-module repeatedly updates until its state zL con- def hrm(z, x, N=2, T=2):
+ x = input_embedding(x)
+verges to a local fixed point zL⋆ . This fixed point, given zH, zL = z
+ k−1
+the current high-level state zH , can be expressed as with torch.no_grad():
+ for _i in range(N ∗ T − 1):
+ k−1 zL = L_net(zL, zH, x)
+ zL⋆ = fL (zL⋆ , zH , x̃; θL ) . if (_i + 1) % T == 0:
+ zH = H_net(zH, zL)
+
+The H-module then performs a single update using this # 1−step grad
+ zL = L_net(zL, zH, x)
+converged L-state: zH = H_net(zH, zL)
+ return (zH, zL), output_head(zH)
+ k k−1 ⋆
+ zH = fH (zH , zL ; θH ) . # Deep Supervision
+ for x, y_true in train_dataloader:
+ z = z_init
+With a proper mapping F, the updates to the high-level for step in range(N_supervision):
+ k z, y_hat = hrm(z, x)
+state can be written in a more compact form as zH =
+ k−1 loss = softmax_cross_entropy(y_hat, y_true)
+F(zH ; x̃, θ), where θ = (θI , θL ), and the fixed-point z = z.detach()
+ ⋆ ⋆ ∂F
+can be written as zH = F(zH ; x̃, θ). Let JF = ∂zH be loss.backward()
+the Jacobian of F, and assume that the matrix I − JF is opt.step()
+ opt.zero_grad()
+ ⋆
+invertible at zH and that the mapping F is continuously
+differentiable. The Implicit Function Theorem then al- Figure 4: Top: Diagram of HRM with
+ ⋆
+lows us to calculate the exact gradient of fixed point zH approximate gradient. Bottom: Pseu-
+with respect to the parameters θ without explicit back- docode of HRM with deep supervision
+propagation: training in PyTorch.
+ ⋆ −1 ∂F
+ ∂zH 
+ = I − JF z⋆ . (1)
+ ∂θ H ∂θ z⋆
+ H
+
+
+Calculating the above gradient requires evaluating and inverting matrix (I − JF ) that can be com-
+putationally expensive. Given the Neumann series expansion,
+ (I − JF )−1 = I + JF + JF2 + JF3 + . . . ,
+the so-called 1-step gradient 37 approximates the series by considering only its first term, i.e. (I −
+JF )−1 ≈ I, and leads to the following approximation of Equation (1):
+ ∗ ∗
+ ∂zH ∂fH ∂zH ∂fH ∂zL∗ ∗
+ ∂zH ∂fH ∂zL∗
+ ≈ , ≈ · , ≈ · . (2)
+ ∂θH ∂θH ∂θL ∂zL∗ ∂θL ∂θI ∂zL∗ ∂θI
+ 6
+ ∂z ∗ ∂z ∗
+The gradients of the low-level fixed point, ∂θLL and ∂θLI , can also be approximated using another
+application of the 1-step gradient:
+ ∂zL∗ ∂fL ∂zL∗ ∂fL
+ ≈ , ≈ . (3)
+ ∂θL ∂θL ∂θI ∂θI
+By substituting Equation (3) back into Equation (2), we arrive at the final simplified gradients.
+Before defining our loss function, we must first introduce two key elements of our proposed
+method: deep supervision and adaptive computational time.
+Deep supervision Inspired by the principle that periodic neural oscillations regulate when learning
+occurs in the brain 38 , we incorporate a deep supervision mechanism into HRM, as detailed next.
+Given a data sample (x, y), we run multiple forward passes of the HRM model, each of which we
+refer to as a segment. Let M denote the total number of segments executed before termination.
+For each segment m ∈ {1, . . . , M }, let z m = (zHmN T
+ , zLmN T ) represent the hidden state at the
+conclusion of segment m, encompassing both high-level and low-level state components.
+At each segment m, we apply a deep supervision step as follows:
+ 1. Given the state z m−1 from the previous segment, compute the next state z m and its associated
+ output ŷ m through a forward pass in the HRM model:
+
+ (z m , ŷ m ) ← HRM(z m−1 , x; θ)
+
+ 2. Compute the loss for the current segment:
+
+ Lm ← L OSS(ŷ m , y)
+
+ 3. Update parameters:
+
+ θ ← O PTIMIZER S TEP(θ, ∇θ Lm )
+
+The crucial aspect of this procedure is that the hidden state z m is “detached” from the computa-
+tion graph before being used as the input state for the next segment. Consequently, gradients from
+segment m + 1 do not propagate back through segment m, effectively creating a 1-step approxi-
+mation of the gradient of the recursive deep supervision process 39,40 . This approach provides more
+frequent feedback to the H-module and serves as a regularization mechanism, demonstrating supe-
+rior empirical performance and enhanced stability in deep equilibrium models when compared to
+more complex, Jacobian-based regularization techniques 39,41 . Figure 4 shows pseudocode of deep
+supervision training.
+Adaptive computational time (ACT) The brain dynamically alternates between automatic think-
+ing (“System 1”) and deliberate reasoning (“System 2”) 42 . Neuroscientific evidence shows that
+these cognitive modes share overlapping neural circuits, particularly within regions such as the
+prefrontal cortex and the default mode network 43,44 . This indicates that the brain dynamically mod-
+ulates the “runtime” of these circuits according to task complexity and potential rewards 45,46 .
+Inspired by the above mechanism, we incorporate an adaptive halting strategy into HRM that en-
+ables “thinking, fast and slow”. This integration leverages deep supervision and uses the Q-learning
+
+
+
+ 7
+ algorithm 47 to adaptively determine the number of segments. A Q-head uses the final state of the
+H-module to predict the Q-values Q̂m = (Q̂m m
+ halt , Q̂continue ) of the “halt” and “continue” actions:
+
+ ⊤ mN T
+ Q̂m = σ(θQ zH ) ,
+where σ denotes the sigmoid function applied element-wise. The halt or continue action is chosen
+using a randomized strategy as detailed next. Let Mmax denote the maximum number of segments
+(a fixed hyperparameter) and Mmin denote the minimum number of segments (a random variable).
+The value of Mmin is determined stochastically: with probability ε, it is sampled uniformly from the
+set {2, · · · , Mmax } (to encourage longer thinking), and with probability 1 − ε, it is set to 1. The halt
+action is selected under two conditions: when the segment count surpasses the maximum threshold
+Mmax , or when the estimated halt value Q̂halt exceeds the estimated continue value Q̂continue and the
+segment count has reached at least the minimum threshold Mmin .
+The Q-head is updated through a Q-learning algorithm, which is defined on the following episodic
+Markov Decision Process (MDP). The state of the MDP at segment m is z m , and the action space
+is {halt, continue}. Choosing the action “halt” terminates the episode and returns a binary reward
+indicating prediction correctness, i.e., 1{ŷ m = y}. Choosing “continue” yields a reward of 0 and
+the state transitions to z m+1 . Thus, the Q-learning targets for the two actions Ĝm = (Ĝm m
+ halt , Ĝcontinue )
+are given by
+
+ Ĝm m
+ halt = 1{ŷ = y} ,
+ 
+ Q̂m+1
+ halt , if m ≥ Nmax ,
+ m
+ Ĝcontinue =
+ max(Q̂m+1 , Q̂m+1 ) , otherwise .
+ halt continue
+
+We can now define the loss function of our learning procedure. The overall loss for each supervision
+segment combines both the Q-head loss and the sequence-to-sequence loss:
+
+ Lm m m m
+ ACT = L OSS (ŷ , y) + B INARY C ROSS E NTROPY (Q̂ , Ĝ ) .
+
+Minimizing the above loss enables both accurate predictions and nearly optimal stopping decisions.
+Selecting the “halt” action ends the supervision loop. In practice, sequences are processed in
+batches, which can be easily handled by substituting any halted sample in the batch with a fresh
+sample from the dataloader.
+Figure 5 presents a performance comparison between two HRM variants: one incorporating ACT
+and another employing a fixed computational step count equivalent to ACT’s Mmax parameter. It
+shows that ACT effectively adapts its computational resources based on task complexity, achieving
+significant computational savings with minimal impact on performance.
+Inference-time scaling An effective neural model should exploit additional computational re-
+sources during inference to enhance performance. As illustrated in Figure 5-(c), HRM seamlessly
+achieves inference-time scaling by simply increasing the computational limit parameter, Mmax
+without requiring further training or architectural modifications.
+Additional compute is especially effective for tasks that demand deeper reasoning. On Sudoku—
+a problem that often requires long-term planning—HRM exhibits strong inference-time scaling.
+On the other hand, we find that extra computational resources yield minimal gains in ARC-AGI
+challenge, as solutions generally require only a few transformations.
+
+ 8
+ (a) ACT Compute Spent (b) ACT Performance (c) Inference-time scaling
+ 8 100.0 100.0
+ Fixed M Fixed M
+ 7
+Mean Compute Steps
+
+ ACT (Mmax limit) 97.5 ACT (Mmax limit) 97.5
+ 6 95.0 95.0
+
+
+
+
+ Accuracy %
+
+
+
+
+ Accuracy %
+ 5 92.5 92.5
+ 4 90.0 90.0
+ 3 87.5 87.5 Train Mmax = 2
+ 85.0 85.0 Train Mmax = 4
+ 2 Train Mmax = 8
+ 1 82.5 82.5
+ 2 4 8 2 4 8 2 4 8 16
+ M (Fixed) or Mmax (ACT) M (Fixed) or Mmax (ACT) Inference Mmax
+
+
+Figure 5: Effectiveness of Adaptive Computation Time (ACT) on the Sudoku-Extreme-Full. (a)
+Mean compute steps used by models with ACT versus models with a fixed number of compute steps
+(M ). ACT maintains a low and stable number of average compute steps even as the maximum limit
+(Mmax ) increases. (b) Accuracy comparison. The ACT model achieves performance comparable
+to the fixed-compute model while utilizing substantially fewer computational steps on average. (c)
+Inference-time scalability. Models trained with a specific Mmax can generalize to higher compu-
+tational limits during inference, leading to improved accuracy. For example, a model trained with
+Mmax = 8 continues to see accuracy gains when run with Mmax = 16 during inference.
+
+Stability of Q-learning in ACT The deep Q-learning that underpins our ACT mechanism is
+known to be prone to instability, often requiring stabilization techniques such as replay buffers
+and target networks 48 , which are absent in our design. Our approach, however, achieves stability
+through the intrinsic properties of our model and training procedure. Recent theoretical work by
+Gallici et al. 49 shows that Q-learning can achieve convergence if network parameters are bounded,
+weight decay is incorporated during training, and post-normalization layers are implemented. Our
+model satisfies these conditions through its Post-Norm architecture that employs RMSNorm (a
+layer normalization variant) and the AdamW optimizer. AdamW has been shown to solve an L∞ -
+constrained optimization problem, ensuring that model parameters remain bounded by 1/λ 50 .
+Architectural details We employ a sequence-to-sequence architecture for HRM. Both input and
+output are represented as token sequences: x = (x1 , . . . , xl ) and y = (y1 , . . . , yl′ ) respectively.
+The model includes an embedding layer fI that converts discrete tokens into vector representa-
+tions, and an output head fO (z; θO ) = softmax(θO z) that transforms hidden states into token prob-
+ability distributions ŷ. For small-sample experiments, we replace softmax with stablemax 51 to
+improve generalization performance. The sequence-to-sequence loss is averaged over all tokens,
+ Pl′
+L OSS(ŷ, y) = l1′ i=1 log p(yi ), where p(yi ) is the probability that distribution ŷi assigns to token
+yi . The initial hidden states z 0 are initialized by sampling from a truncated normal distribution with
+standard deviation of 1, truncation of 2, and kept fixed throughout training.
+Both the low-level and high-level recurrent modules fL and fH are implemented using encoder-
+only Transformer 52 blocks with identical architectures and dimensions. These modules take mul-
+tiple inputs, and we use straightforward element-wise addition to combine them, though more
+sophisticated merging techniques such as gating mechanisms could potentially improve perfor-
+mance and is left for future work. For all Transformer blocks in this work—including those in
+the baseline models—we incorporate the enhancements found in modern LLMs (based on Llama 53
+architectures). These improvements include Rotary Positional Encoding 54 , Gated Linear Units 55 ,
+RMSNorm 56 , and the removal of bias terms from linear layers.
+Furthermore, both HRM and recurrent Transformer models implement a Post-Norm architecture
+
+ 9
+ 8 4 5 6
+ 8 7
+ 3 4
+ 3 8 4 2
+ 6 3 8
+ 9 6
+ 5
+ 2 1
+ 2 5 3 8
+
+
+
+
+ 7 8 4 1 2 5 9 6 3
+ 2 6 1 3 8 9 7 4 5
+ 3 5 9 6 4 7 8 1 2
+ 5 3 8 4 9 6 1 2 7
+ 4 1 6 2 7 3 5 9 8
+ 9 7 2 8 5 1 4 3 6
+ 6 9 3 5 1 8 2 7 4
+ 8 4 7 9 6 2 3 5 1
+ 1 2 5 7 3 4 6 8 9
+
+
+
+
+ (a) ARC-AGI (b) Sudoku-Hard (c) Maze navigation (d) Sudoku-Extreme subset difficulty
+
+Figure 6: Left: Visualization of benchmark tasks. Right: Difficulty of Sudoku-Extreme examples.
+
+with weights initialized via truncated LeCun Normal initialization 57,58,59 , while the scale and bias
+parameters are excluded from RMSNorm. All parameters are optimized using the Adam-atan2 op-
+timizer 60 , a scale-invariant variant of Adam 61 , combined with a constant learning rate that includes
+linear warm-up.
+
+
+3 Results
+This section begins by describing the ARC-AGI, Sudoku, and Maze benchmarks, followed by an
+overview of the baseline models and their results. Figure 6-(a,b,c) presents a visual representa-
+tion of the three benchmark tasks, which are selected to evaluate various reasoning abilities in AI
+models.
+
+3.1 Benchmarks
+ARC-AGI Challenge The ARC-AGI benchmark evaluates general fluid intelligence through IQ-
+test-like puzzles that require inductive reasoning 27 . The initial version, ARC-AGI-1, presents chal-
+lenges as input-label grid pairs that force AI systems to extract and generalize abstract rules from
+just a few examples. Each task provides a few input–output demonstration pairs (usually 2–3) and
+a test input. An AI model has two attempts to produce the correct output grid. Although some be-
+lieve that mastering ARC-AGI would signal true artificial general intelligence, its primary purpose
+is to expose the current roadblocks in AGI progress. In fact, both conventional deep learning meth-
+ods and CoT techniques have faced significant challenges with ARC-AGI-1, primarily because it
+requires the ability to generalize to entirely new tasks 28 .
+Addressing the limitations identified in ARC-AGI-1, ARC-AGI-2 significantly expands the bench-
+mark by providing a more comprehensive and carefully refined collection of tasks. These new
+tasks emphasize deeper compositional reasoning, multi-step logic, contextual rule application, and
+symbolic abstraction. Human calibration studies show these tasks are challenging but doable for
+people, while being much harder for current AI systems, offering a clearer measure of general
+reasoning abilities 29 .
+
+
+ 10
+ Sudoku-Extreme Sudoku is a 9×9 logic puzzle, requiring each row, column, and 3×3 block to
+contain the digits 1–9 exactly once. A prediction is considered correct if it exactly matches the
+puzzle’s unique solution. Sudoku’s complex logical structure makes it a popular benchmark for
+evaluating logical reasoning in machine learning 62,63,64 .
+The most frequently used Sudoku dataset in research, namely the Kaggle dataset 65 , can be fully
+solved using elementary single-digit techniques 66 . The minimal 17-clue puzzles 62 , another widely-
+used collection, might seem more challenging due to its small number of clues. However, this
+perception is misleading—since 17 represents the minimum number of clues required to guarantee
+a unique Sudoku solution, these hints need to be highly orthogonal to each other. This orthogonal
+arrangement leads to many direct, easily-resolved solution paths 67 .
+We introduce Sudoku-Extreme, a more challenging dataset that is compiled from the aforemen-
+tioned easy datasets as well as puzzles recognized by the Sudoku community as exceptionally
+difficult for human players:
+• Easy puzzles compiled from Kaggle, 17-clue, plus unbiased samples from the Sudoku puzzle
+ distribution 67 : totaling 1 149 158 puzzles.
+• Challenging puzzles compiled from Magictour 1465, Forum-Hard and Forum-Extreme subsets:
+ totaling 3 104 157 puzzles.
+The compiled data then undergo a strict 90/10 train-test split, ensuring that the test set puzzles
+cannot be derived through equivalent transformations of any training samples. Sudoku-Extreme is
+a down-sampled subset of this data containing 1000 training examples. We use Sudoku-Extreme in
+our main experiments (Figure 1), which focuses on small-sample learning scenarios. To guarantee
+convergence and control overfitting effects in our analysis experiments (Figures 2, 3 and 5), we use
+the complete training data, Sudoku-Extreme-Full, containing 3 831 994 examples.
+We measure puzzle difficulty by counting the number of search backtracks (“guesses”) required
+by a smart Sudoku solver program tdoku, which uses propositional logic to reduce the number of
+guesses 67 . Our Sudoku-Extreme dataset exhibits a mean difficulty of 22 backtracks per puzzle, sig-
+nificantly higher than existing datasets, including recent handmade puzzles Sudoku-Bench 68 which
+average just 0.45 backtracks per puzzle. These subset complexity levels are shown in Figure 6-(d).
+Maze-Hard This task involves finding the optimal path in a 30×30 maze, making it interpretable
+and frequently used for training LLMs in search tasks 69,70,71 . We adopt the instance generation
+procedure of Lehnert et al. 71 , but introduce an additional filter to retain only those instances whose
+difficulty exceeds 110. Here, “difficulty” is defined as the length of the shortest path, which aligns
+with the linear time complexity of the wavefront breadth-first search algorithm on GPUs 72 . A path
+is considered correct if it is valid and optimal—that is, the shortest route from the start to the goal.
+The training and test set both include 1000 examples.
+
+3.2 Evaluation Details
+For all benchmarks, HRM models were initialized with random weights and trained in the sequence-
+to-sequence setup using the input-output pairs. The two-dimensional input and output grids were
+flattened and then padded to the maximum sequence length. The resulting performance is shown in
+Figure 1. Remarkably, HRM attains these results with just ~1000 training examples per task—and
+without pretraining or CoT labels.
+
+
+ 11
+ For ARC-AGI challenge, we start with (1) all demonstration and test input-label pairs from the
+training set, and (2) all demonstration pairs along with test inputs from the evaluation set. The
+dataset is augmented by applying translations, rotations, flips, and color permutations to the puz-
+zles. Each task example is prepended with a learnable special token that represents the puzzle it
+belongs to. At test time, we proceed as follows for each test input in the evaluation set: (1) Gener-
+ate and solve 1000 augmented variants and, for each, apply the inverse-augmentation transform to
+obtain a prediction. (2) Choose the two most popular predictions as the final outputs.3 All reported
+results are obtained by comparing the outputs with the withheld test labels from the evaluation set.
+We augment Sudoku puzzles by applying band and digit permutations, while data augmentation is
+disabled for Maze tasks. Both tasks undergo only a single inference pass.
+For ARC-AGI, the scores of the CoT models are taken from the official leaderboard 29 , while for
+Sudoku and Maze, the scores are obtained by evaluating through the corresponding API.
+In Figure 1, the baselines are grouped based on whether they are pre-trained and use CoT, or neither.
+The “Direct pred” baseline means using “direct prediction without CoT and pre-training”, which
+retains the exact training setup of HRM but swaps in a Transformer architecture. Interestingly, on
+ARC-AGI-1, “Direct pred” matches the performance of Liao and Gu 73 , who built a carefully de-
+signed, domain-specific equivariant network for learning the ARC-AGI task from scratch, without
+pre-training. By substituting the Transformer architecture with HRM’s hierarchical framework and
+implementing ACT, we achieve more than a twofold performance improvement.
+On the Sudoku-Extreme and Maze-Hard benchmarks, the performance gap between HRM and the
+baseline methods is significant, as the baselines almost never manage to solve the tasks. These
+benchmarks that demand lengthy reasoning traces are particularly difficult for CoT-based methods.
+With only 1000 training examples, the “Direct pred” baseline—which employs an 8-layer Trans-
+former identical in size to HRM—fails entirely on these challenging reasoning problems. When
+trained on the larger Sudoku-Extreme-Full dataset, however, “Direct pred” can solve some easy
+Sudoku puzzles and reaches 16.9% accuracy (see Figure 2). Lehnert et al. 71 showed that a large
+vanilla Transformer model with 175M parameters, trained on 1 million examples across multiple
+trials, achieved only marginal success on 30x30 Maze tasks, with accuracy below 20% using the
+pass@64 evaluation metric.
+
+3.3 Visualization of intermediate timesteps
+Although HRM demonstrates strong performance on complex reasoning tasks, it raises an intrigu-
+ing question: what underlying reasoning algorithms does the HRM neural network actually imple-
+ment? Addressing this question is important for enhancing model interpretability and developing a
+deeper understanding of the HRM solution space.
+While a definitive answer lies beyond our current scope, we begin our investigation by analyzing
+state trajectories and their corresponding solution evolution. More specifically, at each timestep
+i and given the low-level and high-level state pair (zLi and zH i
+ ) we perform a preliminary forward
+ i i i
+pass through the H-module to obtain z̄ = fH (zH , zL ; θH ) and its corresponding decoded prediction
+ȳ i = fO (z̄ i ; θO ). The prediction ȳ i is then visualized in Figure 7.
+In the Maze task, HRM appears to initially explore several potential paths simultaneously, subse-
+quently eliminating blocked or inefficient routes, then constructing a preliminary solution outline
+ 3
+ The ARC-AGI allows two attempts for each test input.
+
+ 12
+ Timestep i = 0 Timestep i = 1 Timestep i = 2 Timestep i = 3 Timestep i = 4 Timestep i = 5 Timestep i = 6
+
+
+
+
+ Initial Timestep i = 0 Timestep i = 1 Timestep i = 2 Timestep i = 3 Timestep i = 4 Timestep i = 5 Timestep i = 6 Timestep i = 7
+ 4 89
+ 2 4 3 5 7 1 8 9 6 2 4 3 5 7 1 8 9 6 2 4 3 5 7 1 8 9 3 2 4 3 5 7 1 8 9 3 2 4 1 6 7 5 8 9 3 2 4 1 6 7 5 8 9 3 2 4 1 6 7 5 8 9 3 2 4 1 6 7 5 8 9 3
+ 7 3 1
+ 6 7 8 6 3 4 1 5 4 6 7 8 6 3 4 1 5 4 8 7 9 6 3 4 1 5 2 8 7 9 6 3 4 1 5 2 6 7 9 6 3 4 1 5 2 6 7 9 9 3 4 1 5 2 6 7 9 8 3 4 1 5 2 6 7 9 8 3 4 1 5 2
+ 2 6 5 1 2 7 9 7 3 4 6 5 1 2 8 9 7 3 4 6 5 1 2 8 8 7 3 4 6 5 1 2 8 9 7 3 4 6 5 3 2 1 8 7 6 4 9 5 1 2 8 8 7 3 4 8 5 3 2 1 9 7 6 4 8 5 3 2 1 9 7 6 4
+ 67 8 3 4 8 6 7 2 1 2 5 3 4 8 6 7 2 1 9 5 3 4 8 6 7 2 1 5 5 3 4 8 6 7 2 1 5 5 3 4 9 6 7 2 1 5 5 3 4 8 6 7 2 1 8 5 3 4 9 6 7 2 1 8 5 3 4 9 6 7 2 1 8
+ 3 4 7 2 8 3 1 8 6 4 8 7 2 5 3 1 5 6 4 8 7 2 5 3 1 5 6 4 9 7 2 5 3 1 1 6 4 6 7 2 5 3 8 1 6 4 6 7 2 5 3 1 1 6 4 6 7 2 8 3 8 1 6 4 6 7 2 8 3 5 1 6 4 9
+1 64 23 15649237 9 19645237 7 19645237 7 19645237 7 19645238 7 19645237 7 19648237 5 19648237 5
+ 27 3 8 1 2 7 9 3 4 6 6 9 1 2 7 4 3 9 8 6 9 1 2 7 4 3 5 6 8 9 1 2 7 4 3 5 6 8 9 1 2 7 4 3 5 6 8 9 1 2 7 4 3 5 8 6 9 1 2 7 4 3 5 8 6 9 1 2 7 4 3 5 8 6
+46 12 468128 7 3 7 465128 9 7 3 465128 9 7 3 468128 9 7 3 468128 9 7 3 468128 9 3 7 465128 9 3 7 465128 9 3 7
+3 7 6 1 3979 964 21 3875 564 21 3879 564 21 3879 564 21 3875 864 21 3875 864 21 3875 964 21 3875 964 21
+[7666fa5d] Example Input [7666fa5d] Example Output [7666fa5d] Test Input Timestep i = 0 Timestep i = 1 Timestep i = 2 Timestep i = 3 Timestep i = 4
+
+
+
+
+ [7b80bb43] Test Input Timestep i = 0 Timestep i = 1 Timestep i = 2 Timestep i = 3 Timestep i = 4 Timestep i = 5 Timestep i = 6
+ [7b80bb43] Example Input [7b80bb43] Example Output
+
+
+
+
+ Figure 7: Visualization of intermediate predictions by HRM on benchmark tasks. Top: Maze-
+ Hard—blue cells indicate the predicted path. Middle: Sudoku-Extreme—bold cells represent ini-
+ tial givens; red highlights cells violating Sudoku constraints; grey shading indicates changes from
+ the previous timestep. Bottom: ARC-AGI-2 Task—left: provided example input-output pair; right:
+ intermediate steps solving the test input.
+
+ followed by multiple refinement iterations. In Sudoku, the strategy resembles a depth-first search
+ approach, where the model appears to explore potential solutions and backtracks when it hits dead
+ ends. HRM uses a different approach for ARC tasks, making incremental adjustments to the board
+ and iteratively improving it until reaching a solution. Unlike Sudoku, which involves frequent
+ backtracking, the ARC solution path follows a more consistent progression similar to hill-climbing
+ optimization.
+ Importantly, the model shows that it can adapt to different reasoning approaches, likely choosing an
+ effective strategy for each particular task. Further research is needed to gain more comprehensive
+ insights into these solution strategies.
+
+
+ 4 Brain Correspondence
+ A key principle from systems neuroscience is that a brain region’s functional repertoire—its ability
+ to handle diverse and complex tasks—is closely linked to the dimensionality of its neural represen-
+ tations 75,76 . Higher-order cortical areas, responsible for complex reasoning and decision-making,
+ must handle a wide variety of tasks, demanding more flexible and context-dependent processing 77 .
+ In dynamical systems, this flexibility is often realized through higher-dimensional state-space tra-
+ jectories, which allow for a richer repertoire of potential computations 78 . This principle gives rise
+ to an observable dimensionality hierarchy, where a region’s position in the processing hierarchy
+
+ 13
+ (a) (c) (e)
+
+
+
+
+ (b) (d) (f)
+ 5.0
+
+ 4.5
+ Participation Ratio (PR)
+
+
+
+
+ 4.0
+
+ 3.5
+
+ 3.0
+
+ 2.5
+
+ 2.0
+
+ 0 20 40
+ Position in the hierarchy
+
+
+Figure 8: Hierarchical Dimensionality Organization in the HRM and Mouse Cortex. (a,b) are
+adapted from Posani et al. 74 . (a) Anatomical illustration of mouse cortical areas, color-coded by
+functional modules. (b) Correlation between Participation Ratio (PR), a measure of effective neural
+dimensionality, and hierarchical position across different mouse cortical areas. Higher positions in
+the hierarchy (e.g., MOs, ACAd) exhibit significantly higher PR values compared to lower sensory
+areas (e.g., SSp-n), with a Spearman correlation coefficient of ρ = 0.79 (P = 0.0003). (c,d) Trained
+HRM. (c) PR scaling of the trained HRM with task diversity. The dimensionality of the high-
+level module (zH ) scales with the number of unique tasks (trajectories) included in the analysis,
+indicating an adaptive expansion of its representational capacity. In contrast, the low-level module’s
+(zL ) dimensionality remains stable. (d) PR values for the low-level (zL , PR = 30.22) and high-
+level (zH , PR = 89.95) modules of the trained HRM, computed from neural activity during 100
+unique Sudoku-solving trajectories. A clear dimensionality hierarchy is observed, with the high-
+level module operating in a substantially higher-dimensional space. (e,f) Analysis of Untrained
+Network. To verify that the dimensionality hierarchy is an emergent property of training, the same
+analyses were performed on an untrained HRM with random weights. (e) In contrast to the trained
+model’s scaling in (c), the dimensionality of both modules in the untrained model remains low and
+stable, failing to scale with the number of tasks. (f) Similarly, contrasting with the clear separation
+in (d), the PR values for the untrained model’s modules (zL , PR = 42.09; zH , PR = 40.75) are
+low and nearly identical, showing no evidence of hierarchical separation. This confirms that the
+observed hierarchical organization of dimensionality is a learned property that emerges through
+training, not an artifact of the model’s architecture.
+
+
+
+
+ 14
+ correlates with its effective dimensionality. To quantify this phenomenon, we can examine the
+Participation Ratio (PR), which serves as a standard measure of the effective dimensionality of a
+high-dimensional representation 79 . The PR is calculated using the formula
+ ( i λi )2
+ P
+ PR = P 2 ,
+ i λi
+
+where {λi } are the eigenvalues of the covariance matrix of neural trajectories. Intuitively, a higher
+PR value signifies that variance is distributed more evenly across many dimensions, corresponding
+to a higher-dimensional representation. Conversely, a lower PR value indicates that variance is
+concentrated in only a few principal components, reflecting a more compact, lower-dimensional
+structure.
+The dimensionality hierarchy can be observed, for example, in the mouse cortex, where the PR of
+population activity increases monotonically from low-level sensory areas to high-level associative
+areas, supporting this link between dimensionality and functional complexity 74 (Figure 8 (a,b)).
+We evaluated whether HRM reproduces this neuroscientific principle by calculating the PR for
+both recurrent modules after training on the Sudoku-Extreme Full dataset. The PR computation
+used the covariance matrix derived from neural states gathered across multiple Sudoku-solving
+trajectories. The results show a striking parallel to the biological findings. The low-level module’s
+state (zL ) occupies a relatively small subspace with a participation ratio of 30.22, whereas the high-
+level module’s state (zH ) operates in a substantially larger subspace with a participation ratio of
+89.95, as shown in Figure 8(c). Furthermore, Figure 8(d) shows that increasing the number of
+unique tasks (trajectories) from 10 to 100 causes zH dimensionality to scale up accordingly, while
+zL dimensionality remains stable. These results suggest an emergent separation of representational
+capacity between the modules that parallels their functional roles.
+To confirm that this hierarchical organization is an emergent property of training, and not an artifact
+of the network’s architecture, we performed a control analysis using an identical but untrained
+network with random weights.
+We initialized an identical HRM architecture with random weights and, without any training, mea-
+sured the PR of its modules as the network processed the same task-specific inputs given to the
+trained model.
+The results, shown in Figure 8(e,f), reveal a stark contrast: the high-level and low-level modules of
+the untrained network exhibit no hierarchical separation, with their PR values remaining low and
+nearly indistinguishable from each other. This control analysis validates that the dimensionality
+hierarchy is an emergent property that arises as the model learns to perform complex reasoning.
+The high-to-low PR ratio in HRM (zH /zL ≈ 2.98) closely matches that measured in the mouse
+cortex (≈ 2.25). In contrast, conventional deep networks often exhibit neural collapse, where
+last-layer features converge to a low-dimensional subspace 80,81,82 . HRM therefore departs from the
+collapse pattern and instead fosters a high-dimensional representation in its higher module. This
+is significant because such representations are considered crucial for cognitive flexibility and are a
+hallmark of higher-order brain regions like the prefrontal cortex (PFC), which is central to complex
+reasoning.
+This structural parallel suggests the model has discovered a fundamental organizational principle.
+By learning to partition its representations into a high-capacity, high-dimensional subspace (zH )
+
+ 15
+ and a more specialized, low-dimensional one (zL ), HRM autonomously discovers an organizational
+principle that is thought to be fundamental for achieving robust and flexible reasoning in biological
+systems. This provides a potential mechanistic explanation for the model’s success on complex,
+long-horizon tasks that are intractable for models lacking such a differentiated internal structure.
+We emphasize, however, that this evidence is correlational. While a causal link could be tested
+via intervention (e.g., by constraining the H-module’s dimensionality), such methods are difficult
+to interpret in deep learning due to potential confounding effects on the training process itself.
+Thus, the causal necessity of this emergent hierarchy remains an important question for future
+investigation.
+
+
+5 Related Work
+Reasoning and algorithm learning Given the central role of reasoning problems and their close
+relation to algorithms, researchers have long explored neural architectures that enable algorithm
+learning from training instances. This line of work includes Neural Turing Machines (NTM) 83 ,
+the Differentiable Neural Computer (DNC) 84 , and Neural GPUs 85 –all of which construct iterative
+neural architectures that mimic computational hardware for algorithm execution, and are trained to
+learn algorithms from data. Another notable work in this area is Recurrent Relational Networks
+(RRN) 62 , which executes algorithms on graph representations through graph neural networks.
+Recent studies have integrated algorithm learning approaches with Transformer-based architec-
+tures. Universal Transformers extend the standard Transformer model by introducing a recurrent
+loop over the layers and implementing an adaptive halting mechanism. Geiping et al. 86 demonstrate
+that looped Transformers can generalize to a larger number of recurrent steps during inference than
+what they were trained on. Shen et al. 16 propose adding continuous recurrent reasoning tokens
+to the Transformer. Finally, TransNAR 8 combine recurrent graph neural networks with language
+models.
+Building on the success of CoT-based reasoning, a line of work have introduced fine-tuning meth-
+ods that use reasoning paths from search algorithms (like A*) as SFT targets 87,71,70 .
+We also mention adaptive halting mechanisms designed to allocate additional computational re-
+sources to more challenging problems. This includes the Adaptive Computation Time (ACT) for
+RNNs 88 and follow-up research like PonderNet 89 , which aims to improve the stability of this allo-
+cation process.
+HRM further pushes the boundary of algorithm learning through a brain-inspired computational
+architecture that achieves exceptional data efficiency and model expressiveness, successfully dis-
+covering complex and diverse algorithms from just 1000 training examples.
+Brain-inspired reasoning architectures Developing a model with the reasoning power of the
+brain has long been a goal in brain-inspired computing. Spaun 90 is one notable example, which uses
+spiking neural networks to create distinct modules corresponding to brain regions like the visual
+cortex and prefrontal cortex. This design enables an architecture to perform a range of cognitive
+tasks, from memory recall to simple reasoning puzzles. However, its reasoning relies on hand-
+designed algorithms, which may limit its ability to learn new tasks. Another significant model is the
+Tolman-Eichenbaum Machine (TEM) 91 , which is inspired by the hippocampal-entorhinal system’s
+role in spatial and relational memory tasks. TEM proposes that medial entorhinal cells create a
+basis for structural knowledge, while hippocampal cells link this basis to sensory information. This
+
+ 16
+ allows TEM to generalize and explains the emergence of various cell types like grid, border, and
+place cells. Another approach involves neural sampling models 92 , which view the neural signaling
+process as inference over a distribution, functioning similarly to a Boltzmann machine. These
+models often require hand-made rules to be set up for solving a specific reasoning task. In essence,
+while prior models are restricted to simple reasoning problems, HRM is designed to solve complex
+tasks that are hard for even advanced LLMs, without pre-training or task-specific manual design.
+Hierarchical memory The hierarchical multi-timescale structure also plays an important role in
+how the brain processes memory. Models such as Hierarchical Sequential Models 93 and Clockwork
+RNN 94 use multiple recurrent modules that operate at varying time scales to more effectively cap-
+ture long-range dependencies within sequences, thereby mitigating the forgetting issue in RNNs.
+Similar mechanisms have also been adopted in linear attention methods for memorizing long con-
+texts (see the Discussions section). Since HRM focuses on reasoning, full attention is applied for
+simplicity. Incorporating hierarchical memory into HRM could be a promising future direction.
+
+
+6 Discussions
+Turing-completeness of HRM Like earlier neural reasoning algorithms including the Universal
+Transformer 95 , HRM is computationally universal when given sufficient memory and time con-
+straints. In other words, it falls into the category of models that can simulate any Turing machine,
+overcoming the computational limitations of standard Transformers discussed previously in the in-
+troduction. Given that earlier neural algorithm reasoners were trained as recurrent neural networks,
+they suffer from premature convergence and memory intensive BPTT. Therefore, in practice, their
+effective computational depth remains limited, though still deeper than that of a standard Trans-
+former. By resolving these two challenges and being equipped with adaptive computation, HRM
+could be trained on long reasoning processes, solve complex puzzles requiring intensive depth-first
+search and backtracking, and move closer to practical Turing-completeness.
+Reinforcement learning with chain-of-thought Beyond fine-tuning using human-annotated CoT,
+reinforcement learning (RL) represents another widely adopted training methodology. However,
+recent evidence suggests that RL primarily unlocks existing CoT-like capabilities rather than dis-
+covering fundamentally new reasoning mechanisms 96,97,98,99 . Additionally, CoT-training with RL
+is known for its instability and data inefficiency, often requiring extensive exploration and careful
+reward design. In contrast, HRM takes feedback from dense gradient-based supervision rather than
+relying on a sparse reward signal. Moreover, HRM operates naturally in a continuous space, which
+is biologically plausible and avoids allocating same computational resources to each token, even
+though tokens vary in their reasoning and planning complexity 16 .
+Linear attention Recurrence has been explored not only for its capability in universal computa-
+tion, but also as a means to replace the attention mechanism in Transformers, which suffers from
+quadratic time and memory complexity 100 . Recurrent alternatives offer a more efficient design by
+processing input tokens sequentially and predicting the next token at each time step, similar to early
+RNN-based language models.
+Some linear-attention variants, such as Log-linear Attention 101 , share an RNN-like state-update that
+can be interpreted as propagating multi-timescale summary statistics, thereby retaining long-range
+context without the quadratic memory growth of standard self-attention. However, substituting the
+attention mechanism alone does not change the fact that Transformers are still fixed-depth, and
+
+ 17
+ require CoT as a compensatory mechanism. Notably, linear attention can operate with a reduced
+key-value cache over extended contexts, making them more suitable for deployment on resource-
+constrained edge devices.
+
+
+7 Conclusion
+This work introduces the Hierarchical Reasoning Model, a brain-inspired architecture that lever-
+ages hierarchical structure and multi-timescale processing to achieve substantial computational
+depth without sacrificing training stability or efficiency. With only 27M parameters and train-
+ing on just 1000 examples, HRM effectively solves challenging reasoning problems such as ARC,
+Sudoku, and complex maze navigation–tasks that typically pose significant difficulties for contem-
+porary LLM and chain-of-thought models.
+Although the brain relies heavily on hierarchical structures to enable most cognitive processes,
+these concepts have largely remained confined to academic literature rather than being translated
+into practical applications. The prevailing AI approach continues to favor non-hierarchical models.
+Our results challenge this established paradigm and suggest that the Hierarchical Reasoning Model
+represents a viable alternative to the currently dominant chain-of-thought reasoning methods, ad-
+vancing toward a foundational framework capable of Turing-complete universal computation.
+Acknowledgements We thank Mingli Yuan, Ahmed Murtadha Hasan Mahyoub and Hengshuai
+Yao for their insightful discussions and valuable feedback throughout the course of this work.
+
+
+
+
+ 18
+ References
+ 1. Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016.
+ http://www.deeplearningbook.org.
+ 2. Kaiming He, X. Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image
+ recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
+ pages 770–778, 2015.
+ 3. Lena Strobl. Average-hard attention transformers are constant-depth uniform threshold
+ circuits, 2023.
+ 4. Tom Bylander. Complexity results for planning. In Proceedings of the 12th International Joint
+ Conference on Artificial Intelligence - Volume 1, IJCAI’91, page 274–279, San Francisco,
+ CA, USA, 1991. Morgan Kaufmann Publishers Inc. ISBN 1558601600.
+ 5. William Merrill and Ashish Sabharwal. A logic for expressing log-precision transformers. In
+ Neural Information Processing Systems, 2023.
+ 6. David Chiang. Transformers in DLOGTIME-uniform TC0 . Transactions on Machine
+ Learning Research, 2025.
+ 7. Lucas Lehnert, Sainbayar Sukhbaatar, DiJia Su, Qinqing Zheng, Paul McVay, Michael
+ Rabbat, and Yuandong Tian. Beyond a*: Better planning with transformers via search
+ dynamics bootstrapping. In First Conference on Language Modeling, 2024.
+ 8. Wilfried Bounsi, Borja Ibarz, Andrew Dudzik, Jessica B. Hamrick, Larisa Markeeva, Alex
+ Vitvitskyi, Razvan Pascanu, and Petar Velivckovi’c. Transformers meet neural algorithmic
+ reasoners. ArXiv, abs/2406.09308, 2024.
+ 9. William Merrill and Ashish Sabharwal. The parallelism tradeoff: Limitations of log-precision
+ transformers. Transactions of the Association for Computational Linguistics, 11:531–545,
+ 2023. doi: 10.1162/tacl_a_00562.
+10. Jason Wei, Yi Tay, et al. Chain-of-thought prompting elicits reasoning in large language
+ models, 2022. arXiv preprint arXiv:2201.11903.
+11. William Merrill and Ashish Sabharwal. The expressive power of transformers with chain of
+ thought. In ICLR, 2024.
+12. Xinyun Chen, Ryan A. Chi, Xuezhi Wang, and Denny Zhou. Premise order matters in
+ reasoning with large language models. ArXiv, abs/2402.08939, 2024.
+13. Rongwu Xu, Zehan Qi, and Wei Xu. Preemptive answer "attacks" on chain-of-thought
+ reasoning. In Annual Meeting of the Association for Computational Linguistics, 2024.
+14. Pablo Villalobos, Anson Ho, Jaime Sevilla, Tamay Besiroglu, Lennart Heim, and Marius
+ Hobbhahn. Will we run out of data? limits of llm scaling based on human-generated data.
+ arXiv preprint arXiv:2211.04325, 2022.
+15. Xinghao Chen, Anhao Zhao, Heming Xia, Xuan Lu, Hanlin Wang, Yanjun Chen, Wei Zhang,
+ Jian Wang, Wenjie Li, and Xiaoyu Shen. Reasoning beyond language: A comprehensive
+ survey on latent chain-of-thought reasoning, 2025.
+16. Xuan Shen, Yizhou Wang, Xiangxi Shi, Yanzhi Wang, Pu Zhao, and Jiuxiang Gu.
+ Training large language models to reason in a continuous latent space. arXiv preprint
+ arXiv:2412.07423, 2024.
+
+
+ 19
+ 17. Evelina Fedorenko, Steven T Piantadosi, and Edward AF Gibson. Language is primarily a
+ tool for communication rather than thought. Nature, 630(8017):575–586, 2024.
+18. Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Dongdong Zhang, and Furu Wei.
+ Deepnet: Scaling transformers to 1,000 layers. IEEE Transactions on Pattern Analysis and
+ Machine Intelligence, 2024.
+19. Timothy P Lillicrap and Adam Santoro. Backpropagation through time and the brain. Current
+ Opinion in Neurobiology, 55:82–89, 2019. ISSN 0959-4388. doi: https://doi.org/10.1016/j.
+ conb.2019.01.011.
+20. John D Murray, Alberto Bernacchia, David J Freedman, Ranulfo Romo, Jonathan D Wallis,
+ Xinying Cai, Camillo Padoa-Schioppa, Tatiana Pasternak, Hyojung Seo, Daeyeol Lee, et al.
+ A hierarchy of intrinsic timescales across primate cortex. Nature neuroscience, 17(12):1661–
+ 1663, 2014.
+21. Roxana Zeraati, Yan-Liang Shi, Nicholas A Steinmetz, Marc A Gieselmann, Alexander
+ Thiele, Tirin Moore, Anna Levina, and Tatiana A Engel. Intrinsic timescales in the
+ visual cortex change with selective attention and reflect spatial connectivity. Nature
+ communications, 14(1):1858, 2023.
+22. Julia M Huntenburg, Pierre-Louis Bazin, and Daniel S Margulies. Large-scale gradients in
+ human cortical organization. Trends in cognitive sciences, 22(1):21–31, 2018.
+23. Victor AF Lamme and Pieter R Roelfsema. The distinct modes of vision offered by
+ feedforward and recurrent processing. Trends in neurosciences, 23(11):571–579, 2000.
+24. Andre M Bastos, W Martin Usrey, Rick A Adams, George R Mangun, Pascal Fries, and Karl J
+ Friston. Canonical microcircuits for predictive coding. Neuron, 76(4):695–711, 2012.
+25. Klara Kaleb, Barbara Feulner, Juan Gallego, and Claudia Clopath. Feedback control guides
+ credit assignment in recurrent neural networks. Advances in Neural Information Processing
+ Systems, 37:5122–5144, 2024.
+26. Timothy P Lillicrap, Adam Santoro, Luke Marris, Colin J Akerman, and Geoffrey Hinton.
+ Backpropagation and the brain. Nature Reviews Neuroscience, 21(6):335–346, 2020.
+27. François Chollet. On the measure of intelligence (abstraction and reasoning corpus), 2019.
+ arXiv preprint arXiv:1911.01547.
+28. Francois Chollet, Mike Knoop, Gregory Kamradt, and Bryan Landers. Arc prize 2024:
+ Technical report. ArXiv, abs/2412.04604, 2024.
+29. Francois Chollet, Mike Knoop, Gregory Kamradt, Bryan Landers, and Henry Pinkard. Arc-
+ agi-2: A new challenge for frontier ai reasoning systems. arXiv preprint arXiv:2505.11831,
+ 2025.
+30. György Buzsáki. Gamma, alpha, delta, and theta oscillations govern cognitive processes.
+ International Journal of Psychophysiology, 39:241–248, 2000.
+31. György Buzsáki. Rhythms of the Brain. Oxford university press, 2006.
+32. Anja Pahor and Norbert Jaušovec. Theta–gamma cross-frequency coupling relates to the level
+ of human intelligence. Intelligence, 46:283–290, 2014.
+33. Adriano BL Tort, Robert W Komorowski, Joseph R Manns, Nancy J Kopell, and Howard
+ Eichenbaum. Theta–gamma coupling increases during the learning of item–context
+ associations. Proceedings of the National Academy of Sciences, 106(49):20942–20947, 2009.
+
+
+ 20
+ 34. Benjamin Scellier and Yoshua Bengio. Equilibrium propagation: Bridging the gap between
+ energy-based models and backpropagation. Frontiers in Computational Neuroscience, 11,
+ 2016.
+35. Guillaume Bellec, Franz Scherr, Anand Subramoney, Elias Hajek, Darjan Salaj, Robert
+ Legenstein, and Wolfgang Maass. A solution to the learning dilemma for recurrent
+ networks of spiking neurons. Nature Communications, 11, 07 2020. doi: 10.1038/
+ s41467-020-17236-y.
+36. Shaojie Bai, J Zico Kolter, and Vladlen Koltun. Deep equilibrium models. In Advances in
+ Neural Information Processing Systems, pages 690–701, 2019.
+37. Zhengyang Geng, Xinyu Zhang, Shaojie Bai, Yisen Wang, and Zhouchen Lin. On training
+ implicit models. ArXiv, abs/2111.05177, 2021.
+38. Katarina Begus and Elizabeth Bonawitz. The rhythm of learning: Theta oscillations as an
+ index of active learning in infancy. Developmental Cognitive Neuroscience, 45:100810, 2020.
+ ISSN 1878-9293. doi: https://doi.org/10.1016/j.dcn.2020.100810.
+39. Shaojie Bai, Zhengyang Geng, Yash Savani, and J. Zico Kolter. Deep Equilibrium
+ Optical Flow Estimation . In 2022 IEEE/CVF Conference on Computer Vision and Pattern
+ Recognition (CVPR), pages 610–620, 2022.
+40. Zaccharie Ramzi, Florian Mannel, Shaojie Bai, Jean-Luc Starck, Philippe Ciuciu, and
+ Thomas Moreau. Shine: Sharing the inverse estimate from the forward pass for bi-level
+ optimization and implicit models. ArXiv, abs/2106.00553, 2021.
+41. Shaojie Bai, Vladlen Koltun, and J. Zico Kolter. Stabilizing equilibrium models by jacobian
+ regularization. In International Conference on Machine Learning, 2021.
+42. Daniel Kahneman and P Egan. Thinking, fast and slow (farrar, straus and giroux, new york),
+ 2011.
+43. Matthew D Lieberman. Social cognitive neuroscience: a review of core processes. Annu. Rev.
+ Psychol., 58(1):259–289, 2007.
+44. Randy L Buckner, Jessica R Andrews-Hanna, and Daniel L Schacter. The brain’s default
+ network: anatomy, function, and relevance to disease. Annals of the new York Academy of
+ Sciences, 1124(1):1–38, 2008.
+45. Marcus E Raichle. The brain’s default mode network. Annual review of neuroscience, 38(1):
+ 433–447, 2015.
+46. Andrew Westbrook and Todd S Braver. Cognitive effort: A neuroeconomic approach.
+ Cognitive, Affective, & Behavioral Neuroscience, 15:395–415, 2015.
+47. Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT
+ Press, Cambridge, MA, 2018.
+48. Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan
+ Wierstra, and Martin A. Riedmiller. Playing atari with deep reinforcement learning. ArXiv,
+ abs/1312.5602, 2013.
+49. Matteo Gallici, Mattie Fellows, Benjamin Ellis, Bartomeu Pou, Ivan Masmitja,
+ Jakob Nicolaus Foerster, and Mario Martin. Simplifying deep temporal difference learning,
+ 2025.
+
+
+
+ 21
+ 50. Shuo Xie and Zhiyuan Li. Implicit bias of adamw: L inf norm constrained optimization.
+ ArXiv, abs/2404.04454, 2024.
+51. Lucas Prieto, Melih Barsbey, Pedro A. M. Mediano, and Tolga Birdal. Grokking at the edge of
+ numerical stability. In The Thirteenth International Conference on Learning Representations,
+ 2025.
+52. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
+ Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural
+ information processing systems, pages 5998–6008, 2017.
+53. Meta AI. Llama 3: State-of-the-art open weight language models. Technical report, Meta,
+ 2024. URL https://ai.meta.com/llama/.
+54. Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer:
+ Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024.
+55. Noam M. Shazeer. Glu variants improve transformer. ArXiv, abs/2002.05202, 2020.
+56. Biao Zhang and Rico Sennrich. Root mean square layer normalization. ArXiv,
+ abs/1910.07467, 2019.
+57. Günter Klambauer, Thomas Unterthiner, Andreas Mayr, and Sepp Hochreiter. Self-
+ normalizing neural networks. In Neural Information Processing Systems, 2017.
+58. JAX Developers. jax.nn.initializers.lecun_normal. Google Research, 2025. URL
+ https://docs.jax.dev/en/latest/_autosummary/jax.nn.initializers.lecun_
+ normal.html. Accessed June 22, 2025.
+59. Yann LeCun, Léon Bottou, Genevieve B Orr, and Klaus-Robert Müller. Efficient backprop.
+ In Neural networks: Tricks of the trade, pages 9–50. Springer, 2002.
+60. Katie E Everett, Lechao Xiao, Mitchell Wortsman, Alexander A Alemi, Roman Novak,
+ Peter J Liu, Izzeddin Gur, Jascha Sohl-Dickstein, Leslie Pack Kaelbling, Jaehoon Lee, and
+ Jeffrey Pennington. Scaling exponents across parameterizations and optimizers. In Forty-first
+ International Conference on Machine Learning, 2024.
+61. Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017.
+62. Rasmus Berg Palm, Ulrich Paquet, and Ole Winther. Recurrent relational networks. In Neural
+ Information Processing Systems, 2017.
+63. Jieyi Long. Large language model guided tree-of-thought. ArXiv, abs/2305.08291, 2023.
+64. Yilun Du, Jiayuan Mao, and Josh Tenenbaum. Learning iterative reasoning through energy
+ diffusion. ArXiv, abs/2406.11179, 2024.
+65. Kyubyong Park. Can convolutional neural networks crack sudoku puzzles? https:
+ //github.com/Kyubyong/sudoku, 2018.
+66. Single-digit techniques. https://hodoku.sourceforge.net/en/tech_singles.php.
+ Accessed: 2025-06-16.
+67. Tom Dillion. Tdoku: A fast sudoku solver and generator. https://t-dillon.github.io/
+ tdoku/, 2025.
+68. Jeffrey Seely, Yuki Imajuku, Tianyu Zhao, Edoardo Cetin, and Llion Jones. Sudoku-bench:
+ Evaluating creative reasoning with sudoku variants. arXiv preprint arXiv:2505.16135, 2025.
+69. Luke Darlow, Ciaran Regan, Sebastian Risi, Jeffrey Seely, and Llion Jones. Continuous
+ thought machines. arXiv preprint arXiv:2505.05522, 2025.
+
+ 22
+ 70. DiJia Su, Sainbayar Sukhbaatar, Michael Rabbat, Yuandong Tian, and Qinqing Zheng.
+ Dualformer: Controllable fast and slow thinking by learning with randomized reasoning
+ traces, 2025.
+71. Lucas Lehnert, Sainbayar Sukhbaatar, DiJia Su, Qinqing Zheng, Paul McVay, Michael
+ Rabbat, and Yuandong Tian. Beyond a*: Better planning with transformers via search
+ dynamics bootstrapping. In First Conference on Language Modeling, 2024.
+72. Mubbasir Kapadia, Francisco Garcia, Cory D. Boatright, and Norman I. Badler. Dynamic
+ search on the gpu. In 2013 IEEE/RSJ International Conference on Intelligent Robots and
+ Systems, pages 3332–3337, 2013. doi: 10.1109/IROS.2013.6696830.
+73. Isaac Liao and Albert Gu. Arc-agi without pretraining, 2025. URL https:
+ //iliao2345.github.io/blog_posts/arc_agi_without_pretraining/arc_agi_
+ without_pretraining.html.
+74. Lorenzo Posani, Shuqi Wang, Samuel P Muscinelli, Liam Paninski, and Stefano Fusi.
+ Rarely categorical, always high-dimensional: how the neural code changes along the cortical
+ hierarchy. bioRxiv, pages 2024–11, 2025.
+75. Mattia Rigotti, Omri Barak, Melissa R. Warden, Xiao-Jing Wang, Nathaniel D. Daw, Earl K.
+ Miller, and Stefano Fusi. The importance of mixed selectivity in complex cognitive tasks.
+ Nature, 497:585–590, 2013. doi: 10.1038/nature12160.
+76. Valerio Mante, David Sussillo, Krishna V. Shenoy, and William T. Newsome. Context-
+ dependent computation by recurrent dynamics in prefrontal cortex. Nature, 503(7474):78–84,
+ 2013. doi: 10.1038/nature12742.
+77. Earl K. Miller and Jonathan D. Cohen. An integrative theory of prefrontal cortex function.
+ Annual Review of Neuroscience, 24(1):167–202, 2001. doi: 10.1146/annurev.neuro.24.1.167.
+78. Wolfgang Maass. Real-time computing without stable states: a new framework for neural
+ computation based on perturbations. Neural Computation, 14(11):2531–2560, 2002. doi:
+ 10.1162/089976602760407955.
+79. Ege Altan, Sara A. Solla, Lee E. Miller, and Eric J. Perreault. Estimating the dimensionality
+ of the manifold underlying multi-electrode neural recordings. PLoS Computational Biology,
+ 17(11):e1008591, 2021. doi: 10.1371/journal.pcbi.1008591.
+80. Vardan Papyan, X. Y. Han, and David L. Donoho. Prevalence of neural collapse during the
+ terminal phase of deep learning training. Proceedings of the National Academy of Sciences,
+ 117(40):24652–24663, 2020. doi: 10.1073/pnas.2015509117.
+81. Cong Fang, Hangfeng He, Qi Long, and Weijie J. Su. Exploring deep neural networks via
+ layer–peeled model: Minority collapse in imbalanced training. Proceedings of the National
+ Academy of Sciences, 118(43):e2103091118, 2021. doi: 10.1073/pnas.2103091118.
+82. Zhihui Zhu, Tianyu Ding, Jinxin Zhou, Xiao Li, Chong You, Jeremias Sulam, and Qing Qu.
+ A geometric analysis of neural collapse with unconstrained features. In Advances in Neural
+ Information Processing Systems, volume 34 of NeurIPS, pages 29820–29834, 2021.
+83. Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines, 2014.
+84. Alex Graves, Greg Wayne, Malcolm Reynolds, Tim Harley, Ivo Danihelka, Agnieszka
+ Grabska-Barwińska, Sergio Gómez Colmenarejo, Edward Grefenstette, Tiago Ramalho, John
+ Agapiou, et al. Hybrid computing using a neural network with dynamic external memory.
+ Nature, 538(7626):471–476, 2016.
+
+ 23
+ 85. Lukasz Kaiser and Ilya Sutskever. Neural GPUs learn algorithms. In ICLR, 2016.
+ 86. Jonas Geiping, Sean McLeish, Neel Jain, John Kirchenbauer, Siddharth Singh, Brian R.
+ Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, and Tom Goldstein. Scaling up test-time
+ compute with latent reasoning: A recurrent depth approach, 2025.
+ 87. Tiedong Liu and Kian Hsiang Low. Goat: Fine-tuned llama outperforms gpt-4 on arithmetic
+ tasks. ArXiv, abs/2305.14201, 2023.
+ 88. Alex Graves. Adaptive computation time for recurrent neural networks. ArXiv,
+ abs/1603.08983, 2016.
+ 89. Andrea Banino, Jan Balaguer, and Charles Blundell. Pondernet: Learning to ponder. ArXiv,
+ abs/2107.05407, 2021.
+ 90. Chris Eliasmith, Terrence C Stewart, Xuan Choo, Trevor Bekolay, Travis DeWolf, Yichuan
+ Tang, and Daniel Rasmussen. A large-scale model of the functioning brain. science, 338
+ (6111):1202–1205, 2012.
+ 91. James CR Whittington, Timothy H Muller, Shirley Mark, Guifen Chen, Caswell Barry, Neil
+ Burgess, and Timothy EJ Behrens. The tolman-eichenbaum machine: unifying space and
+ relational memory through generalization in the hippocampal formation. Cell, 183(5):1249–
+ 1263, 2020.
+ 92. Lars Buesing, Johannes Bill, Bernhard Nessler, and Wolfgang Maass. Neural dynamics as
+ sampling: a model for stochastic computation in recurrent networks of spiking neurons. PLoS
+ computational biology, 7(11):e1002211, 2011.
+ 93. Salah Hihi and Yoshua Bengio. Hierarchical recurrent neural networks for long-term
+ dependencies. In D. Touretzky, M.C. Mozer, and M. Hasselmo, editors, Advances in Neural
+ Information Processing Systems, volume 8. MIT Press, 1995.
+ 94. Jan Koutník, Klaus Greff, Faustino J. Gomez, and Jürgen Schmidhuber. A clockwork rnn. In
+ International Conference on Machine Learning, 2014.
+ 95. Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser.
+ Universal transformers, 2018. arXiv preprint arXiv:1807.03819.
+ 96. Yiping Wang, Qing Yang, Zhiyuan Zeng, Liliang Ren, Lucas Liu, Baolin Peng, Hao Cheng,
+ Xuehai He, Kuan Wang, Jianfeng Gao, Weizhu Chen, Shuohang Wang, Simon Shaolei Du,
+ and Yelong Shen. Reinforcement learning for reasoning in large language models with one
+ training example, 2025. URL https://arxiv.org/abs/2504.20571.
+ 97. Niklas Muennighoff. s1: Simple test-time scaling. arXiv preprint arXiv:2502.23456, 2025.
+ 98. Liang Wen, Yunke Cai, Fenrui Xiao, Xin He, Qi An, Zhenyu Duan, Yimin Du, Junchen Liu,
+ Lifu Tang, Xiaowei Lv, Haosheng Zou, Yongchao Deng, Shousheng Jia, and Xiangzheng
+ Zhang. Light-r1: Curriculum sft, dpo and rl for long cot from scratch and beyond, 2025.
+ 99. Xuefeng Li, Haoyang Zou, and Pengfei Liu. Limr: Less is more for rl scaling, 2025.
+100. Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms
+ through structured state space duality. ArXiv, abs/2405.21060, 2024.
+101. Han Guo, Songlin Yang, Tarushii Goel, Eric P Xing, Tri Dao, and Yoon Kim. Log-linear
+ attention. arXiv preprint arXiv:2506.04761, 2025.
+
+
+
+
+ 24
+ \ No newline at end of file