diff options
Diffstat (limited to 'papers/txt/hrm2025_hierarchical_reasoning.txt')
| -rw-r--r-- | papers/txt/hrm2025_hierarchical_reasoning.txt | 1302 |
1 files changed, 1302 insertions, 0 deletions
diff --git a/papers/txt/hrm2025_hierarchical_reasoning.txt b/papers/txt/hrm2025_hierarchical_reasoning.txt new file mode 100644 index 0000000..3cc1f2e --- /dev/null +++ b/papers/txt/hrm2025_hierarchical_reasoning.txt @@ -0,0 +1,1302 @@ + Hierarchical Reasoning Model + Guan Wang1,† , Jin Li1 , Yuhao Sun1 , Xing Chen1 , Changling Liu1 , + Yue Wu1 , Meng Lu1,† , Sen Song2,† , Yasin Abbasi Yadkori1,† + 1 + Sapient Intelligence, Singapore + + + + Abstract + + Reasoning, the process of devising and executing complex goal-oriented action sequences, +arXiv:2506.21734v3 [cs.AI] 4 Aug 2025 + + + + + remains a critical challenge in AI. Current large language models (LLMs) primarily employ + Chain-of-Thought (CoT) techniques, which suffer from brittle task decomposition, extensive + data requirements, and high latency. Inspired by the hierarchical and multi-timescale pro- + cessing in the human brain, we propose the Hierarchical Reasoning Model (HRM), a novel + recurrent architecture that attains significant computational depth while maintaining both train- + ing stability and efficiency. HRM executes sequential reasoning tasks in a single forward pass + without explicit supervision of the intermediate process, through two interdependent recurrent + modules: a high-level module responsible for slow, abstract planning, and a low-level mod- + ule handling rapid, detailed computations. With only 27 million parameters, HRM achieves + exceptional performance on complex reasoning tasks using only 1000 training samples. The + model operates without pre-training or CoT data, yet achieves nearly perfect performance on + challenging tasks including complex Sudoku puzzles and optimal path finding in large mazes. + Furthermore, HRM outperforms much larger models with significantly longer context windows + on the Abstraction and Reasoning Corpus (ARC), a key benchmark for measuring artificial + general intelligence capabilities. These results underscore HRM’s potential as a transformative + advancement toward universal computation and general-purpose reasoning systems. + + + ARC-AGI-1 ARC-AGI-2 Sudoku-Extreme (9x9) Maze-Hard (30x30) + 960 training examples 1120 training examples 1000 training examples 1000 training examples + 40.3 5.0 60 55.0 80 74.5 + 40 5 + 34.5 + 60 + Deepseek R1 + + + + + 4 + Claude 3.7 8K + + + + + 30 40 + Accuracy % + + + + + 3.0 + 3 + Claude 3.7 8K + + + + + Claude 3.7 8K + 21.0 21.2 + Deepseek R1 + + + + + Deepseek R1 + 40 + o3-mini-high + + + + + o3-mini-high + + + 20 15.8 + Direct pred + + + + + Direct pred + + + + + Direct pred + o3-mini-high + + + + + o3-mini-high + Claude 3.7 8K + + + + + 2 20 + Deepseek R1 + + + + + 1.3 + Direct pred + + + + + 10 0.9 20 + 1 + HRM + + + + + HRM + + + + + HRM + + + + + HRM + + 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 + 0 0 0 0 + Chain-of-thought, pretrained Direct prediction, small-sample learning + + + + Figure 1: Left: HRM is inspired by hierarchical processing and temporal separation in the brain. It + has two recurrent networks operating at different timescales to collaboratively solve tasks. Right: + With only about 1000 training examples, the HRM (~27M parameters) surpasses state-of-the-art + CoT models on inductive benchmarks (ARC-AGI) and challenging symbolic tree-search puzzles + (Sudoku-Extreme, Maze-Hard) where CoT models failed completely. The HRM was randomly + initialized, and it solved the tasks directly from inputs without chain of thoughts. + 2 + Tsinghua University † Corresponding author. Contact: research@sapient.inc. + Code available at: github.com/sapientinc/HRM + + 1 +1 Introduction +Deep learning, as its name suggests, emerged from the idea of stacking more layers to achieve +increased representation power and improved performance 1,2 . However, despite the remarkable +success of large language models, their core architecture is paradoxically shallow 3 . This imposes +a fundamental constraint on their most sought-after capability: reasoning. The fixed depth of stan- +dard Transformers places them in computational complexity classes such as AC 0 or T C 0 4 , prevent- +ing them from solving problems that require polynomial time 5,6 . LLMs are not Turing-complete +and thus they cannot, at least in a purely end-to-end manner, execute complex algorithmic rea- +soning that is necessary for deliberate planning or symbolic manipulation tasks 7,8 . For example, +our results on the Sudoku task show that increasing Transformer model depth can improve per- +formance,1 but performance remains far from optimal even with very deep models (see Figure 2), +which supports the conjectured limitations of the LLM scaling paradigm 9 . +The LLMs literature has relied largely on Chain-of-Thought (CoT) prompting for reasoning 10 . +CoT externalizes reasoning into token-level language by breaking down complex tasks into sim- +pler intermediate steps, sequentially generating text using a shallow model 11 . However, CoT for +reasoning is a crutch, not a satisfactory solution. It relies on brittle, human-defined decompositions +where a single misstep or a misorder of the steps can derail the reasoning process entirely 12,13 . This +dependency on explicit linguistic steps tethers reasoning to patterns at the token level. As a result, +CoT reasoning often requires significant amount of training data and generates a large number of +tokens for complex reasoning tasks, resulting in slow response times. A more efficient approach is +needed to minimize these data requirements 14 . +Towards this goal, we explore “latent reasoning”, where the model conducts computations within +its internal hidden state space 15,16 . This aligns with the understanding that language is a tool for +human communication, not the substrate of thought itself 17 ; the brain sustains lengthy, coherent +chains of reasoning with remarkable efficiency in a latent space, without constant translation back +to language. However, the power of latent reasoning is still fundamentally constrained by a model’s +effective computational depth. Naively stacking layers is notoriously difficult due to vanishing gra- +dients, which plague training stability and effectiveness 1,18 . Recurrent architectures, a natural al- +ternative for sequential tasks, often suffer from early convergence, rendering subsequent computa- +tional steps inert, and rely on the biologically implausible, computationally expensive and memory +intensive Backpropagation Through Time (BPTT) for training 19 . +The human brain provides a compelling blueprint for achieving the effective computational depth +that contemporary artificial models lack. It organizes computation hierarchically across corti- +cal regions operating at different timescales, enabling deep, multi-stage reasoning 20,21,22 . Recur- +rent feedback loops iteratively refine internal representations, allowing slow, higher-level areas to +guide, and fast, lower-level circuits to execute—subordinate processing while preserving global +coherence 23,24,25 . Notably, the brain achieves such depth without incurring the prohibitive credit- +assignment costs that typically hamper recurrent networks from backpropagation through time 19,26 . +Inspired by this hierarchical and multi-timescale biological architecture, we propose the Hierar- +chical Reasoning Model (HRM). HRM is designed to significantly increase the effective compu- +tational depth. It features two coupled recurrent modules: a high-level (H) module for abstract, +deliberate reasoning, and a low-level (L) module for fast, detailed computations. This structure + 1 + Simply increasing the model width does not improve performance here. + + 2 + 100 100 + Scaling Width - 8 layers fixed Transformer + Scaling Depth - 512 hidden size fixed Recurrent Transformer + 80 80 HRM + Accuracy % + 60 60 + + 40 40 + + 20 20 + 27M 54M 109M 218M 436M 872M 8 16 32 64 128 256 512 + Parameters Depth / Transformer layers computed + +Figure 2: The necessity of depth for complex reasoning. Left: On Sudoku-Extreme Full, which +require extensive tree-search and backtracking, increasing a Transformer’s width yields no perfor- +mance gain, while increasing depth is critical. Right: Standard architectures saturates, failing to +benefit from increased depth. HRM overcomes this fundamental limitation, effectively using its +computational depth to achieve near-perfect accuracy. + +avoids the rapid convergence of standard recurrent models through a process we term “hierarchi- +cal convergence.” The slow-updating H-module advances only after the fast-updating L-module +has completed multiple computational steps and reached a local equilibrium, at which point the +L-module is reset to begin a new computational phase. +Furthermore, we propose a one-step gradient approximation for training HRM, which offers im- +proved efficiency and eliminates the requirement for BPTT. This design maintains a constant mem- +ory footprint (O(1) compared to BPTT’s O(T ) for T timesteps) throughout the backpropagation +process, making it scalable and more biologically plausible. +Leveraging its enhanced effective depth, HRM excels at tasks that demand extensive search and +backtracking. Using only 1,000 input-output examples, without pre-training or CoT supervi- +sion, HRM learns to solve problems that are intractable for even the most advanced LLMs. For +example, it achieves near-perfect accuracy in complex Sudoku puzzles (Sudoku-Extreme Full) and +optimal pathfinding in 30x30 mazes, where state-of-the-art CoT methods completely fail (0% ac- +curacy). In the Abstraction and Reasoning Corpus (ARC) AGI Challenge 27,28,29 - a benchmark +of inductive reasoning - HRM, trained from scratch with only the official dataset (~1000 exam- +ples), with only 27M parameters and a 30x30 grid context (900 tokens), achieves a performance +of 40.3%, which substantially surpasses leading CoT-based models like o3-mini-high (34.5%) +and Claude 3.7 8K context (21.2%), despite their considerably larger parameter sizes and con- +text lengths, as shown in Figure 1. This represents a promising direction toward the development +of next-generation AI reasoning systems with universal computational capabilities. + + +2 Hierarchical Reasoning Model +We present the HRM, inspired by three fundamental principles of neural computation observed in +the brain: +• Hierarchical processing: The brain processes information across a hierarchy of cortical ar- + eas. Higher-level areas integrate information over longer timescales and form abstract repre- + sentations, while lower-level areas handle more immediate, detailed sensory and motor process- + ing 20,22,21 . + + 3 +• Temporal Separation: These hierarchical levels in the brain operate at distinct intrinsic timescales, + reflected in neural rhythms (e.g., slow theta waves, 4–8 Hz and fast gamma waves, 30–100 + Hz) 30,31 . This separation allows for stable, high-level guidance of rapid, low-level computa- + tions 32,33 . +• Recurrent Connectivity: The brain features extensive recurrent connections. These feedback + loops enable iterative refinement, yielding more accurate and context-sensitive representations + at the cost of additional processing time. Additionally, the brain largely avoids the problematic + deep credit assignment problem associated with BPTT 19 . +The HRM model consists of four learnable components: an input network fI (·; θI ), a low-level re- +current module fL (·; θL ), a high-level recurrent module fH (·; θH ), and an output network fO (·; θO ). +The model’s dynamics unfold over N high-level cycles of T low-level timesteps each2 . We index +the total timesteps of one forward pass by i = 1, . . . , N × T . The modules fL and fH each keep a +hidden state—zLi for fL and zH i + for fH —which are initialized with the vectors zL0 and zH0 + , respec- +tively. +The HRM maps an input vector x to an output prediction vector ŷ as follows. First, the input x is +projected into a working representation x̃ by the input network: + + x̃ = fI (x; θI ) . +At each timestep i, the L-module updates its state conditioned on its own previous state, the H- +module’s current state (which remains fixed throughout the cycle), and the input representation. +The H-module only updates once per cycle (i.e., every T timesteps) using the L-module’s final +state at the end of that cycle: + + zLi = fL zLi−1 , zH + i−1 + + , x̃; θL , + ( + i−1 i−1 + + i fH zH , zL ; θH if i ≡ 0 (mod T ) , + zH = i−1 + zH otherwise . + +Finally, after N full cycles, a prediction ŷ is extracted from the hidden state of the H-module: + + + NT + ŷ = fO (zH ; θO ) . + +This entire N T -timestep process represents a single forward pass of the HRM. A halting mecha- +nism (detailed later in this section) determines whether the model should terminate, in which case +ŷ will be used as the final prediction, or continue with an additional forward pass. +Hierarchical convergence Although convergence is crucial for recurrent networks, standard RNNs +are fundamentally limited by their tendency to converge too early. As the hidden state settles toward +a fixed point, update magnitudes shrink, effectively stalling subsequent computation and capping +the network’s effective depth. To preserve computational power, we actually want convergence to +proceed very slowly–but engineering that gradual approach is difficult, since pushing convergence +too far edges the system toward instability. + 2 + While inspired by temporal separation in the brain, our model’s “high-level” and “low-level” modules are concep- +tual abstractions and do not map directly to specific neural oscillation frequencies. + + + 4 + 250 250 250 + HRM H Recurrent Neural Net Deep Neural Net + 200 200 200 +Forward residual + HRM L + 150 150 150 + 100 100 100 + 50 50 50 + 0 0 0 + 0 20 40 60 0 20 40 60 0 100 200 + Step Index # Step Index # Layer Index # + + 60 60 + + + + + Layer Index # + Step Index # + + + + + Step Index # + 200 + 30 30 100 + + + Principal Components Principal Components Principal Components + +Figure 3: Comparison of forward residuals and PCA trajectories. HRM shows hierarchical conver- +gence: the H-module steadily converges, while the L-module repeatedly converges within cycles +before being reset by H, resulting in residual spikes. The recurrent neural network exhibits rapid +convergence with residuals quickly approaching zero. In contrast, the deep neural network experi- +ences vanishing gradients, with significant residuals primarily in the initial (input) and final layers. + +HRM is explicitly designed to counteract this premature convergence through a process we term +hierarchical convergence. During each cycle, the L-module (an RNN) exhibits stable convergence +to a local equilibrium. This equilibrium, however, depends on the high-level state zH supplied +during that cycle. After completing the T steps, the H-module incorporates the sub-computation’s +outcome (the final state zL ) and performs its own update. This zH update establishes a fresh context +for the L-module, essentially “restarting” its computational path and initiating a new convergence +phase toward a different local equilibrium. +This process allows the HRM to perform a sequence of distinct, stable, nested computations, where +the H-module directs the overall problem-solving strategy and the L-module executes the intensive +search or refinement required for each step. Although a standard RNN may approach convergence +within T iterations, the hierarchical convergence benefits from an enhanced effective depth of N T +steps. As empirically shown in Figure 3, this mechanism allows HRM both to maintain high +computational activity (forward residual) over many steps (in contrast to a standard RNN, whose +activity rapidly decays) and to enjoy stable convergence. This translates into better performance at +any computation depth, as illustrated in Figure 2. +Approximate gradient Recurrent models typically use BPTT to compute gradients. However, +BPTT requires storing the hidden states from the forward pass and then combining them with +gradients during the backward pass, which demands O(T ) memory for T timesteps. This heavy +memory burden forces smaller batch sizes and leads to poor GPU utilization, especially for large- +scale networks. Additionally, because retaining the full history trace through time is biologically +implausible, it is unlikely that the brain implements BPTT 19 . +Fortunately, if a recurrent neural network converges to a fixed point, we can avoid unrolling its state +sequence by applying backpropagation in a single step at that equilibrium point. Moreover, such a +mechanism could plausibly be implemented in the brain using only local learning rules 34,35 . Based + + 5 +on this finding, we propose a one-step approximation of the HRM gradient–using the gradient of +the last state of each module and treating other states as constant. The gradient path is, therefore, + + Output head → final state of the H-module → final state of the L-module → input embedding + +The above method needs O(1) memory, does not require unrolling through time, and can be easily +implemented with an autograd framework such as PyTorch, as shown in Figure 4. Given that +each module only needs to back-propagate errors through its most recent local synaptic activity, +this approach aligns well with the perspective that cortical credit assignment relies on short-range, +temporally local mechanisms rather than on a global replay of activity patterns. +The one-step gradient approximation is theoretically +grounded in the mathematics of Deep Equilibrium Mod- +els (DEQ) 36 which employs the Implicit Function Theo- +rem (IFT) to bypass BPTT, as detailed next. Consider an +idealized HRM behavior where, during high-level cycle +k, the L-module repeatedly updates until its state zL con- def hrm(z, x, N=2, T=2): + x = input_embedding(x) +verges to a local fixed point zL⋆ . This fixed point, given zH, zL = z + k−1 +the current high-level state zH , can be expressed as with torch.no_grad(): + for _i in range(N ∗ T − 1): + k−1 zL = L_net(zL, zH, x) + zL⋆ = fL (zL⋆ , zH , x̃; θL ) . if (_i + 1) % T == 0: + zH = H_net(zH, zL) + +The H-module then performs a single update using this # 1−step grad + zL = L_net(zL, zH, x) +converged L-state: zH = H_net(zH, zL) + return (zH, zL), output_head(zH) + k k−1 ⋆ + zH = fH (zH , zL ; θH ) . # Deep Supervision + for x, y_true in train_dataloader: + z = z_init +With a proper mapping F, the updates to the high-level for step in range(N_supervision): + k z, y_hat = hrm(z, x) +state can be written in a more compact form as zH = + k−1 loss = softmax_cross_entropy(y_hat, y_true) +F(zH ; x̃, θ), where θ = (θI , θL ), and the fixed-point z = z.detach() + ⋆ ⋆ ∂F +can be written as zH = F(zH ; x̃, θ). Let JF = ∂zH be loss.backward() +the Jacobian of F, and assume that the matrix I − JF is opt.step() + opt.zero_grad() + ⋆ +invertible at zH and that the mapping F is continuously +differentiable. The Implicit Function Theorem then al- Figure 4: Top: Diagram of HRM with + ⋆ +lows us to calculate the exact gradient of fixed point zH approximate gradient. Bottom: Pseu- +with respect to the parameters θ without explicit back- docode of HRM with deep supervision +propagation: training in PyTorch. + ⋆ −1 ∂F + ∂zH + = I − JF z⋆ . (1) + ∂θ H ∂θ z⋆ + H + + +Calculating the above gradient requires evaluating and inverting matrix (I − JF ) that can be com- +putationally expensive. Given the Neumann series expansion, + (I − JF )−1 = I + JF + JF2 + JF3 + . . . , +the so-called 1-step gradient 37 approximates the series by considering only its first term, i.e. (I − +JF )−1 ≈ I, and leads to the following approximation of Equation (1): + ∗ ∗ + ∂zH ∂fH ∂zH ∂fH ∂zL∗ ∗ + ∂zH ∂fH ∂zL∗ + ≈ , ≈ · , ≈ · . (2) + ∂θH ∂θH ∂θL ∂zL∗ ∂θL ∂θI ∂zL∗ ∂θI + 6 + ∂z ∗ ∂z ∗ +The gradients of the low-level fixed point, ∂θLL and ∂θLI , can also be approximated using another +application of the 1-step gradient: + ∂zL∗ ∂fL ∂zL∗ ∂fL + ≈ , ≈ . (3) + ∂θL ∂θL ∂θI ∂θI +By substituting Equation (3) back into Equation (2), we arrive at the final simplified gradients. +Before defining our loss function, we must first introduce two key elements of our proposed +method: deep supervision and adaptive computational time. +Deep supervision Inspired by the principle that periodic neural oscillations regulate when learning +occurs in the brain 38 , we incorporate a deep supervision mechanism into HRM, as detailed next. +Given a data sample (x, y), we run multiple forward passes of the HRM model, each of which we +refer to as a segment. Let M denote the total number of segments executed before termination. +For each segment m ∈ {1, . . . , M }, let z m = (zHmN T + , zLmN T ) represent the hidden state at the +conclusion of segment m, encompassing both high-level and low-level state components. +At each segment m, we apply a deep supervision step as follows: + 1. Given the state z m−1 from the previous segment, compute the next state z m and its associated + output ŷ m through a forward pass in the HRM model: + + (z m , ŷ m ) ← HRM(z m−1 , x; θ) + + 2. Compute the loss for the current segment: + + Lm ← L OSS(ŷ m , y) + + 3. Update parameters: + + θ ← O PTIMIZER S TEP(θ, ∇θ Lm ) + +The crucial aspect of this procedure is that the hidden state z m is “detached” from the computa- +tion graph before being used as the input state for the next segment. Consequently, gradients from +segment m + 1 do not propagate back through segment m, effectively creating a 1-step approxi- +mation of the gradient of the recursive deep supervision process 39,40 . This approach provides more +frequent feedback to the H-module and serves as a regularization mechanism, demonstrating supe- +rior empirical performance and enhanced stability in deep equilibrium models when compared to +more complex, Jacobian-based regularization techniques 39,41 . Figure 4 shows pseudocode of deep +supervision training. +Adaptive computational time (ACT) The brain dynamically alternates between automatic think- +ing (“System 1”) and deliberate reasoning (“System 2”) 42 . Neuroscientific evidence shows that +these cognitive modes share overlapping neural circuits, particularly within regions such as the +prefrontal cortex and the default mode network 43,44 . This indicates that the brain dynamically mod- +ulates the “runtime” of these circuits according to task complexity and potential rewards 45,46 . +Inspired by the above mechanism, we incorporate an adaptive halting strategy into HRM that en- +ables “thinking, fast and slow”. This integration leverages deep supervision and uses the Q-learning + + + + 7 +algorithm 47 to adaptively determine the number of segments. A Q-head uses the final state of the +H-module to predict the Q-values Q̂m = (Q̂m m + halt , Q̂continue ) of the “halt” and “continue” actions: + + ⊤ mN T + Q̂m = σ(θQ zH ) , +where σ denotes the sigmoid function applied element-wise. The halt or continue action is chosen +using a randomized strategy as detailed next. Let Mmax denote the maximum number of segments +(a fixed hyperparameter) and Mmin denote the minimum number of segments (a random variable). +The value of Mmin is determined stochastically: with probability ε, it is sampled uniformly from the +set {2, · · · , Mmax } (to encourage longer thinking), and with probability 1 − ε, it is set to 1. The halt +action is selected under two conditions: when the segment count surpasses the maximum threshold +Mmax , or when the estimated halt value Q̂halt exceeds the estimated continue value Q̂continue and the +segment count has reached at least the minimum threshold Mmin . +The Q-head is updated through a Q-learning algorithm, which is defined on the following episodic +Markov Decision Process (MDP). The state of the MDP at segment m is z m , and the action space +is {halt, continue}. Choosing the action “halt” terminates the episode and returns a binary reward +indicating prediction correctness, i.e., 1{ŷ m = y}. Choosing “continue” yields a reward of 0 and +the state transitions to z m+1 . Thus, the Q-learning targets for the two actions Ĝm = (Ĝm m + halt , Ĝcontinue ) +are given by + + Ĝm m + halt = 1{ŷ = y} , + + Q̂m+1 + halt , if m ≥ Nmax , + m + Ĝcontinue = + max(Q̂m+1 , Q̂m+1 ) , otherwise . + halt continue + +We can now define the loss function of our learning procedure. The overall loss for each supervision +segment combines both the Q-head loss and the sequence-to-sequence loss: + + Lm m m m + ACT = L OSS (ŷ , y) + B INARY C ROSS E NTROPY (Q̂ , Ĝ ) . + +Minimizing the above loss enables both accurate predictions and nearly optimal stopping decisions. +Selecting the “halt” action ends the supervision loop. In practice, sequences are processed in +batches, which can be easily handled by substituting any halted sample in the batch with a fresh +sample from the dataloader. +Figure 5 presents a performance comparison between two HRM variants: one incorporating ACT +and another employing a fixed computational step count equivalent to ACT’s Mmax parameter. It +shows that ACT effectively adapts its computational resources based on task complexity, achieving +significant computational savings with minimal impact on performance. +Inference-time scaling An effective neural model should exploit additional computational re- +sources during inference to enhance performance. As illustrated in Figure 5-(c), HRM seamlessly +achieves inference-time scaling by simply increasing the computational limit parameter, Mmax +without requiring further training or architectural modifications. +Additional compute is especially effective for tasks that demand deeper reasoning. On Sudoku— +a problem that often requires long-term planning—HRM exhibits strong inference-time scaling. +On the other hand, we find that extra computational resources yield minimal gains in ARC-AGI +challenge, as solutions generally require only a few transformations. + + 8 + (a) ACT Compute Spent (b) ACT Performance (c) Inference-time scaling + 8 100.0 100.0 + Fixed M Fixed M + 7 +Mean Compute Steps + + ACT (Mmax limit) 97.5 ACT (Mmax limit) 97.5 + 6 95.0 95.0 + + + + + Accuracy % + + + + + Accuracy % + 5 92.5 92.5 + 4 90.0 90.0 + 3 87.5 87.5 Train Mmax = 2 + 85.0 85.0 Train Mmax = 4 + 2 Train Mmax = 8 + 1 82.5 82.5 + 2 4 8 2 4 8 2 4 8 16 + M (Fixed) or Mmax (ACT) M (Fixed) or Mmax (ACT) Inference Mmax + + +Figure 5: Effectiveness of Adaptive Computation Time (ACT) on the Sudoku-Extreme-Full. (a) +Mean compute steps used by models with ACT versus models with a fixed number of compute steps +(M ). ACT maintains a low and stable number of average compute steps even as the maximum limit +(Mmax ) increases. (b) Accuracy comparison. The ACT model achieves performance comparable +to the fixed-compute model while utilizing substantially fewer computational steps on average. (c) +Inference-time scalability. Models trained with a specific Mmax can generalize to higher compu- +tational limits during inference, leading to improved accuracy. For example, a model trained with +Mmax = 8 continues to see accuracy gains when run with Mmax = 16 during inference. + +Stability of Q-learning in ACT The deep Q-learning that underpins our ACT mechanism is +known to be prone to instability, often requiring stabilization techniques such as replay buffers +and target networks 48 , which are absent in our design. Our approach, however, achieves stability +through the intrinsic properties of our model and training procedure. Recent theoretical work by +Gallici et al. 49 shows that Q-learning can achieve convergence if network parameters are bounded, +weight decay is incorporated during training, and post-normalization layers are implemented. Our +model satisfies these conditions through its Post-Norm architecture that employs RMSNorm (a +layer normalization variant) and the AdamW optimizer. AdamW has been shown to solve an L∞ - +constrained optimization problem, ensuring that model parameters remain bounded by 1/λ 50 . +Architectural details We employ a sequence-to-sequence architecture for HRM. Both input and +output are represented as token sequences: x = (x1 , . . . , xl ) and y = (y1 , . . . , yl′ ) respectively. +The model includes an embedding layer fI that converts discrete tokens into vector representa- +tions, and an output head fO (z; θO ) = softmax(θO z) that transforms hidden states into token prob- +ability distributions ŷ. For small-sample experiments, we replace softmax with stablemax 51 to +improve generalization performance. The sequence-to-sequence loss is averaged over all tokens, + Pl′ +L OSS(ŷ, y) = l1′ i=1 log p(yi ), where p(yi ) is the probability that distribution ŷi assigns to token +yi . The initial hidden states z 0 are initialized by sampling from a truncated normal distribution with +standard deviation of 1, truncation of 2, and kept fixed throughout training. +Both the low-level and high-level recurrent modules fL and fH are implemented using encoder- +only Transformer 52 blocks with identical architectures and dimensions. These modules take mul- +tiple inputs, and we use straightforward element-wise addition to combine them, though more +sophisticated merging techniques such as gating mechanisms could potentially improve perfor- +mance and is left for future work. For all Transformer blocks in this work—including those in +the baseline models—we incorporate the enhancements found in modern LLMs (based on Llama 53 +architectures). These improvements include Rotary Positional Encoding 54 , Gated Linear Units 55 , +RMSNorm 56 , and the removal of bias terms from linear layers. +Furthermore, both HRM and recurrent Transformer models implement a Post-Norm architecture + + 9 + 8 4 5 6 + 8 7 + 3 4 + 3 8 4 2 + 6 3 8 + 9 6 + 5 + 2 1 + 2 5 3 8 + + + + + 7 8 4 1 2 5 9 6 3 + 2 6 1 3 8 9 7 4 5 + 3 5 9 6 4 7 8 1 2 + 5 3 8 4 9 6 1 2 7 + 4 1 6 2 7 3 5 9 8 + 9 7 2 8 5 1 4 3 6 + 6 9 3 5 1 8 2 7 4 + 8 4 7 9 6 2 3 5 1 + 1 2 5 7 3 4 6 8 9 + + + + + (a) ARC-AGI (b) Sudoku-Hard (c) Maze navigation (d) Sudoku-Extreme subset difficulty + +Figure 6: Left: Visualization of benchmark tasks. Right: Difficulty of Sudoku-Extreme examples. + +with weights initialized via truncated LeCun Normal initialization 57,58,59 , while the scale and bias +parameters are excluded from RMSNorm. All parameters are optimized using the Adam-atan2 op- +timizer 60 , a scale-invariant variant of Adam 61 , combined with a constant learning rate that includes +linear warm-up. + + +3 Results +This section begins by describing the ARC-AGI, Sudoku, and Maze benchmarks, followed by an +overview of the baseline models and their results. Figure 6-(a,b,c) presents a visual representa- +tion of the three benchmark tasks, which are selected to evaluate various reasoning abilities in AI +models. + +3.1 Benchmarks +ARC-AGI Challenge The ARC-AGI benchmark evaluates general fluid intelligence through IQ- +test-like puzzles that require inductive reasoning 27 . The initial version, ARC-AGI-1, presents chal- +lenges as input-label grid pairs that force AI systems to extract and generalize abstract rules from +just a few examples. Each task provides a few input–output demonstration pairs (usually 2–3) and +a test input. An AI model has two attempts to produce the correct output grid. Although some be- +lieve that mastering ARC-AGI would signal true artificial general intelligence, its primary purpose +is to expose the current roadblocks in AGI progress. In fact, both conventional deep learning meth- +ods and CoT techniques have faced significant challenges with ARC-AGI-1, primarily because it +requires the ability to generalize to entirely new tasks 28 . +Addressing the limitations identified in ARC-AGI-1, ARC-AGI-2 significantly expands the bench- +mark by providing a more comprehensive and carefully refined collection of tasks. These new +tasks emphasize deeper compositional reasoning, multi-step logic, contextual rule application, and +symbolic abstraction. Human calibration studies show these tasks are challenging but doable for +people, while being much harder for current AI systems, offering a clearer measure of general +reasoning abilities 29 . + + + 10 +Sudoku-Extreme Sudoku is a 9×9 logic puzzle, requiring each row, column, and 3×3 block to +contain the digits 1–9 exactly once. A prediction is considered correct if it exactly matches the +puzzle’s unique solution. Sudoku’s complex logical structure makes it a popular benchmark for +evaluating logical reasoning in machine learning 62,63,64 . +The most frequently used Sudoku dataset in research, namely the Kaggle dataset 65 , can be fully +solved using elementary single-digit techniques 66 . The minimal 17-clue puzzles 62 , another widely- +used collection, might seem more challenging due to its small number of clues. However, this +perception is misleading—since 17 represents the minimum number of clues required to guarantee +a unique Sudoku solution, these hints need to be highly orthogonal to each other. This orthogonal +arrangement leads to many direct, easily-resolved solution paths 67 . +We introduce Sudoku-Extreme, a more challenging dataset that is compiled from the aforemen- +tioned easy datasets as well as puzzles recognized by the Sudoku community as exceptionally +difficult for human players: +• Easy puzzles compiled from Kaggle, 17-clue, plus unbiased samples from the Sudoku puzzle + distribution 67 : totaling 1 149 158 puzzles. +• Challenging puzzles compiled from Magictour 1465, Forum-Hard and Forum-Extreme subsets: + totaling 3 104 157 puzzles. +The compiled data then undergo a strict 90/10 train-test split, ensuring that the test set puzzles +cannot be derived through equivalent transformations of any training samples. Sudoku-Extreme is +a down-sampled subset of this data containing 1000 training examples. We use Sudoku-Extreme in +our main experiments (Figure 1), which focuses on small-sample learning scenarios. To guarantee +convergence and control overfitting effects in our analysis experiments (Figures 2, 3 and 5), we use +the complete training data, Sudoku-Extreme-Full, containing 3 831 994 examples. +We measure puzzle difficulty by counting the number of search backtracks (“guesses”) required +by a smart Sudoku solver program tdoku, which uses propositional logic to reduce the number of +guesses 67 . Our Sudoku-Extreme dataset exhibits a mean difficulty of 22 backtracks per puzzle, sig- +nificantly higher than existing datasets, including recent handmade puzzles Sudoku-Bench 68 which +average just 0.45 backtracks per puzzle. These subset complexity levels are shown in Figure 6-(d). +Maze-Hard This task involves finding the optimal path in a 30×30 maze, making it interpretable +and frequently used for training LLMs in search tasks 69,70,71 . We adopt the instance generation +procedure of Lehnert et al. 71 , but introduce an additional filter to retain only those instances whose +difficulty exceeds 110. Here, “difficulty” is defined as the length of the shortest path, which aligns +with the linear time complexity of the wavefront breadth-first search algorithm on GPUs 72 . A path +is considered correct if it is valid and optimal—that is, the shortest route from the start to the goal. +The training and test set both include 1000 examples. + +3.2 Evaluation Details +For all benchmarks, HRM models were initialized with random weights and trained in the sequence- +to-sequence setup using the input-output pairs. The two-dimensional input and output grids were +flattened and then padded to the maximum sequence length. The resulting performance is shown in +Figure 1. Remarkably, HRM attains these results with just ~1000 training examples per task—and +without pretraining or CoT labels. + + + 11 +For ARC-AGI challenge, we start with (1) all demonstration and test input-label pairs from the +training set, and (2) all demonstration pairs along with test inputs from the evaluation set. The +dataset is augmented by applying translations, rotations, flips, and color permutations to the puz- +zles. Each task example is prepended with a learnable special token that represents the puzzle it +belongs to. At test time, we proceed as follows for each test input in the evaluation set: (1) Gener- +ate and solve 1000 augmented variants and, for each, apply the inverse-augmentation transform to +obtain a prediction. (2) Choose the two most popular predictions as the final outputs.3 All reported +results are obtained by comparing the outputs with the withheld test labels from the evaluation set. +We augment Sudoku puzzles by applying band and digit permutations, while data augmentation is +disabled for Maze tasks. Both tasks undergo only a single inference pass. +For ARC-AGI, the scores of the CoT models are taken from the official leaderboard 29 , while for +Sudoku and Maze, the scores are obtained by evaluating through the corresponding API. +In Figure 1, the baselines are grouped based on whether they are pre-trained and use CoT, or neither. +The “Direct pred” baseline means using “direct prediction without CoT and pre-training”, which +retains the exact training setup of HRM but swaps in a Transformer architecture. Interestingly, on +ARC-AGI-1, “Direct pred” matches the performance of Liao and Gu 73 , who built a carefully de- +signed, domain-specific equivariant network for learning the ARC-AGI task from scratch, without +pre-training. By substituting the Transformer architecture with HRM’s hierarchical framework and +implementing ACT, we achieve more than a twofold performance improvement. +On the Sudoku-Extreme and Maze-Hard benchmarks, the performance gap between HRM and the +baseline methods is significant, as the baselines almost never manage to solve the tasks. These +benchmarks that demand lengthy reasoning traces are particularly difficult for CoT-based methods. +With only 1000 training examples, the “Direct pred” baseline—which employs an 8-layer Trans- +former identical in size to HRM—fails entirely on these challenging reasoning problems. When +trained on the larger Sudoku-Extreme-Full dataset, however, “Direct pred” can solve some easy +Sudoku puzzles and reaches 16.9% accuracy (see Figure 2). Lehnert et al. 71 showed that a large +vanilla Transformer model with 175M parameters, trained on 1 million examples across multiple +trials, achieved only marginal success on 30x30 Maze tasks, with accuracy below 20% using the +pass@64 evaluation metric. + +3.3 Visualization of intermediate timesteps +Although HRM demonstrates strong performance on complex reasoning tasks, it raises an intrigu- +ing question: what underlying reasoning algorithms does the HRM neural network actually imple- +ment? Addressing this question is important for enhancing model interpretability and developing a +deeper understanding of the HRM solution space. +While a definitive answer lies beyond our current scope, we begin our investigation by analyzing +state trajectories and their corresponding solution evolution. More specifically, at each timestep +i and given the low-level and high-level state pair (zLi and zH i + ) we perform a preliminary forward + i i i +pass through the H-module to obtain z̄ = fH (zH , zL ; θH ) and its corresponding decoded prediction +ȳ i = fO (z̄ i ; θO ). The prediction ȳ i is then visualized in Figure 7. +In the Maze task, HRM appears to initially explore several potential paths simultaneously, subse- +quently eliminating blocked or inefficient routes, then constructing a preliminary solution outline + 3 + The ARC-AGI allows two attempts for each test input. + + 12 + Timestep i = 0 Timestep i = 1 Timestep i = 2 Timestep i = 3 Timestep i = 4 Timestep i = 5 Timestep i = 6 + + + + + Initial Timestep i = 0 Timestep i = 1 Timestep i = 2 Timestep i = 3 Timestep i = 4 Timestep i = 5 Timestep i = 6 Timestep i = 7 + 4 89 + 2 4 3 5 7 1 8 9 6 2 4 3 5 7 1 8 9 6 2 4 3 5 7 1 8 9 3 2 4 3 5 7 1 8 9 3 2 4 1 6 7 5 8 9 3 2 4 1 6 7 5 8 9 3 2 4 1 6 7 5 8 9 3 2 4 1 6 7 5 8 9 3 + 7 3 1 + 6 7 8 6 3 4 1 5 4 6 7 8 6 3 4 1 5 4 8 7 9 6 3 4 1 5 2 8 7 9 6 3 4 1 5 2 6 7 9 6 3 4 1 5 2 6 7 9 9 3 4 1 5 2 6 7 9 8 3 4 1 5 2 6 7 9 8 3 4 1 5 2 + 2 6 5 1 2 7 9 7 3 4 6 5 1 2 8 9 7 3 4 6 5 1 2 8 8 7 3 4 6 5 1 2 8 9 7 3 4 6 5 3 2 1 8 7 6 4 9 5 1 2 8 8 7 3 4 8 5 3 2 1 9 7 6 4 8 5 3 2 1 9 7 6 4 + 67 8 3 4 8 6 7 2 1 2 5 3 4 8 6 7 2 1 9 5 3 4 8 6 7 2 1 5 5 3 4 8 6 7 2 1 5 5 3 4 9 6 7 2 1 5 5 3 4 8 6 7 2 1 8 5 3 4 9 6 7 2 1 8 5 3 4 9 6 7 2 1 8 + 3 4 7 2 8 3 1 8 6 4 8 7 2 5 3 1 5 6 4 8 7 2 5 3 1 5 6 4 9 7 2 5 3 1 1 6 4 6 7 2 5 3 8 1 6 4 6 7 2 5 3 1 1 6 4 6 7 2 8 3 8 1 6 4 6 7 2 8 3 5 1 6 4 9 +1 64 23 15649237 9 19645237 7 19645237 7 19645237 7 19645238 7 19645237 7 19648237 5 19648237 5 + 27 3 8 1 2 7 9 3 4 6 6 9 1 2 7 4 3 9 8 6 9 1 2 7 4 3 5 6 8 9 1 2 7 4 3 5 6 8 9 1 2 7 4 3 5 6 8 9 1 2 7 4 3 5 8 6 9 1 2 7 4 3 5 8 6 9 1 2 7 4 3 5 8 6 +46 12 468128 7 3 7 465128 9 7 3 465128 9 7 3 468128 9 7 3 468128 9 7 3 468128 9 3 7 465128 9 3 7 465128 9 3 7 +3 7 6 1 3979 964 21 3875 564 21 3879 564 21 3879 564 21 3875 864 21 3875 864 21 3875 964 21 3875 964 21 +[7666fa5d] Example Input [7666fa5d] Example Output [7666fa5d] Test Input Timestep i = 0 Timestep i = 1 Timestep i = 2 Timestep i = 3 Timestep i = 4 + + + + + [7b80bb43] Test Input Timestep i = 0 Timestep i = 1 Timestep i = 2 Timestep i = 3 Timestep i = 4 Timestep i = 5 Timestep i = 6 + [7b80bb43] Example Input [7b80bb43] Example Output + + + + + Figure 7: Visualization of intermediate predictions by HRM on benchmark tasks. Top: Maze- + Hard—blue cells indicate the predicted path. Middle: Sudoku-Extreme—bold cells represent ini- + tial givens; red highlights cells violating Sudoku constraints; grey shading indicates changes from + the previous timestep. Bottom: ARC-AGI-2 Task—left: provided example input-output pair; right: + intermediate steps solving the test input. + + followed by multiple refinement iterations. In Sudoku, the strategy resembles a depth-first search + approach, where the model appears to explore potential solutions and backtracks when it hits dead + ends. HRM uses a different approach for ARC tasks, making incremental adjustments to the board + and iteratively improving it until reaching a solution. Unlike Sudoku, which involves frequent + backtracking, the ARC solution path follows a more consistent progression similar to hill-climbing + optimization. + Importantly, the model shows that it can adapt to different reasoning approaches, likely choosing an + effective strategy for each particular task. Further research is needed to gain more comprehensive + insights into these solution strategies. + + + 4 Brain Correspondence + A key principle from systems neuroscience is that a brain region’s functional repertoire—its ability + to handle diverse and complex tasks—is closely linked to the dimensionality of its neural represen- + tations 75,76 . Higher-order cortical areas, responsible for complex reasoning and decision-making, + must handle a wide variety of tasks, demanding more flexible and context-dependent processing 77 . + In dynamical systems, this flexibility is often realized through higher-dimensional state-space tra- + jectories, which allow for a richer repertoire of potential computations 78 . This principle gives rise + to an observable dimensionality hierarchy, where a region’s position in the processing hierarchy + + 13 + (a) (c) (e) + + + + + (b) (d) (f) + 5.0 + + 4.5 + Participation Ratio (PR) + + + + + 4.0 + + 3.5 + + 3.0 + + 2.5 + + 2.0 + + 0 20 40 + Position in the hierarchy + + +Figure 8: Hierarchical Dimensionality Organization in the HRM and Mouse Cortex. (a,b) are +adapted from Posani et al. 74 . (a) Anatomical illustration of mouse cortical areas, color-coded by +functional modules. (b) Correlation between Participation Ratio (PR), a measure of effective neural +dimensionality, and hierarchical position across different mouse cortical areas. Higher positions in +the hierarchy (e.g., MOs, ACAd) exhibit significantly higher PR values compared to lower sensory +areas (e.g., SSp-n), with a Spearman correlation coefficient of ρ = 0.79 (P = 0.0003). (c,d) Trained +HRM. (c) PR scaling of the trained HRM with task diversity. The dimensionality of the high- +level module (zH ) scales with the number of unique tasks (trajectories) included in the analysis, +indicating an adaptive expansion of its representational capacity. In contrast, the low-level module’s +(zL ) dimensionality remains stable. (d) PR values for the low-level (zL , PR = 30.22) and high- +level (zH , PR = 89.95) modules of the trained HRM, computed from neural activity during 100 +unique Sudoku-solving trajectories. A clear dimensionality hierarchy is observed, with the high- +level module operating in a substantially higher-dimensional space. (e,f) Analysis of Untrained +Network. To verify that the dimensionality hierarchy is an emergent property of training, the same +analyses were performed on an untrained HRM with random weights. (e) In contrast to the trained +model’s scaling in (c), the dimensionality of both modules in the untrained model remains low and +stable, failing to scale with the number of tasks. (f) Similarly, contrasting with the clear separation +in (d), the PR values for the untrained model’s modules (zL , PR = 42.09; zH , PR = 40.75) are +low and nearly identical, showing no evidence of hierarchical separation. This confirms that the +observed hierarchical organization of dimensionality is a learned property that emerges through +training, not an artifact of the model’s architecture. + + + + + 14 +correlates with its effective dimensionality. To quantify this phenomenon, we can examine the +Participation Ratio (PR), which serves as a standard measure of the effective dimensionality of a +high-dimensional representation 79 . The PR is calculated using the formula + ( i λi )2 + P + PR = P 2 , + i λi + +where {λi } are the eigenvalues of the covariance matrix of neural trajectories. Intuitively, a higher +PR value signifies that variance is distributed more evenly across many dimensions, corresponding +to a higher-dimensional representation. Conversely, a lower PR value indicates that variance is +concentrated in only a few principal components, reflecting a more compact, lower-dimensional +structure. +The dimensionality hierarchy can be observed, for example, in the mouse cortex, where the PR of +population activity increases monotonically from low-level sensory areas to high-level associative +areas, supporting this link between dimensionality and functional complexity 74 (Figure 8 (a,b)). +We evaluated whether HRM reproduces this neuroscientific principle by calculating the PR for +both recurrent modules after training on the Sudoku-Extreme Full dataset. The PR computation +used the covariance matrix derived from neural states gathered across multiple Sudoku-solving +trajectories. The results show a striking parallel to the biological findings. The low-level module’s +state (zL ) occupies a relatively small subspace with a participation ratio of 30.22, whereas the high- +level module’s state (zH ) operates in a substantially larger subspace with a participation ratio of +89.95, as shown in Figure 8(c). Furthermore, Figure 8(d) shows that increasing the number of +unique tasks (trajectories) from 10 to 100 causes zH dimensionality to scale up accordingly, while +zL dimensionality remains stable. These results suggest an emergent separation of representational +capacity between the modules that parallels their functional roles. +To confirm that this hierarchical organization is an emergent property of training, and not an artifact +of the network’s architecture, we performed a control analysis using an identical but untrained +network with random weights. +We initialized an identical HRM architecture with random weights and, without any training, mea- +sured the PR of its modules as the network processed the same task-specific inputs given to the +trained model. +The results, shown in Figure 8(e,f), reveal a stark contrast: the high-level and low-level modules of +the untrained network exhibit no hierarchical separation, with their PR values remaining low and +nearly indistinguishable from each other. This control analysis validates that the dimensionality +hierarchy is an emergent property that arises as the model learns to perform complex reasoning. +The high-to-low PR ratio in HRM (zH /zL ≈ 2.98) closely matches that measured in the mouse +cortex (≈ 2.25). In contrast, conventional deep networks often exhibit neural collapse, where +last-layer features converge to a low-dimensional subspace 80,81,82 . HRM therefore departs from the +collapse pattern and instead fosters a high-dimensional representation in its higher module. This +is significant because such representations are considered crucial for cognitive flexibility and are a +hallmark of higher-order brain regions like the prefrontal cortex (PFC), which is central to complex +reasoning. +This structural parallel suggests the model has discovered a fundamental organizational principle. +By learning to partition its representations into a high-capacity, high-dimensional subspace (zH ) + + 15 +and a more specialized, low-dimensional one (zL ), HRM autonomously discovers an organizational +principle that is thought to be fundamental for achieving robust and flexible reasoning in biological +systems. This provides a potential mechanistic explanation for the model’s success on complex, +long-horizon tasks that are intractable for models lacking such a differentiated internal structure. +We emphasize, however, that this evidence is correlational. While a causal link could be tested +via intervention (e.g., by constraining the H-module’s dimensionality), such methods are difficult +to interpret in deep learning due to potential confounding effects on the training process itself. +Thus, the causal necessity of this emergent hierarchy remains an important question for future +investigation. + + +5 Related Work +Reasoning and algorithm learning Given the central role of reasoning problems and their close +relation to algorithms, researchers have long explored neural architectures that enable algorithm +learning from training instances. This line of work includes Neural Turing Machines (NTM) 83 , +the Differentiable Neural Computer (DNC) 84 , and Neural GPUs 85 –all of which construct iterative +neural architectures that mimic computational hardware for algorithm execution, and are trained to +learn algorithms from data. Another notable work in this area is Recurrent Relational Networks +(RRN) 62 , which executes algorithms on graph representations through graph neural networks. +Recent studies have integrated algorithm learning approaches with Transformer-based architec- +tures. Universal Transformers extend the standard Transformer model by introducing a recurrent +loop over the layers and implementing an adaptive halting mechanism. Geiping et al. 86 demonstrate +that looped Transformers can generalize to a larger number of recurrent steps during inference than +what they were trained on. Shen et al. 16 propose adding continuous recurrent reasoning tokens +to the Transformer. Finally, TransNAR 8 combine recurrent graph neural networks with language +models. +Building on the success of CoT-based reasoning, a line of work have introduced fine-tuning meth- +ods that use reasoning paths from search algorithms (like A*) as SFT targets 87,71,70 . +We also mention adaptive halting mechanisms designed to allocate additional computational re- +sources to more challenging problems. This includes the Adaptive Computation Time (ACT) for +RNNs 88 and follow-up research like PonderNet 89 , which aims to improve the stability of this allo- +cation process. +HRM further pushes the boundary of algorithm learning through a brain-inspired computational +architecture that achieves exceptional data efficiency and model expressiveness, successfully dis- +covering complex and diverse algorithms from just 1000 training examples. +Brain-inspired reasoning architectures Developing a model with the reasoning power of the +brain has long been a goal in brain-inspired computing. Spaun 90 is one notable example, which uses +spiking neural networks to create distinct modules corresponding to brain regions like the visual +cortex and prefrontal cortex. This design enables an architecture to perform a range of cognitive +tasks, from memory recall to simple reasoning puzzles. However, its reasoning relies on hand- +designed algorithms, which may limit its ability to learn new tasks. Another significant model is the +Tolman-Eichenbaum Machine (TEM) 91 , which is inspired by the hippocampal-entorhinal system’s +role in spatial and relational memory tasks. TEM proposes that medial entorhinal cells create a +basis for structural knowledge, while hippocampal cells link this basis to sensory information. This + + 16 +allows TEM to generalize and explains the emergence of various cell types like grid, border, and +place cells. Another approach involves neural sampling models 92 , which view the neural signaling +process as inference over a distribution, functioning similarly to a Boltzmann machine. These +models often require hand-made rules to be set up for solving a specific reasoning task. In essence, +while prior models are restricted to simple reasoning problems, HRM is designed to solve complex +tasks that are hard for even advanced LLMs, without pre-training or task-specific manual design. +Hierarchical memory The hierarchical multi-timescale structure also plays an important role in +how the brain processes memory. Models such as Hierarchical Sequential Models 93 and Clockwork +RNN 94 use multiple recurrent modules that operate at varying time scales to more effectively cap- +ture long-range dependencies within sequences, thereby mitigating the forgetting issue in RNNs. +Similar mechanisms have also been adopted in linear attention methods for memorizing long con- +texts (see the Discussions section). Since HRM focuses on reasoning, full attention is applied for +simplicity. Incorporating hierarchical memory into HRM could be a promising future direction. + + +6 Discussions +Turing-completeness of HRM Like earlier neural reasoning algorithms including the Universal +Transformer 95 , HRM is computationally universal when given sufficient memory and time con- +straints. In other words, it falls into the category of models that can simulate any Turing machine, +overcoming the computational limitations of standard Transformers discussed previously in the in- +troduction. Given that earlier neural algorithm reasoners were trained as recurrent neural networks, +they suffer from premature convergence and memory intensive BPTT. Therefore, in practice, their +effective computational depth remains limited, though still deeper than that of a standard Trans- +former. By resolving these two challenges and being equipped with adaptive computation, HRM +could be trained on long reasoning processes, solve complex puzzles requiring intensive depth-first +search and backtracking, and move closer to practical Turing-completeness. +Reinforcement learning with chain-of-thought Beyond fine-tuning using human-annotated CoT, +reinforcement learning (RL) represents another widely adopted training methodology. However, +recent evidence suggests that RL primarily unlocks existing CoT-like capabilities rather than dis- +covering fundamentally new reasoning mechanisms 96,97,98,99 . Additionally, CoT-training with RL +is known for its instability and data inefficiency, often requiring extensive exploration and careful +reward design. In contrast, HRM takes feedback from dense gradient-based supervision rather than +relying on a sparse reward signal. Moreover, HRM operates naturally in a continuous space, which +is biologically plausible and avoids allocating same computational resources to each token, even +though tokens vary in their reasoning and planning complexity 16 . +Linear attention Recurrence has been explored not only for its capability in universal computa- +tion, but also as a means to replace the attention mechanism in Transformers, which suffers from +quadratic time and memory complexity 100 . Recurrent alternatives offer a more efficient design by +processing input tokens sequentially and predicting the next token at each time step, similar to early +RNN-based language models. +Some linear-attention variants, such as Log-linear Attention 101 , share an RNN-like state-update that +can be interpreted as propagating multi-timescale summary statistics, thereby retaining long-range +context without the quadratic memory growth of standard self-attention. However, substituting the +attention mechanism alone does not change the fact that Transformers are still fixed-depth, and + + 17 +require CoT as a compensatory mechanism. Notably, linear attention can operate with a reduced +key-value cache over extended contexts, making them more suitable for deployment on resource- +constrained edge devices. + + +7 Conclusion +This work introduces the Hierarchical Reasoning Model, a brain-inspired architecture that lever- +ages hierarchical structure and multi-timescale processing to achieve substantial computational +depth without sacrificing training stability or efficiency. With only 27M parameters and train- +ing on just 1000 examples, HRM effectively solves challenging reasoning problems such as ARC, +Sudoku, and complex maze navigation–tasks that typically pose significant difficulties for contem- +porary LLM and chain-of-thought models. +Although the brain relies heavily on hierarchical structures to enable most cognitive processes, +these concepts have largely remained confined to academic literature rather than being translated +into practical applications. The prevailing AI approach continues to favor non-hierarchical models. +Our results challenge this established paradigm and suggest that the Hierarchical Reasoning Model +represents a viable alternative to the currently dominant chain-of-thought reasoning methods, ad- +vancing toward a foundational framework capable of Turing-complete universal computation. +Acknowledgements We thank Mingli Yuan, Ahmed Murtadha Hasan Mahyoub and Hengshuai +Yao for their insightful discussions and valuable feedback throughout the course of this work. + + + + + 18 +References + 1. Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. + http://www.deeplearningbook.org. + 2. Kaiming He, X. Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image + recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), + pages 770–778, 2015. + 3. Lena Strobl. Average-hard attention transformers are constant-depth uniform threshold + circuits, 2023. + 4. Tom Bylander. Complexity results for planning. In Proceedings of the 12th International Joint + Conference on Artificial Intelligence - Volume 1, IJCAI’91, page 274–279, San Francisco, + CA, USA, 1991. Morgan Kaufmann Publishers Inc. ISBN 1558601600. + 5. William Merrill and Ashish Sabharwal. A logic for expressing log-precision transformers. In + Neural Information Processing Systems, 2023. + 6. David Chiang. Transformers in DLOGTIME-uniform TC0 . Transactions on Machine + Learning Research, 2025. + 7. Lucas Lehnert, Sainbayar Sukhbaatar, DiJia Su, Qinqing Zheng, Paul McVay, Michael + Rabbat, and Yuandong Tian. Beyond a*: Better planning with transformers via search + dynamics bootstrapping. In First Conference on Language Modeling, 2024. + 8. Wilfried Bounsi, Borja Ibarz, Andrew Dudzik, Jessica B. Hamrick, Larisa Markeeva, Alex + Vitvitskyi, Razvan Pascanu, and Petar Velivckovi’c. Transformers meet neural algorithmic + reasoners. ArXiv, abs/2406.09308, 2024. + 9. William Merrill and Ashish Sabharwal. The parallelism tradeoff: Limitations of log-precision + transformers. Transactions of the Association for Computational Linguistics, 11:531–545, + 2023. doi: 10.1162/tacl_a_00562. +10. Jason Wei, Yi Tay, et al. Chain-of-thought prompting elicits reasoning in large language + models, 2022. arXiv preprint arXiv:2201.11903. +11. William Merrill and Ashish Sabharwal. The expressive power of transformers with chain of + thought. In ICLR, 2024. +12. Xinyun Chen, Ryan A. Chi, Xuezhi Wang, and Denny Zhou. Premise order matters in + reasoning with large language models. ArXiv, abs/2402.08939, 2024. +13. Rongwu Xu, Zehan Qi, and Wei Xu. Preemptive answer "attacks" on chain-of-thought + reasoning. In Annual Meeting of the Association for Computational Linguistics, 2024. +14. Pablo Villalobos, Anson Ho, Jaime Sevilla, Tamay Besiroglu, Lennart Heim, and Marius + Hobbhahn. Will we run out of data? limits of llm scaling based on human-generated data. + arXiv preprint arXiv:2211.04325, 2022. +15. Xinghao Chen, Anhao Zhao, Heming Xia, Xuan Lu, Hanlin Wang, Yanjun Chen, Wei Zhang, + Jian Wang, Wenjie Li, and Xiaoyu Shen. Reasoning beyond language: A comprehensive + survey on latent chain-of-thought reasoning, 2025. +16. Xuan Shen, Yizhou Wang, Xiangxi Shi, Yanzhi Wang, Pu Zhao, and Jiuxiang Gu. + Training large language models to reason in a continuous latent space. arXiv preprint + arXiv:2412.07423, 2024. + + + 19 +17. Evelina Fedorenko, Steven T Piantadosi, and Edward AF Gibson. Language is primarily a + tool for communication rather than thought. Nature, 630(8017):575–586, 2024. +18. Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Dongdong Zhang, and Furu Wei. + Deepnet: Scaling transformers to 1,000 layers. IEEE Transactions on Pattern Analysis and + Machine Intelligence, 2024. +19. Timothy P Lillicrap and Adam Santoro. Backpropagation through time and the brain. Current + Opinion in Neurobiology, 55:82–89, 2019. ISSN 0959-4388. doi: https://doi.org/10.1016/j. + conb.2019.01.011. +20. John D Murray, Alberto Bernacchia, David J Freedman, Ranulfo Romo, Jonathan D Wallis, + Xinying Cai, Camillo Padoa-Schioppa, Tatiana Pasternak, Hyojung Seo, Daeyeol Lee, et al. + A hierarchy of intrinsic timescales across primate cortex. Nature neuroscience, 17(12):1661– + 1663, 2014. +21. Roxana Zeraati, Yan-Liang Shi, Nicholas A Steinmetz, Marc A Gieselmann, Alexander + Thiele, Tirin Moore, Anna Levina, and Tatiana A Engel. Intrinsic timescales in the + visual cortex change with selective attention and reflect spatial connectivity. Nature + communications, 14(1):1858, 2023. +22. Julia M Huntenburg, Pierre-Louis Bazin, and Daniel S Margulies. Large-scale gradients in + human cortical organization. Trends in cognitive sciences, 22(1):21–31, 2018. +23. Victor AF Lamme and Pieter R Roelfsema. The distinct modes of vision offered by + feedforward and recurrent processing. Trends in neurosciences, 23(11):571–579, 2000. +24. Andre M Bastos, W Martin Usrey, Rick A Adams, George R Mangun, Pascal Fries, and Karl J + Friston. Canonical microcircuits for predictive coding. Neuron, 76(4):695–711, 2012. +25. Klara Kaleb, Barbara Feulner, Juan Gallego, and Claudia Clopath. Feedback control guides + credit assignment in recurrent neural networks. Advances in Neural Information Processing + Systems, 37:5122–5144, 2024. +26. Timothy P Lillicrap, Adam Santoro, Luke Marris, Colin J Akerman, and Geoffrey Hinton. + Backpropagation and the brain. Nature Reviews Neuroscience, 21(6):335–346, 2020. +27. François Chollet. On the measure of intelligence (abstraction and reasoning corpus), 2019. + arXiv preprint arXiv:1911.01547. +28. Francois Chollet, Mike Knoop, Gregory Kamradt, and Bryan Landers. Arc prize 2024: + Technical report. ArXiv, abs/2412.04604, 2024. +29. Francois Chollet, Mike Knoop, Gregory Kamradt, Bryan Landers, and Henry Pinkard. Arc- + agi-2: A new challenge for frontier ai reasoning systems. arXiv preprint arXiv:2505.11831, + 2025. +30. György Buzsáki. Gamma, alpha, delta, and theta oscillations govern cognitive processes. + International Journal of Psychophysiology, 39:241–248, 2000. +31. György Buzsáki. Rhythms of the Brain. Oxford university press, 2006. +32. Anja Pahor and Norbert Jaušovec. Theta–gamma cross-frequency coupling relates to the level + of human intelligence. Intelligence, 46:283–290, 2014. +33. Adriano BL Tort, Robert W Komorowski, Joseph R Manns, Nancy J Kopell, and Howard + Eichenbaum. Theta–gamma coupling increases during the learning of item–context + associations. Proceedings of the National Academy of Sciences, 106(49):20942–20947, 2009. + + + 20 +34. Benjamin Scellier and Yoshua Bengio. Equilibrium propagation: Bridging the gap between + energy-based models and backpropagation. Frontiers in Computational Neuroscience, 11, + 2016. +35. Guillaume Bellec, Franz Scherr, Anand Subramoney, Elias Hajek, Darjan Salaj, Robert + Legenstein, and Wolfgang Maass. A solution to the learning dilemma for recurrent + networks of spiking neurons. Nature Communications, 11, 07 2020. doi: 10.1038/ + s41467-020-17236-y. +36. Shaojie Bai, J Zico Kolter, and Vladlen Koltun. Deep equilibrium models. In Advances in + Neural Information Processing Systems, pages 690–701, 2019. +37. Zhengyang Geng, Xinyu Zhang, Shaojie Bai, Yisen Wang, and Zhouchen Lin. On training + implicit models. ArXiv, abs/2111.05177, 2021. +38. Katarina Begus and Elizabeth Bonawitz. The rhythm of learning: Theta oscillations as an + index of active learning in infancy. Developmental Cognitive Neuroscience, 45:100810, 2020. + ISSN 1878-9293. doi: https://doi.org/10.1016/j.dcn.2020.100810. +39. Shaojie Bai, Zhengyang Geng, Yash Savani, and J. Zico Kolter. Deep Equilibrium + Optical Flow Estimation . In 2022 IEEE/CVF Conference on Computer Vision and Pattern + Recognition (CVPR), pages 610–620, 2022. +40. Zaccharie Ramzi, Florian Mannel, Shaojie Bai, Jean-Luc Starck, Philippe Ciuciu, and + Thomas Moreau. Shine: Sharing the inverse estimate from the forward pass for bi-level + optimization and implicit models. ArXiv, abs/2106.00553, 2021. +41. Shaojie Bai, Vladlen Koltun, and J. Zico Kolter. Stabilizing equilibrium models by jacobian + regularization. In International Conference on Machine Learning, 2021. +42. Daniel Kahneman and P Egan. Thinking, fast and slow (farrar, straus and giroux, new york), + 2011. +43. Matthew D Lieberman. Social cognitive neuroscience: a review of core processes. Annu. Rev. + Psychol., 58(1):259–289, 2007. +44. Randy L Buckner, Jessica R Andrews-Hanna, and Daniel L Schacter. The brain’s default + network: anatomy, function, and relevance to disease. Annals of the new York Academy of + Sciences, 1124(1):1–38, 2008. +45. Marcus E Raichle. The brain’s default mode network. Annual review of neuroscience, 38(1): + 433–447, 2015. +46. Andrew Westbrook and Todd S Braver. Cognitive effort: A neuroeconomic approach. + Cognitive, Affective, & Behavioral Neuroscience, 15:395–415, 2015. +47. Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT + Press, Cambridge, MA, 2018. +48. Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan + Wierstra, and Martin A. Riedmiller. Playing atari with deep reinforcement learning. ArXiv, + abs/1312.5602, 2013. +49. Matteo Gallici, Mattie Fellows, Benjamin Ellis, Bartomeu Pou, Ivan Masmitja, + Jakob Nicolaus Foerster, and Mario Martin. Simplifying deep temporal difference learning, + 2025. + + + + 21 +50. Shuo Xie and Zhiyuan Li. Implicit bias of adamw: L inf norm constrained optimization. + ArXiv, abs/2404.04454, 2024. +51. Lucas Prieto, Melih Barsbey, Pedro A. M. Mediano, and Tolga Birdal. Grokking at the edge of + numerical stability. In The Thirteenth International Conference on Learning Representations, + 2025. +52. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, + Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural + information processing systems, pages 5998–6008, 2017. +53. Meta AI. Llama 3: State-of-the-art open weight language models. Technical report, Meta, + 2024. URL https://ai.meta.com/llama/. +54. Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: + Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024. +55. Noam M. Shazeer. Glu variants improve transformer. ArXiv, abs/2002.05202, 2020. +56. Biao Zhang and Rico Sennrich. Root mean square layer normalization. ArXiv, + abs/1910.07467, 2019. +57. Günter Klambauer, Thomas Unterthiner, Andreas Mayr, and Sepp Hochreiter. Self- + normalizing neural networks. In Neural Information Processing Systems, 2017. +58. JAX Developers. jax.nn.initializers.lecun_normal. Google Research, 2025. URL + https://docs.jax.dev/en/latest/_autosummary/jax.nn.initializers.lecun_ + normal.html. Accessed June 22, 2025. +59. Yann LeCun, Léon Bottou, Genevieve B Orr, and Klaus-Robert Müller. Efficient backprop. + In Neural networks: Tricks of the trade, pages 9–50. Springer, 2002. +60. Katie E Everett, Lechao Xiao, Mitchell Wortsman, Alexander A Alemi, Roman Novak, + Peter J Liu, Izzeddin Gur, Jascha Sohl-Dickstein, Leslie Pack Kaelbling, Jaehoon Lee, and + Jeffrey Pennington. Scaling exponents across parameterizations and optimizers. In Forty-first + International Conference on Machine Learning, 2024. +61. Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017. +62. Rasmus Berg Palm, Ulrich Paquet, and Ole Winther. Recurrent relational networks. In Neural + Information Processing Systems, 2017. +63. Jieyi Long. Large language model guided tree-of-thought. ArXiv, abs/2305.08291, 2023. +64. Yilun Du, Jiayuan Mao, and Josh Tenenbaum. Learning iterative reasoning through energy + diffusion. ArXiv, abs/2406.11179, 2024. +65. Kyubyong Park. Can convolutional neural networks crack sudoku puzzles? https: + //github.com/Kyubyong/sudoku, 2018. +66. Single-digit techniques. https://hodoku.sourceforge.net/en/tech_singles.php. + Accessed: 2025-06-16. +67. Tom Dillion. Tdoku: A fast sudoku solver and generator. https://t-dillon.github.io/ + tdoku/, 2025. +68. Jeffrey Seely, Yuki Imajuku, Tianyu Zhao, Edoardo Cetin, and Llion Jones. Sudoku-bench: + Evaluating creative reasoning with sudoku variants. arXiv preprint arXiv:2505.16135, 2025. +69. Luke Darlow, Ciaran Regan, Sebastian Risi, Jeffrey Seely, and Llion Jones. Continuous + thought machines. arXiv preprint arXiv:2505.05522, 2025. + + 22 +70. DiJia Su, Sainbayar Sukhbaatar, Michael Rabbat, Yuandong Tian, and Qinqing Zheng. + Dualformer: Controllable fast and slow thinking by learning with randomized reasoning + traces, 2025. +71. Lucas Lehnert, Sainbayar Sukhbaatar, DiJia Su, Qinqing Zheng, Paul McVay, Michael + Rabbat, and Yuandong Tian. Beyond a*: Better planning with transformers via search + dynamics bootstrapping. In First Conference on Language Modeling, 2024. +72. Mubbasir Kapadia, Francisco Garcia, Cory D. Boatright, and Norman I. Badler. Dynamic + search on the gpu. In 2013 IEEE/RSJ International Conference on Intelligent Robots and + Systems, pages 3332–3337, 2013. doi: 10.1109/IROS.2013.6696830. +73. Isaac Liao and Albert Gu. Arc-agi without pretraining, 2025. URL https: + //iliao2345.github.io/blog_posts/arc_agi_without_pretraining/arc_agi_ + without_pretraining.html. +74. Lorenzo Posani, Shuqi Wang, Samuel P Muscinelli, Liam Paninski, and Stefano Fusi. + Rarely categorical, always high-dimensional: how the neural code changes along the cortical + hierarchy. bioRxiv, pages 2024–11, 2025. +75. Mattia Rigotti, Omri Barak, Melissa R. Warden, Xiao-Jing Wang, Nathaniel D. Daw, Earl K. + Miller, and Stefano Fusi. The importance of mixed selectivity in complex cognitive tasks. + Nature, 497:585–590, 2013. doi: 10.1038/nature12160. +76. Valerio Mante, David Sussillo, Krishna V. Shenoy, and William T. Newsome. Context- + dependent computation by recurrent dynamics in prefrontal cortex. Nature, 503(7474):78–84, + 2013. doi: 10.1038/nature12742. +77. Earl K. Miller and Jonathan D. Cohen. An integrative theory of prefrontal cortex function. + Annual Review of Neuroscience, 24(1):167–202, 2001. doi: 10.1146/annurev.neuro.24.1.167. +78. Wolfgang Maass. Real-time computing without stable states: a new framework for neural + computation based on perturbations. Neural Computation, 14(11):2531–2560, 2002. doi: + 10.1162/089976602760407955. +79. Ege Altan, Sara A. Solla, Lee E. Miller, and Eric J. Perreault. Estimating the dimensionality + of the manifold underlying multi-electrode neural recordings. PLoS Computational Biology, + 17(11):e1008591, 2021. doi: 10.1371/journal.pcbi.1008591. +80. Vardan Papyan, X. Y. Han, and David L. Donoho. Prevalence of neural collapse during the + terminal phase of deep learning training. Proceedings of the National Academy of Sciences, + 117(40):24652–24663, 2020. doi: 10.1073/pnas.2015509117. +81. Cong Fang, Hangfeng He, Qi Long, and Weijie J. Su. Exploring deep neural networks via + layer–peeled model: Minority collapse in imbalanced training. Proceedings of the National + Academy of Sciences, 118(43):e2103091118, 2021. doi: 10.1073/pnas.2103091118. +82. Zhihui Zhu, Tianyu Ding, Jinxin Zhou, Xiao Li, Chong You, Jeremias Sulam, and Qing Qu. + A geometric analysis of neural collapse with unconstrained features. In Advances in Neural + Information Processing Systems, volume 34 of NeurIPS, pages 29820–29834, 2021. +83. Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines, 2014. +84. Alex Graves, Greg Wayne, Malcolm Reynolds, Tim Harley, Ivo Danihelka, Agnieszka + Grabska-Barwińska, Sergio Gómez Colmenarejo, Edward Grefenstette, Tiago Ramalho, John + Agapiou, et al. Hybrid computing using a neural network with dynamic external memory. + Nature, 538(7626):471–476, 2016. + + 23 + 85. Lukasz Kaiser and Ilya Sutskever. Neural GPUs learn algorithms. In ICLR, 2016. + 86. Jonas Geiping, Sean McLeish, Neel Jain, John Kirchenbauer, Siddharth Singh, Brian R. + Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, and Tom Goldstein. Scaling up test-time + compute with latent reasoning: A recurrent depth approach, 2025. + 87. Tiedong Liu and Kian Hsiang Low. Goat: Fine-tuned llama outperforms gpt-4 on arithmetic + tasks. ArXiv, abs/2305.14201, 2023. + 88. Alex Graves. Adaptive computation time for recurrent neural networks. ArXiv, + abs/1603.08983, 2016. + 89. Andrea Banino, Jan Balaguer, and Charles Blundell. Pondernet: Learning to ponder. ArXiv, + abs/2107.05407, 2021. + 90. Chris Eliasmith, Terrence C Stewart, Xuan Choo, Trevor Bekolay, Travis DeWolf, Yichuan + Tang, and Daniel Rasmussen. A large-scale model of the functioning brain. science, 338 + (6111):1202–1205, 2012. + 91. James CR Whittington, Timothy H Muller, Shirley Mark, Guifen Chen, Caswell Barry, Neil + Burgess, and Timothy EJ Behrens. The tolman-eichenbaum machine: unifying space and + relational memory through generalization in the hippocampal formation. Cell, 183(5):1249– + 1263, 2020. + 92. Lars Buesing, Johannes Bill, Bernhard Nessler, and Wolfgang Maass. Neural dynamics as + sampling: a model for stochastic computation in recurrent networks of spiking neurons. PLoS + computational biology, 7(11):e1002211, 2011. + 93. Salah Hihi and Yoshua Bengio. Hierarchical recurrent neural networks for long-term + dependencies. In D. Touretzky, M.C. Mozer, and M. Hasselmo, editors, Advances in Neural + Information Processing Systems, volume 8. MIT Press, 1995. + 94. Jan Koutník, Klaus Greff, Faustino J. Gomez, and Jürgen Schmidhuber. A clockwork rnn. In + International Conference on Machine Learning, 2014. + 95. Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser. + Universal transformers, 2018. arXiv preprint arXiv:1807.03819. + 96. Yiping Wang, Qing Yang, Zhiyuan Zeng, Liliang Ren, Lucas Liu, Baolin Peng, Hao Cheng, + Xuehai He, Kuan Wang, Jianfeng Gao, Weizhu Chen, Shuohang Wang, Simon Shaolei Du, + and Yelong Shen. Reinforcement learning for reasoning in large language models with one + training example, 2025. URL https://arxiv.org/abs/2504.20571. + 97. Niklas Muennighoff. s1: Simple test-time scaling. arXiv preprint arXiv:2502.23456, 2025. + 98. Liang Wen, Yunke Cai, Fenrui Xiao, Xin He, Qi An, Zhenyu Duan, Yimin Du, Junchen Liu, + Lifu Tang, Xiaowei Lv, Haosheng Zou, Yongchao Deng, Shousheng Jia, and Xiangzheng + Zhang. Light-r1: Curriculum sft, dpo and rl for long cot from scratch and beyond, 2025. + 99. Xuefeng Li, Haoyang Zou, and Pengfei Liu. Limr: Less is more for rl scaling, 2025. +100. Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms + through structured state space duality. ArXiv, abs/2405.21060, 2024. +101. Han Guo, Songlin Yang, Tarushii Goel, Eric P Xing, Tri Dao, and Yoon Kim. Log-linear + attention. arXiv preprint arXiv:2506.04761, 2025. + + + + + 24 +
\ No newline at end of file |
