# SRM Design Round 2 — Claude's analysis (codex placeholder failed twice) Building on `srm_design_codex.md`. Codex round 1 produced solid joint-operator design; rounds 2-3 hit routing issues. This is my analysis of the 6 implementation questions. ## (1) Attention compatibility **Recommendation: start with `mlp_t=True` for Sudoku, defer Lipschitz-attention to ARC phase.** Reasoning: TRM's reported 87% Sudoku 1k uses `mlp_t=True` (MLP over tokens, no self-attention). Their `pretrain_att_sudoku` (with attention) is reportedly ~75%. So for Sudoku specifically, the attention isn't even helping — replacing it with MLP gives **+12% accuracy**. We can use MLP-over-tokens for both L and H updates; MLPs admit cheap Lipschitz bounds (Sandwich / AOL trivially compose with ReLU/GELU). For ARC where attention likely matters: use **LipsFormer-style attention** (Qi et al. 2023) — normalizes via L_∞ / cosine to bound Lipschitz constant of softmax(QK^T)V. Or skip self-attention and use **cross-attention only** with bounded-norm keys. ## (2) Single vs two-state **Keep two-state (h, ℓ), with single tied operator.** Why: even though TRM ties weights, the two states play **functionally distinct roles**: - ℓ (z_L) receives `z_H + input_embedding` injection — "deliberation conditioned on current answer + question" - h (z_H) receives `z_L` injection — "answer update from deliberation" - The hierarchy of update frequencies (H every T steps vs L every step) is preserved - Empirical: TRM with H_cycles=3 L_cycles=6 outperforms vanilla recursion (H=1, L=18) on Sudoku Single state z would force the same vector to serve as both "answer" and "scratchpad", losing the inductive bias that's empirically validated. ## (3) κ fixed vs learnable **Fixed κ=0.9 to start, learnable with anchor penalty as v2.** Phase 1 (validation): κ=0.9 fixed. Provably contractive, easy to analyze, no risk of κ-collapse. Phase 2 (refinement): `κ = sigmoid(c) * (0.99 - 0.5) + 0.5` learnable. Add anchor penalty `λ_anchor · (κ - 0.9)²` to keep near empirical sweet spot. Why not learnable from start: the empirical observation `success λ_1 ≈ -0.15 ⇔ κ ≈ 0.86` is from a trained HRM. SRM might find a different sweet spot but starting at the empirically-validated value avoids early-training pathology. ## (4) Power iteration cost **Recommendation: AOL (Almost-Orthogonal Layers, Prach & Lampert 2022).** Compared to alternatives: - **Sandwich layers (Wang & Manchester 2023)**: tightest Lipschitz bound but needs power iteration. For 40h training cost-prohibitive on 4× A6000. - **AOL**: Cayley-transform parameterization gives **exact 1-Lipschitz with no iteration** at training time. Slight expressivity loss (orthogonal weights are 1-Lipschitz but not all 1-Lipschitz maps are orthogonal). For our recursive setting where we want exactness, this trade-off is favorable. - **Householder reflections**: also exact orthogonal but slower per forward pass than AOL. - **Cached spectral estimates**: cheapest but introduces unstable training when stale estimates miss real growth. AOL is the right balance for our compute budget. ## (5) Multi-modal capacity **Phase 1: strict-contraction SRM (unique fixed point). Phase 2: small Gaussian noise injection if ARC plateaus.** Sudoku has unique solutions (most of them — at least the Sudoku-Extreme dataset filters for unique solutions). Unique-fixed-point SRM should match GRAM's 97% — they're getting it through stable dynamics + multi-sample voting at inference, not through architectural multi-modality during training. For ARC where multiple valid completions could exist: add `z_{t+1} = T(z_t) + σ·ε` with small σ. Mathematically: - Strict contraction: T is K-Lipschitz with K < 1, σ=0 - Stochastic contraction: T is K-Lipschitz, σ small → stationary distribution exists, concentrated near fixed point but with Gaussian fluctuation - As σ increases: distribution spreads, eventually multi-modal if T has multiple "near-fixed-points" The σ value should be tuned per task. For Sudoku σ=0; for ARC σ=0.05-0.2. Mixture of K operators (alternative): too many extra parameters, harder to analyze, defer. ## (6) Pseudocode for one SRM forward step ```python def srm_step(h, l, x, theta, kappa=0.9, eta=1.0, alpha=1.0): """ One SRM update step. h, l: (B, seq, hidden) — high-level and low-level states x: input embedding, (B, seq, hidden) theta: parameters (Sandwich/AOL block + gain params + biases) kappa, eta, alpha: SRM hyperparameters """ # 1. Joint feature map via AOL block (1-Lipschitz by construction) # Concatenate h with sqrt(eta)*l so the operator naturally acts under # the weighted norm ‖z‖²_P = ‖h‖² + η‖l‖² z = concat(h, sqrt(eta) * l) # (B, seq, 2*hidden) psi = aol_block(z + bias_in(x), theta.aol) # (B, seq, 2*hidden), Lip≤1 psi_h, psi_l_scaled = split(psi) psi_l = psi_l_scaled / sqrt(eta) # unscale back # 2. Block-gain matrix A — row-sum bounded by kappa under weighted metric # Parameterize a_HH, a_HL (with a_HH + sqrt(eta)*a_HL ≤ kappa) via softmax gain_H = kappa * softmax(theta.gain_H_logits) # (2,) — [a_HH, sqrt(eta)*a_HL] a_HH, a_HL_scaled = gain_H[0], gain_H[1] a_HL = a_HL_scaled / sqrt(eta) gain_L = kappa * softmax(theta.gain_L_logits) # similarly a_LL, a_LH_scaled = gain_L[0], gain_L[1] a_LH = a_LH_scaled * sqrt(eta) # Apply block gain (U_HL, U_LH are orthogonal mixing matrices) Az_h = a_HH * psi_h + a_HL * (U_HL @ psi_l) Az_l = a_LH * (U_LH @ psi_h) + a_LL * psi_l # 3. Damping update — even if κ were slightly above 1, this saves us h_new = (1 - alpha) * h + alpha * Az_h + bias_h(x) l_new = (1 - alpha) * l + alpha * Az_l + bias_l(x) return h_new, l_new ``` ## A. Candidate architectures ranked 1. **SRM-Joint-AOL** (top): the above pseudocode. AOL block + block-gain row-sum bound + light damping. Strict contraction, deterministic, ~7-10M params (TRM-scale). 2. **SRM-Joint-Sandwich**: identical structure but Sandwich layers instead of AOL. Tighter Lipschitz bound, slower training. Use if AOL is too restrictive. 3. **SRM-Stochastic** (Phase 2): SRM-Joint-AOL + Gaussian noise injection `σε` per step. Trades strict contraction for multi-modal coverage. Use for ARC if Phase 1 plateaus. ## B. Full math for top candidate State: `z = (h, ℓ) ∈ R^(2D)`, weighted norm `‖z‖²_P = ‖h‖² + η‖ℓ‖²`. Per-step map: `T(z, x) = (1-α)z + α·A·ψ(z + b_in(x)) + b_out(x)`. Where: - `ψ: R^(2D) → R^(2D)` is AOL-parameterized, `Lip_P(ψ) ≤ 1`. - `A` is block matrix with rows `[a_HH·I, a_HL·U_HL]` and `[a_LH·U_LH, a_LL·I]`, `U_HL, U_LH` orthogonal. - Block-row-sum bound: `a_HH + √η·a_HL ≤ κ` and `√η·a_LH + a_LL ≤ κ` with `κ ∈ (0.85, 0.95)`. Lipschitz of T under P-norm: `Lip_P(T) ≤ (1-α) + α·Lip_P(A·ψ) ≤ (1-α) + α·κ < 1` by construction. ⇒ `log Lip_P(T) ≤ log((1-α) + α·κ) < 0` — by construction non-chaotic. For α=1, κ=0.9: `λ_1 ≤ log 0.9 ≈ -0.105`. Matches empirical success regime. ## C. Risks (what could prevent ≥87% Sudoku) 1. **AOL expressivity ceiling**: orthogonal weights restrict the function class. Sandwich layers are less restrictive — if AOL caps below 87%, swap in Sandwich (slower training cost). 2. **Block-gain matrix too restrictive**: forcing row-sum ≤ κ might not preserve TRM's specific information flow pattern (e.g., L→H amplification). Solution: lower bounds on `a_HL` and `a_LH` (e.g., ≥ 0.1·κ) to prevent decoupling. 3. **Damping over-smooths**: α=1 means full step. If unstable, drop α. But α<1 reduces effective compute per step (anti-iterative). 4. **Weight tying may not transfer**: TRM ties across H and L (J_L = J_H = J via same module). SRM-Joint already does this. But our `A` and `ψ` blocks are also tied across recursion steps (Banach iteration). If this hurts representational depth, untie A while keeping ψ tied. 5. **Bias terms `b_in(x), b_out(x)`**: must be the only place where input x enters. If we naively use input-dependent ψ (`ψ(z, x)` instead of `ψ(z + b_in(x))`), Lipschitz analysis breaks (Lipschitz w.r.t. z still holds, but coupling with x is opaque). ## D. Smallest validation experiment **Replace HRM's L_level and H_level with the AOL-Sandwich block (1-Lipschitz) + block-gain A, train Sudoku 1k for 5000 steps (1/4 the full recipe). Compare λ_joint_1 trajectory and final acc to baseline HRM at same step count.** Pass criteria: - λ_joint_1 stays strictly < 0 throughout (validates by-construction stability) - Accuracy at 5K steps ≥ HRM's accuracy at 5K steps (baseline ~15% per our checkpoint evolution) - If both pass, scale to full 20K-step recipe. This is ~2-4 hours on 4× A6000 (1/10 the full HRM training). Quick gate on whether SRM-Joint-AOL is viable. ## Notes for synthesis (when codex eventually responds) - Codex's round 1 framing (joint operator, block-gain, κ=0.85-0.95) is solid. I largely agree. - Implementation-wise I lean towards AOL not Sandwich for speed. - Two-state I keep; codex left it ambiguous. - Noise injection deferred to Phase 2.