# SRM Design Round 2 — Claude's analysis (codex placeholder failed twice)

Building on `srm_design_codex.md`. Codex round 1 produced solid joint-operator design; rounds 2-3 hit routing issues. This is my analysis of the 6 implementation questions.

## (1) Attention compatibility

**Recommendation: start with `mlp_t=True` for Sudoku, defer Lipschitz-attention to ARC phase.**

Reasoning: TRM's reported 87% Sudoku 1k uses `mlp_t=True` (MLP over tokens, no self-attention). Their `pretrain_att_sudoku` (with attention) is reportedly ~75%. So for Sudoku specifically, the attention isn't even helping — replacing it with MLP gives **+12% accuracy**. We can use MLP-over-tokens for both L and H updates; MLPs admit cheap Lipschitz bounds (Sandwich / AOL trivially compose with ReLU/GELU).

For ARC where attention likely matters: use **LipsFormer-style attention** (Qi et al. 2023) — normalizes via L_∞ / cosine to bound Lipschitz constant of softmax(QK^T)V. Or skip self-attention and use **cross-attention only** with bounded-norm keys.

## (2) Single vs two-state

**Keep two-state (h, ℓ), with single tied operator.**

Why: even though TRM ties weights, the two states play **functionally distinct roles**:
- ℓ (z_L) receives `z_H + input_embedding` injection — "deliberation conditioned on current answer + question"
- h (z_H) receives `z_L` injection — "answer update from deliberation"
- The hierarchy of update frequencies (H every T steps vs L every step) is preserved
- Empirical: TRM with H_cycles=3 L_cycles=6 outperforms vanilla recursion (H=1, L=18) on Sudoku

Single state z would force the same vector to serve as both "answer" and "scratchpad", losing the inductive bias that's empirically validated.

## (3) κ fixed vs learnable

**Fixed κ=0.9 to start, learnable with anchor penalty as v2.**

Phase 1 (validation): κ=0.9 fixed. Provably contractive, easy to analyze, no risk of κ-collapse.

Phase 2 (refinement): `κ = sigmoid(c) * (0.99 - 0.5) + 0.5` learnable. Add anchor penalty `λ_anchor · (κ - 0.9)²` to keep near empirical sweet spot.

Why not learnable from start: the empirical observation `success λ_1 ≈ -0.15 ⇔ κ ≈ 0.86` is from a trained HRM. SRM might find a different sweet spot but starting at the empirically-validated value avoids early-training pathology.

## (4) Power iteration cost

**Recommendation: AOL (Almost-Orthogonal Layers, Prach & Lampert 2022).**

Compared to alternatives:
- **Sandwich layers (Wang & Manchester 2023)**: tightest Lipschitz bound but needs power iteration. For 40h training cost-prohibitive on 4× A6000.
- **AOL**: Cayley-transform parameterization gives **exact 1-Lipschitz with no iteration** at training time. Slight expressivity loss (orthogonal weights are 1-Lipschitz but not all 1-Lipschitz maps are orthogonal). For our recursive setting where we want exactness, this trade-off is favorable.
- **Householder reflections**: also exact orthogonal but slower per forward pass than AOL.
- **Cached spectral estimates**: cheapest but introduces unstable training when stale estimates miss real growth.

AOL is the right balance for our compute budget.

## (5) Multi-modal capacity

**Phase 1: strict-contraction SRM (unique fixed point). Phase 2: small Gaussian noise injection if ARC plateaus.**

Sudoku has unique solutions (most of them — at least the Sudoku-Extreme dataset filters for unique solutions). Unique-fixed-point SRM should match GRAM's 97% — they're getting it through stable dynamics + multi-sample voting at inference, not through architectural multi-modality during training.

For ARC where multiple valid completions could exist: add `z_{t+1} = T(z_t) + σ·ε` with small σ. Mathematically:
- Strict contraction: T is K-Lipschitz with K < 1, σ=0
- Stochastic contraction: T is K-Lipschitz, σ small → stationary distribution exists, concentrated near fixed point but with Gaussian fluctuation
- As σ increases: distribution spreads, eventually multi-modal if T has multiple "near-fixed-points"

The σ value should be tuned per task. For Sudoku σ=0; for ARC σ=0.05-0.2.

Mixture of K operators (alternative): too many extra parameters, harder to analyze, defer.

## (6) Pseudocode for one SRM forward step

```python
def srm_step(h, l, x, theta, kappa=0.9, eta=1.0, alpha=1.0):
    """
    One SRM update step.
    h, l: (B, seq, hidden) — high-level and low-level states
    x: input embedding, (B, seq, hidden)
    theta: parameters (Sandwich/AOL block + gain params + biases)
    kappa, eta, alpha: SRM hyperparameters
    """
    # 1. Joint feature map via AOL block (1-Lipschitz by construction)
    # Concatenate h with sqrt(eta)*l so the operator naturally acts under
    # the weighted norm ‖z‖²_P = ‖h‖² + η‖l‖²
    z = concat(h, sqrt(eta) * l)                            # (B, seq, 2*hidden)
    psi = aol_block(z + bias_in(x), theta.aol)              # (B, seq, 2*hidden), Lip≤1
    psi_h, psi_l_scaled = split(psi)
    psi_l = psi_l_scaled / sqrt(eta)                        # unscale back

    # 2. Block-gain matrix A — row-sum bounded by kappa under weighted metric
    # Parameterize a_HH, a_HL (with a_HH + sqrt(eta)*a_HL ≤ kappa) via softmax
    gain_H = kappa * softmax(theta.gain_H_logits)           # (2,) — [a_HH, sqrt(eta)*a_HL]
    a_HH, a_HL_scaled = gain_H[0], gain_H[1]
    a_HL = a_HL_scaled / sqrt(eta)

    gain_L = kappa * softmax(theta.gain_L_logits)           # similarly
    a_LL, a_LH_scaled = gain_L[0], gain_L[1]
    a_LH = a_LH_scaled * sqrt(eta)

    # Apply block gain (U_HL, U_LH are orthogonal mixing matrices)
    Az_h = a_HH * psi_h + a_HL * (U_HL @ psi_l)
    Az_l = a_LH * (U_LH @ psi_h) + a_LL * psi_l

    # 3. Damping update — even if κ were slightly above 1, this saves us
    h_new = (1 - alpha) * h + alpha * Az_h + bias_h(x)
    l_new = (1 - alpha) * l + alpha * Az_l + bias_l(x)

    return h_new, l_new
```

## A. Candidate architectures ranked

1. **SRM-Joint-AOL** (top): the above pseudocode. AOL block + block-gain row-sum bound + light damping. Strict contraction, deterministic, ~7-10M params (TRM-scale).

2. **SRM-Joint-Sandwich**: identical structure but Sandwich layers instead of AOL. Tighter Lipschitz bound, slower training. Use if AOL is too restrictive.

3. **SRM-Stochastic** (Phase 2): SRM-Joint-AOL + Gaussian noise injection `σε` per step. Trades strict contraction for multi-modal coverage. Use for ARC if Phase 1 plateaus.

## B. Full math for top candidate

State: `z = (h, ℓ) ∈ R^(2D)`, weighted norm `‖z‖²_P = ‖h‖² + η‖ℓ‖²`.

Per-step map: `T(z, x) = (1-α)z + α·A·ψ(z + b_in(x)) + b_out(x)`.

Where:
- `ψ: R^(2D) → R^(2D)` is AOL-parameterized, `Lip_P(ψ) ≤ 1`.
- `A` is block matrix with rows `[a_HH·I, a_HL·U_HL]` and `[a_LH·U_LH, a_LL·I]`, `U_HL, U_LH` orthogonal.
- Block-row-sum bound: `a_HH + √η·a_HL ≤ κ` and `√η·a_LH + a_LL ≤ κ` with `κ ∈ (0.85, 0.95)`.

Lipschitz of T under P-norm: `Lip_P(T) ≤ (1-α) + α·Lip_P(A·ψ) ≤ (1-α) + α·κ < 1` by construction.

⇒ `log Lip_P(T) ≤ log((1-α) + α·κ) < 0` — by construction non-chaotic.

For α=1, κ=0.9: `λ_1 ≤ log 0.9 ≈ -0.105`. Matches empirical success regime.

## C. Risks (what could prevent ≥87% Sudoku)

1. **AOL expressivity ceiling**: orthogonal weights restrict the function class. Sandwich layers are less restrictive — if AOL caps below 87%, swap in Sandwich (slower training cost).

2. **Block-gain matrix too restrictive**: forcing row-sum ≤ κ might not preserve TRM's specific information flow pattern (e.g., L→H amplification). Solution: lower bounds on `a_HL` and `a_LH` (e.g., ≥ 0.1·κ) to prevent decoupling.

3. **Damping over-smooths**: α=1 means full step. If unstable, drop α. But α<1 reduces effective compute per step (anti-iterative).

4. **Weight tying may not transfer**: TRM ties across H and L (J_L = J_H = J via same module). SRM-Joint already does this. But our `A` and `ψ` blocks are also tied across recursion steps (Banach iteration). If this hurts representational depth, untie A while keeping ψ tied.

5. **Bias terms `b_in(x), b_out(x)`**: must be the only place where input x enters. If we naively use input-dependent ψ (`ψ(z, x)` instead of `ψ(z + b_in(x))`), Lipschitz analysis breaks (Lipschitz w.r.t. z still holds, but coupling with x is opaque).

## D. Smallest validation experiment

**Replace HRM's L_level and H_level with the AOL-Sandwich block (1-Lipschitz) + block-gain A, train Sudoku 1k for 5000 steps (1/4 the full recipe). Compare λ_joint_1 trajectory and final acc to baseline HRM at same step count.**

Pass criteria: 
- λ_joint_1 stays strictly < 0 throughout (validates by-construction stability)
- Accuracy at 5K steps ≥ HRM's accuracy at 5K steps (baseline ~15% per our checkpoint evolution)
- If both pass, scale to full 20K-step recipe.

This is ~2-4 hours on 4× A6000 (1/10 the full HRM training). Quick gate on whether SRM-Joint-AOL is viable.

## Notes for synthesis (when codex eventually responds)

- Codex's round 1 framing (joint operator, block-gain, κ=0.85-0.95) is solid. I largely agree.
- Implementation-wise I lean towards AOL not Sandwich for speed.
- Two-state I keep; codex left it ambiguous.
- Noise injection deferred to Phase 2.