research/flossing/srm_design_round2_claude.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155

# SRM Design Round 2 — Claude's analysis (codex placeholder failed twice)

Building on `srm_design_codex.md`. Codex round 1 produced solid joint-operator design; rounds 2-3 hit routing issues. This is my analysis of the 6 implementation questions.

## (1) Attention compatibility

**Recommendation: start with `mlp_t=True` for Sudoku, defer Lipschitz-attention to ARC phase.**

Reasoning: TRM's reported 87% Sudoku 1k uses `mlp_t=True` (MLP over tokens, no self-attention). Their `pretrain_att_sudoku` (with attention) is reportedly ~75%. So for Sudoku specifically, the attention isn't even helping — replacing it with MLP gives **+12% accuracy**. We can use MLP-over-tokens for both L and H updates; MLPs admit cheap Lipschitz bounds (Sandwich / AOL trivially compose with ReLU/GELU).

For ARC where attention likely matters: use **LipsFormer-style attention** (Qi et al. 2023) — normalizes via L_∞ / cosine to bound Lipschitz constant of softmax(QK^T)V. Or skip self-attention and use **cross-attention only** with bounded-norm keys.

## (2) Single vs two-state

**Keep two-state (h, ℓ), with single tied operator.**

Why: even though TRM ties weights, the two states play **functionally distinct roles**:
- ℓ (z_L) receives `z_H + input_embedding` injection — "deliberation conditioned on current answer + question"
- h (z_H) receives `z_L` injection — "answer update from deliberation"
- The hierarchy of update frequencies (H every T steps vs L every step) is preserved
- Empirical: TRM with H_cycles=3 L_cycles=6 outperforms vanilla recursion (H=1, L=18) on Sudoku

Single state z would force the same vector to serve as both "answer" and "scratchpad", losing the inductive bias that's empirically validated.

## (3) κ fixed vs learnable

**Fixed κ=0.9 to start, learnable with anchor penalty as v2.**

Phase 1 (validation): κ=0.9 fixed. Provably contractive, easy to analyze, no risk of κ-collapse.

Phase 2 (refinement): `κ = sigmoid(c) * (0.99 - 0.5) + 0.5` learnable. Add anchor penalty `λ_anchor · (κ - 0.9)²` to keep near empirical sweet spot.

Why not learnable from start: the empirical observation `success λ_1 ≈ -0.15 ⇔ κ ≈ 0.86` is from a trained HRM. SRM might find a different sweet spot but starting at the empirically-validated value avoids early-training pathology.

## (4) Power iteration cost

**Recommendation: AOL (Almost-Orthogonal Layers, Prach & Lampert 2022).**

Compared to alternatives:
- **Sandwich layers (Wang & Manchester 2023)**: tightest Lipschitz bound but needs power iteration. For 40h training cost-prohibitive on 4× A6000.
- **AOL**: Cayley-transform parameterization gives **exact 1-Lipschitz with no iteration** at training time. Slight expressivity loss (orthogonal weights are 1-Lipschitz but not all 1-Lipschitz maps are orthogonal). For our recursive setting where we want exactness, this trade-off is favorable.
- **Householder reflections**: also exact orthogonal but slower per forward pass than AOL.
- **Cached spectral estimates**: cheapest but introduces unstable training when stale estimates miss real growth.

AOL is the right balance for our compute budget.

## (5) Multi-modal capacity

**Phase 1: strict-contraction SRM (unique fixed point). Phase 2: small Gaussian noise injection if ARC plateaus.**

Sudoku has unique solutions (most of them — at least the Sudoku-Extreme dataset filters for unique solutions). Unique-fixed-point SRM should match GRAM's 97% — they're getting it through stable dynamics + multi-sample voting at inference, not through architectural multi-modality during training.

For ARC where multiple valid completions could exist: add `z_{t+1} = T(z_t) + σ·ε` with small σ. Mathematically:
- Strict contraction: T is K-Lipschitz with K < 1, σ=0
- Stochastic contraction: T is K-Lipschitz, σ small → stationary distribution exists, concentrated near fixed point but with Gaussian fluctuation
- As σ increases: distribution spreads, eventually multi-modal if T has multiple "near-fixed-points"

The σ value should be tuned per task. For Sudoku σ=0; for ARC σ=0.05-0.2.

Mixture of K operators (alternative): too many extra parameters, harder to analyze, defer.

## (6) Pseudocode for one SRM forward step

```python
def srm_step(h, l, x, theta, kappa=0.9, eta=1.0, alpha=1.0):
    """
    One SRM update step.
    h, l: (B, seq, hidden) — high-level and low-level states
    x: input embedding, (B, seq, hidden)
    theta: parameters (Sandwich/AOL block + gain params + biases)
    kappa, eta, alpha: SRM hyperparameters
    """
    # 1. Joint feature map via AOL block (1-Lipschitz by construction)
    # Concatenate h with sqrt(eta)*l so the operator naturally acts under
    # the weighted norm ‖z‖²_P = ‖h‖² + η‖l‖²
    z = concat(h, sqrt(eta) * l)                            # (B, seq, 2*hidden)
    psi = aol_block(z + bias_in(x), theta.aol)              # (B, seq, 2*hidden), Lip≤1
    psi_h, psi_l_scaled = split(psi)
    psi_l = psi_l_scaled / sqrt(eta)                        # unscale back

    # 2. Block-gain matrix A — row-sum bounded by kappa under weighted metric
    # Parameterize a_HH, a_HL (with a_HH + sqrt(eta)*a_HL ≤ kappa) via softmax
    gain_H = kappa * softmax(theta.gain_H_logits)           # (2,) — [a_HH, sqrt(eta)*a_HL]
    a_HH, a_HL_scaled = gain_H[0], gain_H[1]
    a_HL = a_HL_scaled / sqrt(eta)

    gain_L = kappa * softmax(theta.gain_L_logits)           # similarly
    a_LL, a_LH_scaled = gain_L[0], gain_L[1]
    a_LH = a_LH_scaled * sqrt(eta)

    # Apply block gain (U_HL, U_LH are orthogonal mixing matrices)
    Az_h = a_HH * psi_h + a_HL * (U_HL @ psi_l)
    Az_l = a_LH * (U_LH @ psi_h) + a_LL * psi_l

    # 3. Damping update — even if κ were slightly above 1, this saves us
    h_new = (1 - alpha) * h + alpha * Az_h + bias_h(x)
    l_new = (1 - alpha) * l + alpha * Az_l + bias_l(x)

    return h_new, l_new
```

## A. Candidate architectures ranked

1. **SRM-Joint-AOL** (top): the above pseudocode. AOL block + block-gain row-sum bound + light damping. Strict contraction, deterministic, ~7-10M params (TRM-scale).

2. **SRM-Joint-Sandwich**: identical structure but Sandwich layers instead of AOL. Tighter Lipschitz bound, slower training. Use if AOL is too restrictive.

3. **SRM-Stochastic** (Phase 2): SRM-Joint-AOL + Gaussian noise injection `σε` per step. Trades strict contraction for multi-modal coverage. Use for ARC if Phase 1 plateaus.

## B. Full math for top candidate

State: `z = (h, ℓ) ∈ R^(2D)`, weighted norm `‖z‖²_P = ‖h‖² + η‖ℓ‖²`.

Per-step map: `T(z, x) = (1-α)z + α·A·ψ(z + b_in(x)) + b_out(x)`.

Where:
- `ψ: R^(2D) → R^(2D)` is AOL-parameterized, `Lip_P(ψ) ≤ 1`.
- `A` is block matrix with rows `[a_HH·I, a_HL·U_HL]` and `[a_LH·U_LH, a_LL·I]`, `U_HL, U_LH` orthogonal.
- Block-row-sum bound: `a_HH + √η·a_HL ≤ κ` and `√η·a_LH + a_LL ≤ κ` with `κ ∈ (0.85, 0.95)`.

Lipschitz of T under P-norm: `Lip_P(T) ≤ (1-α) + α·Lip_P(A·ψ) ≤ (1-α) + α·κ < 1` by construction.

⇒ `log Lip_P(T) ≤ log((1-α) + α·κ) < 0` — by construction non-chaotic.

For α=1, κ=0.9: `λ_1 ≤ log 0.9 ≈ -0.105`. Matches empirical success regime.

## C. Risks (what could prevent ≥87% Sudoku)

1. **AOL expressivity ceiling**: orthogonal weights restrict the function class. Sandwich layers are less restrictive — if AOL caps below 87%, swap in Sandwich (slower training cost).

2. **Block-gain matrix too restrictive**: forcing row-sum ≤ κ might not preserve TRM's specific information flow pattern (e.g., L→H amplification). Solution: lower bounds on `a_HL` and `a_LH` (e.g., ≥ 0.1·κ) to prevent decoupling.

3. **Damping over-smooths**: α=1 means full step. If unstable, drop α. But α<1 reduces effective compute per step (anti-iterative).

4. **Weight tying may not transfer**: TRM ties across H and L (J_L = J_H = J via same module). SRM-Joint already does this. But our `A` and `ψ` blocks are also tied across recursion steps (Banach iteration). If this hurts representational depth, untie A while keeping ψ tied.

5. **Bias terms `b_in(x), b_out(x)`**: must be the only place where input x enters. If we naively use input-dependent ψ (`ψ(z, x)` instead of `ψ(z + b_in(x))`), Lipschitz analysis breaks (Lipschitz w.r.t. z still holds, but coupling with x is opaque).

## D. Smallest validation experiment

**Replace HRM's L_level and H_level with the AOL-Sandwich block (1-Lipschitz) + block-gain A, train Sudoku 1k for 5000 steps (1/4 the full recipe). Compare λ_joint_1 trajectory and final acc to baseline HRM at same step count.**

Pass criteria: 
- λ_joint_1 stays strictly < 0 throughout (validates by-construction stability)
- Accuracy at 5K steps ≥ HRM's accuracy at 5K steps (baseline ~15% per our checkpoint evolution)
- If both pass, scale to full 20K-step recipe.

This is ~2-4 hours on 4× A6000 (1/10 the full HRM training). Quick gate on whether SRM-Joint-AOL is viable.

## Notes for synthesis (when codex eventually responds)

- Codex's round 1 framing (joint operator, block-gain, κ=0.85-0.95) is solid. I largely agree.
- Implementation-wise I lean towards AOL not Sandwich for speed.
- Two-state I keep; codex left it ambiguous.
- Noise injection deferred to Phase 2.