summaryrefslogtreecommitdiff
path: root/research/flossing/srm_design_round2_claude.md
diff options
context:
space:
mode:
authorYurenHao0426 <blackhao0426@gmail.com>2026-06-13 12:35:36 -0500
committerYurenHao0426 <blackhao0426@gmail.com>2026-06-13 12:35:36 -0500
commit66e0d8b9fd4d0f7a2231d689c055e26fdf1cf04a (patch)
treec29cba61124018755a19b02c9d33e3ad5f2e05cc /research/flossing/srm_design_round2_claude.md
rrm workspace: TRM/HRM/SRM code, Maze dataset, dynamical-analysis pipelineHEADmain
Curated export for clone-and-run Maze training (2x A6000) + diagnostics. trm/hrm pretrain.py carry trajectory-augmentation code (backward-compatible). Heavy artifacts (checkpoints/wandb/npz) gitignored; see PROVENANCE.md. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Diffstat (limited to 'research/flossing/srm_design_round2_claude.md')
-rw-r--r--research/flossing/srm_design_round2_claude.md155
1 files changed, 155 insertions, 0 deletions
diff --git a/research/flossing/srm_design_round2_claude.md b/research/flossing/srm_design_round2_claude.md
new file mode 100644
index 0000000..b63b260
--- /dev/null
+++ b/research/flossing/srm_design_round2_claude.md
@@ -0,0 +1,155 @@
+# SRM Design Round 2 — Claude's analysis (codex placeholder failed twice)
+
+Building on `srm_design_codex.md`. Codex round 1 produced solid joint-operator design; rounds 2-3 hit routing issues. This is my analysis of the 6 implementation questions.
+
+## (1) Attention compatibility
+
+**Recommendation: start with `mlp_t=True` for Sudoku, defer Lipschitz-attention to ARC phase.**
+
+Reasoning: TRM's reported 87% Sudoku 1k uses `mlp_t=True` (MLP over tokens, no self-attention). Their `pretrain_att_sudoku` (with attention) is reportedly ~75%. So for Sudoku specifically, the attention isn't even helping — replacing it with MLP gives **+12% accuracy**. We can use MLP-over-tokens for both L and H updates; MLPs admit cheap Lipschitz bounds (Sandwich / AOL trivially compose with ReLU/GELU).
+
+For ARC where attention likely matters: use **LipsFormer-style attention** (Qi et al. 2023) — normalizes via L_∞ / cosine to bound Lipschitz constant of softmax(QK^T)V. Or skip self-attention and use **cross-attention only** with bounded-norm keys.
+
+## (2) Single vs two-state
+
+**Keep two-state (h, ℓ), with single tied operator.**
+
+Why: even though TRM ties weights, the two states play **functionally distinct roles**:
+- ℓ (z_L) receives `z_H + input_embedding` injection — "deliberation conditioned on current answer + question"
+- h (z_H) receives `z_L` injection — "answer update from deliberation"
+- The hierarchy of update frequencies (H every T steps vs L every step) is preserved
+- Empirical: TRM with H_cycles=3 L_cycles=6 outperforms vanilla recursion (H=1, L=18) on Sudoku
+
+Single state z would force the same vector to serve as both "answer" and "scratchpad", losing the inductive bias that's empirically validated.
+
+## (3) κ fixed vs learnable
+
+**Fixed κ=0.9 to start, learnable with anchor penalty as v2.**
+
+Phase 1 (validation): κ=0.9 fixed. Provably contractive, easy to analyze, no risk of κ-collapse.
+
+Phase 2 (refinement): `κ = sigmoid(c) * (0.99 - 0.5) + 0.5` learnable. Add anchor penalty `λ_anchor · (κ - 0.9)²` to keep near empirical sweet spot.
+
+Why not learnable from start: the empirical observation `success λ_1 ≈ -0.15 ⇔ κ ≈ 0.86` is from a trained HRM. SRM might find a different sweet spot but starting at the empirically-validated value avoids early-training pathology.
+
+## (4) Power iteration cost
+
+**Recommendation: AOL (Almost-Orthogonal Layers, Prach & Lampert 2022).**
+
+Compared to alternatives:
+- **Sandwich layers (Wang & Manchester 2023)**: tightest Lipschitz bound but needs power iteration. For 40h training cost-prohibitive on 4× A6000.
+- **AOL**: Cayley-transform parameterization gives **exact 1-Lipschitz with no iteration** at training time. Slight expressivity loss (orthogonal weights are 1-Lipschitz but not all 1-Lipschitz maps are orthogonal). For our recursive setting where we want exactness, this trade-off is favorable.
+- **Householder reflections**: also exact orthogonal but slower per forward pass than AOL.
+- **Cached spectral estimates**: cheapest but introduces unstable training when stale estimates miss real growth.
+
+AOL is the right balance for our compute budget.
+
+## (5) Multi-modal capacity
+
+**Phase 1: strict-contraction SRM (unique fixed point). Phase 2: small Gaussian noise injection if ARC plateaus.**
+
+Sudoku has unique solutions (most of them — at least the Sudoku-Extreme dataset filters for unique solutions). Unique-fixed-point SRM should match GRAM's 97% — they're getting it through stable dynamics + multi-sample voting at inference, not through architectural multi-modality during training.
+
+For ARC where multiple valid completions could exist: add `z_{t+1} = T(z_t) + σ·ε` with small σ. Mathematically:
+- Strict contraction: T is K-Lipschitz with K < 1, σ=0
+- Stochastic contraction: T is K-Lipschitz, σ small → stationary distribution exists, concentrated near fixed point but with Gaussian fluctuation
+- As σ increases: distribution spreads, eventually multi-modal if T has multiple "near-fixed-points"
+
+The σ value should be tuned per task. For Sudoku σ=0; for ARC σ=0.05-0.2.
+
+Mixture of K operators (alternative): too many extra parameters, harder to analyze, defer.
+
+## (6) Pseudocode for one SRM forward step
+
+```python
+def srm_step(h, l, x, theta, kappa=0.9, eta=1.0, alpha=1.0):
+ """
+ One SRM update step.
+ h, l: (B, seq, hidden) — high-level and low-level states
+ x: input embedding, (B, seq, hidden)
+ theta: parameters (Sandwich/AOL block + gain params + biases)
+ kappa, eta, alpha: SRM hyperparameters
+ """
+ # 1. Joint feature map via AOL block (1-Lipschitz by construction)
+ # Concatenate h with sqrt(eta)*l so the operator naturally acts under
+ # the weighted norm ‖z‖²_P = ‖h‖² + η‖l‖²
+ z = concat(h, sqrt(eta) * l) # (B, seq, 2*hidden)
+ psi = aol_block(z + bias_in(x), theta.aol) # (B, seq, 2*hidden), Lip≤1
+ psi_h, psi_l_scaled = split(psi)
+ psi_l = psi_l_scaled / sqrt(eta) # unscale back
+
+ # 2. Block-gain matrix A — row-sum bounded by kappa under weighted metric
+ # Parameterize a_HH, a_HL (with a_HH + sqrt(eta)*a_HL ≤ kappa) via softmax
+ gain_H = kappa * softmax(theta.gain_H_logits) # (2,) — [a_HH, sqrt(eta)*a_HL]
+ a_HH, a_HL_scaled = gain_H[0], gain_H[1]
+ a_HL = a_HL_scaled / sqrt(eta)
+
+ gain_L = kappa * softmax(theta.gain_L_logits) # similarly
+ a_LL, a_LH_scaled = gain_L[0], gain_L[1]
+ a_LH = a_LH_scaled * sqrt(eta)
+
+ # Apply block gain (U_HL, U_LH are orthogonal mixing matrices)
+ Az_h = a_HH * psi_h + a_HL * (U_HL @ psi_l)
+ Az_l = a_LH * (U_LH @ psi_h) + a_LL * psi_l
+
+ # 3. Damping update — even if κ were slightly above 1, this saves us
+ h_new = (1 - alpha) * h + alpha * Az_h + bias_h(x)
+ l_new = (1 - alpha) * l + alpha * Az_l + bias_l(x)
+
+ return h_new, l_new
+```
+
+## A. Candidate architectures ranked
+
+1. **SRM-Joint-AOL** (top): the above pseudocode. AOL block + block-gain row-sum bound + light damping. Strict contraction, deterministic, ~7-10M params (TRM-scale).
+
+2. **SRM-Joint-Sandwich**: identical structure but Sandwich layers instead of AOL. Tighter Lipschitz bound, slower training. Use if AOL is too restrictive.
+
+3. **SRM-Stochastic** (Phase 2): SRM-Joint-AOL + Gaussian noise injection `σε` per step. Trades strict contraction for multi-modal coverage. Use for ARC if Phase 1 plateaus.
+
+## B. Full math for top candidate
+
+State: `z = (h, ℓ) ∈ R^(2D)`, weighted norm `‖z‖²_P = ‖h‖² + η‖ℓ‖²`.
+
+Per-step map: `T(z, x) = (1-α)z + α·A·ψ(z + b_in(x)) + b_out(x)`.
+
+Where:
+- `ψ: R^(2D) → R^(2D)` is AOL-parameterized, `Lip_P(ψ) ≤ 1`.
+- `A` is block matrix with rows `[a_HH·I, a_HL·U_HL]` and `[a_LH·U_LH, a_LL·I]`, `U_HL, U_LH` orthogonal.
+- Block-row-sum bound: `a_HH + √η·a_HL ≤ κ` and `√η·a_LH + a_LL ≤ κ` with `κ ∈ (0.85, 0.95)`.
+
+Lipschitz of T under P-norm: `Lip_P(T) ≤ (1-α) + α·Lip_P(A·ψ) ≤ (1-α) + α·κ < 1` by construction.
+
+⇒ `log Lip_P(T) ≤ log((1-α) + α·κ) < 0` — by construction non-chaotic.
+
+For α=1, κ=0.9: `λ_1 ≤ log 0.9 ≈ -0.105`. Matches empirical success regime.
+
+## C. Risks (what could prevent ≥87% Sudoku)
+
+1. **AOL expressivity ceiling**: orthogonal weights restrict the function class. Sandwich layers are less restrictive — if AOL caps below 87%, swap in Sandwich (slower training cost).
+
+2. **Block-gain matrix too restrictive**: forcing row-sum ≤ κ might not preserve TRM's specific information flow pattern (e.g., L→H amplification). Solution: lower bounds on `a_HL` and `a_LH` (e.g., ≥ 0.1·κ) to prevent decoupling.
+
+3. **Damping over-smooths**: α=1 means full step. If unstable, drop α. But α<1 reduces effective compute per step (anti-iterative).
+
+4. **Weight tying may not transfer**: TRM ties across H and L (J_L = J_H = J via same module). SRM-Joint already does this. But our `A` and `ψ` blocks are also tied across recursion steps (Banach iteration). If this hurts representational depth, untie A while keeping ψ tied.
+
+5. **Bias terms `b_in(x), b_out(x)`**: must be the only place where input x enters. If we naively use input-dependent ψ (`ψ(z, x)` instead of `ψ(z + b_in(x))`), Lipschitz analysis breaks (Lipschitz w.r.t. z still holds, but coupling with x is opaque).
+
+## D. Smallest validation experiment
+
+**Replace HRM's L_level and H_level with the AOL-Sandwich block (1-Lipschitz) + block-gain A, train Sudoku 1k for 5000 steps (1/4 the full recipe). Compare λ_joint_1 trajectory and final acc to baseline HRM at same step count.**
+
+Pass criteria:
+- λ_joint_1 stays strictly < 0 throughout (validates by-construction stability)
+- Accuracy at 5K steps ≥ HRM's accuracy at 5K steps (baseline ~15% per our checkpoint evolution)
+- If both pass, scale to full 20K-step recipe.
+
+This is ~2-4 hours on 4× A6000 (1/10 the full HRM training). Quick gate on whether SRM-Joint-AOL is viable.
+
+## Notes for synthesis (when codex eventually responds)
+
+- Codex's round 1 framing (joint operator, block-gain, κ=0.85-0.95) is solid. I largely agree.
+- Implementation-wise I lean towards AOL not Sandwich for speed.
+- Two-state I keep; codex left it ambiguous.
+- Noise injection deferred to Phase 2.