diff options
| author | YurenHao0426 <blackhao0426@gmail.com> | 2026-06-13 12:35:36 -0500 |
|---|---|---|
| committer | YurenHao0426 <blackhao0426@gmail.com> | 2026-06-13 12:35:36 -0500 |
| commit | 66e0d8b9fd4d0f7a2231d689c055e26fdf1cf04a (patch) | |
| tree | c29cba61124018755a19b02c9d33e3ad5f2e05cc /research/flossing/srm_design_round2_claude.md | |
Curated export for clone-and-run Maze training (2x A6000) + diagnostics.
trm/hrm pretrain.py carry trajectory-augmentation code (backward-compatible).
Heavy artifacts (checkpoints/wandb/npz) gitignored; see PROVENANCE.md.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Diffstat (limited to 'research/flossing/srm_design_round2_claude.md')
| -rw-r--r-- | research/flossing/srm_design_round2_claude.md | 155 |
1 files changed, 155 insertions, 0 deletions
diff --git a/research/flossing/srm_design_round2_claude.md b/research/flossing/srm_design_round2_claude.md new file mode 100644 index 0000000..b63b260 --- /dev/null +++ b/research/flossing/srm_design_round2_claude.md @@ -0,0 +1,155 @@ +# SRM Design Round 2 — Claude's analysis (codex placeholder failed twice) + +Building on `srm_design_codex.md`. Codex round 1 produced solid joint-operator design; rounds 2-3 hit routing issues. This is my analysis of the 6 implementation questions. + +## (1) Attention compatibility + +**Recommendation: start with `mlp_t=True` for Sudoku, defer Lipschitz-attention to ARC phase.** + +Reasoning: TRM's reported 87% Sudoku 1k uses `mlp_t=True` (MLP over tokens, no self-attention). Their `pretrain_att_sudoku` (with attention) is reportedly ~75%. So for Sudoku specifically, the attention isn't even helping — replacing it with MLP gives **+12% accuracy**. We can use MLP-over-tokens for both L and H updates; MLPs admit cheap Lipschitz bounds (Sandwich / AOL trivially compose with ReLU/GELU). + +For ARC where attention likely matters: use **LipsFormer-style attention** (Qi et al. 2023) — normalizes via L_∞ / cosine to bound Lipschitz constant of softmax(QK^T)V. Or skip self-attention and use **cross-attention only** with bounded-norm keys. + +## (2) Single vs two-state + +**Keep two-state (h, ℓ), with single tied operator.** + +Why: even though TRM ties weights, the two states play **functionally distinct roles**: +- ℓ (z_L) receives `z_H + input_embedding` injection — "deliberation conditioned on current answer + question" +- h (z_H) receives `z_L` injection — "answer update from deliberation" +- The hierarchy of update frequencies (H every T steps vs L every step) is preserved +- Empirical: TRM with H_cycles=3 L_cycles=6 outperforms vanilla recursion (H=1, L=18) on Sudoku + +Single state z would force the same vector to serve as both "answer" and "scratchpad", losing the inductive bias that's empirically validated. + +## (3) κ fixed vs learnable + +**Fixed κ=0.9 to start, learnable with anchor penalty as v2.** + +Phase 1 (validation): κ=0.9 fixed. Provably contractive, easy to analyze, no risk of κ-collapse. + +Phase 2 (refinement): `κ = sigmoid(c) * (0.99 - 0.5) + 0.5` learnable. Add anchor penalty `λ_anchor · (κ - 0.9)²` to keep near empirical sweet spot. + +Why not learnable from start: the empirical observation `success λ_1 ≈ -0.15 ⇔ κ ≈ 0.86` is from a trained HRM. SRM might find a different sweet spot but starting at the empirically-validated value avoids early-training pathology. + +## (4) Power iteration cost + +**Recommendation: AOL (Almost-Orthogonal Layers, Prach & Lampert 2022).** + +Compared to alternatives: +- **Sandwich layers (Wang & Manchester 2023)**: tightest Lipschitz bound but needs power iteration. For 40h training cost-prohibitive on 4× A6000. +- **AOL**: Cayley-transform parameterization gives **exact 1-Lipschitz with no iteration** at training time. Slight expressivity loss (orthogonal weights are 1-Lipschitz but not all 1-Lipschitz maps are orthogonal). For our recursive setting where we want exactness, this trade-off is favorable. +- **Householder reflections**: also exact orthogonal but slower per forward pass than AOL. +- **Cached spectral estimates**: cheapest but introduces unstable training when stale estimates miss real growth. + +AOL is the right balance for our compute budget. + +## (5) Multi-modal capacity + +**Phase 1: strict-contraction SRM (unique fixed point). Phase 2: small Gaussian noise injection if ARC plateaus.** + +Sudoku has unique solutions (most of them — at least the Sudoku-Extreme dataset filters for unique solutions). Unique-fixed-point SRM should match GRAM's 97% — they're getting it through stable dynamics + multi-sample voting at inference, not through architectural multi-modality during training. + +For ARC where multiple valid completions could exist: add `z_{t+1} = T(z_t) + σ·ε` with small σ. Mathematically: +- Strict contraction: T is K-Lipschitz with K < 1, σ=0 +- Stochastic contraction: T is K-Lipschitz, σ small → stationary distribution exists, concentrated near fixed point but with Gaussian fluctuation +- As σ increases: distribution spreads, eventually multi-modal if T has multiple "near-fixed-points" + +The σ value should be tuned per task. For Sudoku σ=0; for ARC σ=0.05-0.2. + +Mixture of K operators (alternative): too many extra parameters, harder to analyze, defer. + +## (6) Pseudocode for one SRM forward step + +```python +def srm_step(h, l, x, theta, kappa=0.9, eta=1.0, alpha=1.0): + """ + One SRM update step. + h, l: (B, seq, hidden) — high-level and low-level states + x: input embedding, (B, seq, hidden) + theta: parameters (Sandwich/AOL block + gain params + biases) + kappa, eta, alpha: SRM hyperparameters + """ + # 1. Joint feature map via AOL block (1-Lipschitz by construction) + # Concatenate h with sqrt(eta)*l so the operator naturally acts under + # the weighted norm ‖z‖²_P = ‖h‖² + η‖l‖² + z = concat(h, sqrt(eta) * l) # (B, seq, 2*hidden) + psi = aol_block(z + bias_in(x), theta.aol) # (B, seq, 2*hidden), Lip≤1 + psi_h, psi_l_scaled = split(psi) + psi_l = psi_l_scaled / sqrt(eta) # unscale back + + # 2. Block-gain matrix A — row-sum bounded by kappa under weighted metric + # Parameterize a_HH, a_HL (with a_HH + sqrt(eta)*a_HL ≤ kappa) via softmax + gain_H = kappa * softmax(theta.gain_H_logits) # (2,) — [a_HH, sqrt(eta)*a_HL] + a_HH, a_HL_scaled = gain_H[0], gain_H[1] + a_HL = a_HL_scaled / sqrt(eta) + + gain_L = kappa * softmax(theta.gain_L_logits) # similarly + a_LL, a_LH_scaled = gain_L[0], gain_L[1] + a_LH = a_LH_scaled * sqrt(eta) + + # Apply block gain (U_HL, U_LH are orthogonal mixing matrices) + Az_h = a_HH * psi_h + a_HL * (U_HL @ psi_l) + Az_l = a_LH * (U_LH @ psi_h) + a_LL * psi_l + + # 3. Damping update — even if κ were slightly above 1, this saves us + h_new = (1 - alpha) * h + alpha * Az_h + bias_h(x) + l_new = (1 - alpha) * l + alpha * Az_l + bias_l(x) + + return h_new, l_new +``` + +## A. Candidate architectures ranked + +1. **SRM-Joint-AOL** (top): the above pseudocode. AOL block + block-gain row-sum bound + light damping. Strict contraction, deterministic, ~7-10M params (TRM-scale). + +2. **SRM-Joint-Sandwich**: identical structure but Sandwich layers instead of AOL. Tighter Lipschitz bound, slower training. Use if AOL is too restrictive. + +3. **SRM-Stochastic** (Phase 2): SRM-Joint-AOL + Gaussian noise injection `σε` per step. Trades strict contraction for multi-modal coverage. Use for ARC if Phase 1 plateaus. + +## B. Full math for top candidate + +State: `z = (h, ℓ) ∈ R^(2D)`, weighted norm `‖z‖²_P = ‖h‖² + η‖ℓ‖²`. + +Per-step map: `T(z, x) = (1-α)z + α·A·ψ(z + b_in(x)) + b_out(x)`. + +Where: +- `ψ: R^(2D) → R^(2D)` is AOL-parameterized, `Lip_P(ψ) ≤ 1`. +- `A` is block matrix with rows `[a_HH·I, a_HL·U_HL]` and `[a_LH·U_LH, a_LL·I]`, `U_HL, U_LH` orthogonal. +- Block-row-sum bound: `a_HH + √η·a_HL ≤ κ` and `√η·a_LH + a_LL ≤ κ` with `κ ∈ (0.85, 0.95)`. + +Lipschitz of T under P-norm: `Lip_P(T) ≤ (1-α) + α·Lip_P(A·ψ) ≤ (1-α) + α·κ < 1` by construction. + +⇒ `log Lip_P(T) ≤ log((1-α) + α·κ) < 0` — by construction non-chaotic. + +For α=1, κ=0.9: `λ_1 ≤ log 0.9 ≈ -0.105`. Matches empirical success regime. + +## C. Risks (what could prevent ≥87% Sudoku) + +1. **AOL expressivity ceiling**: orthogonal weights restrict the function class. Sandwich layers are less restrictive — if AOL caps below 87%, swap in Sandwich (slower training cost). + +2. **Block-gain matrix too restrictive**: forcing row-sum ≤ κ might not preserve TRM's specific information flow pattern (e.g., L→H amplification). Solution: lower bounds on `a_HL` and `a_LH` (e.g., ≥ 0.1·κ) to prevent decoupling. + +3. **Damping over-smooths**: α=1 means full step. If unstable, drop α. But α<1 reduces effective compute per step (anti-iterative). + +4. **Weight tying may not transfer**: TRM ties across H and L (J_L = J_H = J via same module). SRM-Joint already does this. But our `A` and `ψ` blocks are also tied across recursion steps (Banach iteration). If this hurts representational depth, untie A while keeping ψ tied. + +5. **Bias terms `b_in(x), b_out(x)`**: must be the only place where input x enters. If we naively use input-dependent ψ (`ψ(z, x)` instead of `ψ(z + b_in(x))`), Lipschitz analysis breaks (Lipschitz w.r.t. z still holds, but coupling with x is opaque). + +## D. Smallest validation experiment + +**Replace HRM's L_level and H_level with the AOL-Sandwich block (1-Lipschitz) + block-gain A, train Sudoku 1k for 5000 steps (1/4 the full recipe). Compare λ_joint_1 trajectory and final acc to baseline HRM at same step count.** + +Pass criteria: +- λ_joint_1 stays strictly < 0 throughout (validates by-construction stability) +- Accuracy at 5K steps ≥ HRM's accuracy at 5K steps (baseline ~15% per our checkpoint evolution) +- If both pass, scale to full 20K-step recipe. + +This is ~2-4 hours on 4× A6000 (1/10 the full HRM training). Quick gate on whether SRM-Joint-AOL is viable. + +## Notes for synthesis (when codex eventually responds) + +- Codex's round 1 framing (joint operator, block-gain, κ=0.85-0.95) is solid. I largely agree. +- Implementation-wise I lean towards AOL not Sandwich for speed. +- Two-state I keep; codex left it ambiguous. +- Noise injection deferred to Phase 2. |
