1 files changed, 1381 insertions, 0 deletions
diff --git a/CLAUDE.md b/CLAUDE.md
new file mode 100644
index 0000000..1d83dec
--- /dev/null
+++ b/CLAUDE.md
@@ -0,0 +1,1381 @@
+# CLAUDE.md — DAGFormer Project Specification
+
+**Read this file in full before writing any code.** This document is the
+single source of truth for all design decisions. If something contradicts
+the README or any other file, this document wins.
+
+---
+
+## 1. What Is This Project?
+
+**DAGFormer** trains a small neural network (the "structure predictor") to
+predict, for each token, the optimal wiring diagram (a DAG) of a frozen
+1B-parameter language model (OLMo2-1B). The predicted DAG controls which
+attention heads talk to which other heads. The entire system is trained
+end-to-end with language modeling loss — no labeled topology data needed.
+
+### Why?
+
+Standard transformers have a fixed sequential computation graph: layer 0 →
+layer 1 → ... → layer 15. Every input sees the same wiring. We showed via
+expensive oracle search that **context-dependent topologies** can reduce
+next-token prediction loss (NLL) from 2.58 to 0.12 (median across 50
+evaluation windows, 100% of windows improved). The oracle search is too
+slow to use at scale (500 gradient steps per window), so we need a learned
+predictor that produces the topology in a single forward pass.
+
+### What is NOT this project?
+
+- This is NOT the oracle search codebase (that exists separately)
+- This is NOT a Mixture-of-Experts project (despite the repo history)
+- This does NOT modify the OLMo2-1B weights in Phase 1
+- This does NOT implement Phase 2 (joint training) yet — only the
+  infrastructure to support it later
+
+---
+
+## 2. Architecture — Exact Specification
+
+### 2.1 The Computation Graph of OLMo2-1B
+
+OLMo2-1B (HuggingFace ID: `allenai/OLMo-2-0425-1B`) has:
+
+- **16 transformer layers**, each with **16 attention heads**
+- This gives **256 "nodes"** total: node `(l, h)` = layer `l`, head `h`
+- We flatten to a single index: `node_id = l * 16 + h` (0-indexed)
+
+**Standard forward pass** — each layer does:
+```python
+# Input: residual (shared across all heads in this layer)
+normed = RMSNorm(residual)
+attn_out = self_attn(normed)          # all 16 heads compute in parallel,
+                                       # outputs concatenated and projected by o_proj
+residual = residual + attn_out         # attention residual connection
+normed2 = RMSNorm(residual)
+mlp_out = MLP(normed2)
+residual = residual + mlp_out          # MLP residual connection
+```
+
+The **residual stream** at the start of layer `l` is therefore:
+```
+residual_l = embedding + Σ_{l'<l} (attn_output[l'] + mlp_output[l'])
+           = embedding + Σ_{l'<l} (Σ_h head_output[l',h] + mlp_output[l'])
+```
+where `head_output[l',h]` is head h's individual contribution to the
+attention output (its slice of `o_proj`, see §2.2). ALL heads in layer l
+see the SAME `residual_l` as input — there is no per-head differentiation
+in standard transformers.
+
+### 2.2 The Adjacency Matrix A
+
+We introduce a **256×256 adjacency matrix A** that controls information
+routing between attention heads across layers.
+
+```
+A[i][j] ∈ [0, 1]   where i = source node, j = target node
+                     i = l_i * 16 + h_i,  j = l_j * 16 + h_j
+```
+
+#### Mask: Block-Upper-Triangular (NOT element-upper-triangular)
+
+**CRITICAL**: The mask is based on LAYER indices, not node indices.
+
+```python
+# CORRECT: block-upper-triangular based on layer
+mask[i, j] = 1  if  layer(j) > layer(i)    # i.e. j//16 > i//16
+mask[i, j] = 0  if  layer(j) <= layer(i)   # same layer or backward
+
+# WRONG: do NOT use torch.triu() — it would allow same-layer connections
+# e.g. triu would set mask[0,15]=1, but both are in layer 0
+```
+
+Heads in the same layer execute in parallel and cannot see each other's
+outputs. Only cross-layer forward connections are meaningful.
+
+#### Connection Count
+
+| Type | Definition | Count | Role |
+|------|-----------|-------|------|
+| **Adjacent-layer** | `layer(j) = layer(i) + 1`, all head pairs | 15 × 16 × 16 = **3,840** | These exist in standard transformer (via shared residual). When gated to 1, behavior matches baseline. |
+| **Skip** | `layer(j) > layer(i) + 1`, all head pairs | 105 × 16 × 16 = **26,880** | These do NOT exist in standard transformer. They are additional direct routes that bypass intermediate layers. |
+| **Total** | All entries where `layer(j) > layer(i)` | **30,720** | |
+
+For logging and analysis, label connections as "adjacent" or "skip", but
+the forward pass treats all 30,720 entries identically.
+
+> **Note on "31K" count**: The oracle search reported "256 sequential +
+> 30,720 hyperconnection ≈ 31K" using a different parameterization
+> (separate per-head activity gates + routing gates). In our unified
+> 256×256 matrix, there are exactly 30,720 free entries. Both represent
+> the same underlying structure.
+
+#### What "head output" means (resolves shape ambiguity)
+
+In HuggingFace OLMo2, the `o_proj` layer concatenates all 16 heads and
+projects back to model_dim:
+
+```python
+# Inside self_attn:
+# Each head computes: attn_weights @ V → [batch, seq, head_dim]  (head_dim = 128)
+# All heads concatenated: [batch, seq, 16 * 128] = [batch, seq, 2048]
+# Then: o_proj([batch, seq, 2048]) → [batch, seq, 2048]
+```
+
+The `o_proj` weight matrix `W_o ∈ R^{2048 × 2048}` can be viewed as 16
+column blocks: `W_o = [W_o^0 | W_o^1 | ... | W_o^15]`, each `W_o^h ∈
+R^{2048 × 128}`. The full attention output is:
+
+```
+attn_output = Σ_h W_o^h @ head_value_h = Σ_h head_output[l, h]
+```
+
+So **`head_output[l, h]` is `W_o^h @ head_value_h`, shape `[batch, seq,
+model_dim]` (= 2048)**. This is each head's contribution to the attention
+output in MODEL DIM space. Heads can be summed directly because they're
+already in the same 2048-dim space.
+
+#### Per-Head Input Assembly (the core modification)
+
+In DAGFormer, each head j = (l_j, h_j) has its **own input**, assembled
+from three sources:
+
+```python
+input_j = embedding                               # (1) always present
+        + Σ_{l' < l_j} mlp_output[l']             # (2) all prior MLP outputs, NOT gated
+        + Σ_{i: layer(i) < l_j} A[i,j] * head_output[i]   # (3) gated head outputs
+```
+
+**Source (1) — Token embedding**: The output of the token embedding layer
+(before any transformer layer). Always included for every head. This is
+NOT part of A. Think of it as the "seed" of the computation.
+
+**Source (2) — MLP outputs**: Each layer's MLP computes on a shared
+aggregated state (see below) and its output is added to ALL downstream
+head inputs equally. MLP outputs are **never gated by A**. This keeps
+the MLP's contribution identical to the standard forward pass.
+
+**Source (3) — Gated head outputs**: The only part controlled by A.
+Each entry A[i,j] scales how much of head i's output reaches head j's
+input. Different heads h within the same layer l_j can receive different
+weighted combinations of prior head outputs — this is what makes the
+256×256 per-head routing meaningful.
+
+#### Baseline Reproduction Proof (resolves the "double-counting" concern)
+
+When A[i,j] = 1 for ALL 30,720 valid entries:
+
+```
+input_j (where j is in layer l) 
+    = embedding + Σ_{l'<l} mlp_output[l'] + Σ_{i: layer(i)<l} 1.0 * head_output[i]
+    = embedding + Σ_{l'<l} mlp_output[l'] + Σ_{l'<l} Σ_h head_output[l',h]
+    = embedding + Σ_{l'<l} (mlp_output[l'] + attn_output[l'])
+    = residual_l   ← the standard residual stream!
+```
+
+All heads in the same layer l receive the SAME input (because A=1 gives
+the same weighted sum for all target heads). This equals the standard
+`residual_l`. Therefore **A=1 exactly reproduces vanilla OLMo2-1B**. ✓
+
+There is NO double-counting because each head output appears exactly once
+in the sum. Head (0,5)'s output is included via the gated sum (source 3),
+and it flows to layer 8 only via that direct A[0*16+5, 8*16+h] entry, NOT
+also through intermediate layers. The intermediate layers' head outputs
+are separate entries in the sum.
+
+#### MLP Execution (not gated, shared input)
+
+After all 16 heads in layer l compute, the MLP runs:
+
+```python
+# MLP input = standard aggregation (same as baseline when A=1)
+mlp_input_l = embedding
+            + Σ_{l'<l} mlp_output[l']
+            + Σ_{l'<=l} attn_output[l']    # includes current layer's heads
+# where attn_output[l'] = Σ_h head_output[l', h]  (UNGATED sum of this layer's heads)
+
+# IMPORTANT: MLP always sees the UNGATED sum of its own layer's head outputs.
+# The gating by A only affects what OTHER layers' heads receive.
+# This is necessary for baseline reproduction.
+
+mlp_output[l] = MLP_l(RMSNorm(mlp_input_l))
+```
+
+The MLP input includes the **ungated** sum of the current layer's head
+outputs. Gating only affects cross-layer routing. This design ensures
+that A does not interfere with the MLP's computation within its own layer.
+
+#### Layer 0 Special Case
+
+Layer 0's 16 heads have no prior head outputs (no nodes with layer < 0).
+Their input is simply:
+```python
+input_j (j in layer 0) = embedding   # no prior heads, no prior MLPs
+```
+This requires no special handling — the formula naturally produces
+`embedding` when there are no prior sources.
+
+### 2.3 The Structure Predictor
+
+The predictor takes the current context and outputs A. Architecture:
+
+```
+Input: raw text → Qwen tokenizer → qwen_ids [batch, qwen_seq_len]
+       (qwen_seq_len may differ from OLMo's seq_len — that's fine,
+        because mean pooling collapses to a single vector per sequence)
+         │
+         ▼
+   ┌─────────────────────────────────┐
+   │  Qwen-3-Embedding-0.6B          │
+   │  HF ID: "Qwen/Qwen3-Embedding-0.6B" │
+   │                                  │
+   │  This is a TEXT EMBEDDING MODEL, │
+   │  NOT a generative LLM. It takes  │
+   │  text and produces a single      │
+   │  fixed-size vector per sequence. │
+   │                                  │
+   │  FROZEN — no gradients ever      │
+   │  Uses its OWN tokenizer          │
+   │  (separate from OLMo's)          │
+   │                                  │
+   │  Input: raw text → Qwen tokenizer│
+   │  → Qwen forward → last_hidden   │
+   │  Pooling: mean over seq_len dim  │
+   │  → embedding e ∈ R^d             │
+   │    (d = model.config.hidden_size │
+   │     — do NOT hardcode, query at  │
+   │     runtime from the model)      │
+   │                                  │
+   │  e is a SINGLE VECTOR per        │
+   │  sequence — it summarizes the    │
+   │  entire context. The predictor   │
+   │  MLP then maps e → A matrix.     │
+   │  Qwen has nothing to do with     │
+   │  OLMo's vocabulary or tokens.    │
+   └─────────────────────────────────┘
+         │
+         ▼
+   ┌─────────────────────────────────┐
+   │  MLP Decoder (TRAINABLE)         │
+   │                                  │
+   │  Linear(d, hidden_dim)           │
+   │  → GELU                          │
+   │  → Linear(hidden_dim, hidden_dim)│
+   │  → GELU                          │
+   │  → two heads:                    │
+   │      Linear(hidden_dim, 256 * r) │  → reshape to U ∈ R^{256×r}
+   │      Linear(hidden_dim, 256 * r) │  → reshape to V ∈ R^{256×r}
+   │                                  │
+   │  hidden_dim = 1024 (default)     │
+   │  r = rank hyperparameter         │
+   │    ablate: r ∈ {8, 16, 32, 64}   │
+   │    default: r = 32               │
+   └─────────────────────────────────┘
+         │
+         ▼
+   ┌─────────────────────────────────┐
+   │  Low-Rank Logits                 │
+   │                                  │
+   │  Z = U @ V.transpose(-1, -2)    │
+   │  Z ∈ R^{batch × 256 × 256}      │
+   │                                  │
+   │  This is the logit matrix before │
+   │  any gating or masking.          │
+   └─────────────────────────────────┘
+         │
+         ▼
+   ┌─────────────────────────────────┐
+   │  Block-Upper-Triangular Mask     │
+   │                                  │
+   │  mask[i,j] = 1 if j//16 > i//16 │
+   │  (layer(j) > layer(i))          │
+   │                                  │
+   │  ⚠ Do NOT use torch.triu() —     │
+   │  it allows same-layer connections│
+   │                                  │
+   │  Z_masked = Z * mask + (-1e9) * (1 - mask)  │
+   │  (force invalid positions to     │
+   │   -inf so sigmoid → 0)           │
+   └─────────────────────────────────┘
+         │
+         ▼
+   ┌─────────────────────────────────┐
+   │  Gumbel-Sigmoid (3 modes)       │
+   │                                  │
+   │  MODE 1 — TRAINING:             │
+   │    G ~ Logistic(0, 1)            │
+   │      (= log(U) - log(1-U),      │
+   │       U ~ Uniform(0,1))          │
+   │    A = σ((Z_masked + G) / τ)     │
+   │    Gumbel noise G adds stochastic│
+   │    exploration. Gradients flow    │
+   │    through σ naturally (this is  │
+   │    continuous relaxation, NOT    │
+   │    STE — no straight-through     │
+   │    estimator needed here).       │
+   │                                  │
+   │  MODE 2 — EVAL SOFT:            │
+   │    A = σ(Z_masked / τ)           │
+   │    NO Gumbel noise. Deterministic│
+   │    soft gates. Values in (0,1).  │
+   │    Used for eval/nll_soft metric.│
+   │                                  │
+   │  MODE 3 — EVAL HARD:            │
+   │    A = (Z_masked > 0).float()    │
+   │    NO Gumbel noise. Binary 0/1.  │
+   │    Threshold at logit=0 (= prob  │
+   │    0.5). Used for eval/nll_hard  │
+   │    and for final inference.      │
+   │                                  │
+   │  τ = temperature, annealed       │
+   │  during training (see §3)        │
+   └─────────────────────────────────┘
+
+**WHY Gumbel-Sigmoid — the gradient problem and its solution:**
+
+At inference we want binary A ∈ {0, 1} (discrete routing decisions).
+But discrete decisions have zero gradient — you can't backprop through
+`(Z > 0).float()`. The predictor MLP would receive no learning signal.
+
+Gumbel-Sigmoid is a **continuous relaxation** (also called the "concrete
+distribution" / "surrogate gradient" technique):
+
+```
+Training:  A = σ((Z + G) / τ)    ← continuous, differentiable
+                                     ∂A/∂Z = A(1-A)/τ  ← well-defined gradient
+                                     backprop: ∂Loss/∂Z = ∂Loss/∂A · ∂A/∂Z
+
+Inference: A = (Z > 0).float()   ← discrete, but no gradient needed
+```
+
+The complete gradient chain during training:
+```
+Loss (NLL from OLMo)
+  → ∂Loss/∂logits            (OLMo's output layer)
+  → ∂logits/∂head_inputs     (through OLMo's frozen layers — computed
+                               but NOT used to update OLMo weights)
+  → ∂head_inputs/∂A          (the gate multiplication: input_j = Σ A[i,j] * head_out[i])
+  → ∂A/∂Z                    (sigmoid derivative: A(1-A)/τ — THIS is the surrogate)
+  → ∂Z/∂(U,V)                (low-rank matmul: Z = UV^T)
+  → ∂(U,V)/∂MLP_params       (the trainable predictor MLP)
+  → optimizer updates MLP_params
+```
+
+Key points:
+- OLMo is frozen but its forward computation is ON the gradient tape
+  (it's a differentiable function of A). Gradients flow THROUGH OLMo
+  back to A, even though OLMo's own parameters don't get updated.
+- The sigmoid σ acts as a differentiable surrogate for the discrete
+  threshold. As τ → 0, σ((Z+G)/τ) approaches a step function, but
+  during training τ is always > 0 so gradients are always nonzero.
+- Gumbel noise G provides stochastic exploration: even if Z is small,
+  the noise can push A above or below 0.5, letting the model discover
+  which connections matter.
+- NO straight-through estimator (STE) is needed. The sigmoid itself
+  provides the gradient. STE would be needed if we hard-thresholded
+  during training, but we don't.
+         │
+         ▼
+   ┌─────────────────────────────────┐
+   │  Cascading Gate                  │
+   │                                  │
+   │  Purpose: if node j has no       │
+   │  incoming edges, it has no info  │
+   │  to propagate, so kill outgoing. │
+   │                                  │
+   │  ONE-PASS computation (not       │
+   │  sequential by layer):           │
+   │                                  │
+   │  1. Compute ALL incoming sums:   │
+   │     inc_j = Σ_i A[i, j]  ∀j     │
+   │  2. Compute ALL gates:           │
+   │     g_j = σ(k * inc_j)   ∀j     │
+   │  3. Apply ALL gates at once:     │
+   │     A[j, :] *= g_j       ∀j     │
+   │                                  │
+   │  This is a single vectorized op, │
+   │  NOT a layer-by-layer cascade.   │
+   │  All incoming sums use the       │
+   │  ORIGINAL A values (before any   │
+   │  gates are applied).             │
+   │                                  │
+   │  k = 5.0 (fixed scalar)         │
+   │  k can be made learnable later.  │
+   │  This is fully differentiable.   │
+   │                                  │
+   │  EVAL HARD MODE: After hard-     │
+   │  thresholding A to binary 0/1,  │
+   │  the cascading gate is also      │
+   │  hard: g_j = 1 if inc_j > 0,    │
+   │  else g_j = 0. (Because σ(5*0)  │
+   │  = 0.5 would be wrong for       │
+   │  binary gates — a disconnected   │
+   │  node should be fully killed,    │
+   │  not half-alive.)                │
+   │                                  │
+   │  SOFT MODE NOTE: In training and │
+   │  eval-soft mode, the cascading   │
+   │  gate uses σ(k * inc_j) with the│
+   │  ORIGINAL (pre-gate) A values.   │
+   │  If all incoming gates are small │
+   │  (e.g. 0.01 each), inc_j can    │
+   │  still be > 0, giving g_j > 0.5.│
+   │  This is INTENTIONAL: in soft    │
+   │  mode, "weakly connected" is a   │
+   │  valid state (the gradient can   │
+   │  still flow). The cascading gate │
+   │  is a soft penalty, not a hard   │
+   │  kill, during training.          │
+   └─────────────────────────────────┘
+         │
+         ▼
+   Output: A ∈ [0,1]^{batch × 256 × 256}, block-upper-triangular
+           (30,720 free entries, rest forced to 0)
+```
+
+### 2.4 End-to-End Pipeline
+
+```
+raw text ──→ [Qwen tokenizer] ──→ qwen_ids ──→ [Qwen encoder] ──→ e ──→ [Predictor MLP] ──→ A
+   │                                                                                         │
+   │                                                                                         ▼
+   └──→ [OLMo tokenizer] ──→ olmo_ids ──→ [OLMo2-1B modified forward with A] ──→ logits ──→ NLL
+                                                                                             │
+                                                                         ∇ backprop to predictor MLP
+```
+
+**Tokenization**: Qwen and OLMo have DIFFERENT tokenizers and vocabularies.
+The dataloader produces raw text strings. Each model tokenizes independently:
+```python
+# In the dataloader or pipeline:
+raw_text = batch["text"]                              # list of strings
+qwen_ids = qwen_tokenizer(raw_text, ...)["input_ids"] # Qwen's token IDs
+olmo_ids = olmo_tokenizer(raw_text, ...)["input_ids"]  # OLMo's token IDs
+# These are DIFFERENT tensors with DIFFERENT lengths and vocabularies.
+# Qwen produces a pooled embedding (one vector per sequence).
+# OLMo produces per-token logits for NLL computation.
+```
+
+- Qwen and OLMo are FROZEN (Phase 1). Only the MLP decoder trains.
+- The same **raw text** goes to both Qwen and OLMo, but they use
+  **separate tokenizers**. Qwen tokenizes independently to produce an
+  embedding; OLMo tokenizes independently for language modeling.
+  Token IDs are NOT shared between the two models.
+- Qwen's output (a single pooled vector per sequence) goes to the
+  predictor MLP → A matrix. OLMo uses A in its modified forward pass.
+- Loss = NLL (from OLMo) + λ · mean(A)  (sparsity regularization)
+
+---
+
+## 3. Training — Exact Specification
+
+### 3.1 Phase 1: Learn Topology (IMPLEMENT THIS)
+
+| What | Frozen/Trainable |
+|------|-----------------|
+| OLMo2-1B (`allenai/OLMo-2-0425-1B`) | ❄ FROZEN, no grad |
+| Qwen-3-Embedding (`Qwen/Qwen3-Embedding-0.6B`) | ❄ FROZEN, no grad |
+| Structure Predictor (MLP decoder) | 🔥 TRAINABLE |
+
+| Hyperparameter | Value | Notes |
+|----------------|-------|-------|
+| Dataset | Dolma v1.7 (`allenai/dolma`, `name="v1_7"`, streamed) | Specify version explicitly |
+| Token budget | 5–10B | Configurable |
+| Sequence length | 1024 | OLMo token count, matches oracle search |
+| Batch size | 32 (start) | Reduce if OOM |
+| Learning rate | 3e-4 | For predictor MLP only |
+| Optimizer | AdamW | β1=0.9, β2=0.999, weight_decay=0.01 |
+| LR schedule | Cosine decay to 0 | Standard |
+| τ (temperature) init | 5.0 | |
+| τ final | 0.2 | |
+| τ schedule | Cosine annealing | τ(t) = τ_f + 0.5(τ_i - τ_f)(1 + cos(πt/T)) |
+| Sparsity λ max | 0.01 | |
+| Sparsity ramp | Linear 0 → λ_max over first 20% of steps | |
+| Hardware | 4× A40 (48GB each) | Use DDP |
+
+**Loss function:**
+```python
+total_loss = nll_loss + lambda_t * A.mean()
+```
+where `lambda_t` ramps linearly from 0 to `lambda_max` over the first 20%
+of training steps.
+
+**τ schedule (exact formula):**
+```python
+tau_t = tau_final + 0.5 * (tau_init - tau_final) * (1 + math.cos(math.pi * step / total_steps))
+```
+
+### 3.1.1 Data Processing Pipeline
+
+**Dolma version**: Use `allenai/dolma` with `name="v1_7"`. If v1_7 is not
+available on HuggingFace, fall back to whatever default version loads.
+Verify by printing the dataset info at startup and logging to wandb.
+
+**Sequence packing** (critical for training efficiency):
+
+Dolma contains documents of varying lengths. Do NOT pad short documents
+or discard them. Instead, **pack multiple documents into fixed-length
+sequences**:
+
+```python
+# Pseudocode for sequence packing:
+buffer = []
+for doc in dolma_stream:
+    olmo_tokens = olmo_tokenizer(doc["text"], add_special_tokens=False)["input_ids"]
+    buffer.extend(olmo_tokens)
+    buffer.append(olmo_tokenizer.eos_token_id)  # document separator
+
+    while len(buffer) >= seq_len + 1:  # +1 for labels (next-token prediction)
+        chunk = buffer[:seq_len + 1]
+        buffer = buffer[seq_len + 1:]
+        yield {
+            "olmo_ids": chunk[:seq_len],      # input
+            "olmo_labels": chunk[1:seq_len+1], # shifted target
+            "raw_text": olmo_tokenizer.decode(chunk[:seq_len])  # for Qwen
+        }
+```
+
+This means:
+- No padding ever. Every token contributes to NLL.
+- A single training sequence may contain parts of multiple documents,
+  separated by EOS tokens. The causal mask handles this correctly (each
+  token can only attend to prior tokens, so cross-document leakage is
+  minimal and standard practice).
+- `raw_text` is decoded from OLMo tokens to feed to Qwen. This is a
+  slight approximation (Qwen re-tokenizes the decoded text), but Qwen
+  only needs to understand the gist of the context to produce a good
+  embedding — exact token alignment doesn't matter.
+- **Qwen sees the entire packed sequence** (which may contain multiple
+  document fragments separated by EOS). This is intentional: the
+  predictor should condition on exactly what OLMo will process. Qwen's
+  mean-pooling produces a single summary vector of the full 1024-token
+  window, which is the right granularity — one A matrix per window.
+- Qwen's `seq_len` will differ from OLMo's 1024 (different tokenizer
+  granularity), but this is fine because Qwen's output is mean-pooled
+  to a single vector regardless of input length.
+
+**Qwen input format**: Qwen3-Embedding-0.6B is an embedding model.
+**Decision**: Use raw text directly (no prefix). Rationale: our input is
+a packed sequence of document fragments, not a search query or passage
+in the retrieval sense. Prefixes like "query:" or "passage:" are designed
+for retrieval tasks and would be semantically misleading here. The Qwen
+encoder just needs a general-purpose text representation.
+
+Log `qwen_input_prefix: ""` to wandb config for reproducibility. If
+future experiments show that a prefix helps, it can be changed via the
+`qwen_input_prefix` config field.
+
+### 3.2 Evaluation Data
+
+Use a **fixed held-out subset of Dolma** for eval. Implementation:
+
+```python
+# At startup, skip the first N examples to get to a held-out region
+eval_dataset = load_dataset("allenai/dolma", name="v1_7", split="train", streaming=True)
+eval_dataset = eval_dataset.skip(1_000_000)  # skip 1M examples
+eval_dataset = eval_dataset.take(1_000)       # use next 1K as eval set
+
+# Pack these into fixed-length sequences using the same packing logic
+# as training (§3.1.1), then cache in memory at startup
+eval_batches = list(eval_dataloader)  # ~1K sequences, fits in RAM
+```
+
+**Eval caching**: The `skip(1_000_000)` on a streaming dataset is slow
+(~minutes). To avoid this cost on every restart, **cache the packed eval
+sequences to disk** after the first run:
+```python
+eval_cache_path = os.path.join(config.save_dir, "eval_cache.pt")
+if os.path.exists(eval_cache_path):
+    eval_batches = torch.load(eval_cache_path)
+else:
+    eval_batches = list(build_eval_dataloader(...))
+    torch.save(eval_batches, eval_cache_path)
+```
+
+**Multi-GPU eval**: Run eval on **rank 0 only**. Other ranks skip eval
+and wait at a barrier. This avoids redundant computation and ensures
+consistent eval metrics (no need to reduce across GPUs).
+
+Eval runs in **two modes** (both deterministic, no Gumbel noise):
+- **Soft**: A = σ(Z / τ) at current temperature. Reports `eval/nll_soft`.
+- **Hard**: A = (Z > 0).float(), binary 0/1. Reports `eval/nll_hard`.
+Also report `eval/nll_baseline` with A = all-ones (should be constant).
+
+### 3.3 Multi-GPU Strategy
+
+**Standard DDP — each GPU holds a full replica of all three models.**
+
+Memory budget per GPU (fp16/bf16):
+```
+OLMo2-1B parameters:  ~2.0 GB
+Qwen-0.6B parameters: ~1.2 GB
+Predictor MLP:         ~0.05 GB
+Optimizer states:      ~0.1 GB (only predictor params)
+─────────────────────────────
+Model total:           ~3.4 GB
+
+Activations (seq_len=1024, batch=8):
+  OLMo per-head forward: ~12-16 GB (per-head inputs means 16 separate
+    attention computations per layer instead of 1 batched MHA — this
+    increases intermediate activation storage significantly vs standard)
+  Qwen forward:          ~3 GB
+  Per-head A storage:    ~0.5 GB (256×256 × batch × float32)
+─────────────────────────────
+Activations total:       ~16-20 GB
+
+Grand total:             ~20-24 GB per GPU (with batch=8)
+Headroom on 48GB A40:    ~24-28 GB
+```
+
+This fits but is tighter than a standard OLMo forward pass. If OOM,
+reduce batch_size to 4 first (halves activation memory). If still OOM,
+use **gradient accumulation** to maintain effective batch size:
+```python
+# Example: effective_batch=8, micro_batch=2, accumulation_steps=4
+accumulation_steps = config.batch_size // config.micro_batch_size
+for micro_step in range(accumulation_steps):
+    loss = forward(micro_batch) / accumulation_steps
+    loss.backward()
+optimizer.step()
+optimizer.zero_grad()
+```
+Add `micro_batch_size` to config (default: same as `batch_size`, i.e. no
+accumulation). The per-head computation path creates ~2-3x more
+intermediates than standard batched MHA because we cannot fuse across
+heads when each has a different input.
+
+**Gradient checkpointing**: Since OLMo is frozen (no gradients computed
+for its parameters), we do NOT need to store OLMo's intermediate
+activations for backprop through OLMo's weights. However, we DO need
+gradients to flow through OLMo's forward pass back to A (and then to
+the predictor). This means OLMo's forward activations are needed for
+the chain rule through A's gate multiplications, but NOT for OLMo's
+own parameter gradients.
+
+If memory is still tight, apply `torch.utils.checkpoint` to the OLMo
+layer loop — recompute each layer's forward pass during backward instead
+of storing all intermediates. This trades compute for memory.
+Gradient checkpointing is optional; try without it first.
+
+**Storing head_outputs**: The forward pass accumulates `head_outputs` for
+all 256 nodes (each `[batch, seq, 2048]`). At fp16 with batch=8,
+seq=1024: 256 × 8 × 1024 × 2048 × 2 bytes ≈ 8.6 GB. This is the
+dominant memory cost. Gradient checkpointing can reduce this by
+recomputing head outputs layer-by-layer during backward.
+
+Use `DistributedDataParallel` wrapping ONLY on the predictor MLP (the
+only trainable component). The frozen Qwen and OLMo models do not need
+DDP wrapping — just load them on each GPU.
+
+```python
+# DDP setup
+predictor = StructurePredictor(config).to(device)
+predictor = DDP(predictor, device_ids=[local_rank])
+
+# Frozen models — no DDP needed
+olmo = AutoModelForCausalLM.from_pretrained(...).to(device).eval()
+qwen = AutoModel.from_pretrained(...).to(device).eval()
+
+# Gradient disabled for frozen models
+for p in olmo.parameters(): p.requires_grad_(False)
+for p in qwen.parameters(): p.requires_grad_(False)
+```
+
+Data parallelism: Dolma is a streaming `IterableDataset` — do NOT use
+`DistributedSampler` (which requires map-style datasets). Instead, shard
+the stream manually:
+```python
+dataset = load_dataset("allenai/dolma", name="v1_7", split="train", streaming=True)
+dataset = dataset.shard(num_shards=world_size, index=rank)
+```
+Each GPU processes a disjoint shard. Gradients are synchronized by DDP.
+
+### 3.4 o_proj Bias and Weight Layout
+
+OLMo2-1B uses **no bias** in its linear layers (`bias=False` for q_proj,
+k_proj, v_proj, o_proj, and MLP layers). This is standard for modern LLMs.
+
+**Verify this at runtime:**
+```python
+assert not model.layers[0].self_attn.o_proj.bias, \
+    "Expected no bias in o_proj — update per-head splitting if bias exists"
+```
+
+If a future model version adds bias, the per-head split must also split
+the bias: `bias_h = bias[h * head_dim : (h+1) * head_dim]` for input
+projections, or `bias / num_heads` for o_proj (since it's additive).
+But for OLMo2-1B, this is not needed.
+
+**o_proj weight layout** for per-head splitting:
+```python
+# o_proj.weight: [model_dim, model_dim] = [2048, 2048]
+# This maps concatenated head outputs → model_dim
+#
+# Split by INPUT dimension (head_dim chunks):
+# o_proj_h.weight = o_proj.weight[:, h*head_dim : (h+1)*head_dim]
+# Shape: [2048, 128]
+#
+# head_output[l, h] = attn_values_h @ o_proj_h.weight.T
+# Shape: [batch, seq, 128] @ [128, 2048] → [batch, seq, 2048]
+```
+
+### 3.5 Phase 2: Joint CPT (DO NOT IMPLEMENT — FUTURE WORK)
+
+In Phase 2, OLMo would be unfrozen and co-trained with the predictor using
+differential learning rates (OLMo: 3e-5, predictor: 1e-4). The training
+loop should be DESIGNED to support this (parameter groups with different
+LRs) but the actual unfreezing logic should NOT be implemented yet.
+
+---
+
+## 4. OLMo2-1B Modification — Implementation Guide
+
+This is the hardest part. The goal: intercept OLMo2-1B's forward pass to
+apply the adjacency matrix A without forking the model code.
+
+### 4.1 Strategy: Hook-Based or Subclass
+
+**Option A (preferred): Monkey-patch the attention forward.**
+- Load the model normally via HuggingFace
+- Replace each layer's attention module's forward method with a wrapped
+  version that accepts and applies A
+- Pro: no code duplication, stays in sync with HF updates
+- Con: fragile if HF changes internal API
+
+**Option B: Subclass the model.**
+- Subclass `OlmoForCausalLM` and override the relevant methods
+- Pro: cleaner, more explicit
+- Con: more code to maintain
+
+Choose whichever is cleaner after inspecting the actual OLMo2 source.
+The key requirement is: **when A is not provided or is all-ones, the
+output must be IDENTICAL to vanilla OLMo2-1B.**
+
+### 4.2 What Exactly to Modify
+
+**Reference pseudocode — this is the exact semantics to implement:**
+
+```python
+def dagformer_forward(model, olmo_ids, A, input_norm="none"):
+    """
+    Args:
+        model: OLMo2-1B loaded from HuggingFace
+        olmo_ids: [batch, seq_len] — tokenized by OLMo's tokenizer
+        A: [batch, 256, 256] — block-upper-triangular gate matrix
+           (produced by the predictor from Qwen embeddings — but this
+            function doesn't know about Qwen, it just takes A as input)
+        input_norm: normalization method for assembled head inputs
+    Returns:
+        logits: [batch, seq_len, vocab_size]
+    """
+    batch, seq = olmo_ids.shape
+    model_dim = 2048
+    num_layers = 16
+    num_heads = 16
+
+    # Token embedding (always included, not gated)
+    embedding = model.embed_tokens(olmo_ids)  # [batch, seq, 2048]
+
+    # Storage for per-head outputs (in model_dim space)
+    head_outputs = {}  # (layer, head) -> [batch, seq, model_dim]
+    mlp_outputs = {}   # layer -> [batch, seq, model_dim]
+
+    for l in range(num_layers):
+        layer_module = model.layers[l]
+
+        # === ASSEMBLE PER-HEAD INPUTS ===
+        # Base: embedding + all prior MLP outputs (shared, ungated)
+        base_l = embedding.clone()
+        for prev_l in range(l):
+            base_l = base_l + mlp_outputs[prev_l]
+
+        # Per-head: add gated head outputs from earlier layers
+        # This is the KEY difference from standard transformer:
+        # each head h in layer l gets its OWN input.
+        per_head_inputs = []
+        for h in range(num_heads):
+            j = l * num_heads + h
+            assembled = base_l.clone()  # [batch, seq, model_dim]
+
+            # Gated sum of all prior head outputs
+            gated_sum = torch.zeros_like(base_l)
+            for src_l in range(l):
+                for src_h in range(num_heads):
+                    i = src_l * num_heads + src_h
+                    gate = A[:, i, j]  # [batch]
+                    gated_sum = gated_sum + gate[:, None, None] * head_outputs[(src_l, src_h)]
+
+            # Apply normalization ONLY to gated_sum, then add to base (see §6.1)
+            assembled = base_l + apply_norm(gated_sum, method=input_norm)
+            per_head_inputs.append(assembled)
+
+        # === RUN ATTENTION WITH PER-HEAD INPUTS ===
+        # This requires splitting the attention computation:
+        # 1. Each head h gets its own Q from per_head_inputs[h]
+        # 2. Each head h gets its own K, V from per_head_inputs[h]
+        # 3. Each head computes attention independently
+        # 4. Each head's output is projected by its slice of o_proj
+        #
+        # In practice: need to split q_proj, k_proj, v_proj into per-head
+        # projections, run attention per-head, then apply per-head o_proj.
+        #
+        # Efficient batch implementation:
+        # Stack per_head_inputs → [batch, num_heads, seq, model_dim]
+        # Apply qkv projection in batch across heads
+        # Run attention, apply o_proj per-head
+        stacked = torch.stack(per_head_inputs, dim=1)  # [batch, 16, seq, 2048]
+        normed = layer_module.input_layernorm(stacked)  # RMSNorm per-head
+        # NOTE: RMSNorm(2048) operates on the last dimension. When applied to
+        # [batch, 16, seq, 2048], it normalizes each head's input independently.
+        # When A=1, all 16 heads have identical inputs, so RMSNorm produces
+        # identical outputs — matching the standard single-RMSNorm behavior.
+        # No numerical divergence occurs in the baseline case.
+
+        for h in range(num_heads):
+            q = q_proj_head(normed[:, h], head=h)  # [batch, seq, head_dim]
+            k = k_proj_head(normed[:, h], head=h)
+            v = v_proj_head(normed[:, h], head=h)
+            attn_out_h = attention(q, k, v)         # [batch, seq, head_dim]
+            head_outputs[(l, h)] = o_proj_head(attn_out_h, head=h)  # [batch, seq, model_dim]
+
+        # === RUN MLP (ungated, standard) ===
+        # MLP input = base_l + ungated sum of THIS layer's head outputs
+        attn_output_l = sum(head_outputs[(l, h)] for h in range(num_heads))
+        mlp_input = base_l + attn_output_l  # standard residual
+        mlp_outputs[l] = layer_module.mlp(layer_module.post_attention_layernorm(mlp_input))
+
+    # Final: assemble output state = embedding + all heads + all MLPs
+    # IMPORTANT: final output uses UNGATED sum of ALL head outputs.
+    # A only controls intermediate routing (which heads feed into which).
+    # The final output is NOT gated — every head's output contributes
+    # equally to the final state. This is a deliberate design choice:
+    # (1) it matches the standard residual stream when A=1,
+    # (2) it means A controls *how* heads compute (via their inputs),
+    #     not *whether* their outputs are used in the final prediction.
+    final_state = embedding
+    for l in range(num_layers):
+        final_state = final_state + mlp_outputs[l]
+        for h in range(num_heads):
+            final_state = final_state + head_outputs[(l, h)]
+
+    logits = model.lm_head(model.norm(final_state))
+    return logits
+```
+
+⚠ **This pseudocode is for SEMANTIC CLARITY. The actual implementation
+MUST use batched operations for performance:**
+
+```python
+# Efficient version for the gated sum:
+# Stack all prior head outputs into a tensor
+prior_outputs = torch.stack([head_outputs[(l', h')]
+                             for l' in range(l) for h' in range(16)], dim=1)
+# prior_outputs: [batch, l*16, seq, model_dim]
+
+# Slice A for connections into this layer
+A_slice = A[:, :l*16, l*16:(l+1)*16]  # [batch, l*16, 16]
+
+# Batched gated sum: einsum or matmul
+# per_head_gated[h] = sum_i A_slice[:, i, h] * prior_outputs[:, i]
+per_head_gated = torch.einsum('bih,bisd->bhsd', A_slice, prior_outputs)
+# per_head_gated: [batch, 16, seq, model_dim]
+
+# Apply normalization ONLY to per_head_gated, then add base (consistent with §6.1)
+per_head_gated = apply_norm(per_head_gated, method=input_norm)
+assembled = base_l.unsqueeze(1) + per_head_gated  # [batch, 16, seq, model_dim]
+```
+
+**Splitting Q/K/V projections per-head**: OLMo2's `q_proj`, `k_proj`,
+`v_proj` are single linear layers projecting from `model_dim` to
+`num_heads * head_dim`. To apply different inputs per head:
+
+```python
+# Option A: reshape the weight matrix and apply per-head
+W_q = layer.self_attn.q_proj.weight  # [2048, 2048]
+W_q_heads = W_q.view(num_heads, head_dim, model_dim)  # [16, 128, 2048]
+# For each head h: q_h = assembled[:, h] @ W_q_heads[h].T
+
+# Option B: run the full projection on each head's input separately
+# Less efficient but simpler
+
+# Option C (RECOMMENDED): run full projection on stacked inputs
+# assembled: [batch, 16, seq, 2048]
+# Reshape to [batch*16, seq, 2048], run q_proj, reshape back
+```
+
+**RoPE (Rotary Position Embeddings)**: OLMo2 applies RoPE to Q and K
+AFTER projection, BEFORE the attention dot product. This is critical for
+per-head computation:
+
+```python
+# Standard OLMo2 attention (simplified):
+q = q_proj(hidden_states)    # [batch, seq, num_heads * head_dim]
+k = k_proj(hidden_states)    # [batch, seq, num_kv_heads * head_dim]
+q, k = apply_rotary_emb(q, k, cos, sin, position_ids)
+# Then attention(q, k, v)
+
+# DAGFormer per-head version:
+# For each head h with its own input:
+q_h = q_proj_head(assembled_h, head=h)   # [batch, seq, head_dim]
+k_h = k_proj_head(assembled_h, head=h)   # [batch, seq, head_dim]
+q_h, k_h = apply_rotary_emb(q_h, k_h, cos, sin, position_ids)  # SAME cos/sin/position_ids for all heads
+v_h = v_proj_head(assembled_h, head=h)   # no RoPE on V
+attn_out_h = attention(q_h, k_h, v_h)
+```
+
+The cos/sin cache and position_ids are **shared across all heads** (they
+depend on sequence position, not on head identity). Extract them once from
+the model's rotary embedding module at the start of each layer, then reuse
+for all 16 heads. If the implementation fails to apply RoPE, the baseline
+reproduction sanity check WILL fail because position information is lost.
+
+**OLMo2-1B uses standard MHA (NOT GQA)**: OLMo2-1B has
+`num_attention_heads = 16` and `num_key_value_heads = 16` (every Q head
+has its own K and V head). This means per-head splitting is straightforward
+— no KV sharing complications. **Verify at runtime:**
+```python
+config = model.config
+assert config.num_attention_heads == 16
+assert config.num_key_value_heads == 16, \
+    f"Expected MHA (16 KV heads), got {config.num_key_value_heads} — GQA detected, update splitting logic"
+```
+If a future model uses GQA, the per-head splitting must account for shared
+KV heads. But OLMo2-1B does not require this.
+
+**Causal attention mask**: The standard causal mask within each head's
+self-attention (preventing token t from attending to tokens t+1, t+2, ...)
+is UNCHANGED. DAGFormer's adjacency matrix A controls **cross-layer**
+routing (which heads' outputs feed into which heads' inputs). The causal
+mask controls **within-sequence** attention (which token positions can
+attend to which). These are orthogonal — A operates at the head/layer
+level, causal mask operates at the token position level. Use OLMo's
+existing causal mask implementation as-is.
+
+### 4.3 Sanity Checks (MUST PASS before proceeding)
+
+1. **Baseline reproduction**: Set A[i,j]=1 for all 30,720 valid cross-layer
+   entries, `input_norm: "none"`. NLL must match vanilla OLMo2-1B within
+   0.01 nats. This works because A=1 makes every head's input equal to the
+   standard residual stream (see §2.2 proof). If this fails, the gate
+   injection or per-head splitting logic is wrong.
+2. **A = all-zeros** (all 30,720 entries = 0): Every head sees only
+   embedding + MLP outputs, no cross-layer attention routing. NLL should
+   be significantly higher than baseline.
+3. **A = random in [0,1]**: NLL should be between the all-ones and all-zeros
+   cases.
+4. **Gradient check**: Create A as a leaf tensor with `requires_grad=True`,
+   compute NLL, call `.backward()`, verify `A.grad` is not None and has
+   nonzero entries at all 30,720 valid positions.
+5. **Normalization smoke test**: For each `input_norm` in {gate_mean,
+   rms_post, ln_post, rms_pre}, run forward with A=all-ones. NLL will NOT
+   match baseline (normalization changes the scale), but must be finite
+   (no NaN/Inf). This confirms the norm implementations don't crash.
+6. **Per-head input divergence**: Set A to a matrix where different heads
+   in the same layer have different gate values. Verify that the per-head
+   inputs are actually different (not collapsed to the same tensor).
+   This confirms the per-head routing works.
+
+---
+
+## 5. Directory Structure
+
+```
+dagformer/
+├── CLAUDE.md                  # THIS FILE — the complete spec
+├── README.md                  # Brief public description
+├── pyproject.toml             # Dependencies
+│
+├── configs/
+│   ├── sanity_check.yaml      # 1K steps, verify baseline NLL reproduction
+│   ├── ablation_rank.yaml     # r ∈ {8, 16, 32, 64}
+│   ├── ablation_tau.yaml      # τ_init/τ_final sweep
+│   ├── ablation_lambda.yaml   # sparsity coefficient sweep
+│   └── phase1_full.yaml       # Full Phase 1 run
+│
+├── src/
+│   ├── __init__.py
+│   ├── model/
+│   │   ├── __init__.py
+│   │   ├── olmo_graph.py      # Modified OLMo2 forward with A injection
+│   │   ├── predictor.py       # Qwen encoder + MLP decoder + Gumbel + cascade
+│   │   └── pipeline.py        # Combines predictor + OLMo into single forward
+│   ├── data/
+│   │   ├── __init__.py
+│   │   └── dolma.py           # Streaming dataloader: produces raw text,
+│   │                          # tokenizes with BOTH Qwen and OLMo tokenizers,
+│   │                          # returns {qwen_ids, olmo_ids, labels}
+│   ├── training/
+│   │   ├── __init__.py
+│   │   ├── trainer.py         # Training loop (pure PyTorch + DDP)
+│   │   ├── schedulers.py      # τ annealing, λ ramp, LR schedule
+│   │   └── checkpointing.py   # Save/load predictor + optimizer + schedule state
+│   └── utils/
+│       ├── __init__.py
+│       ├── logging.py         # Wandb integration
+│       └── topology.py        # A matrix analysis utilities
+│
+├── scripts/
+│   ├── train.py               # Entry: python scripts/train.py --config configs/X.yaml
+│   ├── eval.py                # Evaluate NLL with/without predictor
+│   ├── sanity_check.py        # Verify A=1 reproduces baseline
+│   └── visualize_topology.py  # Plot A matrices and gate distributions
+│
+└── tests/
+    ├── test_olmo_graph.py     # Baseline reproduction test
+    ├── test_predictor.py      # Shape and gradient tests
+    └── test_gumbel.py         # Gumbel-Sigmoid correctness
+```
+
+**File responsibilities (be precise):**
+
+- `olmo_graph.py`: ONLY handles injecting A into OLMo's forward. Does NOT
+  know about the predictor, Qwen, or training. Exports a function or class
+  that takes `(model, olmo_ids, A)` and returns logits.
+
+- `predictor.py`: ONLY the structure predictor. Takes raw text (or
+  pre-tokenized Qwen IDs), returns A. Contains Qwen loading, Qwen
+  tokenizer, MLP, Gumbel-Sigmoid, cascading gate. Does NOT know about
+  OLMo or training.
+
+- `pipeline.py`: Glue. Takes raw text (or pre-tokenized batch dict with
+  both qwen_ids and olmo_ids), calls predictor to get A, calls modified
+  OLMo forward with A, returns loss. This is what the trainer calls.
+
+- `trainer.py`: Pure training loop. Loads config, builds pipeline, runs
+  forward/backward/step. Handles DDP, logging, checkpointing. No model
+  logic here.
+
+---
+
+## 6. Implementation Order (FOLLOW THIS EXACTLY)
+
+### Step 0: Project Scaffolding
+- Create directory structure above
+- Write `pyproject.toml`:
+  ```
+  dependencies: torch>=2.2, transformers>=4.40, datasets, wandb, pyyaml, einops
+  ```
+- Verify model loading:
+  ```python
+  from transformers import AutoModelForCausalLM, AutoModel, AutoTokenizer
+  olmo = AutoModelForCausalLM.from_pretrained("allenai/OLMo-2-0425-1B")
+  qwen = AutoModel.from_pretrained("Qwen/Qwen3-Embedding-0.6B")
+  ```
+- Verify Dolma streaming:
+  ```python
+  from datasets import load_dataset
+  ds = load_dataset("allenai/dolma", name="v1_7", split="train", streaming=True)
+  print(next(iter(ds)))  # verify a document loads
+  ```
+
+### Step 1: `olmo_graph.py` — Modified OLMo Forward
+**This is the foundation. Get it right.**
+
+1. Load OLMo2-1B, inspect its architecture:
+   - Find the attention layer class name
+   - Find where head outputs are computed and merged
+   - Find the residual connection logic
+2. Implement adjacency injection
+3. Run sanity checks (§4.3) — ALL FOUR must pass
+4. Do not proceed to Step 2 until Step 1 passes all checks
+
+### Step 2: `predictor.py` — Structure Predictor
+1. Implement Qwen encoder wrapper (frozen, mean-pooled)
+2. Implement MLP decoder with low-rank heads
+3. Implement Gumbel-Sigmoid with 3 modes: (a) training: noise + τ,
+   (b) eval soft: σ(Z/τ) no noise, (c) eval hard: (Z>0).float() no noise
+4. Implement block-upper-triangular mask (based on layer index, NOT torch.triu)
+5. Implement cascading gate
+6. Test: output shape [batch, 256, 256], values in [0,1], block-upper-tri
+7. Test: `loss = A.sum(); loss.backward()` produces gradients in MLP params
+
+### Step 3: `pipeline.py` — End-to-End
+1. Wire predictor output into OLMo modified forward
+2. Verify full gradient chain: NLL.backward() updates predictor MLP
+3. Profile memory on single GPU (must fit seq_len=1024, batch=1 on 48GB)
+
+### Step 4: Training Infrastructure
+1. YAML config → Python dataclass (not raw dicts)
+2. `schedulers.py`: τ annealing, λ ramp, LR cosine decay
+3. `trainer.py`: training loop
+4. `logging.py`: Wandb metrics (see §7)
+5. `checkpointing.py`: save/load
+
+### Step 5: Sanity Check Training Run
+- Config: `sanity_check.yaml` — 1K steps, batch=4, high τ=5.0
+- Verify: loss decreases over 1K steps
+- Verify: A is not collapsing (not all-ones, not all-zeros)
+- Verify: gradient norms are reasonable
+- **STOP HERE if loss doesn't decrease.** Debug before proceeding.
+
+### Step 6: Ablations (later)
+- Rank r, τ schedule, sparsity λ, cascading gate on/off
+- **Input normalization** (see §6.1 below)
+
+### 6.1 Input Normalization Ablation (IMPORTANT)
+
+When head j receives gated inputs from multiple source heads across
+different layers, the magnitudes of these representations can differ
+significantly. Layer 0 outputs and layer 14 outputs live at different
+scales. The choice of how to normalize the aggregated input to each
+head is a critical design decision.
+
+**The problem** — referring to the per-head assembly from §2.2:
+```python
+gated_sum = Σ_{i: layer(i) < l} A[i,j] * head_output[i]
+# If 50 sources have A[i,j] > 0, gated_sum has much larger magnitude
+# than any single head_output. Scale depends on sparsity pattern of A.
+
+assembled = base_l + gated_sum  # base_l = embedding + prior MLPs
+# The gated_sum component can dwarf or be dwarfed by base_l.
+```
+
+**Normalization is applied ONLY to the gated_sum, before adding to base_l:**
+```python
+assembled = base_l + normalize(gated_sum, method=config.input_norm)
+```
+This way, base_l (which is the standard, ungated component) is preserved
+as-is, and only the novel gated routing is normalized.
+
+**Ablation candidates (implement all, sweep in configs):**
+
+| ID | Method | Formula | Learnable params | Rationale |
+|----|--------|---------|-----------------|-----------|
+| `none` | Raw weighted sum | `gated_sum` as-is | 0 | Baseline. When A=1 this reproduces vanilla OLMo. |
+| `gate_mean` | Divide by gate sum | `gated_sum / (Σ_i A[i,j] + ε)` | 0 | Normalizes for fan-in. ε=1e-8. |
+| `rms_post` | RMSNorm after sum | `RMSNorm(gated_sum)` | 2048 (one gain vector) | One shared `nn.RMSNorm(model_dim)` instance, applied to each head's gated_sum. |
+| `ln_post` | LayerNorm after sum | `LayerNorm(gated_sum)` | 2×2048 (gain + bias) | One shared `nn.LayerNorm(model_dim)` instance. Affine params are trainable and counted as predictor params. |
+| `rms_pre` | RMSNorm each source before sum | `Σ_i A[i,j] * RMSNorm_i(head_output[i])` | 256 × 2048 = 524,288 | One `nn.RMSNorm(model_dim)` **per source node** (256 total). Each head gets its own learnable gain, allowing fine-grained per-head scale correction before mixing. |
+
+All learnable norm params (if any) are part of the predictor's parameter
+group and trained with the same LR and optimizer as the MLP decoder.
+
+**Default:** Start with `none` (to verify baseline reproduction), then
+switch to `gate_mean` for actual training. If that doesn't work, try
+`rms_post`.
+
+**Implementation:** The normalization method is a config string
+(`input_norm: "gate_mean"`). The `olmo_graph.py` code dispatches on this
+config. All five methods must be implemented.
+
+**Config example for ablation:**
+```yaml
+# In configs/ablation_norm.yaml
+sweep:
+  input_norm: ["none", "gate_mean", "rms_post", "ln_post", "rms_pre"]
+```
+
+---
+
+## 7. Logging & Monitoring
+
+Log ALL of these to Wandb at every training step:
+
+| Metric | Formula / Source | Why |
+|--------|-----------------|-----|
+| `train/nll` | Cross-entropy loss | Primary objective |
+| `train/sparsity_loss` | λ_t · mean(A) | Regularization term |
+| `train/total_loss` | nll + sparsity_loss | What optimizer sees |
+| `eval/nll_soft` | NLL with **deterministic** soft gates: A = σ(Z / τ), NO Gumbel noise | Smooth relaxation perf |
+| `eval/nll_hard` | NLL with hard gates: A = (Z > 0).float(), NO Gumbel noise | Inference-mode perf |
+| `eval/nll_baseline` | NLL with A = all-ones | Should be constant |
+| `topology/mean_A` | mean(A) | Overall gate activation |
+| `topology/seq_gate_frac` | Fraction of sequential gates > 0.5 | |
+| `topology/hyp_gate_frac` | Fraction of hyperconnection gates > 0.5 | |
+| `topology/jaccard_var` | Variance of pairwise Jaccard across batch | Context-dependency |
+| `schedule/tau` | Current temperature | |
+| `schedule/lambda` | Current sparsity coefficient | |
+| `grad/predictor_norm` | Total L2 norm of predictor gradients | |
+
+**Collapse alarm:** If `topology/mean_A` < 0.01 or > 0.99 for 100
+consecutive steps, log a WARNING. The predictor has degenerated.
+
+---
+
+## 8. Config Format
+
+Use YAML. Example (`configs/sanity_check.yaml`):
+
+```yaml
+# Model
+olmo_model_id: "allenai/OLMo-2-0425-1B"
+qwen_model_id: "Qwen/Qwen3-Embedding-0.6B"
+
+# Predictor
+predictor_hidden_dim: 1024
+predictor_rank: 32
+cascading_gate_k: 5.0
+input_norm: "none"  # one of: none, gate_mean, rms_post, ln_post, rms_pre
+                    # use "none" for sanity check, "gate_mean" for training
+
+# Data
+dataset: "allenai/dolma"
+dataset_name: "v1_7"        # Dolma version / subset
+seq_len: 1024               # OLMo token count per packed sequence
+batch_size: 4
+micro_batch_size: 4         # per-GPU micro batch; if < batch_size, use gradient accumulation
+qwen_input_prefix: ""       # use raw text directly (see §3.1.1)
+
+# Eval
+eval_skip: 1000000          # skip this many examples to reach held-out region
+eval_size: 1000             # number of eval sequences (cached in memory)
+
+# Training
+total_steps: 1000
+lr: 3e-4
+weight_decay: 0.01
+optimizer: "adamw"  # only adamw supported
+
+# Schedules
+tau_init: 5.0
+tau_final: 0.2
+tau_schedule: "cosine"
+lambda_max: 0.0    # no sparsity for sanity check
+lambda_warmup_frac: 0.2
+
+# Logging
+wandb_project: "dagformer"
+wandb_run_name: "sanity-check"
+log_every: 10
+eval_every: 100
+
+# Checkpointing
+save_every: 500
+save_dir: "checkpoints/"
+
+# Hardware
+num_gpus: 1
+```
+
+Parse into a `@dataclass` with validation. Crash on unknown keys.
+
+---
+
+## 9. Key Invariants (ALWAYS enforce)
+
+1. **Baseline reproduction**: A=1 (all 30,720 entries) with `input_norm:
+   "none"` → NLL matches vanilla OLMo within 0.01. This validates the
+   entire gate injection and per-head splitting logic. Test BEFORE and
+   AFTER any architectural change.
+
+2. **DAG constraint**: A is block-upper-triangular based on LAYER indices.
+   A[i,j] = 0 whenever `j//16 <= i//16`. Enforced by multiplicative mask,
+   never by loss or gradient clipping. Do NOT use `torch.triu()`.
+
+3. **Gradient flow**: After every forward-backward, assert that all
+   predictor parameters have non-None, non-zero gradients.
+
+4. **Memory budget**: Must fit on 4×A40 for seq_len=1024. If OOM, reduce
+   batch size. Do NOT change the architecture to fix memory.
+
+5. **Frozen models stay frozen**: OLMo and Qwen must have
+   `requires_grad=False` on ALL parameters. Verify this at startup.
+   In Phase 1, the only trainable parameters are the MLP decoder.
+
+6. **Deterministic eval**: Eval uses NO Gumbel noise, ever. Two eval modes:
+   (a) Soft: A = σ(Z/τ), continuous [0,1]. (b) Hard: A = (Z>0).float(),
+   binary {0,1}, with hard cascading gate (g_j = 1 if inc_j>0, else 0).
+   Always report both `eval/nll_soft` and `eval/nll_hard`.
+
+---
+
+## 10. Oracle Search Reference Numbers
+
+These are the results from the completed oracle search. Use them to
+validate that the learned predictor is heading in the right direction.
+
+| Metric | Value |
+|--------|-------|
+| Windows evaluated | 50 |
+| Window size | 1024 tokens |
+| Improvement rate | 100% (all 50 improved) |
+| Baseline NLL (median) | 2.58 |
+| Oracle NLL (median) | 0.12 |
+| NLL delta (median) | +2.38 |
+| Oracle sequential gates ON | ~91% |
+| Oracle hyperconnection gates ON | ~70% |
+| Oracle search steps per window | 500 |
+| Oracle search method | Surrogate gradient (STE) |
+
+The learned predictor does NOT need to match oracle performance. The
+**decision gate** for Phase 1 success is: predictor NLL ≤ dense baseline
+NLL (i.e., the predictor must not make things WORSE).
+
+---
+
+## 11. Dependencies (exact)
+
+```toml
+[project]
+name = "dagformer"
+version = "0.1.0"
+requires-python = ">=3.10"
+dependencies = [
+    "torch>=2.2",
+    "transformers>=4.40",
+    "datasets",
+    "wandb",
+    "pyyaml",
+    "einops",
+]
+```
+
+**NOT used (by design):**
+- No HuggingFace Accelerate
+- No PyTorch Lightning
+- No DeepSpeed
+- No custom CUDA kernels
+
+Multi-GPU via `torch.nn.parallel.DistributedDataParallel` only.
+
+---
+
+## 12. What NOT to Do
+
+- **Do NOT implement Phase 2** (joint OLMo training). Design the code to
+  support it (param groups, differential LR) but do not implement unfreezing.
+- **Do NOT implement a diffusion-based predictor.** The MLP decoder is
+  the current design. Diffusion is future work.
+- **Do NOT write custom CUDA kernels.** Use dense matmuls with masking.
+- **Do NOT support other base models.** Hardcode OLMo2-1B for now.
+- **Do NOT use Accelerate or Lightning.** Pure PyTorch.
+- **Do NOT run hyperparameter search.** Manual ablations only.
+- **Do NOT fork OLMo2's source code.** Load from HF and modify via
+  hooks, monkey-patching, or subclassing.
+- **Do NOT use `nn.DataParallel`.** Use `DistributedDataParallel` only.
+
+---
+
+## 13. Code Style
+
+- Type hints on all function signatures
+- Docstrings on all public functions and classes
+- Config as `@dataclass`, not raw dicts
+- `assert` for shape checks in every forward pass (e.g.,
+  `assert A.shape == (batch, 256, 256)`)
+- No silent failures — crash loudly with informative messages
+- Prefer explicit loops over clever one-liners when clarity matters
+- One class per file is fine; don't over-split
+- Use `einops.rearrange` for complex tensor reshaping (clearer than
+  chains of `.view().permute().contiguous()`)
+
+---
+
+## 14. Quick Reference — Model IDs and Shapes
+
+| Thing | Value |
+|-------|-------|
+| OLMo model ID | `allenai/OLMo-2-0425-1B` |
+| Qwen model ID | `Qwen/Qwen3-Embedding-0.6B` |
+| OLMo layers | 16 |
+| OLMo heads per layer | 16 |
+| Total nodes | 256 |
+| A matrix shape | `[batch, 256, 256]` |
+| A constraint | Block-upper-triangular: `A[i,j]=0` when `j//16 <= i//16` |
+| A free entries | 30,720 (cross-layer only) |
+| Predictor rank r | 32 (default) |
+| Predictor hidden dim | 1024 (default) |
+| Sequence length | 1024 |
+| Gumbel-Sigmoid τ range | 5.0 → 0.2 |
+| Cascading gate k | 5.0 |
+| Input normalization | `none` (sanity check), `gate_mean` (training default), ablate all 5 |
+| Sparsity λ range | 0 → 0.01 |