1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
|
# DAGFormer Experiment Results
## Sanity Checks
### S0 — Dense Baseline (no predictor)
| Item | Value |
|------|-------|
| Status | **DONE** (from sanity training eval) |
| Date | 2025-02-09 |
| Job ID | 15785016 |
| Hardware | A40×1 |
| Eval set | skip=10000, size=50, seq_len=1024 |
| **NLL_base** | **2.4569** |
| Notes | All experiments must beat this. Consider re-running with eval_size=1000 for more robust estimate. |
---
### S1 — Predictor identity init (constant tau=5, ~10M tokens)
| Item | Value |
|------|-------|
| Status | **DONE** |
| Date | 2026-02-09 |
| Job ID | 15788145 |
| Config | r=32, tau=5→5 (constant), k=5, lambda=0 |
| Tokens | ~10M (2500 steps @ batch=4, seq=1024) |
| Hardware | A40×1 (gpub073) |
| Wall time | ~2 hrs |
| Target | NLL ≈ NLL_base (within 1%) |
| Purpose | Verify init reproduces dense topology |
| **Result** | **PASS** — NLL within 0.3% of baseline |
| Metric | Value (final) |
|--------|---------------|
| eval/nll_soft | **2.4500** (baseline: 2.4569, diff: -0.3%) |
| eval/nll_hard | **2.4506** (diff: -0.3%) |
| eval/nll_baseline | 2.4569 |
| topology/mean_A | 0.975 |
| topology/seq_gate_frac | 0.986 |
| topology/hyp_gate_frac | 0.988 |
**Per-eval-step data:**
| Step | nll_soft | nll_hard | nll_base | mean_A |
|------|----------|----------|----------|--------|
| 100 | 2.4531 | 2.4512 | 2.4569 | 0.970 |
| 500 | 2.4588 | 2.4609 | 2.4569 | 0.974 |
| 1000 | 2.4506 | 2.4506 | 2.4569 | 0.978 |
| 1500 | 2.4562 | 2.4578 | 2.4569 | 0.972 |
| 2000 | 2.4500 | 2.4506 | 2.4569 | 0.978 |
| 2500 | 2.4500 | 2.4506 | 2.4569 | 0.975 |
**Observations:**
- Init NLL matches baseline from step 0 — identity init working correctly
- Step 700 had transient dip (mean_A=0.916, nll_soft=2.496) but recovered — Gumbel noise exploration at high tau
- nll_hard ≈ nll_soft throughout — at tau=5, soft gates ≈ 0.95, so hard threshold (>0) gives similar A
---
### S2 — Gradient flow check (constant tau=2, ~50M tokens)
| Item | Value |
|------|-------|
| Status | **RUNNING** (attempt 2) |
| Config | r=32, tau=2→2 (constant), k=5, lambda=0 |
| Tokens | ~50M (12,500 steps @ batch=4, seq=1024) |
| Hardware | A40×1 |
| Est. Time | ~15 hrs (within 48h limit) |
| Target | NLL < NLL_base (2.4569) |
| Purpose | Lower tau gives sharper gates — does predictor learn useful topology? |
**Attempt 1** — Job 15789537, crashed at step ~1860 (Dolma HTTP range request error)
| Step | nll_soft | nll_hard | nll_baseline | mean_A |
|------|----------|----------|--------------|--------|
| 500 | 2.4581 | 2.4581 | 2.4569 | 0.993 |
| 1000 | 2.4575 | 2.4569 | 2.4569 | 0.999 |
| 1500 | 2.4547 | 2.4559 | 2.4569 | 0.993 |
Observations (attempt 1):
- Eval NLL ≈ baseline throughout — predictor still near init (mean_A ≈ 0.99)
- Train NLL high variance (0.27–2.96) is normal batch-to-batch variation at batch_size=4
- No checkpoint saved (save_every=2500, crashed at 1860)
- Crashed due to Dolma streaming HTTP error, not code bug
**Attempt 2** — Job 15798568 (fresh start, no checkpoint from attempt 1)
| Metric | Value |
|--------|-------|
| eval/nll_soft | |
| eval/nll_hard | |
| topology/mean_A | |
---
## Phase 1 Core
### P1 — Phase 1 default config (5B tokens)
| Item | Value |
|------|-------|
| Status | NOT STARTED |
| Config | r=32, tau=5→0.2 cosine, k=5, lambda=0→0.01 ramp |
| Tokens | 5B |
| Hardware | A40×4 |
| Est. Time | ~4 days |
| Metric | Value |
|--------|-------|
| eval/nll_soft | |
| eval/nll_hard | |
| topology/mean_A | |
| topology/seq_gate_frac | |
| topology/hyp_gate_frac | |
---
### P2 — Phase 1 extended (10B tokens)
| Item | Value |
|------|-------|
| Status | NOT STARTED |
| Config | Continue P1 if still improving at 5B |
| Tokens | 10B |
| Hardware | A40×4 |
| Est. Time | ~7 days |
---
## Ablations
### A1–A4: Rank r
| ID | Rank | NLL_soft | NLL_hard | Sparsity | Notes |
|----|------|----------|----------|----------|-------|
| A1 | 8 | | | | |
| A2 | 16 | | | | |
| P1 | 32 | | | | (reference) |
| A3 | 64 | | | | |
| A4 | 256 | | | | full rank |
### A5–A7: Temperature schedule
| ID | tau_init | tau_final | NLL_soft | NLL_hard | A entropy | Notes |
|----|----------|-----------|----------|----------|-----------|-------|
| A5 | 1 | 1 | | | | constant, perpetually soft |
| P1 | 5 | 0.2 | | | | (reference) |
| A6 | 5 | 0.05 | | | | aggressive anneal |
| A7 | 10 | 1.0 | | | | slow anneal |
### A8–A9: Sparsity lambda
| ID | lambda | NLL_soft | NLL_hard | Density | Notes |
|----|--------|----------|----------|---------|-------|
| A8 | 0 | | | | no sparsity |
| P1 | 0→0.01 | | | | (reference) |
| A9 | 0→0.05 | | | | high sparsity |
### A10–A11: Cascading gate
| ID | Gate | NLL_soft | NLL_hard | Dead heads | Notes |
|----|------|----------|----------|------------|-------|
| A10 | OFF | | | | |
| P1 | k=5 fixed | | | | (reference) |
| A11 | k=5 learnable | | | | |
---
## Analysis Experiments
### X1 — Topology variance analysis
| Item | Value |
|------|-------|
| Status | NOT STARTED |
| Result | |
### X2 — Domain-specific topology
| Item | Value |
|------|-------|
| Status | NOT STARTED |
| Result | |
### X3 — Topology-NLL sensitivity
| Item | Value |
|------|-------|
| Status | NOT STARTED |
| Result | |
---
## Speed Estimates (A40×1, batch=4, micro_batch=2, seq=1024)
| Component | Time | Notes |
|-----------|------|-------|
| Training step | ~3s | Forward + backward + optimizer |
| Eval round (50 samples) | ~2 min | 25 batches × 3 modes (soft/hard/baseline) |
| Model loading | ~10 min | OLMo + Qwen + eval set build |
| 1K steps (no eval) | ~50 min | |
| 1K steps (eval every 100) | ~70 min | 10 eval rounds add ~20 min |
| 10K steps | ~12 hrs | |
| 100K steps | ~5 days | Exceeds 48h SLURM limit, needs auto-resume |
**Previous 14s/step estimate was wrong** — it included model loading and eval overhead in wall-clock average.
---
## Preliminary Data (from sanity training job 15785016)
Run with cascading gate bug (layer 0 not exempted). 500/1000 steps completed before timeout.
| Step | train/nll | eval/nll_soft | eval/nll_hard | eval/nll_baseline | mean_A | tau |
|------|-----------|---------------|---------------|-------------------|--------|-----|
| 0 | 3.539 | — | — | — | 0.417 | 5.00 |
| 100 | 2.750 | 2.635 | 4.744 | 2.457 | 0.416 | 4.88 |
| 200 | 3.102 | 2.630 | 4.570 | 2.457 | 0.416 | 4.54 |
| 300 | 2.844 | 2.621 | 4.680 | 2.457 | 0.418 | 4.01 |
| 400 | 2.492 | 2.641 | 4.893 | 2.457 | 0.419 | 3.34 |
| 500 | 1.805 | 2.639 | 4.503 | 2.457 | 0.419 | 2.60 |
**Key observations:**
- train/nll decreasing (3.54 → 1.80) but eval/nll_soft flat (~2.63) — overfitting or predictor not generalizing
- eval/nll_hard very high (4.5-4.9) due to cascading gate layer 0 bug (now fixed in `80579d6`)
- mean_A stable ~0.42 (= ~0.89 over valid entries), no collapse
- Baseline NLL = 2.4569 confirmed correct after double-shift fix
|