docs/COLAB_50M.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117

# Colab 50M EP run — self-contained notebook cells

Goal: train the 50M (C=2048) EP energy-transformer on TinyStories-BPE on a Colab A100/H100,
which fixes the two things timan1's A6000 can't: fp32 throughput and the big-width init instability
(more headroom to tune the curriculum + bigger batch). Checkpoints to Google Drive with full-state
resume, so the 12 h Colab timeout never loses progress — just re-run the training cell to continue.

PREP (once, on your laptop): download from timan1 `~/ept/ept_colab.tar.gz` (16 KB, the code) and
upload it to your Google Drive root as `ept_colab.tar.gz`. Data is regenerated in-notebook (or
upload `~/ept/lt_ep_code/.../tinystories_bpe` bins to Drive to skip the ~40 min prep — optional).

────────────────────────────────────────────────────────────────────────
## Cell 1 — setup, Drive, GPU, deps
```python
import torch, subprocess, os
print(torch.__version__, torch.cuda.get_device_name(0))
assert torch.__version__ >= "2.1", "need torch>=2.1 for torch.func/compile"
print(subprocess.run(["nvidia-smi","--query-gpu=name,memory.total","--format=csv,noheader"],
                     capture_output=True,text=True).stdout)
from google.colab import drive; drive.mount('/content/drive')
!pip -q install tokenizers
WORK="/content/work"; DRIVE="/content/drive/MyDrive"; os.makedirs(WORK, exist_ok=True)
!tar xzf {DRIVE}/ept_colab.tar.gz -C {WORK}
print("code:", os.listdir(WORK))
```

## Cell 2 — data (regenerate, cached to Drive; skip if bins already uploaded)
```python
import os
DATA="/content/drive/MyDrive/ept_data/tinystories_bpe"
if os.path.exists(f"{DATA}/train.bin"):
    print("BPE bins found on Drive — reusing.")
else:
    os.makedirs("/content/drive/MyDrive/ept_data/tinystories", exist_ok=True)
    %cd /content/drive/MyDrive/ept_data/tinystories
    !test -f train.txt || wget -q -O train.txt https://huggingface.co/datasets/roneneldan/TinyStories/resolve/main/TinyStoriesV2-GPT4-train.txt
    !test -f valid.txt || wget -q -O valid.txt https://huggingface.co/datasets/roneneldan/TinyStories/resolve/main/TinyStoriesV2-GPT4-valid.txt
    # point the prep script at the Drive paths
    import re; src=open(f"{WORK}/prepare_tinystories_bpe.py").read()
    src=src.replace("/tmp/lt_ep/data/tinystories","/content/drive/MyDrive/ept_data/tinystories")
    src=src.replace("/tmp/lt_ep/data/tinystories_bpe",DATA)
    open(f"{WORK}/prep_bpe_colab.py","w").write(src)
    %cd {WORK}
    !python prep_bpe_colab.py
print("data:", os.listdir(DATA))
```

## Cell 3 — STABILITY SMOKE (always run first; ~15 min). Must survive past warmup without abort.
```python
%cd {WORK}
# C=2048 starting curriculum (muP-scaled from C=1024's lr 4e-4 -> ~2e-4; longer warmup; gentler resinit)
!python lt_ep_train.py --mode ep --attn_mode thick --B 16 --C 2048 --H 16 --T 512 \
  --c 1.0 --jacreg 1.0 --jr_floor 0.1 --res_target 1.5e-3 --jr_max 64 --res_ema 0.9 \
  --holo 2 --hr 0.02 --pema 0.999 --t1max 300 --res_est 1e-4 --t2sel 60 --res_gate 5e-3 \
  --qknorm --resinit 0.05 --warmup 2500 --compile --T1 150 --T2 20 --lr 2e-4 \
  --steps 1200 --log 100 --data {DATA}
# READ THE OUTPUT: if it ABORTs or res spikes >0.1 repeatedly through steps 600-1200, the curriculum
# is still too hot -> lower lr to 1e-4 and/or resinit 0.03 and/or warmup 4000, re-run this cell.
# If res stays <1e-2 and val descends past step 1000, the curriculum is good -> go to Cell 4.
```

## Cell 3b — KEEP-ALIVE (run once, then it auto-clicks connect every 60s to beat the ~90min idle kill)
Open the browser JS console (F12 → Console) on the Colab tab and paste:
```javascript
function keepAlive(){
  document.querySelector("colab-connect-button")?.shadowRoot?.querySelector("#connect")?.click();
}
setInterval(keepAlive, 60000);
```
This beats ONLY the idle timeout. The HARD cap (free 12h / Pro 24h, and Pro+ background execution
is unreliable in 2026) is unbeatable — which is why Cell 4 is built to RESUME. When Colab drops you,
just reconnect and re-run Cell 4; it continues from the last `--save_every` checkpoint on Drive.

## Cell 4 — FULL RUN with Drive full-state resume. Re-run this exact cell after EVERY disconnect.
```python
%cd {WORK}
ST="/content/drive/MyDrive/ept_ckpt/s4_50m.state"; CK="/content/drive/MyDrive/ept_ckpt/s4_50m.best.pt"
os.makedirs("/content/drive/MyDrive/ept_ckpt", exist_ok=True)
# --resume loads ST (weights+optimizer+sched+step+jr+best) if present -> idempotent across timeouts.
# --save_every 100 = atomic full-state save every 100 steps -> a kill loses at most ~100 steps.
!python lt_ep_train.py --mode ep --attn_mode thick --B 16 --C 2048 --H 16 --T 512 \
  --c 1.0 --jacreg 1.0 --jr_floor 0.1 --res_target 1.5e-3 --jr_max 64 --res_ema 0.9 \
  --holo 2 --hr 0.02 --pema 0.999 --t1max 300 --res_est 1e-4 --t2sel 60 --res_gate 5e-3 \
  --qknorm --resinit 0.05 --warmup 2500 --compile --T1 150 --T2 20 --lr 2e-4 \
  --steps 24000 --log 200 --save_every 100 --data {DATA} --ckpt {CK} --state {ST} --resume
# IMPORTANT: match every flag here to the curriculum that PASSED Cell 3 (esp. lr/warmup/resinit).
# On the FIRST run ST won't exist (fresh start, prints init residual); every re-run prints "[resume] from ...".
```

### Checkpointing guarantees (tested on timan1)
- `--state` writes the FULL state (weights + AdamW moments + LR-schedule position + step + λ + best)
  to `ST.tmp` then `os.replace` → **atomic**: a kill mid-write leaves the previous good `ST` intact.
- `--resume` continues the LR schedule and optimizer momentum exactly (not a cold warm-start):
  verified step 150 → resumed 151 with val still descending monotonically.
- State size at 50M ≈ ~1 GB (weights+pema+opt); `--save_every 100` ≈ a 1 GB Drive write every
  ~20 min of A100 wall-clock (well under Drive's daily quota). Lower to 50 if you want ≤10-min loss.
- `--ckpt` (CK) separately keeps the best-val weights for sampling (Cell 5), updated only on improvement.

## Cell 5 — sample stories from the best checkpoint (run anytime; reads CK from Drive)
```python
%cd {WORK}
CK="/content/drive/MyDrive/ept_ckpt/s4_50m.best.pt"
!python sample_eq.py --ckpt {CK} --data {DATA} --C 2048 --H 16 --T 512 --use_pema --n 4 \
  --prompt "Once upon a time" --temp 0.8 --topk 40
```
Note: sample_eq.py reads vocab from meta.pkl; for BPE it prints token ids unless decoded — if it
shows numbers not text, ping me and I'll add the BPE decode (tokenizer.json is in {DATA}).

────────────────────────────────────────────────────────────────────────
NOTES
- The curriculum in Cells 3/4 is a STARTING GUESS for C=2048 (we never got it stable on timan1).
  Cell 3 is there precisely to dial it in fast on the better GPU before committing Cell 4's long run.
- Full-state resume tested on timan1 (step 150 → resumed 151, optimizer/schedule intact).
- Expected cost: A100 fp32 ~2-3x an A6000 → ~0.06-0.1 it/s → 24k steps ~3-4 days of wall-clock
  ACROSS resumes (so leave it, re-run Cell 4 whenever Colab drops you). H100 faster.
- sample_eq.py BPE-decode gap is the one known rough edge; tell me if Cell 5 prints ids.
```