From b83947778e2c776f757a07d4719b7ce961d7ed55 Mon Sep 17 00:00:00 2001 From: Yuren Hao Date: Fri, 3 Jul 2026 05:56:50 -0500 Subject: =?UTF-8?q?Initial=20commit:=20ept=20=E2=80=94=20backprop-free=20e?= =?UTF-8?q?quilibrium=20transformer=20(EP)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Code (ep_run/), organized docs (docs/{method,campaign,hardware,outreach,paper}), analysis scripts (scripts/), ONBOARDING.md entry point. Large data/checkpoints git-ignored (share separately). Co-Authored-By: Claude Opus 4.8 Claude-Session: https://claude.ai/code/session_014FAPDWQ49M5Ye3NpTndTpn --- docs/COLAB_50M.md | 117 ++++ docs/campaign/C512_PLATEAU_CAMPAIGN.md | 159 ++++++ docs/campaign/C512_ROUND2_ABCD.md | 118 ++++ docs/campaign/EP_BELOW210_DIAGNOSIS_FIX.md | 260 +++++++++ docs/campaign/FINDINGS.md | 609 +++++++++++++++++++++ ...SESSION_2026-06-24_HOPF_DIAGNOSIS_RESREG_FIX.md | 81 +++ docs/hardware/COLLABORATOR_BRIEF.md | 46 ++ docs/hardware/FUGU_CODEX_PHYSICS_ANSWER.md | 125 +++++ docs/hardware/GPT_PRO_PHYSICS_ANSWER.md | 46 ++ docs/hardware/HW_RESEARCH_FINDINGS.md | 98 ++++ .../PHYSICS_QUESTIONS_FOR_DEEP_REASONING.md | 65 +++ docs/hardware/SCALING_AND_HARDWARE_PLAN.md | 58 ++ docs/method/ARCHITECTURE.md | 117 ++++ docs/method/EP_DERIVATION.md | 215 ++++++++ docs/method/METHODS.md | 576 +++++++++++++++++++ docs/method/READING.md | 58 ++ docs/method/READING_EN.md | 54 ++ docs/outreach/EMAIL_DRAFT_BEN.md | 42 ++ docs/outreach/OUTREACH_TARGETS.md | 199 +++++++ docs/outreach/SCELLIER_OUTREACH.md | 62 +++ docs/paper/PAPER_A_OUTLINE.md | 57 ++ 21 files changed, 3162 insertions(+) create mode 100644 docs/COLAB_50M.md create mode 100644 docs/campaign/C512_PLATEAU_CAMPAIGN.md create mode 100644 docs/campaign/C512_ROUND2_ABCD.md create mode 100644 docs/campaign/EP_BELOW210_DIAGNOSIS_FIX.md create mode 100644 docs/campaign/FINDINGS.md create mode 100644 docs/campaign/SESSION_2026-06-24_HOPF_DIAGNOSIS_RESREG_FIX.md create mode 100644 docs/hardware/COLLABORATOR_BRIEF.md create mode 100644 docs/hardware/FUGU_CODEX_PHYSICS_ANSWER.md create mode 100644 docs/hardware/GPT_PRO_PHYSICS_ANSWER.md create mode 100644 docs/hardware/HW_RESEARCH_FINDINGS.md create mode 100644 docs/hardware/PHYSICS_QUESTIONS_FOR_DEEP_REASONING.md create mode 100644 docs/hardware/SCALING_AND_HARDWARE_PLAN.md create mode 100644 docs/method/ARCHITECTURE.md create mode 100644 docs/method/EP_DERIVATION.md create mode 100644 docs/method/METHODS.md create mode 100644 docs/method/READING.md create mode 100644 docs/method/READING_EN.md create mode 100644 docs/outreach/EMAIL_DRAFT_BEN.md create mode 100644 docs/outreach/OUTREACH_TARGETS.md create mode 100644 docs/outreach/SCELLIER_OUTREACH.md create mode 100644 docs/paper/PAPER_A_OUTLINE.md (limited to 'docs') diff --git a/docs/COLAB_50M.md b/docs/COLAB_50M.md new file mode 100644 index 0000000..9236049 --- /dev/null +++ b/docs/COLAB_50M.md @@ -0,0 +1,117 @@ +# Colab 50M EP run — self-contained notebook cells + +Goal: train the 50M (C=2048) EP energy-transformer on TinyStories-BPE on a Colab A100/H100, +which fixes the two things timan1's A6000 can't: fp32 throughput and the big-width init instability +(more headroom to tune the curriculum + bigger batch). Checkpoints to Google Drive with full-state +resume, so the 12 h Colab timeout never loses progress — just re-run the training cell to continue. + +PREP (once, on your laptop): download from timan1 `~/ept/ept_colab.tar.gz` (16 KB, the code) and +upload it to your Google Drive root as `ept_colab.tar.gz`. Data is regenerated in-notebook (or +upload `~/ept/lt_ep_code/.../tinystories_bpe` bins to Drive to skip the ~40 min prep — optional). + +──────────────────────────────────────────────────────────────────────── +## Cell 1 — setup, Drive, GPU, deps +```python +import torch, subprocess, os +print(torch.__version__, torch.cuda.get_device_name(0)) +assert torch.__version__ >= "2.1", "need torch>=2.1 for torch.func/compile" +print(subprocess.run(["nvidia-smi","--query-gpu=name,memory.total","--format=csv,noheader"], + capture_output=True,text=True).stdout) +from google.colab import drive; drive.mount('/content/drive') +!pip -q install tokenizers +WORK="/content/work"; DRIVE="/content/drive/MyDrive"; os.makedirs(WORK, exist_ok=True) +!tar xzf {DRIVE}/ept_colab.tar.gz -C {WORK} +print("code:", os.listdir(WORK)) +``` + +## Cell 2 — data (regenerate, cached to Drive; skip if bins already uploaded) +```python +import os +DATA="/content/drive/MyDrive/ept_data/tinystories_bpe" +if os.path.exists(f"{DATA}/train.bin"): + print("BPE bins found on Drive — reusing.") +else: + os.makedirs("/content/drive/MyDrive/ept_data/tinystories", exist_ok=True) + %cd /content/drive/MyDrive/ept_data/tinystories + !test -f train.txt || wget -q -O train.txt https://huggingface.co/datasets/roneneldan/TinyStories/resolve/main/TinyStoriesV2-GPT4-train.txt + !test -f valid.txt || wget -q -O valid.txt https://huggingface.co/datasets/roneneldan/TinyStories/resolve/main/TinyStoriesV2-GPT4-valid.txt + # point the prep script at the Drive paths + import re; src=open(f"{WORK}/prepare_tinystories_bpe.py").read() + src=src.replace("/tmp/lt_ep/data/tinystories","/content/drive/MyDrive/ept_data/tinystories") + src=src.replace("/tmp/lt_ep/data/tinystories_bpe",DATA) + open(f"{WORK}/prep_bpe_colab.py","w").write(src) + %cd {WORK} + !python prep_bpe_colab.py +print("data:", os.listdir(DATA)) +``` + +## Cell 3 — STABILITY SMOKE (always run first; ~15 min). Must survive past warmup without abort. +```python +%cd {WORK} +# C=2048 starting curriculum (muP-scaled from C=1024's lr 4e-4 -> ~2e-4; longer warmup; gentler resinit) +!python lt_ep_train.py --mode ep --attn_mode thick --B 16 --C 2048 --H 16 --T 512 \ + --c 1.0 --jacreg 1.0 --jr_floor 0.1 --res_target 1.5e-3 --jr_max 64 --res_ema 0.9 \ + --holo 2 --hr 0.02 --pema 0.999 --t1max 300 --res_est 1e-4 --t2sel 60 --res_gate 5e-3 \ + --qknorm --resinit 0.05 --warmup 2500 --compile --T1 150 --T2 20 --lr 2e-4 \ + --steps 1200 --log 100 --data {DATA} +# READ THE OUTPUT: if it ABORTs or res spikes >0.1 repeatedly through steps 600-1200, the curriculum +# is still too hot -> lower lr to 1e-4 and/or resinit 0.03 and/or warmup 4000, re-run this cell. +# If res stays <1e-2 and val descends past step 1000, the curriculum is good -> go to Cell 4. +``` + +## Cell 3b — KEEP-ALIVE (run once, then it auto-clicks connect every 60s to beat the ~90min idle kill) +Open the browser JS console (F12 → Console) on the Colab tab and paste: +```javascript +function keepAlive(){ + document.querySelector("colab-connect-button")?.shadowRoot?.querySelector("#connect")?.click(); +} +setInterval(keepAlive, 60000); +``` +This beats ONLY the idle timeout. The HARD cap (free 12h / Pro 24h, and Pro+ background execution +is unreliable in 2026) is unbeatable — which is why Cell 4 is built to RESUME. When Colab drops you, +just reconnect and re-run Cell 4; it continues from the last `--save_every` checkpoint on Drive. + +## Cell 4 — FULL RUN with Drive full-state resume. Re-run this exact cell after EVERY disconnect. +```python +%cd {WORK} +ST="/content/drive/MyDrive/ept_ckpt/s4_50m.state"; CK="/content/drive/MyDrive/ept_ckpt/s4_50m.best.pt" +os.makedirs("/content/drive/MyDrive/ept_ckpt", exist_ok=True) +# --resume loads ST (weights+optimizer+sched+step+jr+best) if present -> idempotent across timeouts. +# --save_every 100 = atomic full-state save every 100 steps -> a kill loses at most ~100 steps. +!python lt_ep_train.py --mode ep --attn_mode thick --B 16 --C 2048 --H 16 --T 512 \ + --c 1.0 --jacreg 1.0 --jr_floor 0.1 --res_target 1.5e-3 --jr_max 64 --res_ema 0.9 \ + --holo 2 --hr 0.02 --pema 0.999 --t1max 300 --res_est 1e-4 --t2sel 60 --res_gate 5e-3 \ + --qknorm --resinit 0.05 --warmup 2500 --compile --T1 150 --T2 20 --lr 2e-4 \ + --steps 24000 --log 200 --save_every 100 --data {DATA} --ckpt {CK} --state {ST} --resume +# IMPORTANT: match every flag here to the curriculum that PASSED Cell 3 (esp. lr/warmup/resinit). +# On the FIRST run ST won't exist (fresh start, prints init residual); every re-run prints "[resume] from ...". +``` + +### Checkpointing guarantees (tested on timan1) +- `--state` writes the FULL state (weights + AdamW moments + LR-schedule position + step + λ + best) + to `ST.tmp` then `os.replace` → **atomic**: a kill mid-write leaves the previous good `ST` intact. +- `--resume` continues the LR schedule and optimizer momentum exactly (not a cold warm-start): + verified step 150 → resumed 151 with val still descending monotonically. +- State size at 50M ≈ ~1 GB (weights+pema+opt); `--save_every 100` ≈ a 1 GB Drive write every + ~20 min of A100 wall-clock (well under Drive's daily quota). Lower to 50 if you want ≤10-min loss. +- `--ckpt` (CK) separately keeps the best-val weights for sampling (Cell 5), updated only on improvement. + +## Cell 5 — sample stories from the best checkpoint (run anytime; reads CK from Drive) +```python +%cd {WORK} +CK="/content/drive/MyDrive/ept_ckpt/s4_50m.best.pt" +!python sample_eq.py --ckpt {CK} --data {DATA} --C 2048 --H 16 --T 512 --use_pema --n 4 \ + --prompt "Once upon a time" --temp 0.8 --topk 40 +``` +Note: sample_eq.py reads vocab from meta.pkl; for BPE it prints token ids unless decoded — if it +shows numbers not text, ping me and I'll add the BPE decode (tokenizer.json is in {DATA}). + +──────────────────────────────────────────────────────────────────────── +NOTES +- The curriculum in Cells 3/4 is a STARTING GUESS for C=2048 (we never got it stable on timan1). + Cell 3 is there precisely to dial it in fast on the better GPU before committing Cell 4's long run. +- Full-state resume tested on timan1 (step 150 → resumed 151, optimizer/schedule intact). +- Expected cost: A100 fp32 ~2-3x an A6000 → ~0.06-0.1 it/s → 24k steps ~3-4 days of wall-clock + ACROSS resumes (so leave it, re-run Cell 4 whenever Colab drops you). H100 faster. +- sample_eq.py BPE-decode gap is the one known rough edge; tell me if Cell 5 prints ids. +``` diff --git a/docs/campaign/C512_PLATEAU_CAMPAIGN.md b/docs/campaign/C512_PLATEAU_CAMPAIGN.md new file mode 100644 index 0000000..6066c21 --- /dev/null +++ b/docs/campaign/C512_PLATEAU_CAMPAIGN.md @@ -0,0 +1,159 @@ +# C512 "Why slow + plateau" diagnostic campaign (2026-06-17) + +Standalone record of the 7-experiment campaign the user proposed to diagnose why EP at C=512 BPE +plateaus at val CE 2.40 (orphan) and never reaches the BP capability band (1.0–1.5). Self-contained; +companion logs in `/tmp/lt_ep/`, general arc in `FINDINGS.md`. + +--- + +## 0. Setup and the user's hypothesis + +- **Models** (TinyStories BPE-4096, C=512 H=16 T=256 B=24; random ln4096 = 8.318): + - EP "orphan" (lr 8e-4, warmup 800, resinit 0.1, holo N=2, t2sel 40, jr_floor 0.1, res_target + 1.5e-3, jr_max 32, pema 0.999): **best val CE 2.4037**, ran full 20k, zero excursions. + - Standard BP transformer (same C/H/T, mlp=4): **best val CE 1.6953**. + - Target: BP capability band **1.0–1.5** (BP-C512 itself is 1.70; BP needs C≥1024 to enter the band). + +- **User's hypothesis (verbatim intent):** the "full-time slow + plateau" is **not** simple EP + LR-inequivalence; it is a conflict between the **equilibrium architecture's contractive operating + point** and **long-context mixing**, with LR merely expressing it as "8e-4 lives, 9e-4 explodes." + Supporting prior: at a matched tight residual operating point EP ≥ BPTT; the apparent gap is + EP-tight vs loose-BPTT (non-contractive), and the gap grows with T. + +- **Three plateau types the user distinguished:** (1) invalid plateau (res ~1e-2, gate skipping, + λ pinned); (2) valid-but-over-contracted plateau (res tiny, cos high, gate-skip low, still far + from BP); (3) controller-fight / estimator-controller mismatch. + +- **Priority order requested:** exp3 (BPTT+ctl) → exp2 (LR-sweep) → exp4 (warmup/gate decouple) + → exp5 (λ grid) → exp6/exp7 (branch + mixing). exp1 (triangulation) foundational. + +--- + +## 1. The seven experiments — spec, what ran, result, read + +### exp3 — C512 BPTT+controller (the decisive matched cut) +- **Spec:** replicate the orphan EP recipe (resinit, qk-norm, warmup, jr_floor, res_target, gate, + pema, lr 8e-4) but swap the EP task gradient for exact BPTT. If it plateaus ~2.2–2.4 → tax is + architecture/controller; if ~1.9 → EP estimator/noise. +- **Ran:** `--mode bptt`, all else matched, 20k steps, fuse on. +- **Result:** descended cleanly to **best 3.85 @step 1400**, then **destabilized at step 1600** + (jr→32, res→0.07) and lodged in a broken basin (val 6.27, res 0.068 — just under the 0.1 fuse, + no abort) for 8400+ steps. Killed at step ~12.8k. **Worse than the EP orphan (3.85 vs 2.40).** +- **Read:** the answer came back neither 2.2–2.4 nor 1.9 — BPTT **broke**. At C512 the contraction + controller does **not** keep the exact gradient on the manifold; EP's implicit contraction-bias is + what kept the orphan alive. Inverts the S0 result (BPTT+ctl 1.635 < EP 1.676). *Caveat: single + lr/seed; BPTT might be stable at another lr.* + +### exp2 — one-step loss-decrease LR-sweep at the plateau ckpt +- **Spec:** fixed ckpt + batch, gradients {BPTT, EP-task, EP-task+jac}, one-step update over an lr + grid, same-graph eval ΔCE. Answers "is it LR-inequivalence." +- **Result** (ΔCE, negative = better): BPTT best **−0.161** (lr 1e-2); EP best **−0.042** (lr 1e-4), + **diverges for lr ≥ 3e-4**. cos(g_EP,g_BPTT) 0.107, k=|g_EP|/|g_BPTT| ≈ 486. +- **Read:** **not LR.** EP's best one-step descent over *all* lr is 4× below BPTT's; a magnitude + rescale can't fix a 0.10-aligned direction. + +### exp1 — gradient triangulation, multi-checkpoint (foundational) +- **Spec:** at step 0/200/800/2000/plateau, per-group cos of {g_EP, g_BPTT-150, g_BPTT-400}, + norm-ratio k, batch-to-batch variance, + res. +- **Result:** + + | ckpt | val | res | EP·BPTT150 | BPTT150·400 | k | EP self-cos | BPTT self-cos | + |---|---|---|---|---|---|---|---| + | step0 (random) | 9.90 | 2.1e-2 | 0.261 | 0.459 | 0.319 | +0.654 | +0.617 | + | step200 | 6.07 | 4.2e-9 | **0.989** | 1.000 | 1.000 | +0.371 | +0.376 | + | step800 | 4.11 | 3.9e-4 | 0.964 | 0.935 | 0.844 | +0.755 | +0.721 | + | step2000 | 3.56 | 2.9e-4 | 0.885 | 0.845 | 0.675 | +0.445 | +0.358 | + | **plateau (2.40)** | 2.40 | 2.6e-5 | **−0.045** | **1.000** | **4223** | **−0.273** | **+0.957** | + +- **Read (decisive):** the EP estimator is **faithful while descending** (cos 0.99→0.89, steps + 200–2000) and undergoes an **SNR/coherence collapse at the optimum** — cos→0, k→4000, and crucially + the **batch self-cos goes −0.27** (EP gradients on different batches anti-correlate) while BPTT is + +0.96. BPTT-150≡BPTT-400 (cos 1.00) the entire time ⇒ the true gradient is always well-defined. + **The 2.40 plateau is an EP estimator bias-floor / batch-incoherence near the optimum** — not + horizon ambiguity (ruled out here), not LR (exp2). + +### exp4 — warmup / validity-gate decouple (3 arms) +- **Spec:** armA current warmup; armB full-strength contraction early + task LR still warmup; + armC no-warmup + smaller resinit / lower branch LR. Metric: gate-skip / res-under-gate / early val. +- **Result:** armA (warmup): res stuck ~1.5e-2 for ~50 steps → gate skips the nudge → **no task + learning for ~50–60 steps** (cos≈0, the reported grad is pure jacreg); cos jumps to 0.99 once + res