summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorYuren Hao <yurenh2@illinois.edu>2026-07-03 06:06:09 -0500
committerYuren Hao <yurenh2@illinois.edu>2026-07-03 06:06:09 -0500
commit8626ac5d6cf6f548157cb349ea99b8b603b268ce (patch)
tree04afed3ed84652154944a735272939980d888f53
parentb7fab6a524c4c5cd29aaf9933fb150e7b7902a3f (diff)
Add pull_assets.py one-click restore (HF ept-assets) + ONBOARDING restore note
git clone + python pull_assets.py = full working tree: pulls TinyStories-BPE data (~697M) and the 5 key checkpoints from the private HF dataset repo blackhao0426/ept-assets straight into ep_run/{data,runs}. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014FAPDWQ49M5Ye3NpTndTpn
-rw-r--r--ONBOARDING.md15
-rw-r--r--pull_assets.py38
2 files changed, 48 insertions, 5 deletions
diff --git a/ONBOARDING.md b/ONBOARDING.md
index f8f7dee..1b36a72 100644
--- a/ONBOARDING.md
+++ b/ONBOARDING.md
@@ -80,11 +80,16 @@ Diagnostics: add `--diag_cos 500` (log cos-to-BPTT over training) · `--init_ckp
operator's 4-D fingerprint) · `--eigreg 0.1 --eig_margin 1.0` (leading-abscissa control, alt to `--jacreg`).
BP baseline (fair control): `--mode bptt`. **All experiment processes must use `nohup`.**
-**Getting the data & checkpoints (git-ignored — not in this repo):**
-- **Data** (`ep_run/data/tinystories_bpe/`, ~712 MB): regenerate from the BPE tokenizer pipeline in `ep_run/` (build
- the tokenizer + tokenize TinyStories → `train.bin` / `val.bin` / `meta.pkl`), or copy from the shared location.
-- **Checkpoints** (`ep_run/runs/*.pt`, e.g. `redx_traj/s2000.pt` for warm-starting): ask Yuren for a share link —
- too large for git. `s2000.pt` is the stable warm-start operator (see §5).
+**Getting the data & checkpoints (git-ignored — not in this repo):** one command.
+```
+python pull_assets.py # run from the repo root, after `huggingface-cli login`
+```
+This pulls the TinyStories-BPE data (~697 MB) and the key checkpoints from a **private HF dataset repo**
+(`blackhao0426/ept-assets`) straight into their correct paths — so **`git clone` + `python pull_assets.py` = a full
+working tree**. It restores `ep_run/data/tinystories_bpe/` (`train.bin`/`val.bin`/`meta.pkl`) and
+`ep_run/runs/{redx_traj/s2000.pt, ep_rr_ajr, ep_resreg_scratch, ep_fast_adaptive, bptt_clean}.pt`. `s2000.pt` is the
+stable warm-start operator (see §5). *Prereqs:* `pip install -U huggingface_hub` and ask Yuren for access to the
+private repo. (The full `runs/` history is larger; `pull_assets.py` fetches the load-bearing subset — ask for more.)
## 8. Deeper docs (organized under `docs/`)
- **`docs/method/`** — `METHODS.md`, `EP_DERIVATION.md` (the EP/AsymEP gradient derivation), `ARCHITECTURE.md`
diff --git a/pull_assets.py b/pull_assets.py
new file mode 100644
index 0000000..e9b3777
--- /dev/null
+++ b/pull_assets.py
@@ -0,0 +1,38 @@
+#!/usr/bin/env python3
+"""One-click restore of the git-ignored large assets (TinyStories-BPE data + key checkpoints) from the
+private HF dataset repo, into the correct paths. So: `git clone` + `python pull_assets.py` = a full
+working tree (ep_run/data/ + ep_run/runs/ reconstructed in place).
+
+Prereqs:
+ pip install -U huggingface_hub
+ huggingface-cli login # must have access to the private repo below (ask Yuren)
+
+What it restores:
+ ep_run/data/tinystories_bpe/ (train.bin / val.bin / tokenizer.json / meta.pkl)
+ ep_run/runs/redx_traj/s2000.pt (the warm-start operator, §5 of ONBOARDING)
+ ep_run/runs/{ep_rr_ajr, ep_resreg_scratch, ep_fast_adaptive, bptt_clean}.pt (key result checkpoints)
+"""
+import os, sys, subprocess
+
+REPO = "blackhao0426/ept-assets" # private HF dataset repo (mirrors ep_run/ layout)
+HERE = os.path.dirname(os.path.abspath(__file__))
+DEST = os.path.join(HERE, "ep_run")
+
+
+def main():
+ try:
+ from huggingface_hub import snapshot_download
+ except ImportError:
+ subprocess.run([sys.executable, "-m", "pip", "install", "-q", "huggingface_hub"], check=True)
+ from huggingface_hub import snapshot_download
+ print(f"restoring {REPO} -> {DEST}/{{data,runs}} ...", flush=True)
+ snapshot_download(repo_id=REPO, repo_type="dataset", local_dir=DEST)
+ for p in ("data/tinystories_bpe", "runs"):
+ fp = os.path.join(DEST, p)
+ if os.path.isdir(fp):
+ print(f" ok ep_run/{p}/ ({len(os.listdir(fp))} items)", flush=True)
+ print("done — working tree restored.", flush=True)
+
+
+if __name__ == "__main__":
+ main()